ThoughtWorks
  • Contact
  • Español
  • Português
  • Deutsch
  • 中文
Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

    Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

    Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

    Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
Go to overview

Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
  • Español
  • Português
  • Deutsch
  • 中文
ThoughtWorksMenu
  • Close   ✕
  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact
  • Back
  • Close   ✕
  • Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

  • Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

  • Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Back
  • Close   ✕
  • Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
  • Back
  • Close   ✕
  • Go to overview
  • Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

  • Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

  • All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

  • Back
  • Close   ✕
  • Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

  • Back
  • Close   ✕
  • Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
Blogs
Select a topic
View all topicsClose
Technology 
Agile Project Management Cloud Continuous Delivery  Data Science & Engineering Defending the Free Internet Evolutionary Architecture Experience Design IoT Languages, Tools & Frameworks Legacy Modernization Machine Learning & Artificial Intelligence Microservices Platforms Security Software Testing Technology Strategy 
Business 
Financial Services Global Health Innovation Retail  Transformation 
Careers 
Career Hacks Diversity & Inclusion Social Change 
Blogs

Topics

Choose a topic
  • Technology
    Technology
  • Technology Overview
  • Agile Project Management
  • Cloud
  • Continuous Delivery
  • Data Science & Engineering
  • Defending the Free Internet
  • Evolutionary Architecture
  • Experience Design
  • IoT
  • Languages, Tools & Frameworks
  • Legacy Modernization
  • Machine Learning & Artificial Intelligence
  • Microservices
  • Platforms
  • Security
  • Software Testing
  • Technology Strategy
  • Business
    Business
  • Business Overview
  • Financial Services
  • Global Health
  • Innovation
  • Retail
  • Transformation
  • Careers
    Careers
  • Careers Overview
  • Career Hacks
  • Diversity & Inclusion
  • Social Change
Data Science & EngineeringTechnology

Let Data Scientists be Data Mungers

David Johnston David Johnston

Published: Aug 5, 2015

There is a story going around about data science that you’ve surely heard. It's the statement that 80% of the work a data scientist does is collecting, cleaning and organizing data and that only 20% is their real specialty: building models and making discoveries.

Those who make this statement, often those selling something, usually claim what data science needs most of all is better tools for doing these so called data munging tasks to free up data scientists to concentrate on what they’re best at. For example, Steve Lohr of The New York Times, recently said:

"Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

I am not going to argue the 80% statistic. At the early stage of a project, it’s probably true that 80% of the work is data munging. In the middle stages most of the work is spent learning how to model the data and discovering how to harness the information in the data for some useful purpose. In the later stages, most of the work is application development. What I do wish to debunk, however, is the idea that data munging is mundane, easy and something that could be automated completely or outsourced to less experienced people. Data scientists are not wasting their time or skills at this task.

The first step when starting a new data science project is twofold:

  1. One is gaining an understanding of the problem needing to be solved.
  2. The other is getting a high-level overview of the data landscape which is relevant to the problem’s domain.

While they may appear to be separate steps, they are in fact tightly coupled. Without a problem in mind, there is little sense in learning about the data. Likewise, there is no reason to treat a problem as a data science problem if you know nothing of the data landscape. Being able to see how data is relevant to or capable of solving problems is one of the keys skills of a data scientist.

In the early stage of the project the data landscape is explored only lightly. There is still plenty of unknowns. Data relationships are merely guessed at. Data quality is uncertain. Statistical relationships are checked only for a sampling of variables. It is at this stage of only partial knowledge that the data scientist must choose an initial direction of work. It takes intuition gained from experienced to be able to pick a useful direction with such imprecise knowledge. Once a direction is chosen, the data must be explored further and continually in order to steer a useful course.

 let data scientists be data mungers

When watching a data scientist at this data munging stage, it may appear that all they are doing is writing file reading scripts, converting date formats, transforming and aggregating data fields and moving data around. The fact is there is much high-level thinking that goes into every choice. There may be hundreds of variables to look at. From those, there are a nearly infinite number of combinations that could be computed and compared. Transforming raw data into variables bearing an estimable, statistical relationship to the problem is really the heart and soul of data science. While the actual model building occurs later, the distillation of the data into useful combinations is done at that stage.

The process of actually performing these calculations and transformations can involve some repetitive tasks and non-experts can help with this. However professional data scientists do these tasks daily and typically do it very efficiently. In addition, it’s useful as it’s one of the best ways the data scientist can learn about the problem and it’s domain. They typically utilize a large toolkit of reusable pieces which can be combined in myriad ways to accomplish the task at hand. Often it is not efficient to try to delegate these tasks away by explaining them to others.

Furthermore, there are so many of alternatives, exceptions and ways of treating edge cases that these calculations often can’t easily be expressed in some declarative language to automate their implementation. Often we see database developers trying to use SQL to accomplish every such data transformation task. This results in an unwieldy mess that is inflexible, non-performant and difficult or impossible to test. SQL is a query language. It was not intentionally designed for building data transformation pipelines.  It’s far better to write these scripts in a general purpose programming language, using software development best-practices.
   
So even if we don’t try to take the data munging away from the data scientist, one might think that we should at least provide them with clean data. Do expert data scientists really need to be utilized for the task of cleaning data? The trouble here is that the concept of “cleaning” is not a perfect metaphor for what happens in the data cleansing process. Cleaning, in normal usage, implies a fairly basic task of separating the useless “dirt” from whatever is that we want cleaned. Where is the “dirt” in the data? What is called cleaning data is really either filtering data or transforming it into a more standard form. What is clean, versus dirty, can depend heavily on what you are trying to accomplish. Missing data fields must be removed for some analyses and not others. Sometimes a missing field in one column of tabular data requires removal of the entire record. Sometimes not. Similarly filtering data to remove certain records is dependent on the task. One person’s dirt can be another person’s clean data.

Data munging is a challenging, high-level activity because data itself is not as simple as many believe. It’s a mistake to think of data as being the same as what it supposed to represent. It is rare that we can simply ignore the data generation process. What we know is the the data itself and the process of creating that data given something in the real world that impacts our goals. A good data scientist stays focused on that goal rather than the unimportant concepts that are imperfectly represented by the data. Thinking of it in this way implies that data is never clean. It is only more or less informative about something important to us. The more remote the connection between the data and what we care about, the more need for high-level data science but rarely is it so simple that the cleaning metaphor fits well.

Keep developing libraries to aid the process but don’t expect this data munging phase to be solved separately. Data scientists do need help though. When building a team around a data scientist, include junior data scientists who can pick up and apply data munging skills as well as more involved model building. Include software developers and data engineers to help build supporting software and scaling production software products.

But in the end, let data scientists be data mungers. It’s more complicated than it looks. 

  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

Media and analyst relations | Privacy policy | Modern Slavery statement ThoughtWorks| Accessibility | © 2021 ThoughtWorks, Inc.