There is a story going around about data science that you’ve surely heard. It's the statement that 80% of the work a data scientist does is collecting, cleaning and organizing data and that only 20% is their real specialty: building models and making discoveries.
Those who make this statement, often those selling something, usually claim what data science needs most of all is better tools for doing these so called data munging tasks to free up data scientists to concentrate on what they’re best at. For example, Steve Lohr of The New York Times, recently said:
"Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
I am not going to argue the 80% statistic. At the early stage of a project, it’s probably true that 80% of the work is data munging. In the middle stages most of the work is spent learning how to model the data and discovering how to harness the information in the data for some useful purpose. In the later stages, most of the work is application development. What I do wish to debunk, however, is the idea that data munging is mundane, easy and something that could be automated completely or outsourced to less experienced people. Data scientists are not wasting their time or skills at this task.
The first step when starting a new data science project is twofold:
- One is gaining an understanding of the problem needing to be solved.
- The other is getting a high-level overview of the data landscape which is relevant to the problem’s domain.
While they may appear to be separate steps, they are in fact tightly coupled. Without a problem in mind, there is little sense in learning about the data. Likewise, there is no reason to treat a problem as a data science problem if you know nothing of the data landscape. Being able to see how data is relevant to or capable of solving problems is one of the keys skills of a data scientist.
In the early stage of the project the data landscape is explored only lightly. There is still plenty of unknowns. Data relationships are merely guessed at. Data quality is uncertain. Statistical relationships are checked only for a sampling of variables. It is at this stage of only partial knowledge that the data scientist must choose an initial direction of work. It takes intuition gained from experienced to be able to pick a useful direction with such imprecise knowledge. Once a direction is chosen, the data must be explored further and continually in order to steer a useful course.
When watching a data scientist at this data munging stage, it may appear that all they are doing is writing file reading scripts, converting date formats, transforming and aggregating data fields and moving data around. The fact is there is much high-level thinking that goes into every choice. There may be hundreds of variables to look at. From those, there are a nearly infinite number of combinations that could be computed and compared. Transforming raw data into variables bearing an estimable, statistical relationship to the problem is really the heart and soul of data science. While the actual model building occurs later, the distillation of the data into useful combinations is done at that stage.
The process of actually performing these calculations and transformations can involve some repetitive tasks and non-experts can help with this. However professional data scientists do these tasks daily and typically do it very efficiently. In addition, it’s useful as it’s one of the best ways the data scientist can learn about the problem and it’s domain. They typically utilize a large toolkit of reusable pieces which can be combined in myriad ways to accomplish the task at hand. Often it is not efficient to try to delegate these tasks away by explaining them to others.
Furthermore, there are so many of alternatives, exceptions and ways of treating edge cases that these calculations often can’t easily be expressed in some declarative language to automate their implementation. Often we see database developers trying to use SQL to accomplish every such data transformation task. This results in an unwieldy mess that is inflexible, non-performant and difficult or impossible to test. SQL is a query language. It was not intentionally designed for building data transformation pipelines. It’s far better to write these scripts in a general purpose programming language, using software development best-practices.
So even if we don’t try to take the data munging away from the data scientist, one might think that we should at least provide them with clean data. Do expert data scientists really need to be utilized for the task of cleaning data? The trouble here is that the concept of “cleaning” is not a perfect metaphor for what happens in the data cleansing process. Cleaning, in normal usage, implies a fairly basic task of separating the useless “dirt” from whatever is that we want cleaned. Where is the “dirt” in the data? What is called cleaning data is really either filtering data or transforming it into a more standard form. What is clean, versus dirty, can depend heavily on what you are trying to accomplish. Missing data fields must be removed for some analyses and not others. Sometimes a missing field in one column of tabular data requires removal of the entire record. Sometimes not. Similarly filtering data to remove certain records is dependent on the task. One person’s dirt can be another person’s clean data.
Data munging is a challenging, high-level activity because data itself is not as simple as many believe. It’s a mistake to think of data as being the same as what it supposed to represent. It is rare that we can simply ignore the data generation process. What we know is the the data itself and the process of creating that data given something in the real world that impacts our goals. A good data scientist stays focused on that goal rather than the unimportant concepts that are imperfectly represented by the data. Thinking of it in this way implies that data is never clean. It is only more or less informative about something important to us. The more remote the connection between the data and what we care about, the more need for high-level data science but rarely is it so simple that the cleaning metaphor fits well.
Keep developing libraries to aid the process but don’t expect this data munging phase to be solved separately. Data scientists do need help though. When building a team around a data scientist, include junior data scientists who can pick up and apply data munging skills as well as more involved model building. Include software developers and data engineers to help build supporting software and scaling production software products.
But in the end, let data scientists be data mungers. It’s more complicated than it looks.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.