In my experience of working on data science projects, I’ve noticed a distinct difference in how different stakeholders understand its workings. While software development is well understood — people know what to expect or how to communicate — data science projects aren’t, as these are relatively new. They tend to be non-deterministic in nature, with dependencies on the information contained in the data which cannot be known before starting the project. In this article, I outline the typical flow of a data science project. I hope this helps stakeholders understand how data science projects are executed end-to-end.
Figure 1: Lifecycle of a data science project
Each part of the lifecycle of a data science project is typically an experiment. Every data science project begins by defining the problem statement, to get a clear understanding of what we’re trying to solve. This also includes setting the success criteria. This could be a business metric e.g. lift in sales, or this could be a model performance metric e.g. F1 / F2 score, precision, recall or accuracy.
In some projects, defining the problem statement itself can be the first stage. This is typically when a client says that they have some data available and ask the data science teams to glean useful insights or make predictions from it. While doing this, data science teams will figure out the uncertainties as well. For instance, if they’re trying to make a forecast, they would seek clarity on specifics such as ‘how much in the future to forecast — for the next week or the next month?’ Understanding the business context with the help of subject matter experts, users and stakeholders aids to building an acceptable and robust solution.
The next stage is research. They will read up about the domain, understand various kinds of problem statements applicable, estimate the data they would need etc. If the problem statement is vague, then research and problem definition may overlap. This collective effort would help answer some of the questions that arise at the problem definition stage.
Post this, there are three tracks that usually run in parallel:
At this stage, data science teams will understand the data in the system. They will gather the data useful for solving the problem. They will then identify gaps in the data that need to be filled before starting the project. This step is like a feasibility study with the data. Broadly, the outcome can be thought of as “Do we have enough data to start the project, both, in terms of the number of data points as well as the number of features in the data?”
In my experience, most of the challenges that arise pertain to the data rather than the modeling. If the data is clean, few to none missing values and required features are present, then the modeling becomes an easier task.
Stakeholders can look at this as an opportunity to learn more about their data.
This involves a lot of reading and finding the latest developments in the algorithms or the type of algorithms used in the domain.
While the research stage involves studying the domain, the literature review stage is more specific. Here, data scientists will closely survey the literature that already exists about the problems they are trying to solve. The literature could include research papers, blogs, github, technical reports, etc. On rare occasions, they might come across a problem that has never been looked at before. In such cases, they will learn from similar or adjacent projects.
Literature review is usually done in two stages — breadth-wise first and then depth-wise. The data science team will first understand the different ways and methodologies in which the problem has been tackled so far. Then, from these, they would identify the few relevant to the problem at hand and study them in depth.
At this stage, the data science team will take information from the data processes and literature review tracks to perform brainstorming, discussions, hypothesizing and whiteboarding. They would also explore new ways to approach the problem. Active involvement of stakeholders with relevant experience in the domain and industry/ subject matter expert (SME) can bring valuable ideas to the table to explore. The goal here is to ensure a maximum flow of ideas. Once all ideas are brought together, they will combine, discard, dwell on and evolve some of them. The ideate track is about getting as many ideas as possible and looking at the problem at hand from many different angles. The ideation phase may also include PoCs.
All of the above activities can be thought of as data discovery or pre-study phase. Some examples of possible outcomes of this phase could be:
A decision on whether the data is suited to statistical modeling or machine learning — or neither
The problem is solvable, but we need more data or more labeled data or more features. This may require further cost-benefit analysis — is the cost associated with collecting data justifiable?
In case of reinforcement learning, the possible reward functions and their pros and cons
The team has the required information to carry on with the next phase
Combining the above three tracks, the data science team will identify possible approaches to solving the problem, prioritize them and finalize the initial approach. For example, when they have a classification problem, they would outline the kind of models to explore and prioritize them based on experience and time-effort tradeoff. They would also discuss with relevant stakeholders to identify which features are important and need to be engineered if any are required. They would take these top few for the first cycle. You can also think of each cycle as a milestone.
The team would then implement these algorithms and analyze results. They may look beyond just the accuracy of the model to understand input features and their impact on the model (feature importance). They would also perform comparative analysis with past results.
Based on the results, they would typically go back to the data and ideation phases. Sometimes the literature phase too. With a deeper understanding of the data and how the initial ideas/hypothesis have worked, they’d tweak the approach and optimize cyclically. Once they have good results and sufficient confidence in the model, they would integrate it into the tool/product/software.
Before the model is put into production, they would define monitoring metrics to track model performance, track model drift and make adequate adjustments. Based on these metrics the model would need to be maintained, retrained at identified intervals and deployed again.
I invite the stakeholders of a data science project to look at it not from an end result accuracy perspective but rather an evolving experiment. Useful information can be obtained in each stage of the project. Both the data science team and the stakeholders are learning together as the team explores the data to uncover the information it may hold. I hope this article helps understand the kind of activities involved in such a project and the importance of these.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.