Continuous delivery for machine learning
Getting machine learning applications into production is hard
In modern software development, we’ve grown to expect that new software features and enhancements will simply appear incrementally, on any given day. This applies to consumer applications such as mobile, web, desktop apps as well as modern enterprise software. We’re no longer tolerant of big, disruptive, deployments of software. ThoughtWorks has been a pioneer in Continuous Delivery (CD), a set of principles and practices that improve the throughput of delivering software to production, in a safe and reliable way.
As organizations move to become more “data-driven” or “AI-driven”, it’s increasingly important to incorporate data science and data engineering approaches into the software development process to avoid silos that hinder efficient collaboration and alignment. However, this integration also brings new challenges when compared to traditional software development. These include:
A higher number of changing artifacts. Not only do we have to manage the software code artifacts but also the data sets, the machine learning models, and the parameters and hyperparameters used by such models. All these artifacts have to be managed, versioned and promoted through different stages until they’re deployed to production. It’s harder to achieve versioning, quality control, reliability, repeatability and audibility in that process.
Size and portability: Training data and machine learning models usually come in volumes that are orders of magnitude higher than the size of the software code. As such they require different tools that are able to handle them efficiently. These tools impede the use of a single unified format to share those artifacts along the path to production, which can lead to a “throw over the wall” attitude between different teams.
Different skills and working processes in the workforce: To develop machine learning applications, experts with complementary skills are necessary, and they sometimes have contradicting goals, approaches and working processes:
- Data Scientists look into the data, extract features and try to find models which best fit the data to achieve the predictive and prescriptive insights they seek out. They prefer a scientific approach by defining hypotheses and verifying or rejecting them based on the data. They need tools for data wrangling, parallel experimentation, rapid prototyping, data visualization, and for training multiple models at scale.
- Developers and machine learning engineers aim for a clear path to incorporate and use the models in a real application or service. They want to ensure that these models are running as reliably, securely, efficiently and as scalable as possible.
- Data engineers do the work needed to ensure that the right data is always up-to-date and accessible, in the required amount, shape, speed, granularity, with high quality, and minimal cost.
- Business representatives define the outcomes to guide the data scientists’ research and exploration, and the KPIs to evaluate if the machine learning system is achieving the desired results with the desired quality levels.
Continuous Delivery for Machine Learning (CD4ML) is the technical approach to solve these challenges, bringing these groups together to develop, deliver, and continuously improve machine learning applications.