Brief summary
Join Thoughtworks’ global head of technology Mike Mason and his guests Christoph Windheuser, Thoughtworks’ global lead in artificial intelligence, and Sheroy Marker, head of technology at Thoughtworks Products, as they explore how to choose the the right data and AI model, to create complex models that are capable of continuously learning.
Podcast Transcript
Mike Mason:
My name is Mike Mason from Thoughtworks. I'm here with my two colleagues, Christoph Windheuser and Sheroy Marker and we're going to talk about continuous intelligence today. So Christoph, why don't you give us a little introduction of yourself and roughly what you do for Thoughtworks.
Christoph Windheuser:
Hello everybody. My name is Christoph Windheuser. I'm based in Germany, actually here in San Francisco in the Thoughtworks office. At Thoughtworks, I'm the global SME for the Offering Intelligent Empowerment, so I'm looking for this offering and rolling it out into the different Thoughtworks countries. And then I'm supporting our sales teams, our CPs. I'm working with them on different pursuits and pushing this topic also with capability so that we build up these offerings and make it one of our really important offerings for our clients.
Mike Mason:
And Sheroy.
Sheroy Marker:
Hi. I'm Sheroy Marker. I currently work as the Head of Technology for Thoughtworks products, and one of the products we build is GoCD, which is our continuous delivery server. I'm here to talk about some CD related aspects to machine learning algorithms.
Mike Mason:
Yeah. Today's topic is broadly continuous intelligence. I think everybody realizes the importance of data in today's world. The increasing emphasis on the things that organizations can do with data, the importance of being data-driven as opposed to just the Hippo model. The highest paid person's opinion is the Hippo model for decision making. But more and more we're seeing this kind of drive towards data being a decision making tool.
Mike Mason:
But more than that, we're also seeing machine learning and kind of advanced techniques becoming a much more powerful tool for building interesting software. Machine learning lets you build stuff that can better support someone's experience, better anticipate a consumer's needs and provide things to them in a more frictionless manner. With the rise of data and the rise of machine learning, what is the problem that we run into? Christoph.
Christoph Windheuser:
That is actually a very good point. So the data today is there, we see it at our clients, they have tons of datas. They are usually in silos or in ERP or traditional databases, and so they are not easily accessible for data scientists and machine learning, that is one of the problems. And after doing some easy proof of concept they're approaching us and saying, "We want to scale on that. We want to do it on a broader way. We want to use it in the whole company and not just in little little proof of concepts." And then they suddenly need a data infrastructure. So something like a data lake maybe in the cloud environment or on premise and we have the data scientists and the machine learning programs really can work on, so that's the first issue.
Christoph Windheuser:
The other problem is, and that is something we would like to talk about today here in this podcast is, it is kind of easy with tutorials to train TensorFlow on a problem, on a given data set, even the client maybe has his own data and he trains a decision. For example, this is fraud. This is not fraud. It seems to work, some good results in the learning curve. But what then, because when you train on some data, the data is already outdated. So data is continuously changing in our environment. Maybe you think of a recommender for a retailer, so you can train it on what the clients are buying on the platform. So you can recommend things to other clients who have a similar buying behavior, works fine, but this is changing all the time. So you have different weather, different seasons, different [inaudible 00:03:53] and stuff. So we have to continuously retrain this trainer.
Mike Mason:
Even the products being sold would be dynamic, right?
Christoph Windheuser:
Absolutely.
Mike Mason:
New products would be coming in and you wouldn't have any recommended behavior about that thing if you had not been trained on that...
Christoph Windheuser:
Exactly. So what the retailer is doing, he has to retrain. Data is changing, the behavior is changing, he has to retrain that. So get new perimeters, new weights, but doing this by hand. So changing the weights in a productive platform and doing the testing that is very cumbersome, that is very error prone. And Thoughtworks actually has a very good experience in doing it in continuously way, what we call the Continuous Delivery, Continuous Integration. And so we have GoCD for example as a great tool to do that. And that is something we are bringing together and showing the customer how to do this retraining continuously more or less automatically with pipelines, with GoCD and this stuff in excellent quality.
Sheroy Marker:
So we are talking about continuous intelligence in the context of data engineering pipelines. So typically you would develop machine learning models for inference purposes and train them initially with a training data set and then put it out in production. But then the data that you use to train these models is changing rapidly. So how do you continuously improve these machine learning algorithms on a continuous basis? So can we borrow from some of the CD concepts that we're used to for art making improvements or rolling out changes to software frequently? Are the same concept still applicable in this context? So Christoph, traditionally continuous delivery was all about making sure that any changes you made to software were tested and then rolled out in a sustainable sort of manner. Do those principles still apply to these machine learning models or do you also now include things like adding new data to training sets before you're actually make any changes to your models on a continuous basis?
Christoph Windheuser:
The principles are the same. Even if you add of course new data to the training set or you change the training set, that is the update of the continuous changement. So you have new products as we said for the recommender. So your training set is changing, but you know what is the principle of continuous delivery? Continuous integration is that you are able to deliver anytime into the production system in high quality. And that is the same what we would like to do for machine learning. So not just, "Okay, Oh shit I've trained, now I have to copy the weight somewhere else. I changed, I don't know, from Python to Java by hand, somebody reprogram it, it takes me weeks. I have to test again. And then I hit the button and make it productive." No, it should be, again, an automatic process. I should be able at any time, put the stuff I have retrained in the test, the development system, into the production system.
Sheroy Marker:
So which aspects of this are on the critical part of production for these recommender systems or any sort of machine learning, algorithms, mostly used for inference purposes or data engineering. So is all of this on the critical part to production you would think? And is there any difference between training a model per se and making changes to a model? Are both of those aspects of both of those changes to be considered to be synonymous or similar?
Christoph Windheuser:
Changing the training set of course is easier because you're not changing the model. If you change the model that has bigger impacts.
Mike Mason:
Just for clarity. By changing the model you mean something like moving from, I don't know, I'm not an expert here, but moving from say, linear regression to random forest, to a neural network of some sort. That would be a model change?
Christoph Windheuser:
Yeah.
Mike Mason:
Whereas a training data change would be the data that you feed into that model in order to set it up ready for production.
Christoph Windheuser:
Yeah. And we have a lot of different levels of what you can change. You can change completely the learning model, as you said, you could also change the architecture of model. So if you have a neural network, a backpropagation network, you could add additional layers, then you have a different architecture. You could put in convolutional networks or recurrent networks, you'd change the architecture. You have hyperparameters, learning rates, which you'd change. With hyper-parameters you also change the model in some way or you just change the training parameters which came out of the training. So different levels of possibly changes possible.
Sheroy Marker:
So that's interesting. So there are various things to consider in terms of what changes a model can undergo. So when you talk about changing, say the learning rate for example, that seems to be a small incremental change you might want to apply to the model and then test it multiple times in a iterative manner to see if the model is actually improving or becoming worse. So do you think we now need a specialized set of tooling to enable you to do these sorts of variations on a model to see how it behaves under certain stresses? Are the current set of tooling sufficient for these sorts of purposes or do you think we need more specialized tooling?
Christoph Windheuser:
That is an excellent question that would really play back to you because you are developing these tools like GoCD and that is something we are doing to use GoCD now in the learning environment and we are in this process. So trying that and setting up architectures to see how far we really can use for example, GoCD to do that, and what additional features we would need. So for example, one point is really important and I'm sure GoCD is able to do that. That is safeguarding the parameters and all of the parameters. So making the experiments of the training repeatable. So we worked with a client who didn't do that. They had their machine learning experts working on their laptops, optimizing the models, optimizing the parameters, everything. And then they said everything is ready and they move the stuff to production. But when this guy was ill or had an accident, the company couldn't repeat these experiments because everything on his laptop or in his brain and this is something that should not happen.
Sheroy Marker:
That's an interesting point because that's also a very core tenant of CD per se. So when we look at a CD pipeline and regenerate artifacts upstream in our pipeline, it's very important that we probably get the same artifact through the pipeline and in a repeatable manner. So if you trigger this pipeline multiple times, we triggered with exactly the same artifacts and artifacts don't vary anywhere from the time it was created to the final stages in the part of production. So the similar concepts could be applied to parameters that go into training a machine learning model for sure. I do think there are some opportunities for specialized tooling when it comes to the ways in which machine learning models are tested and retested before they're certified to be good because that's a pattern that's very specific to machine learning.
Mike Mason:
But I think some of that also comes down to the nature of a data platform that you might have in house for kind of managing this stuff. I read about, I think it's the Uber Michelangelo platform and the interesting thing about that is that it's a tool at the Uber scale of data, which is obviously a lot of bits and bytes, but the platform is managing datasets in a kind of a self service way for your data scientists to be able to get the right data set actually, run models against it and all of that kind of thing. But then also, to be able to take the output from those things and move them into production.
Mike Mason:
So I guess the question would be, what is the Michelangelo platform for the rest of us who are not at Uber and is there something in between the magical laptop for the data scientist versus doing every single model tweak in your CD pipeline. I think that also would not be a recommended solution because that's not really what you do. You still do some experimental work, but the intent would be in an environment that tracks the experiments or whether you're doing so you can reproduce it.
Christoph Windheuser:
Yeah, that is actually the problem. You have different environments. You need an environment for your data scientists. They feel at home on their laptop with their Python notebooks for example. They love R or Python, but these languages are not really well suited for full performance, highly scalable production environments. And so we are not using a Michelangelo that's proprietary from Uber. But what we have used in several of our projects is the environment H2O, you might know that. They have excellent learning models included, so random forest, also a neural network stuff is there and what is really nice, they can do an automatic migration or translation from Python to Java. And so that was something we used at AutoScout, which is a client in Europe. It's a car dealer, used car internet platform and we have dealt with the client's pricing estimation engine and the machine learning algorithm to estimate prices for cars.
Christoph Windheuser:
And the data scientists did this in this H2O environment. And then we translated that automatically into a Java or a J-A-R, JAR file. And then we used the pipelines and GoCD to test it, to test the results on our version. So put a tested data set into the algorithm, see the result, make the transition to Java and doing the same test again and then comparing the results and only if the results are exactly the same, we know the translation was correct, the parameters have been correctly moved over to the production system. And then we could give it free and put it automatically into the production system.
Mike Mason:
So you mentioned testing there and actually you were talking about actually testing the translation between the two languages for the model. But I'm actually curious about testing machine learning in general. Often ML models kind of are accused of being a black boxes. People might ask, "Well why was I not approved for my mortgage?" And well, unfortunately the model can only tell you "Because I added up these two numbers and it was less than 0.4 so you don't get approved." Which isn't particularly comforting to me as someone who's seeking mortgage. But the general question, we're trying to provide high quality, but we have this slight black box element to what we're doing. How do we test machine learning models?
Christoph Windheuser:
Usually you test the performance of machine learning algorithms by particular test sets. So you usually have a training set and of course you're not testing on the training set because then you're not testing the generalization ability of the machine learning algorithm, so you have an extra set which the algorithm haven't seen during training, it's new and then you test the performance on that and this gives you some idea how good your training has been and how good the performance of your algorithm is.
Mike Mason:
So you might be able to use, say a GoCD pipeline stage to run that testing against a previously untest data and say it needs to be at least this good by some quality.
Christoph Windheuser:
Exactly.
Sheroy Marker:
By some threshold of some sort.
Christoph Windheuser:
Yeah. That is something that you can set up with GoCD, that you automate this testing and you get the result and you compare the result with the threshold and this is then give you a green light to transport this stuff to the production center.
Mike Mason:
But then doesn't that run into the same issue as we had in the first place, which is that, production data is continually evolving and new things will happen all the time. Doesn't that mean that we then have to evolve the test data or is that part of it?
Christoph Windheuser:
Yeah, that is part of it. You have to evolve both, the training data actually and the test data. So what you do is, you have a big bunch of data and you take some part, maybe 20%, 10%, depends. You take away, you don't show these parts to the machine during training, but just your own testing.
Mike Mason:
And that data set that you're showing to the machine, so that must be also dynamic.
Christoph Windheuser:
Exactly.
Mike Mason:
Is that coming from yesterday's production dump or something like that?
Christoph Windheuser:
Yeah. We have to update this as well. Otherwise you would test against all data, which is not what you want.
Sheroy Marker:
And so how do some of these concepts map to unsupervised learning? Seems like a lot of the concepts we talked about were around supervised learning, fixed trainings sets and test sets and stuff like that. How does it work with unsupervised learning?
Christoph Windheuser:
To be honest, we don't have that many applications with clients for unsupervised learning. Unsupervised learning is something you can get some statistical knowledge out of data. So you use unsupervised learning for example for dimension reduction, when you have data with big vectorals and big data sets and you want to reduce that to smaller dimensionalities so that it's easier to handle that, you can do this with unsupervised learning. Where unsupervised learning can play a role is in the financial area with fraud detection because you do not know exactly what is the fraud. Maybe you know some frauds from history, from your training set you might have seen, "Okay this is a fraud." But usually fraud is something which is weird in some kind, which is different and that is something you can find out with unsupervised learning because it's just different from the others.
Mike Mason:
You gave an example earlier of AutoScout, the car retail company. I can give an example where we worked at a financial services company. They had a system where it would take up to six months to get a new machine learning model deployed into production because people were, fraud detection actually is the topic area, but what would happen is their data scientists would work with their data, but over the course of months, validate the new model and validation was a very slow process with lots of sign offs and so on and then put that into production. But you can imagine that a six month old model is actually not that great for keeping up with fraudsters.
Mike Mason:
So the team built an interesting piece of platform, and we've released it as open source software actually, which will allow you to easily promote machine learning models through a deployment pipeline and into production and actually lets you run multiple models in production. So I think they have a concept of the current master model, which is the one that's actually making live decisions about fraud or not fraud, but then you can have other models that are running against the same production data set and if they start to perform better, you can even have the platforms switching those models and start running those instead. So I thought that was a good example of kind of the continuous intelligence concept.
Christoph Windheuser:
I have a question to Sheroy actually. You and the development group for GoCD, do you get a lot of requests from Thoughtworkers how to use GoCD in the machine learning environment and are you planning to develop more and new features in that direction?
Sheroy Marker:
So I think we've seen GoCD used in one instance with data engineering pipeline quite substantially. We've also seen other CD tools used in conjunction with data engineering pipeline tooling, but mostly so far they've been used as both flow orchestrators to either build and deploy machine learning models into data engineering pipeline. There hasn't been a layer of abstraction on top of that workflow orchestration capability that additionally helps you with either training or retraining models. So that is something that's been on our backlog that we will dig into at some point to see if that's tooling that we should start building.
Christoph Windheuser:
Yeah, I think that that will become really, really important features. Maybe one hint who is interested. Thoughtworks will be present on the world summit AI, which is a big, big AI event in Amsterdam in October, actually 10th and 11th of October. And we will run a workshop, a hands on workshop there on intelligence empowerment. So at the moment we are building up an infrastructure with a GoCD server, a machine learning environment and tasks and we will show live on the laptops of the participants, how to change the training set, how to transport that through the pipelines, make the tests the green light and then put it into a production environment.
Mike Mason:
So I'd like to thank both Christoph and Sheroy for joining me on today's podcast. And if you are interested in continuous intelligence, please look it up, have a look on Thoughtworks, thoughtworks.com and you can find out more there.
Mike Mason:
Thanks very much everyone.
Christoph Windheuser:
Thank you very much for listening. Bye bye.