I am part of this project where we run a pretty big CI Build system. We had been facing a few issues with it and we wanted to do some work around Build Monitoring to improve the system as a whole. The next couple of blogs (written in collaboration with Rohith Rajagopal) will talk about the problems and approaches that we considered. In this post we will describe our build system, the tool we use, general build systems and approaches towards monitoring the Build System.
Most Build system typically consists of Compile -> Unit Tests -> Integration Tests -> Functional Tests(+Smoke) -> Artifacts or CUIFA (I just coined this!). The artifacts are then deployed onto QA environments. We follow something very similar.
The build tool we use is Thoughtworks' Go, which allows you to define your CI setup as a series of pipelines. Each pipeline allows you to have multiple stages that run in a serial manner. In each stage, you can define multiple jobs, all in parallel. So basically if we have to map CUIFA to this setup, you would define each of them as stages of one build pipeline. If there are multiple modules in your code, you would then define each of them as a job in one of the stages. The diagram below might help you map this.
So getting back to our situation, at any time, we have 3 branches to monitor - the branch that is in production, the branch that is about to be released and the branch that is in development - with each branch roughly consisting of 10-15 pipelines. The jobs across these 30-40 pipelines run on a build farm consisting of approximately 140 Virtual Machines (VM), or Agents, as they are called. (Note: They have nothing to do with James Bond! Bad joke.) On an average a particular pipeline has anywhere between 5 - 40 compile/test jobs that needs to be run in parallel.
Without churning out metrics and monitoring various aspects of the Build, managing these pipelines was a full time job for the Build team where daily we’d have to keep clicking different pipelines and jobs and trying to get them to go green. This approach isn't completely bad, but results in the possibility of missing out on a lot of metrics and might fail to provide the right insight into your build system. As a result, we've had situations where:
Along with being able to take care of the situations above and making the system more predictive in nature, there were 2 other kinds of problems that we wanted to solve:
At a high level, irrespective of the build system, we found these steps to be really helpful:
What should the granularity be? Depending on the size of your build system, you'll have to decide the granularity. Suppose your build system has 2 jobs - 1 compile job and 1 job - to run all tests, you might want to check how much time each task takes (assuming you have defined your tasks in Ant/Nant) and how much time each test takes. In this case, maybe an addition of 10 seconds to a test indicates "bad code". To implement this, you might have to write scripts or use plugins to emit times per test. In our case, we had a much wider range to play with. We didn't have to measure how much time every test takes. We decided to monitor at a job level, where every job would typically run the compile and all the unit tests. We started to emit these job metrics using a "push" mechanism – the Build server pushed metrics to the Monitoring tool.
How often would you want to emit metrics? If your build lasts less than 10 minutes and you have readily available plugins to do so, then you can start as soon as the pipeline goes green. In our case, we found that a particular pipeline and resultantly all the jobs in that pipeline run, at the maximum 2-3 times per hour. We had a couple of options.
We have followed the above 4 steps as part of our attempt to define the monitoring system. Check out Part 2 where we discuss the exact tools and workflows we used for our build system.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.