This is part 3 of a series. Read part 1 and part 2.
The Thoughtworks Digital team has recently completed a journey to change our infrastructure to use the Phoenix Server Pattern. Many lessons were learned and you can read about them in the previous articles in this series: Introduction to Moving to the Phoenix Server Pattern and Things to Consider before diving in.
Irrespective of how mature your development process is, or how thoroughly an application is tested, disasters or incidents will always strike. There are just too many factors that are beyond your control: outages of third-party providers, security incidents, natural disasters hitting the data centres where your application is hosted - to name a few.
The natural impulse would be to make sure that problems with your system never happen and focus on increasing the mean time between failures. While it is important to take pre-emptive steps, it is also worth investing time to ensure that you can cope with problems when they happen. Instead of focusing all of your efforts on preventing an incident, a better strategy is to accept that it might occur and focus your efforts instead on decreasing your mean time to recovery.
Today, most businesses offer online services and downtime of those applications leads to losing money and reputation. Reducing the time that your services are unavailable makes a big difference to the impact on your business. It’s important to devise a robust disaster and incident recovery strategy, and this article will share some experiences from the Thoughtworks Digital team, and help you to do that.
We have faced situations in which we’ve had to resort to a ‘plan B’. These were scenarios that resulted in, or came close to having a downtime on thoughtworks.com. Examples of what we’ve experienced range from DoS attacks and memory leaks to the regular application of security updates.
In each of these situations, we’ve had an appropriate alerting system in place to notify us of the problem in real time. We also spotted issues before they resulted in downtime, via monitoring tools.
After a number of incidents were caused by our snowflake servers, the Thoughtworks Digital team adopted the Phoenix Server Pattern. One important step in our transition was to find a suitable incident recovery strategy, that would allow us to reduce the mean time to recovery.
Using snowflake servers, our previous recovery mechanism was simple. We had a cluster of disaster recovery snowflake servers that replicated the production ones. These would be updated regularly with the previous stable version deployed to production. The release deployed to the disaster recovery machines was subject to regular testing and monitoring. In case of an incident or a disaster on thoughtworks.com, we would add the disaster recovery servers into the production load balancers and remove the faulty nodes. On average, this process would take less than three minutes.
While this strategy was simple and quick, one disadvantage was that it was hard to keep the disaster recovery servers in sync with the more recent changes in our production environment. As we practise continuous delivery, we deploy changes to the live site several times a week. This meant that the recovery servers would introduce a regression in the features delivered, while the problem was being fixed.
We were aware of the disadvantages of our incident recovery plan, but were looking for ways to ensure that an incident did not happen frequently. Minimising downtime on thoughtworks.com was more important than the drawbacks of our strategy.
Updating the recovery servers even more frequently would have been an easy fix to the problem of feature regression. But even with this, a more fundamental disadvantage remained - the disaster recovery servers were ‘unique’ snowflakes too and therefore, in time, had slightly drifted apart from each other and from the production servers.
After radically changing our infrastructure model to the Phoenix pattern, we had to devise a new incident recovery plan to fit with it. One very important question that was raised was - how many releases do we maintain in parallel with the current release, for use as disaster or incident recovery?
When a new code release causes a major issue in the production environment, (that has not been spotted by tests in the release pipeline), it is likely that it will be noticed soon after the changes go live. In this case, we wanted to go back to one of the previous stable versions.
If the incident is instead caused by a server failing due to misconfiguration or other issues not related to the application code, then the same version of the code can be deployed to a new server.
One solution that addresses our question is to separate the creation and the deployment into two different continuous integration pipelines: one for creating an application server pool and one for deployment to the load balancer.
Our first pipeline creates a pool of production servers and deploys the necessary application code. The next stage of this pipeline is to run smoke tests on the newly created instances. All of the other tests are run on the pre-production pipelines. Therefore, a production server will be created only from a configuration that has passed all the tests.
The servers are not added to the load balancer in this pipeline. As a result, we can create as many clusters of production servers as we want, with different code changes. The servers are versioned using the pipeline label.
A final, manual, step of this pipeline is in charge of server deletion of previous releases. When triggered, it will delete the pool of servers corresponding to its label. In this way, we can keep only the instances that we need. This ‘smart delete’ step checks that the servers it is about to destroy, are not currently serving traffic in the load balancer. This removes the risk of accidentally destroying servers that are in use, and causing another incident.
The second production pipeline is in charge, solely, of deployment of the production servers behind the load balancers. With this pipeline, any of the previously created pools of servers can be deployed, as long as they still exist. The deployment script adds the new servers in the load balancers and then removes the old servers, without destroying them.
When an incident occurs, this pipeline will replace the servers behind the balancer with a previous known good release.
The separation of server creation and deployment to the load balancer is the essence of our incident recovery strategy. This is very similar to blue-green deployment, where we have several pools of production servers and we can switch between them in a single step.
As the recovery strategy relies on the continuous integration server and pipelines, it is very important to make sure that we can recover in case this system fails. One way we ensure resilience of the CI servers and pipelines is to have all the configuration management in code, so that the support servers are reproducible. The pipeline configuration is also backed up to a repository every time we make a change to it.
It takes less than three minutes to switch between versions of our application. The greatest advantage of this strategy is the simplicity of the approach and the time it takes to recover. While three minutes to recover is acceptable in our case, it might not be true for all businesses. Therefore, the time to restore service is a requirement that needs to be discussed when developing an incident recovery strategy.
Another advantage is that we can easily rollback to any release that is still available. This is due to the fact that we do not automatically delete all the older server pools. Depending on how many of those versions we have kept, we can switch back to any of them, if required.
It is worth noting that the strategy is suitable for both incidents (e.g. a server being compromised by an attacker) and disasters (e.g. network failures taking out production servers).
One obvious downside to this approach is the price of keeping so many redundant servers. Depending on the prices charged by the cloud provider, you can incur a hefty bill if unused instances are not deleted on a regular basis. For our production servers, we keep 2-3 versions behind the current one.
The servers could be shut down, knowing that we need to add their boot time to our recovery time, but this is negligible. This would mitigate the cost of keeping a few unused instances for incident recovery. However, not all cloud providers offer the option of shutting down servers. Due to the constraints imposed by our cloud provider, the only option for us is to delete them.
Evolving a recovery strategy poses many challenges, from both a technical and a business perspective. Also, it is a continuous effort, as both the application and business requirements are continuously evolving. To minimize the likelihood of getting that dreaded middle-of-the-night support call, you’ll need to devise a robust and well-tested plan that is suitable for the services you offer, and the requirements imposed by your service level agreement.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.