At many organizations I’ve worked at over the last few years, I’ve seen a common anti-pattern: configuration management (CM) tools used incorrectly as provisioning tools. This has been frustrating because using CM tools to provision infrastructure undoubtedly leads to complex code that is unmaintainable and hard to extend.
Let’s take a look at the downfalls of misusing CM tools for provisioning and how provisioning tools are more equipped for this purpose.
What is Configuration Management?
Configuration management tools are used to repeatably and consistently manage the configuration of systems and the services that they provide entirely through code. Many of them achieve this in three ways: an intuitive command line interface, a lightweight and easily-readable domain-specific language (DSL) and a comprehensive REST-based API that lowers the barrier-to-entry for integrations with other tools. Chef, Ansible, Puppet and SaltStack are popular, open-source examples of these tools.
I've seen many companies use these tools to create and modify, or provision, new infrastructure and configure them afterwards. In theory, this seems like a job that these tools are well-suited for given their advantages. However, my experience has shown that what actually happens is that much more code is written in order to take small edge cases into account. If I had to sum up what the resulting code often looks like, it would be something like this:
Let me explain.
Complexity in simplicity
CM code written in a particular tool’s DSL is meant to be simple. Copying and modifying files, installing packages or setting environment variables are examples of configurations that are straightforward to implement in any popular CM tool. However, many of the details that go into provisioning new machines (number of servers, metadata, disks associated with those servers, etc.) increase the complexity quickly.
One of the more attractive features of modern CM tools is their ability to express complex system and application configurations in an easily-readable way. This works well for expressing how applications should be installed on a system in a way that's easy for others to read.
Let’s use an example to see what we mean here.
You are responsible for a team that maintains a web-based application. The web servers in your environment run on nginx. Your team currently configures these servers by hand, and you would like to use a CM tool, like Ansible, to automate this process entirely with code and source control.
Creating an Ansible playbook to configure these nginx instances enables your engineers to do this. The process is straightforward: create a “playbook” that contains “plays” which does things like install nginx from yum or apt, copies configuration to the host and/or starts the nginx service. All of these plays are expressed with a human-readable templating language that makes it easy to see what is happening and the order in which these things happen. An example of this is shown here.
This simplicity begins to fall apart when one tries to provision the instances on top of which nginx will be running. It usually starts off simple: create a playbook that fetches SSH keys from a backing store and uses them to deploy a static number of instances into a single region within AWS or your private cloud. All of the popular CM tools have modules or resources available for doing things in various cloud providers. But, what happens when you want to deploy onto multiple regions with different IDs and, potentially, different keys? What about when you need to perform modifications to the operating system after the machine is already created? Let’s say you want to use properties from this playbook to deploy a set of database servers and load balancers with them. How do you relate them?
What I've usually seen happen is that custom modules get created to fill in these gaps. These modules might do things like define an Amazon EC2 instance template or OpenStack Cinder volume. This isn't so bad on its own, but the devil in these details is in controlling dependencies between them. Telling Chef, for example, to create a set of volumes with one resource call and then bind them all to one instance isn't impossible, but attempting to express it without writing “real” Ruby is quite challenging.
This may not seem problematic to those who know Ruby (or whichever language in which the CM tool is based) well enough to put this together, but there are hidden costs to this: stricter testing requirements, loss in readability to engineers less familiar with the language, more complexity and more fragility.
Provisioning tools abstract many of these nuances; expressing relationships between disparate infrastructure with something like CloudFormation or Terraform is quite straightforward, and displaying a graphical representation of one’s current stack is similarly easy to do.
Maintaining state is important
There is also a major cost to consider in this approach: maintaining state.
Continuing on with our web application example, let’s say that traffic to your site has spiked up recently. To accommodate this demand, you would like to have three database servers in your environment instead of two. (We will assume that this environment does not currently have auto-scaling built in.) Because configuration management tools, like Chef or Puppet, are designed to modify existing infrastructure, not to keep a history of past configurations, in order to add an additional database server, you would need to write code to calculate the state of your current environment before taking any action against it. Otherwise, your cookbook will deploy three more database servers instead of just the additional one that you needed.
Provisioning tools like Terraform and AWS CloudFormation have logic built into them to prevent this problem from happening without needing to write this code yourself. They can show you what your infrastructure currently looks like and, in the case of Terraform, what it will look like after your desired changes take effect. This way, you can know exactly what is going to happen before it actually happens.
Rollbacks are hard too
Now, imagine that during your database server deployment, your monitoring system alerts you that the disks being allocated to your new servers will risk exceeding your team’s infrastructure budget. To prevent this from happening, you want to cancel the deployment and roll your environment back to its previous state before troubleshooting the problem and re-attempting.
As discussed above, Chef and other configuration management tools do not keep state of their actions. Consequently, without writing additional code to account for scenarios like this, performing this rollback with cookbooks will be impossible. You will need to wait for the cookbook application to fail and then go into your AWS console and delete your failed servers manually.
Provisioning tools like Terraform and CloudFormation provide this rollback for free. What’s more, if the rollback fails for some reason, you can modify the attribute describing the number of servers in your environment and rerun the deployment to purge any failed servers or other infrastructure, such as load balancers and DNS records, from it.
Use the right tools
There are a few tools available that can make provisioning your infrastructure simpler. Terraform, Heat, and CloudFormation are the most popular at the moment. Terraform is a better choice if the majority of your infrastructure is not within AWS or if reducing cloud platform lock-in is a priority. Several of our developers have used it with great success; one example of such can be found here. Both have established enterprise support options available as well as extremely active communities backing them.
There are also a few upstarts in this space to choose from. Cumulus and SparkleFormation share many features with their incumbents, though their communities are smaller and their support options less established.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.