My goal with this article is to, hopefully, draw what that path looks like. My thoughts are based upon several interviews, chats, and discussions with many developers and systems administrators. This article will be split into two parts. The first part of this series will provide a brief, anecdotal look into the world of IT prior to the ideas behind DevOps taking the world by storm. We will also explore what hiring managers looking to bring DevOps into their companies are looking for and skills that are helpful for engineers looking to become DevOps champions. The second part will outline common implementations of DevOps concepts that I’ve seen in the industry, along with some helpful books that discuss the ideas behind this post in more depth.
Olde World IT
Understanding history is key to understanding the future, and DevOps is no exception. To understand the pervasiveness and popularity of the DevOps movement, it’s helpful to understand what IT was like in the late-90’s and most of the '00s. This was my experience.
I started my career as a Windows systems administrator in a large multi-national financial services firm in late 2006. In those days, expanding infrastructure involved calling Dell (or, in our case, CDW) and placing a multi-hundred-thousand dollar order of servers, networking equipment, cables, and software, all destined for your on- and off-site data centers. While VMware was still convincing companies that using virtual machines was, indeed, a cost-effective way of hosting their “performance-sensitive” applications, many companies, including mine, pledged allegiance to running applications on their physical hardware. Our Technology department had an entire group dedicated to Datacenter Engineering and Operations, and their job was to negotiate our leasing rates down to some slightly-less-astronomical monthly rate, ensure that our systems were being cooled properly (an incredibly difficult problem if you have enough equipment) and, if you were lucky/wealthy enough, that your off-shored datacenter crew knew enough about all of your server models to not accidentally pull the wrong plug during after-hours trading.
Amazon Web Services and Rackspace were slowly beginning to pick up steam but were far from attaining critical mass.
In those days, we also had teams dedicated to ensuring that the operating systems and software running on top of that hardware worked when they were supposed to. The engineers were responsible for architecting reliable architectures for patching, monitoring, and alerting these systems as well as defining what the “gold image” looked like. Most of this work was done with much manual experimentation, and the extent of most of these tests was writing a runbook describing what you did and ensuring that what you wrote did what you expected it to do after following said runbook. This was important in a large organization like ours, since most of the level 1 and 2 support was offshore, and the extent of their training ended with those runbooks.
(This is the world that I lived in for the first three years of my career. My dream back then was to be the one who made the gold standard!)
Software releases were another beast altogether. Admittedly, I didn’t gain a lot of experience working on that side of the fence. However, from stories that I’ve gathered (and recent experience), much of the daily grind for software development during this time went something like this:
Developers wrote code as specified by the technical and functional requirements laid out by business analysts from meetings they weren’t invited to.
Optionally, developers wrote unit tests for their code to ensure that it didn’t do anything obviously crazy, like try to divide over zero without throwing an exception.
When done, developers marked their code as “ready for QA”. A QA picked up the code and ran it in their own environment, which might or might not be like production or could even be the same environment used by the original developer.
Failures were sent back to the developers within “a few days or weeks” depending on other business activities and/or where priorities fell.
While system admins and developers didn’t see eye to eye often, the one thing they shared a common resistance towards was “change management.” This was a composition of highly-regulated, and in the case of my employer at the time, highly necessary rules and procedures governing when and how technical changes happened in a company. Most companies followed the Information Technology Infrastructure Library, or ITIL process, which, in a nutshell, asked a lot of questions around why, when, where and how things happened along with a process for establishing an audit trail of the decisions that lead up to those answers.
As you could probably gather from my short snippet of history above, many things within IT were done manually. This led to a lot of mistakes. All the mistakes in turn led up to lots of lost revenue. Change management’s job was to minimize that lost revenue, and this usually came in the form of releases only being done every two weeks and changes to servers, regardless of their impact or size, being queued up to be done sometime between Friday, 4pm and Monday, 5:59am. (Ironically, this batching of work led to even more mistakes, usually more serious ones.)
DevOps Isn’t A Tiger Team
You might be thinking “What is Carlos going on about, and when is he going to talk about Ansible playbooks?” I love Ansible tons, but hang on; this is important.
Have you been assigned to a project and ever had to interact with the “DevOps” team? Or did you have to rely on a “Configuration Management” or “CI/CD” team to ensure that your pipeline was set up properly? Were you ever beholden to attending meetings about your release and what it pertains to weeks after the work was marked “code complete?” Have you ever worked with a “tiger team” that was initially created as a short-term fix to bring teams together but landed up becoming permanent?
If so, then you’re re-living history. All of that comes from all of the above. Calling your team “DevOps” isn’t going to fix it.
Functional silos form out of an instinctual draw to working with people like ourselves. Naturally, it’s no surprise that this human trait also manifests in the workplace. I even saw this play out at a company I worked at prior to joining ThoughtWorks. When I started, all developers worked in common pods and worked closely with each other. As the codebase grew in complexity, developers who worked on common features naturally aligned with each other to try and tackle the complexity within their own feature. Soon afterwards, feature teams were officially formed.
(To be clear, I don’t think feature teams are universally good or bad. I thought it was the right decision for the aforementioned company. They are silos, however, and they did occur somewhat naturally.)
System admins and developers at many of the companies I worked at not only formed natural silos like this, but also blindly advocated for, and fiercely competed against, each other. Developers were mad at sysadmins when their environments were broken. Developers were mad at sysadmins when their environments were too locked down. Sysadmins were mad that developers who were breaking their environments in arbitrary ways all of the time. Sysadmins were mad at developers for asking for way more computing power than they needed.
Neither side understood each other, and worse yet, neither side wanted to.
The purpose of DevOps was to put an end to this.
DevOps isn’t a team. It’s not a group in Jira. It’s a way of thinking. According to the movement, in an ideal world, developers, sysadmins and business stakeholders would work as one team, and while they might not know everything about each other’s worlds, they know enough to understand each other and their backlogs, and can, for the most part, speak the same language. Everybody is winning because everyone understands each other.
Adam Jacob said it best: “DevOps is the word we will use to describe the operational side of the transition to enterprises being software-led.”
A common question I’ve gotten asked is “What do I need to know to get into DevOps?” The answer, like most open-ended questions like this, is “it depends.”
Learn The BasicsAt the moment, the “DevOps engineer” varies from company to company. Smaller companies that have plenty of software developers but fewer folks that understand infrastructure will likely look for people with more experience administrating systems. Other, usually larger and/or older companies, that have a solid sysadmin organization will likely optimize for something closer to a Google SRE, i.e. “a software engineer to design an operations function.” This isn’t written in stone, however, as, like any technology job, the decision largely depends on the hiring manager sponsoring it.
That said, we at ThoughtWorks typically look for Infrastructure Developers and DevOps Champions who are interested in learning more about:
- How to administrate and architect secure and scalable cloud platforms (usually on AWS, but Azure, Google Cloud Platform and PaaS providers like DigitalOcean and Heroku are popular too),
- How to build and optimize deployment pipelines and deployment strategies on popular CI/CD tools like Jenkins, GoCD and cloud-based ones like Travis CI or CircleCI,
- How to monitor, log and alert on changes in your system with time series based tools like Kibana, Grafana or Splunk and Loggly or Logstash, and,
- How to maintain infrastructure as code with configuration management tools like Chef, Puppet or Ansible, as well as deploy said infrastructure with tools like Terraform or CloudFormation.
Containers are increasingly popular as well, as they are quickly becoming a great way of achieving extremely high density of services and applications running on fewer systems while increasing their reliability. (Orchestration tools like Kubernetes and Mesos can spin up new containers in seconds if the host they’re being served by fails.)
Having the ability to embed your application into a container image quickly and easily during a code deployment pipeline and deploy it onto any hardware, physical or virtual, (almost) any operating system on any provider makes them highly lucrative for those looking to deploy quickly and more often. Containers also mitigate many of the challenges inherent to system configuration that configuration management platforms have long been solving. Patching, environment variable management and system state are much smaller concerns in a container-driven infrastructure; all that matters is that the operating system kernel on which the container service is running on can support running containers, and that any external dependencies required to run them are present.
Why is this important? There is a very real possibility that configuration management skills of today will become completely obsolete in less than five years.
Spend some time learning how to create a container image in a pipeline. Explore container orchestration platforms. They are quite powerful.
Cross-Skill: Look Left and RightIf you’re a systems administrator that’s looking to make this change, you will also need to know how to write code. Python and Ruby are popular languages used for this purpose, as they are portable (can be used on any operating system), fast and easy to read and learn. They also form the underpinnings of the industry’s most popular configuration management tools (Python for Ansible, Ruby for Chef and Puppet) and cloud API clients (Python and Ruby are commonly used for AWS, Azure and GCP clients).
If you’re a developer, I highly recommend learning more about UNIX, Windows and networking fundamentals. Even though the cloud abstracts away many of the complications of administering a system, debugging slow application performance is aided greatly by knowing how these things work. I’ve included a few books on this topic in the next section.
Practice, Practice, Practice
If this sounds overwhelming, you aren’t alone. Fortunately, there are plenty of small projects to dip your feet into. One such toy project is Gary Stafford’s Voter Service, a simple Java-based voting platform.  We ask our candidates to take the service from Github to production infrastructure through a pipeline. One can combine that with Max Griffiths’ awesome intro to Ansible and CD  to learn about ways of doing this.
Another great way of becoming familiar with these tools is taking popular services and setting up an infrastructure for them using nothing but AWS and configuration management. Set it up manually first to get a good idea of what to do, and then replicate what you just did using nothing but CloudFormation (or Terraform) and Ansible. Surprisingly, this is a large part of the work that we Infrastructure Devs do for our clients on a daily basis. Our clients find this work to be highly valuable!