Building Reliable Digital Operations

Dan McClure and

Jim Highsmith

Published: August 18, 2016

In order for you to thrive in the digital environment, you need to understand the implications of the changing technology landscape on your organization. This is the third article in Technology Radar Echoes, a series where authors share their insights and experience on the technology problems and solutions driving business differentiation for enterprise leaders. Now in its seventh year, the Thoughtworks Technology Radar is an assessment of trends significantly impacting software development and business strategy created by an international advisory board of technologists.

The Fourth Industrial Revolution is filled with the promise of exciting new technologies. Klaus Schwab, the Executive Chairman of the World Economic Forum spared little hyperbole when he said, “We stand on the brink of a technological revolution that will fundamentally alter the way we live, work, and relate to one another. In its scale, scope, and complexity, the transformation will be unlike anything humankind has experienced before.”

Big Data, the Internet of Things, and flexible maker technologies have the potential to transform the way that organizations create value and compete in the market place. The temptation is to focus on these hot new technologies, pulling them into the enterprise strategy with the assumption that disruptive market impact will inevitably follow.

In fact, as a business leader, it might seem that being this kind of cutting edge thinking is your primary responsibility. Certainly embracing the possibilities of new technology is important, but other foundational capabilities must also be in place if those complex fast moving innovations are ever to reach your customers. Operational excellence becomes a critical capability in this Fourth Industrial Revolution world. We have to deliver and deploy these new technologies, but we also need to insure their reliability and adaptability.

Perhaps one of the least glamorous parts of this creative foundation is the technology infrastructure that underlies the trend setting new tools like the cloud. The teams that support the enterprise infrastructure are increasingly called on to reconcile two very different business needs. On one hand they are asked to facilitate a new era of rapid, complex market-focused innovation. At the same time, as technology moves into the center of business strategy, what we’ve called tech at the core, there is less and less tolerance for any gap in reliability.

Adaptability and flawless reliability are not concepts that go together easily. The organizations that master these conflicting demands will be far better positioned to take advantage of all that "cool new technology" and disruptive business opportunity.

Reliability+ Speed Create Business Value

Have you ever upgraded your computer operating system or browser—and discovered a host of applications that didn’t work anymore? Have you ever made what seemed like a small change to some piece of personal technology only to find that everything goes to pieces? Multiply these issues by hundreds of technical components residing on thousands of physical, virtual, and cloud-based servers that any sizeable business has in operation—and it’s a wonder that outages and downtime in IT operations isn’t greater.

The reason is that IT infrastructure teams, the often unsung heroes of reliability, have established strong processes and standards to keep these systems operational, despite their complexity. These teams face an increasing challenge sustaining this performance. When digital systems are the business, reliability becomes an executive concern. Applications that crash (think the initial Affordable Care Act's website) quickly undo the benefits of speedy implementation. The cost of downtime, and often reputation, on critical systems adds up rapidly.

So what we need is greater reliability—right? In a previous series article about tech stack complexity, we examined the explosion of tech components in all those Fourth Industrial Revolution technologies like big data, mobility, social media, and the Internet of Things. Greater complexity usually means that digital systems are moving from supporting the business to becoming the business, but this complexity means there are more components that lead to greater chance of failure.

Early in Jim’s career he worked as a reliability engineer on the Apollo spacecraft program. Reliability was a paramount concern and critical systems had backups and occasionally the backups had backups. The calculation of the reliability of components in series is sobering when dealing with astronaut’s lives—you have to multiply the reliabilities of each component in series. For example, if the reliability of each component is 0.99, then the reliability of 10 components is 95%--the combination fails 5% of the time. The cost of complexity and scale adds up quickly. Even if you raise the reliability of each component to 0.999,with 20 of them in series one will fail 2% of the time. Few executives will find a two percent failure rate on today’s essential digital operations to be acceptable.

All this complex technology is increasing difficult to deploy and test as the pace of innovation and creative change accelerates. In the rush to digital innovation, it’s easy to forget about operations until disaster strikes. Imagine an infrastructure team stumbling unexpectedly on the configuration of some seemingly esoteric component, the kind of problem that can take hours to untangle. Can you imagine the cost to Amazon, Google, Facebook, or your company of an hours-long outage?

Speed is also part of the new challenge. Increasingly the speed a business adapts to its shifting market is a key measure of the organization’s technical prowess. You can have a talented agile innovation team and an effective agile process, but adaptability and speed will still suffer when infrastructure debt is too high. Years ago Jim was involved with a project that built an application in 1 month of 1-week cycles, and then it languished in an antiquated operations deployment processes for six months. Technology enabled innovators can’t afford that kind of tax on agility.

Infrastructure as Code – Speed, Adaptability and Reliability

It’s easy to miss operations personnel toiling away trying to keep all the technology balls in the air. Without reliability, the speed and adaptability of business innovation teams doesn’t mean much. However, what if speed and adaptability could accompany reliability?

Operationally, infrastructure refers to all the components that must be in place to enable application software to run. It consists of cloud-based components of hardware (servers, storage, etc.), databases, networking software, libraries, and more. Each of these components may have configuration settings. Small, seemingly innocuous differences or changes in any of these components might result in a failure. Asking people to manually manage this mine field of dependencies is unrealistic. As Kief Morris writes in his book Infrastructure as Code: Managing Servers in the Cloud,

“Unfortunately, not enough organizations see these benefits even with the latest and best new tools and platforms. IT operations teams still find that they can’t keep up with their daily workload. They don’t have the time to fix longstanding problems with their systems, much less revamp them to make the best use of new tools. In fact, cloud and automation often makes things worse. The ease of provisioning new infrastructure leads to an ever-growing portfolio of systems, and it takes an ever-increasing amount of time just to keep everything from collapsing.”

The solution to these problems seems obvious—automation—but many organizations IT operations in this area are still manual. Shifting the infrastructure’s physical and logical complexity into code offers a way out. Morris defines Infrastructure as Code as.

“Infrastructure as code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration.”

Basically, deploying infrastructure moves from a manual, or partially manual, operation to a fully automated one as the deployment components are treated as code and data. Effectively, the tools, techniques, and practices of agile software development are applied to the infrastructure environment. Ultimately, the measure of progress that touches on this entire range of issues is cycle time (from feature release into development until deployment, and secondarily the number of deployments per day).

Investing in Innovation Reliability + Speed

So why isn’t everyone automating their infrastructure deployment? Of course the answer is investment of time and money. As organizations rush forward into an exciting future, making fundamental changes to the foundation of their technology is seldom a first priority.

But the challenges in implementing Infrastructure as code are as much organizational as technological. Even as organizations move towards their digital future the gap between software development and operations teams remains a barrier to success. Development teams often work in an iterative, agile world while operations remain in a serial, waterfall mode (often because of the lack of deployment automation and therefore the need for manual control processes).

Business leaders are often at risk of undervaluing this core enterprise capability. It is seemingly a long way away from the pressing new business opportunities of the digital marketplace and discussions of the fine points of server configuration can be less than exciting to someone outside the field. Nonetheless it is an area of investment (if not technical details) that is of direct interest to the leaders of business innovation.

Klaus Schwab sees this as a broad based executive challenge, “The bottom line, however, is the same: business leaders and senior executives need to understand their changing environment, challenge the assumptions of their operating teams, and relentlessly and continuously innovate.”

This can only be done on a solid foundation.

Embracing Infrastructure as Code is one critical capability needed to improve reliability, with speed and adaptability coming along for the ride. Everyone needs to get behind the urgency of this hidden, but essential, investment.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights

Building Reliable Digital Operations

Reliability+ Speed Create Business Value

Infrastructure as Code – Speed, Adaptability and Reliability

Investing in Innovation Reliability + Speed

Related Blogs

Keep up to date with our latest insights