Not too long ago, cloud infrastructure was being widely praised for its incredible simplicity and manageability. For companies and teams burdened by complex on-premises infrastructure, it presented an intuitive, scalable, and cost-effective alternative that promised to transform how we approach IT infrastructure as a whole.
But, as offerings from the major public cloud providers have grown and become more sophisticated, that simplicity has fallen by the wayside. Today, cloud infrastructure service portfolios are incredibly large, increasingly complex, and harder than ever for engineering teams to keep pace with.
Today, Amazon Web Services alone offers 175 different cloud infrastructure services across its portfolio. We’re no longer dealing with a short, simple selection of intuitive services that can quickly be provisioned with just a few clicks.
Instead of relieving pressure from businesses and simplifying their infrastructure approach, cloud is now pushing in the opposite direction — creating new management challenges, demanding new (virtual) data center skills, and increasing average time to market.
Why rising cloud complexity is a business problem
When cloud-based infrastructure first emerged, businesses and their IT teams were eager to embrace it and gain the flexibility, scalability, and management simplicity benefits it offered.
It enabled developers and infrastructure teams to provision new services in minutes — dramatically improving time to market, and enabling organizations to gain an all-important competitive edge in crowded markets.
But, as the complexity of cloud infrastructure offerings has increased, those benefits have shrunk for many organizations. More complex services and decisions mean the teams managing cloud infrastructure need greater skills, and developers need deep expertise just to be able to self-serve — pushing the definition of ‘full-stack’ to the point that it becomes disadvantageous.
Put simply, rising cloud complexity is putting many organizations and their infrastructure teams right back to where they were 15 years ago — struggling to keep up with demand for new services and instances, and stay on top of an increasingly unmanageable infrastructure footprint.
The solution? Start viewing and managing infrastructure as a set of internal platform products
Major cloud service providers may have moved away from providing simple and intuitive infrastructure options that anyone can provision themselves. But, that doesn’t mean that you can’t create internal platforms and processes to do it yourself.
That’s the essence of a Platform Engineering approach. Rather than having your cloud infrastructure team spend all of their time constantly configuring and deploying new instances, environments, and services as developers need them, instead you have the team create a platform that offers self-service components that developers can easily choose and deploy themselves.
It’s an approach that’s started to see widespread adoption in 2020, featuring significantly in Puppet’s State of DevOps report. According to the report, 63 percent of organizations have at least one internal self service platform in place.
With re-usable, self-service solutions available on a platform that’s easy to understand and engage with, cloud infrastructure teams can spend less time configuring individual solutions, and give the business a rapid path to market for new digital services and capabilities.
The diagram above shows how speed to value and speed to market both exponentially increase through this approach. As the platform offering matures by creating more reusable elements, these can be easily combined to create new product experiences for the product teams, and by extension the business.
As each re-usable product or component is created, the build-measure-learn cycle gets smaller and quicker, delivering value for everyone faster. However, careful management of the platform is required as it evolves, to ensure that it remains composable and cohesive. Or in other words it doesn't become complex to engage with, or restrictive in the product choice it offers.
Why the approach works
This platform-based approach to cloud infrastructure management works because it considers everyone’s needs, and gives every group the time and capabilities they need to focus on the things they do best:
- Development teams and other internal customers don’t need to become masters of new languages or develop specialist cloud infrastructure skills to help themselves to the services they need.
- The platform team has visibility of all cloud infrastructure requests, and can broaden its expertise as cloud offerings and the infrastructure landscape advances, ensuring that products are both relevant and highly valuable to the organization as a whole
However, it’s important to note that it can only deliver these results when planned, implemented, and managed correctly.
What can go wrong?
Thoughtworks recently worked with a fintech enterprise that had attempted to implement a platform-based infrastructure approach, but faced many challenges that limited its effectiveness.
The company had built a team called Platform Engineering, seeded with some new cloud engineer hires, and a few of the existing data center team who had demonstrated an ability to code and an appetite to stretch out of their comfort zone.
The team worked for 18 months to build a platform that enabled developers to provision their own cloud components by submitting infrastructure-as-code via pull requests to the Platform Engineering team. The platform team would then coordinate and deploy the applications to a load balanced set of VMs, via a continuous delivery (CD) pipeline.
But, after a year and a half of coding, only 2% of the company’s production workloads were flowing through the new platform. Why?
The diagram above shows the cross-backlog dependencies that this approach created. It wasn’t a true self-service approach, because in many cases, other teams and individuals had to contribute to and edit a request before it was given the go-ahead. Ownership of the final product was unclear.
As new dev teams submitted more code, that inefficiency began to scale (shown in the bottom half of the diagram). There were more requests to review, but still only one platform team to review it all, creating a major bottleneck that slowed deployment speed further.
The end result was a platform that struggled to deliver the business value it initially promised. Backlogs and bottlenecks slowed time to market, new languages and processes made the internal platform hard to engage with, and ultimately, it didn’t meet developers' needs.
Keys to success
To ensure the success of this approach in your organization, there are three key factors you must consider as you create and execute your cloud infrastructure platform strategy: