菜单

Towards self-serve infrastructure

13 January, 2020 | 27 min 21 sec
Podcast Host Neal Ford and Mike Mason | Podcast Guest Evan Bottcher and Zhamak Dehghani
Listen on these platforms

Brief Summary

Traditional, centralized approaches to infrastructure management risk creating organizational friction and bottlenecks for dev teams. By defining a standardized tech stack, you immediately make it harder to satisfy your teams’ diverse needs. Co-hosts Neal Ford and Mike Mason are joined by Evan Bottcher and Zhamak Dehghani to explore how to benefit from self-service options.

Podcast Transcript


Neal Ford:

Welcome everyone to the ThoughtWorks Technology podcast. I'm one of your regular hosts Neal Ford.


Mike Mason:

and I'm Mike Mason, also one of your regular hosts.


Neal Ford:

And we're joined today by two of our colleagues on the Doppler group within ThoughtWorks, which is the group that puts together the Technology Radar.


Evan Bottcher:

I'm Evan Bottcher, I'm from Melbourne, Australia and I'm a principal consultant.


Zhamak Dehghani:

I'm Zhamak Dehghani and I'm from the San Francisco office and I'm a tech principal here.


Neal Ford:

And Zhamak is normally one of our hosts, but today she's a guest so that she gets to talk more because she has strong opinions about this. So one of the things that frequently comes up when we meet face to face to put together the Technology Radar a lot of very interesting conversations that get spawned by of what we call blips on our radar. And this podcast is a good example of a very interesting conversation that spun up and we wanted to capture some of the essence of it as a podcast.


Mike Mason:

Yeah. So this specific discussion that we had in the meeting was around using pull requests somehow to cause infrastructure changes to happen rather than using a ticketing system. But that's not where we're going to start with. This kind of spawned a whole discussion about what we mean by self-serve infrastructure because I think it's a kind of a pillar of platform strategy that we talk about at ThoughtWorks.


Mike Mason:

We talk about reducing friction for teams. We talk about self-serve infrastructure, but we wanted to kind of dive in, in a podcast into what we mean by that, why that's important. So maybe you guys could start with talking about some of the traditional challenges that teams face and how self-serve infrastructure, whatever that is might help.


Zhamak Dehghani:

Sure. So I think traditionally operations or infrastructure management has been under often the control of a centralized team. And that centralized team had the best intentions in their hearts that they wanted to provide the quality service or the quality infrastructure to the rest of the developers in the organization while putting guard rails and security and all the quality concerns into that infrastructure.


Zhamak Dehghani:

But organizationally and the process wise, the way it has been organized is that the rest of the organization for their needs would come to the centralized team and they will ask for the pieces of infrastructure that would enable the business value that they're trying to unlock in the application they're trying to build. And they come with their tickets and they're creating, basically contributing to a centralized backlog and they're competing for where on that backlog the work needs to be placed.


Mike Mason:

And when you say infrastructure, you mean like servers and firewalls and bits of... What do you mean?


Zhamak Dehghani:

Yeah, so all of the above, I think you can really apply the same paradigm at different layers of that stack. Even for clients, let's say that they are on the Cloud and they have pieces of that infrastructure already kind of self serve, then managed provided by the Cloud, their compute, their storage. There's still another level of governance or guardrails or ways of using that that gets implemented and controlled often by a centralist team.


Zhamak Dehghani:

So it could vary in range from your compute, your storage, your network, your observability, your monitoring tools, your storage or data pipelining, infrastructure. It's a very diverse range and then you can go up and down that stack in terms of organizational maturity, how much organization has invested to build other higher levels of abstraction on top of it. But I think that centralization has led to a lot of organizational friction that hasn't worked out really well because you've created the centralized bottleneck essentially.


Mike Mason:

So it's a bottleneck in that there's kind of a slow response for teams. Is it that kind of a thing? Like, I know we often talk about governance Kind of in a derogatory way, but you know, you can see why organizations would want to put some controls around infrastructure so it's not the wild West. But is its slowness or inflexibility? Or what's the key problem?


Zhamak Dehghani:

I think, I mean when you centralize control ownership capability into one team, you're immediately has created a high point of friction absolutely in terms of response time. In terms of flexibility as well, because once when you centralize something in one group, that group intend to optimize locally for their efficiency. So what then you would see is this idea of harmonization that they would try to satisfy a very diverse set of needs with one way of doing things. So you don't get that level of like diversity that you would need so that you have very different types of needs and applications.


Zhamak Dehghani:

So imagine like CI/CD pipeline is a good example. Often a centralized team becomes responsible for not only providing your CI/CD agents, but also providing templates for your pipelines. So then local optimization means they're going to provide a typical template of a CI/CD pipeline that might be okay for a typical application, but as soon as you are an anomaly in the organization or you move away from that typical configuration, then that doesn't fit your needs. So I think it's you become rigid, you're not nimble to responding to change and you're also very slow to the needs of the rest of the organization.


Neal Ford:

Looks like devil's advocate for a second. You also get consistency and consolidation and canonical representations of things and single source of updates. So how do you address some of those problems if you move? I mean not, I don't suspect you're advocating here that it becomes the wild West where let a thousand flowers bloom on your infrastructure, right?


Zhamak Dehghani:

Okay, so go ahead.


Evan Bottcher:

In some organizations have gone that way. And so they've allowed for a proliferation of team-managed infrastructure, which has that upside of that fast turnaround time and the ability to tailor the underlying infrastructure to their particular needs in their particular domains, which is great. But as you say, that leads to this excessive proliferation of different technologies. So there's somewhere in between. The cost of the over-centralization and too much constraints without enough self service, are very real.


Evan Bottcher:

It costs you at all levels of your delivery lifecycle from standing up new services and environments and configuring them correctly through to delivering, change in your software and systems and identifying problems and troubleshooting. It's usually a problem of access and availability of tools to be able to resolve incidents quickly. So it's really impact on customers, not just in time to market but also in time to restore service when something goes wrong. So, it is something that you do definitely need to address.


Mike Mason:

And I really want to echo what you said there. It's not just about developers being grumpy because they've got to ask somebody else for stuff. There's real tangible customer negatives to the slowness and them not being able to do a deploy, figure out what's going wrong, look at the logs, all of that stuff has as real negative impacts for the organization.


Evan Bottcher:

Well, Inability to scale too, because ultimately you need to scale. Quite often these platform teams, there's just scaling up the number of teams that are building product, you need to scale up the number of humans that are doing work in the centralized ops or platform team or whatever you've labeled it. In order to service all those teams' needs in a timely manner and always the demand for new environments and new tooling and new very tooling without stripping the capability of a centralized team.


Neal Ford:

I think one of the dysfunctions you see in organizations like that is if you have a centralized team and you view software mostly as overhead then you immediately start cranking down the budget as much as you can on that centralized team, which makes their response time even slower and it makes the problem even worse within the organization. So freeing some of that up to individual teams will help you spread some of that pain around the organization, but you’ll get a much faster response time for individual teams. And it has to do with the attitude about how you use software.


Evan Bottcher:

Well yeah, as you say, the attitude about how you view software and infrastructure and everything. So if you're seeing these sort of operations and platform teams as a cost center as something that you need to manage to take cost out... And certainly consolidating these things is an effort to reduce is an effort to reduce diversity and or variation. Allow for more freedom of people to move around the organization, allow for, have a better cost optimization, but that's not the primary objective. The primary objective is to speed time to market and speed resolution problems that that's where the focus needs to be. A lot of that comes to how we view platforms and infrastructure internally as rather than just servicing demand, but actually as a compelling product.


Evan Bottcher:

And this is a really key bit that we've been talking about for quite a while. We've had some things on the radar in the past around this where we see that the successful organizations that people where they have a clear idea of what their internal platforms are, how they service their internal teams as a product.


Evan Bottcher:

The role of the technical product manager is a really key thing here. Someone who can advocate for a better experience for teams within the organization for reducing friction, for taking away burden, reducing risk and they understand the benefits of using their platform and are able to guide the development of self service and better quality and more reliable platforms. Even if it's those things take away a little bit of choice for the teams around what tools and things they use. They understand that that's going to be a compelling place or compelling a product that the other teams would want to use.


Zhamak Dehghani:

Build upon that. I think that the idea of a product, there are a few things that are kind of in that ward product. One is this experience of the users that they can just pick up the product and use it and in a self-serve way, you know, solve their problems. So that comes with a lot of other other things that have to be put in place. The older guard grills that we talked about, the constraints that we want to put on the infrastructure or on the teams to make sure we are secure and we're not compromising availability. That All can be now abstracted in the implementation of these self-serve infrastructure or self-serve interface to the infrastructure while still giving flexibility to the teams to use the product within those guardrails. So the abstraction of the complexity and abstraction of the guardrails into that layer, the capabilities that we call self-serve.


Zhamak Dehghani:

The other thing is that, that role, Evan mentioned the role of the technical product owner or a platform product owner or infrastructure product owner. That person is an evangelist essentially for the organization to go out and market this as a product to developers to produce great documentation examples of how to use the code so that it can be that those self-serve infrastructure pieces can be discovered easily, can be used easily with good documentations and examples and it doesn't need a lot of handholding yet. And other tickets on the backlog to use the infrastructure pieces.


Zhamak Dehghani:

And the third piece of that is how do we measure success for this piece of infrastructure as a platform? What do we really care about here? And whenever I've built with our teams, any sort of self-serve kind of technical product, one of the key measures that have kind of asked to put in place and kind of measure over time is the lead time to use all of that infrastructure. The decrease lead time to create something valuable on top of that infrastructure. For example, we're building data, ans self-serve data infrastructure right now at a client. So one of the metrics for that self-serve dating frustrater is how long it takes for a data engineer to come and use this self-serve infrastructure to build their first pipeline and get it to production. So lead time to creating these what we call like data products or data pipelines in a way. And I think that's a good starting place to see how effectively you are serving your customers.


Evan Bottcher:

That's an excellent place to start. And quite commonly in talking to organizations that are trying to do this, they look at again... A trap is to start to look at cycle time as your time to make changes to your infrastructure for the platform team itself. But actually that's looking very much internal to the team that can actually look really good. But actually the level of service you're providing to the organization can be quite poor and you really need to start the clock on cycle time when the tenant, the consumer of the platform and delivery team and product team in your organization says, "I want to provision a new piece of infrastructure. I need to make a change or I need to be onboard." That's when the clock starts and that's really important to measure.


Evan Bottcher:

Another interesting measure is to look at the level of manual effort that the team's doing. Just in general terms, we talk about automation, but automation isn't like binary. You have to have absolutely everything self-service. Otherwise we would just pass through every product in the cloud catalog and let every team use every product in AWS or GCP or Zero, whatever the cloud provider is. And we would have this proliferation. We wouldn't have the value add. The automation, if your platform team is doing manual work, you need to be looking at that work suspiciously to look at that and understand what categories of work that is, how much of it is failure demands?


Evan Bottcher:

Something that you should have had an automated or self-serve way of doing and how much of it was value add as in supporting another team performing something that's kind of in that, "Okay now and then we'll need to do this type of work." And the traditional way was I'm measured an incentivized based on the number of tickets, the number of inbound requests that I can serve as an intern. Actually we need to look at how many tickets we didn't have to service. And that's, I think what the technical product manager will do is, is to look at what's missing in our product catalog. What's missing in our offering that's causing all of this extra work for us to do.


Mike Mason:

How do you feel about consuming teams choices in the platforms they use? Because a lot of the time we talk about, if this was a product, you would be competing in the marketplace with other products. But developers within an organization, they often have a mandate got to go use this thing. And there's a lot of some cost problems where some part of an organization says we're going to go build this super valuable thing, but then they just force people to use it and they don't really kind of compete for those customers of that internal product. Do you think it's acceptable to mandate stuff... Because I mean if you're not mandating use of your fancy new self-serve infrastructure, then we're back in the wild West. If teams can just choose whatever they want to... What they're going to use.


Zhamak Dehghani:

It's a wonderful question I have two thoughts on that and maybe I'm a little bit idealistic, but if you go right now to the market and you want to buy a product, does the product creator come and mandate to use their watch or to wear their shoes? It's kind of your choice and at the end of the day it's the quality of the service or the quality of the product and how it fits your needs would help you make a decision hopefully. And I think it should be the same for technical products within the organization as well. And it should be if the technical infrastructure built in a way that it really removes a whole heap of overhead that I have to go through hoops to create and it would really make using easier and more convenient than building it myself, then hopefully that's a good reason for me to use that product.


Zhamak Dehghani:

So even if we don't have diversity of the offerings of infrastructure, self-serve infrastructure and we just have one, it should be the evangelism and the success stories... The first use case is really important to show the success of that product to kind of virally, that product stopped being used. So hopefully that's like one side of it, and maybe it's a bit too idealistic, but I hope that instead of forcing, we can create gravity and create pool for other folks to come and use it by really showing and demonstrating the benefits. The other side, I actually think it would be wonderful to have diversity of the offerings of infrastructure as code and not have a monopoly of all of the infrastructure is in within one team.


Zhamak Dehghani:

But what would happen is often when you get started, you have no choice but to start somewhere, right? Start with the 80% population that you need to serve rather than that 20% like really different population than you need to serve. So like any product owner you would do a market fit analysis and find out what fits the majority of the market that you have, which in this case like developers that want to build applications or data pipelines and whatever it is that you're trying to solve the market population that you're trying to serve and, and build for that first.


Zhamak Dehghani:

And that means yes, that the rest of the organization will be in the wide wild West and they will be doing their own thing and they may not have the best experience building it but we have to get started somewhere and I think it's a problem if we think that where we bootstrapped building any infrastructure and how it looks like at the bootstrapping time, it's exactly the same model of operation three years down the track. This would evolve over time.


Neal Ford:

Toward the end, how difficult is it to get this started? What challenges do you find getting this in place within the organization?


Evan Bottcher:

One of the common things that I observed is that we've tried to take what was a data center-based centralized operations team and just translate one for one into modern technology. What we used to be virtualization, now it's cloud and so traditional ticketing, ticketing systems, provisioning, needing a project code for provisioning a tiny piece of infrastructure essentially not looking at the capabilities of the underlying cloud provider, but really just thinking about this as a data center shift into the cloud.


Evan Bottcher:

Another really common place is another really common challenge is where people have decided to go and build the platform in isolation of use. And so, we're going to spend three months going to stand up our self service portal and... I've come along and seen what's happened and there's a webpage where you click at five times and hit the checkout because I'm purchasing or provisioning some cloud-based compute and the teams have built a web-based checkout system that I can't automate.


Evan Bottcher:

One of the values of the Cloud is to be able to integrate an automateable provisioning, kind of scriptable environments and infrastructure as code into my deployment lifecycle in mind, my development life cycle. And so that's completely detached. That has been built completely detached from how I actually work as a consumer of the platform. So I always recommend that building the platform in isolation of customers is going to lead you in a bad place. So find a consumer who needs the service, build it in-situ, allow the people to use the platform capabilities and then harvest those from for the next consumer seems to be a much better strategy.


Zhamak Dehghani:

Plus one to everything Evan said, it's just so spot on this bottom-up approach to building platforms in isolation is just a recipe for failure. But also it's an art as well because if you think about using these usage patterns or use cases or the teams that need to build a platform as a vehicle to execute building the platform, you need to be careful to not over-fit the implementation of that platform to one use case to one team. So maybe it's one team, maybe it's one or two or three teams that have a slightly common cluster of capabilities that they need. But there are slightly different, so you build a platform with through use cases, with collaboration with your consumers. You have the luxury in the organization to be able to work with your consumers but also not over fit to their needs and not overcommit to features that are not needed.


Zhamak Dehghani:

And again, bring that product thinking to building it. I think going back to where we started Mike with, is ability to do a pull request on a piece of infrastructure configuration by the consumer team is a good self-serve function when you get started. So thinking about that minimum viable, I don't know, I experience for the developers to experience self-serve to a degree, it may not be a fancy website to go and click, which is kind of useless and it's more lower level, but they still have a sense of autonomy and a sense of contribution to what change they need and maybe that's a good place to get started. So not over committing to what should this platform experience look like, what is the acceptable experience when you get started and then build and evolve it from there.


Evan Bottcher:

So to expand on that a little bit, which was the original discussion in the radar sessions this week, investing in automation and self service and APIs and tooling is expensive where you're trying to build a compelling offering on top of whatever underlying cloud provider or provide what are the constraints or added value? So we have seen this a number of times at some very frequently used things become something that is completely automated and self service. But it's actually quite reasonable to have a middle ground of certain types of cloud platform provisioning. Examples I've seen are around data pipelines on a centralized compute costas or actually core infrastructure networking and things like that that you change infrequently that the platform team provides a repository that's connected up to a fully automated build and deploy pipeline.


Evan Bottcher:

But the teams who make use of the platform contribute by making code changes in a brunch and then making a pull request which is reviewed by the platform team and merged into master and then the automation takes over and deploys whatever changes. So it's balancing this appropriate investment in platform automation and self service for the frequency of the change that's required. Something you do once every six months, maybe you don't need an API upfront for that and you wait until you observe all the different types of manual effort that the centralized platform team needs to do and focus the investment in automation on those things.


Neal Ford:

But it strikes me as a good way to evolve capabilities too. If you allow teams that need a new capability to deliver that to the centralized platform through a pull request and it becomes available to other teams over time. And so that's a kind of an odd demand way of evolving capabilities from a centralized store.


Zhamak Dehghani:

Yeah, absolutely. I think we talk about harvesting quite a lot as opposed to building up front. But to be able to even harvest people, you what people are building, you need to give some sort of a framework for that harvest to be contributed back to the shared infrastructure and pull requests are a good place to start.


Neal Ford:

Well, we could definitely talk about this subject for quite a while, but the reason we're in San Francisco is to actually build a radar and we have to go do that now. So we want to thank Zhamak and Evan for their great contributions this morning.


Mike Mason:

Yeah, thanks very much. Thanks for listening.


Neal Ford:

We'll see you next time.


Zhamak Dehghani:

Thank you.

Check out the latest edition of the Technology Radar