Brief summary
In this episode, our regular co-hosts, Mike Mason and Zhamak Dehghani are joined by Scott Shaw, Head of Technology for Thoughtworks Australia and James Lewis, Principal Consultant, Thoughtworks UK. They explore the challenges of delivering an effective multicloud solution and how to assess the criticality and sensitivity of workloads.
Podcast Transcript
Mike Mason:
Hello everyone and welcome to the Thoughtworks podcast. My name is Mike Mason and I'm here with my cohost.
Zhamak Dehghani:
Zhamak Dehghani.
Mike Mason:
Today we're going to be talking about multi-cloud strategy and especially within a regulated environment. We're joined by two of our extreme experts here in the room who we always, always enjoy having around. There's Scott Shaw, he's here from Sydney. Hi Scott.
Scott Shaw:
Melbourne actually.
Mike Mason:
Oh, Melbourne.
Scott Shaw:
I'll make sure that's clear. There's a lot of rivalry and Melbourne's way better.
Mike Mason:
Do you want to tell us a little bit about yourself?
Scott Shaw:
Oh, I'm the head of technology for Thoughtworks in Australia. There I'd lived for many years despite my American accent. And I help to look after the technology end of our business in Australia, but also I do quite a bit of consulting as well, so mostly around technology strategy and making larger scale technology decisions.
Mike Mason:
And we also have James Lewis. That's James microservices Lewis with us from the UK. Do you want me to say a little word about yourself James?
James Lewis:
Yeah, hi. Hey everyone. So I'm based in the UK, been with Thoughtworks a fairly long time now, about 13 years. And much like Scott, actually I'd spent most of my time advising clients or helping clients to understand how best they can make decisions about technology choice, technology strategy, that kind of thing. Don't write as much code as I'd like, I think it's fair to say.
Scott Shaw:
I think we're both in that position.
Mike Mason:
Probably true for most of us actually is not writing as much code as we'd like. So the hook for today's discussion is really multi cloud. And as we've been discussing uses of multiple clouds over the past couple of days. There's been a few times, Scott, where you've kind of... you've almost face palmed over a discussion or something that reminded you of a story in the past. And so we thought multi cloud was worth talking about. So first of all, let's do 30 seconds on what multi cloud actually is and then maybe why you need to think about it in particular way.
Scott Shaw:
Yeah, I think... so we had on the radar a time or two ago, something about poly cloud. And I think I make this distinction in my head at least between poly cloud where you have a variety of cloud providers and you choose the one that's best suited for the task, you may have a particular reason they may specialize in some way. So I think that's kind of a separate issue from do you need to have a selection of cloud providers to manage risk for some reason? And I think we're kind of moving in... in Australia at least the regulator for the financial industry, the APRA, has just released some new guidance for the first time in a couple of years.
Scott Shaw:
They're sort of catching up to the industry in terms of their guidance around cloud usage and it's all about managing risk and managing... and a lot of that is managing the risk of having a single cloud provider. Which for ordinary reasons like disaster recovery or availability, a single cloud provider can probably give you everything that you're getting from your on prem solutions. But when it comes to managing prudential risk, then you really need to have, in certain cases for certain extremely critical workloads or extremely sensitive data, some strategy in place to be able to move between cloud providers if necessary.
Mike Mason:
So what you're saying is, so for disaster recovery and availability kinds of things, multiple availability zones within a single cloud provider would be okay, but for these other use cases you're talking about, it needs to be an entirely different company. So Amazon plus Google as a backup or Google as a live thing that you would be using every day. What kind of timeline are we talking on the mitigation strategy?
Scott Shaw:
That all depends on the criticality of the workload. So I'm talking about... when I say, okay, it's relative to the criticality and sensitivity of the data. We used to talk about materiality a lot and businesses were hesitant to put material workloads. Workloads that were core to the functioning of the business, they were concerned about putting that in the cloud. We've kind of moved beyond that discussion. So a lot of businesses, in Australia at least, have moved material workloads into the cloud and shown the regulator that they have the right kind of risk management processes around that and that they can apply the right level of security. We're really talking about certain, a very small percentage of workloads that are extremely critical. Maybe they have sort of real time criticality where you to be able to move over between cloud providers and I guess it's hard to put a timeframe on it, but in days in some cases.
Mike Mason:
Could you give an example of one of those kinds of work loads?
Scott Shaw:
Well, people being able to do online banking transactions for example. Or perhaps transferring money between banks, a critical transfer that had to happen, that was time linked or foreign exchange transfer or something like that.
James Lewis:
Just to interject at that point, I mean we... see I'm the UK so this is local to the UK, but there's been quite a lot in the press recently of banks exactly [inaudible 00:05:58] and the impact of their failure is actually... you'll read about it immediately on the Twitters. If something like we can't access our business banking accounts, we can't make payroll. That's potentially thousands of people who are going to have a... that there will be a knock on effect on, right? I mean it's, I can't pay my mortgage suddenly because I've not being paid because all this kind of stuff. So I mean, I used to... when I was working with a big Spanish bank actually, one of the... they were awesome team and one of their sort of quite senior folks that he used to say, "We have to be quite careful about this bit because that's where the money is," right? Because I mean, the money doesn't sit in a vault anymore, the money says in ones and zeros on some of these extremely critical systems.
Scott Shaw:
It's true.
Zhamak Dehghani:
And I guess the mitigation strategy or techniques that you would use, they're still applicable on a spectrum of cases. Right now we're talking about this extreme case of the core business of a large financial institute regulated. And you may want to... for some of those have an opportunity of like live switching between one or the other. But if you think about the same requirements but for a business that is not so regulated, clients are still may want to have the opportunity of changing their minds about their cloud provider and moving from one to another. So this maybe the time between switching from one cloud to provider to another expense, but it's still a capability that a lot of clients would, I assume, would want to have.
Scott Shaw:
I think that's the main reason to consider multi cloud. The timeframe, you're right, is a lot longer. You have time to plan for those kinds of migrations. But if you have a lot of data or you have a lot of applications in one cloud provider, the time... you can't just do it instantaneously. You probably have limited bandwidth to be able to move those things. So really that's the main reason you would want to have a strategy in place is to manage the business or commercial risk of being tied in with a single vendor.
Mike Mason:
And I think I remember you also use the phrase competitive pressure as well, right? In order to be able to maintain a negotiating position with your, maybe, your main cloud provider, you would want to have plausible other options in order to kind of keep them honest while you were negotiating prices of services and so on.
Scott Shaw:
Yeah.
James Lewis:
It's interesting, right. It reminds me a little bit of the old backup restore kind of thing. It's like, "Great, we've got backups, right? We're backing up everything but we've never practiced restoring. I wonder how much..." Which is almost makes the back ups useless, right? Because if you can't actually restore any data, then what's the point to do the back up. Kind of reminds me a little of that. I mean how much planning do you have to do for this? How much thought needs to go in? How much work is that stuff?
Scott Shaw:
Well, potentially a lot. And that's where you have to... that costs money and time to do that, to put that work in up front, to have a plan, to have a backup strategy whatever it is. It requires investment. And so that's why you want to be selective as to where you apply that and which workloads you want to go to that level of risk management on.
Mike Mason:
I know we've sort of seen some organizations kind of think about a multi cloud strategy and we've seen it drive them towards some behaviors that we think are not optimal. Do you want to talk about any of those behaviors? I mean, lowest common denominator, usage of cloud is an obvious one, but I think you mentioned some over the past couple of days as well.
Scott Shaw:
I think the first thing people have to consider is if they're thinking about moving something from on prem into the cloud, they have to ask themselves, is everything that they have in their architecture on prem really necessary to replicate in the cloud? It's probably... there's this idea of cloud native architecture and you probably want to be designing things quite different. And there's a tendency for people, I think, to say, "I took advantage of this feature," whether it's a particular kind of switch or load balancer or whatever they may have had in hardware on prem, it was available there. That was the thing that the capital investment had already been made and there was no additional overhead to using that feature. They never really asked themselves, is this necessary? If I'm going to cloud, I'm going to pay for everything on an individual basis. Do I really need that feature anymore?
Scott Shaw:
So that's the first thing that people need to ask. I don't know if whether it's multi cloud or not, but the thing is that if you have to consider a multi cloud solution, those things become exponentially more expensive and difficult to replicate from an on prem solution into a cloud solution. So that's the first thing. I think the other thing is you need to have some rational basis of assessing the risk and criticality of your workload. Because everybody thinks their particular application is ultra critical. There's a tendency to overestimate, I guess, the criticality and sensitivity of what you have. And of course, you need to be conservative and prudent in how you manage those things, but you'd need to have some sort of rational predetermined way of scoring, I think, the criticality to know what sort of controls to put in place because they're expensive.
James Lewis:
Yes interesting. I mean, it's a hedge essentially against some future down side. To play devil's advocate, I mean how actually likely... I'm going to ask the group, how actually likely are these risks to manifest? And I guess what we're saying here the elephant in the room is we're basically either for commercial reasons or for risk management reasons. Saying that at some point Amazon or Microsoft or Google will go out of business or decide to stop offering cloud solutions. Is there a fractionally small chance of that happening? [crosstalk 00:12:33].
Scott Shaw:
I don't think it's so much them [crosstalk 00:12:34] I don't think it's them going out of business. I think it's the view as a business making a decision to move away from them. That's what the risk is. Is there may be some reason... yes, it's a low probability. If you're talking about certain prominent cloud vendors, then yeah, it's a low probability event. But there may be reasons that you may be in a competitive situation with them in another line of business that there may be some kind of sovereignty concerns. So there may be some kind of foreign government interference that people are worried about in some way. I mean, there's a variety of reasons why you might want to move. And that's not just cloud providers, really, I think this is causing people to stand back and look at their suppliers, their IT suppliers as a whole and being more cautious about getting too heavily involved with anyone and creating that really strong dependence. Because there's a variety of reasons you might want to move away from a given vendor.
Zhamak Dehghani:
I didn't do what Scott said on the... I mean, the clients that I have they haven't been on that extreme end, but they still wanted the option of being able to move from one cloud to another cloud. And one technique that we have been kind of emphasizing is automate, automate, automate. So automate all the cloud infrastructure, set up deployment of your applications, environment configurations so that the move from one cloud to another cloud is rewriting those scripts and modifying those. Are there any other techniques? I know that we talked about how we choose different capabilities that cloud providers give us to not lock us in.
James Lewis:
I think we talked about this in the last edition of the radar, certainly in the discussions we had in the room about how much Docker and Kubernetes is going to unlock in this sort of space. But I think that goes back to what you were saying Scott, about cloud lift and shift, which is also on the radar. Some issues are going on how it's not easy to suddenly deploy a 10, 8, 5, 2 year old system suddenly into Kubernetes cluster and expect it to work, right? I mean, the classic set of features that you'd sort of talking about I mean, the classic one is, "Oh, we've got this sand. We rely on this sand across multi-sites to do active, active or whatever to do replication. We don't have that anymore. How do you do that in Kubernetes?" But interestingly I think as an abstraction layer, I think that is something that potentially gives us the ability to, if you build with that as a target, to move between clouds more easily I think.
Scott Shaw:
I think it gives you walk up kind of strategy. You can start with the cloud providers native Kubernetes support, and that's much easier than having to build and manage your own Kubernetes cluster. And then if the time comes that you need to be able to consider portability, then you could implement your own Kubernetes infrastructure that goes across clouds. And I think there's a lot of promise there with containers and container orchestration to buy you some independence from the cloud provider. But I wouldn't go into it lightly. It sounds nice, but I think there's a lot to learn about building and operating your own cluster.
Zhamak Dehghani:
And another conversation we had around this was different services that cloud providers give us and how easy it is to pick a managed service. Like a managed event streaming capability, even if it's half-baked versus operating your own Cafco or an open source equivalent of that. I guess that becomes another access to evaluate locking with...
Scott Shaw:
I think that's the trickiest part of a multi-cloud strategy is understanding which of those platform services you want to go all in on with a single vendor. Because it's a slippery slope if you start using one, then there's a lot of incentive to use another and because they're often so intimately linked, you may not even have a choice. You may have to use one platform service like log management, for example. You may have to consume a whole raft of other platform services. And so that's where there probably need to be some governance or some guidelines to application developers as to which ones they are going to consume and which ones they aren't. If you're in this middle ground of moderately... I think what APRO in Australia called it the heightened risk or a heightened criticality. It's not the extreme end, but it's not at the low end where you just want to rebuild everything in another cloud if you have to move over, it's kind of in the middle. And that's where it's a bit tricky.
Zhamak Dehghani:
And I guess another... like in terms of... I know we moved away from this idea of generic cloud. I think we, we called it on the radar that you try to abstract everything away so you're at the code level. You have no idea whether I'm talking to an Amazon service or a Google service. And then the extreme of that could be quite, I don't know, not very effective. But there are some libraries that help you with that. A lot of kind of libraries in the Spring ecosystem try to abstract that services away. Is that something you think we would advice?
James Lewis:
I think that's a really hard question. Because it's all context dependent, right? Apart from be very careful as Scott was saying, consider the criticality of the systems that you... because you've always going to lose something when you're not targeting a specific provider. There's always going to be a cost to doing that. And so the generic advice around always try and target lots of different clients is, I think, probably wrong in many situations.
James Lewis:
I wonder whether we are sort of lagging and whether our thinking actually as an industry is lagging some what. Over the last couple of years we've seen so much change and it's very hard to keep up. And so there's sort of new patterns I think we're sort of seeing starting to emerge. There's a lag with what we understand that we can do with the technology we've got available now to what we can actually do-
Scott Shaw:
People's skills are catching up and-
James Lewis:
Yeah, right. The implication [crosstalk 00:20:03] things are changing.
Scott Shaw:
The tools are still... seems like we've been talking about cloud forever now. But we're kind of entering a new wave of tooling. And the Amazon API APIs were really great and when we first started using them it was a revelation. And they provide a really good developer experience compared to any alternative people had at the time. Now there's a new generation of cloud APIs and developer tools that are built on all that experience and are even easier to use and provide an even better developer experience. And it's taken time for, I think, the cloud providers to understand what it is that people want and what makes it easy and what sort of operations you're going to have to do repetitively and so on. And to create that abstraction layer that you were talking about Zhamak, means that you have to forego all of that experience that's been built up, and the understanding of how developers actually use these things. Because the cloud providers have already created a set of tools that are meant to be directly consumed by developers and people doing the deployment and operating these systems in the cloud.
James Lewis:
I guess where I was going, maybe to clarify is that the global is lagging the local with this kind of stuff. Like a bunch of developers running code can do all this really exciting, cool new stuff, right? But then this sort of strategy is then lagging behind the abilities that we get from using a lot of this stuff. It's sad, I mean you're talking about the Australian regulation how they just issued this new advice. I think it's the same in the UK. The regulators in UK, the government regulators, have sort of said actually here’s our new cloud UK policy, we can go to the cloud if we want under these circumstances. But it's several years after this sort of stuff has been available and there's a lot of people who are sort of, "We can just do so much more." I wonder if one of the things I've been sort of thinking about is what is business continuity mean now?
James Lewis:
Or what does disaster recovery mean if you're cloud native? If I can start an entire new environment, entire new kind of production data center up on demand by pushing a button, what is disaster recovery mean? We don't run active, passive anymore.
Scott Shaw:
I think people talk about resilience a lot more than they talk about disaster [crosstalk 00:00:22:44].
James Lewis:
But that's sort of level of thinking. I'm not sure how much that's permeating through to some of the bigger organizations yet. Where we're, I think we're still, maybe... I don't know if you know this thing about Eli Goldratt and his four questions you should ask when adopting a new technology. So he's got this lovely set of things... you should ask what's the power of the new thing that you're looking to adopt.
James Lewis:
You should ask what limitations do we currently have that this new thing could help us overcome. So things like we can run always on, right? We can be resilient by just being highly available. He then sort of says, "What current rules do we have to manage the existing limitations?" And all of that stuff is built into process and built into deployment processes and risk management and all these kinds of things that we've sort of built up over time because we've had to have them to manage our data centers and deployment processes and all this kind of stuff. And then he sort of says, "But then we forget to ask the final question, which is what new rules do we need?" And this is where we're sort of lagging.
James Lewis:
Is that what the new things do we need to put in place to take advantage of the new technology of the cloud. So you sort of said this, Australian regulators just got to the point of, "Okay, well maybe we can use the cloud now, right." But the new rules really, really lag.
Scott Shaw:
And I think everybody's trying to figure out what that is.
James Lewis:
Yeah, yeah. I see this a lot in big organizations where it's not just the regulators, it's ingrained thinking. It's 20 years of thinking about how we work with a particular set of tools and technologies, our data centers. And then you try and apply the same thinking to the new stuff but-
Mike Mason:
On that note, with those four questions to ask, I'd like to thank our guests here, Scott Shaw and James Lewis. My name's Mike Mason.
Zhamak Dehghani:
And I'm Zhamak Dehghani.
Mike Mason:
And thanks for listening. Please tune into the next one.