Brief summary
Managing distributed systems and complex workflows can be challenging. What happens when something fails? If a task isn't executed to completion, that can lead to serious problems. From transaction and billing failures to deploying software, even small issues can have significant consequences.
This is one of the reasons for durable computing. Designed to isolate code from crashes, it preserves state so a task is completed even when something fails.
To discuss durable computing, explore why it matters today and how we've been using it at Thoughtworks, Brandon Cook and John Coleman join host Alexey Boas on the Technology Podcast. They dive into the current platform ecosystem and what it means for developers — and requires of them. They also talk about the value of durable computing for AI, explaining why the concept of 'durable agents' offers an important of avenue of investigation in a world eager to embrace agentic systems.
Learn more about durable computing in this blog post from July 2025.
Alexey Bôas: Hello, and welcome to the Thoughtworks Technology Podcast. My name is Alexey. I am one of your regular hosts, and I'm speaking to you from São Paulo in Brazil. This time around, we're here to talk about durable computing, and I am thrilled to have with us Brandon Cook and John Coleman to help us navigate through this very interesting topic. Hello to both of you. Brandon, maybe, would you mind introducing yourself?
Brandon Cook: Yes. I'm Brandon Cook, principal software engineer at Thoughtworks, based out of New York. Excited to chat today.
Alexey: It's amazing to have you with us. Thank you so much for joining. How about you, John?
John Coleman: Hi. Yes, my name's John Coleman. I'm from the Bangkok Thoughtworks office, and I'm a lead consultant.
Alexey: Amazing. Thank you so much for being here with us to talk about this. Maybe we can get started by talking about some motivation, if you don't mind. The foundations of the topic go back to the '70s, right? ACID properties to face commits. Why are we discussing these things now in the context of distributed systems? What has changed, or what is relevant to the topic at the moment?
Brandon: I got started in the durable computing space after an assessment with a client. They were focused on, "Oh, are we building the right event-driven architecture patterns in our system?" Right? What we were finding is that they had all the good patterns and principles nailed down for more or less the happy path, but were really lacking in the sad paths or the failure paths.
Then that's where a lot of the durable computing comes into play, building in those complex event-driven patterns like event replay, recovering from different failures gracefully, being able to continue things in your distributed system if something goes down, that's like the main key outcomes that you get from durable computing, offloading all of that operational burden off of your individual teams so that they can focus on the design of the business domain and business functions rather than having to deal with known patterns across the industry, right?
Alexey: Yes, that's cool. That makes a lot of sense, and the architectures are becoming "more distributed" in a way, so more microservices, and the scale of those kinds of things. Maybe before we go any further, maybe we can try to explain to the audience or define the concept itself. If we go back to the question, what is durable computing, how can we explain that in a simple way, maybe?
John: I think there's some slightly different angles on it. When we had some discussion before about this, for me, I tend to focus on the state part of it. It's this ability for a program to recover its state and continue from where it left off. It depends on the implementation. There's different ways that that's achieved, but essentially it's that ability to recover the process and continue from where it left off and, usually, to guarantee the process completes.
The typical application, if you think of a workflow, is that you might want to call various systems. You might have two or three other APIs that you want to call something like that in a sequence or a chain, and you want to guarantee you get to the end and that, also, the chain of events is correctly handled at each step.
Alexey: It's a lot about coordination and making sure that something either happened or doesn't do any harm, and that you can have those kinds of guarantees in a distributed context?
John: Yes, and it can be more fine-grained than that. Some of the platforms have more superpowers, so when we talk about the state, there's levels of granularity that you can have around that. Some of the platforms are much more granular and can really recover the program and its internal memory stuff more precisely, so they're much more fine-grained. Some of them are a bit more basic, and they just record things like the effects and what were the results of the calls, and then they just start the whole process again, but will give you the effects of the calls, something like that, so there's a broad array of how it's done, but it's all aimed towards the same kind of outcome.
Alexey: That's great. Thanks, John. Just more for the sake of clarity, what kinds of guarantees are we talking about? How is this different from a simple orchestration, for example, when we think about a workflow, and/or when we compare to sagas, for example, how is it different from those kinds of patterns?
John: Well, I think there's some overlap. You'll find similar terminology, like you get assured delivery, and the once-only. You'll find similar terms bounced about. I think the concepts are nothing conceptually particularly new about this. I think it's just the power of the platform to take away the pains of how you handle failure and retries, and recovery. It's basically taking the pain of those typically programmatic solutions and putting them on the platform.
Brandon: It's essentially like all these teams and orgs that have started these durable computing platforms, having the realization that in all distributed systems, we have to solve these types of problems, and they're essentially just focused on extracting that away for users. Yes, nothing new, right? These are known problems, known things that you have to solve in distributed systems.
It's not a matter of if you're going to face that problem or not; it's more of a when, and that's where these platforms are really focused on saying, "Okay, this is going to happen. This failure is going to occur, but we're here to help you recover from that."
John: Yes. If you think about the history of how people are doing these distributed systems, quite often, if you're writing the code, you actually don't really know what to do when something fails at a particular point, so you might just cancel the whole process, and the user gets a 500 error on their API or whatever. It just breaks. It's quite challenging as a programmer to figure out what exactly you have to do.
You may not necessarily know exactly how to solve a failure at a particular point, but if you could have more assurance of the delivery, that makes your life easier as a coder as well. If we're developing these fallible systems, you're taking a lot of the coding headache away, potentially.
Alexey: Yes, and it's interesting. Brandon, you were mentioning that you came across it when you were doing an architectural assessment that looked at platforms, and John, you were talking about some of the origins. It looks like it was an emerging challenge in the industry, and based on the technologies we are using these days. We saw platforms coming out of Uber, Airbnb, Netflix, and that project by Apache. It's interesting to see that conjunction of factors that led to that, isn't that right?
Brandon: Yes, and funny enough, after that assessment, we were able to move on and create a platform team that was focused on building out a lot of these common capabilities. Yes, one of the first things that we did was basically try to assess a variety of these platforms, and there's quite a few and a lot of considerations that need to be taken into account. Yes, it is essentially a way for you to not need to have a set--
Maybe you still have a platform team that knows how to host these tools or enable teams on how to use those tools in the most effective way, but then they don't also have to go build all of the platform and orchestration components that are in these tools. As before, you can see all these organizations who came out with these platforms, if you take Temporal, for example. I think that came out of Uber's Cadence platform.
These are all startups or organizations that realized this built-in resiliency into distributed systems was needed, and so they built it, and then now they had to maintain it and run it, and that, in and of itself, it's its own product, and it becomes its own beast of itself just to say, "Okay, we're building resiliency into the system." Now it's like, "Okay, how do we take these platforms and, I guess, in a way, democratize it and share it with the rest of the industry?"
John: I worked on Akka Actors quite a few years ago. They have this replay capability, so you can persist the state as a stream and a log, and then they can replay that to recover. I think that's something that's been around quite a long time now, but I don't think anyone called it durable computing at the time, because it isn't quite that, perhaps. There's been these ideas.
We mentioned the database before as well, your ACID transaction. You can commit and roll back all of that. I think there's been a progression to some extent of these developments, and now we've kindly arrived at the point where it's pulled together. You don't have to conform to some sort of model. Before, it was various pieces, but now it's just write, we can do the computation now, we can just write some business application code, and we get the guarantees. We don't have to be thinking about these different bits and pieces and making them work together incoherent before, and now it's all formed into one easier-to-use way of working.
Brandon: Funny enough, that you mentioned Akka because I think it highlights a lot of key contexts around some of these platforms, whether they are more heavily opinionated and heavyweight, or more lighterweight. I would say Akka probably is more in the lighterweight area of the world, because it is more of a framework that you can build into your code base. I think the organization is called Lightbend that builds Akka.
They also had this more heavily opinionated system called Kalix as well that does this sort of-- they don't brand it as a durable computing platform, but it does the same principles. It was once you're locked in and you're building in Kalix, that's the way you need to do it, similar with other event-driven, platformy-type experiences that are trying to build resiliency, like Axon or anything like that.
Some of those are like, once you start building in those, you're stuck into that way of working and that way of designing the system, so you essentially got to become an expert in that framework, whereas the newer ones are starting to become a little more flexible in terms of how you want to work and what language you want to work in and being a little more lighterweight in terms of that, but still providing that key durable execution backend for you.
Alexey: Yes, the lock-in aspect is interesting. Even cloud providers, also, have some of their platforms. AWS have step functions, and Azure also has some flavor of that. That is definitely one key trade-off to keep in mind, right?
Brandon: Yes. Funny enough, the org that I was working with on the assessment, they were really itching for AWS to build durable Lambdas. They were like, "We really want this." It wasn't out yet. I think it was last year they announced those durable Lambdas at probably re:Invent or something like that. I imagine that that org is probably looking at those now because they were very much embedded and locked into AWS, but that was a decision they've made from a technology strategy, and they were quite effective at deploying and working with AWS. They saw AWS as their platform of choice.
Alexey: Yes, and that's always a trade-off, right? You're deeper into a platform. You can leverage more of the resources the platform will offer you and be more efficient. Maybe I'll ask you this, Brandon, and John, feel free to add, but building on the experience you had doing an assessment and looking at this from making those architectural decisions and considering trade-offs and those kinds of things, what should teams evaluate when looking at those technologies?
What are some of the key dimensions that are relevant, some of the questions you need to ask yourself when thinking about these kinds of things, and making those architectural decisions?
Brandon: First, I always look at the hosting model. Where are you essentially going to host this durable computing platform? Because it is literally going to store everything that you execute. Some of them will literally store everything that you're executing. Enterprises like, I don't know, financial systems and things like that, they're not going to want to use the SaaS version of these tools; they're going to want to self-host.
The operational burden of "Yes, you don't have to build it, but now you still have to host it and make sure that it's running," there's some operational cost to running that infrastructure, so looking at the hosting model and understanding what you need to do. If you're a startup, you're trying to get something going, you want to build a resilient system, maybe that SaaS product is something you'd reach for quickly.
If you're trying to maybe internalize looking at the hosting models, so looking at, I guess, a Temporal versus a Restate from a platform perspective, maybe those folks are reaching more for Restate because it's like a single binary. It's easier to deploy. Now I think Temporal is making strides to make, I guess, their self-hosted choice a little easier to host, but I'm not sure what the developments are there, but it is part of the decision criteria that you would think through.
Another thing I would sort of assess is "What languages do your teams know? How do they actually develop on the day-to-day, and what's supported by the various platforms?" Because, at the end of the day, you're going to still need to understand "What does these SDKs look like that actually integrate with those platforms and what they look like? Are your teams going to be able to understand them? Are they going to be able to understand, I guess, the workflows and all these different facets of the idiomatics aspects of all these different SDKs and languages?"
I think that's a, probably, key aspect to consider, and then what workflows do you have? I think really understanding the business domain, obviously, as a first principle, around still doing domain-driven design and saying, "Okay, what workflows do we have? Do we have a bunch of long-running processes? Do we have a clear, distinct workflow? Do we have workflows that fan out or fan into certain things?" I think all of those have to be considered when choosing which one of these durable computing platforms to reach for.
John: You can dig into the details of-- you mentioned the idioms. Although there's a lot of overlap between them, they can have slightly different idioms and slightly different applications. Some of the platforms cross over a lot, so you've definitely got things you can choose from. Some of them are a little bit more distinct and have particular use cases that they might be better for versus other ones. You have to dig in and understand what it is that the platforms are offering and how that fits your use case. They're not all going to be equal. In fact, they're all quite different in the whole.
Alexey: Yes, that's great. That's great. Is there obvious scenarios in which you should not be using durable computing? Are there a couple of factors to consider to say, "Hey, they're not needed in this case"?
Brandon: Yes, I think you got to think of what your scale looks like. What's your uptime that you need to deal with? Can you deal with some failures and recover without hurting the business? I know we always put these in the heavy technical bucket, but at the end of the day, they're very much business-driven decisions. If you need high scalability, high recoverability, all these different aspects, maybe the durable computing platforms is something you need to reach for; if not, maybe it's a little overkill for your system, and you're spending a bunch of money that you don't need to spend.
Alexey: Cool. Any considerations to testing? Obviously, we're leveraging the platform to bring a lot of resilience to the workflow and those kind of things, but anything that we still need to be mindful of or other scenarios we need to consider? How does that change the testing strategy overall?
Brandon: Yes, the testing strategy, I think, changes significantly, particularly if you're just trying to think about "What does a normal unit test look like, or what does a full test of the whole program look like?" Are you going to spin up all the infrastructure for your tests to get the feedback? "What testing support do these platforms provide you?" is also probably a key decision criteria.
Last thing I remember is Temporal's testing support was growing significantly, getting that fast feedback, so you don't have to spin up the whole infrastructure to really make sure something is working. Another key aspect is these systems may look like when you're writing the code, it may feel like you're writing synchronous code, but at the end of the day, it's still distributed. It's all async. It's all event-driven underneath the hood.
Maybe you don't need to deal with the event-driven pieces a lot. That also has an effect on how you want to test or how you approach testing as well, because it is a mental mind shift to test synchronous code versus more event-based or asynchronous code.
John: I'm just thinking back. I did a proof of concept with Temporal. The way I did that is just using Docker containers. I ended up having to have a container for all of the external services that I was linking together in the workflow. You can imagine that quite easily becoming unmanageable if you had a substantial, or you had an external provider. Those are problems you're going to face anyway if you're doing integration testing, but now you've added a bit more complexity to that story as well.
Alexey: Good. Good, good. There's the accidental complexity, but there's always the essential complexity as well that you can't remove, right? That's part of the problem itself and the non-solid technologies we're using.
John: Once you start testing locally and faking things, you are in a false world as well; you can't guarantee that-- At some point, you really want to get to the production thing. You always have to be progressive with your tests, and you start with something local just to check the contracts and the behaviors, and then you wishfully think it's going to work like that in production as well. You're hopeful it'll be the same. At some point, you're going to need those other services sandboxed or something like that. You face a lot of the same challenges as regular end-to-end testing.
Alexey: Good. Moving beyond testing, I'm also sure that resilience, those things the platform brings, they don't come free, so they're probably things you need to implement in the code, and some gotchas that you need to be mindful of. What are some of those things, in your experience, one needs to pay close attention to when developing under those platforms?
John: One of the ones that I've been looking at and considering is latency. There's various points at which it might fail. Some of the ways that the durable computer platform works are like a full recovery. Your process is going to start from the beginning again. Now, it might be that your process, in theory, all the effects, the async stuff should happen quickly when it's being replayed because it's being replayed out of some durable states in memory or disk somewhere.
It should progress a lot faster because it's not doing the I/O anymore, but that latency is still there. If you potentially had a very long or complex process or something with a lot of computation inside it that had to be replayed, you could still have an additive effect to your latency that could be quite significant. Some of the platforms, the more granular ones, will recover more of the memory state.
They will actually genuinely continue the process where it left off, so you wouldn't have that replay latency coming into play. The other kind of latency you can get is when a process itself fails, a durable platform, a node fails, and then that node has to be recovered and given its state. That recovery process itself can take a while, so that would depend on the oplog, or whatever it is that's behind helping to restore the state.
That can take a while to replay as well. There's various strategies you can use to try and tweak that on some of these platforms, but you are going to get some latency side effects, potentially, and of course, there's resource overhead involved with that as well.
Brandon: Another key thing is idempotency. All of these platforms rely heavily on determinism. Just how John was saying, if something spins back up and is trying to replay, it's very important to have that idempotency built into the domain that you're building, as well as idempotency with any third parties or things of that nature, because the last thing you want to do is have something fail, it spins back up, and maybe a financial transaction gets processed twice because now you don't have any idempotency because the platform spun back up and replayed the last step that it was in the action of doing.
I think that's really key, and it's very similar to any other event-driven architecture, having idempotency as a focus. I've seen in a lot of areas where idempotency isn't considered. People just assume that, "Okay, yes, it should be fine. Happy path is working. Oh, I didn't receive the event twice, so it's been working." Then, okay, when it does happen, trying to figure out or debug becomes a very stressful nightmare there.
It's understanding what idempotency is, understanding it primarily on the consumer side. I've seen teams also focus on the producer side from an event standpoint, and be like, "Oh, we built idempotency into our producer," and then the consumers downstream are just like, "Oh, yes, well, they've built idempotency up there, so we don't have to worry about it." We all know that probably not going to happen, especially with events being able to fire off multiple times just on the basis of those systems.
Really focusing on consumer idempotency and guarding yourself against that, maybe multiple events or multiple replays of the same thing happening, and ensuring that the same thing happens in the state, and the same thing occurs, so the state maintains throughout the system.
Alexey: Yes. As far as I know, you can also have long-running workflows with states, like months-long. What do you need specifically regarding, perhaps, backward compatibility, when a new version of a service comes up, and then that workflow is faced with different versions of a service available, and those kinds of things? How does that kind of thing work?
Brandon: I think these platforms, they haven't really built that out for you, so really thinking about what that versioning looks like because once you start deploying a new version out there, and then maybe some long-running process, like you're saying, is running, and now you have a new version, and then there's obviously some technical failure that occurs there, that can be a problem.
Another thing where these durable computing platforms aren't going to save you is if you also have a business failure as well. If maybe how you perceived you were doing something is incorrect, and now you need to compensate for those, down the line, that goes hand in hand with the versioning aspect, right? You have a bug in the system that we need to fix, or someone inputted data that was incorrect that we need to fix, that affected downstream systems, the resiliency of the platforms isn't going to save you from that, so really considering that versioning strategy and how you're going to migrate over and understanding what processes are running in that workflow and how long they have been running to ensure that everything continues and works as expected.
Alexey: I'm just curious. From more of a developer mindset perspective, I know many, maybe most, developers are used to a request-response style of developing. You get a request, you produce a response, and then that's it, or everything that's related to the transaction happened within the production of that response, and you see that as a unit; you don't have to worry about other things happening in parallel.
When we're talking about these long-running processes, stateful with retries, and those things, the way of approaching and thinking about that shifts, doesn't it? What have you seen? How has that shift been for you? What have you seen in teams and the way developers approach it?
Brandon: It's a mental model shift. Obviously, from that more request-response type function to understanding, "Okay, this is event-based. This event triggered this, maybe now it's been running for this process this long, then it kicks off something else, so it's really focused in on rather than looking at, I guess, a stack trace and more focused on, "Okay, what's the event log or the history of events that have occurred?"
That will translate into how you design the system, and so really fundamentally trying to understand the workflow interactions, understanding, "Okay, we can kick this off, let it run for this time period, and know that we're going to still continue on at some point once it completes," rather than always having to just wait for that instant feedback, getting into debugging as well, because the debugging does feel quite different, trying to understand when a failure does occur or when something goes wrong, how do I actually debug it, debug the system?" is quite different as well.
Alexey: Great, great. Maybe the last topic I am quite curious about and wanted to hear your thoughts on this is, people have been talking about durable computing, some of these platforms connected to agentic development and the use of agents. What's the connection there? Why are people talking about these platforms as enabling AI orchestration and those kinds of things? What's the connection? What have you seen related to that?
Brandon: Yes, there's, I guess, a new term, new technique. Obviously, with AI, there's a million new terms and techniques popping up every second, but I guess they're calling it durable agents. You can see Temporal, Restate. I think even Vercel has their own durable workflow thing that is focused on durable agents that they've released. I think it's almost a convergence of the two technologies coming together while people are building agentic architectures and making them also distributed, potentially with multi-agent architectures and orchestration.
They're starting to realize that "Okay, what if I can't reach that LLM provider, or what if I can't search the database for some RAG operation? How do I actually then recover when those things are available again?" I think any of the human-in-the-loop interactions that you have in those systems as well, because maybe it's in the middle of a workflow, but they're waiting on a human response.
That thing could sit for days. Someone doesn't respond for days, but you don't want to have that agent up and running just waiting and listening. With these platforms, you can just have it tear down and then once someone responds, it will kick off that workflow again and spin everything back up. Yes, it's a very interesting space, and it's definitely emerging. Yes, it's exciting to see where this goes.
John: It should make the life of a developer a lot easier around these agent-based solutions. You want to interact with those different systems. Maybe you want to call a Lambda function, then you want to call a database or whatever, and then you've got some APIs you need to call. Durable is going to make developing those agents a lot easier. The thought, it fits very nicely. It's come along in a timely fashion for the AI solutions.
Alexey: Yes, amazing, amazing. Let's keep an eye on that. It's definitely an exciting field, so let's see. Let's see how it evolves. We're coming to the end of the episode. Any parting thoughts you want to share? Any ideas for the future, where is this headed, or if someone wants to learn more about these platforms, where to start? Anything you want to share before we close?
Brandon: Yes, I guess the best way to get started is just to look up one of these platforms and start playing with them and seeing how maybe they can fit into your systems, especially if you're building a distributed system and your team is struggling with a bunch of failures and recoveries and starting to look and see if this is a viable option for you to maybe start incorporating into your system.
We listed off a bunch of them. There's Restate, Temporal, Golem. I think it's quite overwhelming at the moment, definitely. The explosion of these platforms is warranted because they do fit a need in the industry, particularly around building distributed systems.
Alexey: Is there any one of those platforms, Brandon, that would be, if you want to start here, or they just fit different needs and each of them has their applicability?
Brandon: I think maybe the easy ones to get your head around and maybe the most accessible, maybe one of the cloud platform ones, like with the Azure durable functions or, obviously, the AWS durable Lambdas that we mentioned before, those may be the most accessible for you to play with, spin up, and test out. I think the other platforms, like the ones that I listed before, provide a lot more of the bells and whistles that you might need in terms of observability and all these other things that the cloud providers haven't really focused on.
There's some stuff there, but there's a lot more tooling with these other platforms, so maybe it's worthwhile starting with those to get your feet wet and understand them, but then once you want to start adopting, start considering some of these other platforms.
John: Yes, you can join us on the durable computing space as well if you want to have questions or to talk about these things. We're trying to gather info together to help these processes of how you might select the right technology, what the features are. I also put together a GitHub project, which you can check out. I did a POC with Temporal. You can pull that and play with that, and it's quite simple.
I also had experimentation with Golem. Golem is quite fresh. I wouldn't say it's particularly production-ready yet, but it's a very interesting new player on the scene. It takes quite a radical and different approach to durable computing than the more established solutions. Yes, you can have a look at the code and play with it. Yes, of course, you can always check the websites and have a look at the platforms themselves and the websites. They're going to tell you everything you need to know.
Alexey: All right, then. I guess this brings us to the end of this episode. Brandon, John, thank you very much for joining. It's been an amazing conversation. Lots of fun. Thank you very much. Bye.