Enable javascript in your browser for better experience. Need to know to enable it? Go here.

Durable computing: Building resilience into distributed systems

Durable computing is a technique that makes it easier for teams to address resiliency and reliability in distributed systems. It involves handling state in such a way to ensure requests can be run to completion, reducing the risks of failure or downtime. 

 

Although it isn’t a new idea, dating back to the 1970s with Jim Gray’s innovations in database transactions, in an age of microservices and increasingly complex distributed systems it is, today, becoming more important. And, with agentic AI set to potentially reshape the entire digital landscape, the ability to ensure systems can handle complex workflows is only going to grow as a concern for engineering teams. 

 

Let’s take a deeper look at why it matters and its relevance in 2025.

Durable computing: a faster route to resilience

 

Although system complexity is a key driver of durable computing, it’s also being motivated by growing pressure on resources — personnel, time and money. This is because while it’s perfectly possible to build resiliency and reliability into complex systems, doing so is expensive and requires significant up front investment. 

 

As Thoughtworker Brandon Cook explained, “organizations are gravitating towards durable computing because they don't want to have to build all of that resiliency into their systems. It's a huge lift — you need a whole other platform team just to enable folks to do these loosely coupled event-driven-type patterns.”

 

So, while a platform engineering team can provide the foundations for resiliency and consistency, durable computing products and platforms enable — as Cook himself puts it — “faster delivery of independently evolving services without having to pay the cost of building in a lot of the resiliency in it.”

The evolving durable computing tooling ecosystem

 

The challenges of state management and resilience are well-known, but the emergence of products that specifically respond to these challenges has made the term itself more prominent in industry conversations.

 

Cook sees a key driver of the current durable computing landscape being internal teams at large organizations working on the challenges of distributed systems. There are a number of examples: Temporal has its origins in Uber, while Apache Airflow comes from a team at Airbnb. The open source Conductor — and its enterprise version, Orkes — were developed inside Netflix. Clearly, given the scale at which such companies operate, the challenges of distributed systems are particularly pronounced. 

 

In a not dissimilar fashion, the team behind Apache Flink also used their own experiences — and user challenges — to drive the development of Restate. (The team has written specifically on the story behind the tool.) 

 

Another durable computing platform worth calling out is Golem. We featured Golem on Vol.31 of the Technology Radar in the latter part of 2024 and highlighted the fact it’s backed by a WebAssembly runtime. As the Golem team explains in a blog post, “WASM gives Golem Cloud the capability to make programs in any programming language invincible — a feat that would be completely impossible with machine code, due to its highly unconstrained nature.”

 

There are durable computing offerings from major cloud vendors too, with key examples being AWS Step Functions and Azure Durable Functions. However, these are more specifically focused on building and orchestrating serverless workflows within their respective ecosystems. They offer a nice way into durable computing if you’re an Azure or AWS customer, but do not necessarily represent the absolute cutting-edge in the field.

The limitations of durable computing 

 

Durable computing has many benefits for teams managing complex and intensive workflows. However, while there are clearly advantages to circumventing the need to build out resiliency in a centralized manner, leveraging platforms that offer such capabilities can create some degree of lock-in.

 

Cook talks about the risks of “getting locked into opinionated framework platform patterns,” whereby the immediate advantages of a platform that seemingly does everything for you from a durability standpoint ends up making you inflexible.

 

Of course, the less opinionated the platform, the more that needs to be done by an engineering team. “I guess down the stack there's less opinionated [durable computing] frameworks]. But then you have to identify whether you want to use the cloud services of these platforms, whether they're hosted elsewhere, or whether you want to host it yourselves.” 

 

In other words, it’s a question, as always, of trade-offs. 

 

“It's also not all sunshine and rainbows where you don't have to think about any of the key patterns,” Cook warns. “There's things like idempotency that teams still need to build into their services. There's still things like understanding how the workflows interact with each other so that you're designing your services or individual workflows or different patterns to work well in the system. Then there's also interesting failover and multi-region difficulties that a lot of these orgs that built like a lot of the platforms themselves are starting to build out as well.”

Brandon Cook, Thoughtworks
Being able to build durability into your agentic architecture earlier on is an important place of exploration at the moment.
Brandon Cook
Principal Software Engineer, Thoughtworks
Being able to build durability into your agentic architecture earlier on is an important place of exploration at the moment.
Brandon Cook
Principal Software Engineer, Thoughtworks

Durable computing in the era of agentic AI

 

One of the reasons durable computing is important today is because of how it intersects with agentic AI. The link should be clear: agentic AI is a technology geared towards handling complex workflows which makes decisions based on diverse and dispersed sources of data. Their success — indeed, their reliability — depends on system durability.

 

“I've seen a lot of folks start thinking these durable computing frameworks could be great for agentic architectures as well, because they're very workflow driven.” Cook says. “Being able to build durability into your agentic architecture earlier on is an important place of exploration at the moment.”

 

It shouldn’t be surprising, then, that the likes of Golem and Orkes are leaning into AI in its marketing. The first thing you see on Golem’s landing page are the words “the AI orchestration platform”, with the promise to help users “deploy secure AI apps that run reliably, remember everything, and scale effortlessly.” Orkes, meanwhile, describes itself as “the enterprise platform for building highly reliable applications and AI agents. This could well be where we see durable computing really take off, helping mitigate some of the well-known risks of agentic AI. Just as Docker helped to popularize microservices, maybe durable computing platforms will drive adoption of AI agents.

Putting durability at the heart of of innovation

 

Innovation in technology is largely about handling increasing complexity in faster and more elegant ways. That challenge will remain; however extensive the agentic AI revolution really is, core issues around stability and reliability will still need to be addressed. In fact, those issues may well come more clearly into view and become even more urgent for software development teams. 

 

If that’s the case, we’ll likely be hearing much more about durable computing across the industry. How the field evolves will be something we’ll be watching closely.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Find fresh perspectives on tech on the Technology Podcast