What Does Durability Mean for Agentic Software?

2026-04-01

Durability frameworks were built for deterministic workflows. Agents are probabilistic, stateful, and long-running. Here's why the old playbook breaks and what should replace it.

The Claude Agents SDK marks a fundamental shift in software development. Unlike traditional systems where users interact with pre-written logic, these agents function as autonomous developers, capable of accessing file systems, writing code, and executing terminal commands.

This capability introduces unique infrastructure challenges. Because agents actively modify their environment, traditional multi-tenant server architectures are unsafe.

But today I want to talk about something more specific: durability. What does it mean? How does it change when the software is intelligent? And why is the playbook we've relied on for a decade suddenly insufficient?

What is durability?

I define durability as a subset of reliability: it is a promise to execute a specific set of logic on a given piece of data, regardless of underlying conditions or failures.

This guarantee is critical for the user experience because it ensures deterministic results. Every customer input yields a predictable output. Furthermore, architecting for durability significantly reduces the operational burden on developers by minimizing the need for manual system maintenance.

How durability has traditionally worked

Traditional durability frameworks (e.g., Temporal) maintain state by checkpointing the results of non-deterministic operations. Network calls, file I/O, anything that talks to the outside world. These results get persisted into a centralized datastore.

The process typically follows this flow:

  1. Execution & Retry: Operations are attempted and retried according to configured policies.
  2. Persistence: Once a non-deterministic operation succeeds, its result is recorded in the execution history.
  3. Replay: If the process restarts, the framework uses the saved result from the datastore instead of re-triggering the external operation.

This mechanism ensures the system is 'replayable,' allowing it to recover and resume from a specific point of failure without duplicating side effects.

This works beautifully for deterministic workflows. A payment flow: charge card → create order → send email. If the worker crashes after charging the card, the framework replays the workflow, skips the card charge (already recorded), and picks up at order creation. Clean, predictable, battle-tested.

Why agents break this model

Here's the fundamental problem: agents are not deterministic workflows.

When you tell an agent to 'Analyze the logs,' it doesn't follow a fixed execution path. Run 1, it might call Tool_Search_Logs. Run 2, given the exact same input, it might decide Get_Request_Ids is the better first move. The model is probabilistic by nature. Every invocation can produce a different sequence of actions.

Traditional durability frameworks assume that if you replay the code from the beginning, it will make the same decisions it made last time. The checkpoint data fills in the results, and execution proceeds identically. Agents violate this assumption at a foundational level.

But that's just the beginning. The architectural mismatches run much deeper.

Agents are stateful, locally

The Claude Agents SDK is architected for local persistence. It relies heavily on a hidden local directory (.claude/) to store critical session context, interaction history, and resume flags. These session files are preserved locally and are not synchronized to Anthropic's servers.

.claude/
├── debug/
├── plans/
├── projects/
│   └── -mnt-c-code-project/
│       ├── 42a94450-751b-41b2-8660-532bb470e4c8.jsonl
│       ├── agent-a40f53c.jsonl
│       └── ...
├── session-env/
├── shell-snapshots/
├── statsig/
├── telemetry/
└── todos/

Traditional durability frameworks assume that if Worker A crashes, the workflow can resume on Worker B seamlessly, because state is stored centrally, not on the worker. But when a Claude Agent crashes, Worker B wakes up without the .claude/ directory. It has no session context, no conversation history, no resume flags. It's starting blind.

Agents are long-running

Kubernetes defaults to a 30-second termination grace period. Agentic workflows often run for minutes. If a pod gets evicted during a scale-down event and the agent cannot finish within the grace window, the process is killed mid-thought. The session state on local disk is gone. The durability framework retries on a surviving pod that lacks the context to continue. Scaling down infrastructure actively corrupts running jobs.

Agents generate massive context

A single agent response can easily exceed 100k tokens. Traditional durability frameworks are designed to transport small parameters, not multi-megabyte conversation histories. When you try to persist the full context of an agentic session through a checkpointing system designed for payment flows, you hit hard limits fast.

The 'retry' isn't a retry

In traditional systems, if an activity fails, you retry it with the exact same input. Same arguments, same expected behavior. But agents have session memory.

Consider two failure modes:

Timeout: The prompt 'Start Research on Temporal' reached the model. Claude started thinking. The activity timed out waiting for the answer. If you send 'Start Research on Temporal' again on the same session, Claude now has duplicate context. If you send 'Continue' instead, that works. But 'Continue' is a different input.

Network Failure: The prompt never reached the model, but a session was already created in the first system call. If you send 'Continue Research,' Claude is confused. It never received the first message. It has nothing to continue.

The correct retry prompt is not just different from the original. It's dependent on the specific error. Traditional frameworks have no mechanism for this.

Durability vs. reliability

This is where the distinction matters.

Durability is the promise that a specific piece of logic will execute to completion on a given input. It's an infrastructure concern. It answers: 'Did the thing run?'

Reliability, in the agentic world, is broader. It's not just about uptime. It's about outcome. The definition of a 'successful run' isn't that the code completed without crashing. It's that the agent actually accomplished the task correctly.

An agent can run to completion, never throw an exception, never crash, and still completely fail the user. It can hallucinate an API that doesn't exist. It can get stuck in a reasoning loop. It can drift from the original objective. It can fill its context window with garbage and start producing incoherent output.

Traditional durability says: 'The workflow finished.' Agentic reliability asks: 'Did the agent actually solve the problem?'

The future of reliability for intelligent software is no longer just about retrying failed infrastructure. It's about detecting confusion, diagnosing cognitive failures, and applying targeted recovery. Not blind retries.

Rethinking the checkpoint

Here's the insight that changes everything: in the agentic paradigm, the persisted messages array effectively replaces the instruction pointer as the source of truth.

The state is not held in the execution path of the code. It's in the accumulated history of the conversation. By treating the context window as a durable log, we eliminate the need for a framework to replay code and intercept side effects during recovery.

Instead, we simply reload the history. The agent observes its previous tool outputs and naturally determines the next logical step. The model creates its own 'instruction pointer' by analyzing the conversation history. It doesn't need the surrounding code to re-execute in a specific order. It reads the transcript of tools called and results received, then generates the next logical step as if no interruption occurred.

This replaces rigid activity wrappers with a simpler, data-centric form of resilience.

This is Part 1 of a series on building durable infrastructure for AI agents. In Part 2, we'll walk through every specific way Temporal, the industry standard for durable execution, breaks down when you put an AI agent inside it. Not in theory. We tried it.

Exosphere is building the reliability runtime for AI agents in production →