Why Temporal Breaks When You Put an AI Agent Inside It

2026-04-03

Temporal is the industry standard for durable execution. We tried using it as the durability layer for Claude Agents. Here are the 12 specific ways it falls apart.

Temporal excels as a reliability engine for deterministic workflows. But the dynamic nature of agents calls for an architecture ready to handle non-determinism. We explored using Temporal as the durability layer for the Claude Agents SDK and hit wall after wall.

This is not a critique of Temporal as a product. It's a documentation of what happens when you take a framework built for payment flows and try to run an autonomous coding agent inside it.

1. The file system checkpoint mismatch

Temporal assumes that if a Worker crashes, the Workflow can resume on a different machine seamlessly, because the state is stored in the Temporal Service, not the worker.

The Claude Agents SDK does not work this way. It relies heavily on a hidden local directory (.claude/) to store critical session context, interaction history, and resume flags. These session files are preserved locally and are not synchronized to Anthropic's servers.

.claude/
├── debug/
├── plans/
├── projects/
│   └── -mnt-c-code-project/
│       ├── 42a94450-751b-41b2-8660-532bb470e4c8.jsonl
│       ├── agent-a40f53c.jsonl
│       └── ...
├── session-env/
├── shell-snapshots/
├── statsig/
├── telemetry/
└── todos/

When Temporal migrates a failed workflow to a new worker, this local directory is left behind on the failed machine. The agent wakes up on the new machine lacking the necessary context to resume, leading to logical errors or a forced restart from scratch.

2. Eviction risks with auto-scaling

Standard Kubernetes Horizontal Pod Autoscalers and KEDA operate on the assumption that all pods are stateless and fungible. Any pod can be terminated at any time without data loss, as long as total capacity remains sufficient.

A running Claude Agent violates this assumption. Once a workflow starts on a specific worker, that worker becomes stateful. It holds the unique .claude/ session context and the local file history required for that specific user request.

The failure mode plays out in three steps:

Arbitrary Termination: The scheduler may evict a pod that is mid-execution of a long-running agent task.
Grace Period Mismatch: Kubernetes defaults to a 30-second termination grace period (SIGTERM to SIGKILL). Agentic workflows often run for minutes. If the agent cannot finish and checkpoint its state within 30 seconds, the process is killed.
Recovery Failure: When Temporal detects the node death and retries the task on a surviving pod, that new pod lacks the local filesystem state of the evicted pod. The agent resumes with an empty context, leading to immediate failure or hallucinated actions.

Temporal workers are designed to stop accepting new tasks on SIGTERM. They stop polling but continue processing current tasks. However, if the task takes longer than the terminationGracePeriodSeconds (default 30s), Kubernetes sends a SIGKILL. Since agentic tasks like 'Refactor this codebase' can take minutes, the default grace period is insufficient. Your agents get killed mid-thought during a scale-down event.

3. The sticky worker problem

Let's assume we route a user's request to a specific worker to use the local file system via sticky session affinity. While this ensures the agent can find its previous files, it creates a critical performance vulnerability.

The scenario: User A starts a session and is pinned to Worker 1. They perform a few tasks and pause to read the output. During this pause, User B starts a session and is also assigned to Worker 1. User B submits a heavy, blocking task like 'Refactor this entire repository.' User A returns to send a simple, quick follow-up command.

Even though other workers in the fleet are completely idle, User A is forced to wait. Their request is stuck in Worker 1's local backlog, blocked behind User B's heavy computation. A user's experience is no longer determined by system capacity, but by the luck of who they are sharing a worker with.

4. Streaming

A superior user experience relies on low-latency streaming. Users should see the agent's thought process and code generation unfold live, rather than waiting for a bulk response.

By default, a Temporal Activity accepts an input, processes it for the duration (seconds or minutes), and returns a single final result upon completion. There is no efficient native mechanism to stream partial results from inside a running Activity back to the Workflow and thus the Client in real-time.

You could technically use Temporal Signals to push updates, but sending a signal for every generated token (roughly 50 to 100 signals per second) would bloat the Workflow History and overwhelm the Temporal database.

To bypass this, developers often introduce a side-channel, pushing real-time tokens directly from the Worker to the Client via Redis, WebSockets, or S3 while the Activity runs. If a worker crashes mid-stream, Temporal will retry the activity from the beginning, but it has no awareness of what data was already delivered to the user via the side-channel.

5. The replay problem

Orchestrating complex agent behaviors faces a rigid architectural ceiling. We cannot spawn sub-activities from within a running activity. Triggering Child Workflows directly from an activity is generally discouraged due to visibility and lifecycle management issues.

To achieve multi-step orchestration where an agent needs to dynamically spawn sub-tasks, we explored embedding the Claude Agents SDK directly within the Temporal Workflow definition. This introduces a critical conflict with Temporal's core execution model.

By design, Temporal expects the workflow execution to be deterministic in each replay. The Claude Agents SDK, by nature, is probabilistic.

Here's what happens:

Run 1: You ask Claude to 'Analyze the logs.' Claude decides to call Tool_Search_Logs. Temporal records ActivityTaskScheduled: Tool_Search_Logs.
The Pause: The workflow waits for the tool.
Run 2 (The Replay): The workflow wakes up to process the tool result. It replays the code from the start. It calls the Claude SDK again with 'Analyze the logs.' But this time Claude decides to use Get_Request_Ids.
The Clash: Temporal looks at the history and sees that last time you scheduled Tool_Search_Logs, but now you are trying to execute Get_Request_Ids. This is a Non-Deterministic Error.

It blocks progress and throws a WorkflowTaskFailed error, or it goes back to reevaluating everything from the start.

6. Different retry arguments

If an activity fails, Temporal simply retries it with the exact same input arguments. When orchestrating agents, this 'blind retry' mechanism creates a critical failure mode.

Scenario A (Response Timeout): The prompt 'Start Research on Temporal' reached Claude and was passed as a parameter to the activity. Claude started thinking. The Temporal Activity timed out waiting for the answer. If we send 'Start Research on Temporal' again on the same session, Claude now has two parallel contexts or a duplicated conversation history. If we send 'Continue,' it works perfectly. But that is a different input while retrying.

Scenario B (Network Request Failed): The activity tried to send the prompt 'Start Research' to Claude, but the HTTP request timed out before reaching Claude. However, a session was created in the first system call from Claude CLI. If we send 'Continue Research' on the same session, Claude will be confused. It never received the first message. It has nothing to continue.

Because the Activity failed, we often lose the specific error context. The retry prompt is not only different from the original prompt but also dependent on the error we got.

7. Size limits

Claude can easily generate 100k+ tokens. Temporal is not designed for this. It's designed to transport small parameters.

Temporal enforces a strict size limit on the data payloads recorded in its Event History:

Max Blob Size: 2MB (default) for a single Activity result or Signal.
The Scenario: You ask Claude to 'Refactor the legacy reporting module.' Claude writes a massive response containing the reasoning, a plan, and the full code for 6 different files.
The Crash: The Activity attempts to return this 4MB string to the Workflow. Temporal rejects the payload with a BlobSizeLimitError. The workflow crashes immediately.

Even if your responses average 500KB (safe for a single blob), they accumulate in the history. 500KB per step multiplied by 100 agent steps equals 50MB. Temporal's default History Size Limit is 50MB. Your long-running agent workflow is terminated by the server with WorkflowExecutionHistorySizeLimitExceeded, not because it failed a task, but because it 'remembered too much.'

8. The latency multiplier

Temporal adds a small, fixed latency penalty of approximately 50ms to 100ms for every Activity transition due to the mechanics of Polling, Persistence, and Scheduling.

In a standard payment flow with 3 steps, adding 300ms total latency is irrelevant. Agents operate in a tight cognitive loop: Think, Tool Call, Parse, Observe. A complex agent might run this loop 20 to 50 times to solve a single request.

If you wrap every step of the agent's thought process in a Temporal Activity to get visibility, you pay that 100ms tax 50 times. This accumulates to seconds of pure infrastructure lag, making the agent feel sluggish and unresponsive. This problem is further aggravated by size limits and external buckets you will need to maintain.

9. Cold boot time

For Claude Agents, a cold start is not measured in milliseconds but in seconds or minutes, typically 10s to 2+ minutes. This latency stems from two distinct initialization phases.

Protocol Overhead: Even when running on a compute instance with the Claude CLI pre-installed, the session cannot start immediately. Upon initialization, the SDK must perform a mandatory sequence of network operations with Anthropic's servers, including authentication handshakes, capability negotiation, and session registration. These blocking API calls occur before a single line of agent logic can be executed.

To solve this, you are forced to maintain a pool of pre-warmed workers (which is expensive) or implement complex caching layers for container images.

10. The zombie thread issue

A critical operational hazard arises from the disconnect between how Temporal manages task cancellation and how the Claude SDK executes commands. Temporal manages the lifecycle of the Worker thread (the Python wrapper). However, the Claude SDK spawns a separate OS-level subprocess (the binary CLI) to execute the actual agent logic.

The failure mode:

Cancellation Signal: When Temporal cancels an activity due to a timeout or user request, it raises a CancelledError inside the Python thread.
The Orphan: If the Python wrapper simply exits without explicitly intercepting this error and sending a SIGKILL to the child process, the CLI binary is left running in the background.
The Accumulation: These zombie processes continue to consume CPU and Memory, eventually starving the worker node and requiring a hard reboot.

This often leads to race conditions and data corruption. A zombie agent that started a development server on localhost:3000 and never shut it down will cause the next agent attempting to use the same port to fail every time.

11. Lack of agent-specific observability

Temporal provides world-class monitoring for Workflow Execution: success rates, latencies, stack traces. But it lacks the domain of Agentic Execution entirely.

If an agent gets stuck in a loop, hallucinates a non-existent API, or refuses a task due to safety filters, Temporal reports this merely as an ActivityTimeout or ApplicationError. You get zero visibility into why the failure happened. Key agentic metrics are completely absent from the Temporal UI: cost per turn, loop detection, tool usage, safety triggers.

Temporal treats an agent's failure the same as a network timeout.

To debug a failed agent in Temporal, a developer must download the raw history JSON, manually parse the huge input/output blobs (if not encrypted), and reconstruct the conversation thread. There is no native visualization of the tree of thought or the agent's decision-making process, making root cause analysis incredibly slow.

Developers often try to connect specialized third-party observability platforms like LangSmith or Arize. But standard tracing SDKs assume a single continuous process. Temporal breaks execution across multiple workers and machines. You cannot simply connect these tools. You must manually implement injecting Trace IDs into every Activity input and extracting them on the other side to stitch together a fragmented history.

12. PII and sensitive data leakage

Temporal's architecture is designed to be an immutable evidence locker. It records the full serialized input and output of every single Activity and Workflow transition to enable replayability.

For agentic apps, you pass massive, unstructured context windows. The input to an agent activity often includes the entire conversation history, the contents of files read from disk, and environment variables. If a user pastes a customer email, a medical diagnosis, or a private API key into the chat, Temporal permanently writes that plain-text data into its database history.

Temporal History is immutable and often stored as binary blobs. You cannot surgically remove line 42 of a conversation from a closed workflow's history without corrupting the entire cryptographic chain or breaking the replay capability. To comply with a deletion request, you might have to nuke the entire workflow history, losing all audit trails and debugging context for that session.

A support engineer debugging a stuck agent might open the UI and instantly view highly sensitive proprietary code, PII, or credentials that the agent was processing.

To use Temporal safely with agents, you cannot use the default configuration. You are forced to implement complex Custom Data Converters, encrypting every payload before it leaves the SDK, managing encryption keys that allow the Worker to decrypt the data for processing but prevent the Temporal Server and the UI from ever seeing it. Once encrypted, you lose the ability to search your workflows by content.

What if we tried it anyway?

We did. Here is what the low-level design actually looks like when you force Temporal to work with Claude Agents:

Attach a network volume from the cloud provider to your pods/containers, ensuring the same disk is mounted across all instances. Set the CLAUDE_CONFIG_DIR environment variable to point to a fixed path on this shared volume.
The first activity in the workflow creates a Sandbox and checkpoints its ID.
Within an activity, create a session with Claude Agents CLI using the Sandbox details from step 2. Return the session ID.
Create the main activity. This activity passes a different prompt to the model based on the retry count. Use resume = session_id from step 3 and sandbox details from step 2.
Since Tools, Hooks, and Sub-agents execute within the context of an Activity, design these components as accessible APIs to be called from the Activity.
Clean all blocked resources like Sandbox on completion.

While this looks promising on paper, it comes with its own set of limitations:

Agents generate high-frequency, small writes. Cloud network volumes throttle based on IOPS, not just bandwidth. If 100 agents try to checkpoint their state simultaneously, you hit the volume's IOPS limit immediately. AWS General purpose EFS has a Write IOPS of 7000. With 100 parallel users, each user gets roughly 70 IOPS, which is extremely low for a Claude Agent running multiple sub-agents and tools.
AWS EFS MAX I/O Mode gives theoretically unlimited IOPS but the round trip time of each data packet increases. This is a huge problem for the Claude Agents SDK which by design is very chatty with the file system. While a single round trip latency might be insignificant, the total across 100 to 200 file operations can balloon.
Network storage adds additional durability concerns. EFS assumes eventual consistency for reads unless you close the file. Agent A writes to log.txt but keeps the file handle open (streaming logs). Agent B tries to read log.txt. It might see empty content or old data until Agent A actually sends a close() signal.
As your CLAUDE_CONFIG_DIR fills with thousands of session files, simple startup checks will hang. Operations like ls take IOPS roughly equivalent to the number of files in the folder.
Activities are still opaque. By encapsulating the entire agent loop inside a single Activity, we hide the agent's internal logic from Temporal.
Zombie threads can now write to the shared file system, creating system states which are extremely hard to debug.
If a tool triggers a long-running sub-process (e.g., scraping a website taking 30 seconds), the main Activity is blocked. The CLI expects a synchronous return from the tool function. Meanwhile the Sandbox is also running throughout the lifecycle.
We are creating multiple Claude Sessions (step 3 then step 4) which each take time to initialize, increasing the overall latency.
Tool calls are not truly durable as they are being triggered as API calls rather than Temporal workflows or activities.
Management overhead of at least 2 extra services: Network File Storage and Sandboxes.

Every workaround introduces new failure modes. The complexity compounds rather than resolving.

The takeaway

Temporal is excellent software built for a different problem. The 12 issues above are not bugs. They are fundamental architectural mismatches between a deterministic workflow engine and a probabilistic, stateful, long-running, streaming, context-heavy runtime like an AI agent.

The question is not how to make Temporal work for agents. The question is what a durability runtime designed from the ground up for agents should look like.

This is Part 2 of a series on building durable infrastructure for AI agents. In Part 3, we explore why retry alone is not enough for intelligent software and introduce the 3U framework for agentic reliability.

Exosphere is building the reliability runtime for AI agents in production →