The Future of Reliable Orchestration: Warm Pools, Sticky Routing, and Cognitive Durability

2026-04-06

What does infrastructure designed from the ground up for AI agents actually look like? Pre-warmed sessions, direct context streaming, stateful routing, sandboxed isolation, and a unified control plane.

In the previous parts of this series we established that traditional durability frameworks were built for deterministic workflows, that Temporal specifically breaks in 12 distinct ways when you put an AI agent inside it, and that retry alone is insufficient because agent failures are cognitive, not just infrastructural.

The natural question is: what should we actually build instead?

This post lays out what we believe the future of agent orchestration looks like. Not a patched version of existing workflow engines. A runtime designed from the ground up for the specific demands of production AI agents.

The problems that remain

Even after solving for cognitive reliability with the 3U framework (Uncover, Understand, Utilize), there are hard infrastructure problems that still need answers.

Streams: Users need to see the agent think in real-time. Token-by-token streaming cannot be an afterthought bolted on through side-channels.
Cold Boot: Claude Agent sessions take 10 seconds to 2+ minutes to initialize. Authentication handshakes, capability negotiation, session registration. All of this happens before a single line of agent logic executes.
Auto-Scaling: Standard pod autoscalers assume stateless, fungible instances. Agents are stateful. Scaling down kills running sessions. Scaling up doesn't help if sessions are pinned to specific nodes.
Sticky Routing: If you pin sessions to specific workers for filesystem locality, you create head-of-line blocking. One user's heavy task starves another user on the same worker, even if the rest of the fleet is idle.

These are not theoretical concerns. These are the problems you hit in the first week of running agents in production at any meaningful scale.

Pre-warmed session pools

Cold boot is the first thing users feel. A 30-second wait before the agent starts thinking is unacceptable for any interactive product.

The orchestrator should maintain a fixed-size buffer of initialized Claude sessions. These sessions have already completed the authentication handshake, capability negotiation, and session registration with Anthropic's servers. They are warm and ready to accept a prompt immediately.

When a new task arrives, it is assigned a session from the warm pool. The pool is instantly replenished in the background, ensuring zero-latency starts for subsequent tasks. The pool size is tunable based on expected traffic, and the replenishment is asynchronous so it never blocks the critical path.

This turns a 10-second to 2-minute cold start into a near-instant session acquisition.

Direct context streaming

Streaming should not be a side-channel bolted onto a request-response framework. It should be the primary data path.

User inputs stream directly into the active session's context window, bypassing traditional queues and activity wrappers. Response tokens are piped back to the client in real-time as the model generates them. There is no buffering an entire response and returning it as a single blob.

This removes the fundamental incompatibility between streaming and traditional durability frameworks, where an Activity must complete and return a final result before the Workflow can proceed. The stream is the interface. The agent thinks and the user watches it think. That is the product.

Stateful sticky routing

The sticky routing problem is real but solvable. The key insight is that routing must be deterministic at the stream level, not at the user level.

Each stream ID maps deterministically to a specific node and local disk, ensuring execution always lands where the context is physically stored. This is not traditional session affinity where a user is pinned to a worker for the duration of their relationship with the service. It is per-task pinning. Each task knows exactly where its filesystem state lives and routes there directly.

The difference matters because it avoids the head-of-line blocking problem. User A's follow-up task does not compete with User B's heavy task on the same worker unless their filesystem state happens to live on the same node. The routing is determined by data locality, not user identity.

Sandboxed isolation

Every session operates within a dedicated local sandbox, strictly isolating the agent's filesystem, side-effects, and dependencies. One agent cannot see another agent's files. One agent's zombie process cannot corrupt another agent's state.

Critically, this sandboxing must happen with minimal or no latency overhead. If creating a sandbox adds 5 seconds to every session start, you have traded one cold boot problem for another. The sandbox lifecycle is managed by the orchestrator and tied directly to the session lifecycle. When the session ends, the sandbox is cleaned up. No orphaned resources. No zombie filesystems.

Tiered storage offloading

Not all session state needs to live on local disk forever. The orchestrator implements a tiered storage strategy.

Hot sessions, those actively executing or recently active, keep their state on local disk for maximum I/O performance. Cold sessions, those that have been idle beyond a configurable threshold, have their state asynchronously flushed to central durable storage.

When a cold session is resumed, the orchestrator retrieves the state from central storage, hydrates the local filesystem, and routes the task to the appropriate node. This retrieval adheres to a strict SLA so that even cold resumes feel responsive.

This solves the problem of local disk filling up with thousands of session files, which we saw causes IOPS degradation and startup hangs when using shared network volumes. Hot state stays fast. Cold state stays durable. The orchestrator manages the transition transparently.

Optimized keep-alive

Sessions enforce an extended idle timeout, preserving hot context in memory to instantly serve follow-up requests without reloading history.

In practice, users interact with agents in bursts. They send a task, read the output for 30 seconds to a few minutes, then send a follow-up. If the session was torn down after 30 seconds of inactivity, every follow-up would require a cold restart. The keep-alive window is tuned to match real user interaction patterns, keeping sessions warm through natural pauses without wasting resources on truly abandoned sessions.

Predictive auto-scaling

Traditional autoscalers react to load. By the time CPU spikes and the autoscaler provisions new nodes, users are already waiting.

The orchestrator dynamically expands the warm pool capacity based on real-time traffic velocity and historical patterns. If traffic has been ramping at 10% per hour for the last 3 hours, the orchestrator pre-provisions capacity for the expected demand 15 minutes from now, not the demand right now.

This ensures zero-latency session acquisition even during sudden demand spikes. The scaling is predictive, not reactive. The warm pool grows before users arrive, not after they start queuing.

Unified control plane

All of the above, warm pools, streaming, routing, sandboxing, tiered storage, keep-alive, scaling, is governed by a centralized control plane.

This control plane exposes clean, abstract interfaces and strictly enforces policies for scaling, routing, and state management across the distributed fleet. It is the single source of truth for the orchestration lifecycle. Operators interact with the control plane to set policies. The control plane translates those policies into concrete decisions about where sessions run, when they scale, and how they recover.

Without a unified control plane, each of these capabilities becomes an isolated system with its own configuration, its own failure modes, and its own operational overhead. The control plane is what makes the whole thing coherent instead of a collection of duct-taped solutions.

Putting it all together

Here is the full picture. A request arrives. The orchestrator assigns a pre-warmed session from the pool. The user's input streams directly into the session's context window. Response tokens stream back in real-time. The session runs inside a dedicated sandbox on a node selected by deterministic routing based on data locality.

If the agent drifts, the 3U sidecar detects it, diagnoses the cause, and applies a targeted recovery. If the user pauses, the session stays warm. If the user leaves, the state is flushed to central storage. If the user returns hours later, the state is retrieved, hydrated, and the agent picks up where it left off.

If traffic spikes, the warm pool has already expanded. If traffic drops, cold sessions are offloaded and resources are reclaimed without killing active work.

This is not a theoretical architecture. This is what we believe production agent infrastructure must look like to deliver the reliability and responsiveness that users expect from intelligent software.

This is Part 4 of a series on building durable infrastructure for AI agents.

Exosphere is building the reliability runtime for AI agents in production →