Stop paying twice: the gateway buffer fix for agent crashes

When an agent process dies mid-stream, every major LLM provider except OpenAI forces a full re-prompt and re-bills every output token from the start.

Alessandro Benigni

PUBLISHED JUN 18, 2026

3 MIN READ

Follow on Google

9 HR AGO

Stop paying twice: the gateway buffer fix for agent crashes — featured image for AI Insiders

Every token an LLM provider generates is billed the moment it is produced, not the moment your application receives it. That distinction has no practical cost when your code runs reliably. In agentic systems, it is a silent budget leak.

Sunil Pai documented the problem on his engineering blog on June 17: when an agent process crashes while streaming from a provider, the in-flight request cannot be recovered. The crashed process lost its connection. Recovery means calling the API again from scratch, which re-bills every output token for identical content. With a premium model, the cost multiple per retry is roughly 15x compared to a cheap model. In a multi-tool-call agent loop, every deployment event or container restart compounds that loss across potentially dozens of in-flight turns.

The structural fix Pai proposes is a gateway layer buffer: a durable, independently deployed component that owns the provider connection and writes each chunk to SQLite as it arrives. The agent process never holds the stream directly. When the agent dies or redeploys, the buffer keeps draining. On restart, the agent calls a resume endpoint and retrieves missed chunks without triggering a new provider request.

The implementation details keep the footprint small. Each chunk is stored by sequential index. A single read function serves two consumers: a reconnected client tailing the live stream, and a restarted process catching up from a checkpoint. To avoid polling, the drain loop resolves a notification promise on each insert so waiting readers wake immediately. Because Durable Objects are single-threaded, the insert and notification form one synchronous block, which means readers always see committed rows. The buffer stores raw provider bytes and replays them through each provider’s own SSE parser, so the approach is vendor-neutral by design.

That vendor-neutrality is the point. OpenAI’s Responses API already supports server-side resume via a sequence cursor, but only for OpenAI. Anthropic and Google’s Gemini have no native resume mechanism; a crashed agent rebuilds context and pays again from the first output token. Vercel’s resumable-stream library implements a similar buffer pattern, but it runs inside the agent process. A redeploy still kills it. Moving the buffer to the gateway layer means it survives agent code changes entirely.

For teams currently building on Anthropic or Gemini in production, the provider-resume gap is real and ongoing. Neither provider has announced a server-side resume feature. Until they do, any agent that runs for more than a few seconds per turn and sits inside a redeployable container is exposed to duplicate billing on every deployment event. The gateway buffer pattern translates a platform gap into a solved infrastructure problem.

The pattern also generalizes beyond crash recovery. A durable gateway buffer enables cost auditing at the chunk level, rate-limit backpressure without provider re-calls, and cross-request token accounting that survives process restarts. For teams planning agent infrastructure over the next quarter, this is the layer worth investing in before it becomes a line item you notice too late.

Analysis based on “Never waste a token,” an engineering essay by Sunil Pai published June 17, 2026 at sunilpai.dev.

Stop paying twice: the gateway buffer fix for agent crashes

The morning brief for people inside the AI industry.

More in Opinion

The Benchmark Ceiling: Why Frontier Evals Need a Rebuild

When the US government export-controlled Anthropic's own models

Coding agents broke the review pipeline, not just the code