Research
Engineering
January 2026· 9 min read

Prompt caching with Claude and GPT: significant token cost reduction on long debates

As debates grow longer, the transcript passed to each model grows too. We implemented KV caching strategies for Anthropic and OpenAI/Gemini that meaningfully reduce cost on extended conversations.

The cost structure of long debates

In a multi-model debate with a shared transcript, token costs grow with debate length. In round one, each model call processes roughly the system prompt plus the user question. By round five, each model call processes the system prompt, the user question, and four rounds of responses from every model in the debate. A five-round debate between three models can cost many times as much per model call in round five as in round one — even though the incremental information added by that turn is small.

KV caching: the basic idea

Key-value (KV) caching lets the model provider skip the processing of context tokens that have not changed between calls. If the first 80% of a prompt is identical to the previous call (the system prompt, the early transcript), and only the last 20% is new (the most recent exchange), the provider can serve the first 80% from cache at a fraction of the cost. For OpenAI and Gemini via OpenRouter, this happens automatically through sticky routing — requests from the same conversation are routed to the same inference instance, where the KV cache is preserved.

Anthropic: explicit cache_control

Anthropic's caching implementation is different — it requires explicit annotation. We mark the system message with a cache_control block requesting a 1-hour TTL (the maximum available). This covers the case where a user returns to an ongoing debate after a break: the cache remains warm, so the reconnection does not reset the cost savings. We read cache_read_input_tokens from Anthropic's usage response and count cached tokens at 10% of the normal input cost when tracking total debate cost.

Measuring the impact

The cost reduction is most significant on long debates (round 5+) and when using providers where the cache hit rate is high. For Claude in a typical 8-round debate, caching reduces effective input token cost substantially from round 3 onward, assuming the user does not take a break longer than the TTL. For GPT-4o and Gemini via OpenRouter, automatic sticky routing provides similar benefits without explicit annotation. For shorter debates (1–3 rounds), the caching benefit is smaller because the transcript has not grown large enough for the cache ratio to matter significantly.

Implementation considerations

One subtlety: cache_control must be applied carefully. If you apply it to a message that changes frequently (like the most recent user message), you will pay the cache write cost on every call without getting cache read savings. We apply it only to the system prompt, which is truly stable across all turns of a debate. The transcript itself is not annotated for caching — each turn adds to it, so the cached portion is only the prefix, and the savings scale proportionally with how much of the context is stable versus new.
Eclipsco — Next Generation AI