Ensemble's cache optimization is the primary driver of cost savings. Anthropic's prompt caching offers 90% cost reduction on cached tokens — routing a request to an endpoint that already has your prompt prefix cached saves real money.
How It Works
CRC-Based Prefix Fingerprinting
Instead of comparing full message content (expensive at 150k-800k token contexts), Ensemble computes lightweight CRC fingerprints of message prefixes:
1. On each request: Compute CRC of the conversation prefix (system prompt + first N messages) 2. Session affinity: Route follow-up messages in the same session to the same endpoint 3. Cross-session sharing: Detect when different sessions share the same system prompt prefix
Cache Value Estimation
For each routing decision, Ensemble estimates the cache value:
cache_value = cached_tokens × cache_savings_per_token × confidence
Where cache_savings_per_token varies by provider (from config):
- Anthropic: 90% discount — cached reads at $0.30/MTok vs $3.00/MTok input (multi-breakpoint, 5m–1h TTL, 1024 min tokens)
- OpenAI: 90% discount — cached reads at $0.125/MTok vs $1.25/MTok input (prefix-based, 24h TTL, 1024 min tokens)
- Gemini: 75% discount — cached reads at $0.125/MTok vs $1.25/MTok input (implicit ephemeral, 60m TTL, 2048 min tokens)
- xAI: 75% discount — cached reads at $0.75/MTok vs $3.00/MTok input (prefix-based, 10m TTL)
- Fireworks: 50–90% discount depending on model
- OpenRouter: 75% discount (varies by underlying provider)
Routing Thresholds
| Cache Value | Routing Decision |
|---|---|
| High (>$0.25) | Strong affinity — route to cached endpoint even if busier |
| Medium ($0.05–$0.25) | Moderate affinity — prefer cached endpoint if available |
| Low (<$0.05) | Cost/load balance — route to least-utilized endpoint |
These thresholds are configurable:
cache:
high_value_threshold: "0.25"
low_value_threshold: "0.05"
stickiness_factor: 0.8
Multi-Endpoint Cache Sharing
Capacity pools (multiple endpoints for the same provider) share cached content for failover. When endpoint A is rate-limited, endpoint B in the same pool often has overlapping cache from shared system prompts.
Session Affinity
The X-Session-ID header controls cache affinity:
class="token comment"># First request creates cache on endpoint A
curl -H "X-Session-ID: sess_123" -d '{"model":"claude-sonnet-4",...}'
class="token comment"># Follow-up routes to endpoint A (cache hit)
curl -H "X-Session-ID: sess_123" -d '{"model":"claude-sonnet-4",...}'
Affinity is weighted, not absolute — rate limits and errors can override the cache preference.