Routing Decision Flow
Request arrives
│
▼
┌─────────────┐
│ Model lookup │ → Which providers support this model?
└──────┬──────┘
│
▼
┌──────────────────┐
│ Cache affinity │ → Does any endpoint have cached context for this session?
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Rate limit check │ → Filter out endpoints at capacity
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Cost/load balance│ → Among remaining, pick optimal endpoint
└──────┬───────────┘
│
▼
Selected endpoint
RoutingDecision
Every request produces a RoutingDecision:
type RoutingDecision struct {
Provider string // "anthropic", "openai", etc.
Endpoint string // Endpoint display name
EndpointID string // Internal endpoint ID
Reason string // Human-readable reason
EstimatedValue decimal.Decimal // Estimated cache value
CacheOptimized bool // Whether cache influenced the decision
CostPenalty decimal.Decimal // Cost delta vs cheapest option
}
Routing Strategies
Configurable per provider:
| Strategy | Behavior |
|---|---|
session_affinity | Prefer endpoint with session cache (default) |
round_robin | Equal distribution across endpoints |
least_used | Route to endpoint with lowest utilization |
providers:
anthropic:
strategy: session_affinity
openai:
strategy: least_used
Error-Aware Routing
When a request fails on the selected endpoint:
1. Rate limit (429): Automatically retry on next available endpoint in the capacity pool 2. Server error (5xx): Retry on a different provider if available 3. Permanent error (4xx): Return error immediately (no retry)
Error classification is provider-specific — Ensemble understands the difference between Anthropic's overloaded (retryable) and invalid_api_key (permanent).
Capacity Pools
Multiple API keys for the same provider form a capacity pool:
providers:
anthropic:
keys:
- name: primary
endpoints:
- id: anthropic-1
rpm_limit: 1000
tpm_limit: 100000
- name: secondary
endpoints:
- id: anthropic-2
rpm_limit: 500
tpm_limit: 50000
The router distributes load across the pool and fails over between endpoints.