Architecture
Ensemble uses a local-first rate management design optimized for zero hot-path latency:
Request path (microseconds):
Local atomic counter check → Allow/Deny
Background (every 1 second):
Local counters → Redis → Global view update
No Redis queries on the request path. Local counters use atomic.Int64 for lock-free operation.
Rate Limit Tracking
Per-Endpoint Limits
Each endpoint has RPM (requests per minute) and TPM (tokens per minute) limits:
endpoints:
- id: anthropic-primary
rpm_limit: 1000
tpm_limit: 100000
Local Counter Structure
Counters are packed into a single atomic.Uint64 for cache-line efficiency:
// Upper 32 bits: TPM count
// Lower 32 bits: RPM count
packed := atomic.Uint64{}
Window rollover uses CompareAndSwap — no mutex, no contention.
Global View
Background sync publishes local counters to Redis and reads global aggregates:
Redis key: {namespace}:ratelimit:{endpoint_id}:{model_id}
The global view is used for routing decisions (avoid sending traffic to endpoints that other instances have already saturated) but is never on the request path.
Mock Endpoint Detection
Endpoints with TPM limits above a configurable threshold (MockEndpointTPMThreshold) are treated as unlimited locally — useful for testing and development.
RateDecision
Every rate limit check produces a RateDecision:
type RateDecision struct {
Allowed bool
CurrentTPM int64
LimitTPM int64
Utilization float64 // 0.0-1.0
WindowStart time.Time
RetryAfter *time.Duration // Set when rate limited
}
Per-Key Rate Limits
API keys can have their own rate limits (in addition to endpoint limits):
{
"rate_limit_rpm": 100,
"rate_limit_tpm": 50000
}
Key-level limits are checked before endpoint-level limits.
Redis Namespace Isolation
Rate limit keys are namespaced by environment:
Priority 1: REDIS_NAMESPACE env var
Priority 2: ENSEMBLE_ENVIRONMENT env var (dev/staging/production)
Priority 3: Test mode detection (unique per test run)
Default: "default"
This prevents development instances from interfering with production rate limit state.