The Problem
LLM inference is expensive and non-deterministic. If a client disconnects mid-stream:
- The provider has already charged for the request
- The generated output is lost
- Re-running produces different (and costly) results
The Solution
Ensemble persists every completed response to S3:
1. Provider continues: Even after client disconnection, the provider call runs to completion 2. S3 storage: The complete response (blocks, tokens, cost) is stored in S3 3. Client recovery: The client can retrieve the response later via GET /api/v1/retrieve/{request_id}
Flow
Client ──── Ensemble ──── Provider
│ │ │
│ request │ forward │
│─────────────>│────────────>│
│ │ │
│ streaming │ streaming │
│<─────────────│<────────────│
│ │ │
╳ disconnect │ continues │
│<────────────│
│ │
│ complete │
│ ──> S3 │
│ │
│ retrieve │ │
│─────────────>│ │
│ response │ │
│<─────────────│ │
Configuration
Response persistence is enabled by configuring S3:
class="token comment"># S3 configuration (also supports MinIO)
class="token comment"># Set via environment variables:
class="token comment"># AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
class="token comment"># AWS_ENDPOINT_URL (for MinIO)
Request ID Tracking
Every request gets a unique X-Request-ID (auto-generated or client-provided). This ID is:
- Returned in response headers
- Used as the S3 storage key
- Required for status checks and retrieval
- Included in logs and traces for correlation