Response Persistence - iGent Concert

The Problem

LLM inference is expensive and non-deterministic. If a client disconnects mid-stream:

The provider has already charged for the request
The generated output is lost
Re-running produces different (and costly) results

The Solution

Ensemble persists every completed response to S3:

1. Provider continues: Even after client disconnection, the provider call runs to completion 2. S3 storage: The complete response (blocks, tokens, cost) is stored in S3 3. Client recovery: The client can retrieve the response later via GET /api/v1/retrieve/{request_id}

Flow

Client ──── Ensemble ──── Provider
  │              │             │
  │  request     │  forward    │
  │─────────────>│────────────>│
  │              │             │
  │  streaming   │  streaming  │
  │<─────────────│<────────────│
  │              │             │
  ╳ disconnect   │  continues  │
                 │<────────────│
                 │             │
                 │  complete   │
                 │  ──> S3     │
                 │             │
  │  retrieve    │             │
  │─────────────>│             │
  │  response    │             │
  │<─────────────│             │

Configuration

Response persistence is enabled by configuring S3:

class="token comment"># S3 configuration (also supports MinIO)
class="token comment"># Set via environment variables:
class="token comment"># AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
class="token comment"># AWS_ENDPOINT_URL (for MinIO)

Request ID Tracking

Every request gets a unique X-Request-ID (auto-generated or client-provided). This ID is:

Returned in response headers
Used as the S3 storage key
Required for status checks and retrieval
Included in logs and traces for correlation