Ensemble normalizes multimodal content across providers, handling format differences transparently.
Provider Capabilities
| Provider | Images | PDFs | Audio | Video |
|---|---|---|---|---|
| Anthropic (Claude Opus/Sonnet/Haiku) | Yes | Yes (document blocks) | No | No |
| OpenAI (GPT-5 series) | Yes | Yes (Responses API) | No | No |
| Gemini (2.5/3.0/3.1 Pro/Flash) | Yes | Yes | Yes (mp3, wav, etc.) | Yes (mp4, 300s max) |
| Gemini (image models) | Image generation (1024x1024) | No | No | No |
| Gemini (Veo 3.1) | No | No | No | Video generation (8–60s, up to 4K) |
| xAI (Grok 4/4.1/4.20) | Yes | No | No | No |
| Fireworks (Kimi K2.5) | Yes | No | No | Yes (input only) |
Content Blocks
Multimodal content is sent as content_blocks in messages:
Image
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": "iVBORw0KGgo..."
}
}
Document (PDF)
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": "JVBERi0xLjQ..."
}
}
Audio (Gemini only)
{
"type": "audio",
"source": {
"type": "base64",
"media_type": "audio/mp3",
"data": "..."
}
}
Provider-Specific Handling
Ensemble translates content blocks to each provider's native format:
- Anthropic: Uses document blocks with cache control for PDFs
- OpenAI: Converts to Responses API format for PDFs
- Gemini: Native multimodal parts with media resolution options
- Grok: Image-only via standard image_url format
Prompt Caching with Multimodal
Anthropic supports prompt caching for document blocks. Ensemble automatically sets cache_control on large documents to enable caching:
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": "..."},
"cache_control": {"type": "ephemeral"}
}