Gateway
OpenAI-compatible Chat Completions — multi-turn, multimodal content (text + base64 images), tool-calling, SSE streaming. Semantic reranking and embeddings as native endpoints.
Each model alias maps to a primary backend with an optional configurable fallback. A per-provider circuit-breaker (closed / open / half-open) trips on transient errors, and a smart router applies per-context defaults or header-driven overrides.
Every LLM call is recorded in an append-only journal — alias, effective provider, route, latency, tokens, HTTP code, role id — with 30-day retention, a 5M-event cap, and Prometheus export. Conversation content is never captured.
Endpoints are guarded by Bearer token with constant-time comparison and an optional loopback bypass keyed on the real TCP address. Business-context ids (feature, agent) are propagated into telemetry — never exposed in logs.