Gradatum · Agent Layer

Foundations first.
Runtime next.

Two of the three layers that make a self-hosted agent stack are already in place. The memory store and the compute backbone are live — sovereign, embedded, no SaaS round-trip. The agent runtime that orchestrates them comes with v0.6.0.

Memory + Compute live Runtime · v0.6.0

Live since v0.3.0

Coming · v0.6.0

Live since v0.3.0

Two foundations.
Already in place.

The runtime coming in v0.6.0 builds on two live components — a compute layer for inference routing, and a memory layer for persistent knowledge. Both are sovereign, embedded, and running today.

LIVE Compute

Gateway — unified LLM router

Gateway

Context

OpenAI-compatible Chat Completions — multi-turn, multimodal content (text + base64 images), tool-calling, SSE streaming. Semantic reranking and embeddings as native endpoints.

Routing & fallback

Each model alias maps to a primary backend with an optional configurable fallback. A per-provider circuit-breaker (closed / open / half-open) trips on transient errors, and a smart router applies per-context defaults or header-driven overrides.

Logs

Every LLM call is recorded in an append-only journal — alias, effective provider, route, latency, tokens, HTTP code, role id — with 30-day retention, a 5M-event cap, and Prometheus export. Conversation content is never captured.

Identity

Endpoints are guarded by Bearer token with constant-time comparison and an optional loopback bypass keyed on the real TCP address. Business-context ids (feature, agent) are propagated into telemetry — never exposed in logs.

Engine fleet — local inference

Engine fleet

Optimization

A native Rust supervisor runs inference processes directly — managing GPU, context window, slot parallelism and sampling per model instance, under strict execution-environment isolation and graceful shutdown.

Monitoring

Each inference instance exposes a real-time health endpoint (starting / ok / unhealthy) and Prometheus metrics (requests, latencies), with automatic supervision under a bounded restart budget and flapping detection.

Shared config & logs

Per-instance configuration is declarative TOML with environment overrides. Every request emits an async, non-blocking telemetry event (route, model, latency, HTTP code, role id) to the central journal — engine and server share the same core types and auth handshake.

LIVE Memory

Vault

Sovereign knowledge store — BM25 + cosine + RRF, MCP, history, temporal index.

the memory layer → /vault

Roadmap · v0.6.0

The agent runtime.
What comes next.

With the memory store and compute backbone already in place, v0.6.0 adds the orchestration layer — the code that consumes both foundations to run autonomous agent loops.

coming

Context assembly

Sliding-window continuity over the conversation, combined with proactive recall from the Vault — relevant memories injected at the right moment, not on every turn.

coming

ReAct runtime

Agent loop with tool dispatch via MCP, built-in guards against runaway loops, and structured observation → thought → action traces for auditability.

coming

Skill selection

MCP-native dispatch matched to agent intent. The gateway routes to the right engine role; the skill layer decides which capability to invoke.

All three components build on the foundations already LIVE. No new storage engine, no new inference stack — the same discipline, extended upward.

Foundations first. Runtime next.

Two foundations. Already in place.

Gateway

Engine fleet

Vault

The agent runtime. What comes next.

Context assembly

ReAct runtime

Skill selection

Foundations first.
Runtime next.

Two foundations.
Already in place.

The agent runtime.
What comes next.