Gradatum · Agent Layer

Foundations first.
Runtime next.

Two of the three layers that make a self-hosted agent stack are already in place. The memory store and the compute backbone are live — sovereign, embedded, no SaaS round-trip. The agent runtime that orchestrates them comes with v0.6.0.

Memory + Compute live Runtime · v0.6.0

Live since v0.3.0
Coming · v0.6.0

Live since v0.3.0

Two foundations.
Already in place.

The runtime coming in v0.6.0 builds on two live components — a compute layer for inference routing, and a memory layer for persistent knowledge. Both are sovereign, embedded, and running today.

LIVE Compute
Gateway — unified LLM router

Gateway

Context

OpenAI-compatible Chat Completions — multi-turn, multimodal content (text + base64 images), tool-calling, SSE streaming. Semantic reranking and embeddings as native endpoints.

Routing & fallback

Each model alias maps to a primary backend with an optional configurable fallback. A per-provider circuit-breaker (closed / open / half-open) trips on transient errors, and a smart router applies per-context defaults or header-driven overrides.

Logs

Every LLM call is recorded in an append-only journal — alias, effective provider, route, latency, tokens, HTTP code, role id — with 30-day retention, a 5M-event cap, and Prometheus export. Conversation content is never captured.

Identity

Endpoints are guarded by Bearer token with constant-time comparison and an optional loopback bypass keyed on the real TCP address. Business-context ids (feature, agent) are propagated into telemetry — never exposed in logs.

Engine fleet — local inference

Engine fleet

Optimization

A native Rust supervisor runs inference processes directly — managing GPU, context window, slot parallelism and sampling per model instance, under strict execution-environment isolation and graceful shutdown.

Monitoring

Each inference instance exposes a real-time health endpoint (starting / ok / unhealthy) and Prometheus metrics (requests, latencies), with automatic supervision under a bounded restart budget and flapping detection.

Shared config & logs

Per-instance configuration is declarative TOML with environment overrides. Every request emits an async, non-blocking telemetry event (route, model, latency, HTTP code, role id) to the central journal — engine and server share the same core types and auth handshake.

LIVE Memory

Vault

Sovereign knowledge store — BM25 + cosine + RRF, MCP, history, temporal index.


Roadmap · v0.6.0

The agent runtime.
What comes next.

With the memory store and compute backbone already in place, v0.6.0 adds the orchestration layer — the code that consumes both foundations to run autonomous agent loops.

coming

Context assembly

Sliding-window continuity over the conversation, combined with proactive recall from the Vault — relevant memories injected at the right moment, not on every turn.

coming

ReAct runtime

Agent loop with tool dispatch via MCP, built-in guards against runaway loops, and structured observation → thought → action traces for auditability.

coming

Skill selection

MCP-native dispatch matched to agent intent. The gateway routes to the right engine role; the skill layer decides which capability to invoke.

All three components build on the foundations already LIVE. No new storage engine, no new inference stack — the same discipline, extended upward.