Status: Proposed, for review. This is an RFC, a position proposal. It is not
an approvable architecture decision record, and it is not a contract. Several
questions below, including who owns capacity, the lifecycle shape, and how
providers are discovered, can still change what the chosen boundary means in
practice. The contract sketches in it are illustrative, and a number of
questions are still open (see Open questions). The target
design is not built yet. What exists on main today is described in
The dataplane today and [Current
Stand up the agent-substrate micro-VM runtime (cloud-hypervisor + kata-agent, PR #287) on a kind cluster on bigbox and run two demos, then tear it down:
- Part A — counter-microvm: in-RAM counter → suspend (VM memory snapshot) → resume on another worker → count continues. The runtime's own demo; pure substrate.
- Part B — helpdesk-microvm: the OpenShell helpdesk agent running as a micro-VM actor →
/status+/chat(realgpt-oss:20b-cloudcompletions via Ollama Cloud) → suspend → resume → state continues. Proves a real ~3 GB agent workload boots, does name-based egress, snapshots, and
Status: prototype against Kata 3.31.0 (runtime-rs). Validated end-to-end (via shim-ctl):
a counter's in-memory state survives 3 checkpoint/restore cycles (monotonic, no reset), a
live TCP LISTEN socket survives, a ~10 MB memory buffer survives, and a checkpoint
restores in a fresh microVM (migration-style). Engine-driven: both ctr (containerd
native) and crictl (CRI, no kubelet) do the full checkpoint→restore cycle with the counter
surviving; crictl restore needs containerd ≥ 2.3.0. See Proof of concept for the branches.
Generated: 2026-06-03
Repo: agent-substrate/substrate — "Agent Substrate: the core system" (public, 468★, Apache-2.0)
What it is: a system on top of Kubernetes that manages agent-like workloads at higher scale/lower latency by taking the K8s control-plane out of the critical path — actors run in gVisor sandboxes (ateom), managed by a kubelet-like agent (atelet), with GCS checkpoint/restore (ategcs) and a router (atenet).
Window: 2026-05-13 → 2026-06-03 (~3 weeks — a brand-new seed project)
Volume analyzed: 95 commits · 117 PRs (all states) · 63 issues (all states)
Analysis basis: upstream agent-substrate/substrate@main (e26cfa22), cloned fresh — the local dims/substrate fork checkout (4cbac18) was a few commits behind.
Framing: Unlike the NVIDIA-owned reports (nvsentinel / dra-driver / aicr / OpenShell), this repo is not NVIDIA-owned — it's a
- Generated: 2026-05-26
- Repository:
nvidia/OpenShell(working copy:/Users/dsrinivas/go/src/github.com/nvidia/OpenShell) - Total commits analyzed (full
mainhistory): 754 - Total unique commit-author emails: 58
- Total unique GitHub handles (resolved): 51 (excluding bots)
Generated: 2026-06-03
Repo: NVIDIA/aicr — "Tooling for optimized, validated, and reproducible GPU-accelerated AI runtime in Kubernetes" (323★)
History analyzed: 2026-01-30 → 2026-06-03 (~4 months), main @ f65d7b0
Total commits analyzed: 1,205 (44 unique author emails → 35 distinct GitHub handles + 3 bots)
Analysis basis: working copy is the dims/aicr2 fork; its main HEAD (f65d7b0eddcda…) is identical to upstream NVIDIA/aicr@main, so the local history faithfully represents upstream.
Methodology: Extracted every commit author via git log (email, name, date, and Signed-off-by trailer via %(trailers)) → resolved each email to a GitHub login through the upstream commit API (GET /repos/NVIDIA/aicr/commits/{sha} → .author.login) → classified each handle by (1) Helios LDAP match, (2) @nvidia.com commit email, (3) NVIDIA GitHub-org membership (`GET /orgs/NVIDIA/member
Update (2026-05-29): this standalone PoC has since been turned into a full in-repo implementation (Phases 0–3) and a cluster e2e — a counter actor on a Firecracker worker driven through the real control plane (ate-api-server + atenet), state preserved across suspend/resume, on the existing kind cluster. Branch
firecracker-backend(pushed todims/substrate, commitbc533f5; worktree~/go/src/github.com/agent-substrate/substrate-firecracker). Full journal:~/notes/agent-substrate/2026-05-29-firecracker-backend-implementation-log.md. The PoC notes below are retained for the from-scratch microVM bring-up details (rootfs build, Firecracker API sequence, gotchas).
- Date: 2026-05-29 · Host:
bigbox(Ubuntu 24.04, AMD EPYC 7763, nested KVM) · Firecracker: v1.15.1 · Guest kernel:vmlinux-6.1.128 - Goal: prove a Firecracker backend can satisfy substrate's
ateomRun/Checkpoint/Restore contract, preserving
| Field | Value |
|---|---|
| Status | Implementable minimal alpha |
| Feature gate | HostManagedIMEX |
| Scope | Install-wide, not per-ComputeDomain |
| Primary goal | Stop launching per-ComputeDomain IMEX DaemonSets when the host already runs nvidia-imex |
| Primary non-goal | Per-ComputeDomain channel isolation across an IMEX fabric |
| # set PATH and check if cluster is present (all terminals) | |
| export PATH=$HOME/go/bin:$PATH: | |
| kubectl version | |
| # ============================================================ | |
| # Terminal A — keep this running, watches and port-forwards. | |
| # ============================================================ | |
| kubectl port-forward -n ate-system svc/atenet-router 8000:80 & | |
| kubectl port-forward -n ate-openshell-m0 svc/openshell-gateway-substrate 50051:50051 & |
Generated: 2026-05-11 (rev. 2 — Helios cross-check added)
Repo: kubernetes-sigs/dra-driver-nvidia-gpu
Repo history: 2022-07-14 → 2026-05-11 (~3.8 years)
Total commits analyzed: 1,853 (47 unique author emails)
Methodology: Extracted all unique commit authors via git log → classified by email domain (@nvidia.com = NVIDIA, all others = candidates) → mapped commits to GitHub logins via GET /repos/.../commits/{sha} → verified every candidate against GET /orgs/NVIDIA/members/{username} (HTTP 204 = confirmed member, 404 = not a member) → for ambiguous cases, additionally cross-referenced against NVIDIA Helios LDAP (helios-cli user search) to detect NVIDIA employees who contribute via personal GitHub accounts not registered in the NVIDIA org → cross-referenced GitHub profiles, DCO Signed-off-by trailers, LinkedIn, and corporate-email patterns → folded NVIDIA-personal-e