Davanum Srinivas dims

Dataplane Pluggability

Status: Proposed, for review. This is an RFC, a position proposal. It is not an approvable architecture decision record, and it is not a contract. Several questions below, including who owns capacity, the lifecycle shape, and how providers are discovered, can still change what the chosen boundary means in practice. The contract sketches in it are illustrative, and a number of questions are still open (see Open questions). The target design is not built yet. What exists on main today is described in The dataplane today and [Current

Micro-VM (substrate PR #287) on bigbox — runbook

Stand up the agent-substrate micro-VM runtime (cloud-hypervisor + kata-agent, PR #287) on a kind cluster on bigbox and run two demos, then tear it down:

Part A — counter-microvm: in-RAM counter → suspend (VM memory snapshot) → resume on another worker → count continues. The runtime's own demo; pure substrate.
Part B — helpdesk-microvm: the OpenShell helpdesk agent running as a micro-VM actor → /status + /chat (real gpt-oss:20b-cloud completions via Ollama Cloud) → suspend → resume → state continues. Proves a real ~3 GB agent workload boots, does name-based egress, snapshots, and

CRIU Checkpoint / Restore for Kata Containers

Status: prototype against Kata 3.31.0 (runtime-rs). Validated end-to-end (via shim-ctl): a counter's in-memory state survives 3 checkpoint/restore cycles (monotonic, no reset), a live TCP LISTEN socket survives, a ~10 MB memory buffer survives, and a checkpoint restores in a fresh microVM (migration-style). Engine-driven: both ctr (containerd native) and crictl (CRI, no kubelet) do the full checkpoint→restore cycle with the counter surviving; crictl restore needs containerd ≥ 2.3.0. See Proof of concept for the branches.

Motivation

Agent Substrate — Cross-Vendor Contributor & Affiliation Report

Generated: 2026-06-03 Repo: agent-substrate/substrate — "Agent Substrate: the core system" (public, 468★, Apache-2.0) What it is: a system on top of Kubernetes that manages agent-like workloads at higher scale/lower latency by taking the K8s control-plane out of the critical path — actors run in gVisor sandboxes (ateom), managed by a kubelet-like agent (atelet), with GCS checkpoint/restore (ategcs) and a router (atenet). Window: 2026-05-13 → 2026-06-03 (~3 weeks — a brand-new seed project) Volume analyzed: 95 commits · 117 PRs (all states) · 63 issues (all states) Analysis basis: upstream agent-substrate/substrate@main (e26cfa22), cloned fresh — the local dims/substrate fork checkout (4cbac18) was a few commits behind.

Framing: Unlike the NVIDIA-owned reports (nvsentinel / dra-driver / aicr / OpenShell), this repo is not NVIDIA-owned — it's a

External Contributor & DCO-Hygiene Report — `nvidia/OpenShell`

Generated: 2026-05-26
Repository: nvidia/OpenShell (working copy: /Users/dsrinivas/go/src/github.com/nvidia/OpenShell)
Total commits analyzed (full main history): 754
Total unique commit-author emails: 58
Total unique GitHub handles (resolved): 51 (excluding bots)

Methodology summary

aicr — External Contributor & DCO-Hygiene Report

Generated: 2026-06-03 Repo: NVIDIA/aicr — "Tooling for optimized, validated, and reproducible GPU-accelerated AI runtime in Kubernetes" (323★) History analyzed: 2026-01-30 → 2026-06-03 (~4 months), main @ f65d7b0 Total commits analyzed: 1,205 (44 unique author emails → 35 distinct GitHub handles + 3 bots) Analysis basis: working copy is the dims/aicr2 fork; its main HEAD (f65d7b0eddcda…) is identical to upstream NVIDIA/aicr@main, so the local history faithfully represents upstream.

Methodology: Extracted every commit author via git log (email, name, date, and Signed-off-by trailer via %(trailers)) → resolved each email to a GitHub login through the upstream commit API (GET /repos/NVIDIA/aicr/commits/{sha} → .author.login) → classified each handle by (1) Helios LDAP match, (2) @nvidia.com commit email, (3) NVIDIA GitHub-org membership (`GET /orgs/NVIDIA/member

Firecracker `ateom` Backend — Working PoC on bigbox (counter demo)

Update (2026-05-29): this standalone PoC has since been turned into a full in-repo implementation (Phases 0–3) and a cluster e2e — a counter actor on a Firecracker worker driven through the real control plane (ate-api-server + atenet), state preserved across suspend/resume, on the existing kind cluster. Branch firecracker-backend (pushed to dims/substrate, commit bc533f5; worktree ~/go/src/github.com/agent-substrate/substrate-firecracker). Full journal: ~/notes/agent-substrate/2026-05-29-firecracker-backend-implementation-log.md. The PoC notes below are retained for the from-scratch microVM bring-up details (rootfs build, Firecracker API sequence, gotchas).

Date: 2026-05-29 · Host: bigbox (Ubuntu 24.04, AMD EPYC 7763, nested KVM) · Firecracker: v1.15.1 · Guest kernel: vmlinux-6.1.128
Goal: prove a Firecracker backend can satisfy substrate's ateom Run/Checkpoint/Restore contract, preserving

Design v2: Host-Managed IMEX, Minimal Alpha

Field	Value
Status	Implementable minimal alpha
Feature gate	`HostManagedIMEX`
Scope	Install-wide, not per-`ComputeDomain`
Primary goal	Stop launching per-`ComputeDomain` IMEX DaemonSets when the host already runs `nvidia-imex`
Primary non-goal	Per-`ComputeDomain` channel isolation across an IMEX fabric

dra-driver-nvidia-gpu — External Contributor Report

Generated: 2026-05-11 (rev. 2 — Helios cross-check added) Repo: kubernetes-sigs/dra-driver-nvidia-gpu Repo history: 2022-07-14 → 2026-05-11 (~3.8 years) Total commits analyzed: 1,853 (47 unique author emails) Methodology: Extracted all unique commit authors via git log → classified by email domain (@nvidia.com = NVIDIA, all others = candidates) → mapped commits to GitHub logins via GET /repos/.../commits/{sha} → verified every candidate against GET /orgs/NVIDIA/members/{username} (HTTP 204 = confirmed member, 404 = not a member) → for ambiguous cases, additionally cross-referenced against NVIDIA Helios LDAP (helios-cli user search) to detect NVIDIA employees who contribute via personal GitHub accounts not registered in the NVIDIA org → cross-referenced GitHub profiles, DCO Signed-off-by trailers, LinkedIn, and corporate-email patterns → folded NVIDIA-personal-e

	# set PATH and check if cluster is present (all terminals)
	export PATH=$HOME/go/bin:$PATH:
	kubectl version

	# ============================================================
	# Terminal A — keep this running, watches and port-forwards.
	# ============================================================
	kubectl port-forward -n ate-system svc/atenet-router 8000:80 &
	kubectl port-forward -n ate-openshell-m0 svc/openshell-gateway-substrate 50051:50051 &