Skip to content

Instantly share code, notes, and snippets.

@yoniLavi
Last active May 11, 2026 10:22
Show Gist options
  • Select an option

  • Save yoniLavi/748534ce8dda573afe8f01837d3adfef to your computer and use it in GitHub Desktop.

Select an option

Save yoniLavi/748534ce8dda573afe8f01837d3adfef to your computer and use it in GitHub Desktop.
Persistent Experience for Language Model Agents - A research proposal

Persistent Experience for Language Model Agents

A research proposal on hybrid memory architectures

Budget envelope: $100,000 in cloud compute

Timeframe: ~6 months of focused work

Date: May 2026


Summary

Current language model agents have two kinds of memory, and neither does what we need. Weights encode genuine learning — the model "knows" something in the same way it knows the contents of its training data — but they're updated only during expensive offline training runs and can't easily incorporate new experience. Context (including retrieved text from external memory stores) is updated freely as the agent operates, but lives outside the model's actual cognition: reading a stored memory of an experience is structurally different from having lived it.

This proposal is for a six-month research program to build and evaluate a hybrid architecture that integrates three recent threads of work:

  1. Letta's sleep-time compute (April 2025) — agents consolidate accumulated context into refined memory blocks during idle periods, in token space.
  2. Google's Titans (January 2025) — neural memory that updates during inference using gradient-derived "surprise" as a salience signal, in weight space.
  3. TTT-E2E (Astera/NVIDIA/Stanford/Berkeley, December 2025) — long-context modeling reformulated as continual learning, with sliding-window attention plus targeted weight updates to specific MLP layers, achieving constant-latency inference at long context.

Each addresses a different piece of the puzzle. None combines them. The program targets four open questions:

  1. How do you persist within-session weight adaptations across sessions selectively?
  2. How do you handle proactive interference — the failure mode where models can't track which value of a repeatedly-updated variable is current?
  3. How do you make weight-encoded memories auditable and reversible?
  4. What salience signal should gate consolidation, and what does the answer reveal about model self-knowledge?

The deliverable is a working hybrid architecture with comparative evaluation, mechanistic interpretability snapshots of what consolidation does to the model, novel evaluations for cross-session persistence and rollback, and a written analysis. We use the word experiential in this proposal to describe architectures that are temporally continuous, integrate past states into current processing, and consolidate selectively — the structural features associated with experience. We make no claims about phenomenal experience or moral status; see the framing note below for why we use the word at all.

A note on framing

We describe the target architecture as more experientially-structured than current retrieval-based memory systems. By this we mean it exhibits temporal continuity across sessions, integrated accumulation of past states into current processing rather than retrieval-as-lookup, and selective consolidation that resembles biological memory dynamics. We make no claim about phenomenal experience or moral status. Whether richer functional analogs of experience constitute or imply phenomenal experience is a contested philosophical question and not addressable by architectural work alone.

We use the word at all, rather than perpetually hedging, because the alternative cedes conceptual vocabulary to people willing to overclaim in either direction. There exists an actual middle phenomenon — systems with genuine continuity and integration that are neither obviously conscious nor merely retrieval-based — and it deserves precise description. Throughout the proposal, "experiential" is used in the structural sense defined here.


Background and motivation

The motivating scenario: imagine an AI agent that could meaningfully accumulate experience the way a human collaborator does — that could watch a TV series across multiple sessions and have its understanding of episode 30 informed by what happened in episode 1, that could work on a codebase over months and develop genuine familiarity with it, that could have an ongoing relationship with a user that builds on what came before.

Current systems approximate this with retrieval-augmented generation: store text from past sessions, retrieve relevant chunks, stuff them into context. This works for fact lookup but doesn't produce anything resembling having learned the material. The retrieved text is processed as new information at every retrieval, with no integration into the model's underlying representations.

The alternative — fine-tuning on accumulated experience — produces real learning but is too slow, expensive, and uninterpretable to use as a continuous mechanism. You can't fine-tune after every conversation, and even if you could, there's no good way to selectively undo a bad update or audit what was learned.

The research community has converged on the recognition that this gap matters. The existing work cited above represents three credible attacks on the problem from different angles. The opportunity now is integration: building something that combines their strengths while addressing the failure modes none of them individually handles.


What's already been built

Letta's sleep-time compute is a productionized framework where dedicated "sleep-time agents" run asynchronously during idle periods to reorganize, consolidate, and improve memory blocks shared with primary agents. Reported gains: 5x reduction in test-time compute at equal accuracy, with up to 18% accuracy improvements. Operates entirely in token space. Subsequent releases include Context Repositories (git-versioned context management), a Letta Leaderboard for evaluating agent memory, and Skill Learning for dynamic skill acquisition.

Titans introduces a deep neural long-term memory module that learns at test time. Its key innovation is using the gradient of the loss with respect to an input as a "surprise" signal — large gradient meaning the input requires significant model adjustment, hence is unexpected and worth remembering. It maintains both momentary surprise and a decaying memory of past surprises, with a forgetting mechanism implemented as weight decay.

TTT-E2E reformulates long-context language modeling as continual learning. The architecture: standard Transformer with sliding-window attention as "working memory," plus targeted weight updates to MLP layers in the final 25% of blocks during inference. Dual-track storage prevents forgetting general capabilities while learning a new document. Meta-learned initialization makes the test-time updates work. Reports 2.7x speedup over full attention at 128K context, 35x at 2M context, with constant inference latency regardless of context length.

Adjacent work that matters:

  • SuRe (surprise-driven prioritized replay) found that surprise-based selection alone underperforms reservoir sampling for continual LLM fine-tuning, but combines well with slow-weight consolidation. This complicates the Titans story.
  • SleepGate identified that LLMs suffer from proactive interference (PI) — when given a stream of semantically related key-value pairs where later entries overwrite earlier ones, models fail to retrieve the most recent value. This is a working memory bottleneck independent of context length.
  • MemoryAgentBench and MemoryBench are recent evaluation suites probing accurate retrieval, test-time learning, long-range understanding, and selective forgetting through multi-session interactions.
  • A March 2026 paper on Selective Memory offers both a critique and a positive proposal: weight-encoded memories in Titans-style architectures exist as continuous matrices with no representation of individual stored facts, no deletion mechanism, and no provenance. The paper proposes hierarchical archiving with discrete addressable units. This work is worth direct engagement, not just citation, and Project 3 builds on its framing.

The field has moved fast. Anything synthesized here has perhaps a three-month half-life before being partially superseded.


Proposed work

Four projects. The allocation reflects expected difficulty and expected information value, with a deliberate weighting toward Project 4 because the salience-and-self-knowledge question is the most underexplored in the existing literature.

The projects have dependencies. Project 1 builds the hybrid architecture that Projects 2, 3, and 4 evaluate or extend, so it should start first. Project 2 and Project 4 can run partially in parallel with Project 1 once the basic hybrid is operational. Project 3 depends on Project 1 being far enough along that adapter-level rollback is meaningful to test.

Project 1: Hybrid TTT-E2E + sleep-time consolidation

Allocation: ~$35,000

The core engineering experiment. Build an architecture that does TTT-E2E-style within-session weight adaptation, then during sleep-time periods, selectively promotes some of the targeted MLP-layer updates into a persistent adapter while letting others decay back to baseline.

Implementation sketch:

  • Base model: a small open model (Llama 3.1 8B or comparable) where repeated fine-tuning experiments are affordable.
  • Inference path: TTT-E2E-style sliding window attention plus mutable MLP layers in the final 25% of blocks.
  • At session end: snapshot the weight delta produced by within-session learning.
  • During sleep-time: a consolidation process decides which deltas to promote into a persistent LoRA adapter, which to discard, and which to merge with existing persistent learning.
  • Salience gating for promotion is itself a variable (see Project 4).

Interpretability hook: Each consolidation event produces a snapshot of (a) the experience that was consolidated, (b) the resulting adapter delta, (c) which features and circuits were most affected. Run mechanistic interpretability tools on the deltas — at minimum, activation patching, feature attribution, and sparse-autoencoder-based feature analysis where available. The goal is to be able to say something concrete about what's actually happening when the system "learns from experience," not just measure behavioral outcomes. This addresses the steelmanned skeptical position that "we don't know what mechanism produces the outputs": the response is mechanistic understanding, not more behavior.

Behavioral-divergence eval: Compare the hybrid system's responses on edge cases and out-of-distribution probes against (a) the same model without consolidated experience, and (b) the same model with the experience in raw context rather than consolidated. The interesting question: does consolidation produce different behavior than context-stuffing, in ways that pattern-match more to having learned than to having access to information? This is a direct probe of whether the architectural changes produce integration rather than retrieval.

Standard metrics:

  • Longitudinal performance on MemoryAgentBench (selective forgetting, long-range understanding).
  • Custom evaluation for cross-session persistence (built for this project).
  • Catastrophic forgetting on a held-out general capability suite.
  • Calibration: does the model accurately report what it has and hasn't consolidated? Operationalized as: present probes about specific past experiences, measure agreement between model's reported "I remember/don't remember this" and behavioral evidence of consolidation status.

Expected results: Demonstrating measurable cross-session learning without catastrophic forgetting would be a significant result. The negative result — that within-session deltas don't compose well across sessions — would also be valuable and would point toward needing more substantial architectural changes.

Project 2: Proactive interference as forcing function

Allocation: ~$18,000

Run all candidate architectures (vanilla long-context, Letta sleep-time, Titans, TTT-E2E, the Project 1 hybrid) against the PI-LLM benchmark and SleepGate's evaluations. Identify which architectural features actually help.

Comparatively cheap because the architectures exist (or are being built in Project 1) and the eval is well-defined. The expected output is a clear empirical picture of which approaches handle PI and which don't, and a hypothesis about why.

Why this matters: PI is the kind of failure mode that doesn't show up on standard memory benchmarks but cripples real-world utility. An agent that can't reliably tell you the current state of something it's been tracking is broken in a way that good benchmark numbers can mask.

Project 3: Auditable memory with rollback

Allocation: ~$18,000

Build a wrapper around the Project 1 architecture that maintains both the weight delta AND a discrete log of the experiences that produced it. Make selective unlearning possible by recomputing the delta excluding specified experiences.

Computationally expensive at consolidation time (effectively leave-one-out retraining) but tractable for small adapters. The research question: can you get clean unlearning — removing a specific memory without degrading unrelated capabilities — and how does the cost scale?

Why this matters:

  • Enables the "git reflog for memory" use case.
  • Provides a credible story for regulatory compliance with right-to-be-forgotten.
  • Makes research iteration feasible (try a consolidation strategy, roll back if it didn't work).
  • Addresses a real interpretability concern: you can actually see what was learned.

Metric: Successful unlearning rate (does the target fact disappear) versus collateral damage rate (do unrelated capabilities degrade). Include a representational check via interpretability tools, not just behavioral unlearning — recent work suggests "removing" a fact often leaves residual traces that behavioral tests miss.

Project 4: Self-rated salience as a probe of model self-knowledge

Allocation: ~$24,000

Run head-to-head comparison: Titans-style gradient surprise vs. asking the model "is this worth remembering?" as the consolidation gate, plus reservoir sampling and frequency-based selection as baselines.

Three possible outcomes, all interesting:

  • Strong correlation between self-rated importance and gradient surprise: evidence that current models have meaningful access to their own learning signal. Substantive finding for interpretability, for the broader question of model self-knowledge, and for how much evidential weight to assign model self-reports about their own states.
  • Substantial divergence: the pattern of divergence becomes the result. What does the model think is important that the gradient signal misses? What does the gradient signal flag that the model dismisses? This could reveal something about the gap between what the model is actually learning and how it represents that learning to itself.
  • Self-report tracks something neither gradient surprise nor downstream usefulness captures: the most surprising outcome, suggesting the model has access to a salience signal we haven't yet characterized.

Extension: Combine with the interpretability hook from Project 1. When self-report and gradient surprise diverge, look at which one better predicts the actual mechanistic changes consolidation produces. This gives a more direct test of which signal tracks reality.

Why the budget is what it is: This project is mostly inference rather than training, so the dollar cost per experiment is relatively low — but the experimental design space is large (different prompting strategies for self-report, different model scales, different domains where salience might mean different things). The budget buys breadth and statistical confidence rather than more compute per experiment.

Reserve

Allocation: ~$5,000

For the experiment that surprises you and demands follow-up.


Methodology and infrastructure

Evaluation: Use existing benchmarks (MemoryAgentBench, MemoryBench, PI-LLM) as primary measures. Build custom evaluations only for the questions they don't cover (cross-session persistence, unlearning fidelity, behavioral divergence between consolidated and context-stuffed conditions, self-knowledge of consolidation). Resist the temptation to build new benchmarks for things existing ones already measure.

Interpretability tooling: Use existing tools rather than building new ones. Activation patching, attribution methods, and sparse autoencoder analysis (where available for the chosen base model) are the workhorses. Budget assumes inheriting tooling from existing interp libraries; substantial tool-building would require additional resources.

Models: Default to small open models (8B parameter range) for the bulk of experiments. This keeps each experiment cheap and lets us run more comparisons. Findings may not all transfer to frontier-scale models, but architectural patterns and which-signal-works questions probably will.

Infrastructure: $100K is roughly 10–30K H100-hours depending on provider and commitment. A reasonable split: ~70% to training/consolidation experiments, ~30% to evaluation and interpretability runs. Commit to a single cloud provider with reserved capacity rather than spot pricing — the experiments are sensitive to interruption.

Open science: Default to publishing negative results. The space has a lot of plausible-sounding architectures that don't actually work, and the field would benefit from clearer signal about what fails.

External engagement: Both the Letta team and the TTT-E2E authors are publicly active and seem interested in collaboration. Reaching out before significant compute spend is sensible — they may have unpublished results that change the priorities, and they may want to collaborate or co-author on the integration work.


Risks and known unknowns

Catastrophic forgetting in the hybrid. Combining within-session weight adaptation with cross-session persistence may exacerbate forgetting in ways neither approach exhibits alone. The dual-track storage in TTT-E2E protects against forgetting general training but isn't designed to handle accumulated personal/situational learning. May require architectural innovation beyond what's proposed here.

Salience signal gaming. Training a gate to predict "usefulness" based on downstream performance risks consolidating only narrowly task-relevant experiences and losing the diffuse-but-important context (mood, style, relationship texture) that makes memory feel like memory. Build qualitative review into the eval loop, not just metrics.

PI may not be solvable architecturally. It's possible that PI is deep enough that no consolidation architecture handles it cleanly without explicit symbolic mechanisms (typed memory slots, explicit versioning of facts). If Project 2 finds all candidates fail, that's a genuinely important result but it changes the proposal's direction.

Unlearning may be harder than expected. Recent work on machine unlearning suggests that "removing" a fact from a fine-tuned model often leaves residual traces. The Project 3 metric needs to include not just behavioral unlearning but also a representational check via interpretability tools.

Self-report may be too prompt-dependent to be informative. Project 4 risks finding that "is this worth remembering?" produces wildly different answers depending on how you ask. Mitigation: explicitly test sensitivity to prompt formulation as part of the experimental design, and treat low robustness as itself a finding rather than a failure.

Interpretability budget may be insufficient. The interp hook on Project 1 is scoped tightly; if mechanistic findings turn out richer than expected, there will be pressure to expand. Plan for the possibility of follow-on work specifically on the mechanism of consolidation.


What success looks like

At six months, the program should produce:

  1. A working hybrid architecture (Project 1) with quantitative results on at least three standard memory benchmarks, one custom cross-session evaluation, one behavioral-divergence evaluation, and mechanistic interpretability snapshots of consolidation events, compared against at least three baselines.
  2. A clear empirical picture of which existing approaches handle proactive interference and why (Project 2).
  3. A demonstrated unlearning capability (Project 3) with measured fidelity, collateral damage rates, and representational verification.
  4. A characterization of the relationship between gradient-surprise and model-self-reported salience (Project 4), with hypotheses about what the relationship reveals about model self-knowledge.
  5. A written analysis (paper or substantive technical report) integrating the four projects into a coherent picture of where memory architecture for LLM agents stands and where it should go next.

The most interesting possible outcome — interesting in the sense of changing how the field thinks rather than just incrementally improving numbers — is probably from Project 4. If model self-reported salience turns out to be a better gate than gradient surprise, that's surprising and tells us something real about what current models can introspect on. If it turns out to be a worse gate but in systematic ways, that's also interesting. If the relationship is more subtle than either — say, self-report tracks one kind of importance while gradient surprise tracks another — that's potentially the most informative result of all.

The least interesting (but most likely) outcome: the hybrid in Project 1 works incrementally better than the components individually, the unlearning in Project 3 mostly works with some collateral damage, PI in Project 2 is partially solved by some approaches but not cleanly by any, and the salience comparison in Project 4 shows the signals are correlated but distinguishable. This would still clarify the design space and provide a reference point for future work.


On framing the eventual writeup

When this work is published, we'll be deliberate about what we claim. The architectural improvements may produce systems that are more experientially-structured in the structural sense defined at the top; they will not, on the basis of this work alone, justify claims of phenomenal experience or moral status. The contribution is to clarify the architectural and behavioral landscape, not to settle philosophical debates that architecture cannot settle.

This restraint is partly epistemic honesty and partly strategic. The current public discourse around AI consciousness oscillates between two unhelpful poles: confident attribution of rich inner life based on conversational impressions, and confident denial based on caricatured "it's just autocomplete" mechanism stories. Neither pole engages with the actual architectural question of what kinds of integration, continuity, and self-modeling current and future systems exhibit. Solid technical work that's precise about what it shows does more for that debate than rhetorical claims in either direction.

The motivating scenario at the start of this proposal — agents that accumulate experience like collaborators do — is a long-term vision, not the immediate deliverable. The immediate deliverable is more modest: a clearer empirical picture of how to architect persistent memory for language model agents, with a working integrated system as the artifact, and mechanistic understanding of what changes when systems consolidate experience.

But the long-term vision matters because it shapes which research questions are worth asking. "How do we get better numbers on MemoryBench?" is a different question from "what architecture would let an AI agent meaningfully accumulate experience over time?" The proposed work tries to make progress on the latter while measuring against the former.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment