Skip to content

Instantly share code, notes, and snippets.

@abhishekmishragithub
Last active September 10, 2025 12:22
Show Gist options
  • Select an option

  • Save abhishekmishragithub/4c25961ccfca5b41527e5a71d8c99c6f to your computer and use it in GitHub Desktop.

Select an option

Save abhishekmishragithub/4c25961ccfca5b41527e5a71d8c99c6f to your computer and use it in GitHub Desktop.

mental model (what to remember)

Think in layers of memory, each with different lifetime & size:

  1. Working buffer (short-term)
    The last K messages (e.g., 10–30) from the current chat. Fast, no processing.

  2. Running summary (compressed short-term)
    When the buffer gets long, summarize it into ~300–600 tokens and keep that instead of the raw earlier turns. This keeps tokens under control.

  3. Episodic session memory (last 2–3 chats)
    For each past chat in this browser session, store a 4–8 sentence summary + key facts (a few bullet points). Prepend these summaries to the system/context when a new chat starts. Limit to N=2–3.

  4. Document memory (PDF RAG)
    Parse the PDF to chunks, embed them, and at question time pull the top 3–5 chunks relevant to the user query. Keep these vectors in-memory for the session only.

All of the above are ephemeral: kept in RAM (server process or the browser tab) with a TTL. No DB persistence unless you decide to.


architecture (minimal)

  • Frontend (React/Next.js)

    • Holds a sessionId (random UUID) for the tab.

    • Sends user messages to a /chat endpoint with sessionId.

    • For PDF: upload once per session to /pdf/index?sessionId=....

  • Backend (Node/Express or Next.js Route Handlers)

    • MemoryStore (in-memory Map) keyed by sessionId.

    • For each session: { buffer, summary, episodicSummaries[], pdfIndex, lastActiveAt }.

    • TTL sweeper clears sessions after, say, 60 minutes of inactivity.


memory policy (simple rules)

  • Buffer: keep last 12–20 messages.

  • Summarize trigger: if buffer > 20 or token estimate > X, summarize older turns → append to summary, drop old turns.

  • Episodic: when a chat “ends” (user hits “new chat”), produce a compact summary + key facts and push into episodicSummaries (max 3; drop oldest).

  • PDF index: build once per session; store vectors/inverted index in RAM; discard on TTL.


prompt assembly (per LLM call)

When you call the model:

[System]: role, guardrails, how to use tools
[Context]: 
  - Episodic summaries (last 2–3 chats)
  - Running summary (if any)
  - Top-k retrieved PDF chunks (if any)
[Conversation buffer]: last ~12–20 messages
[User]: current message

keep the context compact; don’t dump entire chat history.

PDF: super-simple in-memory RAG

  • Parse with pdfjs-dist (frontend) or pdf-parse (backend).

  • Split into ~500–1000 char chunks with overlaps.

  • For quick & dirty retrieval (no embeddings): use a small TF-IDF/BM25 lib (e.g., minisearch) stored in RAM.

  • If you have an embeddings API, store vectors in RAM and cosine-search top-k.

expiring memory

  • You already have a TTL sweeper.

  • Everything is in-RAM; when TTL hits (e.g., 60 min idle) → delete sessionId entry.

  • If you also want “manual end of session,” expose an endpoint that calls SESSIONS.delete(sessionId).


practical tips

  • Token budget: do a quick token estimate of assembled prompt; if too large, first trim PDF chunks (k→3), then compress summary again (ask LLM to reduce by half), and keep only the last 8–12 buffer turns.

  • Hallucination control: add a system rule: “If PDF retrieval returns low confidence, ask a follow-up instead of guessing.”

  • PII safety: never persist to disk if you promise ephemeral memory; don’t log full prompts server-side.

  • Client-only option: if you truly want no backend, you can store everything in a React context + Map + Web Worker; but embeddings + API keys are safer on a backend.

TL;DR implementation checklist

  • Generate sessionId in the tab; send with all requests

  • Keep {buffer, summary, episodic[], pdfIndex} in a server-side Map by sessionId

  • Summarize when buffer grows; keep last ~10–20 raw turns

  • On “new chat” within the session: save a compact episode summary (keep last 2–3)

  • Build an in-memory PDF index per session; retrieve top-k chunks on each query

  • Assemble prompt: episodic → summary → pdf chunks → recent buffer → user message

  • Sweep & delete sessions after TTL (or on manual end)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment