Think in layers of memory, each with different lifetime & size:
-
Working buffer (short-term)
The last K messages (e.g., 10–30) from the current chat. Fast, no processing. -
Running summary (compressed short-term)
When the buffer gets long, summarize it into ~300–600 tokens and keep that instead of the raw earlier turns. This keeps tokens under control. -
Episodic session memory (last 2–3 chats)
For each past chat in this browser session, store a 4–8 sentence summary + key facts (a few bullet points). Prepend these summaries to the system/context when a new chat starts. Limit to N=2–3. -
Document memory (PDF RAG)
Parse the PDF to chunks, embed them, and at question time pull the top 3–5 chunks relevant to the user query. Keep these vectors in-memory for the session only.
All of the above are ephemeral: kept in RAM (server process or the browser tab) with a TTL. No DB persistence unless you decide to.
-
Frontend (React/Next.js)
-
Holds a
sessionId(random UUID) for the tab. -
Sends user messages to a
/chatendpoint withsessionId. -
For PDF: upload once per session to
/pdf/index?sessionId=....
-
-
Backend (Node/Express or Next.js Route Handlers)
-
MemoryStore(in-memory Map) keyed bysessionId. -
For each session:
{ buffer, summary, episodicSummaries[], pdfIndex, lastActiveAt }. -
TTL sweeper clears sessions after, say, 60 minutes of inactivity.
-
-
Buffer: keep last 12–20 messages.
-
Summarize trigger: if buffer > 20 or token estimate > X, summarize older turns → append to
summary, drop old turns. -
Episodic: when a chat “ends” (user hits “new chat”), produce a compact summary + key facts and push into
episodicSummaries(max 3; drop oldest). -
PDF index: build once per session; store vectors/inverted index in RAM; discard on TTL.
When you call the model:
[System]: role, guardrails, how to use tools
[Context]:
- Episodic summaries (last 2–3 chats)
- Running summary (if any)
- Top-k retrieved PDF chunks (if any)
[Conversation buffer]: last ~12–20 messages
[User]: current message
keep the context compact; don’t dump entire chat history.
-
Parse with
pdfjs-dist(frontend) orpdf-parse(backend). -
Split into ~500–1000 char chunks with overlaps.
-
For quick & dirty retrieval (no embeddings): use a small TF-IDF/BM25 lib (e.g.,
minisearch) stored in RAM. -
If you have an embeddings API, store vectors in RAM and cosine-search top-k.
-
You already have a TTL sweeper.
-
Everything is in-RAM; when TTL hits (e.g., 60 min idle) → delete
sessionIdentry. -
If you also want “manual end of session,” expose an endpoint that calls
SESSIONS.delete(sessionId).
-
Token budget: do a quick token estimate of assembled prompt; if too large, first trim PDF chunks (k→3), then compress
summaryagain (ask LLM to reduce by half), and keep only the last 8–12 buffer turns. -
Hallucination control: add a system rule: “If PDF retrieval returns low confidence, ask a follow-up instead of guessing.”
-
PII safety: never persist to disk if you promise ephemeral memory; don’t log full prompts server-side.
-
Client-only option: if you truly want no backend, you can store everything in a React context +
Map+ Web Worker; but embeddings + API keys are safer on a backend.
-
Generate
sessionIdin the tab; send with all requests -
Keep
{buffer, summary, episodic[], pdfIndex}in a server-side Map bysessionId -
Summarize when buffer grows; keep last ~10–20 raw turns
-
On “new chat” within the session: save a compact episode summary (keep last 2–3)
-
Build an in-memory PDF index per session; retrieve top-k chunks on each query
-
Assemble prompt: episodic → summary → pdf chunks → recent buffer → user message
-
Sweep & delete sessions after TTL (or on manual end)