| title | Portable, Server-Free, Concurrent SQLite for Local Applications |
|---|---|
| description | A domain-agnostic reference on building a locally-friendly, portable, thread-safe, NFS-safe, data-loss-safe usage pattern (package/wrapper/adapter) around SQLite — covering the locking/journaling/filesystem background, the session/changeset extension, source-grounded node:sqlite findings, the prior-art landscape, and a validated snapshot-publish and transaction-envelope design with a working prototype. |
| author | Claude Code agent (finterm), with prior survey work by Joshua Levy and LLM agents |
Date: 2026-06-06 (last updated 2026-06-06)
Author: Claude Code agent, building on the finterm embeddable-databases survey and a validated standalone spike
Status: Complete (factual survey and a tentative design; prototype validated)
A domain-agnostic reference on one question:
How do you make a SQLite database safe and convenient to use concurrently, from local applications, with no database server, on any filesystem (including NFS and consumer sync folders), while keeping the file a portable, durable, tool- compatible artifact and never losing more than the current transaction?
SQLite is the obvious archival/interchange format — a single public-domain file readable
by nearly every language and tool.
But its built-in concurrency story (one writer at a time; WAL needing shared memory;
fcntl/flock locks unreliable on network mounts) is a poor fit for “many local agents
and processes write one file, sometimes on a network drive, with no daemon.”
The gap between “great file format” and “safe concurrent local access with no
server” is exactly what this research maps, and it ends with a concrete,
prototype-validated design for closing it as a generic wrapper around SQLite.
The motivating use cases are deliberately kept generic — durable local agent state, local data tools, offline-first caches, content-addressed metadata catalogs, build/CI artifacts, “zip-and-go” datasets. (finterm’s own dataroom catalog is one such consumer, but nothing here is specific to it; the design takes a caller-supplied schema and merge policy and knows nothing about the domain.)
This builds directly on a prior finterm survey, Modern Embeddable / File-Based Databases (§G–§J there), and on a from-scratch standalone spike with 55 passing tests and source-grounded checkouts of SQLite and Node. It restates the relevant parts so it stands alone.
A framing question worth settling up front, because it shapes what the deliverable actually is (and is more fundamental than any product name): if we build something here, what category is it? In short, primarily a storage format and commit protocol — a set of conventions over standard SQLite files — not a wrapper or a new engine.
The accurate answer: primarily a storage format and commit protocol — a set of
conventions over standard SQLite files — with per-language reference libraries.
Concretely, the durable core is (a) an on-disk layout (immutable versioned
snapshots, a current pointer, changeset transaction envelopes, an applied_tx ledger,
and a mkdir lease) and (b) the protocol for writing, reconciling, and reading it.
The library code is one implementation of that protocol, not the thing itself.
The closest precedents are Apache Iceberg and Delta Lake: table formats and commit
protocols layered over Parquet files, with multiple engine implementations — nobody
calls them “Parquet wrappers.”
This is the same shape, over SQLite.
Why the alternative framings mislead:
- “A wrapper around SQLite” undersells and misdescribes it. A wrapper implies intercepting or replacing the SQLite API (a custom VFS, or a drop-in connection object). This design never touches the SQL surface — you still query with plain SQLite — it orchestrates files around SQLite. It is an overlay/coordination layer, not an API wrapper.
- “A complete repackaging as a file-based database that uses SQLite” oversells it. There is no new query language and no new file format; the published snapshots are ordinary SQLite. To a consumer it behaves like a small embeddable multi-writer database, but it is not a new engine.
- “A set of formats and conventions” is the core truth — that is the durable, standardize-and-version deliverable; the libraries follow from it.
Is it Node-specific?
No. The format and the entire reader path are language-agnostic: a reader just
opens the standard SQLite file named by current, needing zero special support in any
language or tool. Even the writer’s changeset envelope is a SQLite-defined binary blob
that any session-capable binding can produce or consume.
Only the writer/reconcile role needs such a binding — Node node:sqlite is the
natural first reference implementation (it ships sessions with zero native deps;
Finding G2), with Python (APSW) and Rust (rusqlite) able to implement the same
format later. So: a language-agnostic format and reader story, with writer libraries
wherever a session-capable binding exists, Node first.
The practical consequence for the rest of this document: the format/protocol is the
primary artifact (its layout, manifest schema, application_id/user_version
stamping, and reconcile/conflict rules are what must be precise, stable, and versioned),
and any package — in any language — is a reference implementation of it.
Almost every approach in this document leans on one specific SQLite feature — the session/changeset extension — so it is worth foregrounding what it is, why you have probably never used it, and why it is the key that makes the rest possible. (The mechanism in detail is Finding F; this is the orientation.)
What it is. The session extension lets a connection record the exact row-level
changes (INSERT/UPDATE/DELETE) made during a unit of work and serialize them into
a changeset — an ordinary binary blob you can write to a file.
A changeset can later be applied to another database, with a per-change conflict
handler deciding OMIT/REPLACE/ABORT. In effect it is logical, row-level
replication built into SQLite, where the unit of replication is just a file.
History. The extension has been in SQLite since version 3.13.0 (2016) — it is not new or experimental at the engine level; it is a decade-old, stable part of the codebase maintained by the core SQLite team, with a documented C API (sessionintro) and an accompanying rebaser (for merging divergent changesets) and changegroup (for concatenating them). It was built for the natural “session” use case: capture what changed here, carry it elsewhere, merge it — i.e. database sync.
Why you have probably never used it. It is disabled by default. The feature only
compiles in when SQLite is built with
-DSQLITE_ENABLE_SESSION -DSQLITE_ENABLE_PREUPDATE_HOOK, and most distributions and
bindings ship without those flags — confirmed by source in this pass: macOS system
SQLite, the default better-sqlite3 build (deps/defines.gypi), Python’s stdlib
sqlite3, and Go’s mattn/go-sqlite3 all omit it.
So for years the feature was present in the amalgamation but invisible at the
application layer. Compounding this, the famous “distributed SQLite” projects chose
other mechanisms — LiteFS and Litestream replicate the WAL byte stream; rqlite
and dqlite use Raft over SQL statements — so the changeset (logical, row-level) lane
stayed in the mobile / offline-first sync corner (e.g. SQLiteChangesetSync,
sqlite-sync, Tencent WCDB, Bloomberg comdb2), out of the server-infrastructure
conversation. The net effect: a powerful, mature primitive that very few people reach
for.
Why it is newly relevant — and promising. Its visibility is rising precisely because
a first-class runtime finally exposes it by default: Node’s built-in node:sqlite
ships with SESSION compiled in (confirmed: node/deps/sqlite/sqlite.gyp:24,27), and
Python’s APSW and Rust’s rusqlite (session feature) expose it too.
That matters here because a changeset is exactly the file-native, server-free,
optimistic commit-log this problem needs: writers each emit a uniquely-named changeset
file (conflict-free even on NFS), a merge step folds them with a domain conflict
policy, and — the key property — readers never need the extension at all, since they
just open the resulting standard SQLite file.
In other words, the thing that makes “concurrent, server-free, NFS-safe,
per-session-optimistic SQLite” tractable has been sitting in the engine for ~10 years;
what changed is that it is finally reachable from a zero-dependency local runtime.
The rest of this document is, in large part, an argument that this underused primitive
is the right foundation — and a check of where its binding-level rough edges (Findings
G/G2) constrain a v1.
The target properties (the “contours”), restated as testable requirements:
- R-FORMAT — durable, portable, tool-compatible archival format. The at-rest artifact is a standard SQLite file (or a directory of plain files and standard SQLite files) that any SQLite tool/binding/language can open, today and in years. No engine-specific or proprietary container as the interchange surface.
- R-LOCAL — direct local access via in-process APIs. First-class Node, with a
path to Python and Rust (and the
sqlite3CLI) for at least the reader role. No network client required to read or write. - R-XPLAT — cross-platform. macOS, Linux, Windows; common CPU architectures.
- R-NOSERVER — no traditional server/daemon. Any process just opens files. A minimal, ephemeral, auto-managed coordinator is acceptable only as an optional optimization, never required — cross-platform daemon lifecycle management is itself a liability.
- R-FS — filesystem-agnostic, including NFS/SMB/FUSE/sync. Correct where POSIX
flock/fcntlbyte-range locks and WAL shared memory do not work. - R-DURABLE — bounded loss and hot-copy-safe. A crash, an interrupted write, or a file copy taken while the DB is live loses at most the current/in-flight transaction — never a previously-committed state, never the whole dataset. A repair path exists for anomalies (stale locks, recovering a hot copy).
- R-CONCURRENCY — wrapper-managed, session/optimistic, not one-big-lock. A thin wrapper API may manage concurrency, but the goal is optimistic, session-/transaction-grained concurrency (worst-case loss = one session), not a single coarse whole-DB mutex held across work.
Requirement review (refinements and one addition). These are sound and mutually achievable, with three notes:
- R-CONCURRENCY “optimistic, per-session” and R-NOSERVER together force a commit-log/MVCC-style design, not a lock-wrapper. A pure wrapper-lock (even a good one) is still one-writer-at-a-time; true per-session optimism on a single file with no server is achieved by logical row-level changesets merged optimistically, not by finer file locks (see Findings B). So “wrapper-based locking” is best read as “wrapper-managed concurrency,” whose best form is a changeset commit-log.
- R-FS and R-CONCURRENCY mean the live multi-writer working directory is a directory of files, even though the archival artifact is a single SQLite file. These are not in tension: the live form has a tail of pending change-files that a finalize/seal step folds into one clean file. Worth stating explicitly as part of R-FORMAT.
- Add R-DETECT — no silent corruption. Implied by R-DURABLE but worth elevating: every inconsistency (torn copy, partial write, bit-rot, dangling reference) must be detectable and surfaced loudly, not silently accepted. Content hashing, integrity checks, and structural “in-flight tells” make this cheap.
- Is SQLite the right at-rest format, and what makes a SQLite file portable/archival?
- What exactly breaks when you use SQLite concurrently, and on NFS/SMB/sync folders?
- Why is there “no magic file lock,” and what does that imply for the design?
- What does the existing landscape (LiteFS, rqlite, dqlite, mvSQLite, Litestream, libSQL/Turso, CRDT layers, mobile sync) actually do, and what’s the gap?
- How does SQLite’s session/changeset extension work, and is it the right primitive?
- What are the concrete binding constraints (Node
node:sqlitevsbetter-sqlite3vsbun:sqlite; Python; Rust)? - What design satisfies all the requirements, and does a prototype bear it out?
In scope: in-process/local concurrent access to a single logical SQLite store with no required server; locking/journaling/filesystem behavior; the session/changeset extension; the snapshot/commit-log pattern; crash/durability/hot-copy safety; cross-language reader portability; a generic wrapper design and prototype.
Out of scope: analytical/columnar engines (DuckDB/Parquet) and full-text/vector search (covered in the prior survey); choosing a specific application schema; remote sync/replication and object-storage transport (a separable layer); and any single application’s domain model.
SQLite is the strongest database archival format in wide use: a single file, public
domain, a documented and decades-stable on-disk format, a US Library of Congress
recommended storage format, and readable by ubiquitous tooling (sqlite3 CLI,
Datasette, every major language’s stdlib or a maintained binding).
It also bundles FTS5 and JSON1, so metadata, key/value (a trivial
kv(key PRIMARY KEY, value) table), and lexical search can live in one inspectable
file. The on-disk format is cross-platform and endianness-independent.
Portability is mostly a discipline, not luck:
- Store room-relative, slash-separated UTF-8 paths in any path columns; never
absolute or platform-specific paths (a real bug observed in
qmd, whoseinitrecorded absolute paths). - Keep the authoritative tables on core SQLite features only. FTS5 is broadly
available; loadable extensions (e.g.
sqlite-vec’svec0virtual table), custom collations, and compile-time options are not guaranteed. A database that embeds avec0table is no longer openable by a stock reader (touching the table throwsno such module: vec0). Vectors/large indexes belong in separate, rebuildable files. - macOS system SQLite ships
SQLITE_OMIT_LOAD_EXTENSION, so never assume extension loading — probe it. - A packaged/sealed file must be finalized (no
-wal/-shmside files) via checkpoint, the backup API, orVACUUM INTO, and passPRAGMA integrity_check. - Stamp
PRAGMA application_idandPRAGMA user_versionso foreign readers can recognize and version-gate the file.
Conclusion: keep the content/bytes and the interchange artifact as standard SQLite (plus plain files for large blobs); never hand an outside consumer an engine-specific KV store (LMDB/RocksDB/redb) — those are fine only as a rebuildable local cache.
Most confusion comes from one word spanning three independent layers:
- File/OS ownership — which process may write the file:
flock/fcntl/LockFileEx/mkdir/ dotfile. Coarse, whole-file. - In-engine concurrency control — how concurrent transactions isolate: MVCC, two-phase locking, a lock manager with table/page/row granularity. This is the “per-row / per-section” part.
- Durability / commit — WAL / rollback journal.
The key realization: layer 2 — the fine-grained part — is an in-memory data structure
owned by one process.
It is not, and never has been, a filesystem feature. No database does per-row locking
by taking a filesystem lock per row.
Every engine that allows concurrent local writers uses a shared-memory lock table
(Postgres' lock table in shared memory; InnoDB’s in the buffer pool; Berkeley DB’s lock
subsystem in a shared-memory environment; LMDB’s reader table and write mutex in
lock.mdb; SQLite’s WAL-index in -shm). Shared memory is same-host only — it does
not cross machines.
This yields the cross-host theorem: fine-grained locking requires a per-lock acquisition; across hosts “acquire lock” is a network round-trip, and a service that answers network lock requests is a database server. Therefore you cannot have fine-grained, cross-host locking without a server. Across hosts you stop holding locks and turn the lock into a commit — an atomic-create/lease to elect a writer, or a compare-and-swap pointer-swap (optimistic concurrency) to commit a change — held for milliseconds, not across the work.
Better-than-flock local primitives exist (mmap atomic-CAS slots; Linux robust
futexes/mutexes with EOWNERDEAD recovery; Windows named mutexes with WAIT_ABANDONED;
OFD byte-range locks F_OFD_SETLK) but they all inherit the network wall, and robust
mutexes have a macOS gap (historically unimplemented), so a portable shared-memory
locker can’t rely on them.
The honest consequence for our requirements: don’t look for a cleverer file lock; put
the serialization point in the right place — a millisecond commit/election, made
correct by idempotency, not by lock perfection.
- Default (rollback journal) mode uses
fcntlbyte-range locks at fixed offsets ("lock bytes") plus a-journalfile for crash atomicity. One writer at a time; readers blocked during the brief write commit. - WAL mode adds a
-walfile and a-shmshared-memory wal-index, giving concurrent readers and one writer with much less blocking. Per SQLite’s own WAL docs, the-shmwal-index does not work on a network filesystem, so WAL is local-disk-only. busy_timeout+BEGIN IMMEDIATE+ retry lets multiple local processes write one WAL DB safely with brief contention. On local disk this already mostly “just works.”- URI flags:
nolock=1disables SQLite’s file locking entirely (safe only when you guarantee a single writer / own an external mutex);immutable=1opens a file promised not to change (fast, safe reads of a finalized snapshot). - Never treat copying an open
.dbas a backup. Use the online backup API orVACUUM INTO, thenintegrity_check/quick_check. A naivecp/rsync/tar/sync of a live WAL DB can capture a torn page state and is oblivious to-wal/-shm— the exact reason replication tools (Litestream) exist instead of file copy.
Two layers fail on network/sync storage: WAL’s -shm shared memory does not exist
there, and fcntl/flock byte-range locks are unreliable (notably flaky for
flock-on-NFS on macOS). Consumer cloud-sync folders (Dropbox, iCloud Drive, OneDrive,
Google Drive) are the worst case: the sync client copies the .sqlite mid-write and
ignores -wal/-shm, corrupting it.
This is not hypothetical. The beads issue tracker (bd) shipped a SQLite backend
that was effectively unusable on Claude Code Cloud agent workspaces because they run on
network drives; the project removed SQLite entirely, moved its working store to
Dolt (still “single-writer only, file locking enforced”) with an optional server for
sharing, and kept a plain git-committed .jsonl as the portable interchange surface.
Its successor (tbd) went fully file/git-native.
The lesson: a live, network-mounted SQLite file as a shared multi-writer store is the
specific thing to avoid. The safe patterns are (a) a plain file-native interchange
layer as the portable truth and (b) confining any live single-writer DB to local disk or
a coordinator — never a shared network-drive DB file.
The one primitive that is reliable across local disk and NFS is atomic directory
create (mkdir(2)) and atomic rename — which is why the git/npm world uses
mkdir lock dirs rather than flock. The design leans entirely on these.
Existing “distributed/concurrent SQLite” efforts cluster into a few shapes, none of which is a thin, packaged, portable wrapper for the local+NFS+no-server case:
| Project / approach | What it is | Where it sits |
|---|---|---|
WAL + busy_timeout + retry |
SQLite’s built-in local concurrency | local-disk only; no NFS; one writer |
libSQL BEGIN CONCURRENT |
MVCC/OCC multi-writer, still a SQLite file | same -shm limit; experimental; local |
| LiteFS (Fly.io) | FUSE FS replicating SQLite; lease-based single primary | ephemeral elected coordinator (a server, productized) |
| rqlite / dqlite | HTTP/Raft (rqlite) or embeddable C Raft (dqlite) over SQLite | long-lived consensus server |
| mvSQLite | SQLite storage swapped onto FoundationDB; distributed MVCC | replace the storage layer (a cluster) |
| Litestream | single local writer + streaming WAL replication to object store | durability/replication, not concurrency |
| cr-sqlite (Vlcn) | CRDT extension on SQLite | dormant; a merge layer, not a store |
Mobile changeset sync (SQLiteChangesetSync, sqlite-sync, WCDB) |
offline-first device sync via the session extension | the closest analogue — but app/device-specific |
DIY proper-lockfile + write queue |
a generic cross-process lock around a SQLite file | the thin end; folk knowledge, not a marquee package |
The serious multi-node efforts all centralize the writer (LiteFS lease,
rqlite/dqlite Raft) or replace the storage layer (mvSQLite).
The thin “external lock around a shared/NFS SQLite file” is DIY folk knowledge — there
is no marquee packaged library, partly because the SQLite project actively discourages
SQLite-on-NFS. The genuinely file-native row-level mechanism (the session/changeset
extension) lived in the mobile-offline-sync corner, out of the “distributed SQLite”
spotlight, because it was compile-time-disabled in most builds — until node:sqlite
exposed it first-class.
The gap: a small, dependency-light, cross-platform package that gives optimistic, session-grained, no-server, NFS-safe, hot-copy-safe concurrent access while keeping the artifact a standard SQLite file. That is the thing worth building.
The session extension (in SQLite since 3.13.0, behind
-DSQLITE_ENABLE_SESSION -DSQLITE_ENABLE_PREUPDATE_HOOK) records the row changes a
connection makes and packages them into a changeset (or smaller patchset) —
just a binary blob / file:
sqlite3session_create+sqlite3session_attach(NULL)→ record all tables.- run normal
INSERT/UPDATE/DELETE. sqlite3session_changeset→ extract the changes as a blob.sqlite3changeset_apply(target, blob, xFilter, xConflict, …)→ apply to another DB, with a filter (per-table include/exclude) and a conflict handler returningOMIT/REPLACE/ABORTper change.sqlite3changegroup/changeset_concatcombine;changeset_invertundoes; the rebaser (sqlite3rebaser_*, EXPERIMENTAL) rebases one writer’s changeset onto another’s already-resolved conflicts.
Conflict types the handler sees: DATA (target row changed out from under the
writer), NOTFOUND (no row with that PK), CONFLICT (PK/UNIQUE violation on insert),
CONSTRAINT, FOREIGN_KEY. This is where merge policy lives (e.g. cache = last-writer-
wins by timestamp; edges = set-union; events = append).
Changeset vs patchset: a changeset carries old+new values, so it can detect DATA conflicts; a patchset is smaller (PK and new values) but cannot. Measured (update 1 of 4 columns × 500 rows): changeset 15009 B vs patchset 10009 B (~33% smaller). For a metadata store where big bytes live as files (so changesets are already small), the conflict detection is worth more than the size — use changesets.
Documented limitations, and why they fit a clean design:
- Every table needs a non-null
PRIMARY KEY— rows with NULL PK are silently ignored. (So: declare explicit PKs everywhere.) - Virtual tables are not captured (FTS5/
sqlite-vec). (So: those are rebuildable projections, excluded from the merge and rebuilt after — exactly where they belong.) - DDL/schema changes are not captured. (So: migrations are a separate deterministic step; all writers must share a schema version.)
- Big values are embedded in the changeset. (So: keep large bytes as files; catalog rows stay small and so do changesets.)
Binding availability (the practical crux): because a changeset is just a file, readers never need the extension — they open the SQLite file normally with any binding. Only the writer (to record) and the merge step (to apply) need it:
- Node: built-in
node:sqlitehas it (verified).better-sqlite3’s bundled build does not enableSQLITE_ENABLE_SESSION;bun:sqlitelacks the session API and Bun does not exposenode:sqlite. - Python: stdlib
sqlite3lacks it; APSW has full session support. - Rust:
rusqliteexposes it behind itssessionfeature.
Verified empirically and confirmed against checkouts of sqlite/sqlite (002d33d,
ext/session/sqlite3session.h) and nodejs/node @ v22.22.2 (src/node_sqlite.cc):
node:sqlite’s conflict handler receives only the integer conflict type — not the table name or row values. The CxConflict(pCtx, eConflict, pIter)has the iteratorpIter(from whichsqlite3changeset_opyields the table name), but the binding passes JS onlyeConflict(node_sqlite.cc:conflictCallback(eConflict)). So per-table policy cannot be routed inside the conflict handler — you must use theapplyChangesetfilter(tableName)callback, which does receive the table name, running one filtered apply pass per policy action. (This is a binding limitation; an upstream feature request to forward the iterator/table name is the clean fix.)SQLITE_CHANGESET_ABORTworks andnode:sqlitesignals it as afalsereturn value, not a throw. Per the C docs, ABORT “rolls back any changes applied so far and returnsSQLITE_ABORT”; the binding maps that toapplyChangeset(...) === false(node_sqlite.cc: “this is not an error, return false”). Verified: a two-insert changeset where one row conflicts, under ABORT, returnsfalseand rolls back both; under OMIT, returnstrueand keeps the non-conflicting row. So a strict/quarantine policy is detected by the boolean, then the whole transaction is rolled back.REPLACEis only legal forDATA/CONFLICTconflicts — returning it forNOTFOUND/CONSTRAINT/FOREIGN_KEYisSQLITE_MISUSE(a throw). So last-writer-wins must be conflict-type-aware (REPLACE on DATA/CONFLICT, OMIT otherwise). (A latent bug caught by reading the header; covered by a NOTFOUND test.)- No
sqlite3_rebaserin the binding (the C API has it;node_sqlite.ccdoes not expose it). So a v1 design must avoid needing a rebaser. - The session API works on the repo’s Node v22 baseline (v22.22.2), though
node:sqliteis still behind--experimental-sqlite, so production use should gate on it leaving experimental (or a native fallback).
All five are pinned to C source in the prototype’s SOURCES.md (SQLite
ext/session/sqlite3session.c:4952-4978 for the conflict dispatch and MISUSE rules;
sqlite3session.h:192-202 for the PK/NULL-PK rules and :1592-1622 for the rebaser;
src/wal.c:129-133 for “WAL is not supported on a network filesystem”;
src/os_unix.c:1191 “POSIX advisory locks are broken by design”; and
node/src/node_sqlite.cc for the binding behavior, with
node/deps/sqlite/sqlite.gyp:24,27 confirming SESSION/PREUPDATE are compiled in).
A second source pass checked the two other obvious building blocks:
better-sqlite3 (v12.10.0, bundles SQLite 3.53.2). Its actual compile flags
(deps/defines.gypi) enable FTS5/JSON1/DESERIALIZE/THREADSAFE=2/USE_URI=0 but
neither SQLITE_ENABLE_SESSION nor SQLITE_ENABLE_PREUPDATE_HOOK — the session
code is #if-compiled-out — and it exposes no session/changeset/rebaser JS API
(only prepare/transaction/pragma/backup/serialize/function/aggregate/loadExtension).
It compiles the standard amalgamation, so its on-disk file is the identical
SQLite format 3, cross-readable with node:sqlite/CLI/Python/Rust.
Verdict: not usable as the changeset writer/reconcile runtime without a custom
native rebuild (re-introducing the prebuild burden node:sqlite avoids), but fully
usable as a reader and as the direct-mode writer (which needs no session), and a
good reference for the driver API shape.
A CatalogDriver abstraction should require node:sqlite only for the changeset path
and allow either binding elsewhere.
node-proper-lockfile (v4.1.2). It locks with mkdir ("atomic on any kind of
file system, even network based ones") and explicitly avoids O_EXCL/flock because
“O_EXCL is broken on NFS”; it uses an mtime lease and heartbeat with a
compromised threshold and mtime-precision probing for coarse-granularity
filesystems.
This is independent confirmation that mkdir, mtime-lease, and heartbeat is
the correct, battle-tested pattern — the exact shape of the spike’s lock.
Verdict: don’t depend on it directly (it is async/Promise-based while the publish
path is synchronous, pulls a graceful-fs dep, and has no owner-token fencing — only an
onCompromised notification), but adopt its mtime-precision handling and default
constants (stale ≥ 5000, update = stale/2) on top of the spike’s owner-token
revalidate-before-publish fencing.
Crucially, proper-lockfile alone does not make a live SQLite DB NFS-safe — it only
protects the lock — which is why the design still needs the immutable-snapshot +
idempotent-ledger model so correctness does not depend on the lock at all.
Combining the above, the pattern that meets every requirement with no server and no fragile file lock is optimistic concurrency over an append-only changeset commit-log, with immutable published snapshots selected by an atomic pointer. On disk:
<dir>/
current # tiny text pointer → the published snapshot version
snapshots/
000000000000.sqlite # immutable, versioned published heads (standard SQLite)
000000000001.sqlite
tx/
pending/<txid>.txn/ # one transaction envelope per writer commit
manifest.json # tx_id, writer_id, base_version, schema_version+hash,
# changeset_sha256, logical_clock
changeset # the SQLite changeset blob
COMMITTED # marker written LAST → the tx becomes visible atomically
applied/<txid>.txn/ # archived after folding into a published snapshot
quarantine/<txid>.txn/ # corrupt / schema-mismatch / strict-conflict (evidence kept)
leases/<token>.lease # read-leases pinning a version against GC
publish.lock/ # mkdir lease lock (owner token); held only during publish
Protocol:
- Write (lock-free). Seed a session from the current immutable snapshot (safe — it
is never mutated), run the caller’s work, extract a changeset, and write an envelope
with the
COMMITTEDmarker last. Unique paths ⇒ conflict-free concurrent writes on local disk and NFS; partial/uncommitted envelopes are ignored. - Reconcile (out-of-band). Build the next snapshot in a temp file from the current
snapshot and committed pending envelopes (skipping any already in the snapshot’s
applied_txledger; quarantining digest/schema-mismatched or strict-conflicting ones), runquick_check, then publish by atomically swappingcurrent(writecurrent.tmp,rename). Building is lock-free; only the short publish (re-check pointer and rename) is fenced by themkdirlease lock, with optimistic concurrency oncurrent. - Read. Open only the immutable snapshot named by
current— any binding, any language, no session extension, and (since it never changes) safe toimmutable=1.
Why this satisfies each requirement:
- R-DURABLE / hot-copy-safe / R-DETECT. Published heads are immutable and the
pointer swap is atomic, so a crash, an interrupted write, or a file copy taken
while live can never damage a committed state — the worst case rolls back to the last
quick_check-validatedcurrent, losing at most the uncommitted tail (≤ one session per writer). A naive hot copy captures, at worst, some extra pending envelopes (harmless — ignored unlessCOMMITTED) and a consistent older snapshot. Content digests,integrity_check, and structural “in-flight tells” make every anomaly detectable. - R-CONCURRENCY. Writers never block each other (unique envelopes); only genuinely conflicting transactions are resolved by policy at reconcile (last-writer-wins / set- union / strict-quarantine). This is per-session optimism, not a whole-DB lock.
- R-FS / R-NOSERVER. The only shared mutations are atomic file/dir creates and one
atomic rename — reliable on local disk and NFS. No WAL
-shm, nofcntl, no daemon. - R-FORMAT / R-LOCAL / R-XPLAT. Every published snapshot is a standard SQLite file;
readers are universal;
node:sqliteis zero-dependency so install is trivial cross- platform.
Crucially, correctness does not depend on the lock being perfect. The mkdir lease
is belt-and-suspenders: a per-snapshot applied_tx ledger (written in the same
transaction that applies each changeset) makes folding exactly-once, and two
overlapping reconcilers build separate temp files and commit via optimistic
concurrency on current (the loser retries to a no-op).
A stale-lock break or a process killed mid-publish therefore cannot corrupt or
double-apply. This is what makes a portable design feasible where a perfect
cross-platform lock is not.
For the common single-writer-on-local-disk case, the same immutable-snapshot format
supports a direct strategy: one writer holds the publish lock and does copy →
mutate → publish synchronously — no changesets, no envelopes, no reconcile, and no
session extension even for the writer, so it runs on any SQLite binding
(better-sqlite3, bun:sqlite). It is a coarse single-writer serializer (the lock is
held across the short write), but still immutable-snapshot-safe (R-DURABLE) and NFS-safe
(mkdir and atomic rename).
Because it writes the identical on-disk shape, a room can be written directly when alone
and read/operated by the concurrent engine later — two write strategies, one format.
This is the “legacy/simple, easier-locking-locally” option alongside the full concurrent
engine.
A from-scratch, dependency-free spike (node:sqlite, Node v22.22.2) implements the
design as a generic package (caller supplies schema and per-table policy) and is
exercised by 55 passing tests, including real multi-process (child_process)
concurrency:
- Concurrency: N lock-free writer processes lose zero committed tx after
reconcile; racing reconciler processes publish exactly-once with an
integrity_check-clean head; readers looping during concurrent publishes never see a torn or regressing snapshot. - Crash recovery at every publish step (
after-build,after-snapshot,after-pointer) recovers to exactly-once; a published snapshot is never lost. - True SIGKILL: a child killed while holding
publish.lockdoes not wedge the store (the stale lease is reclaimed; recovery is exactly-once); a reconciler killed mid-build orphans only a temp. - Randomized chaos: writers, reconcilers, and readers all at once converge with zero loss, no double-apply, integrity-clean head (stable across repeated runs).
- Fencing/lease: revalidate-before-publish aborts a holder whose lease was stolen; a stale lock is reclaimable; OCC on the pointer turns a superseded-base reconcile into a safe no-op.
- GC and read-lease: a snapshot pinned by an active lease survives GC; an abandoned
lease self-clears; a packager reads a byte-stable pinned snapshot while
currentadvances. - Portability: a published/sealed head begins with
SQLite format 3\0and carriesapplication_id/user_version; opens read-only with no session API. - doctor/repair: restores the last consistent point from a torn room (re-publish to the highest integrity-clean snapshot, roll back orphans, quarantine uncommitted envelopes, clear transients and the stale lock).
- Direct mode and interop: synchronous direct writes; a directly-written room is read by the concurrent engine and vice versa; N direct-writer processes serialize with zero loss.
Performance (indicative, small metadata rows on local tmpfs):
- Write (envelope, default snapshot-seed): ~4–5 ms/write at a few-hundred-row catalog, growing with catalog size because the writer copies the current snapshot to seed its session.
- Seeding crossover (2 KB rows): snapshot-seed 10.5 → 28.6 ms/write as the catalog grows 500 → 5500 rows, while an opt-in schema-only seed (build an empty schema-only working DB, valid for inserts of new keys) stays flat at ~16 ms/write. So: default to snapshot-seed when small/for any mutation; switch insert-heavy writers to schema-only once the catalog is large.
- Reconcile: ~6 ms/tx (one base copy per pass + per-tx apply + ledger insert + quick_check).
Exactly what each model offers, and where the snapshot-publish design lands.
WAL (the fair baseline). WAL is the strong-isolation incumbent: many readers run
concurrently and lock-free (each sees a consistent snapshot as of its transaction
start — snapshot isolation), and a reader never blocks the writer or vice versa.
But there is exactly one writer at a time, and the write lock is whole-database — a
second writer gets SQLITE_BUSY and must wait/retry.
Isolation is serializable for writers (each write transaction sees the latest
committed state and is fully isolated).
The price: it needs shared memory (-shm), so it is local-disk only, and a live
file copy is not safe (R-FS and R-DURABLE both fail).
So WAL = best isolation, worst portability, single writer.
The snapshot-publish design, stated precisely. A “write” has two phases with very different concurrency:
- Reads: unlimited and never blocked. Every reader opens an immutable published
snapshot (optionally
immutable=1), so any number of readers across any processes/hosts proceed in parallel and are never blocked by writers, by reconcile, or by each other. Isolation is snapshot isolation at the granularity of a published version — a reader sees a single consistent committed snapshot N. (Same guarantee class as WAL readers, and strictly safer because readers never touch a live/mutating file.) The one difference from WAL: a just- committed write is not visible untilreconcilepublishes it (a visibility lag, not an isolation violation). - Write capture: unlimited and lock-free. Any number of writers, in any processes, on any hosts, record their changes and drop envelopes simultaneously with no write lock. This is more write concurrency than WAL, not less — there is no global write lock on the application path.
- The only serialized step is the out-of-band merge (
reconcile), fenced by themkdirlease and OCC on the pointer. It is brief, runs on its own cadence, and writers and readers never block on it. It serializes the production of a new published snapshot, not the acceptance of independent writes. (And even two merges racing are safe — OCC; the loser retries to a no-op.)
Conflict granularity is per-row, not whole-DB and not even per-table. Because
merging is done with changesets, two writers that touch different rows never
conflict — even in the same table — and both land with zero loss.
Two writers that touch different tables never conflict.
A conflict arises only when two writers touch the same primary key, and it is
resolved by that table’s policy: lww (last-writer-wins by deterministic order),
union (keep first / drop duplicate), or strict (quarantine the losing tx).
Changesets carry old+new values, so the engine also detects a genuine DATA conflict
— two writers that each modified the same existing row from the same base — not just
insert collisions. So the write-concurrency granularity the requirements asked for
("different tables/sections instead of a global write lock") is already exceeded: it is
row-level.
Isolation: honestly weaker than serializable. The design is not serializable
across writers. It tracks each transaction’s write set (the rows it changed), not
its read set. So it detects write-write conflicts on the same row, but it cannot
detect a read-write anti-dependency — e.g. writer A reads row X and writes row Y
accordingly while writer B changes X; both commit and the X→Y invariant can be violated.
WAL’s single-writer serializability does not have this hole.
The honest framing: the design provides “row-level optimistic concurrency with
write-set conflict detection and read-your-own-writes within a session”, not full
serializable isolation.
Applications that need a cross-row invariant must put those rows under a strict policy
and use retry (below), or accept last-writer-wins semantics, or keep that invariant in a
single-writer (DirectStore) lane.
Optimistic concurrency and granular retry — yes, at the row/transaction level. The model is optimistic at row granularity (writers never lock; conflicts are detected at merge by PK and old-value). Granular retry is available in three escalating forms:
- Policy resolution (no retry needed): for
lww/uniontables the merge deterministically resolves same-row conflicts — the common case needs no retry at all. - Application-level OCC retry (recommended for
strict): on a quarantined/aborted tx, awriteWithRetry(fn)helper re-seeds from the new current snapshot and re-runs the caller’sfnagainst fresh data, up to N times — a true per-transaction optimistic retry loop (the classic compare-and-retry). This is straightforward to add and is the right answer for “I need this write to see the latest state.” - Engine rebase (future): SQLite’s
sqlite3rebaser_*can rebase a conflicting changeset onto the already-resolved base and re-apply per change, the finest-grained merge.node:sqlitedoes not expose it yet (Finding G), so v1 uses (1)+(2); (3) is the enhancement that would let evenstrictconflicts auto-merge without re-runningfn.
Could writes be even more parallel — independent “sections”? The merge is already
not the write bottleneck (writes are lock-free; the merge is a small, out-of-band
pointer swap). If it ever were, the store can be sharded into independent stores
(e.g. one per table-group), each with its own current pointer and reconcile — giving
fully independent publish lanes — at the cost of no cross-shard atomic snapshot.
This is a real option but unnecessary unless profiling shows merge contention; it is
listed as a possible extension, not a v1 need.
Summary of the trade. Versus WAL: this design offers strictly more read concurrency in practice (readers never touch a live file) and far more write concurrency (many lock-free writers and row-level conflicts vs one global writer), and it works on NFS and survives hot copies — in exchange for weaker isolation (row-level merge with write-set conflict detection, plus a reconcile visibility lag) instead of WAL’s serializable single-writer. That trade is the whole point: it buys the requirements (R-FS, R-DURABLE, R-CONCURRENCY) at the cost of serializability, which the target use cases (caches, agent state, content- addressed metadata) do not need.
- Stop hunting for a better file lock. The fine-grained part of “locking” is always in-memory and same-host; cross-host it must become a commit. Put the serialization at a millisecond atomic pointer-swap and make it idempotent — then lock perfection is unnecessary.
- A changeset is the OCC commit-log in SQLite-native form. It turns “concurrent writers” into “each writer drops a uniquely-named file” and “a merge step folds them,” all plain files, conflict-free even on NFS — exactly the per-session optimism the requirements ask for.
- Immutability is what buys hot-copy and crash safety for free. Because published heads never mutate and the pointer swap is atomic, any copy of the directory yields a consistent older state plus an ignorable pending tail. R-DURABLE falls out of the data model, not out of careful locking.
- Readers are universal; only writers/merge need the extension. This keeps the
tri-language and tool-compatibility goal intact — Python/Rust/
sqlite3read a published snapshot today; the session requirement is confined to the writer/reconcile runtime. - The constraints are in the binding, not SQLite.
node:sqlite’s missing table-name in the conflict handler, ABORT-as-false, and missing rebaser are all binding-level; the C API is richer. This shapes v1 (filter-based per-table policy; constrained writable surface so no rebaser is needed) and gives a clean upstream ask. - Two write strategies, one format cleanly serves both “many agents, maybe on NFS” and “one local tool, keep it dead simple” without forking the on-disk format or the reader path.
Concurrency approaches for a single SQLite store with no server. Read vs write concurrency stated separately, with the conflict/lock granularity and the isolation each provides (see Finding K for the precise definitions and caveats):
| Approach | Read concurrency | Write concurrency | Conflict / lock granularity | Isolation | Server-free | Any FS (incl. NFS) | Hot-copy-safe |
|---|---|---|---|---|---|---|---|
| Single connection, rollback journal (default) | readers exclude the writer | 1 writer | whole-DB | serializable | ✓ | ✗ (locks) | ✗ |
WAL + busy_timeout |
many, lock-free (snapshot) | 1 writer (global) | whole-DB write lock | writer serializable; readers snapshot | ✓ | ✗ (-shm) |
✗ |
libSQL BEGIN CONCURRENT |
many | many (commit-time check) | page-level | snapshot + OCC | ✓ | ✗ (-shm) |
✗ |
Wrapper mkdir-lock + rollback (≈ DirectStore) |
many (immutable snapshots) | 1 writer (lease) | whole-DB | each publish atomic | ✓ | ✓ | ✓ |
| Elected coordinator (LiteFS) | many | 1 primary (replicas read-only) | engine-level | serializable | ✗ (FUSE daemon + lease) | n/a — own FUSE FS + replication | n/a (replicated) |
| Raft server (rqlite/dqlite) | many | many | engine-level | serializable | ✗ (cluster) | n/a — replicates over the network | n/a (replicated) |
| Snapshot-publish + changeset log (this doc) | unlimited, lock-free (snapshot per version) | unlimited at capture + one out-of-band merge | per-row (per-PK), per-table policy | row-level write-set merge + DATA-conflict detection (not serializable) | ✓ | ✓ | ✓✓ |
A fairness note on the “Any FS” column: LiteFS and rqlite/dqlite are not “unsafe on NFS” — they don’t use a shared filesystem at all. LiteFS gives each node its own FUSE filesystem and replicates writes from a single elected primary; rqlite/dqlite replicate via Raft over the network. Both require a running process (a daemon/cluster), which is exactly the cost the server-free approaches avoid. So they solve a different problem (multi-node replication) rather than “share one file on whatever disk you happen to be on.”
The headline contrast: WAL gives you the strongest isolation (true serializable) but a
single global writer and only on local disk; this design gives you many concurrent
writers resolved at row granularity and works on any filesystem / survives hot copies,
but with weaker (merge-based) isolation. They sit at opposite ends of the
concurrency-vs-isolation trade, and the right pick depends on whether you need
serializability or many-writer, any-FS, and hot-copy safety with no server.
(DirectStore is the middle: WAL’s single-writer simplicity but any-FS/hot-copy-safe
and on any binding.)
A cross-process mkdir/flock lease + BEGIN IMMEDIATE + nolock/rollback-journal +
open-fresh-on-NFS around each transaction.
Pros: tiny (~40 lines over an existing lock); correct for modest local/NFS writes; keeps one standard SQLite file. Cons: it is the coarse one-writer-at-a-time model the requirements want to avoid; the lease-break/coherence hazards on NFS are managed, not eliminated; no per-session optimism. Good as a fallback or the “direct mode” floor, not as the primary concurrency story.
First participant wins a lease, opens the DB, serves a socket; others RPC in; lease auto- releases on death.
Pros: real in-memory MVCC, strong immediate consistency, cross-host. Cons: there is a process (transient, but a process); cross-platform socket/lifecycle management is exactly the daemon-seamlessness liability R-NOSERVER warns about; more moving parts. Best reserved as a future optimization for hot, strongly-consistent cross-host work.
The Findings-H design: lock-free changeset writers, immutable snapshots, atomic pointer,
idempotent reconcile, mkdir lease only around publish; plus the Findings-I direct mode
for the simple local case.
Pros: satisfies every requirement (per-session optimism; no server; NFS-safe;
hot-copy and crash safe; standard-SQLite readers; cross-platform via node:sqlite);
validated by 55 multi-process tests; correctness independent of lock perfection.
Cons: writers/reconcile need the session extension (Node node:sqlite / Python APSW
/ Rust rusqlite), which on Node is still --experimental; reader latency lags the
pending tail until reconcile; per-write snapshot-seed cost grows with catalog size
(mitigated by schema-only seed / compaction); no auto-rebase merge in v1 (constrained
writable surface instead).
- Raft servers (rqlite/dqlite), mvSQLite/FoundationDB: violate R-NOSERVER; cluster- scale answers to a local-scale problem.
- CRDT layer (cr-sqlite): dormant; a merge layer, not a store; over-general for our conflict needs.
- Pure-KV engines (LMDB/RocksDB/redb) for the interchange artifact: fail R-FORMAT (engine-specific, poor external tooling); fine only as a rebuildable local cache.
- A custom VFS that intercepts
xLock/xShmMap: native C, large build, and the serialization point can live at one app-level chokepoint anyway — transparency buys little here.
Build a small, dependency-free, cross-platform package that implements Option C, with this surface:
const store = new SnapshotStore(dir, {
schema: SCHEMA_SQL, // caller DDL applied to snapshot 0; every table NON-NULL PK
schemaVersion: 1,
appId: 0x........, // PRAGMA application_id
conflictPolicy: { // per-table merge semantics
cache_table: 'lww', // last-writer-wins (REPLACE on DATA/CONFLICT, OMIT else)
edges: 'union', // set-union (OMIT duplicates)
'*': 'strict', // deny-by-default: a conflict quarantines the tx
},
}).init()
const txId = store.write(writerId, (db) => { /* INSERT/UPDATE/DELETE */ }) // lock-free envelope
const rows = store.read((db) => db.prepare('…').all()) // immutable snapshot
const res = store.reconcile() // build → quick_check → atomic publish; exactly-once
const lease = store.readLease() // pin current for a packager; lease.release() after
store.gc({ retain: 3 }); store.validate(); store.repair(); store.seal('out.sqlite')
// Easy-local alternative over the same format (no session extension, any binding):
const v = new DirectStore(dir, opts).write((db) => { /* … */ })Module/code outline (what a reproducer needs to build):
lock—withLock(lockPath, fn(token), {timeoutMs>staleMs, staleMs, pollMs})+ownerOf(lockPath)+heartbeat: the NFS-safemkdirlease lock with an owner token, held only around publish; supports revalidate-before-publish (fencing). InvarianttimeoutMs > staleMs. ~50 lines.driver— thinnode:sqlitewrapper:open/openReadonly,recordChangeset(db, fn)(session over a single tx),vacuumInto,quickCheck,integrityCheck. The only place the session extension is touched; lazy-loaded for write/reconcile.conflictPolicy—normalize(policy)andapplyWithPolicy(db, changeset, policy): per-table policy via one filteredapplyChangesetpass per action (the binding gives the table name only tofilter), conflict-type-aware'replace', and'strict'detected via thefalsereturn of an ABORT pass → roll back the whole tx → quarantine.snapshotStore— the engine:init(snapshot 0 with schema +application_id/user_version+ anapplied_txledger table),read,write(seed-from-snapshot orseed:'schema-only'; envelope withCOMMITTEDlast; manifest carriesschema_version+schema_hash+changeset_sha256),reconcile(build-temp → apply non- ledgered committed tx with the policy →quick_check→ atomic publish under the lease, with OCC re-check + exactly-once viaapplied_tx),gc(retain N + leased),readLease,validate(classifysealed/live/in-flight-capture/corruptfrom the structural tells),repair(restore last consistent point),seal(VACUUM INTOfinalize).directStore— the easy-local strategy:write(fn)= lock + copy current → mutate →quick_check→ atomic publish, with OCC; reuses the same layout/read.- CLI — schema-agnostic
info/validate/gc/seal/repair(these operate on the snapshot/pointer/envelope layer, so they need no schema or policy — an operator can maintain a store they did not create).validateexits 0/2/3 for sealed-or-live / in-flight / corrupt. - Tests — port the spike’s multi-process harness (writer/reconciler/reader/direct- writer/lock-holder child processes): zero-loss concurrency, racing reconcilers exactly- once, readers-never-torn, crash-at-every-publish-step, true SIGKILL, randomized chaos, fencing/lease, GC+lease, portability (magic header + pragmas), repair, direct-mode interop.
Prototype is done. A complete reference implementation of all of the above lives at
explorations/spikes/snaplog-standalone-package/ (the finterm repo) with 55 passing
tests, a runnable demo.mjs, a worked generic-consumer example, and SOURCES.md
recording the exact upstream source lines.
It is intentionally domain-agnostic — the prototype’s tests inject a toy schema, and a
separate example injects an unrelated one — so it transfers directly to a standalone
package.
Recommended sequencing for productionizing:
- TypeScript port of the prototype (types; lazy
node:sqlite; the same tests). - Cross-platform install/runtime matrix (macOS arm64/x64, Linux glibc/musl
x64/arm64, Windows x64) installing the packed tarball, plus a real
NFSv4.1/EFS/Filestore gate (the one thing a local spike cannot settle —
mkdir/rename atomicity under contention,actimeo/close-to-open, lease-break under partition,fsyncdurability). - Schema migration across snapshot versions (deterministic DDL step; reconcile
quarantines schema-mismatch via
schema_hash). - Cheaper seeding for large catalogs (schema-only default for insert-heavy writers, or a shared read-only base and overlay) and snapshot compaction/retention.
- Optional later: Python (APSW) / Rust (
rusqlite) writer parity against the same envelope format; an ephemeral-coordinator mode for hot strongly-consistent cross-host work; upstreamnode:sqliteasks (forward the conflict iterator/table name; expose the rebaser).
Open design questions to settle with maintainers: changeset vs patchset (recommend
changeset); rebaser-gap handling (recommend constrained writable surface for v1);
direct-mode strictly opt-in (never inferred from filesystem detection — the dangerous
cases look local); and whether bit-rot repair should replay tx/applied/ onto the
prior snapshot rather than fall back to it.
- Decide whether to extract the prototype into a standalone open-source package and under what name/scope.
- TypeScript port, plus the cross-platform and real-NFS test matrices.
- Schema migration, cheaper seeding, and compaction.
- File the upstream
node:sqliteenhancement requests (conflict iterator/table name; rebaser). - Evaluate Python/Rust writer parity if multi-language writing (not just reading) is needed.
Empirical probes were run against node:sqlite (Node v22.22.2) and cross-checked
against upstream source checked out per the repo’s third-party-checkout workflow.
Four repos were read at the source level (commits in the prototype’s SOURCES.md):
sqlite/sqlite (002d33d) for the session/locking/WAL semantics
(ext/session/sqlite3session.{h,c}, src/wal.c, src/os_unix.c); nodejs/node
(v22.22.2, SQLite 3.51.2) for the binding (src/node_sqlite.cc,
deps/sqlite/sqlite.gyp); WiseLibs/better-sqlite3 (f93f490, v12.10.0, SQLite
3.53.2) for binding/format/session-availability (deps/defines.gypi, lib/); and
moxystudio/node-proper-lockfile (9f8c303, v4.1.2) for the mkdir-lease prior art
(lib/lockfile.js, README.md). Each technical claim in this doc carries a file:line
reference verifiable against those checkouts.
A complete prototype with 55 multi-process tests was built and run end-to-end (zero-loss
concurrency, racing reconcilers, crash-at-every-publish-step, SIGKILL, chaos, fencing,
GC/lease, portability, repair, direct-mode interop) plus performance/seeding/patchset
benchmarks.
The prior-art landscape and the locking/journaling/filesystem background also
draw on the finterm embeddable-databases survey (2026-06-04, §G–§J); point-in-time
version/maturity facts should be re-verified before relying on them.
SQLite — format, locking, journaling, sessions (primary):
- SQLite as a Library of Congress recommended storage format · SQLite as an application file format · single-file DB
- File locking (lockingv3) ·
WAL mode (incl. network-FS limits) ·
URI filenames (
nolock/immutable) · alternate VFSes (unix-dotfile/unix-flock/unix-none) - Backup API ·
VACUUM INTO· How to corrupt a SQLite database · security / untrusted DBs · limits - Session extension intro ·
sqlite3session_changeset·sqlite3changeset_apply+ conflict handler · rebaser · FTS5
Bindings:
- Node
node:sqlite· session support commit (source pass:attic/nodesrc/node_sqlite.cc,deps/sqlite/sqlite.gyp:24,27— SESSION/PREUPDATE on) better-sqlite3(source pass:attic/better-sqlite3deps/defines.gypi— SESSION off;lib/— no changeset API; standard on-disk format) · APSW session (Python) ·rusqlitesessionfeature ·mattn/go-sqlite3session request
Concurrency / locking mechanisms (background):
- PostgreSQL explicit locking
·
Berkeley DB locking subsystem
·
robust mutexes (
pthread_mutexattr_setrobust) · OFD locks (F_OFD_SETLK) · Windows named mutexes (WAIT_ABANDONED) - Web Locks API
(leader election analogue) ·
proper-lockfile(source pass:attic/node-proper-lockfilelib/lockfile.js—mkdirlease + mtime heartbeat;README— “O_EXCLis broken on NFS,mkdiris not”)
Prior art — distributed/concurrent/replicated SQLite:
- LiteFS (how it works) · Litestream · rqlite · dqlite · mvSQLite · cr-sqlite (Vlcn)
- libSQL ·
Turso
BEGIN CONCURRENT· Turso Database (pure-Rust rewrite, beta) - Changeset-sync prior art: SQLiteChangesetSync · sqliteai/sqlite-sync · Bloomberg comdb2 · GRDB session discussion
Commit-log / OCC pattern (the cross-host “lock becomes a commit”):
- Apache Iceberg · Delta Lake (atomic pointer-swap / optimistic-concurrency commit logs over object storage)
finterm internal background (motivation and deeper survey):
- Modern Embeddable / File-Based Databases survey (§G concurrency theorem, §H beads network-drive pitfall, §I write-lock wrapper, §J session/changeset) — the parent survey this generalizes
- Prototype:
explorations/spikes/snaplog-standalone-package/(55 tests;SOURCES.mdrecords the exact SQLite/node:sqlitesource lines) · earlier in-domain proofsexplorations/spikes/dataroom-changeset-catalog/anddataroom-snapshot-catalog/