Towards a More Useful Model of Distributed System State

How I Think About Programs

I like to model all programs as state machines with pure functions within them. These individual state machines compose into larger ones where sometimes it’s easier to visualize as a hierarchy and other times as a distributed network of actors. One system that applies this model is Erlang’s OTP, but this model is very general and comes up just about everywhere.

An issue with obsessing over generalization is the “turing tarpit” — that every computation can be modelled on a turing machine’s tape or using lambda calculus or some other simple system. The key to abstraction is to purposefully attempt to model things with the least powerful/expressive language reasonably possible. If you can solve a problem correctly with a simple construct, then you should — which is why “goto” lost to structured programming and why pure functional programs are so much easier to reason about than their equivalent object-oriented or imperative counterparts. The extra constraints tend to make abstraction easier because logic becomes much more localized.

One issue that always comes up with functional programming is the concept of an effect. I like to visualize an effect as a state machine transition. From this perspective, effects are always relative to your viewpoint. If an internal state machine has multiple state changes but resets before yielding a result, then it is equivalent to a pure function from the outside. As a note: some types of effects are a bit more implicit — like the resources that a program uses. If a program takes a long time to run, then that will affect me as a user regardless of whether or not it was written without the IO monad or that effect was tracked algebraically.

Consistency, or Where Everything Falls Apart

In this model, state machines have internal state and can communicate that state with each other. Imagine a situation where there are 4 machines: A has a number, B doubles A’s number, C adds 1 to A’s number, and D sums B and C. If we have A set to 1, then D = 2*1 + 1+1 = 4. Let’s say A increments to 2, so D should be 2*2 + 2+1 = 7. However, if D gets B’s value before C updates, then D will have a glitch causing it to transiently result in 6 instead of 7. This can be solved by ensuring every derived state is updated in waves — breadth-first/topologically sorted. Every single update happens in lockstep.

As soon as asynchronous derivations are involved, consistency falls apart again. The answer is to not only update derivations in the right order, but to also tie each value to a logical time. A derivation can wait until every dependency is at the same logical time, or it can intentionally mix values from different times and eventually end in a consistent state after handling all updates. There are many different consistency models which may have very strange behavior which lead me to the conclusion that strict serializability is the only acceptable consistency level for non-commutative transactions and that every weaker consistency model must only be opt-in and explicitly handled as speculative.

Databases use weaker consistency models only to improve their performance at the cost of inexplicable bugs in edge-cases. Most of the time, for most business use cases, this does not lead to disaster because users can just reload their browser and re-submit a form. It’s a similar trade-off to how dynamically typed javascript is more productive to use than strongly typed and formally verified ada. However, we choose the secret third thing (gradual typing) by using typescript which allows you to invest exactly however much effort in precisely modeling types as you want. We ought to be able to build eventual consistency on top of a strictly serializable core as optional latency optimizations.

Bubbling Up Proposed Edits

Imagine a social media service that uses a database, an app server with a search index, and a web app. There is a straight line of derivations from the source database to the server to the user. When a user wants to update their bio, they edit a text input that has their current bio in it. If we wanted to be completely strict and correct, then the user’s edit would be relayed after validation via the server to the database. The server would have to run a query to update its internal state, including its search index, and let the browser know that everything went well. You can imagine the act of editing the text box to be manipulating the underlying source data in the authoritative database in such a way that the bio will look that way when all intermediaries re-run their derivations. The edit UI knows how to extract the desired changes and propose them to the app server, which in turn knows how to propose its changes to the database.

A key idea here is that each layer bets on what changes can be made and could choose to behave optimistically. The browser can instantly submit the edit as if it was already done with a small indicator somewhere that it’s still saving in the background. The app server can update its search index and immediately allow any users to search with the new bio in the mix, or maybe even just allow the user to see those updates but not everyone else. All it would require is that the index adds the new bio alongside the old one with a recorded mark that it is tied to a specific in-flight database proposal. As soon as the database confirms success, the entry can be fully replaced or removed if the database rejects the change for some reason. The point is that by making this opt-in and keeping everything tracked, we get the minimum possible latency without compromising the core consistency model.

Another example would be in a headless CMS where editors would want to be able to edit content directly on the rendered web page instead of in the CMS backend. A big issue is that even if you can associate rendered text to its source data, there is a black-box rendering process between that and the web page. However, if each rendered component on the page declared that it has scoped control over specific pieces, then the full re-render can be bypassed. The idea is that an outer render function may care about inner data to only a certain degree, where e.g. a section might not render if its heading is empty. The entire page might declare that it treats all data nested within certain fields as opaque and delegates to inner components. That section component could declare a predicate that simply says it considers the heading text opaque unless it’s empty. When every conditional delegation is tracked, only the minimal nested re-renders are required and simple edits can take place with the absolute minimum possible latency.

It’s All Derived State

Mapping out just about every real service reveals the same pattern of storing data in some machine and deriving that state into different representations in other machines. It’s the exact same pattern even within each physical computer. The databases themselves manage their own dependency graph of which indices and materialized views update when whichever tables change, which itself is derived from the write-ahead log of every transaction. Application servers manage their own data structures derived from the databases they query while also relying on a soup of auxiliary services and caches to speed up the refreshment of those data structures in mostly ad-hoc ways. End-user applications apply the exact same approach.

Luckily for us, things are improving. In the world of frontend web dev, the concept of reactive signals is even potentially going to be standardized in js. Some databases are built to sync reactively from the ground-up and some libraries bring a database transparently to the browser. Perfect incremental updates to arbitrary materialized views expressed with SQL have been solved by two close competitors: Materialize and Feldera. The future looks pretty bright. Except all of these ideas are ancient and have been re-discovered over the decades multiple times. Signals are such a simple concept that the oldest implementation in a UI that I can recall is from Smalltalk in the late 1970’s. Incremental materialized view updates have been around in (proprietary and/or uncommon) databases forever. A cynical reader might wonder what makes the new tech more promising. My expectation is that all of these solutions will be popular but within limited scopes because we are missing a deeper, more unified model.

Databases simply need to stop being seen as isolated nodes in a system of custom services; the database needs to almost dissolve as a concept altogether. I’m advocating for databases to turn inside out and become conceptually simpler, higher performance, and more correct all at once. It starts by separating the concepts of storage durability, authority, and synchronization.

A Typical Relational DBMS

In a contemporary database, SQL mutation results are logged into a linear append-only WAL and then every data structure derived from that is updated — tables, indices, and materialized views. Snapshots of the modified rows are marked with a logical time in order to isolate concurrent transactions so that they don’t see each others’ changes. Every modification is append-only until a phase where old rows are deleted is run (compaction/VACUUM). Since there are a lot of rows sort of floating around and pointer-chasing is slow, the tables are periodically materialized into snapshots. Once this is done, it conveniently also allows the DBMS to drop the redundant older entries in the WAL.

I want to bring attention to a couple of important aspects of this design. The first is that the WAL is linear with a concurrency control feature hacked on top. The second is that history is usually treated as garbage that needs to be periodically collected. These imply that there is no real concern with being able to check-out specific transaction commits, revert, or branch and merge like in version control. The data structures are almost exclusively managed by the DBMS itself and any extensions to this must be done with a plugin system. Custom indices are not particularly flexible or popular and the methods to query them are expressed with hacks on top of SQL. Concurrency control is limited to checking if read values are written over with no real attempt to automatically fix conflicts. Many of these issues have solutions out there that somewhat improve the situation, but the reality is that there is too much momentum to build on top of designs that we outgrew.

My Armchair Proposal

We should take inspiration from Pijul for the way that the WAL is structured, where it is a DAG instead of a stream of commits. Each commit is made up of a set of operations which have implicit and/or explicit commutativity predicates. Commits have an ordered antichain of parents (sequence of unique commits where none are ancestors of each other) that we can call a frontier. We can replay every operation of every commit as we walk from the root node to any frontier, recording conflicting operations between each parent. All conflicts are tracked and resolved automatically based on the order of the frontier, so every operation commutes automatically yet surfaces the conflicts to be adjusted separately if desired. The concept of a frontier with all paths recorded as operations allows for ad-hoc branching and merging. This is a more powerful version of multi-version optimistic concurrency control and therefore the same semantics and algorithms apply to both concurrent transactions and VCS-style branching for development. The commit DAG becomes the canonical source of truth that all other state derives from, replacing the WAL cleanly. But it gets even better, because the DAG is distributed, meaning that every database shard or replica and every machine that wants to propose transactions becomes a peer in a unified distributed database system. The speculative stacks of changes I described earlier are just implicit branches of this DAG.

It’s wasteful to replay every operation to materialize every part of the database every time, but it’s hard to draw the lines of where to store snapshots. The general solution is to make indices for exactly the objects desired in a way that resembles a skip list. Operations often overwrite aspects of earlier operations, which is exactly why the accumulated/reduced state is smaller than the full record of operations. We can record the running sums of the sizes of the operations and compare them with their compressed versions. Once a compacted operation is, say, at most half the size of its corresponding original operations, we should store that compacted operation along with its start and end commits. If we repeat this process, we have a structure that allows us to materialize snapshots at any frontier in logarithmic instead of linear time. Further skip indices can be made to isolate the operations that apply to certain indices like tables, records, or any other object that makes sense to independently materialize at arbitrary frontiers. In order to allow for true deletion (e.g. for compliance or limited resources), deletion operations can be compacted the same way where the original operations are simply deleted afterwards.

In the same way that operations have tracked predicates which allow us to track conflicts quickly, queries also have predicates for when they are invalidated. Queries (arbitrary derivations) themselves can be expressed using differential dataflow, DBSP, or with any custom function by just tracking active query subscriptions’ predicates. When those dependencies change, the database can just inform the derivation of the change. Since every single index or other derived view of the underlying commit DAG can be expressed in these terms, we can reduce all index and query handling to the same problem of incremental view maintenance. Every derivation tracks the logical time frontier that each part of the derivation is based on. If we model the entire derivation graph across all machines — including database servers, app servers, browsers, or anything else — it makes sense to talk about consistency envelopes. Any part of the derivation graph can update out of step with other parts, tracking their logical frontiers, but the graph can be divided into fragments which expose themselves with internal consistency. This is most useful at async boundaries.

Since certain operations on CRDTs are fully commutative with zero conflicts by construction, any objects with CRDT operations should inherently not actually require strict serializability. For these objects, all peers are considered equals. For any operations that have predicates, the requirement for serializability implies a central authority. High-availability replicas or shards must be abstracted using consensus algorithms like paxos or raft to become a single logical machine wrt to the objects in question. This means that any sharding should be done along the lines between objects that rarely interact with each other in order to avoid excess consensus round-trips for those transactions. Even for cases where usually you would think there has to be a global consensus, there are tricks to allow at least partial independence. For example, a uniqueness constraint can be ensured for generated ids by simply only allowing certain nodes to generate within exclusive parts of the id distribution. The point is that we explicitly opt-in to exactly the trade-offs we want to make.

Closing Thoughts

What I’m really craving to see is a unified model of programs that makes the database itself dissolve into the rest of the entire application. I want something like Feldera to be integrated first-class into the database itself, exposing logical time as a frontier and allowing edits to bubble up directly from arbitrary derivations. I want queries to be arbitrary code that opts to use libraries which make logical constraints easy to express instead of being forced to use SQL with a query planner which acts like a completely separate JIT. Consistency envelopes generalize async derivations in functional reactive programming and I hope to see that apply in signal libraries. I want all database operations like schema migrations to be downright trivial and apply with cheap branching, merging, and reverting that updates with live production data. I want transactions to encode arbitrary precision just like gradual type systems. Essentially, I want consistency to take center-stage and do so in a way that makes everyone wonder why we ever did anything differently.

Note that this article is certified organic. All em-dashes, good ideas, and obvious misunderstandings are completely human-generated — let me know what you think.

wlib/Towards a More Useful Model of Distributed System State.md