Skip to content

Instantly share code, notes, and snippets.

@sskeirik
Last active August 15, 2025 05:53
Show Gist options
  • Select an option

  • Save sskeirik/7de4a3a66f03d1f70ccfda8437e6c754 to your computer and use it in GitHub Desktop.

Select an option

Save sskeirik/7de4a3a66f03d1f70ccfda8437e6c754 to your computer and use it in GitHub Desktop.
RocksDB Page Access

Problem: Fetching pages from the DB is too slow

Possible Solutions:

  1. Make fetching a single page faster
  2. Fetch multiple pages in parallel

Approach

First measure relative overhead of different parts of page fetching process to understand where time is being spent:

  • Measure RPC Time (from RPC start to RPC finish)
  • Measure Page Fetch Time (from DB entry to DB exit)
  • Subtract RPC Time - Page Fetch Time to obtain RPC overhead (includes, e.g., DB connection setup and result serialization)

If RPC overhead dominates, we should move to a more efficient RPC method. Otherwise, we should move to parallelize reads using the parallel page token API.

RPC Alternatives

The current RPC using JSON to transmit an array of object records with lots of binary data. This has the following issues:

  • it requires binary data to be encoded using formats like hexadecimal or base64 which have an overhead of 50% or 33% respectively;
  • arrays of human-readable JSON objects have redundant field names which appear repeatedly and are pure overhead.

Some possible alternatives:

  • gRPC with either flat buffers or protobuf;
  • Cap'n Proto with its associated RPC mechanism.

Non-Sequential Page Token API

To implement this, we will need to add a page index column to the DB for a fixed page size.

This page index will be defined as follows:

  • keys are page numbers;
  • values are keys in the transaction log column.

To properly build the page index, as transactions get inserted into the log, we will add them to the appropriate page. Once this page index is built, we will be able to offer an API based on page numbers instead of page tokens.

Handling Append Writes

Since transaction keys are prefixed by the transaction timestamp and key-value pairs are stored in sorted order, transactions will generally be inserted near the end of the log (with some variance due to clock skew, etc...). Let's first handle this easy case. Assuming all transactions are appended to the end of the log, we will:

  • maintain a counter of how many transactions have been appended;
  • once the counter reaches the desired page size, reset the counter and insert add a new key-value pair into the page index with key set to prev_max_page + 1 and value set to the most recently inserted transaction key.

Handling Non-Append Writes

However, we need to handle the case wehre a transaction is not appended to the log but instead inserted somewhere near the end of the log.

To handle this, we will use one or more of the following approaches:

  1. Flexible Page Sizes: have an initial page size and a larger maximum page size; by allowing pages to grow, they can absorb out of order writes;
  2. Page Shifting: shift the page token of pages that occur after the modified page;
  3. Page Delay: delay providing access to the N most recent pages to allow for pages to "settle" before access;
  4. Fallback to Sequential Page Access: if a page grows beyond the maximum page size, instead of doing page shifting, we return the first page size records to the client but indicate that more records are available in the extended page.

Deleting Old Pages

Overtime, if the DB runs out of space, we can delete the pages at the beginning of the transaction log and their corresponding page indices. If a client requests a deleted page, we can detect this case (by the lack of entry in the page index) and report that their desired page was culled.

RocksDB Page Access Strategies

The RocksDB Iterator API

Note: I'm summarizing information from these pages:

  1. To iterate over entries in the DB, we proceed entry-by-entry, i.e., there is no built-in way to use the iterator API to "skip" the next 1000 keys.
  2. Since keys are stored in sorted order, if we know a key we want to read/start iterating from, we can find it quickly using search

Is the RocksDB API Enough?

So if we want an API that:

  • takes a page size S;
  • takes a number of pages N and;
  • returns N page tokens (where each page token is a key in the transaction column separated by S entries),

Unfortunately, we cannot do it efficiently with the simple iteration APIs that the DB directly provides --- we will end up scanning the key range twice.

A Page Access API

However, I believe there is a straightforward way to implement an API for efficient page token lookup assuming a (roughly) fixed page size.

Data Setup


Firstly, we maintain the following data:

  1. we have two columns (i.e., independent sorted key-value maps) in the DB: one for txns and one for pages;

  2. the txns column will keep the same structure it has now, i.e., a transaction's key is the concatenation of its timestamp and hash;

  3. the pages column will have keys that are page tokens and values that are page sizes where:

    • a page token is just a key in the txn column that we can seek to and start iteration from;

    • a page size is either:

      • 0 (in which case this page must be the current page whose size is less than the desired page size) or;
      • non-zero (in which case this page has a fixed size because it is not the current page).

      We use the special 0 page size to avoid having to update the page size after every txn insert.

  4. we maintain a cached copy of the N most recent (i.e. greatest) page token values in the DB client---let the greatest page token value be referred to as the current page;

  5. we maintain a cached copy of the current page size (i.e., the number of txns that are >= the greatest page token value) in the DB client.

Since a transaction key's prefix is a timestamp, assuming distributed clocks are mostly synchronized, we will usually have inserts into the txn column in sorted order.

Page Maintenance Algorithm


When we insert the first txn into the DB, we also:

  • insert that txn key into the page token column and;
  • insert that txn key into the greatest page token cache (it will become the current page token) with size 0.

For subsequent inserts into the txn column, we check if they are greater than the cached current page token:

  • If yes, we increment the cached current page size.

  • If the current page size reaches the desired page size, we also:

    • update the value of the current page key with its acutal page size;
    • insert the transaction key into the page column (it must be the highest value and so it will become the new current page) with size 0;
    • reset the cached current page size to 0.

If an inserted txn is not greater than the current page token, we will:

  • walk backwards over our page token cache and find the first page token that is smaller than our page token;

  • if we find one, we can do one of the following:

    • increment that page's size value by 1 (if we give our pages room to grow, this makes corrections very cheap) or;
    • we shift the value of subsequent page tokens by 1 entry.

Note: Due to txn keys being prefixed with timestmaps, it is highly unlikey that we will need to walk far back in the cache to perform a page update.

Multiple Page Sizes?


If we want to have multiple page sizes, we can maintain multiple page columns, each with a different desired size. If distributed clocks are mostly synchronized, I believe the cost of maintaing a page index will be an amortized O(1).

Page Access API


When we want to access multiple pages from the DB, we can just do a linear scan over the page column. Then page can represents thousands or tens-of-thousands of entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment