Goal: de-risk the head-row + S3-log design with the smallest set of experiments that could actually change it, and decide the schema-growth strategy from data rather than assumption.
- Two kinds of result, never mixed. Shape / correctness / relative results (HOT-update ratio, bloat trend, CAS correctness, MinIO-vs-SeaweedFS comparison) are valid in containers on modest hardware. Absolute throughput/latency needs a production-representative instance and same-region S3; a laptop number there is fiction. Every chart is labelled which it is.
- Earn the complexity. Start from the simplest viable schema (one unpartitioned table, delete-on-terminal, single primary). Partitioning, forward-migration, and sharding are each gated behind a specific metric that fails without them. If the simple thing holds at target scale, we ship the simple thing.
- Representative workload. Synthesize the real shape: per-run transition counts of 5 to 20+, a create/transition/complete mix derived from ~1B runs/month, and a "sleeper" cohort (a few percent of runs left open for the whole window) to model the
wait()-for-weeks runs. - Reproducible. Pinned versions, scripted harness, fixed seeds, warmup then a steady-state measurement window, report p50/p99/p999 (not means), repeat trials, capture raw data.
What we're proving: the head row stays small, updates stay HOT, and growth is manageable with the least machinery.
- Setup: single
run_headtable, PK onrun_id, no partitioning, delete-on-terminal. Tunefillfactor. Postgres pinned (e.g. 16). - Workload: drive create (INSERT) / transition (CAS UPDATE) / complete (DELETE) at steady state for a multi-hour window, at active-run counts of 1e5, 1e6, 1e7. Hold a sleeper cohort open.
- Measure: HOT-update ratio (
n_tup_hot_updvsn_tup_upd), table + index bloat over time (pgstattuple), autovacuum keeping up (dead tuples, autovacuum runs/duration), per-op p50/p99, steady-state table/index size vs active-run count. - Pass: HOT ratio near 100%, bloat flat, size tracks the active set not cumulative history. If it passes, partitioning is unnecessary and we stop here (simplest wins). If autovacuum can't keep up or bloat climbs, escalate to A1.
- Variables: active-run count,
fillfactor(70/80/90/100, to show HOT sensitivity), sleeper %.
- Setup: range-partition
run_headbycreated_at; retention viaDETACH PARTITION CONCURRENTLY+DROP. - Compare: partition-drop reclamation vs bulk
DELETE+ autovacuum (reclaim time, lock held, bloat after, insert p99 during retention). - Sleeper handling: forward-migrate the open rows out of the oldest partition, then drop it; measure how many rows move and the cost.
- Pass: drop is sub-second and lock-light; insert p99 stays flat during retention; no vacuum storm. This is the "real solution for table growth" only if A0 says we need it.
- Setup: N concurrent writers racing the same run through the head-row CAS.
- Assert: exactly one writer wins each step,
seqis gap-free and never forks, even with induced redlock expiry (kill/expire the lock mid-transition) to prove the CAS is a genuine backstop, not redundant with the lock.
Tooling: custom load generator or pgbench custom scripts; pg_stat_user_tables, pgstattuple, pg_stat_activity, auto_explain.
What we're proving: the body-log write path scales, the batching tradeoff is real, and the cold read path is acceptable.
- B1 (write latency/throughput): PUT small objects (1/4/16/64 KB) across many run prefixes at increasing concurrency. Measure PUT p50/p99/p999, throughput, and 503/SlowDown throttle rate.
- B2 (batching tradeoff): vary batch factor (1, 10, 20, 50 snapshots/object). Measure PUT request count, extrapolated monthly cost, and the buffering latency added to the write path.
- B3 (cold reads): "all states" via prefix
ListObjectsV2, and "since N" viastart-after, at per-run object counts of 10/100/1000/5000. Measure LIST pages, total latency, and GET fan-out. This locates where the cold path would get expensive for very long runs. - B4 (consistency): write-then-immediately list/get to confirm read-after-write and list-after-write on each backend.
- Variables: object size, concurrency, prefix fan-out, batch factor, per-run object count.
What we're proving: which self-hostable backend fits this workload (many small objects, high write rate, prefix listing), plus self-host correctness.
- Setup: run the B1/B3/B4 harness against AWS S3 (managed baseline, same-region if possible), MinIO (container), and SeaweedFS (container). Equalize hardware for the two self-hosted backends; treat S3 as a baseline for behaviour, not an absolute comparison against laptop containers.
- Hypothesis to test: SeaweedFS, built for large numbers of small files, may beat MinIO on small-object PUT throughput and on LIST latency at high object counts. Verify rather than assume. Note MinIO CE's maintenance-only status as an operational risk regardless of the numbers.
- Capture: small-object PUT throughput, LIST latency as object count grows, read-after-write and list-after-write correctness, and operational notes (deploy complexity, memory, consistency caveats).
- Caveat: container-on-one-box numbers are relative only; absolute figures need representative nodes and disks.
The short A0 runs proved the steady-state shape (HOT-heavy, heap flat, autovacuum sawtooth). They do not prove the long-horizon behaviours that actually decide whether the simple schema survives at Trigger's scale. This is the test that settles the partitioning-vs-simple call I deferred.
What it must run: hours-to-days of sustained churn at a representative per-table transition rate and active-set size (1M-10M). Decouple from the laptop's fsync ceiling (use synchronous_commit=off or batched commits, plus higher concurrency) so the rate is representative even if the hardware isn't, bloat/vacuum dynamics extrapolate on rate-vs-vacuum-throughput, not total data.
Hypotheses / what to watch over time:
- Autovacuum keeps up. Dead-tuple % stays in a bounded band over hours, doesn't slowly creep. Sweep
autovacuum_vacuum_scale_factor, cost limits, naptime, and per-table settings. - PK-index bloat trajectory. The one warning sign from A0 (0.4→0.9 MB in 120s). Does it plateau or grow until
REINDEX CONCURRENTLYis needed, and at what cadence? Heap reuse is proven; the btree is the open question. - Freeze / wraparound. On a table taking billions of writes/month, characterise autovacuum freeze behaviour and
autovacuum_freeze_max_age. - Sleeper xmin holdback. Confirm a weeks-open sleeper run does not pin dead tuples (each transition is its own short transaction, so it shouldn't, but prove it; also test a genuinely long-open transaction elsewhere blocking vacuum).
- fillfactor sweep at the long horizon, since the HOT ratio interacts with how fast pages fill between vacuums.
Decision: if dead% and index size plateau under representative rate, the simple unpartitioned schema is vindicated for real. If autovacuum falls behind or index bloat forces frequent reindexing, partitioning (A1) earns its complexity back because a partition drop reclaims the index for free.
- Phase 0: A0 + A2. Cheapest and most decisive: does the simple schema hold, and is the CAS correct? Outcome gates whether we ever build partitioning.
- Phase 1: B1-B4 against S3, plus C (MinIO/SeaweedFS comparison). Characterizes the body path and picks the self-host backend.
- Phase 2: A1, only if Phase 0 shows a growth/bloat problem. Justifies partitioning with data.
- Phase 3: absolute numbers on a production-representative Postgres instance and same-region S3, replaying real transition traces. Confirms headroom and the shard threshold.
Each phase ends with a one-page report: the metric tables, the decision (does this change the design?), and a go/no-go.
- A0 flat and HOT-heavy → drop partitioning from the design entirely (simpler).
- A0 bloat/vacuum problems → partitioning is justified; A1 picks drop-vs-delete and sleeper handling.
- B2 shows batching latency is intolerable → bodies stay one-per-snapshot and we lean on Express/cheaper backend instead.
- B3 shows cold "since N" is slow for long runs → consider a small per-run index or capping cold scans.
- C shows SeaweedFS clearly wins for small objects → it becomes the recommended self-host default over MinIO.
Phase 0: ~1 to 2 days (harness + runs + report). Phase 1 + C: ~2 to 3 days. Phase 2: ~1 day if triggered. Phase 3: depends on provisioning a representative environment.