malli_cache_oom.md

Problem Statement

Investigated suspected memory leak in metabase.util.malli.registry/cache atom. Initial reports suggested unbounded memory growth during test runs.

Root Cause Analysis

The real issue wasn't a memory leak, but fundamentally broken caching for function schemas. With the original schema-cache-key implementation, schemas containing function objects generated unstable cache keys because:

(fn [x] ...) creates new function objects each time Function objects have different identity even with identical code This caused cache misses for functionally identical schemas

Empirical Evidence

Cache Miss Reduction

Before fix: 100+ cache misses when loading the app for common schemas like keyword?
After fix: 8-13 total misses (from unavoidable race conditions in parallel tests)

Memory Usage Reality Check

Memory pattern: 2GB → manual GC → 800MB (same on both branches) Conclusion: Normal JVM behavior, not a memory leak Real issue was CPU performance, not memory retention

`mu/defn` Cache Pollution Demonstration

;; OLD IMPLEMENTATION - BAD (cache grows every time)
[(count @@#'mr/cache)
 (mu/defn my-fn :- [:fn (fn [x] (int? x))] 
  [a :- [:fn (fn [x] (int? x))]
   b :- [:fn (fn [x] (int? x))]] 1)
 (count @@#'mr/cache)]
;; => [954 1 957]  <- grew by 3 entries
;; => [951 1 954]  <- grew by 3 entries
;; => [948 1 951]  <- grew by 3 entries

[(count @@#'mr/cache)
 (do (mu/defn my-fn :- [:fn #_:clj-kondo/ignore (fn [x] (int? x))]
       [a :- [:fn #_:clj-kondo/ignore (fn [x] (int? x))]
        b :- [:fn #_:clj-kondo/ignore (fn [x] (string? x))]]
       1)
     (dotimes [_ 100] 
       (my-fn 1 "")))
 (count @@#'mr/cache)]
;; => [1364 nil 1664] <-- each call adds a value !!!!

;; NEW IMPLEMENTATION - GOOD (stable cache)
[(count @@#'mr/cache)
 (mu/defn my-fn :- [:fn (mr/with-key (fn [x] (int? x)))] [...] 1)
 (count @@#'mr/cache)]
;; => [388 1 388]  <- no growth
;; => [388 1 388]  <- no growth

Each mu/defn redefinition created 3 cache entries (return type + 2 parameters).

Fixes Applied

Enhanced schema-cache-key Function

Problem: Only handled regex patterns in 1st position, ignored function objects Solution: Added function object serialization using pr-str + postwalk Result: Stable keys for functionally identical schemas

Cache Validation Utilities

;; Ensures schemas generate stable cache keys
(defmacro stable-key? [schema]
  `(= (schema-cache-key ~schema) (schema-cache-key ~schema)))

;; Provides explicit cache keys for problematic schemas  
(defmacro with-key [body]
  `(with-meta ~body {::key ~(pr-str body)}))

Compile-time Schema Validation

Added checks in mr/def to reject unstable schemas Evaluates schema twice, throws if cache keys differ Forces developers to use cacheable patterns

Cache Structure Optimization

Before: {cache-type {schema-key value}} After: {[cache-type schema-key] value} Easier to analyze, count, and migrate to real cache libraries

Performance Impact

Before Fix (Broken Caching)

Function schemas never cached properly Every validation recompiled from scratch Constant cache pollution during development Slow REPL experience

After Fix (Working Caching)

Function schemas cached correctly Validations reuse compiled schemas Stable cache during development Significantly faster validation performance

Scope Assessment

Confirmed Issues Fixed

✅ mr/def schemas with functions ✅ Direct mr/explain and mr/validate calls ✅ App startup schema compilation ✅ mu/defn input/output validation (TODO)

Open Questions

Is it worth imposing mr/with-key on devs?

Makes it more annoying to write some schemas
for mr/def I already rewrote them because it was easy enough -- tho it wasn't the root issue.
Let's say we take it, where should these guidelines for how to write a cachable schema go?

Related, how should I fix mu/fn schema handling:

Either add the mr/stable-key? check, or try auto-wrapping them
- autowrapping can cause cache issues if you pass in like :- my-fn then redef it.
- I am leaning toward adding the check for mu/defn too.

defendpoint (similar pattern to mu/defn)
Do you have an idea for good tests to make sure this never breaks again?

Current Take

Ship the current fixes - proven performance improvements with no downsides
Apply similar fixes to mu/defn + defendpoint - likely has same cache pollution issues
Add more cache tests - instrumentation to catch future regressions

Future Considerations

Use a real cache library: Current flat map could be replaced with TTL/LRU cache
benchmarking / could be a testd: Establish baselines for cache hit rates

Conclusion

What appeared to be a memory leak was actually a fundamental caching performance issue. The fixes provide significant CPU performance improvements while maintaining the same memory footprint. The investigation revealed that malli's caching was effectively broken for function schemas, which are pervasive in our codebase.

escherize/malli_cache_oom.md