Investigated suspected memory leak in metabase.util.malli.registry/cache
atom. Initial reports suggested unbounded memory growth during test runs.
The real issue wasn't a memory leak, but fundamentally broken caching for function schemas. With the original schema-cache-key implementation, schemas containing function objects generated unstable cache keys because:
(fn [x] ...)
creates new function objects each time
Function objects have different identity even with identical code
This caused cache misses for functionally identical schemas
Cache Miss Reduction
- Before fix: 100+ cache misses when loading the app for common schemas like keyword?
- After fix: 8-13 total misses (from unavoidable race conditions in parallel tests)
Memory pattern: 2GB → manual GC → 800MB (same on both branches) Conclusion: Normal JVM behavior, not a memory leak Real issue was CPU performance, not memory retention
;; OLD IMPLEMENTATION - BAD (cache grows every time)
[(count @@#'mr/cache)
(mu/defn my-fn :- [:fn (fn [x] (int? x))]
[a :- [:fn (fn [x] (int? x))]
b :- [:fn (fn [x] (int? x))]] 1)
(count @@#'mr/cache)]
;; => [954 1 957] <- grew by 3 entries
;; => [951 1 954] <- grew by 3 entries
;; => [948 1 951] <- grew by 3 entries
[(count @@#'mr/cache)
(do (mu/defn my-fn :- [:fn #_:clj-kondo/ignore (fn [x] (int? x))]
[a :- [:fn #_:clj-kondo/ignore (fn [x] (int? x))]
b :- [:fn #_:clj-kondo/ignore (fn [x] (string? x))]]
1)
(dotimes [_ 100]
(my-fn 1 "")))
(count @@#'mr/cache)]
;; => [1364 nil 1664] <-- each call adds a value !!!!
;; NEW IMPLEMENTATION - GOOD (stable cache)
[(count @@#'mr/cache)
(mu/defn my-fn :- [:fn (mr/with-key (fn [x] (int? x)))] [...] 1)
(count @@#'mr/cache)]
;; => [388 1 388] <- no growth
;; => [388 1 388] <- no growth
Each mu/defn redefinition created 3 cache entries (return type + 2 parameters).
- Enhanced schema-cache-key Function
Problem: Only handled regex patterns in 1st position, ignored function objects Solution: Added function object serialization using pr-str + postwalk Result: Stable keys for functionally identical schemas
- Cache Validation Utilities
;; Ensures schemas generate stable cache keys
(defmacro stable-key? [schema]
`(= (schema-cache-key ~schema) (schema-cache-key ~schema)))
;; Provides explicit cache keys for problematic schemas
(defmacro with-key [body]
`(with-meta ~body {::key ~(pr-str body)}))
- Compile-time Schema Validation
Added checks in mr/def to reject unstable schemas Evaluates schema twice, throws if cache keys differ Forces developers to use cacheable patterns
- Cache Structure Optimization
Before: {cache-type {schema-key value}}
After: {[cache-type schema-key] value}
Easier to analyze, count, and migrate to real cache libraries
Function schemas never cached properly Every validation recompiled from scratch Constant cache pollution during development Slow REPL experience
Function schemas cached correctly Validations reuse compiled schemas Stable cache during development Significantly faster validation performance
✅ mr/def schemas with functions ✅ Direct mr/explain and mr/validate calls ✅ App startup schema compilation ✅ mu/defn input/output validation (TODO)
- Is it worth imposing
mr/with-key
on devs?
- Makes it more annoying to write some schemas
- for
mr/def
I already rewrote them because it was easy enough -- tho it wasn't the root issue. - Let's say we take it, where should these guidelines for how to write a cachable schema go?
- Related, how should I fix
mu/fn
schema handling:
- Either add the
mr/stable-key?
check, or try auto-wrapping them- autowrapping can cause cache issues if you pass in like
:- my-fn
then redef it. - I am leaning toward adding the check for
mu/defn
too.
- autowrapping can cause cache issues if you pass in like
defendpoint
(similar pattern to mu/defn)- Do you have an idea for good tests to make sure this never breaks again?
- Ship the current fixes - proven performance improvements with no downsides
- Apply similar fixes to mu/defn + defendpoint - likely has same cache pollution issues
- Add more cache tests - instrumentation to catch future regressions
- Use a real cache library: Current flat map could be replaced with TTL/LRU cache
- benchmarking / could be a testd: Establish baselines for cache hit rates
What appeared to be a memory leak was actually a fundamental caching performance issue. The fixes provide significant CPU performance improvements while maintaining the same memory footprint. The investigation revealed that malli's caching was effectively broken for function schemas, which are pervasive in our codebase.