The file racket/collects/racket/private/map.rkt contains optimized implementations of map, for-each, andmap, and ormap. A comment at the top states:
;; #%kernel implements `map', `for-each', `andmap', and `ormap',
;; but the JIT generates faster code, especially for the common cases.
This comment was written for Racket BC (the bytecode/JIT implementation). This benchmark investigates whether these optimizations are still beneficial in Racket CS (Chez Scheme-based Racket).
The benchmarks compare:
- Optimized versions: The implementations from
racket/private/map.rkt(exposed as the standardmap,for-each,andmap,ormap) - Kernel versions: The built-in implementations from
#%kernel(accessed via(require (prefix-in k: '#%kernel)))
Tests were run with:
- List sizes: 10, 100, 1000, 10000 elements
- Single-list and two-list variants
- Auto-calibrated iteration counts to ensure each benchmark runs for at least 10ms
- Three garbage collections before each timing run, plus three more between optimized and kernel measurements
| Function | Optimized (ms) | Kernel (ms) | Iterations | Ratio | Winner |
|---|---|---|---|---|---|
| map (1 list) | 3.90 | 4.83 | 227,556 | 1.24 | optimized |
| map (2 lists) | 4.43 | 4.40 | 146,286 | 0.99 | roughly equal |
| for-each (1 list) | 3.86 | 5.06 | 372,364 | 1.31 | optimized |
| for-each (2 lists) | 4.71 | 5.62 | 186,182 | 1.19 | optimized |
| andmap (1 list) | 4.18 | 6.26 | 455,112 | 1.50 | optimized |
| andmap (2 lists) | 5.08 | 4.41 | 186,182 | 0.87 | kernel |
| ormap (1 list) | 3.12 | 4.24 | 292,572 | 1.36 | optimized |
| ormap (2 lists) | 5.19 | 4.54 | 186,182 | 0.87 | kernel |
| Function | Optimized (ms) | Kernel (ms) | Iterations | Ratio | Winner |
|---|---|---|---|---|---|
| map (1 list) | 7.40 | 10.40 | 48,762 | 1.40 | optimized |
| map (2 lists) | 9.23 | 10.05 | 28,055 | 1.09 | roughly equal |
| for-each (1 list) | 6.87 | 7.13 | 70,621 | 1.04 | roughly equal |
| for-each (2 lists) | 9.32 | 9.40 | 34,421 | 1.01 | roughly equal |
| andmap (1 list) | 6.76 | 6.93 | 61,135 | 1.03 | roughly equal |
| andmap (2 lists) | 7.62 | 7.69 | 27,864 | 1.01 | roughly equal |
| ormap (1 list) | 6.08 | 6.37 | 61,135 | 1.05 | roughly equal |
| ormap (2 lists) | 9.27 | 9.38 | 33,301 | 1.01 | roughly equal |
| Function | Optimized (ms) | Kernel (ms) | Iterations | Ratio | Winner |
|---|---|---|---|---|---|
| map (1 list) | 7.87 | 9.86 | 4,730 | 1.25 | optimized |
| map (2 lists) | 9.17 | 11.81 | 2,711 | 1.29 | optimized |
| for-each (1 list) | 8.42 | 8.82 | 10,215 | 1.05 | roughly equal |
| for-each (2 lists) | 9.76 | 10.02 | 4,036 | 1.03 | roughly equal |
| andmap (1 list) | 8.39 | 9.12 | 10,064 | 1.09 | roughly equal |
| andmap (2 lists) | 9.58 | 10.07 | 3,943 | 1.05 | roughly equal |
| ormap (1 list) | 8.38 | 9.29 | 10,015 | 1.11 | optimized |
| ormap (2 lists) | 9.74 | 10.23 | 4,016 | 1.05 | roughly equal |
| Function | Optimized (ms) | Kernel (ms) | Iterations | Ratio | Winner |
|---|---|---|---|---|---|
| map (1 list) | 8.21 | 9.56 | 379 | 1.16 | optimized |
| map (2 lists) | 9.48 | 10.16 | 221 | 1.07 | roughly equal |
| for-each (1 list) | 9.73 | 10.23 | 1,102 | 1.05 | roughly equal |
| for-each (2 lists) | 9.94 | 10.23 | 369 | 1.03 | roughly equal |
| andmap (1 list) | 9.72 | 10.66 | 1,108 | 1.10 | roughly equal |
| andmap (2 lists) | 9.93 | 10.28 | 368 | 1.04 | roughly equal |
| ormap (1 list) | 9.61 | 10.61 | 1,090 | 1.10 | optimized |
| ormap (2 lists) | 9.87 | 10.49 | 366 | 1.06 | roughly equal |
-
mapbenefits most from the optimization: The optimizedmapis consistently 16-40% faster for single-list operations across all list sizes. This is the clearest win. -
Single-list operations benefit more than two-list: Single-list variants show consistent improvement, while two-list variants are often "roughly equal".
-
Small list two-list variants: kernel wins: For 10-element lists,
andmapandormapwith two lists are actually ~13% faster with the kernel implementation. This suggests the optimized code's two-list fast path has some overhead that hurts small lists. -
Medium/large lists converge: For 100+ element lists, most operations are "roughly equal" except for
map, which maintains its advantage. -
andmapsingle-list with small lists shows 50% improvement: The optimized version is significantly faster for the common case of checking a predicate over a small list.
| Function | Verdict |
|---|---|
map (1 list) |
Keep optimization - 16-40% faster across all sizes |
map (2 lists) |
Keep - 9-29% faster for larger lists |
for-each (1 list) |
Keep - 5-31% faster, especially small lists |
for-each (2 lists) |
Marginal - 1-19% faster |
andmap (1 list) |
Keep - 3-50% faster, especially small lists |
andmap (2 lists) |
Mixed - kernel faster for small lists |
ormap (1 list) |
Keep - 5-36% faster |
ormap (2 lists) |
Mixed - kernel faster for small lists |
The optimization is still beneficial in Racket CS, particularly for:
mapoperations (the most common use case)- Single-list variants of all functions
- Small lists with single-list operations
However, there's a small regression for two-list andmap and ormap with small (10-element) lists, where the kernel is ~13% faster. This might be worth investigating - the two-list fast path may have overhead that doesn't pay off for very small lists.
Overall, the optimizations are worth keeping because map is the most commonly used function, and it shows consistent improvement.
To understand why the optimized version is faster, we can examine the Chez Scheme cp0 optimizer output using PLT_LINKLET_SHOW_CP0=1.
Optimized version (map-opt.rkt):
#lang racket/base
(define lst '(1 2 3 4 5))
(define (test) (map add1 lst))
(test)Kernel version (map-kern.rkt):
#lang racket/base
(require (prefix-in k: '#%kernel))
(define lst '(1 2 3 4 5))
(define (test) (k:map add1 lst))
(test)Optimized version - the map is completely inlined:
(letrec ([test.58 (lambda ()
(letrec ([loop_711.59 (lambda (l_812.60)
(if (#2%null? l_812.60)
'()
(let ([r_913.61 (#2%cdr l_812.60)])
(let ([app_25.62 (#2%add1 (#3%car l_812.60))])
(#2%cons app_25.62
(loop_711.59 r_913.61))))))])
(loop_711.59 '(1 2 3 4 5))))])Kernel version - remains an indirect function call:
(letrec ([test.56 (lambda ()
((#3%$top-level-value '#%map.rkt-rumble.sls-#%map-0)
#2%add1
'(1 2 3 4 5)))])| Aspect | Optimized | Kernel |
|---|---|---|
| Inlining | Completely inlined into a direct loop | Indirect call through $top-level-value |
list? check |
Constant-folded away (literal list is known to be a list) | Check happens at runtime inside callee |
| Function call overhead | None - direct recursive loop | Runtime symbol lookup + function call |
| Loop structure | Explicit letrec loop in generated code |
Hidden inside #%map primitive |
The optimized map in racket/private/map.rkt uses begin-encourage-inline and has an explicit loop structure that the Chez Scheme cp0 optimizer can inline. The key code pattern:
(if (or-unsafe (and (procedure? f)
(procedure-arity-includes? f 1)
(list? l)))
(let loop ([l l])
(cond
[(null? l) null]
[else
(let ([r (cdr l)])
(cons (f (car l)) (loop r)))]))
(gen-map f (list l)))When cp0 sees this with a known list argument, it:
- Constant-folds the
list?check to#t - Inlines the entire
let loopinto the call site - Eliminates the branch to
gen-map
The kernel's map is a primitive defined in Chez Scheme's runtime, so it cannot be inlined across the linklet boundary.
Based on the analysis above, an optimization was added to the schemify optimizer (racket/src/schemify/optimize.rkt) that recognizes calls to the kernel's map when the function argument is known to be a procedure, and transforms them to explicit loops:
;; Optimize (map f lst) to an explicit loop when f is known to be a procedure.
;; This allows cp0 to inline the loop, avoiding the overhead of #%map which
;; uses #%app to handle applicable structs.
[`(map ,f ,lst)
#:guard (hash-ref prim-knowns 'map #f) ; map is the kernel primitive
(define u-f (unwrap f))
(cond
[(or (lambda? f)
(and (symbol? u-f)
(known-procedure? (find-known u-f prim-knowns knowns imports mutated))))
(define loop-id (deterministic-gensym "loop"))
(define l-id (deterministic-gensym "l"))
(define r-id (deterministic-gensym "r"))
`(letrec* ([,loop-id (lambda (,l-id)
(if (null? ,l-id)
null
(let ([,r-id (cdr ,l-id)])
(cons (,f (car ,l-id))
(,loop-id ,r-id)))))])
(,loop-id ,lst))]
[else v])]This optimization:
- Matches
(map f lst)wheremapis the kernel primitive - Checks if
fis known to be a procedure (either a lambda or aknown-procedure?) - Generates an explicit loop that can be inlined by cp0
The result is that kernel map calls now get the same inlining treatment as the optimized map from racket/private/map.rkt.
See the attached map-benchmark2.rkt file for the complete benchmark code.