Last night we was my first time doing LLM-based performance engineering, and wow what fun! I token-maxxed $20 of Codex 5.5, then wrapped up with Opus 4.7, along with Gemini for some code reivew.

After six hours, I and a 165x improvement of PIXterm! What follow is the performance log we kept.

Performance Notes

2026-04-30: Baseline vs v1.3.2

This comparison used the benchmark suite in pkg/ansimage/ansimage_bench_test.go on an Apple M3 Max (darwin/arm64). The current working tree was on branch nm-ansiimage at d576e05; v1.3.2 was tested in a temporary worktree with the same benchmark and correctness test files copied in.

Benchmark command:

go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s

Correctness result:

Current tree passes go test ./....
v1.3.2 fails the strengthened suite. Main failures are missing render rows, SetMaxProcs(0/-n) not clamped, and wrong dimensions/pixels for sub-images with non-zero Bounds().Min.

Performance result, v1.3.2 -> current:

Benchmark	v1.3.2	Current	Change
`Render/NoDithering/Serial`	19.84 ms/op	10.37 ms/op	1.91x faster
`Render/NoDithering/GoCode`	20.82 ms/op	12.42 ms/op	1.68x faster
`Render/DitheringWithBlocks/Serial`	2.59 ms/op	1.35 ms/op	1.92x faster
`Render/DitheringWithChars/Serial`	2.55 ms/op	1.38 ms/op	1.84x faster
`Pipeline/NoDithering`	16.79 ms/op	10.75 ms/op	1.56x faster
`Pipeline/DitheringWithBlocks`	5.26 ms/op	4.31 ms/op	1.22x faster

Allocation result, v1.3.2 -> current:

Benchmark	v1.3.2	Current
`Render/NoDithering/Serial`	101 MB/op, 91k allocs/op	52 MB/op, 47k allocs/op
`Render/DitheringWithBlocks/Serial`	6.0 MB/op, 26k allocs/op	3.2 MB/op, 14k allocs/op
`Pipeline/NoDithering`	103 MB/op, 107k allocs/op	53 MB/op, 62k allocs/op

Create/scale-only benchmarks are basically unchanged, within normal run-to-run noise. The comparison has an important caveat: v1.3.2 often renders fewer rows because of correctness bugs, while the current tree is faster on render-heavy paths while producing complete output.

Next optimization target: reduce render allocation pressure, especially no-dither rendering. The current no-dither render path still allocates roughly 52 MB/op for a 160x96 benchmark image.

2026-04-30: Go 1.25 Module + Upgraded Packages

This run used the same benchmark suite and command as above. The module is now set to go 1.25.0, while the active local toolchain reports go1.26.2 darwin/arm64.

Current module versions:

github.com/disintegration/imaging v1.6.2
github.com/lucasb-eyer/go-colorful v1.4.0
golang.org/x/image v0.39.0
golang.org/x/sys v0.43.0
golang.org/x/term v0.42.0
golang.org/x/text v0.36.0

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s

Benchmark result compared to the previous current-tree baseline:

Benchmark	Previous	Go 1.25 module + upgraded packages	Change
`Render/NoDithering/Serial`	10.37 ms/op	11.37 ms/op	9.6% slower
`Render/NoDithering/Parallel`	14.47 ms/op	14.46 ms/op	flat
`Render/NoDithering/GoCode`	12.42 ms/op	12.31 ms/op	0.9% faster
`Render/DitheringWithBlocks/Serial`	1.35 ms/op	1.38 ms/op	2.4% slower
`Render/DitheringWithBlocks/NoBg/Serial`	1.19 ms/op	1.09 ms/op	8.6% faster
`Render/DitheringWithChars/Serial`	1.38 ms/op	1.40 ms/op	0.9% slower
`Pipeline/NoDithering`	10.75 ms/op	10.43 ms/op	3.0% faster
`Pipeline/DitheringWithBlocks`	4.31 ms/op	4.29 ms/op	flat

Allocation counts are effectively unchanged on the render-heavy paths:

Benchmark	Result
`Render/NoDithering/Serial`	52 MB/op, 46.6k allocs/op
`Render/DitheringWithBlocks/Serial`	3.2 MB/op, 13.6k allocs/op
`Pipeline/NoDithering`	53 MB/op, 62.1k allocs/op
`Pipeline/DitheringWithBlocks`	4.8 MB/op, 77.1k allocs/op

Takeaway: upgrading the module version and packages does not materially change the performance profile. Render allocation pressure remains the first optimization target.

2026-04-30: Render Builder + Append Formatting

Change: replaced render-time str += loops with strings.Builder and replaced per-pixel fmt.Sprintf color formatting with strconv.AppendUint-based helpers. Public render output remains byte-for-byte covered by exact ANSI tests.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Render|Pipeline)' -benchmem -benchtime=1s -count=3

Median render/pipeline result from the focused 3-run benchmark, compared to the Go 1.25 module + upgraded packages baseline:

Benchmark	Previous	Builder + append	Change
`Render/NoDithering/Serial`	11.37 ms/op	0.749 ms/op	15.2x faster
`Render/NoDithering/Parallel`	14.46 ms/op	0.332 ms/op	43.5x faster
`Render/NoDithering/GoCode`	12.31 ms/op	0.858 ms/op	14.4x faster
`Render/DitheringWithBlocks/Serial`	1.38 ms/op	0.222 ms/op	6.2x faster
`Render/DitheringWithBlocks/Parallel`	1.51 ms/op	0.110 ms/op	13.7x faster
`Render/DitheringWithBlocks/NoBg/Serial`	1.09 ms/op	0.134 ms/op	8.1x faster
`Render/DitheringWithChars/Serial`	1.40 ms/op	0.229 ms/op	6.1x faster
`Render/DitheringWithChars/Parallel`	1.43 ms/op	0.115 ms/op	12.5x faster
`Pipeline/NoDithering`	10.43 ms/op	1.79 ms/op	5.8x faster
`Pipeline/DitheringWithBlocks`	4.29 ms/op	3.13 ms/op	1.4x faster

Allocation result after the change:

Benchmark	Result
`Render/NoDithering/Serial`	708 KB/op, 243 allocs/op
`Render/NoDithering/Parallel`	703 KB/op, 153 allocs/op
`Render/DitheringWithBlocks/Serial`	170 KB/op, 123 allocs/op
`Render/DitheringWithBlocks/Parallel`	167 KB/op, 79 allocs/op
`Pipeline/NoDithering`	1.67 MB/op, 15.8k allocs/op
`Pipeline/DitheringWithBlocks`	1.67 MB/op, 63.6k allocs/op

Takeaway: render formatting was the dominant cost. The next likely target is conversion-time allocation and pointer chasing in [][]*ANSIpixel, especially for pipeline benchmarks where creation still contributes most of the remaining allocations.

2026-04-30: Contiguous Pixmap Stage 1

Change: replaced ANSImage.pixmap [][]*ANSIpixel with a single contiguous []ANSIpixel. This keeps the public ANSIpixel type and behavior intact, including GetAt returning a copy that can still render through its source pointer.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3

Median create/scale/pipeline result from the focused 3-run benchmark:

Benchmark	Previous	Contiguous pixmap	Change
`Create/NoDithering/RGBA_160x96`	286 us/op	73 us/op	3.9x faster
`Create/NoDithering/NRGBAComposite_160x96`	321 us/op	125 us/op	2.6x faster
`Create/DitheringWithBlocks/RGBA_320x192`	1.68 ms/op	1.67 ms/op	flat
`Create/DitheringWithChars/RGBA_320x192`	1.71 ms/op	1.68 ms/op	flat
`ScaleAndCreate/NoDithering/Resize_160x96`	973 us/op	724 us/op	1.3x faster
`ScaleAndCreate/NoDithering/Fit_160x96`	971 us/op	723 us/op	1.3x faster
`ScaleAndCreate/DitheringWithBlocks/Resize_320x192`	2.84 ms/op	2.80 ms/op	flat
`ScaleAndCreate/DitheringWithBlocks/Fit_320x192`	2.79 ms/op	2.80 ms/op	flat
`Pipeline/NoDithering`	1.79 ms/op	1.55 ms/op	1.2x faster
`Pipeline/DitheringWithBlocks`	3.13 ms/op	2.99 ms/op	1.1x faster

Allocation result after the change:

Benchmark	Previous	Contiguous pixmap
`Create/NoDithering/RGBA_160x96`	384 KB/op, 15.5k allocs/op	246 KB/op, 2 allocs/op
`Create/NoDithering/NRGBAComposite_160x96`	449 KB/op, 15.5k allocs/op	311 KB/op, 5 allocs/op
`Create/DitheringWithBlocks/RGBA_320x192`	294 KB/op, 63.4k allocs/op	279 KB/op, 61.4k allocs/op
`ScaleAndCreate/NoDithering/Resize_160x96`	964 KB/op, 15.5k allocs/op	826 KB/op, 84 allocs/op
`Pipeline/NoDithering`	1.67 MB/op, 15.8k allocs/op	1.53 MB/op, 327 allocs/op
`Pipeline/DitheringWithBlocks`	1.67 MB/op, 63.6k allocs/op	1.66 MB/op, 61.6k allocs/op

Takeaway: the contiguous pixmap removes nearly all output-pixmap allocation overhead. Dither-heavy creation is now dominated by per-source-pixel color conversion, not output storage.

2026-04-30: Packed Pixmap Stage 2

Change: replaced the internal []ANSIpixel pixmap with private packed pixel data containing only R, G, B, and Brightness. Public ANSIpixel values are reconstructed at API boundaries, so GetAt still returns a renderable copy with source and upper populated.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Render|Pipeline)' -benchmem -benchtime=1s -count=3

Median benchmark result compared to Contiguous Pixmap Stage 1:

Benchmark	Stage 1	Packed pixmap	Change
`Create/NoDithering/RGBA_160x96`	73 us/op	41 us/op	1.8x faster
`Create/NoDithering/NRGBAComposite_160x96`	125 us/op	89 us/op	1.4x faster
`Create/DitheringWithBlocks/RGBA_320x192`	1.67 ms/op	1.65 ms/op	flat
`Create/DitheringWithChars/RGBA_320x192`	1.68 ms/op	1.66 ms/op	flat
`ScaleAndCreate/NoDithering/Resize_160x96`	724 us/op	676 us/op	1.1x faster
`ScaleAndCreate/NoDithering/Fit_160x96`	723 us/op	675 us/op	1.1x faster
`Pipeline/NoDithering`	1.55 ms/op	1.52 ms/op	flat
`Pipeline/DitheringWithBlocks`	2.99 ms/op	2.99 ms/op	flat
`Render/NoDithering/Serial`	0.749 ms/op	0.798 ms/op	6.5% slower
`Render/NoDithering/Parallel`	0.332 ms/op	0.295 ms/op	1.1x faster
`Render/DitheringWithBlocks/Serial`	0.222 ms/op	0.197 ms/op	1.1x faster
`Render/DitheringWithBlocks/Parallel`	0.110 ms/op	0.089 ms/op	1.2x faster

Allocation and memory result after the change:

Benchmark	Stage 1	Packed pixmap
`Create/NoDithering/RGBA_160x96`	246 KB/op, 2 allocs/op	66 KB/op, 2 allocs/op
`Create/NoDithering/NRGBAComposite_160x96`	311 KB/op, 5 allocs/op	131 KB/op, 5 allocs/op
`Create/DitheringWithBlocks/RGBA_320x192`	279 KB/op, 61.4k allocs/op	254 KB/op, 61.4k allocs/op
`ScaleAndCreate/NoDithering/Resize_160x96`	826 KB/op, 84 allocs/op	646 KB/op, 84 allocs/op
`Pipeline/NoDithering`	1.53 MB/op, 327 allocs/op	1.35 MB/op, 327 allocs/op
`Pipeline/DitheringWithBlocks`	1.66 MB/op, 61.6k allocs/op	1.63 MB/op, 61.6k allocs/op

Takeaway: packing the pixmap mainly reduces memory traffic and no-dither creation cost. It does not address dither allocation count, which remains dominated by per-source-pixel color conversion in colorful.MakeColor/Hsv.

2026-04-30: Direct Dither RGB Aggregation

Change: removed colorful.MakeColor and HSV conversion from the dither conversion loop. The loop now reads *image.RGBA.Pix directly, computes brightness as max(R,G,B), and uses integer sums over each fixed 8x4 block. Transparent pixels still contribute zero, and partial-alpha RGBA pixels are unpremultiplied before averaging.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3

Median create/scale/pipeline result compared to Packed Pixmap Stage 2:

Benchmark	Packed pixmap	Direct dither RGB	Change
`Create/DitheringWithBlocks/RGBA_320x192`	1.65 ms/op	108 us/op	15.3x faster
`Create/DitheringWithChars/RGBA_320x192`	1.66 ms/op	110 us/op	15.1x faster
`ScaleAndCreate/DitheringWithBlocks/Resize_320x192`	2.82 ms/op	971 us/op	2.9x faster
`ScaleAndCreate/DitheringWithBlocks/Fit_320x192`	2.82 ms/op	969 us/op	2.9x faster
`Pipeline/DitheringWithBlocks`	2.99 ms/op	1.21 ms/op	2.5x faster

Allocation result after the change:

Benchmark	Packed pixmap	Direct dither RGB
`Create/DitheringWithBlocks/RGBA_320x192`	254 KB/op, 61.4k allocs/op	8 KB/op, 2 allocs/op
`Create/DitheringWithChars/RGBA_320x192`	254 KB/op, 61.4k allocs/op	8 KB/op, 2 allocs/op
`ScaleAndCreate/DitheringWithBlocks/Resize_320x192`	1.46 MB/op, 61.5k allocs/op	1.21 MB/op, 84 allocs/op
`Pipeline/DitheringWithBlocks`	1.63 MB/op, 61.6k allocs/op	1.39 MB/op, 207 allocs/op

Takeaway: dither conversion allocation pressure is effectively gone. Remaining dither pipeline time is now mostly image scaling plus final render string construction.

2026-04-30: Direct Pixmap Writes During Conversion

Change: bypassed public SetAt and RGBAAt in createANSImage. No-dither conversion now reads *image.RGBA.Pix directly and writes packed ansiPixelData directly; dither conversion also assigns output cells directly after block aggregation.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3

Median create/scale/pipeline result compared to Direct Dither RGB Aggregation:

Benchmark	Previous	Direct pixmap writes	Change
`Create/NoDithering/RGBA_160x96`	41 us/op	15 us/op	2.7x faster
`Create/NoDithering/NRGBAComposite_160x96`	89 us/op	64 us/op	1.4x faster
`Create/DitheringWithBlocks/RGBA_320x192`	108 us/op	104 us/op	flat
`Create/DitheringWithChars/RGBA_320x192`	110 us/op	105 us/op	flat
`ScaleAndCreate/NoDithering/Resize_160x96`	676 us/op	643 us/op	1.1x faster
`ScaleAndCreate/NoDithering/Fit_160x96`	675 us/op	649 us/op	flat
`ScaleAndCreate/DitheringWithBlocks/Resize_320x192`	971 us/op	965 us/op	flat
`ScaleAndCreate/DitheringWithBlocks/Fit_320x192`	969 us/op	964 us/op	flat
`Pipeline/NoDithering`	1.52 ms/op	1.48 ms/op	flat
`Pipeline/DitheringWithBlocks`	1.21 ms/op	1.20 ms/op	flat

Allocation counts are unchanged; this pass removes method-call and color-access overhead from validated private conversion loops.

2026-04-30: Serial Render Single Builder

Change: added a dedicated maxprocs == 1 render path that writes all rows into one final strings.Builder. The parallel path still renders independent row strings and joins them, but now shares the same row-writing helpers. This avoids per-row string allocations, channel/goroutine overhead, and the final strings.Join for the default serial render path.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImageRender' -benchmem -benchtime=1s -count=3
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImagePipeline' -benchmem -benchtime=1s -count=3

Median render result compared to the prior builder/packed-pixmap render baseline:

Benchmark	Previous	Single builder	Change
`Render/NoDithering/Serial`	0.798 ms/op	0.520 ms/op	1.5x faster
`Render/NoDithering/Parallel`	0.295 ms/op	0.288 ms/op	flat
`Render/DitheringWithBlocks/Serial`	0.197 ms/op	0.122 ms/op	1.6x faster
`Render/DitheringWithBlocks/Parallel`	0.089 ms/op	0.091 ms/op	flat

Serial allocation result compared to the original builder row-join path:

Benchmark	Previous	Single builder
`Render/NoDithering/Serial`	708 KB/op, 243 allocs/op	377 KB/op, 2 allocs/op
`Render/DitheringWithBlocks/Serial`	170 KB/op, 123 allocs/op	98 KB/op, 2 allocs/op

Pipeline result compared to Direct Pixmap Writes During Conversion:

Benchmark	Previous	Single builder	Change
`Pipeline/NoDithering`	1.48 ms/op	1.22 ms/op	1.2x faster
`Pipeline/DitheringWithBlocks`	1.20 ms/op	1.10 ms/op	1.1x faster

Takeaway: the default serial render path no longer pays the row string and join tax. Parallel rendering remains faster for these render-only sizes, but serial is now much closer and substantially lighter on allocations, which improves the default pipeline path.

2026-04-30: Final Comparison

Render benchmarks

Benchmark	v1.3.2	Current	Speedup
Render/NoDithering/Serial	19.0 ms	0.193 ms	98×
Render/NoDithering/Parallel	15.2 ms	0.114 ms	134×
Render/NoDithering/GoCode/Serial	19.9 ms	0.190 ms	105×
Render/DitheringWithBlocks/Serial	2.49 ms	0.037 ms	68×
Render/DitheringWithBlocks/Parallel	1.23 ms	0.036 ms	35×
Render/DitheringWithBlocks/NoBg/Serial	1.95 ms	0.037 ms	53×
Render/DitheringWithChars/Serial	2.51 ms	0.039 ms	64×
Render/DitheringWithChars/Parallel	1.24 ms	0.036 ms	35×

Pipeline benchmarks (decode + scale + create + render)

Benchmark	v1.3.2	Current	Speedup
Pipeline/NoDithering_160x96	16.2 ms	0.663 ms	24.5×
Pipeline/DitheringWithBlocks_320x192	5.25 ms	0.836 ms	6.3×

Memory pressure

Benchmark	v1.3.2 bytes/op	Current bytes/op	v1.3.2 allocs/op	Current allocs/op
Render/NoDithering/Serial	101 MB	377 KB	91,518	1
Render/NoDithering/Parallel	54 MB	385 KB	49,196	100
Render/NoDithering/GoCode/Serial	118 MB	377 KB	91,688	1
Render/DitheringWithBlocks/Serial	5.98 MB	98 KB	25,554	1
Render/DitheringWithBlocks/Parallel	3.22 MB	102 KB	13,648	52
Render/DitheringWithBlocks/NoBg/Serial	3.72 MB	98 KB	21,920	1
Render/DitheringWithChars/Serial	5.63 MB	98 KB	25,553	1
Render/DitheringWithChars/Parallel	3.04 MB	102 KB	13,648	52
Pipeline/NoDithering_160x96	103 MB	932 KB	106,844	85
Pipeline/DitheringWithBlocks_320x192	7.58 MB	1.23 MB	89,031	85

Headline numbers

Render path: 35–134× faster.
Full pipeline (load → scale → render): 6–25× faster.
Allocation count on serial render: 91,518 → 1.
Memory pressure: ~270× reduction on no-dither serial render.

Caveats

v1.3.2 has the line-count, sub-image, and SetMaxProcs(0) correctness bugs documented in the perf notes. Its render benchmarks are doing slightly less work than current (one fewer row in some cases), so the quality-adjusted speedup is marginally larger than these numbers show.
The render path's B/op doesn't quite go down at the same rate as allocs/op because the buffer is sized for upper-bound output. Actual terminal-output bytes go down too (the per-row SGR dedup shrinks dither output by ~50%), but the bench reports allocation size, not content size.
Pipeline numbers are now dominated by imaging.Resize Lanczos (~60-80% of pipeline time). That's the next target if you ever want to push pipeline below 200 µs.

neomantra/PERF_NOTES.md

Select an option

No results found

Select an option

No results found

Performance Notes

2026-04-30: Baseline vs v1.3.2

2026-04-30: Go 1.25 Module + Upgraded Packages

2026-04-30: Render Builder + Append Formatting

2026-04-30: Contiguous Pixmap Stage 1

2026-04-30: Packed Pixmap Stage 2

2026-04-30: Direct Dither RGB Aggregation

2026-04-30: Direct Pixmap Writes During Conversion

2026-04-30: Serial Render Single Builder

2026-04-30: Final Comparison

Render benchmarks

Pipeline benchmarks (decode + scale + create + render)

Memory pressure