Last night we was my first time doing LLM-based performance engineering, and wow what fun! I token-maxxed $20 of Codex 5.5, then wrapped up with Opus 4.7, along with Gemini for some code reivew.
After six hours, I and a 165x improvement of PIXterm! What follow is the performance log we kept.
This comparison used the benchmark suite in pkg/ansimage/ansimage_bench_test.go on an Apple M3 Max (darwin/arm64). The current working tree was on branch nm-ansiimage at d576e05; v1.3.2 was tested in a temporary worktree with the same benchmark and correctness test files copied in.
Benchmark command:
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1sCorrectness result:
- Current tree passes
go test ./.... v1.3.2fails the strengthened suite. Main failures are missing render rows,SetMaxProcs(0/-n)not clamped, and wrong dimensions/pixels for sub-images with non-zeroBounds().Min.
Performance result, v1.3.2 -> current:
| Benchmark | v1.3.2 | Current | Change |
|---|---|---|---|
Render/NoDithering/Serial |
19.84 ms/op | 10.37 ms/op | 1.91x faster |
Render/NoDithering/GoCode |
20.82 ms/op | 12.42 ms/op | 1.68x faster |
Render/DitheringWithBlocks/Serial |
2.59 ms/op | 1.35 ms/op | 1.92x faster |
Render/DitheringWithChars/Serial |
2.55 ms/op | 1.38 ms/op | 1.84x faster |
Pipeline/NoDithering |
16.79 ms/op | 10.75 ms/op | 1.56x faster |
Pipeline/DitheringWithBlocks |
5.26 ms/op | 4.31 ms/op | 1.22x faster |
Allocation result, v1.3.2 -> current:
| Benchmark | v1.3.2 | Current |
|---|---|---|
Render/NoDithering/Serial |
101 MB/op, 91k allocs/op | 52 MB/op, 47k allocs/op |
Render/DitheringWithBlocks/Serial |
6.0 MB/op, 26k allocs/op | 3.2 MB/op, 14k allocs/op |
Pipeline/NoDithering |
103 MB/op, 107k allocs/op | 53 MB/op, 62k allocs/op |
Create/scale-only benchmarks are basically unchanged, within normal run-to-run noise. The comparison has an important caveat: v1.3.2 often renders fewer rows because of correctness bugs, while the current tree is faster on render-heavy paths while producing complete output.
Next optimization target: reduce render allocation pressure, especially no-dither rendering. The current no-dither render path still allocates roughly 52 MB/op for a 160x96 benchmark image.
This run used the same benchmark suite and command as above. The module is now set to go 1.25.0, while the active local toolchain reports go1.26.2 darwin/arm64.
Current module versions:
github.com/disintegration/imaging v1.6.2
github.com/lucasb-eyer/go-colorful v1.4.0
golang.org/x/image v0.39.0
golang.org/x/sys v0.43.0
golang.org/x/term v0.42.0
golang.org/x/text v0.36.0
Verification:
go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1sBenchmark result compared to the previous current-tree baseline:
| Benchmark | Previous | Go 1.25 module + upgraded packages | Change |
|---|---|---|---|
Render/NoDithering/Serial |
10.37 ms/op | 11.37 ms/op | 9.6% slower |
Render/NoDithering/Parallel |
14.47 ms/op | 14.46 ms/op | flat |
Render/NoDithering/GoCode |
12.42 ms/op | 12.31 ms/op | 0.9% faster |
Render/DitheringWithBlocks/Serial |
1.35 ms/op | 1.38 ms/op | 2.4% slower |
Render/DitheringWithBlocks/NoBg/Serial |
1.19 ms/op | 1.09 ms/op | 8.6% faster |
Render/DitheringWithChars/Serial |
1.38 ms/op | 1.40 ms/op | 0.9% slower |
Pipeline/NoDithering |
10.75 ms/op | 10.43 ms/op | 3.0% faster |
Pipeline/DitheringWithBlocks |
4.31 ms/op | 4.29 ms/op | flat |
Allocation counts are effectively unchanged on the render-heavy paths:
| Benchmark | Result |
|---|---|
Render/NoDithering/Serial |
52 MB/op, 46.6k allocs/op |
Render/DitheringWithBlocks/Serial |
3.2 MB/op, 13.6k allocs/op |
Pipeline/NoDithering |
53 MB/op, 62.1k allocs/op |
Pipeline/DitheringWithBlocks |
4.8 MB/op, 77.1k allocs/op |
Takeaway: upgrading the module version and packages does not materially change the performance profile. Render allocation pressure remains the first optimization target.
Change: replaced render-time str += loops with strings.Builder and replaced per-pixel fmt.Sprintf color formatting with strconv.AppendUint-based helpers. Public render output remains byte-for-byte covered by exact ANSI tests.
Verification:
go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Render|Pipeline)' -benchmem -benchtime=1s -count=3Median render/pipeline result from the focused 3-run benchmark, compared to the Go 1.25 module + upgraded packages baseline:
| Benchmark | Previous | Builder + append | Change |
|---|---|---|---|
Render/NoDithering/Serial |
11.37 ms/op | 0.749 ms/op | 15.2x faster |
Render/NoDithering/Parallel |
14.46 ms/op | 0.332 ms/op | 43.5x faster |
Render/NoDithering/GoCode |
12.31 ms/op | 0.858 ms/op | 14.4x faster |
Render/DitheringWithBlocks/Serial |
1.38 ms/op | 0.222 ms/op | 6.2x faster |
Render/DitheringWithBlocks/Parallel |
1.51 ms/op | 0.110 ms/op | 13.7x faster |
Render/DitheringWithBlocks/NoBg/Serial |
1.09 ms/op | 0.134 ms/op | 8.1x faster |
Render/DitheringWithChars/Serial |
1.40 ms/op | 0.229 ms/op | 6.1x faster |
Render/DitheringWithChars/Parallel |
1.43 ms/op | 0.115 ms/op | 12.5x faster |
Pipeline/NoDithering |
10.43 ms/op | 1.79 ms/op | 5.8x faster |
Pipeline/DitheringWithBlocks |
4.29 ms/op | 3.13 ms/op | 1.4x faster |
Allocation result after the change:
| Benchmark | Result |
|---|---|
Render/NoDithering/Serial |
708 KB/op, 243 allocs/op |
Render/NoDithering/Parallel |
703 KB/op, 153 allocs/op |
Render/DitheringWithBlocks/Serial |
170 KB/op, 123 allocs/op |
Render/DitheringWithBlocks/Parallel |
167 KB/op, 79 allocs/op |
Pipeline/NoDithering |
1.67 MB/op, 15.8k allocs/op |
Pipeline/DitheringWithBlocks |
1.67 MB/op, 63.6k allocs/op |
Takeaway: render formatting was the dominant cost. The next likely target is conversion-time allocation and pointer chasing in [][]*ANSIpixel, especially for pipeline benchmarks where creation still contributes most of the remaining allocations.
Change: replaced ANSImage.pixmap [][]*ANSIpixel with a single contiguous []ANSIpixel. This keeps the public ANSIpixel type and behavior intact, including GetAt returning a copy that can still render through its source pointer.
Verification:
go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3Median create/scale/pipeline result from the focused 3-run benchmark:
| Benchmark | Previous | Contiguous pixmap | Change |
|---|---|---|---|
Create/NoDithering/RGBA_160x96 |
286 us/op | 73 us/op | 3.9x faster |
Create/NoDithering/NRGBAComposite_160x96 |
321 us/op | 125 us/op | 2.6x faster |
Create/DitheringWithBlocks/RGBA_320x192 |
1.68 ms/op | 1.67 ms/op | flat |
Create/DitheringWithChars/RGBA_320x192 |
1.71 ms/op | 1.68 ms/op | flat |
ScaleAndCreate/NoDithering/Resize_160x96 |
973 us/op | 724 us/op | 1.3x faster |
ScaleAndCreate/NoDithering/Fit_160x96 |
971 us/op | 723 us/op | 1.3x faster |
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 |
2.84 ms/op | 2.80 ms/op | flat |
ScaleAndCreate/DitheringWithBlocks/Fit_320x192 |
2.79 ms/op | 2.80 ms/op | flat |
Pipeline/NoDithering |
1.79 ms/op | 1.55 ms/op | 1.2x faster |
Pipeline/DitheringWithBlocks |
3.13 ms/op | 2.99 ms/op | 1.1x faster |
Allocation result after the change:
| Benchmark | Previous | Contiguous pixmap |
|---|---|---|
Create/NoDithering/RGBA_160x96 |
384 KB/op, 15.5k allocs/op | 246 KB/op, 2 allocs/op |
Create/NoDithering/NRGBAComposite_160x96 |
449 KB/op, 15.5k allocs/op | 311 KB/op, 5 allocs/op |
Create/DitheringWithBlocks/RGBA_320x192 |
294 KB/op, 63.4k allocs/op | 279 KB/op, 61.4k allocs/op |
ScaleAndCreate/NoDithering/Resize_160x96 |
964 KB/op, 15.5k allocs/op | 826 KB/op, 84 allocs/op |
Pipeline/NoDithering |
1.67 MB/op, 15.8k allocs/op | 1.53 MB/op, 327 allocs/op |
Pipeline/DitheringWithBlocks |
1.67 MB/op, 63.6k allocs/op | 1.66 MB/op, 61.6k allocs/op |
Takeaway: the contiguous pixmap removes nearly all output-pixmap allocation overhead. Dither-heavy creation is now dominated by per-source-pixel color conversion, not output storage.
Change: replaced the internal []ANSIpixel pixmap with private packed pixel data containing only R, G, B, and Brightness. Public ANSIpixel values are reconstructed at API boundaries, so GetAt still returns a renderable copy with source and upper populated.
Verification:
go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Render|Pipeline)' -benchmem -benchtime=1s -count=3Median benchmark result compared to Contiguous Pixmap Stage 1:
| Benchmark | Stage 1 | Packed pixmap | Change |
|---|---|---|---|
Create/NoDithering/RGBA_160x96 |
73 us/op | 41 us/op | 1.8x faster |
Create/NoDithering/NRGBAComposite_160x96 |
125 us/op | 89 us/op | 1.4x faster |
Create/DitheringWithBlocks/RGBA_320x192 |
1.67 ms/op | 1.65 ms/op | flat |
Create/DitheringWithChars/RGBA_320x192 |
1.68 ms/op | 1.66 ms/op | flat |
ScaleAndCreate/NoDithering/Resize_160x96 |
724 us/op | 676 us/op | 1.1x faster |
ScaleAndCreate/NoDithering/Fit_160x96 |
723 us/op | 675 us/op | 1.1x faster |
Pipeline/NoDithering |
1.55 ms/op | 1.52 ms/op | flat |
Pipeline/DitheringWithBlocks |
2.99 ms/op | 2.99 ms/op | flat |
Render/NoDithering/Serial |
0.749 ms/op | 0.798 ms/op | 6.5% slower |
Render/NoDithering/Parallel |
0.332 ms/op | 0.295 ms/op | 1.1x faster |
Render/DitheringWithBlocks/Serial |
0.222 ms/op | 0.197 ms/op | 1.1x faster |
Render/DitheringWithBlocks/Parallel |
0.110 ms/op | 0.089 ms/op | 1.2x faster |
Allocation and memory result after the change:
| Benchmark | Stage 1 | Packed pixmap |
|---|---|---|
Create/NoDithering/RGBA_160x96 |
246 KB/op, 2 allocs/op | 66 KB/op, 2 allocs/op |
Create/NoDithering/NRGBAComposite_160x96 |
311 KB/op, 5 allocs/op | 131 KB/op, 5 allocs/op |
Create/DitheringWithBlocks/RGBA_320x192 |
279 KB/op, 61.4k allocs/op | 254 KB/op, 61.4k allocs/op |
ScaleAndCreate/NoDithering/Resize_160x96 |
826 KB/op, 84 allocs/op | 646 KB/op, 84 allocs/op |
Pipeline/NoDithering |
1.53 MB/op, 327 allocs/op | 1.35 MB/op, 327 allocs/op |
Pipeline/DitheringWithBlocks |
1.66 MB/op, 61.6k allocs/op | 1.63 MB/op, 61.6k allocs/op |
Takeaway: packing the pixmap mainly reduces memory traffic and no-dither creation cost. It does not address dither allocation count, which remains dominated by per-source-pixel color conversion in colorful.MakeColor/Hsv.
Change: removed colorful.MakeColor and HSV conversion from the dither conversion loop. The loop now reads *image.RGBA.Pix directly, computes brightness as max(R,G,B), and uses integer sums over each fixed 8x4 block. Transparent pixels still contribute zero, and partial-alpha RGBA pixels are unpremultiplied before averaging.
Verification:
go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3Median create/scale/pipeline result compared to Packed Pixmap Stage 2:
| Benchmark | Packed pixmap | Direct dither RGB | Change |
|---|---|---|---|
Create/DitheringWithBlocks/RGBA_320x192 |
1.65 ms/op | 108 us/op | 15.3x faster |
Create/DitheringWithChars/RGBA_320x192 |
1.66 ms/op | 110 us/op | 15.1x faster |
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 |
2.82 ms/op | 971 us/op | 2.9x faster |
ScaleAndCreate/DitheringWithBlocks/Fit_320x192 |
2.82 ms/op | 969 us/op | 2.9x faster |
Pipeline/DitheringWithBlocks |
2.99 ms/op | 1.21 ms/op | 2.5x faster |
Allocation result after the change:
| Benchmark | Packed pixmap | Direct dither RGB |
|---|---|---|
Create/DitheringWithBlocks/RGBA_320x192 |
254 KB/op, 61.4k allocs/op | 8 KB/op, 2 allocs/op |
Create/DitheringWithChars/RGBA_320x192 |
254 KB/op, 61.4k allocs/op | 8 KB/op, 2 allocs/op |
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 |
1.46 MB/op, 61.5k allocs/op | 1.21 MB/op, 84 allocs/op |
Pipeline/DitheringWithBlocks |
1.63 MB/op, 61.6k allocs/op | 1.39 MB/op, 207 allocs/op |
Takeaway: dither conversion allocation pressure is effectively gone. Remaining dither pipeline time is now mostly image scaling plus final render string construction.
Change: bypassed public SetAt and RGBAAt in createANSImage. No-dither conversion now reads *image.RGBA.Pix directly and writes packed ansiPixelData directly; dither conversion also assigns output cells directly after block aggregation.
Verification:
go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3Median create/scale/pipeline result compared to Direct Dither RGB Aggregation:
| Benchmark | Previous | Direct pixmap writes | Change |
|---|---|---|---|
Create/NoDithering/RGBA_160x96 |
41 us/op | 15 us/op | 2.7x faster |
Create/NoDithering/NRGBAComposite_160x96 |
89 us/op | 64 us/op | 1.4x faster |
Create/DitheringWithBlocks/RGBA_320x192 |
108 us/op | 104 us/op | flat |
Create/DitheringWithChars/RGBA_320x192 |
110 us/op | 105 us/op | flat |
ScaleAndCreate/NoDithering/Resize_160x96 |
676 us/op | 643 us/op | 1.1x faster |
ScaleAndCreate/NoDithering/Fit_160x96 |
675 us/op | 649 us/op | flat |
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 |
971 us/op | 965 us/op | flat |
ScaleAndCreate/DitheringWithBlocks/Fit_320x192 |
969 us/op | 964 us/op | flat |
Pipeline/NoDithering |
1.52 ms/op | 1.48 ms/op | flat |
Pipeline/DitheringWithBlocks |
1.21 ms/op | 1.20 ms/op | flat |
Allocation counts are unchanged; this pass removes method-call and color-access overhead from validated private conversion loops.
Change: added a dedicated maxprocs == 1 render path that writes all rows into one final strings.Builder. The parallel path still renders independent row strings and joins them, but now shares the same row-writing helpers. This avoids per-row string allocations, channel/goroutine overhead, and the final strings.Join for the default serial render path.
Verification:
go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImageRender' -benchmem -benchtime=1s -count=3
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImagePipeline' -benchmem -benchtime=1s -count=3Median render result compared to the prior builder/packed-pixmap render baseline:
| Benchmark | Previous | Single builder | Change |
|---|---|---|---|
Render/NoDithering/Serial |
0.798 ms/op | 0.520 ms/op | 1.5x faster |
Render/NoDithering/Parallel |
0.295 ms/op | 0.288 ms/op | flat |
Render/DitheringWithBlocks/Serial |
0.197 ms/op | 0.122 ms/op | 1.6x faster |
Render/DitheringWithBlocks/Parallel |
0.089 ms/op | 0.091 ms/op | flat |
Serial allocation result compared to the original builder row-join path:
| Benchmark | Previous | Single builder |
|---|---|---|
Render/NoDithering/Serial |
708 KB/op, 243 allocs/op | 377 KB/op, 2 allocs/op |
Render/DitheringWithBlocks/Serial |
170 KB/op, 123 allocs/op | 98 KB/op, 2 allocs/op |
Pipeline result compared to Direct Pixmap Writes During Conversion:
| Benchmark | Previous | Single builder | Change |
|---|---|---|---|
Pipeline/NoDithering |
1.48 ms/op | 1.22 ms/op | 1.2x faster |
Pipeline/DitheringWithBlocks |
1.20 ms/op | 1.10 ms/op | 1.1x faster |
Takeaway: the default serial render path no longer pays the row string and join tax. Parallel rendering remains faster for these render-only sizes, but serial is now much closer and substantially lighter on allocations, which improves the default pipeline path.
| Benchmark | v1.3.2 | Current | Speedup |
|---|---|---|---|
| Render/NoDithering/Serial | 19.0 ms | 0.193 ms | 98× |
| Render/NoDithering/Parallel | 15.2 ms | 0.114 ms | 134× |
| Render/NoDithering/GoCode/Serial | 19.9 ms | 0.190 ms | 105× |
| Render/DitheringWithBlocks/Serial | 2.49 ms | 0.037 ms | 68× |
| Render/DitheringWithBlocks/Parallel | 1.23 ms | 0.036 ms | 35× |
| Render/DitheringWithBlocks/NoBg/Serial | 1.95 ms | 0.037 ms | 53× |
| Render/DitheringWithChars/Serial | 2.51 ms | 0.039 ms | 64× |
| Render/DitheringWithChars/Parallel | 1.24 ms | 0.036 ms | 35× |
| Benchmark | v1.3.2 | Current | Speedup |
|---|---|---|---|
| Pipeline/NoDithering_160x96 | 16.2 ms | 0.663 ms | 24.5× |
| Pipeline/DitheringWithBlocks_320x192 | 5.25 ms | 0.836 ms | 6.3× |
| Benchmark | v1.3.2 bytes/op | Current bytes/op | v1.3.2 allocs/op | Current allocs/op |
|---|---|---|---|---|
| Render/NoDithering/Serial | 101 MB | 377 KB | 91,518 | 1 |
| Render/NoDithering/Parallel | 54 MB | 385 KB | 49,196 | 100 |
| Render/NoDithering/GoCode/Serial | 118 MB | 377 KB | 91,688 | 1 |
| Render/DitheringWithBlocks/Serial | 5.98 MB | 98 KB | 25,554 | 1 |
| Render/DitheringWithBlocks/Parallel | 3.22 MB | 102 KB | 13,648 | 52 |
| Render/DitheringWithBlocks/NoBg/Serial | 3.72 MB | 98 KB | 21,920 | 1 |
| Render/DitheringWithChars/Serial | 5.63 MB | 98 KB | 25,553 | 1 |
| Render/DitheringWithChars/Parallel | 3.04 MB | 102 KB | 13,648 | 52 |
| Pipeline/NoDithering_160x96 | 103 MB | 932 KB | 106,844 | 85 |
| Pipeline/DitheringWithBlocks_320x192 | 7.58 MB | 1.23 MB | 89,031 | 85 |
Headline numbers
- Render path: 35–134× faster.
- Full pipeline (load → scale → render): 6–25× faster.
- Allocation count on serial render: 91,518 → 1.
- Memory pressure: ~270× reduction on no-dither serial render.
Caveats
- v1.3.2 has the line-count, sub-image, and SetMaxProcs(0) correctness bugs documented in the perf notes. Its render benchmarks are doing slightly less work than current (one fewer row in some cases), so the quality-adjusted speedup is marginally larger than these numbers show.
- The render path's B/op doesn't quite go down at the same rate as allocs/op because the buffer is sized for upper-bound output. Actual terminal-output bytes go down too (the per-row SGR dedup shrinks dither output by ~50%), but the bench reports allocation size, not content size.
- Pipeline numbers are now dominated by imaging.Resize Lanczos (~60-80% of pipeline time). That's the next target if you ever want to push pipeline below 200 µs.