Skip to content

Instantly share code, notes, and snippets.

@neomantra
Created April 30, 2026 23:47
Show Gist options
  • Select an option

  • Save neomantra/cb40511b1ae00b7294234ceee909f876 to your computer and use it in GitHub Desktop.

Select an option

Save neomantra/cb40511b1ae00b7294234ceee909f876 to your computer and use it in GitHub Desktop.
PIXterm Performance Engineering

Last night we was my first time doing LLM-based performance engineering, and wow what fun! I token-maxxed $20 of Codex 5.5, then wrapped up with Opus 4.7, along with Gemini for some code reivew.

After six hours, I and a 165x improvement of PIXterm! What follow is the performance log we kept.


Performance Notes

2026-04-30: Baseline vs v1.3.2

This comparison used the benchmark suite in pkg/ansimage/ansimage_bench_test.go on an Apple M3 Max (darwin/arm64). The current working tree was on branch nm-ansiimage at d576e05; v1.3.2 was tested in a temporary worktree with the same benchmark and correctness test files copied in.

Benchmark command:

go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s

Correctness result:

  • Current tree passes go test ./....
  • v1.3.2 fails the strengthened suite. Main failures are missing render rows, SetMaxProcs(0/-n) not clamped, and wrong dimensions/pixels for sub-images with non-zero Bounds().Min.

Performance result, v1.3.2 -> current:

Benchmark v1.3.2 Current Change
Render/NoDithering/Serial 19.84 ms/op 10.37 ms/op 1.91x faster
Render/NoDithering/GoCode 20.82 ms/op 12.42 ms/op 1.68x faster
Render/DitheringWithBlocks/Serial 2.59 ms/op 1.35 ms/op 1.92x faster
Render/DitheringWithChars/Serial 2.55 ms/op 1.38 ms/op 1.84x faster
Pipeline/NoDithering 16.79 ms/op 10.75 ms/op 1.56x faster
Pipeline/DitheringWithBlocks 5.26 ms/op 4.31 ms/op 1.22x faster

Allocation result, v1.3.2 -> current:

Benchmark v1.3.2 Current
Render/NoDithering/Serial 101 MB/op, 91k allocs/op 52 MB/op, 47k allocs/op
Render/DitheringWithBlocks/Serial 6.0 MB/op, 26k allocs/op 3.2 MB/op, 14k allocs/op
Pipeline/NoDithering 103 MB/op, 107k allocs/op 53 MB/op, 62k allocs/op

Create/scale-only benchmarks are basically unchanged, within normal run-to-run noise. The comparison has an important caveat: v1.3.2 often renders fewer rows because of correctness bugs, while the current tree is faster on render-heavy paths while producing complete output.

Next optimization target: reduce render allocation pressure, especially no-dither rendering. The current no-dither render path still allocates roughly 52 MB/op for a 160x96 benchmark image.

2026-04-30: Go 1.25 Module + Upgraded Packages

This run used the same benchmark suite and command as above. The module is now set to go 1.25.0, while the active local toolchain reports go1.26.2 darwin/arm64.

Current module versions:

github.com/disintegration/imaging v1.6.2
github.com/lucasb-eyer/go-colorful v1.4.0
golang.org/x/image v0.39.0
golang.org/x/sys v0.43.0
golang.org/x/term v0.42.0
golang.org/x/text v0.36.0

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s

Benchmark result compared to the previous current-tree baseline:

Benchmark Previous Go 1.25 module + upgraded packages Change
Render/NoDithering/Serial 10.37 ms/op 11.37 ms/op 9.6% slower
Render/NoDithering/Parallel 14.47 ms/op 14.46 ms/op flat
Render/NoDithering/GoCode 12.42 ms/op 12.31 ms/op 0.9% faster
Render/DitheringWithBlocks/Serial 1.35 ms/op 1.38 ms/op 2.4% slower
Render/DitheringWithBlocks/NoBg/Serial 1.19 ms/op 1.09 ms/op 8.6% faster
Render/DitheringWithChars/Serial 1.38 ms/op 1.40 ms/op 0.9% slower
Pipeline/NoDithering 10.75 ms/op 10.43 ms/op 3.0% faster
Pipeline/DitheringWithBlocks 4.31 ms/op 4.29 ms/op flat

Allocation counts are effectively unchanged on the render-heavy paths:

Benchmark Result
Render/NoDithering/Serial 52 MB/op, 46.6k allocs/op
Render/DitheringWithBlocks/Serial 3.2 MB/op, 13.6k allocs/op
Pipeline/NoDithering 53 MB/op, 62.1k allocs/op
Pipeline/DitheringWithBlocks 4.8 MB/op, 77.1k allocs/op

Takeaway: upgrading the module version and packages does not materially change the performance profile. Render allocation pressure remains the first optimization target.

2026-04-30: Render Builder + Append Formatting

Change: replaced render-time str += loops with strings.Builder and replaced per-pixel fmt.Sprintf color formatting with strconv.AppendUint-based helpers. Public render output remains byte-for-byte covered by exact ANSI tests.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Render|Pipeline)' -benchmem -benchtime=1s -count=3

Median render/pipeline result from the focused 3-run benchmark, compared to the Go 1.25 module + upgraded packages baseline:

Benchmark Previous Builder + append Change
Render/NoDithering/Serial 11.37 ms/op 0.749 ms/op 15.2x faster
Render/NoDithering/Parallel 14.46 ms/op 0.332 ms/op 43.5x faster
Render/NoDithering/GoCode 12.31 ms/op 0.858 ms/op 14.4x faster
Render/DitheringWithBlocks/Serial 1.38 ms/op 0.222 ms/op 6.2x faster
Render/DitheringWithBlocks/Parallel 1.51 ms/op 0.110 ms/op 13.7x faster
Render/DitheringWithBlocks/NoBg/Serial 1.09 ms/op 0.134 ms/op 8.1x faster
Render/DitheringWithChars/Serial 1.40 ms/op 0.229 ms/op 6.1x faster
Render/DitheringWithChars/Parallel 1.43 ms/op 0.115 ms/op 12.5x faster
Pipeline/NoDithering 10.43 ms/op 1.79 ms/op 5.8x faster
Pipeline/DitheringWithBlocks 4.29 ms/op 3.13 ms/op 1.4x faster

Allocation result after the change:

Benchmark Result
Render/NoDithering/Serial 708 KB/op, 243 allocs/op
Render/NoDithering/Parallel 703 KB/op, 153 allocs/op
Render/DitheringWithBlocks/Serial 170 KB/op, 123 allocs/op
Render/DitheringWithBlocks/Parallel 167 KB/op, 79 allocs/op
Pipeline/NoDithering 1.67 MB/op, 15.8k allocs/op
Pipeline/DitheringWithBlocks 1.67 MB/op, 63.6k allocs/op

Takeaway: render formatting was the dominant cost. The next likely target is conversion-time allocation and pointer chasing in [][]*ANSIpixel, especially for pipeline benchmarks where creation still contributes most of the remaining allocations.

2026-04-30: Contiguous Pixmap Stage 1

Change: replaced ANSImage.pixmap [][]*ANSIpixel with a single contiguous []ANSIpixel. This keeps the public ANSIpixel type and behavior intact, including GetAt returning a copy that can still render through its source pointer.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3

Median create/scale/pipeline result from the focused 3-run benchmark:

Benchmark Previous Contiguous pixmap Change
Create/NoDithering/RGBA_160x96 286 us/op 73 us/op 3.9x faster
Create/NoDithering/NRGBAComposite_160x96 321 us/op 125 us/op 2.6x faster
Create/DitheringWithBlocks/RGBA_320x192 1.68 ms/op 1.67 ms/op flat
Create/DitheringWithChars/RGBA_320x192 1.71 ms/op 1.68 ms/op flat
ScaleAndCreate/NoDithering/Resize_160x96 973 us/op 724 us/op 1.3x faster
ScaleAndCreate/NoDithering/Fit_160x96 971 us/op 723 us/op 1.3x faster
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 2.84 ms/op 2.80 ms/op flat
ScaleAndCreate/DitheringWithBlocks/Fit_320x192 2.79 ms/op 2.80 ms/op flat
Pipeline/NoDithering 1.79 ms/op 1.55 ms/op 1.2x faster
Pipeline/DitheringWithBlocks 3.13 ms/op 2.99 ms/op 1.1x faster

Allocation result after the change:

Benchmark Previous Contiguous pixmap
Create/NoDithering/RGBA_160x96 384 KB/op, 15.5k allocs/op 246 KB/op, 2 allocs/op
Create/NoDithering/NRGBAComposite_160x96 449 KB/op, 15.5k allocs/op 311 KB/op, 5 allocs/op
Create/DitheringWithBlocks/RGBA_320x192 294 KB/op, 63.4k allocs/op 279 KB/op, 61.4k allocs/op
ScaleAndCreate/NoDithering/Resize_160x96 964 KB/op, 15.5k allocs/op 826 KB/op, 84 allocs/op
Pipeline/NoDithering 1.67 MB/op, 15.8k allocs/op 1.53 MB/op, 327 allocs/op
Pipeline/DitheringWithBlocks 1.67 MB/op, 63.6k allocs/op 1.66 MB/op, 61.6k allocs/op

Takeaway: the contiguous pixmap removes nearly all output-pixmap allocation overhead. Dither-heavy creation is now dominated by per-source-pixel color conversion, not output storage.

2026-04-30: Packed Pixmap Stage 2

Change: replaced the internal []ANSIpixel pixmap with private packed pixel data containing only R, G, B, and Brightness. Public ANSIpixel values are reconstructed at API boundaries, so GetAt still returns a renderable copy with source and upper populated.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Render|Pipeline)' -benchmem -benchtime=1s -count=3

Median benchmark result compared to Contiguous Pixmap Stage 1:

Benchmark Stage 1 Packed pixmap Change
Create/NoDithering/RGBA_160x96 73 us/op 41 us/op 1.8x faster
Create/NoDithering/NRGBAComposite_160x96 125 us/op 89 us/op 1.4x faster
Create/DitheringWithBlocks/RGBA_320x192 1.67 ms/op 1.65 ms/op flat
Create/DitheringWithChars/RGBA_320x192 1.68 ms/op 1.66 ms/op flat
ScaleAndCreate/NoDithering/Resize_160x96 724 us/op 676 us/op 1.1x faster
ScaleAndCreate/NoDithering/Fit_160x96 723 us/op 675 us/op 1.1x faster
Pipeline/NoDithering 1.55 ms/op 1.52 ms/op flat
Pipeline/DitheringWithBlocks 2.99 ms/op 2.99 ms/op flat
Render/NoDithering/Serial 0.749 ms/op 0.798 ms/op 6.5% slower
Render/NoDithering/Parallel 0.332 ms/op 0.295 ms/op 1.1x faster
Render/DitheringWithBlocks/Serial 0.222 ms/op 0.197 ms/op 1.1x faster
Render/DitheringWithBlocks/Parallel 0.110 ms/op 0.089 ms/op 1.2x faster

Allocation and memory result after the change:

Benchmark Stage 1 Packed pixmap
Create/NoDithering/RGBA_160x96 246 KB/op, 2 allocs/op 66 KB/op, 2 allocs/op
Create/NoDithering/NRGBAComposite_160x96 311 KB/op, 5 allocs/op 131 KB/op, 5 allocs/op
Create/DitheringWithBlocks/RGBA_320x192 279 KB/op, 61.4k allocs/op 254 KB/op, 61.4k allocs/op
ScaleAndCreate/NoDithering/Resize_160x96 826 KB/op, 84 allocs/op 646 KB/op, 84 allocs/op
Pipeline/NoDithering 1.53 MB/op, 327 allocs/op 1.35 MB/op, 327 allocs/op
Pipeline/DitheringWithBlocks 1.66 MB/op, 61.6k allocs/op 1.63 MB/op, 61.6k allocs/op

Takeaway: packing the pixmap mainly reduces memory traffic and no-dither creation cost. It does not address dither allocation count, which remains dominated by per-source-pixel color conversion in colorful.MakeColor/Hsv.

2026-04-30: Direct Dither RGB Aggregation

Change: removed colorful.MakeColor and HSV conversion from the dither conversion loop. The loop now reads *image.RGBA.Pix directly, computes brightness as max(R,G,B), and uses integer sums over each fixed 8x4 block. Transparent pixels still contribute zero, and partial-alpha RGBA pixels are unpremultiplied before averaging.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3

Median create/scale/pipeline result compared to Packed Pixmap Stage 2:

Benchmark Packed pixmap Direct dither RGB Change
Create/DitheringWithBlocks/RGBA_320x192 1.65 ms/op 108 us/op 15.3x faster
Create/DitheringWithChars/RGBA_320x192 1.66 ms/op 110 us/op 15.1x faster
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 2.82 ms/op 971 us/op 2.9x faster
ScaleAndCreate/DitheringWithBlocks/Fit_320x192 2.82 ms/op 969 us/op 2.9x faster
Pipeline/DitheringWithBlocks 2.99 ms/op 1.21 ms/op 2.5x faster

Allocation result after the change:

Benchmark Packed pixmap Direct dither RGB
Create/DitheringWithBlocks/RGBA_320x192 254 KB/op, 61.4k allocs/op 8 KB/op, 2 allocs/op
Create/DitheringWithChars/RGBA_320x192 254 KB/op, 61.4k allocs/op 8 KB/op, 2 allocs/op
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 1.46 MB/op, 61.5k allocs/op 1.21 MB/op, 84 allocs/op
Pipeline/DitheringWithBlocks 1.63 MB/op, 61.6k allocs/op 1.39 MB/op, 207 allocs/op

Takeaway: dither conversion allocation pressure is effectively gone. Remaining dither pipeline time is now mostly image scaling plus final render string construction.

2026-04-30: Direct Pixmap Writes During Conversion

Change: bypassed public SetAt and RGBAAt in createANSImage. No-dither conversion now reads *image.RGBA.Pix directly and writes packed ansiPixelData directly; dither conversion also assigns output cells directly after block aggregation.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage' -benchmem -benchtime=1s
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImage(Create|ScaleAndCreate|Pipeline)' -benchmem -benchtime=1s -count=3

Median create/scale/pipeline result compared to Direct Dither RGB Aggregation:

Benchmark Previous Direct pixmap writes Change
Create/NoDithering/RGBA_160x96 41 us/op 15 us/op 2.7x faster
Create/NoDithering/NRGBAComposite_160x96 89 us/op 64 us/op 1.4x faster
Create/DitheringWithBlocks/RGBA_320x192 108 us/op 104 us/op flat
Create/DitheringWithChars/RGBA_320x192 110 us/op 105 us/op flat
ScaleAndCreate/NoDithering/Resize_160x96 676 us/op 643 us/op 1.1x faster
ScaleAndCreate/NoDithering/Fit_160x96 675 us/op 649 us/op flat
ScaleAndCreate/DitheringWithBlocks/Resize_320x192 971 us/op 965 us/op flat
ScaleAndCreate/DitheringWithBlocks/Fit_320x192 969 us/op 964 us/op flat
Pipeline/NoDithering 1.52 ms/op 1.48 ms/op flat
Pipeline/DitheringWithBlocks 1.21 ms/op 1.20 ms/op flat

Allocation counts are unchanged; this pass removes method-call and color-access overhead from validated private conversion loops.

2026-04-30: Serial Render Single Builder

Change: added a dedicated maxprocs == 1 render path that writes all rows into one final strings.Builder. The parallel path still renders independent row strings and joins them, but now shares the same row-writing helpers. This avoids per-row string allocations, channel/goroutine overhead, and the final strings.Join for the default serial render path.

Verification:

go test ./...
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImageRender' -benchmem -benchtime=1s -count=3
go test ./pkg/ansimage -run '^$' -bench '^BenchmarkANSImagePipeline' -benchmem -benchtime=1s -count=3

Median render result compared to the prior builder/packed-pixmap render baseline:

Benchmark Previous Single builder Change
Render/NoDithering/Serial 0.798 ms/op 0.520 ms/op 1.5x faster
Render/NoDithering/Parallel 0.295 ms/op 0.288 ms/op flat
Render/DitheringWithBlocks/Serial 0.197 ms/op 0.122 ms/op 1.6x faster
Render/DitheringWithBlocks/Parallel 0.089 ms/op 0.091 ms/op flat

Serial allocation result compared to the original builder row-join path:

Benchmark Previous Single builder
Render/NoDithering/Serial 708 KB/op, 243 allocs/op 377 KB/op, 2 allocs/op
Render/DitheringWithBlocks/Serial 170 KB/op, 123 allocs/op 98 KB/op, 2 allocs/op

Pipeline result compared to Direct Pixmap Writes During Conversion:

Benchmark Previous Single builder Change
Pipeline/NoDithering 1.48 ms/op 1.22 ms/op 1.2x faster
Pipeline/DitheringWithBlocks 1.20 ms/op 1.10 ms/op 1.1x faster

Takeaway: the default serial render path no longer pays the row string and join tax. Parallel rendering remains faster for these render-only sizes, but serial is now much closer and substantially lighter on allocations, which improves the default pipeline path.


2026-04-30: Final Comparison

Render benchmarks

Benchmark v1.3.2 Current Speedup
Render/NoDithering/Serial 19.0 ms 0.193 ms 98×
Render/NoDithering/Parallel 15.2 ms 0.114 ms 134×
Render/NoDithering/GoCode/Serial 19.9 ms 0.190 ms 105×
Render/DitheringWithBlocks/Serial 2.49 ms 0.037 ms 68×
Render/DitheringWithBlocks/Parallel 1.23 ms 0.036 ms 35×
Render/DitheringWithBlocks/NoBg/Serial 1.95 ms 0.037 ms 53×
Render/DitheringWithChars/Serial 2.51 ms 0.039 ms 64×
Render/DitheringWithChars/Parallel 1.24 ms 0.036 ms 35×

Pipeline benchmarks (decode + scale + create + render)

Benchmark v1.3.2 Current Speedup
Pipeline/NoDithering_160x96 16.2 ms 0.663 ms 24.5×
Pipeline/DitheringWithBlocks_320x192 5.25 ms 0.836 ms 6.3×

Memory pressure

Benchmark v1.3.2 bytes/op Current bytes/op v1.3.2 allocs/op Current allocs/op
Render/NoDithering/Serial 101 MB 377 KB 91,518 1
Render/NoDithering/Parallel 54 MB 385 KB 49,196 100
Render/NoDithering/GoCode/Serial 118 MB 377 KB 91,688 1
Render/DitheringWithBlocks/Serial 5.98 MB 98 KB 25,554 1
Render/DitheringWithBlocks/Parallel 3.22 MB 102 KB 13,648 52
Render/DitheringWithBlocks/NoBg/Serial 3.72 MB 98 KB 21,920 1
Render/DitheringWithChars/Serial 5.63 MB 98 KB 25,553 1
Render/DitheringWithChars/Parallel 3.04 MB 102 KB 13,648 52
Pipeline/NoDithering_160x96 103 MB 932 KB 106,844 85
Pipeline/DitheringWithBlocks_320x192 7.58 MB 1.23 MB 89,031 85

Headline numbers

  • Render path: 35–134× faster.
  • Full pipeline (load → scale → render): 6–25× faster.
  • Allocation count on serial render: 91,518 → 1.
  • Memory pressure: ~270× reduction on no-dither serial render.

Caveats

  • v1.3.2 has the line-count, sub-image, and SetMaxProcs(0) correctness bugs documented in the perf notes. Its render benchmarks are doing slightly less work than current (one fewer row in some cases), so the quality-adjusted speedup is marginally larger than these numbers show.
  • The render path's B/op doesn't quite go down at the same rate as allocs/op because the buffer is sized for upper-bound output. Actual terminal-output bytes go down too (the per-row SGR dedup shrinks dither output by ~50%), but the bench reports allocation size, not content size.
  • Pipeline numbers are now dominated by imaging.Resize Lanczos (~60-80% of pipeline time). That's the next target if you ever want to push pipeline below 200 µs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment