Debugging Flaky Tests: A Systematic Approach

Reference Issue: This methodology was developed while investigating quarto-dev/quarto-cli#13647

Problem: tufte.qmd hanging after running a bucket of tests in CI

Root Cause: Outdated elsarticle.cls v3.3 bundled in test extension causing TinyTeX corruption

Solution: Update extension to use elsarticle.cls v3.4c

Problem Pattern

Tests that hang or timeout when run in CI or as part of a test suite, but work fine when run in isolation.

Root Cause Categories

State pollution: One test modifies global state that affects subsequent tests
Resource leaks: File handles, processes, or network connections not cleaned up
Environment corruption: Package managers (TinyTeX, npm, etc.) get into inconsistent state
Timing/race conditions: Tests depend on specific execution order or timing

Our Case Study: tufte.qmd Hanging After Test Bucket

Symptoms

tufte.qmd hangs after 10+ minutes when run after a bucket of tests
Same document renders fine in ~30s when run alone
Lualatex engine gets stuck during "running lualatex - 1"
Only happens in CI or when running specific test combinations

Investigation Methodology

Phase 1: Reproduce Locally

Goal: Confirm you can reproduce the issue outside of CI

Identify the failing test bucket from CI logs
Extract the test file list from CI configuration
Create a test script to run the bucket sequentially:

# array.sh - Run tests sequentially
readarray -t my_array < <(echo '[...]' | jq -rc '.[]')
haserror=0
for file in "${my_array[@]}"; do
  echo ">>> ./run-tests.sh ${file}"
  shopt -s globstar && ./run-tests.sh $file
  status=$?
  [ $status -eq 0 ] && echo ">>> No error" || haserror=1
done

Run and confirm the hang occurs locally

Phase 2: Binary Search to Isolate Culprit

Goal: Find which specific test file causes the issue

Key Insight: If test N causes state pollution, tests 1-(N-1) will pass, then the problematic test will occur.

Split your test list in half

Run first half + the hanging test:

# test-binary-search.sh
readarray -t tests < <(echo '[first_half_tests, "hanging-test.qmd"]' | jq -rc '.[]')
for file in "${tests[@]}"; do
  ./run-tests.sh $file || exit 1
done

If it hangs: culprit is in first half → repeat with first half
If it passes: culprit is in second half → repeat with second half
Continue until you identify the single test file

Our result: ./smoke/render/render-format-extension.test.ts was the culprit

Phase 3: Narrow Down Within Test File

Goal: Find which specific operation in the test file causes pollution

Read the test file to understand what it does
Identify distinct operations (e.g., rendering different formats)

Comment out sections and retest:

// Test all formats
// test("academic/document.qmd elsevier-pdf", ...)
// test("academic/document.qmd springer-pdf", ...)
test("academic/document.qmd acm-pdf", ...)

Binary search through the operations to find the specific one

Our result: Rendering academic/document.qmd with elsevier-pdf format

Phase 4: Understand the State Change

Goal: Determine what environmental change causes the issue

Common suspects:

Package installations (TinyTeX, npm, pip)
Configuration file modifications
Cache pollution
File system changes

Investigation approach:

Create a clean test environment (fresh TinyTeX install)

Take snapshots before/after the problematic operation:

# Before snapshot
tlmgr list --only-installed > before.txt

# Run problematic test
./run-tests.sh problematic-test.ts

# After snapshot
tlmgr list --only-installed > after.txt
diff before.txt after.txt

For TinyTeX issues, check:
- Installed packages: tlmgr list --only-installed
- Package versions: tlmgr info <package>
- Format files: ls -la $(kpsewhich -var-value TEXMFSYSVAR)/web2c/luatex/
- Update logs: Check what tlmgr update --all installs

Our findings:

elsevier-pdf format uses bundled elsarticle.cls v3.3 (from 2020)
Rendering triggers tlmgr update --all which updates core packages
Updates regenerate lualatex format files expecting modern conventions
Old class file incompatible with regenerated format files
TinyTeX environment corrupted for subsequent renders

Phase 5: Identify Root Cause

Goal: Understand WHY the state change causes the failure

Compare working vs broken states in detail
For package version issues:
- Check if test bundles old versions of libraries/classes
- Compare with system-installed versions
- Review changelogs between versions

Create minimal reproduction:

# verify-root-cause.sh
echo "=== Test 1: Old version ==="
# Setup with old version
# Run problematic operation
# Run hanging test

echo "=== Test 2: New version ==="
# Setup with new version
# Run problematic operation
# Run hanging test

Our root cause:

Bundled elsarticle.cls v3.3 missing \RequirePackage[T1]{fontenc}
TinyTeX's elsarticle.cls v3.4c includes it
Font encoding mismatch corrupts lualatex format files
Subsequent lualatex renders hang

Phase 6: Verify Solution

Goal: Confirm your fix resolves the issue

Apply the fix (update package, patch code, etc.)

Create verification script:

#!/bin/bash
# verify-fix.sh

echo ">>> Fresh environment setup"
# Clean install

echo ">>> Running problematic test (with fix)"
./run-tests.sh problematic-test.ts || exit 1

echo ">>> Testing previously-hanging test"
./run-tests.sh hanging-test.qmd || exit 1

echo "✅ SUCCESS: Fix verified!"

Run multiple times to ensure consistency
Test with clean environment each time (critical for environment pollution issues)

Key Debugging Tools

For TinyTeX Issues

# List installed packages
tlmgr list --only-installed

# Check package info
tlmgr info <package>

# Find file locations
kpsewhich elsarticle.cls

# Check format files
ls -la $(kpsewhich -var-value TEXMFSYSVAR)/web2c/luatex/

# Clean TinyTeX (for fresh start)
rm -rf ~/.TinyTeX
quarto install tinytex

For General Test Issues

# Run single test
./run-tests.sh path/to/test.ts

# Run test sequence
for test in test1.ts test2.ts test3.ts; do
  ./run-tests.sh $test || break
done

# Check environment differences
diff <(env | sort) <(docker run ... env | sort)

For Package/Dependency Issues

# Compare package versions
npm list
tlmgr list --only-installed
pip list

# Check for bundled vs system versions
find . -name "package.json"
find . -name "*.cls" -o -name "*.sty"

Best Practices

Always reproduce locally first - CI is too slow for debugging
Use binary search - Most efficient way to isolate culprits
Test with clean environments - Especially for environment pollution issues
Take snapshots - Before/after comparisons are invaluable
Create verification scripts - Automate testing your fix
Document the root cause - Help others (and future you) understand the issue

Common Pitfalls

Testing with polluted environment - Always start fresh for environment issues
Assuming causation from correlation - Just because test A runs before test B doesn't mean A causes B's failure
Stopping too early - Finding the problematic test isn't enough; understand WHY it causes issues
Not verifying the fix - Always confirm your solution actually works

Checklist for Flaky Test Investigation

Time Investment

Phase 1 (Reproduce): 30 minutes - 2 hours
Phase 2 (Binary Search): 1-4 hours (depending on test suite size)
Phase 3 (Narrow Down): 30 minutes - 1 hour
Phase 4 (State Change): 1-3 hours
Phase 5 (Root Cause): 2-6 hours (hardest part)
Phase 6 (Verify): 30 minutes - 1 hour

Total: Typically 1-2 days of focused investigation

Example: Our Complete Investigation

# 1. Reproduce locally
./array.sh  # Confirmed hang after bucket

# 2. Binary search (from 51 tests down to 1)
./test-binary-search.sh  # Found: render-format-extension.test.ts

# 3. Narrow down within test
# Commented out formats one by one → Found: elsevier-pdf

# 4. State change investigation
tlmgr list --only-installed  # Before/after comparison
# Found: Package updates triggered by elsevier render

# 5. Root cause analysis
kpsewhich elsarticle.cls  # Found bundled v3.3 vs TinyTeX v3.4c
diff elsarticle-v3.3.cls elsarticle-v3.4c.cls  # Found missing fontenc

# 6. Verify solution
quarto update extension quarto-journals/elsevier
./verify-extension-update.sh  # ✅ Confirmed fix

Conclusion

Debugging flaky tests requires patience, systematic methodology, and understanding of the underlying systems. Binary search is your best friend for isolation, and clean environments are critical for verification. Always dig deep enough to understand the root cause - surface-level fixes often don't hold up.

Related Issues and References

quarto-dev/quarto-cli#13647 - tufte.qmd hanging in CI (solved with this methodology)
quarto-journals/elsevier#38 - Update elsarticle.cls to v3.4c (upstream fix)
quarto-journals/elsevier#40 - CTAN update for elsarticle class (upstream fix)

cderv/DEBUGGING-FLAKY-TESTS.md

Select an option

No results found