Reference Issue: This methodology was developed while investigating quarto-dev/quarto-cli#13647
Problem:
tufte.qmdhanging after running a bucket of tests in CIRoot Cause: Outdated
elsarticle.cls v3.3bundled in test extension causing TinyTeX corruptionSolution: Update extension to use
elsarticle.cls v3.4c
Tests that hang or timeout when run in CI or as part of a test suite, but work fine when run in isolation.
- State pollution: One test modifies global state that affects subsequent tests
- Resource leaks: File handles, processes, or network connections not cleaned up
- Environment corruption: Package managers (TinyTeX, npm, etc.) get into inconsistent state
- Timing/race conditions: Tests depend on specific execution order or timing
tufte.qmdhangs after 10+ minutes when run after a bucket of tests- Same document renders fine in ~30s when run alone
- Lualatex engine gets stuck during "running lualatex - 1"
- Only happens in CI or when running specific test combinations
Goal: Confirm you can reproduce the issue outside of CI
- Identify the failing test bucket from CI logs
- Extract the test file list from CI configuration
- Create a test script to run the bucket sequentially:
# array.sh - Run tests sequentially
readarray -t my_array < <(echo '[...]' | jq -rc '.[]')
haserror=0
for file in "${my_array[@]}"; do
echo ">>> ./run-tests.sh ${file}"
shopt -s globstar && ./run-tests.sh $file
status=$?
[ $status -eq 0 ] && echo ">>> No error" || haserror=1
done- Run and confirm the hang occurs locally
Goal: Find which specific test file causes the issue
Key Insight: If test N causes state pollution, tests 1-(N-1) will pass, then the problematic test will occur.
- Split your test list in half
- Run first half + the hanging test:
# test-binary-search.sh readarray -t tests < <(echo '[first_half_tests, "hanging-test.qmd"]' | jq -rc '.[]') for file in "${tests[@]}"; do ./run-tests.sh $file || exit 1 done
- If it hangs: culprit is in first half → repeat with first half
- If it passes: culprit is in second half → repeat with second half
- Continue until you identify the single test file
Our result: ./smoke/render/render-format-extension.test.ts was the culprit
Goal: Find which specific operation in the test file causes pollution
- Read the test file to understand what it does
- Identify distinct operations (e.g., rendering different formats)
- Comment out sections and retest:
// Test all formats // test("academic/document.qmd elsevier-pdf", ...) // test("academic/document.qmd springer-pdf", ...) test("academic/document.qmd acm-pdf", ...)
- Binary search through the operations to find the specific one
Our result: Rendering academic/document.qmd with elsevier-pdf format
Goal: Determine what environmental change causes the issue
Common suspects:
- Package installations (TinyTeX, npm, pip)
- Configuration file modifications
- Cache pollution
- File system changes
Investigation approach:
-
Create a clean test environment (fresh TinyTeX install)
-
Take snapshots before/after the problematic operation:
# Before snapshot tlmgr list --only-installed > before.txt # Run problematic test ./run-tests.sh problematic-test.ts # After snapshot tlmgr list --only-installed > after.txt diff before.txt after.txt
-
For TinyTeX issues, check:
- Installed packages:
tlmgr list --only-installed - Package versions:
tlmgr info <package> - Format files:
ls -la $(kpsewhich -var-value TEXMFSYSVAR)/web2c/luatex/ - Update logs: Check what
tlmgr update --allinstalls
- Installed packages:
Our findings:
elsevier-pdfformat uses bundledelsarticle.cls v3.3(from 2020)- Rendering triggers
tlmgr update --allwhich updates core packages - Updates regenerate lualatex format files expecting modern conventions
- Old class file incompatible with regenerated format files
- TinyTeX environment corrupted for subsequent renders
Goal: Understand WHY the state change causes the failure
-
Compare working vs broken states in detail
-
For package version issues:
- Check if test bundles old versions of libraries/classes
- Compare with system-installed versions
- Review changelogs between versions
-
Create minimal reproduction:
# verify-root-cause.sh echo "=== Test 1: Old version ===" # Setup with old version # Run problematic operation # Run hanging test echo "=== Test 2: New version ===" # Setup with new version # Run problematic operation # Run hanging test
Our root cause:
- Bundled
elsarticle.cls v3.3missing\RequirePackage[T1]{fontenc} - TinyTeX's
elsarticle.cls v3.4cincludes it - Font encoding mismatch corrupts lualatex format files
- Subsequent lualatex renders hang
Goal: Confirm your fix resolves the issue
-
Apply the fix (update package, patch code, etc.)
-
Create verification script:
#!/bin/bash # verify-fix.sh echo ">>> Fresh environment setup" # Clean install echo ">>> Running problematic test (with fix)" ./run-tests.sh problematic-test.ts || exit 1 echo ">>> Testing previously-hanging test" ./run-tests.sh hanging-test.qmd || exit 1 echo "✅ SUCCESS: Fix verified!"
-
Run multiple times to ensure consistency
-
Test with clean environment each time (critical for environment pollution issues)
# List installed packages
tlmgr list --only-installed
# Check package info
tlmgr info <package>
# Find file locations
kpsewhich elsarticle.cls
# Check format files
ls -la $(kpsewhich -var-value TEXMFSYSVAR)/web2c/luatex/
# Clean TinyTeX (for fresh start)
rm -rf ~/.TinyTeX
quarto install tinytex# Run single test
./run-tests.sh path/to/test.ts
# Run test sequence
for test in test1.ts test2.ts test3.ts; do
./run-tests.sh $test || break
done
# Check environment differences
diff <(env | sort) <(docker run ... env | sort)# Compare package versions
npm list
tlmgr list --only-installed
pip list
# Check for bundled vs system versions
find . -name "package.json"
find . -name "*.cls" -o -name "*.sty"- Always reproduce locally first - CI is too slow for debugging
- Use binary search - Most efficient way to isolate culprits
- Test with clean environments - Especially for environment pollution issues
- Take snapshots - Before/after comparisons are invaluable
- Create verification scripts - Automate testing your fix
- Document the root cause - Help others (and future you) understand the issue
- Testing with polluted environment - Always start fresh for environment issues
- Assuming causation from correlation - Just because test A runs before test B doesn't mean A causes B's failure
- Stopping too early - Finding the problematic test isn't enough; understand WHY it causes issues
- Not verifying the fix - Always confirm your solution actually works
- Reproduce the issue locally
- Identify the specific test bucket that triggers the issue
- Use binary search to isolate the culprit test file
- Narrow down to specific operation within the test
- Take environment snapshots before/after
- Identify what environmental change occurs
- Understand WHY the change causes the failure
- Develop and apply a fix
- Verify the fix with clean environments
- Document the root cause and solution
- Phase 1 (Reproduce): 30 minutes - 2 hours
- Phase 2 (Binary Search): 1-4 hours (depending on test suite size)
- Phase 3 (Narrow Down): 30 minutes - 1 hour
- Phase 4 (State Change): 1-3 hours
- Phase 5 (Root Cause): 2-6 hours (hardest part)
- Phase 6 (Verify): 30 minutes - 1 hour
Total: Typically 1-2 days of focused investigation
# 1. Reproduce locally
./array.sh # Confirmed hang after bucket
# 2. Binary search (from 51 tests down to 1)
./test-binary-search.sh # Found: render-format-extension.test.ts
# 3. Narrow down within test
# Commented out formats one by one → Found: elsevier-pdf
# 4. State change investigation
tlmgr list --only-installed # Before/after comparison
# Found: Package updates triggered by elsevier render
# 5. Root cause analysis
kpsewhich elsarticle.cls # Found bundled v3.3 vs TinyTeX v3.4c
diff elsarticle-v3.3.cls elsarticle-v3.4c.cls # Found missing fontenc
# 6. Verify solution
quarto update extension quarto-journals/elsevier
./verify-extension-update.sh # ✅ Confirmed fixDebugging flaky tests requires patience, systematic methodology, and understanding of the underlying systems. Binary search is your best friend for isolation, and clean environments are critical for verification. Always dig deep enough to understand the root cause - surface-level fixes often don't hold up.
- quarto-dev/quarto-cli#13647 - tufte.qmd hanging in CI (solved with this methodology)
- quarto-journals/elsevier#38 - Update elsarticle.cls to v3.4c (upstream fix)
- quarto-journals/elsevier#40 - CTAN update for elsarticle class (upstream fix)