Problem: Matplotlib's baseline image tests generate 507MB .git directory (40MB current baselines + 467MB history). With 2,330 baseline images and ~106 baseline-touching commits per year, this continues to grow, making FreeType updates painful and new contributor clones slow.
Solution: Store perceptual hashes (~200KB total) instead of images in the main repo. Download actual images on-demand only for test failures.
import imagehash
from PIL import Image
# Generate and compare hashes
baseline_hash = imagehash.phash(Image.open('baseline.png'))
generated_hash = imagehash.phash(Image.open('test_output.png'))
# Compare with tolerance (Hamming distance)
tolerance = 5 # bits difference allowed
if baseline_hash - generated_hash <= tolerance:
# Test passes - images perceptually similar
passHash Properties:
- Perceptual hash (not cryptographic): similar images = similar hashes
- 64-bit hash = ~16 characters stored
- Hamming distance comparison allows configurable tolerance
- Tolerances: 0 (pixel-perfect), 1-3 (minor antialiasing), 5-8 (small differences), 10+ (likely failure)
{
"test_backend_pdf::test_kerning": {
"primary": "a1b2c3d4e5f6g7h8",
"variants": {
"macos-arm64": "a1b2c3d4e5f7g7h8",
"windows": "a1b2c3d4e5f9g7h8",
"freetype-2.13": "a1b2c3d4e6f6g7h8"
},
"tolerance": 5,
"metadata": {
"created": "2024-01-15",
"last_updated": "2026-03-01",
"format": "pdf"
}
},
"test_backend_pdf::test_hatching_legend": {
"primary": "x9y8z7w6v5u4t3s2",
"tolerance": 3,
"metadata": {
"created": "2023-05-20",
"format": "pdf"
}
}
}Automatic Tolerance: Most platform differences (antialiasing, minor font rendering) automatically handled by hash tolerance.
Explicit Variants: For legitimate platform-specific rendering:
- Detection: Test runs on macOS, generates hash
a1b2c3d4e5f7g7h8 - Comparison: Primary hash
a1b2c3d4e5f6g7h8has distance = 1 (within tolerance!) - CI Action: Detects acceptable-but-new hash, uploads image, requests approval
- Approval: Maintainer reviews side-by-side images, approves via comment
- Update: CI commits variant to
baseline_hashes.json
Safety: New variants must be within tolerance of primary hash AND require human visual approval.
# 1. Write test with @image_comparison decorator
@image_comparison(['my_new_test.pdf'])
def test_my_feature():
# ... test code ...
# 2. Generate baseline locally
pytest test_backend_pdf.py::test_my_feature --accept-new-baseline
# This creates:
# - result_images/test_backend_pdf/my_new_test.pdf
# - Updates baseline_hashes.json with computed hash
# 3. Commit hash file (not the image!)
git add lib/matplotlib/tests/baseline_hashes.json
git commit -m "Add test_my_feature with baseline"
git push
# 4. CI automatically:
# - Runs test, generates image
# - Computes hash, verifies match with baseline_hashes.json
# - Uploads image to storage with hash-based filename
# - Posts image URL in PR for reviewer inspection# 1. Make code changes that affect rendering
# 2. Regenerate hash
pytest test_backend_pdf.py::test_kerning --accept-new-baseline
# 3. Review diff in baseline_hashes.json
git diff lib/matplotlib/tests/baseline_hashes.json
# 4. Commit and push - CI handles image upload# Normal test run (no image downloads)
pytest test_backend_pdf.py::test_kerning
# If hash matches: test passes immediately
# If hash misses: downloads baseline, compares pixels, shows diff
# Accept hash on mismatch (update local baseline)
pytest test_backend_pdf.py::test_kerning --accept-hash
# Accept platform variant
pytest test_backend_pdf.py::test_kerning --accept-hash-variant=macos-arm64# .github/workflows/tests.yml
- name: Upload new baseline images
if: github.event_name == 'pull_request'
run: |
python tools/upload_baseline_images.py \
--source result_images/ \
--hashes lib/matplotlib/tests/baseline_hashes.json \
--storage github-release
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}Script Logic:
- Find all images in
result_images/from test run - Check which hashes exist in
baseline_hashes.jsonbut not in storage - Upload missing images:
storage/{hash}.pdf - Comment on PR with uploaded image list and preview URLs
⚠️ New platform variant detected for test_kerning
Platform: macos-arm64
Hash distance from primary: 1 (within tolerance 5)
Hash: a1b2c3d4e5f7g7h8
Side-by-side comparison:
Primary: https://artifacts.github.com/{primary_hash}.pdf
Variant: https://artifacts.github.com/{variant_hash}.pdf
To approve: Comment "@matplotlib-bot approve-hash test_kerning macos-arm64"
import imagehash
from PIL import Image
def compute_image_hash(image_path, hash_size=16):
"""Compute perceptual hash of an image."""
img = Image.open(image_path)
return str(imagehash.phash(img, hash_size=hash_size))
def load_baseline_hashes():
"""Load baseline_hashes.json from tests directory."""
hash_file = Path(__file__).parent.parent / 'tests/baseline_hashes.json'
if hash_file.exists():
return json.loads(hash_file.read_text())
return {}
def fetch_baseline_image(test_name, image_name, baseline_hash):
"""Download baseline image from storage if not cached locally."""
cache_dir = Path.home() / '.matplotlib/baseline_cache'
cache_dir.mkdir(parents=True, exist_ok=True)
cached_path = cache_dir / f"{baseline_hash}.png"
if cached_path.exists():
return cached_path
# Download from storage
url = f"https://storage.example.com/baselines/{baseline_hash}.png"
# ... download logic ...
return cached_pathdef image_comparison(...):
# Wrapper modification
def compare_with_hash_first(fig, result_path, expected_path):
# Compute hash of generated image
result_hash = compute_image_hash(result_path)
# Load expected hashes
hashes = load_baseline_hashes()
expected_hash_data = hashes.get(f"{test_module}::{test_name}")
if expected_hash_data:
# Try primary hash
primary_hash = expected_hash_data['primary']
tolerance = expected_hash_data.get('tolerance', 5)
if hash_distance(result_hash, primary_hash) <= tolerance:
return # Test passes!
# Try platform variants
for variant_name, variant_hash in expected_hash_data.get('variants', {}).items():
if hash_distance(result_hash, variant_hash) <= tolerance:
return # Test passes!
# Hash mismatch - download baseline for pixel comparison
baseline_path = fetch_baseline_image(test_name, image_name, primary_hash)
# Fall back to traditional pixel comparison
compare_images(expected_path, result_path, tol=tol)def pytest_addoption(parser):
parser.addoption('--accept-new-baseline', action='store_true',
help='Accept new baseline images and update hashes')
parser.addoption('--accept-hash', action='store_true',
help='Accept current output hash as new baseline')
parser.addoption('--accept-hash-variant',
help='Accept hash as platform variant (e.g., macos-arm64)')
parser.addoption('--baseline-source', default='storage',
choices=['storage', 'ci-artifacts', 'local'],
help='Where to fetch baseline images from')#!/usr/bin/env python
"""Upload baseline images to storage for hash entries."""
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--source', required=True, help='result_images/ directory')
parser.add_argument('--hashes', required=True, help='baseline_hashes.json path')
parser.add_argument('--storage', choices=['github-release', 's3', 'gcs'])
args = parser.parse_args()
hashes = json.loads(Path(args.hashes).read_text())
for test_name, hash_data in hashes.items():
# Find corresponding image in result_images/
# Compute its hash, verify match
# Upload if not already in storage
# ...- Add
imagehashdependency - Implement hash computation/comparison in
compare.py - Create
baseline_hashes.jsonschema - Add pytest flags (
--accept-new-baseline, etc.)
- Decide: GitHub releases vs separate repo vs cloud storage
- Implement
upload_baseline_images.pyscript - Configure CI workflow for image uploads
- Migrate PDF backend tests (~20 tests)
- Generate hashes for existing baselines
- Upload existing images to storage
- Test workflows on real PRs
- Migrate backend-by-backend
- Monitor hash tolerance effectiveness
- Collect platform variant data
- Adjust tolerances per test as needed
Repo Size: 507MB → ~50MB (remove baseline images, keep hashes)
Clone Speed: 40MB less data for new contributors
FreeType Updates: Update single JSON file instead of regenerating 100+ images
Platform Flexibility: Tolerance automatically handles minor differences
Developer UX: Same @image_comparison decorator, faster tests (no downloads on pass)
Backward Compatible: Traditional pixel comparison still available as fallback
imagehashlibrary (MIT license, 10KB, no heavy deps)pillow(already a matplotlib dependency)- Storage solution (GitHub releases = free, no new infrastructure)
- Storage location: GitHub releases (free, simple) vs dedicated repo (cleaner) vs cloud (more complex)?
- Hash tolerance defaults: Single global default or per-backend defaults?
- Variant approval: Fully automated for core devs or always require review?
- Migration timeline: All at once vs gradual backend-by-backend?