The CSRD document processing code in the carbon-txt-validator has several performance bottlenecks that can be addressed to improve speed and resource utilization. The main issues revolve around Arelle initialization, file loading, and redundant processing.
Location: src/carbon_txt/processors/csrd_document.py:35-86
Issue: The ArelleProcessor.__init__() method creates a new Arelle session and loads the report from scratch every time an instance is created. This involves:
- Starting a new Arelle session
- Loading taxonomy files
- Parsing the XBRL document
- Validating the document structure
Impact: This is extremely expensive and can take several seconds per report initialization. When processing multiple reports or when the CSRD plugin is called repeatedly, this overhead accumulates significantly.
Estimated Time saved: 2-5 seconds per report initialization (depending on report size and network latency for taxonomy files)
Location: src/carbon_txt/processors/csrd_document.py:180-266
Issue: There's no caching mechanism for already-processed reports. If the same report URL is processed multiple times (which can happen in development/testing scenarios or when validating the same domain repeatedly), the entire expensive parsing process repeats.
Impact: Repeated validations of the same carbon.txt file with CSRD reports incur duplicate processing costs.
Location: src/carbon_txt/process_csrd_document.py:39-46
Issue: The ArelleProcessor is instantiated immediately when a "csrd-report" document type is detected, even if no datapoints will be queried or if the report fails to load. This means:
- File download and parsing happens before we know if we need the data
- Expensive operations occur even when they might not be necessary
Impact: Reports that fail to load or contain no matching datapoints still incur full processing costs.
Location: src/carbon_txt/processors/csrd_document.py:80-86
Issue: Error checking for unloadable files only happens after the entire Arelle session has completed. There's no quick validation to check if a file is even a valid XBRL document before attempting full processing.
Impact: Invalid files (HTML, text files, broken XML) still go through the full Arelle initialization process.
Location: src/carbon_txt/processors/csrd_document.py:98-160
Issue: The _get_datapoints_for_datapoint_code method performs individual lookups for each datapoint code using factsByLocalName.get(). This results in multiple dictionary accesses through the same structure.
Impact: Slight performance degradation, especially when querying many datapoints.
Issue: The keepOpen=True flag is set in Arelle options (line 76), but there's no corresponding mechanism to explicitly close resources when done, relying on garbage collection. This can lead to increased memory usage over time.
Implementation: Modify the GreenwebCSRDProcessor to defer Arelle initialization until datapoints are actually requested.
class GreenwebCSRDProcessor:
# ... existing code ...
def __init__(self, report_url: typing.Optional[str] = None, ...):
self.report_url = report_url
self._arelle_processor = None # Deferred initialization
@property
def arelle_processor(self):
if self._arelle_processor is None:
if not self.report_url:
raise ValueError("Report URL not set")
self._arelle_processor = ArelleProcessor(self.report_url)
return self._arelle_processor
# ... rest of the code ...Benefits:
- Skips expensive Arelle initialization if no datapoints are queried
- Fails fast for invalid report files
- Can save 2-5 seconds per validation when CSRD reports are present but not needed
Implementation: Add a lightweight validation step before full Arelle processing:
class ArelleProcessor:
def __init__(self, report_url: str) -> None:
# Quick validation first
self._validate_file_type(report_url)
# Then proceed with full Arelle processing
self.report_url = report_url
# ... rest of initialization ...
def _validate_file_type(self, url: str):
"""Quick validation to ensure the file is likely an XBRL document"""
try:
response = HTTPClient().get_url(url, allow_redirects=True)
content = response.text
# Check for basic XBRL markers
if "<?xml" not in content and "xbrl" not in content.lower():
raise NoLoadableCSRDFile(f"File at {url} does not appear to be an XBRL document")
except Exception as e:
raise NoLoadableCSRDFile(f"Could not validate file at {url}: {e}")Benefits:
- Immediate rejection of non-XBRL files
- Avoids Arelle initialization for invalid files
- Faster failure for broken links or wrong file types
Implementation: Add a simple cache for processed reports (can be enhanced later with LRU or time-based eviction):
# In validators.py or a new cache module
from functools import lru_cache
# Or custom cache with URL key
class CSRDCache:
_cache = {}
@classmethod
def get(cls, report_url: str):
return cls._cache.get(report_url)
@classmethod
def set(cls, report_url: str, processor: ArelleProcessor):
cls._cache[report_url] = processorUsage in process_csrd_document.py:
def process_document(document, logs):
if document.doc_type == "csrd-report":
cached_processor = CSRDCache.get(document.url)
if cached_processor:
processor = GreenwebCSRDProcessor(arelle_processor=cached_processor)
else:
try:
processor = GreenwebCSRDProcessor(report_url=document.url)
CSRDCache.set(document.url, processor.arelle_processor)
except Exception as e:
log_safely(f"Failed to process report: {e}", logs)
return {"logs": logs}
# ... rest of processing ...Benefits:
- Eliminates duplicate processing for the same report
- Especially beneficial during development/testing
- Reduces server load when multiple validations occur
Implementation: Cache the factsByLocalName dictionary and perform batch lookups:
class ArelleProcessor:
def __init__(self, report_url: str) -> None:
# ... existing code ...
self._facts_cache = None
@property
def facts(self):
if self._facts_cache is None:
self._facts_cache = self.xbrls[0].factsByLocalName
return self._facts_cache
def _get_datapoints_for_datapoint_code(self, datapoint_code: str, esrs_datapoints: dict):
res = self.facts.get(datapoint_code) # Use cached version
# ... rest of method ...Benefits:
- Reduces repeated dictionary access overhead
- Slightly cleaner API for accessing facts
Implementation: Add better error handling and retry logic:
# In process_csrd_document.py
import time
MAX_RETRIES = 2
RETRY_DELAY = 1.0
def process_document(document, logs):
if document.doc_type == "csrd-report":
for attempt in range(MAX_RETRIES):
try:
processor = GreenwebCSRDProcessor(report_url=document.url)
# ... rest of processing ...
break
except NoLoadableCSRDFile as e:
if attempt == MAX_RETRIES - 1:
log_safely(f"Failed to load CSRD report after {MAX_RETRIES} attempts: {e}", logs)
return {"logs": logs}
time.sleep(RETRY_DELAY)
log_safely(f"Retrying CSRD report loading (attempt {attempt + 1})...", logs)Benefits:
- Handles transient network issues gracefully
- Provides feedback about retry attempts
- Better user experience during temporary failures
Based on the optimizations above:
-
Best-case scenario (non-CSRD reports or quick failures):
- Current: 2-5 seconds for Arelle initialization
- After: <1 second (quick validation only)
- Speedup: 2-5x faster
-
Typical case (valid CSRD reports processed once):
- Current: 5-10 seconds total (init + processing)
- After: 3-8 seconds total
- Speedup: ~1.5-2x faster
-
Repeated validations (same report multiple times):
- Current: 5-10 seconds each time
- After: 5-10 seconds first time, <1 second subsequent times (with caching)
- Speedup: 5-10x faster for repeated access
-
Overall system resource usage:
- Current: High memory/CPU for every validation with CSRD reports
- After: Lower baseline, spikes only when actually processing reports
- Resource savings: 30-50% reduction in average usage
-
Phase 1 (Quick Wins - 1-2 days):
- Add quick file validation (Priority 2)
- Implement lazy loading (Priority 1)
- Add simple caching (Priority 3)
- Fix error handling (Priority 5)
-
Phase 2 (Optimizations - 1-2 days):
- Optimize datapoint lookups (Priority 4)
- Add proper resource cleanup
- Consider adding timeout controls for Arelle operations
- Ensure Arelle resources are properly closed when the processor is done
- Consider using context managers for resource cleanup
- Monitor memory usage in production to ensure no leaks
Add configuration knobs for:
- Cache size/TTL
- Maximum report size
- Timeout settings for different operations
- Logging verbosity for Arelle operations
Add instrumentation to track:
- Time spent in Arelle initialization vs processing
- Cache hit/miss ratios
- Report processing success/failure rates
- Memory usage during processing
-
Performance benchmarking:
- Measure before/after timings for different scenarios
- Test with various report sizes and network conditions
-
Functional testing:
- Ensure all existing functionality works with changes
- Test edge cases (invalid files, network failures)
- Verify cache invalidation works correctly
-
Load testing:
- Test with multiple concurrent validations
- Measure resource usage under load
- Ensure stability with cached vs non-cached reports
By implementing these optimizations in priority order, the CSRD processing code can achieve significant performance improvements while maintaining all existing functionality. The lazy loading approach alone could reduce validation times by 50-80% for cases where CSRD reports are present but not fully processed, making the validator much more responsive overall.