This guide provides a streamlined approach to implementing RAGAS evaluation while managing OpenAI API rate limits effectively. It's designed to be straightforward, visual, and actionable.
RAGAS (Retrieval Augmented Generation Assessment) is a framework for evaluating RAG systems with:
- Objective metrics without human annotations
- Synthetic test data generation
- Comprehensive evaluation workflows
The challenge: RAGAS metrics that use LLMs make multiple API calls, which can quickly hit rate limits.
pip install ragas==0.2.15
from ragas.metrics import Faithfulness, ResponseRelevancy
from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
# Set up evaluator LLM
evaluator_llm = LangchainLLMWrapper(
ChatOpenAI(model="gpt-4o", temperature=0)
)
# Create rate-friendly configuration
rate_friendly_config = RunConfig(
timeout=300, # 5 minutes max for operations
max_retries=15, # More retries for rate limits
max_wait=90, # Longer wait between retries
max_workers=8, # Fewer concurrent API calls
log_tenacity=True # Log retry attempts
)
π¦ Traffic Light System for Metrics
Metric | API Usage | Rate Impact | When to Use |
---|---|---|---|
π΄ Faithfulness | Very High | Critical | Small datasets, critical evaluation |
π΄ LLMContextRecall | Very High | Critical | Small datasets, retrieval quality |
π FactualCorrectness | High | Significant | Medium datasets, factual accuracy |
π AspectCritic | High | Significant | Custom evaluation criteria |
π’ ResponseRelevancy | Medium | Moderate | Larger datasets, user experience |
π’ ContextPrecision | Medium | Moderate | Larger datasets, retrieval precision |
π΅ StringPresence | None | None | Any dataset size, basic checks |
π΅ ExactMatch | None | None | Any dataset size, direct matching |
For datasets with more than 10 samples:
def batch_process_evaluation(dataset, metrics, batch_size=5, pause_seconds=120):
"""Process evaluations in batches to avoid rate limits"""
# Split dataset into batches
batches = [dataset[i:i+batch_size] for i in range(0, len(dataset), batch_size)]
print(f"Processing {len(dataset)} samples in {len(batches)} batches")
all_results = []
for i, batch in enumerate(batches):
print(f"Processing batch {i+1}/{len(batches)}")
# Run evaluation for this batch
batch_result = evaluate(
dataset=batch,
metrics=metrics,
llm=evaluator_llm,
run_config=rate_friendly_config
)
all_results.append(batch_result)
# Pause between batches (except after the last one)
if i < len(batches) - 1:
print(f"Pausing for {pause_seconds} seconds...")
time.sleep(pause_seconds)
return all_results
# Heavy metrics on small subset
heavy_metrics = [Faithfulness(), LLMContextRecall()]
subset_size = min(10, len(full_dataset))
subset = full_dataset[:subset_size]
heavy_results = evaluate(
dataset=subset,
metrics=heavy_metrics,
llm=evaluator_llm,
run_config=rate_friendly_config
)
# Lighter metrics on full dataset
light_metrics = [ResponseRelevancy(), ContextPrecision()]
light_results = batch_process_evaluation(
dataset=full_dataset,
metrics=light_metrics,
batch_size=10,
pause_seconds=60
)
START
βββ Dataset Size?
β βββ Small (<10 samples)
β β βββ Use All Metrics with Default Config
β β
β βββ Medium (10-50 samples)
β β βββ Heavy Metrics β On Small Subset
β β βββ Medium/Light Metrics β Batch Process
β β
β βββ Large (>50 samples)
β βββ Heavy Metrics β On Tiny Subset
β βββ Medium Metrics β On Small Subset
β βββ Light Metrics β Batch Process
β
βββ Rate Limit Encountered?
β βββ YES
β β βββ Reduce max_workers
β β βββ Increase pause between batches
β β βββ Consider model fallback
β β
β βββ NO
β βββ Continue processing
β
βββ END
Evaluation Need | Recommended Model | Rate Parameters |
---|---|---|
Development/Testing | GPT-3.5-Turbo | max_workers=12, batch_size=15 |
Standard Evaluation | GPT-4o-mini | max_workers=10, batch_size=10 |
Critical Evaluation | GPT-4o | max_workers=8, batch_size=8 |
Research Grade | GPT-4.1 | max_workers=6, batch_size=5 |
Solution: Reduce max_workers
by 50% and increase pause between batches.
# More conservative config
conservative_config = RunConfig(
max_workers=4, # Very conservative
timeout=300,
max_retries=15,
max_wait=120
)
Solution: Balance processing time with rate limits by optimizing batch size.
# Faster processing with careful rate management
if model.startswith("gpt-3.5"):
batch_size = 15 # Higher throughput models
pause = 45 # Shorter pause
elif model.startswith("gpt-4o-mini"):
batch_size = 10
pause = 60
else: # GPT-4.1 or GPT-4o
batch_size = 5 # Lower throughput models
pause = 90 # Longer pause
Solution: Increase timeout parameter in RunConfig.
# For complex evaluations
timeout_config = RunConfig(
timeout=600, # 10 minutes for complex operations
max_retries=15
)
- Start small and scale gradually: Begin with smaller test sets and simpler metrics.
- Monitor API usage: Keep track of rate limit errors to adjust parameters.
- Batch strategically: Use smaller batches for heavy metrics, larger for lighter ones.
- Use fallbacks: Have a backup plan when primary models hit rate limits.
- Iterate and optimize: Refine your approach based on results and rate limit experiences.
By following this user-friendly guide, you can implement RAGAS evaluation effectively while managing OpenAI API rate limits.