RAGAS Evals - Fine Tuning Guide

This guide provides a streamlined approach to implementing RAGAS evaluation while managing OpenAI API rate limits effectively. It's designed to be straightforward, visual, and actionable.

Quick Overview

RAGAS (Retrieval Augmented Generation Assessment) is a framework for evaluating RAG systems with:

Objective metrics without human annotations
Synthetic test data generation
Comprehensive evaluation workflows

The challenge: RAGAS metrics that use LLMs make multiple API calls, which can quickly hit rate limits.

Step-by-Step Implementation

Step 1: Install RAGAS

pip install ragas==0.2.15

Step 2: Basic RAGAS Setup

from ragas.metrics import Faithfulness, ResponseRelevancy
from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Set up evaluator LLM
evaluator_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o", temperature=0)
)

# Create rate-friendly configuration
rate_friendly_config = RunConfig(
    timeout=300,          # 5 minutes max for operations
    max_retries=15,       # More retries for rate limits
    max_wait=90,          # Longer wait between retries
    max_workers=8,        # Fewer concurrent API calls
    log_tenacity=True     # Log retry attempts
)

ragas_metrics=[
    LLMContextRecall(),
    Faithfulness(),
    FactualCorrectness(),
    ResponseRelevancy(),
    ContextEntityRecall(),
    NoiseSensitivity()
]

Step 3: Choose Metrics Strategically

🚦 Traffic Light System for Metrics

Metric	API Usage	Rate Impact	When to Use
🔴 Faithfulness	Very High	Critical	Small datasets, critical evaluation
🔴 LLMContextRecall	Very High	Critical	Small datasets, retrieval quality
🟠 FactualCorrectness	High	Significant	Medium datasets, factual accuracy
🟠 AspectCritic	High	Significant	Custom evaluation criteria
🟢 ResponseRelevancy	Medium	Moderate	Larger datasets, user experience
🟢 ContextPrecision	Medium	Moderate	Larger datasets, retrieval precision
🔵 StringPresence	None	None	Any dataset size, basic checks
🔵 ExactMatch	None	None	Any dataset size, direct matching

Step 4: Implement Batch Processing

For datasets with more than 10 samples:

def batch_process_evaluation(dataset, metrics, batch_size=5, pause_seconds=120):
    """Process evaluations in batches to avoid rate limits"""
    # Split dataset into batches
    batches = [dataset[i:i+batch_size] for i in range(0, len(dataset), batch_size)]
    print(f"Processing {len(dataset)} samples in {len(batches)} batches")
    
    all_results = []
    for i, batch in enumerate(batches):
        print(f"Processing batch {i+1}/{len(batches)}")
        
        # Run evaluation for this batch
        batch_result = evaluate(
            dataset=batch,
            metrics=ragas_metrics,
            llm=evaluator_llm,
            run_config=rate_friendly_config
        )
        
        all_results.append(batch_result)
        
        # Pause between batches (except after the last one)
        if i < len(batches) - 1:
            print(f"Pausing for {pause_seconds} seconds...")
            time.sleep(pause_seconds)
    
    return all_results

Step 5: Use Multi-Tier Evaluation Approach

# Heavy metrics on small subset
heavy_metrics = [Faithfulness(), LLMContextRecall()]
subset_size = min(10, len(full_dataset))
subset = full_dataset[:subset_size]

heavy_results = evaluate(
    dataset=subset,
    metrics=heavy_metrics,
    llm=evaluator_llm,
    run_config=rate_friendly_config
)

# Lighter metrics on full dataset
light_metrics = [ResponseRelevancy(), ContextPrecision()]
light_results = batch_process_evaluation(
    dataset=full_dataset,
    metrics=light_metrics,
    batch_size=10,
    pause_seconds=60
)

Visual Decision Tree for RAGAS Implementation

START
  ├── Dataset Size?
  │   ├── Small (<10 samples)
  │   │   └── Use All Metrics with Default Config
  │   │
  │   ├── Medium (10-50 samples)
  │   │   ├── Heavy Metrics → On Small Subset
  │   │   └── Medium/Light Metrics → Batch Process
  │   │
  │   └── Large (>50 samples)
  │       ├── Heavy Metrics → On Tiny Subset
  │       ├── Medium Metrics → On Small Subset
  │       └── Light Metrics → Batch Process
  │
  ├── Rate Limit Encountered?
  │   ├── YES
  │   │   ├── Reduce max_workers
  │   │   ├── Increase pause between batches
  │   │   └── Consider model fallback
  │   │
  │   └── NO
  │       └── Continue processing
  │
  └── END

Model Selection Guide

Evaluation Need	Recommended Model	Rate Parameters
Development/Testing	GPT-3.5-Turbo	max_workers=12, batch_size=15
Standard Evaluation	GPT-4o-mini	max_workers=10, batch_size=10
Critical Evaluation	GPT-4o	max_workers=8, batch_size=8
Research Grade	GPT-4.1	max_workers=6, batch_size=5

Troubleshooting Common Issues

1. "Rate limit exceeded" errors

Solution: Reduce max_workers by 50% and increase pause between batches.

# More conservative config
conservative_config = RunConfig(
    max_workers=4,  # Very conservative
    timeout=300,
    max_retries=15,
    max_wait=120
)

2. Slow evaluation progress

Solution: Balance processing time with rate limits by optimizing batch size.

# Faster processing with careful rate management
if model.startswith("gpt-3.5"):
    batch_size = 15  # Higher throughput models
    pause = 45       # Shorter pause
elif model.startswith("gpt-4o-mini"):
    batch_size = 10
    pause = 60
else:  # GPT-4.1 or GPT-4o
    batch_size = 5   # Lower throughput models
    pause = 90       # Longer pause

3. Evaluation timeouts

Solution: Increase timeout parameter in RunConfig.

# For complex evaluations
timeout_config = RunConfig(
    timeout=600,  # 10 minutes for complex operations
    max_retries=15
)

Key Takeaways

Start small and scale gradually: Begin with smaller test sets and simpler metrics.
Monitor API usage: Keep track of rate limit errors to adjust parameters.
Batch strategically: Use smaller batches for heavy metrics, larger for lighter ones.
Use fallbacks: Have a backup plan when primary models hit rate limits.
Iterate and optimize: Refine your approach based on results and rate limit experiences.

By following this user-friendly guide, you can implement RAGAS evaluation effectively while managing OpenAI API rate limits.

flowchart TB
    classDef highImpact fill:#ff9999,stroke:#333,stroke-width:1px
    classDef mediumImpact fill:#ffcc99,stroke:#333,stroke-width:1px
    classDef lowImpact fill:#99ff99,stroke:#333,stroke-width:1px
    classDef noImpact fill:#99ccff,stroke:#333,stroke-width:1px
    classDef framework fill:#f5f5f5,stroke:#333,stroke-width:2px
    
    A["RAGAS Evaluation<br>Framework v0.2.15"]:::framework
    A --> B["LLM-Based Metrics<br>(API Calls Required)"]
    A --> C["Non-LLM Metrics<br>(No API Calls)"]
    
    subgraph HighImpactMetrics [High Rate Limit Impact Metrics]
        direction TB
        H["Faithfulness"]:::highImpact --- H1["⚠️ Very High Impact<br>Multiple complex LLM calls"]:::highImpact
        I["LLMContextRecall"]:::highImpact --- I1["⚠️ Very High Impact<br>Detailed context analysis"]:::highImpact
        J["FactualCorrectness"]:::mediumImpact --- J1["⚠️ High Impact<br>Reference comparison"]:::mediumImpact
        AG["AspectCritic"]:::mediumImpact --- AG1["⚠️ High Impact<br>Custom evaluation criteria"]:::mediumImpact
        AH["AgentGoalAccuracy"]:::mediumImpact --- AH1["⚠️ High Impact<br>Complex goal assessment"]:::mediumImpact
    end
    
    subgraph NoImpactMetrics [No Rate Limit Impact Metrics]
        direction TB
        K["StringPresence"]:::noImpact --- K1["✅ No Impact<br>Simple text matching"]:::noImpact
        L["ExactMatch"]:::noImpact --- L1["✅ No Impact<br>Binary comparison"]:::noImpact
        M["BLEUScore"]:::noImpact --- M1["✅ No Impact<br>Translation metric"]:::noImpact
        N["ROUGEScore"]:::noImpact --- N1["✅ No Impact<br>Summarization metric"]:::noImpact
    end
    
    B --- HighImpactMetrics
    C --- NoImpactMetrics

donbr/ragas-implementation-guide.md