Skip to content

Instantly share code, notes, and snippets.

@donbr
Last active April 29, 2025 09:44
Show Gist options
  • Save donbr/a6e0fd539c1288b5ab3d88bb42f75a40 to your computer and use it in GitHub Desktop.
Save donbr/a6e0fd539c1288b5ab3d88bb42f75a40 to your computer and use it in GitHub Desktop.
RAGAS Implementation Guide

RAGAS Implementation Guide

This guide provides a streamlined approach to implementing RAGAS evaluation while managing OpenAI API rate limits effectively. It's designed to be straightforward, visual, and actionable.

Quick Overview

RAGAS (Retrieval Augmented Generation Assessment) is a framework for evaluating RAG systems with:

  • Objective metrics without human annotations
  • Synthetic test data generation
  • Comprehensive evaluation workflows

The challenge: RAGAS metrics that use LLMs make multiple API calls, which can quickly hit rate limits.

Step-by-Step Implementation

Step 1: Install RAGAS

pip install ragas==0.2.15

Step 2: Basic RAGAS Setup

from ragas.metrics import Faithfulness, ResponseRelevancy
from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Set up evaluator LLM
evaluator_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o", temperature=0)
)

# Create rate-friendly configuration
rate_friendly_config = RunConfig(
    timeout=300,          # 5 minutes max for operations
    max_retries=15,       # More retries for rate limits
    max_wait=90,          # Longer wait between retries
    max_workers=8,        # Fewer concurrent API calls
    log_tenacity=True     # Log retry attempts
)

Step 3: Choose Metrics Strategically

🚦 Traffic Light System for Metrics

Metric API Usage Rate Impact When to Use
πŸ”΄ Faithfulness Very High Critical Small datasets, critical evaluation
πŸ”΄ LLMContextRecall Very High Critical Small datasets, retrieval quality
🟠 FactualCorrectness High Significant Medium datasets, factual accuracy
🟠 AspectCritic High Significant Custom evaluation criteria
🟒 ResponseRelevancy Medium Moderate Larger datasets, user experience
🟒 ContextPrecision Medium Moderate Larger datasets, retrieval precision
πŸ”΅ StringPresence None None Any dataset size, basic checks
πŸ”΅ ExactMatch None None Any dataset size, direct matching

Step 4: Implement Batch Processing

For datasets with more than 10 samples:

def batch_process_evaluation(dataset, metrics, batch_size=5, pause_seconds=120):
    """Process evaluations in batches to avoid rate limits"""
    # Split dataset into batches
    batches = [dataset[i:i+batch_size] for i in range(0, len(dataset), batch_size)]
    print(f"Processing {len(dataset)} samples in {len(batches)} batches")
    
    all_results = []
    for i, batch in enumerate(batches):
        print(f"Processing batch {i+1}/{len(batches)}")
        
        # Run evaluation for this batch
        batch_result = evaluate(
            dataset=batch,
            metrics=metrics,
            llm=evaluator_llm,
            run_config=rate_friendly_config
        )
        
        all_results.append(batch_result)
        
        # Pause between batches (except after the last one)
        if i < len(batches) - 1:
            print(f"Pausing for {pause_seconds} seconds...")
            time.sleep(pause_seconds)
    
    return all_results

Step 5: Use Multi-Tier Evaluation Approach

# Heavy metrics on small subset
heavy_metrics = [Faithfulness(), LLMContextRecall()]
subset_size = min(10, len(full_dataset))
subset = full_dataset[:subset_size]

heavy_results = evaluate(
    dataset=subset,
    metrics=heavy_metrics,
    llm=evaluator_llm,
    run_config=rate_friendly_config
)

# Lighter metrics on full dataset
light_metrics = [ResponseRelevancy(), ContextPrecision()]
light_results = batch_process_evaluation(
    dataset=full_dataset,
    metrics=light_metrics,
    batch_size=10,
    pause_seconds=60
)

Visual Decision Tree for RAGAS Implementation

START
  β”œβ”€β”€ Dataset Size?
  β”‚   β”œβ”€β”€ Small (<10 samples)
  β”‚   β”‚   └── Use All Metrics with Default Config
  β”‚   β”‚
  β”‚   β”œβ”€β”€ Medium (10-50 samples)
  β”‚   β”‚   β”œβ”€β”€ Heavy Metrics β†’ On Small Subset
  β”‚   β”‚   └── Medium/Light Metrics β†’ Batch Process
  β”‚   β”‚
  β”‚   └── Large (>50 samples)
  β”‚       β”œβ”€β”€ Heavy Metrics β†’ On Tiny Subset
  β”‚       β”œβ”€β”€ Medium Metrics β†’ On Small Subset
  β”‚       └── Light Metrics β†’ Batch Process
  β”‚
  β”œβ”€β”€ Rate Limit Encountered?
  β”‚   β”œβ”€β”€ YES
  β”‚   β”‚   β”œβ”€β”€ Reduce max_workers
  β”‚   β”‚   β”œβ”€β”€ Increase pause between batches
  β”‚   β”‚   └── Consider model fallback
  β”‚   β”‚
  β”‚   └── NO
  β”‚       └── Continue processing
  β”‚
  └── END

Model Selection Guide

Evaluation Need Recommended Model Rate Parameters
Development/Testing GPT-3.5-Turbo max_workers=12, batch_size=15
Standard Evaluation GPT-4o-mini max_workers=10, batch_size=10
Critical Evaluation GPT-4o max_workers=8, batch_size=8
Research Grade GPT-4.1 max_workers=6, batch_size=5

Troubleshooting Common Issues

1. "Rate limit exceeded" errors

Solution: Reduce max_workers by 50% and increase pause between batches.

# More conservative config
conservative_config = RunConfig(
    max_workers=4,  # Very conservative
    timeout=300,
    max_retries=15,
    max_wait=120
)

2. Slow evaluation progress

Solution: Balance processing time with rate limits by optimizing batch size.

# Faster processing with careful rate management
if model.startswith("gpt-3.5"):
    batch_size = 15  # Higher throughput models
    pause = 45       # Shorter pause
elif model.startswith("gpt-4o-mini"):
    batch_size = 10
    pause = 60
else:  # GPT-4.1 or GPT-4o
    batch_size = 5   # Lower throughput models
    pause = 90       # Longer pause

3. Evaluation timeouts

Solution: Increase timeout parameter in RunConfig.

# For complex evaluations
timeout_config = RunConfig(
    timeout=600,  # 10 minutes for complex operations
    max_retries=15
)

Key Takeaways

  1. Start small and scale gradually: Begin with smaller test sets and simpler metrics.
  2. Monitor API usage: Keep track of rate limit errors to adjust parameters.
  3. Batch strategically: Use smaller batches for heavy metrics, larger for lighter ones.
  4. Use fallbacks: Have a backup plan when primary models hit rate limits.
  5. Iterate and optimize: Refine your approach based on results and rate limit experiences.

By following this user-friendly guide, you can implement RAGAS evaluation effectively while managing OpenAI API rate limits.

flowchart TB
    classDef highImpact fill:#ff9999,stroke:#333,stroke-width:1px
    classDef mediumImpact fill:#ffcc99,stroke:#333,stroke-width:1px
    classDef lowImpact fill:#99ff99,stroke:#333,stroke-width:1px
    classDef noImpact fill:#99ccff,stroke:#333,stroke-width:1px
    classDef framework fill:#f5f5f5,stroke:#333,stroke-width:2px
    
    A["RAGAS Evaluation<br>Framework v0.2.15"]:::framework
    A --> B["LLM-Based Metrics<br>(API Calls Required)"]
    A --> C["Non-LLM Metrics<br>(No API Calls)"]
    
    subgraph HighImpactMetrics [High Rate Limit Impact Metrics]
        direction TB
        H["Faithfulness"]:::highImpact --- H1["⚠️ Very High Impact<br>Multiple complex LLM calls"]:::highImpact
        I["LLMContextRecall"]:::highImpact --- I1["⚠️ Very High Impact<br>Detailed context analysis"]:::highImpact
        J["FactualCorrectness"]:::mediumImpact --- J1["⚠️ High Impact<br>Reference comparison"]:::mediumImpact
        AG["AspectCritic"]:::mediumImpact --- AG1["⚠️ High Impact<br>Custom evaluation criteria"]:::mediumImpact
        AH["AgentGoalAccuracy"]:::mediumImpact --- AH1["⚠️ High Impact<br>Complex goal assessment"]:::mediumImpact
    end
    
    subgraph NoImpactMetrics [No Rate Limit Impact Metrics]
        direction TB
        K["StringPresence"]:::noImpact --- K1["βœ… No Impact<br>Simple text matching"]:::noImpact
        L["ExactMatch"]:::noImpact --- L1["βœ… No Impact<br>Binary comparison"]:::noImpact
        M["BLEUScore"]:::noImpact --- M1["βœ… No Impact<br>Translation metric"]:::noImpact
        N["ROUGEScore"]:::noImpact --- N1["βœ… No Impact<br>Summarization metric"]:::noImpact
    end
    
    B --- HighImpactMetrics
    C --- NoImpactMetrics
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment