Skip to content

Instantly share code, notes, and snippets.

@donbr
Last active April 24, 2025 18:55
Show Gist options
  • Save donbr/4db8c8a86d68a35b6f313f770061fce4 to your computer and use it in GitHub Desktop.
Save donbr/4db8c8a86d68a35b6f313f770061fce4 to your computer and use it in GitHub Desktop.
Evaluating LLM Applications: From RAG to Agents with Ragas

Evaluating LLM Applications: From RAG to Agents with Ragas

1. Introduction

Large Language Models (LLMs) have revolutionized AI applications by enabling natural language understanding and generation capabilities. However, as these applications grow more sophisticated, ensuring their quality, reliability, and accuracy becomes increasingly challenging. Two key architectures in the LLM ecosystem are Retrieval-Augmented Generation (RAG) systems and LLM-powered agents.

This guide introduces the concepts of RAG systems and agents, explains their relationship, and presents the Ragas framework for evaluating their performance. We'll explore examples from two practical implementations: evaluating a RAG system and evaluating an agent application.

2. Understanding RAG Systems

What is RAG?

Retrieval-Augmented Generation (RAG) is an approach that enhances LLM outputs by retrieving relevant information from external knowledge sources before generating responses. This method addresses limitations in LLMs' internal knowledge by providing up-to-date and domain-specific information1.

Components of a RAG System

graph LR
    User([User]) -- "Query" --> RAG
    subgraph RAG[RAG System]
        Query[Query Processing] --> Retriever[Retriever]
        Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
        Retriever -- "Relevant Context" --> Generator[Generator]
        Generator -- "Augmented Response" --> Output
    end
    Output --> User
Loading

A RAG system typically consists of:

  1. Query Processor: Prepares the user's query for retrieval
  2. Retriever: Finds relevant information from a knowledge base
  3. Knowledge Base: Contains documents, chunks, or data points
  4. Generator: An LLM that produces responses using retrieved context

Benefits and Limitations

RAG systems offer several advantages:

  • Provide access to information not in the LLM's training data
  • Reduce hallucination by grounding responses in retrieved facts
  • Enable easy knowledge updates without retraining the LLM
  • Allow for domain-specific knowledge incorporation

However, they also face challenges:

  • Retrieval quality heavily impacts response quality
  • May struggle with complex reasoning tasks
  • Lack persistence in multi-turn conversations

3. Understanding LLM Agents

What are LLM Agents?

LLM agents are systems that enhance LLMs with the ability to interact with external tools, maintain state across interactions, and perform complex reasoning tasks. They can make decisions about when to use tools and how to interpret their outputs2.

The ReAct Pattern

graph TD
    User([User]) -- "Query" --> Agent
    subgraph Agent[Agent System]
        State[State Management] --> LLM[Language Model]
        LLM -- "Reasoning" --> ToolUse[Tool Decision]
        ToolUse -- "If needed" --> Tools[External Tools]
        Tools -- "Results" --> State
        ToolUse -- "If not needed" --> Response[Response Generation]
    end
    Response --> User
Loading

The ReAct (Reasoning + Acting) pattern is a common approach for building agents:

  1. Reasoning: The agent analyzes the current state and decides on a course of action
  2. Acting: The agent executes actions using tools
  3. Observing: The agent processes the results of its actions
  4. Updating: The agent updates its state based on observations

State Management and Tools

Agents require:

  • State Management: Tracking conversation history, tool outputs, and intermediate reasoning
  • Tool Integration: The ability to use external tools like APIs, calculators, or databases
  • Decision Logic: Rules for determining when to use tools vs. direct responses

4. Evolution to Agentic RAG

Combining RAG and Agents

Agentic RAG represents an evolution where retrieval becomes an integral part of an agent's toolkit rather than a separate pipeline3. In this approach, an agent orchestrates the retrieval process strategically.

graph TB
    User([User]) -- "Query" --> AgenticRAG
    subgraph AgenticRAG[Agentic RAG System]
        Agent[Agent Coordinator] -- "Decides strategy" --> QueryProcessor[Query Processor]
        QueryProcessor --> Retriever[Retriever]
        Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
        Retriever -- "Relevant Documents" --> AgentTools[Agent Tools]
        AgentTools -- "Processed Context" --> Generator[Generator]
        Generator -- "Enhanced Response" --> Agent
    end
    AgenticRAG -- "Final Response" --> User
Loading

Advantages of Agentic RAG

Agentic RAG offers several improvements over standard RAG:

  • More sophisticated retrieval strategies adapted to the query
  • Better handling of complex information needs
  • Ability to reform queries based on initial results
  • Integration of multiple sources of information

5. Evaluating LLM Applications with Ragas

The Evaluation Workflow

Ragas provides a framework for evaluating both RAG systems and agents through objective metrics4. The typical evaluation workflow follows these steps:

graph LR
    Build[Build System] --> Test[Test System]
    Test -- "Generate traces" --> Evaluate[Evaluate with Ragas]
    Evaluate -- "Analyze metrics" --> Improve[Improve System]
    Improve --> Test
Loading
  1. Build: Develop the initial RAG or agent system
  2. Test: Generate interaction traces with test queries
  3. Evaluate: Apply Ragas metrics to measure performance
  4. Improve: Make targeted improvements based on metrics
  5. Repeat: Continue the cycle to refine the system

Synthetic Data Generation

Ragas can generate synthetic test data for evaluation:

  • Creates realistic queries based on your knowledge base
  • Provides expected answers as a reference
  • Generates different query types (simple, complex, multi-hop)
  • Creates test scenarios that challenge different aspects of your system

Objective vs. Subjective Evaluation

Traditional evaluation often relies on subjective human judgment, which is:

  • Time-consuming and expensive
  • Inconsistent across evaluators
  • Difficult to scale

Ragas provides objective, consistent metrics that:

  • Can be automatically calculated
  • Provide comparable scores across systems
  • Identify specific areas for improvement
  • Scale to large test sets

6. Key Metrics for Evaluation

Ragas offers different metrics for evaluating RAG systems and agents:

graph TD
    Ragas[Ragas Framework] --> RAGMetrics[RAG Metrics]
    Ragas --> AgentMetrics[Agent Metrics]
    
    RAGMetrics --> ContextRecall[Context Recall]
    RAGMetrics --> Faithfulness[Faithfulness]
    RAGMetrics --> FactualCorrectness[Factual Correctness]
    
    AgentMetrics --> ToolCallAccuracy[Tool Call Accuracy]
    AgentMetrics --> GoalCompletion[Goal Completion]
    AgentMetrics --> TopicAdherence[Topic Adherence]
Loading

Note: this is a partial list

RAG-specific Metrics

  1. Context Recall: Measures whether the retrieved context contains all necessary information to answer the query correctly5

  2. Faithfulness: Assesses if the generated answer is consistent with the provided context rather than made up (hallucinated)6

  3. Factual Correctness: Evaluates if the answer is factually accurate compared to a reference answer

  4. Answer Relevancy: Measures how directly the response addresses the user's question

Agent-specific Metrics

  1. Tool Call Accuracy: Evaluates if the agent correctly identifies when to use tools and calls them with proper parameters7

  2. Goal Completion: Assesses if the agent accomplishes the user's intended goal

  3. Topic Adherence: Measures if the agent stays on topic and within its domain of expertise

Interpreting Results

Ragas metrics are typically scored from 0 to 1, where:

  • Higher scores indicate better performance
  • Scores can be compared across system iterations
  • Relative improvements are more meaningful than absolute scores

7. Case Studies from Notebooks

RAG Evaluation Example

In the RAG evaluation notebook, we see:

  • Creation of synthetic test data from document collections
  • Evaluation of a baseline RAG system using multiple metrics
  • Implementation of reranking to improve retrieval quality
  • Re-evaluation to demonstrate metric improvements

The notebook showed significant improvements in Context Recall when adding reranking, demonstrating how targeted improvements informed by metrics can enhance system performance.

Agent Evaluation Example

In the agent evaluation notebook, we see:

  • Building a ReAct agent with a metal price lookup tool
  • Converting agent interaction traces to Ragas format
  • Evaluating Tool Call Accuracy and Goal Completion
  • Testing Topic Adherence with an out-of-domain query

The agent showed perfect scores for Tool Call Accuracy and Goal Completion but failed the Topic Adherence test, highlighting a need for better domain enforcement.

8. Best Practices for Evaluation

Based on both notebooks, here are key best practices for evaluating LLM applications:

  1. Establish Baselines: Start with a baseline implementation and metrics
  2. Use Multiple Metrics: Different metrics reveal different aspects of performance
  3. Target Improvements: Make focused changes based on metric insights
  4. Test Edge Cases: Include challenging queries that test system boundaries
  5. Combine Automatic and Manual Evaluation: Use metrics for quantitative assessment and human review for qualitative insights
  6. Continuous Evaluation: Incorporate evaluation into your development workflow

9. Conclusion

Evaluation is a critical aspect of building reliable and effective LLM applications. The Ragas framework provides powerful tools for objectively measuring the performance of both RAG systems and agents, enabling data-driven improvements.

As LLM applications continue to evolve from simple RAG systems to sophisticated agents and agentic RAG, robust evaluation frameworks like Ragas will become increasingly important for ensuring quality and reliability.

References

Footnotes

  1. "What is retrieval-augmented generation (RAG)?" IBM Research. https://research.ibm.com/blog/retrieval-augmented-generation-RAG

  2. "LangGraph: A stateful orchestration framework for LLM applications." LangChain. https://www.langchain.com/langgraph

  3. "Agentic RAG: Extending traditional Retrieval-Augmented Generation(RAG) pipelines with intelligent agents." LeewayHertz. https://www.leewayhertz.com/agentic-rag/

  4. "GitHub - explodinggradients/ragas: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas

  5. "RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." Medium. https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38

  6. "Overview of Metrics - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/overview/

  7. "Agentic or Tool use - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment