Evaluating LLM Applications: From RAG to Agents with Ragas

1. Introduction

Large Language Models (LLMs) have revolutionized AI applications by enabling natural language understanding and generation capabilities. However, as these applications grow more sophisticated, ensuring their quality, reliability, and accuracy becomes increasingly challenging. Two key architectures in the LLM ecosystem are Retrieval-Augmented Generation (RAG) systems and LLM-powered agents.

This guide introduces the concepts of RAG systems and agents, explains their relationship, and presents the Ragas framework for evaluating their performance. We'll explore examples from two practical implementations: evaluating a RAG system and evaluating an agent application.

2. Understanding RAG Systems

What is RAG?

Retrieval-Augmented Generation (RAG) is an approach that enhances LLM outputs by retrieving relevant information from external knowledge sources before generating responses. This method addresses limitations in LLMs' internal knowledge by providing up-to-date and domain-specific information¹.

Components of a RAG System

graph LR
    User([User]) -- "Query" --> RAG
    subgraph RAG[RAG System]
        Query[Query Processing] --> Retriever[Retriever]
        Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
        Retriever -- "Relevant Context" --> Generator[Generator]
        Generator -- "Augmented Response" --> Output
    end
    Output --> User

A RAG system typically consists of:

Query Processor: Prepares the user's query for retrieval
Retriever: Finds relevant information from a knowledge base
Knowledge Base: Contains documents, chunks, or data points
Generator: An LLM that produces responses using retrieved context

Benefits and Limitations

RAG systems offer several advantages:

Provide access to information not in the LLM's training data
Reduce hallucination by grounding responses in retrieved facts
Enable easy knowledge updates without retraining the LLM
Allow for domain-specific knowledge incorporation

However, they also face challenges:

Retrieval quality heavily impacts response quality
May struggle with complex reasoning tasks
Lack persistence in multi-turn conversations

3. Understanding LLM Agents

What are LLM Agents?

LLM agents are systems that enhance LLMs with the ability to interact with external tools, maintain state across interactions, and perform complex reasoning tasks. They can make decisions about when to use tools and how to interpret their outputs².

The ReAct Pattern

graph TD
    User([User]) -- "Query" --> Agent
    subgraph Agent[Agent System]
        State[State Management] --> LLM[Language Model]
        LLM -- "Reasoning" --> ToolUse[Tool Decision]
        ToolUse -- "If needed" --> Tools[External Tools]
        Tools -- "Results" --> State
        ToolUse -- "If not needed" --> Response[Response Generation]
    end
    Response --> User

The ReAct (Reasoning + Acting) pattern is a common approach for building agents:

Reasoning: The agent analyzes the current state and decides on a course of action
Acting: The agent executes actions using tools
Observing: The agent processes the results of its actions
Updating: The agent updates its state based on observations

State Management and Tools

Agents require:

State Management: Tracking conversation history, tool outputs, and intermediate reasoning
Tool Integration: The ability to use external tools like APIs, calculators, or databases
Decision Logic: Rules for determining when to use tools vs. direct responses

4. Evolution to Agentic RAG

Combining RAG and Agents

Agentic RAG represents an evolution where retrieval becomes an integral part of an agent's toolkit rather than a separate pipeline³. In this approach, an agent orchestrates the retrieval process strategically.

graph TB
    User([User]) -- "Query" --> AgenticRAG
    subgraph AgenticRAG[Agentic RAG System]
        Agent[Agent Coordinator] -- "Decides strategy" --> QueryProcessor[Query Processor]
        QueryProcessor --> Retriever[Retriever]
        Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
        Retriever -- "Relevant Documents" --> AgentTools[Agent Tools]
        AgentTools -- "Processed Context" --> Generator[Generator]
        Generator -- "Enhanced Response" --> Agent
    end
    AgenticRAG -- "Final Response" --> User

Advantages of Agentic RAG

Agentic RAG offers several improvements over standard RAG:

More sophisticated retrieval strategies adapted to the query
Better handling of complex information needs
Ability to reform queries based on initial results
Integration of multiple sources of information

5. Evaluating LLM Applications with Ragas

The Evaluation Workflow

Ragas provides a framework for evaluating both RAG systems and agents through objective metrics⁴. The typical evaluation workflow follows these steps:

graph LR
    Build[Build System] --> Test[Test System]
    Test -- "Generate traces" --> Evaluate[Evaluate with Ragas]
    Evaluate -- "Analyze metrics" --> Improve[Improve System]
    Improve --> Test

Build: Develop the initial RAG or agent system
Test: Generate interaction traces with test queries
Evaluate: Apply Ragas metrics to measure performance
Improve: Make targeted improvements based on metrics
Repeat: Continue the cycle to refine the system

Synthetic Data Generation

Ragas can generate synthetic test data for evaluation:

Creates realistic queries based on your knowledge base
Provides expected answers as a reference
Generates different query types (simple, complex, multi-hop)
Creates test scenarios that challenge different aspects of your system

Objective vs. Subjective Evaluation

Traditional evaluation often relies on subjective human judgment, which is:

Time-consuming and expensive
Inconsistent across evaluators
Difficult to scale

Ragas provides objective, consistent metrics that:

Can be automatically calculated
Provide comparable scores across systems
Identify specific areas for improvement
Scale to large test sets

6. Key Metrics for Evaluation

Ragas offers different metrics for evaluating RAG systems and agents:

graph TD
    Ragas[Ragas Framework] --> RAGMetrics[RAG Metrics]
    Ragas --> AgentMetrics[Agent Metrics]
    
    RAGMetrics --> ContextRecall[Context Recall]
    RAGMetrics --> Faithfulness[Faithfulness]
    RAGMetrics --> FactualCorrectness[Factual Correctness]
    
    AgentMetrics --> ToolCallAccuracy[Tool Call Accuracy]
    AgentMetrics --> GoalCompletion[Goal Completion]
    AgentMetrics --> TopicAdherence[Topic Adherence]

Note: this is a partial list

RAG-specific Metrics

Context Recall: Measures whether the retrieved context contains all necessary information to answer the query correctly⁵
Faithfulness: Assesses if the generated answer is consistent with the provided context rather than made up (hallucinated)⁶
Factual Correctness: Evaluates if the answer is factually accurate compared to a reference answer
Answer Relevancy: Measures how directly the response addresses the user's question

Agent-specific Metrics

Tool Call Accuracy: Evaluates if the agent correctly identifies when to use tools and calls them with proper parameters⁷
Goal Completion: Assesses if the agent accomplishes the user's intended goal
Topic Adherence: Measures if the agent stays on topic and within its domain of expertise

Interpreting Results

Ragas metrics are typically scored from 0 to 1, where:

Higher scores indicate better performance
Scores can be compared across system iterations
Relative improvements are more meaningful than absolute scores

7. Case Studies from Notebooks

RAG Evaluation Example

In the RAG evaluation notebook, we see:

Creation of synthetic test data from document collections
Evaluation of a baseline RAG system using multiple metrics
Implementation of reranking to improve retrieval quality
Re-evaluation to demonstrate metric improvements

The notebook showed significant improvements in Context Recall when adding reranking, demonstrating how targeted improvements informed by metrics can enhance system performance.

Agent Evaluation Example

In the agent evaluation notebook, we see:

Building a ReAct agent with a metal price lookup tool
Converting agent interaction traces to Ragas format
Evaluating Tool Call Accuracy and Goal Completion
Testing Topic Adherence with an out-of-domain query

The agent showed perfect scores for Tool Call Accuracy and Goal Completion but failed the Topic Adherence test, highlighting a need for better domain enforcement.

8. Best Practices for Evaluation

Based on both notebooks, here are key best practices for evaluating LLM applications:

Establish Baselines: Start with a baseline implementation and metrics
Use Multiple Metrics: Different metrics reveal different aspects of performance
Target Improvements: Make focused changes based on metric insights
Test Edge Cases: Include challenging queries that test system boundaries
Combine Automatic and Manual Evaluation: Use metrics for quantitative assessment and human review for qualitative insights
Continuous Evaluation: Incorporate evaluation into your development workflow

9. Conclusion

Evaluation is a critical aspect of building reliable and effective LLM applications. The Ragas framework provides powerful tools for objectively measuring the performance of both RAG systems and agents, enabling data-driven improvements.

As LLM applications continue to evolve from simple RAG systems to sophisticated agents and agentic RAG, robust evaluation frameworks like Ragas will become increasingly important for ensuring quality and reliability.

References

"What is retrieval-augmented generation (RAG)?" IBM Research. https://research.ibm.com/blog/retrieval-augmented-generation-RAG ↩
"LangGraph: A stateful orchestration framework for LLM applications." LangChain. https://www.langchain.com/langgraph ↩
"Agentic RAG: Extending traditional Retrieval-Augmented Generation(RAG) pipelines with intelligent agents." LeewayHertz. https://www.leewayhertz.com/agentic-rag/ ↩
"GitHub - explodinggradients/ragas: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas ↩
"RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." Medium. https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38 ↩
"Overview of Metrics - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/overview/ ↩
"Agentic or Tool use - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/ ↩

donbr/rag-agent-evaluation.md