Large Language Models (LLMs) have revolutionized AI applications by enabling natural language understanding and generation capabilities. However, as these applications grow more sophisticated, ensuring their quality, reliability, and accuracy becomes increasingly challenging. Two key architectures in the LLM ecosystem are Retrieval-Augmented Generation (RAG) systems and LLM-powered agents.
This guide introduces the concepts of RAG systems and agents, explains their relationship, and presents the Ragas framework for evaluating their performance. We'll explore examples from two practical implementations: evaluating a RAG system and evaluating an agent application.
Retrieval-Augmented Generation (RAG) is an approach that enhances LLM outputs by retrieving relevant information from external knowledge sources before generating responses. This method addresses limitations in LLMs' internal knowledge by providing up-to-date and domain-specific information1.
graph LR
User([User]) -- "Query" --> RAG
subgraph RAG[RAG System]
Query[Query Processing] --> Retriever[Retriever]
Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
Retriever -- "Relevant Context" --> Generator[Generator]
Generator -- "Augmented Response" --> Output
end
Output --> User
A RAG system typically consists of:
- Query Processor: Prepares the user's query for retrieval
- Retriever: Finds relevant information from a knowledge base
- Knowledge Base: Contains documents, chunks, or data points
- Generator: An LLM that produces responses using retrieved context
RAG systems offer several advantages:
- Provide access to information not in the LLM's training data
- Reduce hallucination by grounding responses in retrieved facts
- Enable easy knowledge updates without retraining the LLM
- Allow for domain-specific knowledge incorporation
However, they also face challenges:
- Retrieval quality heavily impacts response quality
- May struggle with complex reasoning tasks
- Lack persistence in multi-turn conversations
LLM agents are systems that enhance LLMs with the ability to interact with external tools, maintain state across interactions, and perform complex reasoning tasks. They can make decisions about when to use tools and how to interpret their outputs2.
graph TD
User([User]) -- "Query" --> Agent
subgraph Agent[Agent System]
State[State Management] --> LLM[Language Model]
LLM -- "Reasoning" --> ToolUse[Tool Decision]
ToolUse -- "If needed" --> Tools[External Tools]
Tools -- "Results" --> State
ToolUse -- "If not needed" --> Response[Response Generation]
end
Response --> User
The ReAct (Reasoning + Acting) pattern is a common approach for building agents:
- Reasoning: The agent analyzes the current state and decides on a course of action
- Acting: The agent executes actions using tools
- Observing: The agent processes the results of its actions
- Updating: The agent updates its state based on observations
Agents require:
- State Management: Tracking conversation history, tool outputs, and intermediate reasoning
- Tool Integration: The ability to use external tools like APIs, calculators, or databases
- Decision Logic: Rules for determining when to use tools vs. direct responses
Agentic RAG represents an evolution where retrieval becomes an integral part of an agent's toolkit rather than a separate pipeline3. In this approach, an agent orchestrates the retrieval process strategically.
graph TB
User([User]) -- "Query" --> AgenticRAG
subgraph AgenticRAG[Agentic RAG System]
Agent[Agent Coordinator] -- "Decides strategy" --> QueryProcessor[Query Processor]
QueryProcessor --> Retriever[Retriever]
Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
Retriever -- "Relevant Documents" --> AgentTools[Agent Tools]
AgentTools -- "Processed Context" --> Generator[Generator]
Generator -- "Enhanced Response" --> Agent
end
AgenticRAG -- "Final Response" --> User
Agentic RAG offers several improvements over standard RAG:
- More sophisticated retrieval strategies adapted to the query
- Better handling of complex information needs
- Ability to reform queries based on initial results
- Integration of multiple sources of information
Ragas provides a framework for evaluating both RAG systems and agents through objective metrics4. The typical evaluation workflow follows these steps:
graph LR
Build[Build System] --> Test[Test System]
Test -- "Generate traces" --> Evaluate[Evaluate with Ragas]
Evaluate -- "Analyze metrics" --> Improve[Improve System]
Improve --> Test
- Build: Develop the initial RAG or agent system
- Test: Generate interaction traces with test queries
- Evaluate: Apply Ragas metrics to measure performance
- Improve: Make targeted improvements based on metrics
- Repeat: Continue the cycle to refine the system
Ragas can generate synthetic test data for evaluation:
- Creates realistic queries based on your knowledge base
- Provides expected answers as a reference
- Generates different query types (simple, complex, multi-hop)
- Creates test scenarios that challenge different aspects of your system
Traditional evaluation often relies on subjective human judgment, which is:
- Time-consuming and expensive
- Inconsistent across evaluators
- Difficult to scale
Ragas provides objective, consistent metrics that:
- Can be automatically calculated
- Provide comparable scores across systems
- Identify specific areas for improvement
- Scale to large test sets
Ragas offers different metrics for evaluating RAG systems and agents:
graph TD
Ragas[Ragas Framework] --> RAGMetrics[RAG Metrics]
Ragas --> AgentMetrics[Agent Metrics]
RAGMetrics --> ContextRecall[Context Recall]
RAGMetrics --> Faithfulness[Faithfulness]
RAGMetrics --> FactualCorrectness[Factual Correctness]
AgentMetrics --> ToolCallAccuracy[Tool Call Accuracy]
AgentMetrics --> GoalCompletion[Goal Completion]
AgentMetrics --> TopicAdherence[Topic Adherence]
Note: this is a partial list
-
Context Recall: Measures whether the retrieved context contains all necessary information to answer the query correctly5
-
Faithfulness: Assesses if the generated answer is consistent with the provided context rather than made up (hallucinated)6
-
Factual Correctness: Evaluates if the answer is factually accurate compared to a reference answer
-
Answer Relevancy: Measures how directly the response addresses the user's question
-
Tool Call Accuracy: Evaluates if the agent correctly identifies when to use tools and calls them with proper parameters7
-
Goal Completion: Assesses if the agent accomplishes the user's intended goal
-
Topic Adherence: Measures if the agent stays on topic and within its domain of expertise
Ragas metrics are typically scored from 0 to 1, where:
- Higher scores indicate better performance
- Scores can be compared across system iterations
- Relative improvements are more meaningful than absolute scores
In the RAG evaluation notebook, we see:
- Creation of synthetic test data from document collections
- Evaluation of a baseline RAG system using multiple metrics
- Implementation of reranking to improve retrieval quality
- Re-evaluation to demonstrate metric improvements
The notebook showed significant improvements in Context Recall when adding reranking, demonstrating how targeted improvements informed by metrics can enhance system performance.
In the agent evaluation notebook, we see:
- Building a ReAct agent with a metal price lookup tool
- Converting agent interaction traces to Ragas format
- Evaluating Tool Call Accuracy and Goal Completion
- Testing Topic Adherence with an out-of-domain query
The agent showed perfect scores for Tool Call Accuracy and Goal Completion but failed the Topic Adherence test, highlighting a need for better domain enforcement.
Based on both notebooks, here are key best practices for evaluating LLM applications:
- Establish Baselines: Start with a baseline implementation and metrics
- Use Multiple Metrics: Different metrics reveal different aspects of performance
- Target Improvements: Make focused changes based on metric insights
- Test Edge Cases: Include challenging queries that test system boundaries
- Combine Automatic and Manual Evaluation: Use metrics for quantitative assessment and human review for qualitative insights
- Continuous Evaluation: Incorporate evaluation into your development workflow
Evaluation is a critical aspect of building reliable and effective LLM applications. The Ragas framework provides powerful tools for objectively measuring the performance of both RAG systems and agents, enabling data-driven improvements.
As LLM applications continue to evolve from simple RAG systems to sophisticated agents and agentic RAG, robust evaluation frameworks like Ragas will become increasingly important for ensuring quality and reliability.
Footnotes
-
"What is retrieval-augmented generation (RAG)?" IBM Research. https://research.ibm.com/blog/retrieval-augmented-generation-RAG ↩
-
"LangGraph: A stateful orchestration framework for LLM applications." LangChain. https://www.langchain.com/langgraph ↩
-
"Agentic RAG: Extending traditional Retrieval-Augmented Generation(RAG) pipelines with intelligent agents." LeewayHertz. https://www.leewayhertz.com/agentic-rag/ ↩
-
"GitHub - explodinggradients/ragas: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas ↩
-
"RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." Medium. https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38 ↩
-
"Overview of Metrics - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/overview/ ↩
-
"Agentic or Tool use - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/ ↩