Last active
April 24, 2025 18:55
Revisions
-
donbr revised this gist
Apr 24, 2025 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -168,6 +168,8 @@ graph TD AgentMetrics --> TopicAdherence[Topic Adherence] ``` > **Note:** this is a partial list ### RAG-specific Metrics 1. **Context Recall**: Measures whether the retrieved context contains all necessary information to answer the query correctly[^5] -
donbr created this gist
Apr 24, 2025 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,249 @@ # Evaluating LLM Applications: From RAG to Agents with Ragas ## 1. Introduction Large Language Models (LLMs) have revolutionized AI applications by enabling natural language understanding and generation capabilities. However, as these applications grow more sophisticated, ensuring their quality, reliability, and accuracy becomes increasingly challenging. Two key architectures in the LLM ecosystem are Retrieval-Augmented Generation (RAG) systems and LLM-powered agents. This guide introduces the concepts of RAG systems and agents, explains their relationship, and presents the Ragas framework for evaluating their performance. We'll explore examples from two practical implementations: evaluating a RAG system and evaluating an agent application. ## 2. Understanding RAG Systems ### What is RAG? Retrieval-Augmented Generation (RAG) is an approach that enhances LLM outputs by retrieving relevant information from external knowledge sources before generating responses. This method addresses limitations in LLMs' internal knowledge by providing up-to-date and domain-specific information[^1]. ### Components of a RAG System ```mermaid graph LR User([User]) -- "Query" --> RAG subgraph RAG[RAG System] Query[Query Processing] --> Retriever[Retriever] Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)] Retriever -- "Relevant Context" --> Generator[Generator] Generator -- "Augmented Response" --> Output end Output --> User ``` A RAG system typically consists of: 1. **Query Processor**: Prepares the user's query for retrieval 2. **Retriever**: Finds relevant information from a knowledge base 3. **Knowledge Base**: Contains documents, chunks, or data points 4. **Generator**: An LLM that produces responses using retrieved context ### Benefits and Limitations RAG systems offer several advantages: - Provide access to information not in the LLM's training data - Reduce hallucination by grounding responses in retrieved facts - Enable easy knowledge updates without retraining the LLM - Allow for domain-specific knowledge incorporation However, they also face challenges: - Retrieval quality heavily impacts response quality - May struggle with complex reasoning tasks - Lack persistence in multi-turn conversations ## 3. Understanding LLM Agents ### What are LLM Agents? LLM agents are systems that enhance LLMs with the ability to interact with external tools, maintain state across interactions, and perform complex reasoning tasks. They can make decisions about when to use tools and how to interpret their outputs[^2]. ### The ReAct Pattern ```mermaid graph TD User([User]) -- "Query" --> Agent subgraph Agent[Agent System] State[State Management] --> LLM[Language Model] LLM -- "Reasoning" --> ToolUse[Tool Decision] ToolUse -- "If needed" --> Tools[External Tools] Tools -- "Results" --> State ToolUse -- "If not needed" --> Response[Response Generation] end Response --> User ``` The ReAct (Reasoning + Acting) pattern is a common approach for building agents: 1. **Reasoning**: The agent analyzes the current state and decides on a course of action 2. **Acting**: The agent executes actions using tools 3. **Observing**: The agent processes the results of its actions 4. **Updating**: The agent updates its state based on observations ### State Management and Tools Agents require: - **State Management**: Tracking conversation history, tool outputs, and intermediate reasoning - **Tool Integration**: The ability to use external tools like APIs, calculators, or databases - **Decision Logic**: Rules for determining when to use tools vs. direct responses ## 4. Evolution to Agentic RAG ### Combining RAG and Agents Agentic RAG represents an evolution where retrieval becomes an integral part of an agent's toolkit rather than a separate pipeline[^3]. In this approach, an agent orchestrates the retrieval process strategically. ```mermaid graph TB User([User]) -- "Query" --> AgenticRAG subgraph AgenticRAG[Agentic RAG System] Agent[Agent Coordinator] -- "Decides strategy" --> QueryProcessor[Query Processor] QueryProcessor --> Retriever[Retriever] Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)] Retriever -- "Relevant Documents" --> AgentTools[Agent Tools] AgentTools -- "Processed Context" --> Generator[Generator] Generator -- "Enhanced Response" --> Agent end AgenticRAG -- "Final Response" --> User ``` ### Advantages of Agentic RAG Agentic RAG offers several improvements over standard RAG: - More sophisticated retrieval strategies adapted to the query - Better handling of complex information needs - Ability to reform queries based on initial results - Integration of multiple sources of information ## 5. Evaluating LLM Applications with Ragas ### The Evaluation Workflow Ragas provides a framework for evaluating both RAG systems and agents through objective metrics[^4]. The typical evaluation workflow follows these steps: ```mermaid graph LR Build[Build System] --> Test[Test System] Test -- "Generate traces" --> Evaluate[Evaluate with Ragas] Evaluate -- "Analyze metrics" --> Improve[Improve System] Improve --> Test ``` 1. **Build**: Develop the initial RAG or agent system 2. **Test**: Generate interaction traces with test queries 3. **Evaluate**: Apply Ragas metrics to measure performance 4. **Improve**: Make targeted improvements based on metrics 5. **Repeat**: Continue the cycle to refine the system ### Synthetic Data Generation Ragas can generate synthetic test data for evaluation: - Creates realistic queries based on your knowledge base - Provides expected answers as a reference - Generates different query types (simple, complex, multi-hop) - Creates test scenarios that challenge different aspects of your system ### Objective vs. Subjective Evaluation Traditional evaluation often relies on subjective human judgment, which is: - Time-consuming and expensive - Inconsistent across evaluators - Difficult to scale Ragas provides objective, consistent metrics that: - Can be automatically calculated - Provide comparable scores across systems - Identify specific areas for improvement - Scale to large test sets ## 6. Key Metrics for Evaluation Ragas offers different metrics for evaluating RAG systems and agents: ```mermaid graph TD Ragas[Ragas Framework] --> RAGMetrics[RAG Metrics] Ragas --> AgentMetrics[Agent Metrics] RAGMetrics --> ContextRecall[Context Recall] RAGMetrics --> Faithfulness[Faithfulness] RAGMetrics --> FactualCorrectness[Factual Correctness] AgentMetrics --> ToolCallAccuracy[Tool Call Accuracy] AgentMetrics --> GoalCompletion[Goal Completion] AgentMetrics --> TopicAdherence[Topic Adherence] ``` ### RAG-specific Metrics 1. **Context Recall**: Measures whether the retrieved context contains all necessary information to answer the query correctly[^5] 2. **Faithfulness**: Assesses if the generated answer is consistent with the provided context rather than made up (hallucinated)[^6] 3. **Factual Correctness**: Evaluates if the answer is factually accurate compared to a reference answer 4. **Answer Relevancy**: Measures how directly the response addresses the user's question ### Agent-specific Metrics 1. **Tool Call Accuracy**: Evaluates if the agent correctly identifies when to use tools and calls them with proper parameters[^7] 2. **Goal Completion**: Assesses if the agent accomplishes the user's intended goal 3. **Topic Adherence**: Measures if the agent stays on topic and within its domain of expertise ### Interpreting Results Ragas metrics are typically scored from 0 to 1, where: - Higher scores indicate better performance - Scores can be compared across system iterations - Relative improvements are more meaningful than absolute scores ## 7. Case Studies from Notebooks ### RAG Evaluation Example In the RAG evaluation notebook, we see: - Creation of synthetic test data from document collections - Evaluation of a baseline RAG system using multiple metrics - Implementation of reranking to improve retrieval quality - Re-evaluation to demonstrate metric improvements The notebook showed significant improvements in Context Recall when adding reranking, demonstrating how targeted improvements informed by metrics can enhance system performance. ### Agent Evaluation Example In the agent evaluation notebook, we see: - Building a ReAct agent with a metal price lookup tool - Converting agent interaction traces to Ragas format - Evaluating Tool Call Accuracy and Goal Completion - Testing Topic Adherence with an out-of-domain query The agent showed perfect scores for Tool Call Accuracy and Goal Completion but failed the Topic Adherence test, highlighting a need for better domain enforcement. ## 8. Best Practices for Evaluation Based on both notebooks, here are key best practices for evaluating LLM applications: 1. **Establish Baselines**: Start with a baseline implementation and metrics 2. **Use Multiple Metrics**: Different metrics reveal different aspects of performance 3. **Target Improvements**: Make focused changes based on metric insights 4. **Test Edge Cases**: Include challenging queries that test system boundaries 5. **Combine Automatic and Manual Evaluation**: Use metrics for quantitative assessment and human review for qualitative insights 6. **Continuous Evaluation**: Incorporate evaluation into your development workflow ## 9. Conclusion Evaluation is a critical aspect of building reliable and effective LLM applications. The Ragas framework provides powerful tools for objectively measuring the performance of both RAG systems and agents, enabling data-driven improvements. As LLM applications continue to evolve from simple RAG systems to sophisticated agents and agentic RAG, robust evaluation frameworks like Ragas will become increasingly important for ensuring quality and reliability. ## References [^1]: "What is retrieval-augmented generation (RAG)?" IBM Research. https://research.ibm.com/blog/retrieval-augmented-generation-RAG [^2]: "LangGraph: A stateful orchestration framework for LLM applications." LangChain. https://www.langchain.com/langgraph [^3]: "Agentic RAG: Extending traditional Retrieval-Augmented Generation(RAG) pipelines with intelligent agents." LeewayHertz. https://www.leewayhertz.com/agentic-rag/ [^4]: "GitHub - explodinggradients/ragas: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas [^5]: "RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." Medium. https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38 [^6]: "Overview of Metrics - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/overview/ [^7]: "Agentic or Tool use - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/