donbr · April 24, 2025 18:55 · Apr 24, 2025 · Apr 24, 2025
diff --git a/rag-agent-evaluation.md b/rag-agent-evaluation.md
@@ -168,6 +168,8 @@ graph TD
     AgentMetrics --> TopicAdherence[Topic Adherence]
 ```
 
+> **Note:**  this is a partial list
+
 ### RAG-specific Metrics
 
 1. **Context Recall**: Measures whether the retrieved context contains all necessary information to answer the query correctly[^5]

diff --git a/rag-agent-evaluation.md b/rag-agent-evaluation.md
@@ -0,0 +1,249 @@
+# Evaluating LLM Applications: From RAG to Agents with Ragas
+
+## 1. Introduction
+
+Large Language Models (LLMs) have revolutionized AI applications by enabling natural language understanding and generation capabilities. However, as these applications grow more sophisticated, ensuring their quality, reliability, and accuracy becomes increasingly challenging. Two key architectures in the LLM ecosystem are Retrieval-Augmented Generation (RAG) systems and LLM-powered agents.
+
+This guide introduces the concepts of RAG systems and agents, explains their relationship, and presents the Ragas framework for evaluating their performance. We'll explore examples from two practical implementations: evaluating a RAG system and evaluating an agent application.
+
+## 2. Understanding RAG Systems
+
+### What is RAG?
+
+Retrieval-Augmented Generation (RAG) is an approach that enhances LLM outputs by retrieving relevant information from external knowledge sources before generating responses. This method addresses limitations in LLMs' internal knowledge by providing up-to-date and domain-specific information[^1].
+
+### Components of a RAG System
+
+```mermaid
+graph LR
+    User([User]) -- "Query" --> RAG
+    subgraph RAG[RAG System]
+        Query[Query Processing] --> Retriever[Retriever]
+        Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
+        Retriever -- "Relevant Context" --> Generator[Generator]
+        Generator -- "Augmented Response" --> Output
+    end
+    Output --> User
+```
+
+A RAG system typically consists of:
+
+1. **Query Processor**: Prepares the user's query for retrieval
+2. **Retriever**: Finds relevant information from a knowledge base
+3. **Knowledge Base**: Contains documents, chunks, or data points
+4. **Generator**: An LLM that produces responses using retrieved context
+
+### Benefits and Limitations
+
+RAG systems offer several advantages:
+- Provide access to information not in the LLM's training data
+- Reduce hallucination by grounding responses in retrieved facts
+- Enable easy knowledge updates without retraining the LLM
+- Allow for domain-specific knowledge incorporation
+
+However, they also face challenges:
+- Retrieval quality heavily impacts response quality
+- May struggle with complex reasoning tasks
+- Lack persistence in multi-turn conversations
+
+## 3. Understanding LLM Agents
+
+### What are LLM Agents?
+
+LLM agents are systems that enhance LLMs with the ability to interact with external tools, maintain state across interactions, and perform complex reasoning tasks. They can make decisions about when to use tools and how to interpret their outputs[^2].
+
+### The ReAct Pattern
+
+```mermaid
+graph TD
+    User([User]) -- "Query" --> Agent
+    subgraph Agent[Agent System]
+        State[State Management] --> LLM[Language Model]
+        LLM -- "Reasoning" --> ToolUse[Tool Decision]
+        ToolUse -- "If needed" --> Tools[External Tools]
+        Tools -- "Results" --> State
+        ToolUse -- "If not needed" --> Response[Response Generation]
+    end
+    Response --> User
+```
+
+The ReAct (Reasoning + Acting) pattern is a common approach for building agents:
+
+1. **Reasoning**: The agent analyzes the current state and decides on a course of action
+2. **Acting**: The agent executes actions using tools
+3. **Observing**: The agent processes the results of its actions
+4. **Updating**: The agent updates its state based on observations
+
+### State Management and Tools
+
+Agents require:
+- **State Management**: Tracking conversation history, tool outputs, and intermediate reasoning
+- **Tool Integration**: The ability to use external tools like APIs, calculators, or databases
+- **Decision Logic**: Rules for determining when to use tools vs. direct responses
+
+## 4. Evolution to Agentic RAG
+
+### Combining RAG and Agents
+
+Agentic RAG represents an evolution where retrieval becomes an integral part of an agent's toolkit rather than a separate pipeline[^3]. In this approach, an agent orchestrates the retrieval process strategically.
+
+```mermaid
+graph TB
+    User([User]) -- "Query" --> AgenticRAG
+    subgraph AgenticRAG[Agentic RAG System]
+        Agent[Agent Coordinator] -- "Decides strategy" --> QueryProcessor[Query Processor]
+        QueryProcessor --> Retriever[Retriever]
+        Retriever -- "Fetches" --> KnowledgeBase[(Knowledge Base)]
+        Retriever -- "Relevant Documents" --> AgentTools[Agent Tools]
+        AgentTools -- "Processed Context" --> Generator[Generator]
+        Generator -- "Enhanced Response" --> Agent
+    end
+    AgenticRAG -- "Final Response" --> User
+```
+
+### Advantages of Agentic RAG
+
+Agentic RAG offers several improvements over standard RAG:
+- More sophisticated retrieval strategies adapted to the query
+- Better handling of complex information needs
+- Ability to reform queries based on initial results
+- Integration of multiple sources of information
+
+## 5. Evaluating LLM Applications with Ragas
+
+### The Evaluation Workflow
+
+Ragas provides a framework for evaluating both RAG systems and agents through objective metrics[^4]. The typical evaluation workflow follows these steps:
+
+```mermaid
+graph LR
+    Build[Build System] --> Test[Test System]
+    Test -- "Generate traces" --> Evaluate[Evaluate with Ragas]
+    Evaluate -- "Analyze metrics" --> Improve[Improve System]
+    Improve --> Test
+```
+
+1. **Build**: Develop the initial RAG or agent system
+2. **Test**: Generate interaction traces with test queries
+3. **Evaluate**: Apply Ragas metrics to measure performance
+4. **Improve**: Make targeted improvements based on metrics
+5. **Repeat**: Continue the cycle to refine the system
+
+### Synthetic Data Generation
+
+Ragas can generate synthetic test data for evaluation:
+- Creates realistic queries based on your knowledge base
+- Provides expected answers as a reference
+- Generates different query types (simple, complex, multi-hop)
+- Creates test scenarios that challenge different aspects of your system
+
+### Objective vs. Subjective Evaluation
+
+Traditional evaluation often relies on subjective human judgment, which is:
+- Time-consuming and expensive
+- Inconsistent across evaluators
+- Difficult to scale
+
+Ragas provides objective, consistent metrics that:
+- Can be automatically calculated
+- Provide comparable scores across systems
+- Identify specific areas for improvement
+- Scale to large test sets
+
+## 6. Key Metrics for Evaluation
+
+Ragas offers different metrics for evaluating RAG systems and agents:
+
+```mermaid
+graph TD
+    Ragas[Ragas Framework] --> RAGMetrics[RAG Metrics]
+    Ragas --> AgentMetrics[Agent Metrics]
+    
+    RAGMetrics --> ContextRecall[Context Recall]
+    RAGMetrics --> Faithfulness[Faithfulness]
+    RAGMetrics --> FactualCorrectness[Factual Correctness]
+    
+    AgentMetrics --> ToolCallAccuracy[Tool Call Accuracy]
+    AgentMetrics --> GoalCompletion[Goal Completion]
+    AgentMetrics --> TopicAdherence[Topic Adherence]
+```
+
+### RAG-specific Metrics
+
+1. **Context Recall**: Measures whether the retrieved context contains all necessary information to answer the query correctly[^5]
+
+2. **Faithfulness**: Assesses if the generated answer is consistent with the provided context rather than made up (hallucinated)[^6]
+
+3. **Factual Correctness**: Evaluates if the answer is factually accurate compared to a reference answer
+
+4. **Answer Relevancy**: Measures how directly the response addresses the user's question
+
+### Agent-specific Metrics
+
+1. **Tool Call Accuracy**: Evaluates if the agent correctly identifies when to use tools and calls them with proper parameters[^7]
+
+2. **Goal Completion**: Assesses if the agent accomplishes the user's intended goal
+
+3. **Topic Adherence**: Measures if the agent stays on topic and within its domain of expertise
+
+### Interpreting Results
+
+Ragas metrics are typically scored from 0 to 1, where:
+- Higher scores indicate better performance
+- Scores can be compared across system iterations
+- Relative improvements are more meaningful than absolute scores
+
+## 7. Case Studies from Notebooks
+
+### RAG Evaluation Example
+
+In the RAG evaluation notebook, we see:
+- Creation of synthetic test data from document collections
+- Evaluation of a baseline RAG system using multiple metrics
+- Implementation of reranking to improve retrieval quality
+- Re-evaluation to demonstrate metric improvements
+
+The notebook showed significant improvements in Context Recall when adding reranking, demonstrating how targeted improvements informed by metrics can enhance system performance.
+
+### Agent Evaluation Example
+
+In the agent evaluation notebook, we see:
+- Building a ReAct agent with a metal price lookup tool
+- Converting agent interaction traces to Ragas format
+- Evaluating Tool Call Accuracy and Goal Completion
+- Testing Topic Adherence with an out-of-domain query
+
+The agent showed perfect scores for Tool Call Accuracy and Goal Completion but failed the Topic Adherence test, highlighting a need for better domain enforcement.
+
+## 8. Best Practices for Evaluation
+
+Based on both notebooks, here are key best practices for evaluating LLM applications:
+
+1. **Establish Baselines**: Start with a baseline implementation and metrics
+2. **Use Multiple Metrics**: Different metrics reveal different aspects of performance
+3. **Target Improvements**: Make focused changes based on metric insights
+4. **Test Edge Cases**: Include challenging queries that test system boundaries
+5. **Combine Automatic and Manual Evaluation**: Use metrics for quantitative assessment and human review for qualitative insights
+6. **Continuous Evaluation**: Incorporate evaluation into your development workflow
+
+## 9. Conclusion
+
+Evaluation is a critical aspect of building reliable and effective LLM applications. The Ragas framework provides powerful tools for objectively measuring the performance of both RAG systems and agents, enabling data-driven improvements.
+
+As LLM applications continue to evolve from simple RAG systems to sophisticated agents and agentic RAG, robust evaluation frameworks like Ragas will become increasingly important for ensuring quality and reliability.
+
+## References
+
+[^1]: "What is retrieval-augmented generation (RAG)?" IBM Research. https://research.ibm.com/blog/retrieval-augmented-generation-RAG
+
+[^2]: "LangGraph: A stateful orchestration framework for LLM applications." LangChain. https://www.langchain.com/langgraph
+
+[^3]: "Agentic RAG: Extending traditional Retrieval-Augmented Generation(RAG) pipelines with intelligent agents." LeewayHertz. https://www.leewayhertz.com/agentic-rag/
+
+[^4]: "GitHub - explodinggradients/ragas: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas
+
+[^5]: "RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." Medium. https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38
+
+[^6]: "Overview of Metrics - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/overview/
+
+[^7]: "Agentic or Tool use - Ragas." Ragas Documentation. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/