Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing Large Language Models (LLMs) with external knowledge. However, evaluating RAG pipelines presents significant challenges due to the complexity of retrieval quality, generation accuracy, and the overall coherence of responses. This document provides a comprehensive analysis of using RAGAS (Retrieval Augmented Generation Assessment) for synthetic test data generation and LangSmith for RAG pipeline evaluation, based on the Jupyter notebook example provided.
Retrieval-Augmented Generation is a technique that enhances LLMs by providing them with relevant external knowledge. A typical RAG system consists of two main components[1]:
- Retriever: Identifies and retrieves relevant information needed to answer a query
- Generator: Generates the answer using the retrieved information
RAG systems allow LLMs to access external knowledge sources, reducing the risk of hallucinations and enabling more factual responses.
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework specifically designed for the evaluation of RAG pipelines. It provides comprehensive metrics and tools to assess RAG implementations without requiring human-annotated reference data[2].
RAGAS offers several specialized metrics to evaluate different aspects of a RAG system[3]:
- Faithfulness: Measures how factually consistent the generated answer is with the retrieved context, helping identify hallucinations
- Answer Relevancy: Evaluates how well the generated answer addresses the original query
- Context Precision: Assesses how much of the retrieved context is actually relevant to the question
- Context Recall: Measures how much of the necessary information to answer the question is contained in the retrieved context
- Context Relevance: Evaluates the overall relevance of the retrieved documents to the question
- Citation Accuracy: Assesses whether citations in the generated response correctly refer to information in the retrieved context
One of RAGAS' most valuable features is its ability to generate synthetic test data for RAG evaluation, addressing the significant challenge of creating extensive, domain-specific evaluation datasets.
RAGAS employs a sophisticated knowledge graph-based approach for synthetic data generation[4]:
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ │ │ │ │ │
│ Document Loading │────▶│ Knowledge Graph │────▶│ Query Generation │
│ │ │ Construction │ │ │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
│
▼
┌─────────────────────┐
│ Graph │
│ Transformations │
└─────────────────────┘
-
Knowledge Graph Creation: RAGAS constructs a knowledge graph from input documents and enriches it with additional information through various transformations:
from ragas.testset.graph import KnowledgeGraph kg = KnowledgeGraph() # Add documents to the knowledge graph for doc in docs: kg.nodes.append( Node( type=NodeType.DOCUMENT, properties={"page_content": doc.page_content, "document_metadata": doc.metadata} ) )
-
Node Generation: The framework extracts entities, themes, summaries, and headlines from documents to create nodes in the knowledge graph:
from ragas.testset.transforms import default_transforms, apply_transforms # Apply transformations to enrich the knowledge graph default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model) apply_transforms(kg, default_transforms)
-
Query Synthesis: Using the knowledge graph, RAGAS generates different types of queries:
query_distribution = [ (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5), (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25), (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25), ]
- Single-hop specific queries: Focus on single pieces of information
- Multi-hop abstract queries: Require synthesis across multiple pieces of information
- Multi-hop specific queries: Focus on specific details across multiple documents
-
Testset Generation: Finally, RAGAS generates a comprehensive testset:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
This approach allows developers to automatically generate comprehensive test datasets that can evaluate RAG systems across various query types and complexity levels. According to industry research, synthetic data generation can reduce developer time by up to 90% compared to manual test set creation[5].
LangSmith is a unified platform for observability and evaluation where teams can debug, test, and monitor AI application performance[6]. It plays a critical role in RAG evaluation by providing:
- A platform to create and store test datasets
- Tools to run evaluations and visualize results
- Features to make metrics explainable and reproducible
- Capabilities to continuously add test examples from production logs
After generating synthetic data with RAGAS, the next step is to create a dataset in LangSmith:
from langsmith import Client
client = Client()
dataset_name = "State of AI Across the Years2!"
langsmith_dataset = client.create_dataset(
dataset_name=dataset_name,
description="State of AI Across the Years2!"
)
# Add examples to the dataset
for data_row in dataset.to_pandas().iterrows():
client.create_example(
inputs={
"question": data_row[1]["user_input"]
},
outputs={
"answer": data_row[1]["reference"]
},
metadata={
"context": data_row[1]["reference_contexts"]
},
dataset_id=langsmith_dataset.id
)
The example demonstrates building a simple RAG pipeline using the following components:
-
Document Processing:
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size = 500, chunk_overlap = 50 ) rag_documents = text_splitter.split_documents(rag_documents)
-
Embeddings Generation:
from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
-
Vector Store Creation:
from langchain_community.vectorstores import Qdrant vectorstore = Qdrant.from_documents( documents=rag_documents, embedding=embeddings, location=":memory:", collection_name="State of AI" ) retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
-
RAG Prompt Template:
from langchain.prompts import ChatPromptTemplate RAG_PROMPT = """\ Given a provided context and question, you must answer the question based only on context. If you cannot answer the question based on the context - you must say "I don't know". Context: {context} Question: {question} """ rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)
-
LLM Integration:
from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4.1-mini")
-
RAG Chain Assembly:
from operator import itemgetter from langchain_core.runnables import RunnablePassthrough, RunnableParallel from langchain.schema import StrOutputParser rag_chain = ( {"context": itemgetter("question") | retriever, "question": itemgetter("question")} | rag_prompt | llm | StrOutputParser() )
The notebook sets up several evaluators to assess different aspects of the RAG system:
from langsmith.evaluation import LangChainStringEvaluator, evaluate
qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})
labeled_helpfulness_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={
"criteria": {
"helpfulness": (
"Is this submission helpful to the user,"
" taking into account the correct reference answer?"
)
},
"llm" : eval_llm
},
prepare_data=lambda run, example: {
"prediction": run.outputs["output"],
"reference": example.outputs["answer"],
"input": example.inputs["question"],
}
)
dope_or_nope_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"dopeness": "Is this submission dope, lit, or cool?",
},
"llm" : eval_llm
}
)
-
QA Evaluator (
qa_evaluator
):- Assesses whether the generated answer is correct based on the reference answer
- Uses GPT-4.1 to determine if the answer is accurate
-
Labeled Helpfulness Evaluator (
labeled_helpfulness_evaluator
):- Measures whether the response is helpful to the user, taking into account the correct reference answer
- Uses the reference answer as ground truth to contextualize helpfulness
-
"Dope or Nope" Evaluator (
dope_or_nope_evaluator
):- Evaluates whether the submission is "dope, lit, or cool"
- Assesses the style and engagement factor of the response
- Provides a more subjective measure of response quality
evaluate(
rag_chain.invoke,
data=dataset_name,
evaluators=[
qa_evaluator,
labeled_helpfulness_evaluator,
dope_or_nope_evaluator
],
metadata={"revision_id": "default_chain_init"},
)
The notebook demonstrates how to improve the RAG pipeline based on evaluation results:
-
Enhanced Prompt:
DOPE_RAG_PROMPT = """\ Given a provided context and question, you must answer the question based only on context. If you cannot answer the question based on the context - you must say "I don't know". You must answer the questions in a dope way, be cool! Context: {context} Question: {question} """ dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)
-
Larger Chunk Size:
text_splitter = RecursiveCharacterTextSplitter( chunk_size = 1000, chunk_overlap = 50 )
-
Improved Embedding Model:
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
-
Chunk Size Impact:
- Larger chunks provide more context for the LLM, potentially improving the coherence and completeness of answers
- However, larger chunks might also include irrelevant information, potentially reducing precision
- The optimal chunk size depends on the specific use case and document structure
-
Embedding Model Impact:
- Higher-quality embeddings can better capture semantic relationships in the text
- This can lead to more accurate retrieval, ensuring the most relevant information is found
- Improved embeddings can reduce the "lost in embedding space" problem where semantically similar content is not properly matched
evaluate(
dope_rag_chain.invoke,
data=dataset_name,
evaluators=[
qa_evaluator,
labeled_helpfulness_evaluator,
dope_or_nope_evaluator
],
metadata={"revision_id": "dope_chain"},
)
The combination of RAGAS for synthetic data generation and LangSmith for evaluation provides a powerful workflow for developing and refining RAG systems:
- Efficient Testing: Synthetic data generation reduces the time and effort required to create comprehensive test datasets
- Comprehensive Evaluation: The multiple evaluators in LangSmith provide a holistic view of system performance
- Iterative Improvement: The ability to quickly run evaluations enables rapid iteration and improvement
- Objective Measurement: Standardized metrics help quantify improvements across different versions of the RAG pipeline
This approach enables developers to build more robust, accurate, and engaging RAG systems with less manual effort and more systematic evaluation.
[1] Superlinked. "Evaluating Retrieval Augmented Generation using RAGAS." VectorHub. https://superlinked.com/vectorhub/articles/retrieval-augmented-generation-eval-qdrant-ragas
[2] Explodinggradients. "RAGAS: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas
[3] Medium. "RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38
[4] Ragas Documentation. "Generate Synthetic Testset for RAG." https://docs.ragas.io/en/latest/getstarted/rag_testset_generation/
[5] Advancing Analytics. "Streamline Your RAG Pipeline Evaluation with Synthetic Data." https://www.advancinganalytics.co.uk/blog/just-built-a-rag-pipeline-and-need-to-evaluate-its-performance
[6] LangChain. "LangSmith." https://www.langchain.com/langsmith
[7] LangChain. "Evaluating RAG pipelines with Ragas + LangSmith." https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/
[8] LangSmith Documentation. "Evaluation concepts." https://docs.smith.langchain.com/evaluation/concepts