Synthetic Data Generation & RAG Evaluation: RAGAS + LangSmith

Introduction

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing Large Language Models (LLMs) with external knowledge. However, evaluating RAG pipelines presents significant challenges due to the complexity of retrieval quality, generation accuracy, and the overall coherence of responses. This document provides a comprehensive analysis of using RAGAS (Retrieval Augmented Generation Assessment) for synthetic test data generation and LangSmith for RAG pipeline evaluation, based on the Jupyter notebook example provided.

What is RAG?

Retrieval-Augmented Generation is a technique that enhances LLMs by providing them with relevant external knowledge. A typical RAG system consists of two main components[1]:

Retriever: Identifies and retrieves relevant information needed to answer a query
Generator: Generates the answer using the retrieved information

RAG systems allow LLMs to access external knowledge sources, reducing the risk of hallucinations and enabling more factual responses.

RAGAS: A Framework for RAG Evaluation

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework specifically designed for the evaluation of RAG pipelines. It provides comprehensive metrics and tools to assess RAG implementations without requiring human-annotated reference data[2].

Key Evaluation Metrics in RAGAS

RAGAS offers several specialized metrics to evaluate different aspects of a RAG system[3]:

Faithfulness: Measures how factually consistent the generated answer is with the retrieved context, helping identify hallucinations
Answer Relevancy: Evaluates how well the generated answer addresses the original query
Context Precision: Assesses how much of the retrieved context is actually relevant to the question
Context Recall: Measures how much of the necessary information to answer the question is contained in the retrieved context
Context Relevance: Evaluates the overall relevance of the retrieved documents to the question
Citation Accuracy: Assesses whether citations in the generated response correctly refer to information in the retrieved context

Synthetic Data Generation with RAGAS

One of RAGAS' most valuable features is its ability to generate synthetic test data for RAG evaluation, addressing the significant challenge of creating extensive, domain-specific evaluation datasets.

Knowledge Graph-Based Approach

RAGAS employs a sophisticated knowledge graph-based approach for synthetic data generation[4]:

┌─────────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐
│                     │     │                     │     │                     │
│  Document Loading   │────▶│ Knowledge Graph     │────▶│  Query Generation   │
│                     │     │  Construction       │     │                     │
└─────────────────────┘     └─────────────────────┘     └─────────────────────┘
                                      │
                                      ▼
                            ┌─────────────────────┐
                            │  Graph              │
                            │  Transformations    │
                            └─────────────────────┘

Knowledge Graph Creation: RAGAS constructs a knowledge graph from input documents and enriches it with additional information through various transformations:

from ragas.testset.graph import KnowledgeGraph
kg = KnowledgeGraph()

# Add documents to the knowledge graph
for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )

Node Generation: The framework extracts entities, themes, summaries, and headlines from documents to create nodes in the knowledge graph:

from ragas.testset.transforms import default_transforms, apply_transforms

# Apply transformations to enrich the knowledge graph
default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)

Query Synthesis: Using the knowledge graph, RAGAS generates different types of queries:
```
query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]
```
- Single-hop specific queries: Focus on single pieces of information
- Multi-hop abstract queries: Require synthesis across multiple pieces of information
- Multi-hop specific queries: Focus on specific details across multiple documents

Testset Generation: Finally, RAGAS generates a comprehensive testset:

testset = generator.generate(testset_size=10, query_distribution=query_distribution)

This approach allows developers to automatically generate comprehensive test datasets that can evaluate RAG systems across various query types and complexity levels. According to industry research, synthetic data generation can reduce developer time by up to 90% compared to manual test set creation[5].

LangSmith: A Comprehensive Evaluation Platform

LangSmith is a unified platform for observability and evaluation where teams can debug, test, and monitor AI application performance[6]. It plays a critical role in RAG evaluation by providing:

A platform to create and store test datasets
Tools to run evaluations and visualize results
Features to make metrics explainable and reproducible
Capabilities to continuously add test examples from production logs

Creating a LangSmith Dataset from RAGAS

After generating synthetic data with RAGAS, the next step is to create a dataset in LangSmith:

from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years2!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years2!"
)

# Add examples to the dataset
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

Building a Basic RAG Pipeline

The example demonstrates building a simple RAG pipeline using the following components:

Document Processing:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

Embeddings Generation:

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Vector Store Creation:

from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

RAG Prompt Template:

from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

LLM Integration:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

RAG Chain Assembly:

from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

LangSmith Evaluation Setup

The notebook sets up several evaluators to assess different aspects of the RAG system:

from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

Understanding the Evaluators

QA Evaluator (qa_evaluator):
- Assesses whether the generated answer is correct based on the reference answer
- Uses GPT-4.1 to determine if the answer is accurate
Labeled Helpfulness Evaluator (labeled_helpfulness_evaluator):
- Measures whether the response is helpful to the user, taking into account the correct reference answer
- Uses the reference answer as ground truth to contextualize helpfulness
"Dope or Nope" Evaluator (dope_or_nope_evaluator):
- Evaluates whether the submission is "dope, lit, or cool"
- Assesses the style and engagement factor of the response
- Provides a more subjective measure of response quality

Running the Evaluation

evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

Improving the RAG Pipeline

The notebook demonstrates how to improve the RAG pipeline based on evaluation results:

Enhanced Prompt:

DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

Larger Chunk Size:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

Improved Embedding Model:

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Impact of Changes

Chunk Size Impact:
- Larger chunks provide more context for the LLM, potentially improving the coherence and completeness of answers
- However, larger chunks might also include irrelevant information, potentially reducing precision
- The optimal chunk size depends on the specific use case and document structure
Embedding Model Impact:
- Higher-quality embeddings can better capture semantic relationships in the text
- This can lead to more accurate retrieval, ensuring the most relevant information is found
- Improved embeddings can reduce the "lost in embedding space" problem where semantically similar content is not properly matched

Re-evaluating the Improved Pipeline

evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

Conclusion

The combination of RAGAS for synthetic data generation and LangSmith for evaluation provides a powerful workflow for developing and refining RAG systems:

Efficient Testing: Synthetic data generation reduces the time and effort required to create comprehensive test datasets
Comprehensive Evaluation: The multiple evaluators in LangSmith provide a holistic view of system performance
Iterative Improvement: The ability to quickly run evaluations enables rapid iteration and improvement
Objective Measurement: Standardized metrics help quantify improvements across different versions of the RAG pipeline

This approach enables developers to build more robust, accurate, and engaging RAG systems with less manual effort and more systematic evaluation.

References

[1] Superlinked. "Evaluating Retrieval Augmented Generation using RAGAS." VectorHub. https://superlinked.com/vectorhub/articles/retrieval-augmented-generation-eval-qdrant-ragas

[2] Explodinggradients. "RAGAS: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas

[3] Medium. "RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38

[4] Ragas Documentation. "Generate Synthetic Testset for RAG." https://docs.ragas.io/en/latest/getstarted/rag_testset_generation/

[5] Advancing Analytics. "Streamline Your RAG Pipeline Evaluation with Synthetic Data." https://www.advancinganalytics.co.uk/blog/just-built-a-rag-pipeline-and-need-to-evaluate-its-performance

[6] LangChain. "LangSmith." https://www.langchain.com/langsmith

[7] LangChain. "Evaluating RAG pipelines with Ragas + LangSmith." https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/

[8] LangSmith Documentation. "Evaluation concepts." https://docs.smith.langchain.com/evaluation/concepts

donbr/ragas-langsmith-evaluation.md

Select an option

No results found