Skip to content

Instantly share code, notes, and snippets.

@donbr
Created April 23, 2025 05:16
Show Gist options
  • Save donbr/e06bb12e09cca4abb4427af360689694 to your computer and use it in GitHub Desktop.
Save donbr/e06bb12e09cca4abb4427af360689694 to your computer and use it in GitHub Desktop.
Synthetic Data Generation & RAG Evaluation: RAGAS + LangSmith

Synthetic Data Generation & RAG Evaluation: RAGAS + LangSmith

Introduction

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing Large Language Models (LLMs) with external knowledge. However, evaluating RAG pipelines presents significant challenges due to the complexity of retrieval quality, generation accuracy, and the overall coherence of responses. This document provides a comprehensive analysis of using RAGAS (Retrieval Augmented Generation Assessment) for synthetic test data generation and LangSmith for RAG pipeline evaluation, based on the Jupyter notebook example provided.

What is RAG?

Retrieval-Augmented Generation is a technique that enhances LLMs by providing them with relevant external knowledge. A typical RAG system consists of two main components[1]:

  1. Retriever: Identifies and retrieves relevant information needed to answer a query
  2. Generator: Generates the answer using the retrieved information

RAG systems allow LLMs to access external knowledge sources, reducing the risk of hallucinations and enabling more factual responses.

RAGAS: A Framework for RAG Evaluation

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework specifically designed for the evaluation of RAG pipelines. It provides comprehensive metrics and tools to assess RAG implementations without requiring human-annotated reference data[2].

Key Evaluation Metrics in RAGAS

RAGAS offers several specialized metrics to evaluate different aspects of a RAG system[3]:

  1. Faithfulness: Measures how factually consistent the generated answer is with the retrieved context, helping identify hallucinations
  2. Answer Relevancy: Evaluates how well the generated answer addresses the original query
  3. Context Precision: Assesses how much of the retrieved context is actually relevant to the question
  4. Context Recall: Measures how much of the necessary information to answer the question is contained in the retrieved context
  5. Context Relevance: Evaluates the overall relevance of the retrieved documents to the question
  6. Citation Accuracy: Assesses whether citations in the generated response correctly refer to information in the retrieved context

Synthetic Data Generation with RAGAS

One of RAGAS' most valuable features is its ability to generate synthetic test data for RAG evaluation, addressing the significant challenge of creating extensive, domain-specific evaluation datasets.

Knowledge Graph-Based Approach

RAGAS employs a sophisticated knowledge graph-based approach for synthetic data generation[4]:

┌─────────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐
│                     │     │                     │     │                     │
│  Document Loading   │────▶│ Knowledge Graph     │────▶│  Query Generation   │
│                     │     │  Construction       │     │                     │
└─────────────────────┘     └─────────────────────┘     └─────────────────────┘
                                      │
                                      ▼
                            ┌─────────────────────┐
                            │  Graph              │
                            │  Transformations    │
                            └─────────────────────┘
  1. Knowledge Graph Creation: RAGAS constructs a knowledge graph from input documents and enriches it with additional information through various transformations:

    from ragas.testset.graph import KnowledgeGraph
    kg = KnowledgeGraph()
    
    # Add documents to the knowledge graph
    for doc in docs:
        kg.nodes.append(
            Node(
                type=NodeType.DOCUMENT,
                properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
            )
        )
  2. Node Generation: The framework extracts entities, themes, summaries, and headlines from documents to create nodes in the knowledge graph:

    from ragas.testset.transforms import default_transforms, apply_transforms
    
    # Apply transformations to enrich the knowledge graph
    default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
    apply_transforms(kg, default_transforms)
  3. Query Synthesis: Using the knowledge graph, RAGAS generates different types of queries:

    query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
    ]
    • Single-hop specific queries: Focus on single pieces of information
    • Multi-hop abstract queries: Require synthesis across multiple pieces of information
    • Multi-hop specific queries: Focus on specific details across multiple documents
  4. Testset Generation: Finally, RAGAS generates a comprehensive testset:

    testset = generator.generate(testset_size=10, query_distribution=query_distribution)

This approach allows developers to automatically generate comprehensive test datasets that can evaluate RAG systems across various query types and complexity levels. According to industry research, synthetic data generation can reduce developer time by up to 90% compared to manual test set creation[5].

LangSmith: A Comprehensive Evaluation Platform

LangSmith is a unified platform for observability and evaluation where teams can debug, test, and monitor AI application performance[6]. It plays a critical role in RAG evaluation by providing:

  1. A platform to create and store test datasets
  2. Tools to run evaluations and visualize results
  3. Features to make metrics explainable and reproducible
  4. Capabilities to continuously add test examples from production logs

Creating a LangSmith Dataset from RAGAS

After generating synthetic data with RAGAS, the next step is to create a dataset in LangSmith:

from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years2!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years2!"
)

# Add examples to the dataset
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

Building a Basic RAG Pipeline

The example demonstrates building a simple RAG pipeline using the following components:

  1. Document Processing:

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 500,
        chunk_overlap = 50
    )
    
    rag_documents = text_splitter.split_documents(rag_documents)
  2. Embeddings Generation:

    from langchain_openai import OpenAIEmbeddings
    
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
  3. Vector Store Creation:

    from langchain_community.vectorstores import Qdrant
    
    vectorstore = Qdrant.from_documents(
        documents=rag_documents,
        embedding=embeddings,
        location=":memory:",
        collection_name="State of AI"
    )
    
    retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
  4. RAG Prompt Template:

    from langchain.prompts import ChatPromptTemplate
    
    RAG_PROMPT = """\
    Given a provided context and question, you must answer the question based only on context.
    
    If you cannot answer the question based on the context - you must say "I don't know".
    
    Context: {context}
    Question: {question}
    """
    
    rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)
  5. LLM Integration:

    from langchain_openai import ChatOpenAI
    
    llm = ChatOpenAI(model="gpt-4.1-mini")
  6. RAG Chain Assembly:

    from operator import itemgetter
    from langchain_core.runnables import RunnablePassthrough, RunnableParallel
    from langchain.schema import StrOutputParser
    
    rag_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | rag_prompt | llm | StrOutputParser()
    )

LangSmith Evaluation Setup

The notebook sets up several evaluators to assess different aspects of the RAG system:

from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

Understanding the Evaluators

  1. QA Evaluator (qa_evaluator):

    • Assesses whether the generated answer is correct based on the reference answer
    • Uses GPT-4.1 to determine if the answer is accurate
  2. Labeled Helpfulness Evaluator (labeled_helpfulness_evaluator):

    • Measures whether the response is helpful to the user, taking into account the correct reference answer
    • Uses the reference answer as ground truth to contextualize helpfulness
  3. "Dope or Nope" Evaluator (dope_or_nope_evaluator):

    • Evaluates whether the submission is "dope, lit, or cool"
    • Assesses the style and engagement factor of the response
    • Provides a more subjective measure of response quality

Running the Evaluation

evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

Improving the RAG Pipeline

The notebook demonstrates how to improve the RAG pipeline based on evaluation results:

  1. Enhanced Prompt:

    DOPE_RAG_PROMPT = """\
    Given a provided context and question, you must answer the question based only on context.
    
    If you cannot answer the question based on the context - you must say "I don't know".
    
    You must answer the questions in a dope way, be cool!
    
    Context: {context}
    Question: {question}
    """
    
    dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)
  2. Larger Chunk Size:

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000,
        chunk_overlap = 50
    )
  3. Improved Embedding Model:

    embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Impact of Changes

  1. Chunk Size Impact:

    • Larger chunks provide more context for the LLM, potentially improving the coherence and completeness of answers
    • However, larger chunks might also include irrelevant information, potentially reducing precision
    • The optimal chunk size depends on the specific use case and document structure
  2. Embedding Model Impact:

    • Higher-quality embeddings can better capture semantic relationships in the text
    • This can lead to more accurate retrieval, ensuring the most relevant information is found
    • Improved embeddings can reduce the "lost in embedding space" problem where semantically similar content is not properly matched

Re-evaluating the Improved Pipeline

evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

Conclusion

The combination of RAGAS for synthetic data generation and LangSmith for evaluation provides a powerful workflow for developing and refining RAG systems:

  1. Efficient Testing: Synthetic data generation reduces the time and effort required to create comprehensive test datasets
  2. Comprehensive Evaluation: The multiple evaluators in LangSmith provide a holistic view of system performance
  3. Iterative Improvement: The ability to quickly run evaluations enables rapid iteration and improvement
  4. Objective Measurement: Standardized metrics help quantify improvements across different versions of the RAG pipeline

This approach enables developers to build more robust, accurate, and engaging RAG systems with less manual effort and more systematic evaluation.

References

[1] Superlinked. "Evaluating Retrieval Augmented Generation using RAGAS." VectorHub. https://superlinked.com/vectorhub/articles/retrieval-augmented-generation-eval-qdrant-ragas

[2] Explodinggradients. "RAGAS: Supercharge Your LLM Application Evaluations." GitHub. https://github.com/explodinggradients/ragas

[3] Medium. "RAGAS for RAG in LLMs: A Comprehensive Guide to Evaluation Metrics." https://dkaarthick.medium.com/ragas-for-rag-in-llms-a-comprehensive-guide-to-evaluation-metrics-3aca142d6e38

[4] Ragas Documentation. "Generate Synthetic Testset for RAG." https://docs.ragas.io/en/latest/getstarted/rag_testset_generation/

[5] Advancing Analytics. "Streamline Your RAG Pipeline Evaluation with Synthetic Data." https://www.advancinganalytics.co.uk/blog/just-built-a-rag-pipeline-and-need-to-evaluate-its-performance

[6] LangChain. "LangSmith." https://www.langchain.com/langsmith

[7] LangChain. "Evaluating RAG pipelines with Ragas + LangSmith." https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/

[8] LangSmith Documentation. "Evaluation concepts." https://docs.smith.langchain.com/evaluation/concepts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment