RAGAs: 如何测试评估你构建的LLM RAG系统faithfulness,answer_relevancy,context,ground_truth

本篇通过RAGAs开源库来评估你的RAG系统的LLM响应的准确性，主要通过四个方面faithfulness, answer_relevancy, context,ground_truth

Johnny.Cheung

595人浏览 · 2026-03-01 23:58:29

Johnny.Cheung · 2026-03-01 23:58:29 发布

测量 RAG 的黄金三要素：

Context Relevance（检索相关性）、
Faithfulness（忠实度，即是否胡说八道）
和 Answer Relevance（答案相关性）。

工具名称	核心优势	适用场景
Ragas	RAG 评估的事实标准。基于“无参考”评估，能自动生成测试数据集。	实验室阶段、科研评估、快速刷分
DeepEval	类似 Pytest 的单元测试体验，集成非常简单。支持超过 14 种评估指标。	追求工程化、CI/CD 自动化集成、生产级测试
TruLens	提出 RAG Triad（三元组）评估理论，可视化反馈做得非常好。	需要深度调优模型、对比不同提示词或检索策略效果
LangSmith	可视化UI，自动打分, LangChain开发完美继承，基础追踪好	闭源，付费，支持多个A/B

0x0. RAGAs (Retrieval Augmented Generation Assessment)

核心逻辑

一组“问题-答案-上下文”的数据集
然后由 Ragas 调用一个高性能模型（如GPT-4o），作为“裁判”来打分。

Ragas 要求的输入数据格式是一个包含以下字段的字典（或 Pandas DataFrame）：

question: 用户提出的问题。
answer: 你的 RAG 系统生成的回答。
contexts: 检索到的文档切片列表（List of strings）。
ground_truth:（可选）标准答案，用于计算召回率（Context Recall）。

0x1. 最简单的单文件RAG评分-示例

from datasets import Dataset
from dotenv import load_dotenv
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

load_dotenv() 
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

# 1. Define your Test Dataset
# In a real scenario, these come from your RAG pipeline's logs
data_samples = {
    'question': [
        'When was the Great Wall of China built?', 
        'What is the capital of France?'
    ],
    'answer': [
        'The majority of the existing wall is from the Ming Dynasty (1368–1644).', 
        'The capital is Paris.'
    ],
    'contexts': [
        ['The Great Wall was built across historical northern borders of China. Most of the current wall dates to the Ming Dynasty.'],
        ['Paris is the capital and most populous city of France.']
    ],
    'ground_truth': [
        'The Great Wall was built over many centuries, but the Ming Dynasty (1368–1644) built the most famous parts.',
        'Paris is the capital of France.'
    ]
}

# 2. Convert dictionary to a Dataset object
dataset = Dataset.from_dict(data_samples)

# 3. Run the Evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,        # Checks: Is the answer derived ONLY from the context?
        answer_relevancy,    # Checks: Does the answer actually address the question?
        context_precision,   # Checks: Is the useful information ranked high in the context?
        context_recall       # Checks: Does the context contain the info in the ground_truth?
    ]
)

# 4. Export and Print Results
df = dataset.to_pandas().join(results.to_pandas())
print("--- Ragas Evaluation Results ---")
print(df[['question', 'faithfulness', 'answer_relevancy']])
print("\n--- Summary Scores ---")
print(results)

输出结果是

--- Ragas Evaluation Results ---
                                    question  faithfulness  answer_relevancy
  0  When was the Great Wall of China built?           0.5               NaN
  1           What is the capital of France?           1.0               NaN

--- Summary Scores ---
{
 'faithfulness': 0.7500, 
 'answer_relevancy': nan, 
 'context_precision': 1.0000, 
 'context_recall': 1.0000
}

0x2 真实例子(19页PDF)的Testset和RAGAs评分

完整示例代码
https://github.com/VictorZhang2014/toutcas/tree/main/test_RAGAs

生成Testset as ground truth

import json
from dotenv import load_dotenv
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper


load_dotenv() 

# 1. Load your local JSON file
with open('Guide_for_applicants_MSCA_Postdoctoral.pdf.embeddings.json', 'r', encoding='utf-8') as f:
    local_data = json.load(f)

# 2. Convert JSON chunks to LangChain Document objects
langchain_docs = []
for item in local_data:
    # item is your dict: {"id": 0, "chunk": "...", "embedding": [...]}
    doc = Document(
        page_content=item["chunk"],
        metadata={
            "id": item["id"], 
            "filename": "source_pdf_name.pdf"  # Required for Ragas grouping
        }
    )
    langchain_docs.append(doc)


# 3. Wrap Models
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# 4. Initialize Generator (Flat import)
generator = TestsetGenerator(
    llm=generator_llm, 
    embedding_model=generator_embeddings
)

# 5. Generate Testset
# Distributions are now handled by 'query_distribution' internally or passed as synthesizers
dataset = generator.generate_with_langchain_docs(
    documents=langchain_docs,
    testset_size=10
)

# 6. Result
df = dataset.to_pandas()
print(df.head())

# 2. Save locally
df.to_csv("Guide_for_applicants_MSCA_Postdoctoral.rag.testset.csv", index=False)
df.to_json("Guide_for_applicants_MSCA_Postdoctoral.rag.testset.json", orient="records", force_ascii=False, indent=4)

print("Test set saved successfully!")

验证->评分

from dotenv import load_dotenv
import pandas as pd
import ast 
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import requests

load_dotenv() 

# ==========================================
# 1. DATA LOADING & PREPARATION
# ==========================================
print("Loading and cleaning test set...")
test_df = pd.read_csv("Guide_for_applicants_MSCA_Postdoctoral.rag.testset.csv")

test_df = test_df.rename(columns={
    "user_input": "question", 
    "reference": "ground_truth"
})

# Fix the Pydantic/String-to-List error
def safe_eval(val):
    if isinstance(val, list): return val
    try: return ast.literal_eval(val)
    except: return []

test_df['reference_contexts'] = test_df['reference_contexts'].apply(safe_eval)

# ==========================================
# 2. THE EVALUATION LOOP
# ========================================== 
answers = []
retrieved_contexts = []

for i, row in test_df.iterrows():
    query = row['question']
    print(f"\nEvaluating Question {i+1}/{len(test_df)}: {query}")

    payload = {
        "embedding_filename": "Guide_for_applicants_MSCA_Postdoctoral.pdf.embeddings.json",
        "messages": [{"content": "You are helpful assistant","role": "system"}],
        "model": "openai/gpt-oss-120b:novita",
        "text": query
    }

    response = requests.post(
        "http://localhost:20250/pdf_analyzer/getchunks",
        json=payload
    )

    if response.status_code != 200:
        print("Error:", response.status_code, response.text) 
        answers.append("")
        retrieved_contexts.append([])
        continue

    llm_resp = response.json()

    chunks = llm_resp["context"]
    ans = llm_resp["llm_response"]

    answers.append(ans)
    retrieved_contexts.append(chunks)

# Append results to dataframe
test_df['answer'] = answers
test_df['contexts'] = retrieved_contexts

# ==========================================
# 3. RAGAS SCORING
#    Initialize the LLM with a much higher max_tokens limit
#    2048 or 4096 is usually safe for complex RAG evaluations
# ==========================================
eval_llm = ChatOpenAI(
    model="gpt-4o", 
    max_tokens=4096, 
    temperature=0
) 
eval_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
) 

print("Calculating Ragas metrics...")
eval_dataset = Dataset.from_pandas(
    test_df[['question','answer','contexts','ground_truth']],
    preserve_index=False
)

result = evaluate(
    eval_dataset,
    metrics=[
        faithfulness,       # Checks if Answer matches Contexts
        answer_relevancy,   # Checks if Answer matches Question
        context_recall      # Checks if Contexts matches Ground Truth
    ],
    llm=eval_llm,
    embeddings=eval_embeddings
)

# ==========================================
# FINAL OUTPUT
# ==========================================
print("\n--- Evaluation Summary ---")
print(result)

scores_df = result.to_pandas()
print(scores_df.describe())

low_faith = scores_df.nsmallest(5, "faithfulness")
print(low_faith)

Evaluation Summary

{'faithfulness': 0.5486, 'answer_relevancy': 0.8122, 'context_recall': 0.4556}

Pandas Describe

|       faithfulness | answer_relevancy | context_recall |
|--------------------|------------------|----------------|
|count  |    4.000000    |     12.000000    |   12.000000 |
|mean   |    0.548579    |      0.812217    |    0.455556 |
|std    |    0.216621    |      0.118089    |    0.401596 |
|min    |    0.241935    |      0.534869    |    0.000000 |
|25%    |    0.497984    |      0.759610    |    0.000000 |
|50%    |    0.601190    |      0.834552    |    0.450000 |
|75%    |    0.651786    |      0.857521    |    0.750000 |
|max    |    0.750000    |      0.986943    |    1.000000 |

注意

The "LLM-as-a-Judge" is a Paradox。LLM作为判官是一个悖论。也是很有挑战的。
关于数据集，RAGAs不能100%生成准确的Ground Truth；而Human expert也不能编写出100%正确的Ground Truth。

Ground Truth : Manual Ground Truth is the Gold Standard for accuracy. If a human expert writes the answer, you know it’s 100% correct. However, it is rarely the “best” way for a production system because a man takes much of time writing question&answers pairs, and if we update the Vector DB, the human force is huge.
- Use Ragas to generate 200 synthetic samples.
- Have a human expert audit and fix the top 50.
- Use those 50 Golden Sets for final validation and the other 150 for rough directional testing.

Feature	Ground Truth (Human-Experts)	Ragas Synthetic (LLMs-Gen)
Accuracy	“99-100% (The ““Gold”” Standard)”	80-90% (Depends on LLM)
Effort	High (Weeks)	Low (Minutes)
Diversity	Low (Humans repeat patterns)	High (Can evolve complex logic)
Cost	Expensive (Expert salaries)	Cheap (API tokens)