RAGAs: 如何测试评估你构建的LLM RAG系统faithfulness,answer_relevancy,context,ground_truth
本篇通过RAGAs开源库来评估你的RAG系统的LLM响应的准确性,主要通过四个方面faithfulness, answer_relevancy, context,ground_truth
·
测量 RAG 的黄金三要素:
Context Relevance(检索相关性)、Faithfulness(忠实度,即是否胡说八道)- 和
Answer Relevance(答案相关性)。
| 工具名称 | 核心优势 | 适用场景 |
|---|---|---|
| Ragas | RAG 评估的事实标准。基于“无参考”评估,能自动生成测试数据集。 | 实验室阶段、科研评估、快速刷分 |
| DeepEval | 类似 Pytest 的单元测试体验,集成非常简单。支持超过 14 种评估指标。 | 追求工程化、CI/CD 自动化集成、生产级测试 |
| TruLens | 提出 RAG Triad(三元组)评估理论,可视化反馈做得非常好。 | 需要深度调优模型、对比不同提示词或检索策略效果 |
| LangSmith | 可视化UI,自动打分, LangChain开发完美继承,基础追踪好 | 闭源,付费,支持多个A/B |
0x0. RAGAs (Retrieval Augmented Generation Assessment)
核心逻辑
- 一组“问题-答案-上下文”的数据集
- 然后由 Ragas 调用一个高性能模型(如GPT-4o),作为“裁判”来打分。
Ragas 要求的输入数据格式是一个包含以下字段的字典(或 Pandas DataFrame):
question: 用户提出的问题。answer: 你的 RAG 系统生成的回答。contexts: 检索到的文档切片列表(List of strings)。ground_truth:(可选)标准答案,用于计算召回率(Context Recall)。
0x1. 最简单的单文件RAG评分-示例
from datasets import Dataset
from dotenv import load_dotenv
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
# 1. Define your Test Dataset
# In a real scenario, these come from your RAG pipeline's logs
data_samples = {
'question': [
'When was the Great Wall of China built?',
'What is the capital of France?'
],
'answer': [
'The majority of the existing wall is from the Ming Dynasty (1368–1644).',
'The capital is Paris.'
],
'contexts': [
['The Great Wall was built across historical northern borders of China. Most of the current wall dates to the Ming Dynasty.'],
['Paris is the capital and most populous city of France.']
],
'ground_truth': [
'The Great Wall was built over many centuries, but the Ming Dynasty (1368–1644) built the most famous parts.',
'Paris is the capital of France.'
]
}
# 2. Convert dictionary to a Dataset object
dataset = Dataset.from_dict(data_samples)
# 3. Run the Evaluation
results = evaluate(
dataset,
metrics=[
faithfulness, # Checks: Is the answer derived ONLY from the context?
answer_relevancy, # Checks: Does the answer actually address the question?
context_precision, # Checks: Is the useful information ranked high in the context?
context_recall # Checks: Does the context contain the info in the ground_truth?
]
)
# 4. Export and Print Results
df = dataset.to_pandas().join(results.to_pandas())
print("--- Ragas Evaluation Results ---")
print(df[['question', 'faithfulness', 'answer_relevancy']])
print("\n--- Summary Scores ---")
print(results)
输出结果是
--- Ragas Evaluation Results ---
question faithfulness answer_relevancy
0 When was the Great Wall of China built? 0.5 NaN
1 What is the capital of France? 1.0 NaN
--- Summary Scores ---
{
'faithfulness': 0.7500,
'answer_relevancy': nan,
'context_precision': 1.0000,
'context_recall': 1.0000
}
0x2 真实例子(19页PDF)的Testset和RAGAs评分
完整示例代码
https://github.com/VictorZhang2014/toutcas/tree/main/test_RAGAs
生成Testset as ground truth
import json
from dotenv import load_dotenv
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
load_dotenv()
# 1. Load your local JSON file
with open('Guide_for_applicants_MSCA_Postdoctoral.pdf.embeddings.json', 'r', encoding='utf-8') as f:
local_data = json.load(f)
# 2. Convert JSON chunks to LangChain Document objects
langchain_docs = []
for item in local_data:
# item is your dict: {"id": 0, "chunk": "...", "embedding": [...]}
doc = Document(
page_content=item["chunk"],
metadata={
"id": item["id"],
"filename": "source_pdf_name.pdf" # Required for Ragas grouping
}
)
langchain_docs.append(doc)
# 3. Wrap Models
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# 4. Initialize Generator (Flat import)
generator = TestsetGenerator(
llm=generator_llm,
embedding_model=generator_embeddings
)
# 5. Generate Testset
# Distributions are now handled by 'query_distribution' internally or passed as synthesizers
dataset = generator.generate_with_langchain_docs(
documents=langchain_docs,
testset_size=10
)
# 6. Result
df = dataset.to_pandas()
print(df.head())
# 2. Save locally
df.to_csv("Guide_for_applicants_MSCA_Postdoctoral.rag.testset.csv", index=False)
df.to_json("Guide_for_applicants_MSCA_Postdoctoral.rag.testset.json", orient="records", force_ascii=False, indent=4)
print("Test set saved successfully!")
验证->评分
from dotenv import load_dotenv
import pandas as pd
import ast
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import requests
load_dotenv()
# ==========================================
# 1. DATA LOADING & PREPARATION
# ==========================================
print("Loading and cleaning test set...")
test_df = pd.read_csv("Guide_for_applicants_MSCA_Postdoctoral.rag.testset.csv")
test_df = test_df.rename(columns={
"user_input": "question",
"reference": "ground_truth"
})
# Fix the Pydantic/String-to-List error
def safe_eval(val):
if isinstance(val, list): return val
try: return ast.literal_eval(val)
except: return []
test_df['reference_contexts'] = test_df['reference_contexts'].apply(safe_eval)
# ==========================================
# 2. THE EVALUATION LOOP
# ==========================================
answers = []
retrieved_contexts = []
for i, row in test_df.iterrows():
query = row['question']
print(f"\nEvaluating Question {i+1}/{len(test_df)}: {query}")
payload = {
"embedding_filename": "Guide_for_applicants_MSCA_Postdoctoral.pdf.embeddings.json",
"messages": [{"content": "You are helpful assistant","role": "system"}],
"model": "openai/gpt-oss-120b:novita",
"text": query
}
response = requests.post(
"http://localhost:20250/pdf_analyzer/getchunks",
json=payload
)
if response.status_code != 200:
print("Error:", response.status_code, response.text)
answers.append("")
retrieved_contexts.append([])
continue
llm_resp = response.json()
chunks = llm_resp["context"]
ans = llm_resp["llm_response"]
answers.append(ans)
retrieved_contexts.append(chunks)
# Append results to dataframe
test_df['answer'] = answers
test_df['contexts'] = retrieved_contexts
# ==========================================
# 3. RAGAS SCORING
# Initialize the LLM with a much higher max_tokens limit
# 2048 or 4096 is usually safe for complex RAG evaluations
# ==========================================
eval_llm = ChatOpenAI(
model="gpt-4o",
max_tokens=4096,
temperature=0
)
eval_embeddings = OpenAIEmbeddings(
model="text-embedding-3-small"
)
print("Calculating Ragas metrics...")
eval_dataset = Dataset.from_pandas(
test_df[['question','answer','contexts','ground_truth']],
preserve_index=False
)
result = evaluate(
eval_dataset,
metrics=[
faithfulness, # Checks if Answer matches Contexts
answer_relevancy, # Checks if Answer matches Question
context_recall # Checks if Contexts matches Ground Truth
],
llm=eval_llm,
embeddings=eval_embeddings
)
# ==========================================
# FINAL OUTPUT
# ==========================================
print("\n--- Evaluation Summary ---")
print(result)
scores_df = result.to_pandas()
print(scores_df.describe())
low_faith = scores_df.nsmallest(5, "faithfulness")
print(low_faith)
Evaluation Summary
{'faithfulness': 0.5486, 'answer_relevancy': 0.8122, 'context_recall': 0.4556}
Pandas Describe
| faithfulness | answer_relevancy | context_recall |
|--------------------|------------------|----------------|
|count | 4.000000 | 12.000000 | 12.000000 |
|mean | 0.548579 | 0.812217 | 0.455556 |
|std | 0.216621 | 0.118089 | 0.401596 |
|min | 0.241935 | 0.534869 | 0.000000 |
|25% | 0.497984 | 0.759610 | 0.000000 |
|50% | 0.601190 | 0.834552 | 0.450000 |
|75% | 0.651786 | 0.857521 | 0.750000 |
|max | 0.750000 | 0.986943 | 1.000000 |
注意
The "LLM-as-a-Judge" is a Paradox。LLM作为判官是一个悖论。也是很有挑战的。
关于数据集,RAGAs不能100%生成准确的Ground Truth;而Human expert也不能编写出100%正确的Ground Truth。
- Ground Truth :
Manual Ground Truthis theGold Standardfor accuracy. If a human expert writes the answer, you know it’s 100% correct. However, it is rarely the “best” way for a production system because a man takes much of time writing question&answers pairs, and if we update the Vector DB, the human force is huge.- Use Ragas to generate 200
synthetic samples. - Have a human expert
auditand fix the top 50. - Use those 50
Golden Setsfor final validation and the other 150 forrough directional testing.
- Use Ragas to generate 200
| Feature | Ground Truth (Human-Experts) | Ragas Synthetic (LLMs-Gen) |
|---|---|---|
| Accuracy | “99-100% (The ““Gold”” Standard)” | 80-90% (Depends on LLM) |
| Effort | High (Weeks) | Low (Minutes) |
| Diversity | Low (Humans repeat patterns) | High (Can evolve complex logic) |
| Cost | Expensive (Expert salaries) | Cheap (API tokens) |
更多推荐



所有评论(0)