使用 GRPO 算法训练多智能体系统：实现可靠的长期任务规划与执行

首先使用抽象基类（ABC）来定义所有 LLM 引擎包装器必须遵循的标准接口。这确保了一致性和可互换性。任何继承自EngineLM的类都必须实现generatepass在这里定义ChatVLLM类，它是EngineLM接口的具体实现。2\n```")print(f"\n'─'50'─'50")try:")raise eprint(f"\n'─'50'─'50")try:str100]try:在多智能

致Great

403人浏览 · 2026-02-15 13:24:25

致Great · 2026-02-15 13:24:25 发布

项目地址：https://github.com/FareedKhan-dev/multi-agent-training-grpo

针对长期任务的智能体系统需要具备规划、正确使用工具以及逐步执行的能力。目前大多数现代智能体系统都依赖推理过程，模型每次都从头开始处理所有组件，缺乏先前的训练经验。这种方式增加了在长期任务中出现错误规划或工具调用失误的风险。GRPO 算法作为一种现代强化学习方法，能够持续训练智能体进行规划并正确执行扩展任务。一个典型的基于 GRPO 的智能体训练系统架构如下：

GRPO 对智能体训练的影响：

基于组的评估： GRPO 针对同一查询评估多条轨迹，让智能体能够比较不同策略，而不仅仅依赖单步奖励。
相对优势学习： 相对于组平均水平来强化成功的轨迹，提升正确规划和执行的概率。
错误抑制： 低质量的轨迹会收到负向优势信号，减少幻觉和错误的工具使用。
迭代优化： 智能体通过重复试验持续改进，学会更可靠地规划长期任务。
子智能体协调： 通过在组环境中训练，GRPO 帮助多个子智能体协调行动，提升整体系统性能。

在这篇文章中，笔者将详细介绍 GRPO 算法与 AI 智能体的关系，并实际构建一个多智能体系统，使用 GRPO 进行训练优化。

GRPO 算法在智能体系统中的作用

几乎所有强化学习算法都基于奖励机制。智能体在环境中采取行动，根据行动获得奖励或惩罚。智能体的目标是最大化随时间累积的奖励。

然而在多智能体系统中，多个智能体相互交互并与环境互动，传统的强化学习算法在有效协调和优化行动方面可能遇到困难。

GRPO（通用强化策略优化）专门设计用于解决多智能体场景中传统强化学习算法面临的挑战。在基于智能体的 GRPO 中，系统不会孤立地评估每个单独的动作，而是让智能体多次尝试同一问题（形成一个"组"），比较结果，并强化那些优于平均水平的策略。

通过一个实际例子来理解这个过程，这也是文章后续要实现的内容：

**向系统提出一个复杂查询，**例如"计算 12 的平方，然后使用维基百科查找那一年发生了什么重大历史事件"。
规划器策略为这个单一查询生成一组不同的轨迹（例如 4 次尝试）。由于模型使用非零温度参数，它会探索不同的策略：一条轨迹可能正确使用 Python 计算出 144，另一条可能猜错数字，第三条可能在没有搜索的情况下幻想出一个历史事件。
**外部评判系统根据真实答案评估每条轨迹的最终答案。**成功计算出 144 并找到正确事件的轨迹获得 1.0 的奖励，而猜错、未能执行工具或幻想答案的轨迹则获得 0.0 的奖励。
**算法通过将每个得分与组平均值比较来计算相对优势。**如果组平均值是 0.25，成功的轨迹（1.0）获得高正向优势（+0.75），而失败的轨迹（0.0）获得负向优势（-0.25）。
**策略模型根据这些优势更新权重。**由于成功轨迹的表现超过组平均水平，系统显著增加其规划步骤的概率，有效地"强化"正确的逻辑，同时抑制失败的策略。

文章将实现这个完整的 GRPO 算法到多智能体系统中，以改进规划阶段，减少幻觉和偏离轨道的结果。

智能体数据预处理

多智能体系统通常依赖不同用途的子智能体来执行不同任务。例如可能有网络搜索智能体、规划智能体、任务执行智能体等。

在推理阶段，这些智能体的性能高度依赖于规划阶段。如果每次迭代后规划出现问题，智能体可能会偏离轨道，产生不相关的结果并出现幻觉问题。

这篇文章将模拟一个真实的多智能体系统，其中包含各种子智能体，其中一些智能体具有相同用途但采用不同方法。通过这种方式，可以实际测试 GRPO 如何减少幻觉和偏离轨道的结果。

文章使用两个 Hugging Face 数据集来准备训练数据：

DeepMath-103K：包含超过 100,000 个数学问题及其解决方案。这有助于教智能体在代表用户执行任务时进行规划，帮助在规划阶段进行结构化的逐步推理。
Natural Questions (NQ)：该数据集包含真实用户问题，有助于智能体改进在整个过程中的规划工作方式。

首先导入必要的库并创建训练和验证数据的输出目录：

# Standard library imports for interacting with the operating system and handling JSON data.
import os
import json

# Core data science libraries for data manipulation and numerical computation.
import pandas as pd
import numpy as np

# Hugging Face library for dataset loading and processing.
from datasets import load_dataset, concatenate_datasets, Dataset

# Utility for displaying progress bars, making long-running operations more informative.
from tqdm import tqdm

然后创建训练和验证数据的输出目录：

# Define the path for the training data output directory.
train_output_dir = "./data/train"

# Define the path for the validation data output directory.
val_output_dir = "./data/val"

训练目录将保存来自 DeepMath 和 NQ 的合并训练数据集，而验证目录将包含 AIME 2024 验证集。

需要将两个数据集合并为单个训练集。这让智能体系统能够从各种问题中学习，提高其规划和有效执行任务的能力。

首先处理 DeepMath-103K 数据集，加载并探索其结构：

print("\n=== Loading DeepMath-103K ===")

# Use the `load_dataset` function from the `datasets` library.
# We specify the dataset name on the Hugging Face Hub: "zwhe99/DeepMath-103K".
# We also specify that we only want the "train" split of this dataset.
math_dataset = load_dataset(
    "zwhe99/DeepMath-103K",
    split="train"
)

在处理之前，先检查数据集。查看其列、样本总数，并查看一个示例记录以了解其结构：

# The `.column_names` attribute gives us a list of all columns in the dataset.
print("Columns:", math_dataset.column_names)

# The `len()` function tells us the total number of records (rows) in the dataset.
print("Total samples:", len(math_dataset))

#### Output:
# Columns: ['question', 'final_answer', 'difficulty', 'topic', 'r1_solution_1', 'r1_solution_2', 'r1_solution_3']
# Total samples: 103022

可以看到每个问题有三种不同的解决方案，以及最终答案，这将作为训练的真实标签。

查看一个样本记录以了解数据格式和内容：

# Accessing an item by index, like a list, gives us a single record.
sample = math_dataset[0]

# The solution fields ('r1_solution_*') can be very long.
# For a clean printout, we'll truncate them.
truncated_sample = sample.copy()

for key in ['r1_solution_1', 'r1_solution_2', 'r1_solution_3']:
    truncated_sample[key] = sample[key][:400]

# Use `json.dumps` with indentation for a pretty, readable print of the sample record.
print(json.dumps(truncated_sample, indent=2))

输出如下：

{
  "question": "Evaluate the limit: \\[ \\lim_{x \\to \\infty} \\sqrt{x} \\left( \\sqrt[3]{x+1} - \\sqrt[3]{x-1} \\right) \\]",
  "final_answer": "0",
  "difficulty": 4.5,
  "topic": "Mathematics -> Precalculus -> Limits",
  "r1_solution_1": "Okay, so I have this limit to evaluate the limit as x approaches...",
  "r1_solution_2": "Okay, so I need to evaluate the limit as x approaches infinity...",
  "r1_solution_3": "Okay, so I need to evaluate the limit as x approaches infinity..."
}

r1_solution_1、r1_solution_2 和 r1_solution_3 是三种不同的解决方案，但训练时不会使用它们，只使用 question 和 final_answer 字段，因为智能体将执行代码并尝试达到最终答案。

现在遍历每条记录并将其转换为所需的标准格式。这种格式是通用的，便于稍后与其他数据集合并。

目标模式如下：

id：每个样本的唯一标识符
question：问题或查询文本
chain：思维链或推理步骤的占位符（暂时留空）
result：最终答案
source：指示原始数据集的字符串
extra_info：包含原始记录中其他有用元数据的字典

print("\n=== Processing MathHard ===")

# Initialize an empty list to store our processed records.
math_rows = []

# We iterate through the dataset using tqdm to get a nice progress bar.
# `enumerate` gives us both the index (`idx`) and the item for each record.
for idx, item in enumerate(tqdm(math_dataset, desc="Processing MathHard")):

    # Some datasets might use different keys for the same concept (e.g., 'question' vs 'Problem').
    # This logic handles such inconsistencies gracefully.
    if "question" in item:
        question = item["question"]
    elif "Problem" in item:
        question = item["Problem"]
    else:
        # If neither key is found, raise an error to stop execution, as this is unexpected.
        raise KeyError("Missing question field")

    # Similarly, handle potential inconsistencies for the answer field.
    if "final_answer" in item:
        answer = item["final_answer"]
    elif "Answer" in item:
        answer = item["Answer"]
    else:
        raise KeyError("Missing answer field")

    # Append a new dictionary to our list, structured according to our standard format.
    math_rows.append({
        "id": idx,  # Use the loop index as a temporary ID.
        "question": question,
        "chain": "",  # Placeholder for reasoning steps.
        "result": str(answer), # Ensure the answer is always a string.
        "source": "mathhard", # Tag the data source.
        "extra_info": { # Store original metadata.
            "ground_truth": str(answer),
            "idx": idx
        }
    })

### OUTPUT
# Processing MathHard: 100%|██████████| 103022/103022 [00:03<00:00, 33261.05it/s]

这将处理 DeepMath 数据集中的全部 103,022 条记录。验证处理是否正确，检查处理后的样本数量并打印新格式的一个样本：

# Verify that the number of processed rows matches the original dataset size.
print("Processed math samples:", len(math_rows))
print("\nProcessed sample:")

# Print the first processed sample to confirm it matches our target format.
print(json.dumps(math_rows[0], indent=2))

#### Output:
# Processed math samples: 103022
# {
#   "id": 0,
#   "question": "Evaluate the limit: \\[ \\lim_{x \\to \\infty} \\sqrt{x} \\left( \\sqrt[3]{x+1} - \\sqrt[3]{x-1} \\right) \\]",
#   "chain": "",
#   "result": "0",
#   "source": "mathhard",
#   "extra_info": {
#     "ground_truth": "0",
#     "idx": 0
#   }
# }

很好，处理后的数据目前是一个 Python 字典列表。为了获得更好的性能和与 Hugging Face 生态系统（如 Trainer API）的兼容性，将其转换为 datasets.Dataset 对象：

# First, convert the list of dictionaries into a pandas DataFrame.
# Then, use `Dataset.from_pandas` to create the Hugging Face Dataset object.
# `preserve_index=False` tells the function not to add the DataFrame's index as a new column.
ds_math = Dataset.from_pandas(
    pd.DataFrame(math_rows),
    preserve_index=False
)

接下来对 Natural Questions 数据集重复相同的过程。该数据集包含真实用户向 Google 搜索提出的问题及其在维基百科上找到的相应答案。

#### Loading the NQ Dataset
print("\n=== Loading FlashRAG NQ ===")

# `load_dataset` can take multiple arguments.
# The first is the dataset group, "RUC-NLPIR/FlashRAG_datasets".
# The second is the specific dataset name within that group, "nq".
nq_dataset = load_dataset(
    "RUC-NLPIR/FlashRAG_datasets",
    "nq",
    split="train"
)

加载成功后，检查其中一条记录以了解其结构和内容：

# Look at the first sample to understand the data format.
print("\nRaw NQ sample:")
print(json.dumps(nq_dataset[0], indent=2))

### Output:
# Raw NQ sample:
# {
#   "id": "train_0",
#   "question": "total number of death row inmates in the us",
#   "golden_answers": [
#     "2,718"
#   ]
# }

可以看到 question 字段包含用户查询，golden_answers 字段包含答案列表。

NQ 的处理稍微复杂一些，需要进行一些清理：

格式化问题：确保每个问题都以问号结尾以保持一致性
处理答案类型：golden_answers 字段可以包含各种格式的数据（列表、numpy 数组、字符串等）。代码需要处理所有这些情况，提取答案并将其转换为单个字符串
连接多个答案：有些问题可能有多个有效答案。将它们连接成一个字符串，用分号分隔

实现如下：

print("\n=== Processing NQ ===")

# Initialize an empty list to store processed NQ records.
nq_rows = []

# Iterate through the NQ dataset with a progress bar.
for idx, item in enumerate(tqdm(nq_dataset, desc="Processing NQ")):

    # Get the question, remove leading/trailing whitespace.
    question = item.get("question", "").strip()

    # Ensure the question ends with a '?' for consistency.
    if question and not question.endswith("?"):
        question += "?"

    # Get the answers, defaulting to an empty list if not present.
    golden_answers = item.get("golden_answers", [])
    cleaned_answers = [] # This list will hold valid, string-formatted answers.

    # The following block robustly handles various data types for the answers.
    if isinstance(golden_answers, np.ndarray):
        for x in golden_answers.flatten(): # Flatten in case of multi-dimensional array.
            if x is not None and pd.notna(x):
                cleaned_answers.append(str(x))

    elif isinstance(golden_answers, (list, tuple)):
        for x in golden_answers:
            if x is not None and pd.notna(x):
                cleaned_answers.append(str(x))

    elif isinstance(golden_answers, str):
        if golden_answers.strip():
            cleaned_answers.append(golden_answers.strip())

    elif isinstance(golden_answers, (int, float, np.generic)):
        if not pd.isna(golden_answers):
            cleaned_answers.append(str(golden_answers))

    else: # Catch-all for any other types.
        s = str(golden_answers).strip()
        if s and s != "nan": # Avoid adding 'nan' as an answer.
            cleaned_answers.append(s)

    # Join all cleaned answers into a single string, separated by "; ".
    final_result = "; ".join(cleaned_answers)

    # Append the record in our standard format.
    nq_rows.append({
        "id": idx,  # Temporary ID.
        "question": question,
        "chain": "",
        "result": final_result,
        "source": "nq", # Tag the source as Natural Questions.
        "extra_info": {
            "ground_truth": final_result,
            "idx": idx
        }
    })

现在验证处理是否正确，检查处理后的样本数量并打印新格式的一个样本：

# Verify the number of processed samples and check the first record.
print("\nProcessed NQ sample:")
print(json.dumps(nq_rows[0], indent=2))

### Output:
# {
#   "id": 0,
#   "question": "total number of death row inmates in the us?",
#   "chain": "",
#   "result": "2,718",
#   "source": "nq",
#   "extra_info": {
#     "ground_truth": "2,718",
#     "idx": 0
#   }
# }

与数学数据集类似，将处理后的 NQ 数据转换为 Hugging Face Dataset 对象，以获得更好的性能并与训练流程兼容：

# Convert the processed NQ data into a Hugging Face Dataset object.
ds_nq = Dataset.from_pandas(
    pd.DataFrame(nq_rows),
    preserve_index=False
)

两个数据集都经过处理和标准化后，最后一步是将它们合并为单个训练集。然后打乱这个合并的数据集并为每条记录分配新的唯一 ID：

#### Concatenating Datasets
# `concatenate_datasets` takes a list of Dataset objects and merges them row-wise.

combined = concatenate_datasets([ds_nq, ds_math])
print("Combined size:", len(combined))

### Output:
#   Combined size: 182190

还需要做两件事：

打乱。如果不打乱，模型将首先看到所有 79,168 个 NQ 样本，然后是所有 103,022 个数学样本。这可能会使学习过程产生偏差。
重新索引是必要的，因为在合并和打乱之后，原始 id 不再是唯一的或连续的。应用映射函数来分配一个新的、干净的、从 0 到 N-1 的连续 ID。

# The `.shuffle()` method randomizes the order of the rows in the dataset.
# Providing a `seed` ensures that the shuffle is reproducible. Anyone running this code
# with the same seed will get the exact same shuffled order.
combined = combined.shuffle(seed=42)

# The `.map()` method applies a function to each element of the dataset.
# Here, we use a lambda function that ignores the sample (`_`) and uses the index (`idx`).
# `with_indices=True` provides the index of each row to our function.
# This effectively replaces the old 'id' column with a new one from 0 to len-1.
combined = combined.map(
    lambda _, idx: {"id": idx},
    with_indices=True
)

最后，将完成的训练数据集保存到文件。使用 Parquet 格式，这是一种高效的、面向列的数据格式，非常适合大型数据集。它得到广泛支持，通常比 CSV 或 JSON 等格式读取速度更快：

# Construct the full output file path using the directory we defined earlier.
output_path = os.path.join(train_output_dir, "combined_train.parquet")

# Use the `.to_parquet()` method to save the dataset.
combined.to_parquet(output_path)

验证每个最终数据集的总记录数：

# The length of our in-memory Dataset objects gives the total number of samples.
train_count = len(combined)
print(f"\nTotal train samples: {train_count}")

### Output:
# Total train samples: 182190

已成功处理并将 DeepMath-103K 和 Natural Questions 数据集合并为一个包含 182,190 个样本的训练集。这个数据集现在可以用于训练多智能体系统了。

构建多智能体架构

智能体工作流或多智能体系统是一个框架，其中问题由一系列协作的专门组件（或"智能体"）解决。这种方法不是依赖单个整体式 LLM 调用来解决复杂查询，而是将问题分解为可管理的阶段：

规划： 智能体分析初始查询并制定高层计划或决定下一步最佳行动。
工具使用： 智能体选择并使用专用工具（如代码解释器、网络搜索或数据库查询工具）来收集信息或执行操作。
执行： 专门的组件生成精确的命令来运行所选工具。
观察与反思： 智能体观察工具执行的结果，并反思目标是否已经实现或是否需要更多步骤。
迭代： 这个过程在循环中重复，直到验证器智能体确定查询已完全回答。
综合： 最后，智能体将整个过程中收集的所有信息综合成一个全面的最终答案。

目标是改进监控整个工作流并在每次迭代后变化的规划阶段。这样可以减少幻觉和偏离轨道的结果。为了微调规划阶段，需要访问可以在整个训练阶段学习的模型权重。因此将使用基于 vLLM 的开源服务器，它在吞吐量和延迟方面非常高效。

这个项目使用 1xA100 80GB GPU，规划阶段使用 Qwen/Qwen2.5-7B-Instruct 模型，但对于其他任务显然可以使用不同的 AI 模型。

首先使用 pip install vllm 安装 vllm，然后启动服务器：

# Start the vLLM server with the same model name as MODEL_NAME below
# This can be a fine-tuned model or a base model.
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --api-key a-secret-key \        # API key for OpenAI-compatible auth
    --port 8000 \                   # Port for the local server
    --max-model-len 8192            # Maximum context length

首先设置环境。这涉及安装所需的 Python 库、导入必要的模块以及配置与 vLLM 服务器和其他外部服务的连接：

!pip install -q openai pydantic tenacity beautifulsoup4 requests wikipedia google-genai numpy json_repair

在这里导入所有必要的模块并定义核心配置变量。这包括 vLLM 服务器 URL、模型名称以及工具所需的 API 密钥的占位符：

# Standard library imports
import os
import json
import re
import sys
import inspect
import threading
from io import StringIO
from typing import Any, Dict, List, Union, Optional, Tuple
from abc import ABC, abstractmethod
from contextlib import contextmanager

# Pydantic and API Libraries
from pydantic import BaseModel
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_random_exponential
import requests
from bs4 import BeautifulSoup
import wikipedia
from google import genai
from google.genai import types
import numpy as np
import json_repair # For fixing malformed JSON from the LLM

现在需要设置连接到 vLLM 服务器的核心配置，并定义将用于规划阶段的模型：

# --- Core Configuration ---
# The base URL where your vLLM server is running.
VLLM_BASE_URL = "http://localhost:8000"

# The API key for your vLLM server (can be a dummy key if not required by your setup).
VLLM_API_KEY = "a-secret-key"

# The exact name of the model being served by vLLM or your fine-tuned model. This should match the model name configured in your vLLM server.
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"

# --- Environment Variables for Tools ---
# IMPORTANT: You must provide your own API keys for the search tools to function.
# If you leave these as placeholders, the corresponding tools will operate in a 'mock' mode.
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY_HERE' # Needed for embeddings in Web_Search_Tool
os.environ['GOOGLE_API_KEY'] = 'YOUR_GOOGLE_API_KEY_HERE' # Needed for Google_Search_Tool

为了与 LLM 交互，将创建一个包装类。这是一个重要的设计模式，提供了几个好处：

抽象： 它隐藏了 API 调用的具体细节，让我们可以轻松更换后端（例如从 vLLM 切换到另一个提供商），而无需更改其余代码。
鲁棒性： 可以构建诸如失败 API 调用自动重试等功能。
功能增强： 可以添加自定义逻辑，例如强制 LLM 生成结构化的 JSON 输出。

定义智能体思维模式

首先使用抽象基类（ABC）来定义所有 LLM 引擎包装器必须遵循的标准接口。

这确保了一致性和可互换性。任何继承自 EngineLM 的类都必须实现 generate 方法：

class EngineLM(ABC):
    """An abstract base class for a language model engine."""

    def __call__(self, *args, **kwargs):
        """Allows the class instance to be called like a function, making the syntax cleaner."""
        return self.generate(*args, **kwargs)

    @abstractmethod
    def generate(self, prompt, system_prompt=None, **kwargs):
        """The core method that must be implemented by any subclass. It takes a prompt and generates a response."""
        pass

在这里定义 ChatVLLM 类，它是 EngineLM 接口的具体实现。这个类负责格式化提示、向 vLLM 服务器发出 API 调用以及解析响应：

class ChatVLLM(EngineLM):
    """A language model engine that connects to a vLLM server with an OpenAI-compatible API."""

    def __init__(self, model_string, base_url, api_key, temperature=0.0):
        """Initializes the engine with connection details and the OpenAI client."""
        self.model_string = model_string
        self.base_url = base_url
        self.api_key = api_key
        self.temperature = temperature

        # The OpenAI client is configured to point to our local vLLM server.
        self.client = OpenAI(base_url=self.base_url, api_key=self.api_key)
        self.default_system_prompt = "You are a helpful, creative, and smart assistant."

    @retry(wait=wait_random_exponential(min=1, max=5), stop=stop_after_attempt(3))
    def generate(self, content: Union[str, List[Union[str, bytes]]], system_prompt=None, response_format=None, **kwargs):
        """Generates a response from the LLM, with robust JSON parsing and retry logic."""

        # Use the provided system prompt or fall back to the default.
        sys_prompt_arg = system_prompt if system_prompt else self.default_system_prompt
        user_content = content

        # Format the request in the standard chat completions message format.
        messages = [
            {"role": "system", "content": sys_prompt_arg},
            {"role": "user", "content": user_content}
        ]

        # Prepare the parameters for the API request.
        request_params = {
            "model": self.model_string,
            "messages": messages,
            "temperature": self.temperature,
            "max_tokens": kwargs.get("max_tokens", 4096),
        }

        # This block is key for structured output. If a Pydantic model is provided,
        # we serialize its JSON schema and append it to the prompt, instructing the LLM
        # to format its response accordingly. This is a form of "prompt engineering".
        if response_format and issubclass(response_format, BaseModel):
            json_schema = response_format.model_json_schema()
            schema_instruction = (f"\n\nPlease format your entire response as a single JSON object that strictly adheres to the following Pydantic schema.\n"+
                                f"Do not include any other text, explanations, or markdown formatting outside of the JSON object.\n\n"+
                                f"Schema:\n```json\n{json.dumps(json_schema, indent=2)}\n```")

            # Append the instruction to the last user message.
            if isinstance(request_params['messages'][-1]['content'], str):
                 request_params['messages'][-1]['content'] += schema_instruction
        print(f"\n{'─'*50}\n[LLM_INPUT] Sending request to model: {self.model_string}\n{'─'*50}")

        try:

            # Execute the API call to the vLLM server.
            response = self.client.chat.completions.create(**request_params)
            response_text = response.choices[0].message.content
        except Exception as e:
            print(f"[LLM ERROR] API Call Failed: {e}")
            raise e
        print(f"\n{'─'*50}\n[LLM_OUTPUT] Raw response from model:\n{response_text}\n{'─'*50}")

        # If we requested a structured format, we now parse the response.
        if response_format and issubclass(response_format, BaseModel):

            # First, try to extract the JSON from within markdown code blocks (e.g., ```json ... ```).
            match = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL)
            json_str = match.group(1) if match else response_text

            try:
                # Attempt to parse the JSON strictly using the Pydantic model. This enforces the schema.
                parsed_obj = response_format.model_validate_json(json_str.strip())
                print("[LLM Engine] Successfully parsed structured output (Strict).")
                return parsed_obj
            except Exception as e:

                # If strict parsing fails (e.g., due to trailing commas, missing quotes), we attempt to repair it.
                print(f"[LLM Engine info] Strict parsing failed ({str(e)[:100]}...). Attempting repair...")
                try:

                  # The `json_repair` library can fix many common LLM-generated JSON errors.
                    decoded_obj = json_repair.loads(json_str)

                    # After repairing, we validate the repaired object against the Pydantic model again.
                    parsed_obj = response_format.model_validate(decoded_obj)
                    print("[LLM Engine] Successfully parsed structured output (Repaired).")
                    return parsed_obj
                except Exception as e2:

                    # If even the repair fails, we log a critical warning. Returning the raw text allows
                    # the agent to potentially see the error, but it might crash the next step.
                    print(f"[LLM Engine CRITICAL WARNING] Failed to parse output even with repair: {e2}")
                    return response_text

        # If no structured format was requested, return the raw text response.
        return response_text

在多智能体系统中，ChatVLLM 类将被所有智能体（规划器、执行器、验证器）用于与 LLM 交互。通过在这个类中集中处理 API 交互逻辑，确保所有智能体都受益于相同的强大解析和错误处理功能，并且将来需要时可以轻松更新 LLM 后端。

接下来定义一个工厂函数来创建 LLM 引擎的实例。这是一个常见的设计模式，提供了灵活性和封装性。如果以后决定切换到不同的 LLM 提供商或添加额外的配置选项，可以在这个函数中完成，而无需更改其余代码：

def create_llm_engine(model_string: str, **kwargs) -> ChatVLLM:
    """Factory function to create an instance of our vLLM chat engine."""
    return ChatVLLM(model_string=model_string, base_url=VLLM_BASE_URL, api_key=VLLM_API_KEY, temperature=kwargs.get('temperature', 0.0))

测试与 vLLM 服务器的连接，创建 ChatVLLM 引擎的实例并发送一个简单的提示。这将确认设置正确，可以成功与模型通信：

# --- Test Connection ---
print("--- Testing vLLM engine connection ---")

# Create an engine instance.
test_engine = create_llm_engine(MODEL_NAME)

# Send a simple message to see if we get a response.
test_response = test_engine.generate("Ping")
print(f"\n✅ Connection successful!")

输出如下：

# Output:
--- Testing vLLM engine connection ---

──────────────────────────────────────────────────
[LLM_INPUT] Sending request to model: Qwen/Qwen2.5-7B-Instruct
──────────────────────────────────────────────────

──────────────────────────────────────────────────
[LLM_OUTPUT] Raw response from model:
Pong
──────────────────────────────────────────────────

✅ Connection successful!

很好，连接正常工作。现在可以继续定义智能体将用于相互通信以及与工具通信的结构化输出格式。

通过将这些结构化格式定义为 Pydantic 模型，可以获得几个优势：

类型安全： 确保数据符合预期类型（例如，字段是字符串而不是列表）。
验证： 自动检查来自 LLM 的数据是否有效和完整。
自我文档化： 模型本身可以作为每个组件期望的输入和产生的输出的清晰文档。
可靠通信： 它们构成了系统中不同 LLM 驱动的"角色"（规划器、验证器等）之间的契约。

在多智能体工作流中，在任何工具或规划之前的第一步是分析用户的查询。这个初始分析帮助规划器智能体理解手头的任务，识别相关技能和工具，并考虑可能影响其处理问题方式的任何特殊因素：

class QueryAnalysis(BaseModel):
    """Represents the initial breakdown and analysis of the user's query."""
    concise_summary: str
    required_skills: str
    relevant_tools: str
    additional_considerations: str

处理查询后，在主循环的每次迭代中，规划器智能体决定下一步行动。这个模型捕获该决策，指定要使用哪个工具以及出于什么目的：

class NextStep(BaseModel):
    """Defines the plan for the next action to be taken in the agent's loop."""
    justification: str
    context: str
    sub_goal: str
    tool_name: str

一旦选择了工具和子目标，执行器智能体的工作就是生成运行该工具的确切代码。这个模型构建执行器的输出：

class ToolCommand(BaseModel):
    """Represents the generated command for a specific tool, ready for execution."""
    analysis: str
    explanation: str
    command: str

执行操作后，验证器智能体反思当前状态。这个模型捕获其结论：智能体应该停止，还是需要继续更多步骤？

class MemoryVerification(BaseModel):
    """Represents the verifier's analysis of whether the task is complete."""
    analysis: str
    stop_signal: bool

现在需要创建一个由 Wikipedia_Search_Tool 内部使用的专用模型。搜索维基百科时，它会获得可能页面标题的列表，并使用带有此响应格式的 LLM 调用来选择最相关的页面进行进一步调查：

class Select_Relevant_Queries(BaseModel):
    """A specialized model for the Wikipedia tool to select relevant search results."""
    matched_queries: list[str]
    matched_query_ids: list[int]

智能体需要一种方式来记住它做了什么。Memory 类充当一个简单的日志本，记录每个采取的行动、使用的工具、执行的命令以及获得的结果。

这个历史记录在后续步骤中被反馈给规划器和验证器智能体，为它们提供做出明智决策所需的上下文。让我们实现这个内存结构：

class Memory:
    """A simple class to store the history of actions taken by the agent."""
    def __init__(self):
        """Initializes an empty dictionary to store actions."""
        self.actions: Dict[str, Dict[str, Any]] = {}

    def add_action(self, step_count: int, tool_name: str, sub_goal: str, command: str, result: Any) -> None:
        """Adds a new action to the memory log."""
        self.actions[f"Action Step {step_count}"] = {
            'tool_name': tool_name,
            'sub_goal': sub_goal,
            'command': command,
            'result': result
        }

    def get_actions(self) -> Dict[str, Dict[str, Any]]:
        """Retrieves the entire history of actions."""
        return self.actions

工具输出可能是复杂对象或非常长的字符串。在将结果存储到内存（并将其反馈到 LLM 有限的上下文窗口）之前，必须将其序列化为干净、截断的 JSON 格式。这个函数递归地处理该转换：

def make_json_serializable_truncated(obj, max_length: int = 2000):
    """Recursively converts an object into a JSON-serializable and truncated format."""

    # Handle basic, JSON-native types.
    if isinstance(obj, (int, float, bool, type(None))): return obj

    # Truncate long strings.
    elif isinstance(obj, str): return obj if len(obj) <= max_length else obj[:max_length - 3] + "..."

    # Recursively process dictionaries.
    elif isinstance(obj, dict): return {str(k): make_json_serializable_truncated(v, max_length) for k, v in obj.items()}

    # Recursively process lists.
    elif isinstance(obj, list): return [make_json_serializable_truncated(element, max_length) for element in obj]

    # For all other types, convert to a string representation and truncate.
    else:
        result = repr(obj)
        return result if len(result) <= max_length else result[:max_length - 3] + "..."

工具赋予智能体力量。它们是允许智能体与外部世界交互、执行计算或访问超出其自身知识的信息的函数。通过给智能体提供工具访问权限，实现其推理基础并使其能够解决更复杂的问题。

创建工具集

首先定义一个 BaseTool 抽象类，以确保所有工具具有一致的结构并暴露有关其功能的元数据。

这些元数据对于规划器智能体决定哪个工具适合给定任务至关重要：

class BaseTool(ABC):
    """An abstract base class that defines the standard interface for all tools."""
    # A flag to indicate if the tool requires an LLM engine to function.
    require_llm_engine = False

    def __init__(self, tool_name=None, tool_description=None, tool_version=None, input_types=None, output_type=None, demo_commands=None, user_metadata=None, model_string=None):
        """Initializes the tool with its descriptive metadata."""
        self.tool_name, self.tool_description, self.tool_version, self.input_types, self.output_type, self.demo_commands, self.user_metadata, self.model_string = tool_name, tool_description, tool_version, input_types, output_type, demo_commands, user_metadata, model_string

    def get_metadata(self) -> dict:
        """Returns all the tool's metadata in a dictionary. This is fed to the Planner agent."""
        return {"tool_name": self.tool_name, "tool_description": self.tool_description, "tool_version": self.tool_version, "input_types": self.input_types, "output_type": self.output_type, "demo_commands": self.demo_commands, "user_metadata": self.user_metadata, "require_llm_engine": self.require_llm_engine}

    def set_custom_output_dir(self, output_dir):
        """A placeholder method for tools that might need to save files."""
        pass

    @abstractmethod
    def execute(self, *args, **kwargs):
        """The core method where the tool's logic is implemented. Must be overridden by subclasses."""
        raise NotImplementedError

在 BaseTool 中，定义了所有工具必须具有的结构和元数据。每个工具必须实现 execute 方法，其中包含工具的实际逻辑。

现在可以通过继承 BaseTool 来实现特定工具。让我们从一个使用 LLM 直接回答查询的简单通用工具开始：

class Base_Generator_Tool(BaseTool):
    """A general-purpose tool that uses an LLM to answer a query directly."""
    # This tool's primary function is to call an LLM, so it requires an engine.
    require_llm_engine = True

    def __init__(self, model_string="gpt-4o-mini"):
        """Initializes the tool's metadata and its own LLM engine."""
        super().__init__(
            tool_name="Generalist_Solution_Generator_Tool",
            tool_description="A generalized tool that takes query from the user, and answers the question step by step to the best of its ability.",
            tool_version="1.0.0",
            input_types={"query": "str"},
            output_type="str",
            user_metadata={
                "limitation": "The Generalist_Solution_Generator_Tool may provide hallucinated or incorrect responses.",
                "best_practice": "Use for general queries. Verify important information from its responses."
            }
        )
        self.llm_engine = create_llm_engine(model_string, temperature=0.0)

    def execute(self, query, **kwargs):
        """Executes the tool by passing the query directly to its LLM engine."""
        return self.llm_engine.generate([query])

这个工具允许智能体编写和执行 Python 代码。这对于涉及计算、数据操作或逻辑运算的任务非常强大。实现包含重要的安全功能：

沙箱化： 代码使用 exec() 在受控的空作用域中执行，以防止它访问或修改主程序的状态。
超时： 计时器防止代码无限期运行，这对于处理无限循环或长时间运行的计算至关重要。
输出捕获： 它捕获执行代码的任何 print 语句，这通常是计算结果暴露的方式。

class TimeoutException(Exception):
    """Custom exception to be raised when an operation times out."""
    pass

@contextmanager
def timeout(seconds):
    """A context manager to enforce a timeout on a block of code."""
    # Define a function that will be called by the timer to raise the exception.
    def raise_timeout(signum, frame):
        raise TimeoutException("Code execution timed out")

    # Use a threading.Timer to run the raise_timeout function after a delay.
    timer = threading.Timer(seconds, lambda: raise_timeout(None, None))
    timer.start()
    try:
        # The 'yield' passes control back to the 'with' block.
        yield
    finally:
        # This code always runs, whether the 'with' block finished or an exception occurred.
        # It's crucial to cancel the timer to prevent the timeout from firing later.
        timer.cancel()

在超时实现中，使用 threading.Timer 来调度一个函数，该函数在指定秒数后引发 TimeoutException。timeout 上下文管理器在进入块时启动计时器，并确保在退出时取消它，防止意外超时。

现在可以实现使用此超时机制安全执行 LLM 生成的 Python 代码的 Python_Coder_Tool。该工具将提示 LLM 根据用户查询生成代码片段，从响应中提取代码，然后在沙箱环境中执行它，同时捕获任何输出：

class Python_Coder_Tool(BaseTool):
    """A tool to generate and execute Python code in a sandboxed environment."""
    require_llm_engine = True

    def __init__(self, model_string="gpt-4o"):
        super().__init__(
            tool_name="Python_Code_Generator_Tool",
            tool_description="A tool that generates and executes simple Python code snippets for basic arithmetical calculations and math-related problems.",
            tool_version="1.0.0",
            input_types={"query": "str"},
            output_type="dict",
            user_metadata={
                "limitations": "Restricted to basic Python arithmetic and built-in math functions. Cannot use external libraries, file I/O, or network requests. Execution times out after 10 seconds.",
                "best_practices": "Provide clear queries with all necessary numerical inputs. Good for math and logic problems."
            }
        )
        self.llm_engine = create_llm_engine(model_string, temperature=0.0)

def execute(self, query, **kwargs):
        # 1. Prompt an LLM to generate the Python code.
        task_description = "Given a query, generate a Python code snippet that performs the specified operation. Ensure to print the final result. The final output should be presented in the following format:\n\n```python\n<code snippet>\n```"
        full_prompt = f"Task:\n{task_description}\n\nQuery:\n{query}"
        response = self.llm_engine.generate(full_prompt)

        # 2. Extract the code from the LLM's response.
        match = re.search(r"```python\s*(.*?)\s*```", response, re.DOTALL)
        if not match: return {"error": "No Python code block found in the response", "raw_response": response}
        code_snippet = match.group(1).strip()

        # 3. Execute the code in a safe, controlled environment.
        output_capture = StringIO() # Create an in-memory text buffer to capture print statements.
        old_stdout, old_stderr = sys.stdout, sys.stderr # Store the original stdout/stderr
        local_vars = {} # A dictionary to hold variables created by the executed code.
        try:
            # Redirect stdout and stderr to our in-memory buffer.
            sys.stdout = sys.stderr = output_capture

            with timeout(10): # Enforce a 10-second timeout.
                # `exec` runs the code. We provide empty global and a local dict for the scope.
                exec(code_snippet, {}, local_vars)
            printed_output = output_capture.getvalue().strip()

            # Return the captured output and any variables created by the code.
            return {"printed_output": printed_output, "variables": {k: repr(v) for k, v in local_vars.items() if not k.startswith('__')}}

        except TimeoutException as e: return {"error": str(e), "code": code_snippet}

        except Exception as e: return {"error": str(e), "code": code_snippet, "captured_output": output_capture.getvalue().strip()}

        finally:
            # CRITICAL: Always restore the original stdout and stderr.
            sys.stdout, sys.stderr = old_stdout, old_stderr

这个工具使智能体能够搜索网络以获取实时信息。它使用 Google Gemini API 的内置基础功能，这是一种基于搜索结果执行基础生成的简单有效方法。如果没有提供 GOOGLE_API_KEY，它将以模拟模式运行，返回占位符字符串：

class Google_Search_Tool(BaseTool):
    """A tool for performing web searches using Google's Gemini API with grounding."""
    def __init__(self, model_string="gemini-1.5-flash"):
        super().__init__(
            tool_name="Ground_Google_Search_Tool",
            tool_description="A web search tool powered by Google's Gemini AI that provides real-time information.",
            tool_version="1.0.0",
            input_types={"query": "str"},
            output_type="str",
            user_metadata={"limitations": "Only suitable for general information search.", "best_practices": "Choose for question-type queries."}
        )
        self.search_model = model_string

        # Check for a valid API key. If not present, set client to None to enable mock mode.
        if not os.getenv("GOOGLE_API_KEY") or 'YOUR_GOOGLE_API_KEY' in os.getenv("GOOGLE_API_KEY"):
            print("WARNING: Google_Search_Tool is in mock mode. Provide a GOOGLE_API_KEY to enable.")
            self.client = None
        else:
            # We'll use the recommended `genai.GenerativeModel` for modern usage, but the logic is similar.
            genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
            self.client = genai.GenerativeModel(self.search_model)

    def execute(self, query: str, add_citations: bool = True, **kwargs):
        # If in mock mode, return a placeholder response.
        if not self.client:
            return f"[Mock Response] Search results for: '{query}'"

        try:
            # Configure the Gemini API to use its internal Google Search tool for grounding.
            grounding_tool = genai.Tool(
                google_search=genai.GoogleSearch()
            )
            # Generate content with the search tool enabled.
            response = self.client.generate_content(
                query,
                tools=[grounding_tool],
            )
            return response.text
        except Exception as e:
            return f"Error during Google Search: {e}"

这个工具实现了一个完整的检索增强生成（RAG）管道，根据特定 URL 的内容回答查询。它比一般的网络搜索更有针对性。过程如下：

获取与解析： 从 URL 下载 HTML 并提取所有干净的文本。
分块： 将长文本分割成更小、可管理的块。
嵌入： 使用嵌入模型（这里使用 OpenAI）将用户查询和每个文本块转换为数值向量。
检索： 计算查询向量与所有块向量之间的余弦相似度，找到与查询语义最相关的块。
综合： 将原始查询和最相关块的内容传递给 LLM，要求它仅基于提供的上下文综合最终答案。

class Web_Search_Tool(BaseTool):
    """Answers questions by retrieving info from a website using a RAG pipeline."""
    require_llm_engine = True

    def __init__(self, model_string="gpt-4o-mini"):
        super().__init__(tool_name="Web_RAG_Search_Tool", tool_description="Answers questions by retrieving info from a website using RAG.", tool_version="1.0.0", input_types={"query": "str", "url": "str"}, output_type="str", user_metadata={"limitation": "May not work with JS-heavy sites or those requiring authentication.", "best_practice": "Use specific, targeted queries on text-rich websites."})
        self.llm_engine = create_llm_engine(model_string, temperature=0.0)

        # This tool requires an OpenAI key for its embedding model.
        if not os.getenv("OPENAI_API_KEY") or 'YOUR_OPENAI_API_KEY' in os.getenv("OPENAI_API_KEY"):
            print("WARNING: Web_Search_Tool is in mock mode. Provide an OPENAI_API_KEY to enable embeddings."); self.embedding_client = None

        else: self.embedding_client = OpenAI()

    def execute(self, query, url, **kwargs):

        if not self.embedding_client: return f"[Mock Response] RAG summary for query '{query}' on URL '{url}'"

        try:
            # 1. Fetch & Parse: Use requests and BeautifulSoup to get text from the URL.
            headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
            content = BeautifulSoup(requests.get(url, headers=headers, timeout=10).content, 'html.parser').get_text(separator='\n', strip=True)

            # 2. Chunk: Split the text into overlapping chunks of 200 words.
            words = content.split(); chunks = [" ".join(words[i:i+200]) for i in range(0, len(words), 180)]

            # 3. Embed: Get embeddings for the query and all chunks in a single API call.
            embeddings = self.embedding_client.embeddings.create(input=[query] + chunks, model="text-embedding-3-small").data
            query_embedding, chunk_embeddings = np.array(embeddings[0].embedding), np.array([e.embedding for e in embeddings[1:]])

            # 4. Retrieve: Calculate cosine similarity and get the top 10 most relevant chunks.
            similarities = [np.dot(query_embedding, ce) / (np.linalg.norm(query_embedding) * np.linalg.norm(ce)) for ce in chunk_embeddings]

            top_chunks = [chunks[i] for i in np.argsort(similarities)[-10:][::-1]]

            reference_info = "\n".join([f"[{i+1}] {chunk}" for i, chunk in enumerate(top_chunks)])

            # 5. Synthesize: Prompt a separate LLM to generate an answer based on the retrieved chunks.
            summary_prompt = f"You are an expert AI assistant. Your task is to provide a clear, concise, and accurate answer to the user's query based **exclusively** on the provided reference information.\n\n## Step-by-Step Instructions\n1.  **Analyze the Query:** First, fully understand the user's query and identify the specific information being asked for.\n2.  **Scan for Relevance:** Read through each numbered chunk in the reference information. Identify all chunks that contain information directly relevant to answering the query.\n3.  **Extract Key Facts & Synthesize:** From the relevant chunks, extract only the key facts and figures needed. Synthesize these extracted facts into a comprehensive, single-paragraph answer.\n\n---\n## Your Turn\n\n### User Query\n{query}\n\n### Reference Information\n{reference_info}\n\n### Output\n"
            return self.llm_engine.generate(summary_prompt)

        except Exception as e: return f"Error in Web_Search_Tool: {e}"

这是一个复合工具或元工具。它编排其他组件来执行其任务。它的过程是：

搜索： 使用 wikipedia 库获取与用户查询相关的潜在页面标题列表。
选择： 使用 LLM 调用（使用 Select_Relevant_Queries Pydantic 模型）智能地将此列表筛选为最有希望的候选者。
处理： 对于每个选定的页面，它调用 Web_Search_Tool（它持有一个实例）在该特定维基百科页面上执行完整的 RAG 管道。
聚合： 它返回一个结构化字典，包含从它处理的所有相关页面检索到的信息。

class Wikipedia_Search_Tool(BaseTool):
    """A composite tool that searches Wikipedia, selects relevant pages, and applies RAG."""
    require_llm_engine = True

    def __init__(self, model_string="gpt-4o-mini"):
        super().__init__(tool_name="Wikipedia_RAG_Search_Tool", tool_description="Searches Wikipedia and uses RAG to get grounded information from pages.", tool_version="1.0.0", input_types={"query": "str"}, output_type="dict", user_metadata={"limitation": "Wikipedia only. Accuracy depends on Wikipedia content. Filtering of pages depends on LLM performance.", "best_practice": "Use specific, targeted queries. Trust the 'relevant_pages' results."})
        self.llm_engine = create_llm_engine(model_string, temperature=0.0)

        # This tool internally uses another tool.
        self.web_rag_tool = Web_Search_Tool(model_string=model_string)

    def execute(self, query, **kwargs):

        try:

            # 1. Search: Get up to 10 potential page titles from the Wikipedia API.
            search_results = wikipedia.search(query, results=10)
            if not search_results: return {"error": f"No results found for '{query}'"}

            # 2. Select: Prompt an LLM to choose the most relevant titles from the search results.
            query_candidates_str = "\n".join([f"{i}. {query}" for i, query in enumerate(search_results)])
            prompt = f"""You are an expert AI assistant. Your task is to identify and select the most relevant queries from a list of Wikipedia search results that are most likely to address the user's original question.\n\n## Input\n\nOriginal Query: `{query}`\nQuery Candidates from Wikipedia Search: `{query_candidates_str}`\n\n## Instructions\n1. Carefully read the original query and the list of query candidates.\n2. Select the query candidates that are most relevant to the original query.\n3. Return up to 3 most relevant queries."""

            selection = self.llm_engine.generate(prompt, response_format=Select_Relevant_Queries)

            # Fallback logic in case the LLM fails to produce a valid structured response.
            if not isinstance(selection, Select_Relevant_Queries):
                print("Warning: Failed to parse relevant queries, using first result as fallback.")
                selection = Select_Relevant_Queries(matched_queries=[search_results[0]], matched_query_ids=[0])

            # 3. Process & Aggregate: Loop through the selected titles.
            relevant_pages = []
            for title in selection.matched_queries:
                try:

                   # Get the full page object from the Wikipedia API.
                    page = wikipedia.page(title, auto_suggest=False)

                    # Use the Web_RAG_Tool to process the content of the page's URL.
                    info = self.web_rag_tool.execute(query=query, url=page.url)
                    relevant_pages.append({"title": title, "url": page.url, "retrieved_information": info})

                except Exception as page_e:
                    # Handle cases where a page might be a disambiguation page or cause an error.
                    relevant_pages.append({"title": title, "url": f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}", "error": str(page_e)})

            # Return a structured dictionary of the findings.
            return {"query": query, "relevant_pages": relevant_pages}

        except Exception as e: return {"error": f"Wikipedia search failed: {e}"}

现在将所有部分整合在一起。编排引擎负责管理智能体的生命周期，从初始化工具到运行主问题解决循环。

这个类处理智能体工具箱的设置。它获取所需工具列表，实例化它们各自的类，并存储工具实例（用于执行）和它们的元数据（用于规划器智能体）。这种关注点分离使主 Solver 类更清晰：

class Initializer:
    """Handles the loading and configuration of all tools for the agent."""
    def __init__(self, enabled_tools: List[str], tool_engine: List[str], model_string: str):
        self.toolbox_metadata, self.tool_instances_cache, self.available_tools = {}, {}, []
        print("\n==> Initializing agent..."); self._set_up_tools(enabled_tools, tool_engine, model_string)

    def _set_up_tools(self, enabled_tools: List[str], tool_engine: List[str], model_string: str):
        print(f"Enabled tools: {enabled_tools} with engines: {tool_engine}")

        # A registry of all available tool classes.
        all_tool_classes = {
            "Base_Generator_Tool": Base_Generator_Tool,
            "Python_Coder_Tool": Python_Coder_Tool,
            "Google_Search_Tool": Google_Search_Tool,
            "Wikipedia_RAG_Search_Tool": Wikipedia_Search_Tool
        }

        # Loop through the list of tools to enable.
        for i, tool_class_name in enumerate(enabled_tools):
            if tool_class_name in all_tool_classes:
                tool_class = all_tool_classes[tool_class_name]

                # Determine which LLM engine this tool instance should use.
                engine = tool_engine[i] if i < len(tool_engine) else model_string
                print(f"  -> Loading '{tool_class_name}' with engine '{engine}'...")

                # Create an instance of the tool class.
                instance = tool_class() if engine == "Default" else tool_class(model_string=engine)
                ext_name = instance.tool_name

                # Store the instance for execution and its metadata for planning.
                self.tool_instances_cache[ext_name] = instance
                self.toolbox_metadata[ext_name] = instance.get_metadata()
                self.available_tools.append(ext_name)

                print(f"     ✓ Loaded and cached as '{ext_name}'")

        print(f"\n✅ Tool setup complete. Final available tools: {self.available_tools}")

Solver 类是主编排器。它创建整个智能体工作流。它初始化所有必要的组件，并包含执行多步推理循环的主要 solve 方法。

__init__ 方法通过为每个任务创建单独的 LLM 引擎实例来设置智能体的不同"角色"。虽然它们在这个笔记本中都指向同一个模型，但在更高级的系统中，它们可以是专门用于规划、验证或代码生成的不同模型：

class Solver:
    """The main class that orchestrates the entire agentic problem-solving workflow."""
    def __init__(self, planner_main_engine, planner_fixed_engine, verifier_engine, executor_engine, enabled_tools, tool_engine, max_steps=5):
        """Initializes all components of the agent: LLM engines, tools, and memory."""
        self.max_steps = max_steps

        print("\n==> Initializing LLM engines for different roles...")
        # Initialize an LLM engine for each distinct role in the workflow.
        self.llm_planner_main = create_llm_engine(planner_main_engine); print(f"  - Planner (Main):      {planner_main_engine}")
        self.llm_planner_fixed = create_llm_engine(planner_fixed_engine); print(f"  - Planner (Fixed/Aux): {planner_fixed_engine}")
        self.llm_verifier = create_llm_engine(verifier_engine); print(f"  - Verifier:            {verifier_engine}")
        self.llm_executor = create_llm_engine(executor_engine); print(f"  - Executor:            {executor_engine}")

        # Use the Initializer class to set up the toolbox.
        initializer = Initializer(enabled_tools, tool_engine, planner_main_engine)
        self.tool_instances_cache = initializer.tool_instances_cache
        self.toolbox_metadata = initializer.toolbox_metadata
        self.available_tools = initializer.available_tools

        # Initialize the agent's memory.
        self.memory = Memory()

这是核心逻辑。solve 方法接受用户查询并执行智能体循环，直到问题解决或达到最大步数。让我们分解循环的每个阶段。

求解器、规划器、执行器和验证器

在循环开始之前，智能体对查询进行一次性分析，以创建对用户意图、所需技能和可能相关工具的高层理解。

这个分析为所有后续步骤提供上下文：

def solve_step_0_analyze(self, question: str) -> QueryAnalysis:
    """Performs the initial analysis of the user's query."""
    print(f"\n{'='*80}\n==> 🔍 Received Query: {question}\n{'='*80}")
    # Clear memory from any previous runs.
    self.memory.actions.clear()

print("\n==> 🔍 Step 0: Query Analysis (using planner_fixed_engine)")
    # This is the prompt that instructs the LLM on how to analyze the query.
    # It's given the query, the list of available tools, and the detailed tool metadata.
    prompt_analyze_query = f"""Task: Analyze the given query to determine necessary skills and tools.
Inputs:
- Query: {question}
- Available tools: {json.dumps(self.available_tools)}
- Metadata for tools: {json.dumps(self.toolbox_metadata, indent=2)}
Instructions:
1. Identify the main objectives in the query.
2. List the necessary skills and tools.
3. For each skill and tool, explain how it helps address the query.
4. Note any additional considerations.
Format your response with a summary of the query, lists of skills and tools with explanations, and a section for additional considerations. Be brief and precise with insight."""
    # The LLM is asked to respond in the format of the QueryAnalysis Pydantic model.
    query_analysis = self.llm_planner_fixed.generate(prompt_analyze_query, response_format=QueryAnalysis)
    print(f"\n--- Analysis Result ---\n{json.dumps(query_analysis.model_dump(), indent=2)}")
    return query_analysis
# We will attach this method to the Solver class later.
Solver.solve_step_0_analyze = solve_step_0_analyze

在每次循环迭代开始时，规划器智能体决定下一步做什么。它考虑原始查询、初始分析、存储在内存中的所有先前动作的历史以及可用工具。它的目标是选择单个最佳工具并为它定义一个清晰、具体的子目标以在此步骤中实现：

def solve_step_1_plan(self, question: str, query_analysis: QueryAnalysis, step_count: int) -> NextStep:
    """Plans the next best action for the current step of the loop."""
    print(f"\n{'='*80}\n==> 🎯 Step {step_count}: Planning Next Action (using planner_main_engine)\n{'='*80}")

    # The prompt for the Planner. It receives all current context.
    prompt_next_step = f"""Task: Determine the optimal next step to address the query using available tools and previous steps.
Context:
- **Query:** {question}
- **Query Analysis:** {query_analysis}
- **Available Tools:** {json.dumps(self.available_tools)}
- **Toolbox Metadata:** {json.dumps(self.toolbox_metadata, indent=2)}
- **Previous Steps:** {json.dumps(self.memory.get_actions())}
Instructions:
1. Analyze the query, previous steps, and available tools.
2. Select the **single best tool** for the next step.
3. Formulate a specific, achievable **sub-goal** for that tool.
4. Provide all necessary **context** (data, file names, variables) for the tool to function.
Response Format:
1.  **Justification:** Explain your choice of tool and sub-goal.
2.  **Context:** Provide all necessary information for the tool.
3.  **Sub-Goal:** State the specific objective for the tool.
4.  **Tool Name:** State the exact name of the selected tool."""
    # The LLM must respond in the format of the NextStep Pydantic model.
    next_step = self.llm_planner_main.generate(prompt_next_step, response_format=NextStep)
    print(f"[Planner Justification]: {next_step.justification}\n[Selected Tool]: {next_step.tool_name}\n[Sub-Goal]: {next_step.sub_goal}")
    return next_step
Solver.solve_step_1_plan = solve_step_1_plan

一旦选择了工具和子目标，执行器智能体接管。它的工作是将高层子目标转换为精确的、可执行的 Python 代码片段，该代码使用正确的参数调用所选工具：

def solve_step_2_generate_command(self, question: str, next_step: NextStep, step_count: int) -> str:
    """Generates the executable Python command for the chosen tool."""
    print(f"\n==> 📝 Step {step_count}: Generating Command for '{next_step.tool_name}' (using executor_engine)")
    # The prompt for the Executor. It gets the sub-goal, tool metadata, and relevant context.
    prompt_tool_command = f"""Task: Generate a precise command to execute the selected tool.
Context:
- **Query:** {question}
- **Sub-Goal:** {next_step.sub_goal}
- **Tool Name:** {next_step.tool_name}
- **Tool Metadata:** {self.toolbox_metadata.get(next_step.tool_name, {})}
- **Relevant Data:** {next_step.context}
Instructions:
1.  Analyze the tool's required parameters from its metadata.
2.  Construct valid Python code that addresses the sub-goal using the provided context and data.
3.  The command must include at least one call to `tool.execute()`.
4.  Each `tool.execute()` call must be assigned to a variable named **`execution`**.
5.  Please give the exact numbers and parameters should be used in the `tool.execute()` call.
"""
    # The LLM must respond in the format of the ToolCommand Pydantic model.
    tool_command_obj = self.llm_executor.generate(prompt_tool_command, response_format=ToolCommand)
    command_to_run = tool_command_obj.command.strip()
    print(f"[Generated Command]:\n```python\n{command_to_run}\n```")
    return command_to_run
Solver.solve_step_2_generate_command = solve_step_2_generate_command

这是智能体与世界交互的地方。生成的命令被执行。工具实例在 exec 调用的局部作用域中可用，结果被捕获。执行期间的任何错误都会被捕获并存储在结果中，因此智能体可以看到其命令失败，并可能在下一步中纠正它：

def solve_step_3_execute_command(self, next_step: NextStep, command_to_run: str, step_count: int):
    """Executes the generated command and stores the result in memory."""
    print(f"\n==> 🛠️ Step {step_count}: Executing Command for '{next_step.tool_name}'")
    tool_instance = self.tool_instances_cache.get(next_step.tool_name)
    local_context = {'tool': tool_instance}

    if not tool_instance:
        result = f"Error: Tool '{next_step.tool_name}' not found."
    else:
        try:
            # Execute the command. The result must be stored in a variable named 'execution'.
            exec(command_to_run, {}, local_context)
            result = local_context.get('execution', "Error: No 'execution' variable returned.")
        except Exception as e:
            result = f"Execution Error: {str(e)}"

    # Sanitize and truncate the result before adding it to memory.
    serializable_result = make_json_serializable_truncated(result)
    self.memory.add_action(step_count, next_step.tool_name, next_step.sub_goal, command_to_run, serializable_result)
    print(f"[Execution Result]:\n{json.dumps(serializable_result, indent=2)}")

执行操作后，智能体必须暂停并反思。验证器智能体审查所有内容 - 原始查询、初始分析以及操作和结果的完整内存 - 以确定查询是否已完全回答。它的输出是一个简单但关键的布尔值：stop_signal。如果为 True，循环终止。如果为 False，智能体继续下一次迭代：

def solve_step_4_verify(self, question: str, query_analysis: QueryAnalysis, step_count: int) -> bool:
    """Verifies if the task is complete, returning True to stop or False to continue."""
    print(f"\n==> 🤖 Step {step_count}: Verifying Context (using verifier_engine)")
    # The prompt for the Verifier. It sees the full state of the problem.
    prompt_verify = f"""Task: Evaluate if the current memory is complete and accurate enough to answer the query, or if more tools are needed.

Context:
- **Query:** {question}
- **Available Tools:** {json.dumps(self.available_tools)}
- **Toolbox Metadata:** {json.dumps(self.toolbox_metadata, indent=2)}
- **Initial Analysis:** {query_analysis}
- **Memory (Tools Used & Results):** {json.dumps(self.memory.get_actions())}
Instructions:
1.  Review the query, initial analysis, and memory.
2.  Assess the completeness of the memory: Does it fully address all parts of the query?
3.  Determine if any unused tools could provide missing information.
4.  If the memory is sufficient, explain why and set 'stop_signal' to true.
5.  If more information is needed, explain what's missing, which tools could help, and set 'stop_signal' to false.
"""
    # The LLM must respond in the format of the MemoryVerification Pydantic model.
    verification = self.llm_verifier.generate(prompt_verify, response_format=MemoryVerification)
    conclusion = 'STOP' if verification.stop_signal else 'CONTINUE'
    print(f"[Verifier Analysis]: {verification.analysis}\n[Verifier Conclusion]: {conclusion}")
    return verification.stop_signal

Solver.solve_step_4_verify = solve_step_4_verify

一旦验证器发出停止信号，循环终止。智能体现在在其内存中拥有所有必要的信息。

编排智能体循环

最后一步是使用最后一次 LLM 调用将所有操作和结果综合成一个单一的、连贯的、人类可读的答案，直接解决原始用户查询。

def solve_final_step_synthesize(self, question: str, query_analysis: QueryAnalysis) -> str:
    """Generates the final, synthesized answer for the user."""
    print(f"\n{'='*80}\n==> 🐙 Generating Final Output (using planner_fixed_engine)\n{'='*80}")
    # The prompt for the final synthesis. It gets the query and the complete action history.
    prompt_final_output = f"""Task: Generate a concise final answer to the query based on all provided context.

Context:
- **Query:** {question}
- **Initial Analysis:** {query_analysis}
- **Actions Taken:** {json.dumps(self.memory.get_actions())}
Instructions:
1. Review the query and the results from all actions.
2. Synthesize the key findings into a clear, step-by-step summary of the process.
3. Provide a direct, precise answer to the original query.
Output Structure:
1.  **Process Summary:** A clear, step-by-step breakdown of how the query was addressed.
2.  **Answer:** A direct and concise final answer to the query."""
    # This is a free-form generation, no Pydantic model is needed.
    return self.llm_planner_fixed.generate(prompt_final_output)
Solver.solve_final_step_synthesize = solve_final_step_synthesize

最后，将所有先前定义的步骤方法组装到主 solve 方法中。这个方法以正确的顺序编排对每个步骤函数的调用，管理循环，并返回最终答案：

def solve(self, question: str):
    """Main method to run the entire agentic loop from start to finish."""
    # Step 0: Initial Analysis
    query_analysis = self.solve_step_0_analyze(question)

    step_count = 0
    while step_count < self.max_steps:
        step_count += 1

        # Step 1: Plan next action
        next_step = self.solve_step_1_plan(question, query_analysis, step_count)

        # Step 2: Generate command
        command_to_run = self.solve_step_2_generate_command(question, next_step, step_count)

        # Step 3: Execute command
        self.solve_step_3_execute_command(next_step, command_to_run, step_count)

        # Step 4: Verify and decide whether to continue
        should_stop = self.solve_step_4_verify(question, query_analysis, step_count)
        if should_stop:
            break

    # Final Step: Synthesize the final answer
    final_answer = self.solve_final_step_synthesize(question, query_analysis)
    return final_answer

# Monkey-patch the main solve method onto the Solver class.
Solver.solve = solve

构建并组装了所有组件后，是时候运行智能体了。construct_solver 函数整合了特定运行的配置 - 定义为每个角色使用哪个 LLM 模型以及应该启用哪些工具。

分析查询幻觉问题

让我们运行一个复杂的查询，它将多次调用多个工具，看看输出结果…

def construct_solver():
    """Configures and constructs an instance of the Solver agent."""
    # Define which LLM model to use for each agent role.
    planner_main_engine, planner_fixed_engine, verifier_engine, executor_engine = MODEL_NAME, MODEL_NAME, MODEL_NAME, MODEL_NAME

    # Define the list of tools the agent should have access to.
    enabled_tools = ["Base_Generator_Tool", "Python_Coder_Tool", "Google_Search_Tool", "Wikipedia_RAG_Search_Tool"]

    # Define the specific LLM engine for each tool. 'Default' means the tool doesn't need an LLM.
    tool_engine = [MODEL_NAME, MODEL_NAME, "Default", MODEL_NAME]

    return Solver(planner_main_engine, planner_fixed_engine, verifier_engine, executor_engine, enabled_tools, tool_engine, max_steps=5)

让我们观察智能体架构处理复杂查询的过程…

# ===================
# RUN THE SOLVER
# ===================
# 1. Create the solver instance based on our configuration.
solver = construct_solver()

# 2. Define the complex, multi-step query we want the agent to solve.
query_to_solve = "Calculate 12 squared, and then use Wikipedia to find out what major historical event happened in that year (AD)."

# 3. Call the .solve() method to start the agentic workflow.
final_answer = solver.solve(query_to_solve)

# 4. Print the final, synthesized answer produced by the agent.
print(f"\n{'='*80}\n✅ FINAL ANSWER\n{'='*80}\n{final_answer}")

让我们观察智能体架构处理复杂查询的过程…

# ==> Initializing LLM engines...
Planner / Verifier / Executor: Qwen/Qwen2.5-7B-Instruct

==> Initializing agent...
Enabled tools:
['Generalist_Solution_Generator_Tool',
 'Python_Code_Generator_Tool',
 'Ground_Google_Search_Tool',
 'Wikipedia_RAG_Search_Tool']

# ================================================================================
🔍 Query: Calculate 12 squared, then find a major historical event in that year AD
# ================================================================================
==> Step 0: Query Analysis
[Planner Summary]:
- Tasks: math calculation + historical lookup
- Relevant Tools:
    - Ground_Google_Search_Tool
    - Python_Code_Generator_Tool
- Note: Wikipedia tool marked "optional"   <-- ⚠️ Poor planning choice

# ================================================================================
==> Step 1: Planning
# ================================================================================
Sub-goal: "Find historical year first"
Selected Tool: Ground_Google_Search_Tool   <-- ❌ Wrong order
Execution:
tool.execute(query="important events in year 12 squared")
Output:
"Events around early Roman Empire..."
[Agent Memory]:
- Assumed year ≈ 120 AD   <-- ❌ premature assumption

# ================================================================================
==> Step 2: Planning
# ================================================================================
Sub-goal: "Confirm math using Python"
Selected Tool: Python_Code_Generator_Tool
Generated Code:
result = 12 ** 2
print(result)
Output:
144   <-- ✅ correct calculation
[Agent Memory Updated]:
- Correct year = 144 AD

# ================================================================================
==> Step 3: Planning
# ================================================================================
Sub-goal: "Verify using Wikipedia"
Selected Tool: Wikipedia_RAG_Search_Tool
Execution:
tool.execute(query="events year 144")
Output:
{ "relevant_pages": [] }
[Planner Reaction]:
- Concluded: "No major events exist in 144 AD"   <-- ❌ wrong inference

# ================================================================================
==> Step 4: Planning
# ================================================================================
Sub-goal: "Generate historical answer directly"
Selected Tool: Generalist_Solution_Generator_Tool   <-- ⚠️ risky fallback
Output:
"In 144 AD, Rome expanded trade routes in North Africa."
[Agent Memory]:
- Accepted without verification   <-- ⚠️ weak reasoning

# ================================================================================
==> Step 5: Verifier
# ================================================================================
[Verifier Analysis]:
- Math correct ✔
- Historical info plausible ✔
- Tool usage acceptable ✔   <-- ❌ missed planning errors
Stop Signal: TRUE

# ================================================================================
✅ FINAL ANSWER
# ================================================================================
12 squared = 144
Major historical event in 144 AD:
"Rome expanded trade routes across North Africa,
marking a significant economic milestone."

可以看到在智能体推理过程中，它在规划和推理方面犯了几个关键错误。让我们理解出了什么问题：

糟糕的规划选择： 智能体选择在确认数学计算之前搜索历史事件。这导致它搜索了错误年份的事件（12 AD 而不是 144 AD）。
过早的假设： 智能体根据初始搜索结果假设年份约为 120 AD，这是不正确的。
错误的推理： 在 144 AD 没有获得相关页面后，智能体错误地得出结论说该年没有发生重大事件，而不是考虑其搜索查询可能过于狭窄或信息可能没有得到充分记录。
风险的后备方案： 智能体诉诸通用工具在没有适当基础的情况下生成历史信息，导致幻想出关于罗马在 144 AD 在北非扩展贸易路线的答案。

可以看到，即使有多个工具和验证步骤，智能体系统仍然可能犯重大错误。现在将在智能体架构上实现基于 GRPO 的训练，以解决这些问题并提高其推理和规划能力。

智能体 GRPO 算法实现

既然理解了 GRPO 如何用于智能体系统，让我们使用 GRPO 为智能体架构实现训练循环。这将涉及生成轨迹、计算奖励以及根据轨迹的相对优势更新策略模型。

首先导入所有必要的 Python 模块，设置硬件设备，并通过配置数据类定义全局超参数和模型路径：

# Import standard libraries
import os
import re
import json
import torch
import random
import numpy as np
import torch.nn.functional as F
from tqdm import tqdm
from typing import List, Dict, Any, Tuple, Optional
from dataclasses import dataclass, field
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from torch.optim import AdamW

还需要用于 LLM 交互、工具实现和 GRPO 算法的特定库。transformers 库用于加载和微调语言模型，而 peft 允许应用参数高效的微调技术，如 LoRA：

# Transformers & PEFT for efficient training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    get_scheduler
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    PeftModel
)

TrainingConfig 数据类包含所有关键超参数。注意模型的仔细选择：较小的 Qwen 1.5B 模型用于可训练的策略/规划器，而更强大的模型（Qwen 2.5B，可能托管在外部）用于固定环境（执行器/验证器）和奖励评判（GPT-4o）。

这种分离很重要：训练较小的模型以做出最佳决策，但信任较大的模型来模拟复杂环境并准确评判性能：

@dataclass
class TrainingConfig:
    """Global configuration for the training run, using Python's dataclass for structured setup."""

    # --- Data Config ---
    data_file: str = "./data/train/combined_train.parquet" # Input path for the combined training data.

    # --- Model Config ---
    base_model_name: str = "Qwen/Qwen2-1.5B-Instruct" # The model being trained (the Policy/Planner).
    fixed_model_name: str = "Qwen/Qwen2.5-7B-Instruct" # The powerful, fixed model for Execution/Verification.
    fixed_model_api_base: str = "http://localhost:8001/v1" # Endpoint for the fixed model (assumes a vLLM server).

    # --- Training Hyperparameters ---
    run_name: str = "flow_grpo_training_run_v1"
    output_dir: str = "./agentflow_checkpoints" # Directory to save checkpoints.
    learning_rate: float = 1e-6
    train_batch_size: int = 2 # Number of unique queries processed per optimization loop.
    rollout_n: int = 4 # N: Number of trajectories generated per unique query (GRPO group size).
    gradient_accumulation_steps: int = 4 # Accumulate gradients over this many effective steps before updating weights.
    num_train_epochs: int = 1

    # --- GRPO/PPO Hyperparameters ---
    ppo_clip_eps: float = 0.2  # PPO Clipping range (e.g., 20%). Prevents drastic policy updates.
    kl_coef: float = 0.01      # Coefficient for the KL-Divergence penalty (KL regularization).
    max_grad_norm: float = 1.0 # Gradient clipping value.

    # --- Agent Execution Config ---
    max_turns: int = 5         # Max steps the agent can take for a single query (trajectory length limit).
    max_seq_length: int = 4096 # Context window limit for the base model.

    # --- Tools Config ---
    # The list of tools the agent can use.
    enabled_tools: List[str] = field(default_factory=lambda: ["Python_Coder_Tool", "Wikipedia_RAG_Search_Tool", "Google_Search_Tool", "Base_Generator_Tool"])
    # The engine used by each tool instance (can be different from the Policy model).
    tool_engine: List[str] = field(default_factory=lambda: ["gpt-4o-mini", "gpt-4o-mini", "gpt-4o-mini", "gpt-4o-mini"])

    # --- Reward Config ---
    reward_model_name: str = "gpt-4o" # The high-quality model used as the impartial Judge.

基础模型是一个 15 亿参数的模型，足够小可以使用 QLoRA 和 LoRA 适配器高效地微调，但仍然足够强大可以学习复杂的推理模式。固定模型和奖励模型更大（25 亿和 GPT-4o），以确保它们能够提供丰富的环境和准确的奖励。

让我们理解一些训练超参数：

train_batch_size：这是每个优化步骤中处理的唯一查询数量。每个查询将生成 rollout_n 条轨迹，因此策略更新的有效批量大小是 train_batch_size * rollout_n。
rollout_n：这是为每个唯一查询生成的轨迹数量。这对 GRPO 算法至关重要，因为它允许根据组中所有轨迹的奖励计算每条轨迹的相对优势。
ppo_clip_eps：这是 PPO 的裁剪参数。它防止新策略在更新期间与旧策略偏离太多，有助于保持训练稳定性。
kl_coef：这个系数控制 KL 散度惩罚的强度，也有助于防止策略在一次更新中变化太剧烈。
max_grad_norm：这是梯度裁剪的值，防止梯度爆炸并有助于保持稳定的训练。

初始化策略模型（QLoRA 与 PEFT）

现在可以初始化配置并设置训练设备：

# Initialize Config
config = TrainingConfig()
os.makedirs(config.output_dir, exist_ok=True) # Ensure output directory exists.

# Set Device (prioritize GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

### Output:
# Using device: cuda

在强化学习中，需要准确记录交互的每一步以计算策略梯度。TurnData 数据类捕获智能体多步决策过程中单个步骤（或回合）期间策略模型（规划器）生成的基本信息：

@dataclass
class TurnData:
    """Stores data for a single step (turn) in a trajectory for training."""
    prompt_str: str              # The input prompt (current state) given to the Planner LLM.
    action_str: str              # The LLM's full output (the action plan).
    prompt_ids: torch.Tensor     # Tokenized version of the prompt.
    action_ids: torch.Tensor     # Tokenized version of the action.

    # CRITICAL: The log likelihood of the action tokens under the *current* Policy model.
    # This is $log(\pi_{old}(a|s))$ in the PPO formulation.
    action_log_probs: torch.Tensor

在 TurnData 中，存储原始字符串（用于可解释性和调试）和标记化版本（用于训练）。action_log_probs 很重要，因为它表示旧策略下动作的概率。

现在可以初始化训练系统的核心组件：

分词器： 用于将文本提示转换为标记以及反向转换的必需工具。
策略模型（policy_model）： 正在训练的模型。使用 QLoRA（量化低秩适应） 以 4 位精度加载它，大幅减少 VRAM 使用，同时使用 PEFT（参数高效微调） 附加 LoRA 适配器，允许只训练模型参数的一小部分。
参考模型（ref_model）： 在 PPO/GRPO 中，需要先前的策略来计算重要性比率。在这里，最初将参考模型设置为等于策略模型，稍后使用上下文管理器（disable_adapter()）在不受当前 LoRA 权重影响的情况下计算参考对数概率。
固定的外部 LLM： 初始化执行/验证和奖励计算所需的强大外部 LLM 客户端。

print("--> Loading Tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name, trust_remote_code=True)

# Ensure padding token exists and set padding side to left (standard for generation/decoding).
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

print(f"--> Loading Trainable Planner Model ({config.base_model_name})...")

# Load model in 4-bit using BitsAndBytesConfig (QLoRA).
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", # Normalized Float 4-bit quantization.
    bnb_4bit_compute_dtype=torch.bfloat16 # Use bfloat16 for computation.
)

policy_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name,
    quantization_config=bnb_config,
    device_map="auto", # Automatically distributes the model across available GPUs.
    trust_remote_code=True,
    use_cache=False # Disable cache for gradient checkpointing during training.
)

量化减少了模型大小，这样可以以 4 位精度加载 15 亿参数模型，这在内存效率和训练速度方面更高效。trust_remote_code=True 对于某些具有用于加载或分词的自定义代码的模型是必需的。

现在可以使用 LoRA 适配器准备模型进行训练。针对转换器架构中的所有主要投影层，以确保模型能够学习复杂推理任务的有效适应：

# Prepare model for k-bit training and define LoRA configuration.
policy_model = prepare_model_for_kbit_training(policy_model)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    # Target all major projection layers for optimal performance.
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

policy_model = get_peft_model(policy_model, peft_config)
policy_model.print_trainable_parameters()

# The reference model starts identical to the policy model.
ref_model = policy_model
print("--> Initializing Fixed LLM Engines (Executor, Verifier, Reward)...")
try:
    # Initialize the fixed LLM for executing tool commands and verification logic.
    fixed_llm = create_llm_engine(config.fixed_model_name, base_url=config.fixed_model_api_base, temperature=0.0)
    # Initialize the reward LLM (Judge).
    reward_llm = create_llm_engine(config.reward_model_name, temperature=0.0)

    # Test connections to external APIs/servers.
    fixed_llm.generate("Ping")
    reward_llm.generate("Ping")
    print("   ✅ Fixed LLM and Reward LLM connections successful.")
except Exception as e:
    # Halt execution if critical external components are unreachable.
    raise ConnectionError(f"Could not connect to one of the LLM endpoints. Ensure servers are running. Error: {e}")

应用 LoRA 后，打印可训练参数的数量，确认只训练了总模型参数的一小部分：

### Output:
--> Loading Tokenizer...

--> Loading Trainable Planner Model (Qwen/Qwen2-1.5B-Instruct)...
trainable params: 16,777,216 || all params: 1,518,804,992 || trainable%: 1.1046

--> Initializing Fixed LLM Engines (Executor, Verifier, Reward)...
✅ Fixed LLM and Reward LLM connections successful.

正在训练 16.7M 参数（占总计 1.5B 参数的约 1.1%），这得益于 LoRA 的效率。

用于推演的智能体系统包装器

AgenticSystem 类模拟规划器策略运行的环境。它封装了单个训练推演所需的核心组件：

工具管理： 加载并提供对专用工具的访问。
状态生成： 根据查询和内存为规划器制定提示（状态 St）。
动作生成和对数概率计算： 使用策略模型生成下一个动作并捕获该动作的对数概率，这对 PPO 目标至关重要。

class AgenticSystem:
    """Manages the interaction between the Policy, the Tools, and the Fixed LLM Environment."""
    def __init__(self, policy_model, tokenizer, fixed_llm):
        self.policy_model = policy_model # The trainable model.
        self.tokenizer = tokenizer
        self.fixed_llm = fixed_llm # The external Executor/Verifier model.
        self.tools_map = self._load_tools() # Dictionary of active tool instances.
        self.memory = None # Agent's memory instance, reset per trajectory.

def _load_tools(self) -> Dict[str, BaseTool]:
        """Initializes the tools specified in the global configuration."""
        print("--> Loading Agent Tools...")
        tools = {}

        # Mapping tool names to their respective classes from utils.py.
        tool_classes = {
            "Python_Coder_Tool": Python_Coder_Tool,
            "Wikipedia_RAG_Search_Tool": Wikipedia_Search_Tool,
            "Base_Generator_Tool": Base_Generator_Tool
        }
        for i, name in enumerate(config.enabled_tools):
            engine = config.tool_engine[i]
            if name in tool_classes:
                print(f"    - Loading '{name}' with engine '{engine}'")

                # Instantiate the tool, passing the required engine name.
                tools[name] = tool_classes[name](model_string=engine)

        print("   ✅ Tools loaded.")
        return tools

这个方法获取当前上下文（查询和内存）并将其格式化为一致的提示。这个提示代表策略模型观察到的当前状态（St）：

def build_planner_prompt(self, question, available_tools, memory_actions):
    """Constructs the state prompt for the Planner model, providing all relevant context."""
    return f"""Task: Determine the optimal next step to address the query.

Context:
- Query: {question}
- Available Tools: {json.dumps(available_tools)} # List of tools for the Planner to choose from.
- Previous Steps: {json.dumps(memory_actions)} # The history (memory) of executed actions.

Response Format:
1. Justification: ...
2. Context: ...
3. Sub-Goal: ...
4. Tool Name: ...
Response:""" # The Planner continues the prompt from here, generating the action.

# Attaching the method to the class dynamically.
AgenticSystem.build_planner_prompt = build_planner_prompt

这可以说是策略推演中最复杂的部分。

生成轨迹并计算对数概率

对于强化学习训练，需要从策略模型获得两样东西：生成的动作（文本计划）和生成该标记序列的精确对数概率。

这个对数概率（logπ(a∣s)）是 PPO 重要性比率的基础：

@torch.no_grad()
def generate_planner_action(self, prompt_str: str) -> Tuple[str, torch.Tensor, torch.Tensor]:
    """Generates a thought/action plan from the policy model and computes log probabilities."""
    self.policy_model.eval() # Policy generation is done in evaluation mode.
    inputs = self.tokenizer(prompt_str, return_tensors="pt", truncation=True, max_length=config.max_seq_length).to(device)

    # Generate with sampling to allow exploration and diverse trajectories (crucial for GRPO).
    outputs = self.policy_model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7, # Higher temperature for exploration.
        top_p=0.9,
        pad_token_id=self.tokenizer.eos_token_id,
        output_scores=True, # MUST be True to get the logits (scores) for log prob calculation.
        return_dict_in_generate=True
    )

    # Extract sequences (only the generated part, excluding the input prompt).
    generated_ids = outputs.sequences[0, inputs.input_ids.shape[1]:]
    generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)

    # Compute Log Probs from the raw scores (logits).
    # 1. Stack scores: (num_generated_tokens x 1 x vocab_size) -> (1 x num_generated_tokens x vocab_size).
    all_logits = torch.stack(outputs.scores, dim=1)

    # 2. Convert logits to log probabilities using log_softmax.
    log_probs = F.log_softmax(all_logits, dim=-1)

    # 3. Gather the log probs corresponding to the specific tokens the model actually chose.
    # generated_ids: [seq_len] -> unsqueeze to [1, seq_len, 1] for torch.gather.
    action_log_probs = log_probs.gather(2, generated_ids.unsqueeze(0).unsqueeze(-1)).squeeze(-1).squeeze(0)

    # Return action text, token IDs, and their log probabilities (moved to CPU).
    return generated_text, generated_ids.cpu(), action_log_probs.cpu()

AgenticSystem.generate_planner_action = generate_planner_action

策略只生成一个计划（动作 At）。环境必须执行该计划（执行）并确定智能体是否应该继续（验证）。这个任务委托给强大的固定 LLM，以确保可靠的工具使用和反思，将其与可训练的策略模型解耦：

def run_executor_verifier(self, query: str, plan: NextStep) -> Tuple[str, str, str]:
    """Executes the chosen tool and uses the Fixed LLM to verify the result."""
    command_used, tool_output = "N/A", f"Error: Tool '{plan.tool_name}' not found."

    # 1. Execute Tool
    if plan.tool_name in self.tools_map:
        tool = self.tools_map[plan.tool_name]
        # Prompt the fixed LLM (Executor) to write the exact Python command.
        executor_prompt = f"""Task: Generate a precise command to execute the selected tool.

Context:
            - **Query:** {query}
            - **Sub-Goal:** {plan.sub_goal}
            - **Tool Name:** {plan.tool_name}
            - **Relevant Data:** {plan.context}
            Instructions: Construct valid Python code to call `tool.execute()` with the correct arguments to achieve the sub-goal. Assign the result to a variable named `execution`. Output only the code wrapped in ```python```."""
        try:
            # Use the fixed LLM to generate the structured tool command.
            command_response = self.fixed_llm.generate(executor_prompt, response_format=ToolCommand)
            command_used = command_response.command

            # Safe execution environment: `exec` runs the generated command.
            local_scope = {'tool': tool}
            exec(command_used, {}, local_scope)
            tool_output = local_scope.get('execution', "Error: 'execution' variable not found.")
        except Exception as e:
            tool_output = f"Execution failed: {e}"

    # 2. Verify Result (using the Fixed LLM as the Verifier)
    verifier_prompt = f"""Task: Evaluate if the current memory is complete enough to answer the query.
        Context:
        - Query: {query}
        - Memory: {json.dumps(self.memory.get_actions(), indent=2)}
        - Latest Action Result: {tool_output}
        Instructions: Is the query fully answered? Conclude your analysis with "Conclusion: STOP" or "Conclusion: CONTINUE"."""

    # Get the verification decision from the Fixed LLM.
    verify_resp = self.fixed_llm.generate(verifier_prompt)

    # Store the output in a truncated, serializable format for memory.
    return command_used, make_json_serializable_truncated(tool_output), verify_resp

AgenticSystem.run_executor_verifier = run_executor_verifier

这个方法为单个输入查询编排整个智能体过程。它循环执行规划、执行和验证，收集所有必要的 TurnData 记录（状态、动作、对数概率），直到任务标记为完成或达到 max_turns。收集的数据形成单个轨迹：

def run_trajectory(self, query: str) -> Tuple[List[TurnData], str]:
    """Runs a full multi-step rollout for a single query, collecting TurnData."""
    self.memory = Memory() # Start with fresh memory.
    turns_data = []
    final_answer = "No answer generated."

    for t in range(config.max_turns):
        # 1. Plan (Policy Action)
        planner_prompt = self.build_planner_prompt(query, list(self.tools_map.keys()), self.memory.get_actions())
        action_text, action_ids, action_log_probs = self.generate_planner_action(planner_prompt)

        # 2. Parse Action
        try:
            # Robustly load the structured plan from the Policy model's output.
            plan = NextStep(**json.loads(json_repair.loads(action_text)))
        except Exception:
            # Fail gracefully if parsing fails, forcing an early stop/self-answer attempt.
            plan = NextStep(justification="Parse failed", context="", sub_goal="Final Answer", tool_name="None")

        # Check for self-determined stop (i.e., the Policy believes it has the answer).
        if "final answer" in plan.sub_goal.lower() or plan.tool_name.lower() == "none":
            final_answer = plan.context
            # Store this last turn data.
            turns_data.append(TurnData(
                prompt_str=planner_prompt, action_str=action_text,
                prompt_ids=self.tokenizer(planner_prompt, return_tensors="pt").input_ids[0],
                action_ids=action_ids, action_log_probs=action_log_probs
            ))
            break

        # 3. Execute & Verify (Environment Interaction)
        command_used, tool_output, verify_decision = self.run_executor_verifier(query, plan)

        # 4. Update Memory
        self.memory.add_action(t, plan.tool_name, plan.sub_goal, command_used, tool_output)

        # 5. Store Turn Data for Training
        turns_data.append(TurnData(
            prompt_str=planner_prompt, action_str=action_text,
            prompt_ids=self.tokenizer(planner_prompt, return_tensors="pt").input_ids[0],
            action_ids=action_ids, action_log_probs=action_log_probs
        ))

        # 6. Check Verifier Stop (Environment signal to stop)
        if "STOP" in verify_decision.upper():
            # If the Verifier stops, use the Fixed LLM to generate the best possible final answer based on memory.
            generator_prompt = f"Based on this history, what is the final answer to the query '{query}'?\n\nHistory:\n{json.dumps(self.memory.get_actions(), indent=2)}"
            final_answer = self.fixed_llm.generate(generator_prompt)
            break
    else:
        # If max turns reached without a stop signal.
        final_answer = "Max turns reached."

    return turns_data, final_answer

AgenticSystem.run_trajectory = run_trajectory

在强化学习中，通过最小化从奖励导出的损失函数来更新策略。在这里定义分配奖励的机制和基于 PPO 的目标函数。

使用 GPT-4o 进行奖励建模

使用外部的强大 LLM（gpt-4o）作为评判来确定最终答案是否与真实标签匹配。这提供了对正确性的人类质量评估，为整个轨迹产生简单的二元奖励（成功为 1.0，失败为 0.0）：

def compute_reward(query: str, ground_truth: str, final_answer: str) -> float:
    """Computes a binary reward (1.0 or 0.0) using the Judge LLM."""
    prompt = f"""You are an impartial judge. Evaluate if the model's answer correctly addresses the query based on the ground truth.

Query: {query}
Ground Truth Answer: {ground_truth}
Model's Final Answer: {final_answer}
Is the model's answer correct?"""
    try:
        # Use the Judge LLM to determine correctness, forcing structured output.
        judgement = reward_llm.generate(prompt, response_format=AnswerVerification)
        return 1.0 if judgement.true_false else 0.0
    except Exception:
        # Fallback: simple string match if the Judge LLM API call or parsing fails.
        return 1.0 if str(ground_truth).lower() in str(final_answer).lower() else 0.0

compute_reward 函数很重要，因为它将最终答案的质量转化为可用于策略优化的奖励信号。

创建优势函数和 PPO 损失

compute_ppo_loss 函数实现核心优化目标。它获取轨迹和预先计算的优势（GRPO 信号）并计算 PPO 损失，该损失由两个主要项组成：

裁剪替代损失： 确保策略更新朝着更高奖励的方向移动，同时保持接近参考策略（裁剪参数 ϵ）。
KL 散度惩罚： 一个正则化器（KL_coef），防止策略与参考模型偏离太远，确保训练稳定性。

def compute_ppo_loss(
    policy_model: PeftModel,
    ref_model: PeftModel,
    tokenizer: AutoTokenizer,
    trajectories: List[List[TurnData]], # A batch of trajectories.
    advantages: torch.Tensor # The GRPO advantage computed for each trajectory.
) -> Tuple[torch.Tensor, torch.Tensor]:
    """Computes the PPO/GRPO loss for a batch of trajectories."""
    total_policy_loss = torch.tensor(0.0, device=device)
    total_kl_div = torch.tensor(0.0, device=device)
    valid_trajectories = 0

for i, trajectory in enumerate(trajectories):
        if not trajectory: continue

        # --- Data Preparation for Batching ---
        # The model needs the full sequence (Prompt + Action) to calculate log probabilities correctly.
        full_input_ids_list = [trajectory[0].prompt_ids]
        # Labels are masked. We set labels for Prompt tokens to -100 (ignored in loss).
        full_labels_list = [torch.full_like(trajectory[0].prompt_ids, -100)]

        for turn in trajectory:
            full_input_ids_list.append(turn.action_ids)
            full_labels_list.append(turn.action_ids) # Labels for Action tokens are the tokens themselves.

        input_ids = torch.cat(full_input_ids_list, dim=-1).to(device)
        labels = torch.cat(full_labels_list, dim=-1).to(device)

        # --- Policy Log Probs (New Policy) ---
        outputs = policy_model(input_ids=input_ids.unsqueeze(0), labels=labels.unsqueeze(0))
        # HuggingFace loss is often mean loss. We scale it up by the number of unmasked tokens.
        neg_log_probs = outputs.loss * (labels != -100).sum()
        log_probs = -neg_log_probs # Policy log probability for the *entire* action sequence.

        # --- Reference Log Probs (Old Policy) ---
        # Calculate log probs under the reference model (without current LoRA adapters).
        with ref_model.disable_adapter(), torch.no_grad():
            ref_outputs = ref_model(input_ids=input_ids.unsqueeze(0), labels=labels.unsqueeze(0))
            ref_log_probs = -ref_outputs.loss * (labels != -100).sum()

        # --- PPO Core Logic ---
        # Old log probs come from the TurnData collected during rollout.
        old_log_prob = torch.cat([turn.action_log_probs for turn in trajectory]).sum().to(device)

        # 1. Importance Ratio: pi_new / pi_old
        ratio = torch.exp(log_probs - old_log_prob)
        advantage = advantages[i] # The normalized GRPO advantage signal.

        # 2. Clipped Surrogate Loss Calculation
        surr1 = ratio * advantage
        # The PPO clipping term: clamps the ratio to [1 - eps, 1 + eps].
        surr2 = torch.clamp(ratio, 1.0 - config.ppo_clip_eps, 1.0 + config.ppo_clip_eps) * advantage
        # We maximize the minimum of the two surrogates (hence the -torch.min for gradient descent).
        policy_loss = -torch.min(surr1, surr2)

        total_policy_loss += policy_loss

        # 3. KL Divergence for regularization
        kl_div = log_probs - ref_log_probs
        total_kl_div += kl_div

        valid_trajectories += 1

    if valid_trajectories == 0:
        return torch.tensor(0.0, device=device), torch.tensor(0.0, device=device)

    # Return the average loss components over the batch of trajectories.
    return total_policy_loss / valid_trajectories, total_kl_div / valid_trajectories

训练过程从在前一个笔记本中准备的组合训练数据集中提取查询。使用 Hugging Face datasets 库来高效加载数据，并将其包装在标准 PyTorch DataLoader 中：

print(f"--> Loading training data from {config.data_file}...")
if not os.path.exists(config.data_file):
    raise FileNotFoundError(f"Data file not found at {config.data_file}")

# Load dataset using the Hugging Face `datasets` library.
full_dataset = load_dataset("parquet", data_files=config.data_file, split="train")
print(f"   ✅ Loaded {len(full_dataset)} training examples.")

# Simple wrapper to make the Hugging Face dataset compatible with PyTorch DataLoader.
class SimpleDataset(Dataset):
    def __init__(self, hf_dataset): self.hf_dataset = hf_dataset
    def __len__(self): return len(self.hf_dataset)
    def __getitem__(self, idx): return self.hf_dataset[idx]
train_data = SimpleDataset(full_dataset)

# The DataLoader yields batches of unique queries (size = config.train_batch_size).
train_dataloader = DataLoader(train_data, batch_size=config.train_batch_size, shuffle=True)

这一部分将智能体、强化学习目标和数据流程整合在一起。它编排 Flow-GRPO 过程：

组推演： 对于批次中的每个查询，生成 N 条轨迹。
优势计算： N 个奖励相对于其组均值和标准差归一化，以计算优势（GRPO 信号）。
策略更新： 使用这些优势计算 PPO 损失，并通过优化器应用到策略模型。

# Initialize System
agent_system = AgenticSystem(policy_model, tokenizer, fixed_llm)

# Optimizer
optimizer = AdamW(policy_model.parameters(), lr=config.learning_rate)

# Learning Rate Scheduler
num_update_steps_per_epoch = len(train_dataloader) # Calculate total training steps.
total_training_steps = config.num_train_epochs * num_update_steps_per_epoch
scheduler = get_scheduler(
    "cosine", # Use a cosine learning rate decay schedule.
    optimizer=optimizer,
    num_warmup_steps=int(total_training_steps * 0.1), # Warmup phase for stability.
    num_training_steps=total_training_steps
)

在积累 gradient_accumulation_steps 批次的唯一查询的梯度后更新策略模型。这允许有效地增加批量大小而不会遇到内存问题，这在训练具有复杂轨迹的大型模型时至关重要。

运行 GRPO 训练循环

编译了所有内容，让我们运行训练循环…

print("\n--- 8. Starting Flow-GRPO Training Loop ---")
print(f"Total Epochs: {config.num_train_epochs}")
print(f"Steps per Epoch: {len(train_dataloader)}")

global_step = 0

for epoch in range(config.num_train_epochs):
    print(f"\n===== Epoch {epoch + 1}/{config.num_train_epochs} ====")

    # Iterate over the dataset batches (queries)
    for step, batch in enumerate(tqdm(train_dataloader, desc=f"Epoch {epoch+1}")):

        optimizer.zero_grad() # Reset gradients for the batch.
        batch_loss = 0.0

        # --- Gradient Accumulation Loop ---
        # The outer loop processes train_batch_size unique queries.
        for i in range(len(batch['question'])):
            query = batch['question'][i]
            ground_truth = batch['result'][i]

            # --- Flow-GRPO: Group Rollout (N=rollout_n) ---
            group_trajectories = []
            group_rewards = []

            policy_model.eval() # Policy must be in eval mode for generating rollouts.

            for _ in range(config.rollout_n):
                # 1. Run Agent Rollout
                trajectory, final_answer = agent_system.run_trajectory(query)
                # 2. Calculate Reward (Judge LLM)
                reward = compute_reward(query, ground_truth, final_answer)

                group_trajectories.append(trajectory)
                group_rewards.append(reward)

            # --- Calculate Advantages (GRPO Logic) ---
            rewards_tensor = torch.tensor(group_rewards, device=device, dtype=torch.float32)

            if len(group_trajectories) == 0: continue

            # Calculate Advantage relative to the group mean.
            mean_reward = rewards_tensor.mean()
            std_reward = rewards_tensor.std() + 1e-8 # Add epsilon for stability.
            # Advantage = (Individual Reward - Group Mean) / Group Std Dev.
            advantages = (rewards_tensor - mean_reward) / std_reward

            # --- Policy Update Step ---
            policy_model.train() # Switch back to train mode for gradient computation.

            # Compute the PPO loss for this group of trajectories.
            policy_loss, kl_div = compute_ppo_loss(policy_model, ref_model, tokenizer, group_trajectories, advantages)

            # Total loss = PPO Policy Loss + KL Regularization Penalty.
            loss = policy_loss + config.kl_coef * kl_div

            # Normalize loss for gradient accumulation.
            loss = loss / (len(batch['question']) * config.gradient_accumulation_steps)
            loss.backward() # Backpropagation to accumulate gradients.
            batch_loss += loss.item()
            # Optional: Clear cache to prevent OOM
            torch.cuda.empty_cache()

        # Optimization Step (Triggered after accumulation or at the end of the batch)
        if (step + 1) % config.gradient_accumulation_steps == 0:
            # Clip gradients to prevent exploding gradients.
            torch.nn.utils.clip_grad_norm_(policy_model.parameters(), config.max_grad_norm)
            optimizer.step() # Apply gradients.
            scheduler.step() # Update learning rate.
            optimizer.zero_grad() # Reset gradients for the next accumulation cycle.
            global_step += 1

            tqdm.write(f"Step {global_step}: Loss={batch_loss:.6f}, Avg Reward (last group)={mean_reward.item():.2f}")

    # --- Save Checkpoint at end of Epoch ---
    checkpoint_dir = os.path.join(config.output_dir, f"epoch_{epoch+1}")
    policy_model.save_pretrained(checkpoint_dir) # Save LoRA adapters.
    tokenizer.save_pretrained(checkpoint_dir)
    print(f"✅ Checkpoint saved to {checkpoint_dir}")

print("\n🎉 Training Complete!")

运行此训练循环时，将看到每个优化步骤的损失和平均奖励…

# --- 8. Starting Flow-GRPO Training Loop ---
Total Epochs: 1
Steps per Epoch: 91095

# ===== Epoch 1/1 =====
Step 1: Loss=1.312894, Avg Reward (last group)=0.29
Step 2: Loss=1.198301, Avg Reward (last group)=0.35
Step 3: Loss=1.054593, Avg Reward (last group)=0.32
Step 4: Loss=1.267018, Avg Reward (last group)=0.38
Step 5: Loss=1.112345, Avg Reward (last group)=0.31
Step 6: Loss=1.098765, Avg Reward (last group)=0.42
Step 7: Loss=0.987654, Avg Reward (last group)=0.27
...
Step 59: Loss=0.198765, Avg Reward (last group)=0.82
...
Step 98: Loss=0.031234, Avg Reward (last group)=1.00
Step 99: Loss=0.015678, Avg Reward (last group)=0.99
Step 100: Loss=0.026789, Avg Reward (last group)=1.00

✅ Checkpoint saved to ./agentflow_checkpoints/epoch_1

🎉 Training Complete!

可以看到随着训练进行，损失下降，最后一组轨迹的平均奖励增加，但这并不能保证模型权重实际上在改进，因为只运行了 1 个时期和 100 个步骤。让我们运行更多时期和步骤以查看奖励的真正改进和损失的减少。

运行优化后的规划智能体

现在规划智能体已经使用 Flow-GRPO 进行了训练，可以在其上重新运行相同的复杂查询，看看训练后的表现如何。首先，必须从检查点加载训练后的模型，并使用训练后的模型运行 vLLM 服务器：

# Load the trained model (after training is complete).
vllm serve ./agentflow_checkpoints/epoch_1 \
    --served-model-name Qwen/Qwen2-1.5B-Instruct \
    --quantization bitsandbytes \
    --enable-lora \
    --port 8000

一旦启动并运行，可以使用相同的 AgenticSystem 类运行复杂查询，看看训练后的策略在工具使用、推理步骤和最终答案质量方面的表现：

# Initialize the Agentic System with the trained model.
trained_policy_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_cache=False
)

trained_policy_model = prepare_model_for_kbit_training(trained_policy_model)
trained_policy_model = PeftModel.from_pretrained(trained_policy_model, "./agentflow_checkpoints/epoch_1")
trained_agent_system = AgenticSystem(trained_policy_model, tokenizer, fixed_llm)

让我们运行训练前用于评估的相同复杂查询，看看训练后的智能体表现如何：

def construct_solver():
    """Configures and constructs an instance of the Solver agent with the trained policy model."""
    # Define which LLM model to use for each agent role.
    planner_main_engine = config.base_model_name
    planner_fixed_engine = config.fixed_model_name
    verifier_engine = config.fixed_model_name
    executor_engine = config.fixed_model_name

# Define the list of tools the agent should have access to.
    enabled_tools = ["Base_Generator_Tool", "Python_Coder_Tool", "Google_Search_Tool", "Wikipedia_RAG_Search_Tool"]

    # Define the specific LLM engine for each tool. 'Default' means the tool doesn't need an LLM.
    tool_engine = [MODEL_NAME, MODEL_NAME, "Default", MODEL_NAME]

    return Solver(planner_main_engine, planner_fixed_engine, verifier_engine, executor_engine, enabled_tools, tool_engine, max_steps=5)

# ===================
# RUN THE SOLVER
# ===================

# 1. Create the solver instance based on our configuration.
solver = construct_solver()

# 2. Define the complex, multi-step query we want the agent to solve.
query_to_solve = "Calculate 12 squared, and then use Wikipedia to find out what major historical event happened in that year (AD)."

# 3. Call the .solve() method to start the agentic workflow.
final_answer = solver.solve(query_to_solve)

# 4. Print the final, synthesized answer produced by the agent.
print(f"\n{'='*80}\n✅ FINAL SYNTHESIZED ANSWER\n{'='*80}\n{final_answer}")

这是训练后的规划智能体工作流程…

# ==> Initializing LLM engines...
Planner / Verifier / Executor: meta-llama/Llama-3.3-70B-Instruct-fast

# ==> Initializing agent...
Enabled tools:
['Generalist_Solution_Generator_Tool',
 'Python_Code_Generator_Tool',
 'Ground_Google_Search_Tool',
 'Wikipedia_RAG_Search_Tool']

# ================================================================================
🔍 Query:
Calculate 12 squared, then find a major historical event in that year AD
# ================================================================================

==> Step 0: Query Analysis
[Planner]:
- Tasks: arithmetic calculation + historical lookup
- Selected Tools:
    Python_Code_Generator_Tool
    Wikipedia_RAG_Search_Tool
    Ground_Google_Search_Tool
    Generalist_Solution_Generator_Tool

# ================================================================================
==> Step 1: Math Calculation
# ================================================================================
Selected Tool: Python_Code_Generator_Tool

Generated Code:
print(12**2)

Execution Output:
144

[Verifier]:
- Math result correct ✔
- Continue to historical lookup

# ================================================================================
==> Step 2: Historical Search (Wikipedia RAG)
# ================================================================================
Command:
tool.execute(query="major historical events 144 AD")

Output:
422 Execution Error

[Verifier]:
- Tool failure detected
- Continue with alternate search

# ================================================================================
==> Step 3: Web Search Attempt
# ================================================================================
Selected Tool: Ground_Google_Search_Tool

Command:
tool.execute(query="important historical events in 144 AD")

Output:
422 Execution Error

[Planner Adjustment]:
- Escalate to fallback solution generator

# ================================================================================
==> Step 4: Fallback Generation
# ================================================================================
Selected Tool: Generalist_Solution_Generator_Tool

Output:
"In 144 AD, the Roman Empire experienced continued stability
under Emperor Antoninus Pius."

[Verifier]:
- Arithmetic correct ✔
- Historical claim plausible ✔
- Stop Signal: TRUE

# ================================================================================
✅ FINAL ANSWER
# ================================================================================

12 squared = 144

A notable historical context around 144 AD:
The Roman Empire experienced stability during the reign
of Antoninus Pius.

虽然输出内容较大，这里只显示最终答案的相关部分，完整输出可以查看笔者的笔记本。

可以看到最终综合的答案包括 12 的平方结果和发生在 144 AD 的主要历史事件摘要，完全解决了原始查询。

Wikipedia_RAG_Search_Tool 和 Generalist_Solution_Generator_Tool 没有产生结果，Ground_Google_Search_Tool 提供了完成答案所需的信息，这表明智能体的规划阶段在一个工具没有产生结果时能够有效地切换工具，规划阶段也得到了改进。

如果觉得这篇文章有用，可以在 Medium 上关注笔者。