langchain llm如何支持工具类应用的自定义结构化抽取

liliangcsdn

891人浏览 · 2025-11-22 16:37:48

liliangcsdn · 2025-11-22 16:37:48 发布

目前，MCP等工具类应用是基于LLM的agent系统必备的技能。

然而并不是所有的大模型都实现了with_structured_output，不一定都支持工具调用或json方式。

这里针对这些大模型，探索采用自定义的方式实现结构化固定信息的抽取。

1 大模型设置

基于langchain的ChatOpenAI模块创建大模型llm，示例程序如下。

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxx"
os.environ['OPENAI_BASE_URL'] = "https://model_provider_url/v1"

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="qwen3-70b", temperature=0)

response=llm.invoke("张三25岁，并且有168厘米")
print(response.content)

response=llm.invoke("李四28岁，并且有172厘米")
print(response.content)

输出示例如下，显然直接调用大模型，难以满足agent工具类应用结构化抽取的需求。

好的，张三今年25岁，身高168厘米。请问您需要根据这些信息进行什么操作或分析？例如计算BMI、评估健康状况，还是用于其他用途？欢迎补充更多细节！
好的，李四今年28岁，身高172厘米。
如果您需要根据这些信息进行进一步的分析（比如健康评估、BMI计算、或其他用途），请告诉我具体需求，我可以帮您继续处理！

2 抽取问题演示

这里通过具体实例展示基于with_structured_output的结构化信息抽取。

2.1 with_structured_output示例

langchain llm的with_structured_output方法，支持以以固定结构内容的形式输出。

with_structured_output的调用示例参考如下链接

https://blog.csdn.net/liliang199/article/details/155109021

以下是一个with_structured_output能正常调用的示例。

from typing import Optional
from pydantic import BaseModel, Field
from langchain_ollama import ChatOllama
 
# 初始化Ollama LLM，注意需要后台开启ollama服务
model_name = "gemma3n:e2b"
llm  = ChatOllama(model=model_name)

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(..., description="The height of the person expressed in meters.")
 
structured_llm = llm.with_structured_output(Person) # 可以绑定结构化输出
structured_llm.invoke("张三25岁，并且有168厘米")

输出如下，llm成功解析信息，并以固定结构内容的形式输出。

Person(name='张三', height_in_meters=1.68)

2.2 with_structured_output异常

然而，并不是所有模型都能有效支持with_structured_output方法。

比如在如下示例中，这是一个判断用户问题是否与电影相关的agent子模块代码。

该模块直接调用with_structured_output(GuardrailsOutput)，期望llm输出结构化的GuardrailsOutput对象并匹配decision。

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxx"
os.environ['OPENAI_BASE_URL'] = "https://model_provider_url/v1"

from typing import Literal
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(model="qwen3-70b", temperature=0)


guardrails_system = """
As an intelligent assistant, your primary objective is to decide whether a given question is related to movies or not. 
If the question is related to movies, output "movie". Otherwise, output "end".
To make this decision, assess the content of the question and determine if it refers to any movie, actor, director, film industry, 
or related topics. Provide only the specified output: "movie" or "end".
"""
guardrails_prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            guardrails_system,
        ),
        (
            "human",
            ("{question}"),
        ),
    ]
)


class GuardrailsOutput(BaseModel):
    decision: Literal["movie", "end"] = Field(
        description="Decision on whether the question is related to movies"
    )

guardrails_chain = guardrails_prompt_template | llm.with_structured_output(GuardrailsOutput)


question = "What was the cast of the Casino?"
checked = guardrails_chain.invoke({"question": question})
print(f"checked: {checked}")

输出如下，显然当前llm不支持with_structured_output

---------------------------------------------------------------------------
BadRequestError Traceback (most recent call last)
Cell In[361], line 36
31 guardrails_chain = guardrails_prompt_template | llm.with_structured_output(GuardrailsOutput)
35 question = "What was the cast of the Casino?"
---> 36 checked = guardrails_chain.invoke({"question": question})
37 print(f"checked: {checked}")

]
............
/site-packages/openai/_base_client.py:1047, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
1044 err.response.read()
1046 log.debug("Re-raising status error")
-> 1047 raise self._make_status_error_from_response(err.response) from None
1049 break
1051 assert response is not None, "could not resolve response (should never happen)"

BadRequestError: Error code: 400 - {'error': {'message': "<400> InternalError.Algo.InvalidParameter: 'messages' must contain the word 'json' in some form, to use 'response_format' of type 'json_object'.", 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_parameter_error'}, 'id': 'xxxxx', 'request_id': 'xxxx'}

3 自定义信息抽取

这里示例通过自定义parser的方式，解决上述抽取失败的问题。

3.1 自定义parser

针对2.2小节抽取失败案例，通过PydanticOutputParser自定义parser方式，实现结构化信息抽取。

核心逻辑如下所示：

1) 通过PydanticOutputParser自定义parser，识别GuardrailsOutput中需要抽取的内容。

2）system_prompt中增加如下明确提示

By the way, wrap the output in `json` tags\n{format_instructions}

目的是提示llm在输出内容时，遵循给定指令。

其中{format_instructions}来自于parser.get_formal_instructions()

3) llm输出解析

由于输出内容经过format_instructions指令的约素，所以能够被parser解析并输出为GuardrailsOutput对象。

示例代码如下。

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxx"
os.environ['OPENAI_BASE_URL'] = "https://model_provider_url/v1"

from typing import Literal
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(model="qwen3-70b", temperature=0)

guardrails_system = """
As an intelligent assistant, your primary objective is to decide whether a given question is related to movies or not. 
If the question is related to movies, output "movie". Otherwise, output "end".
To make this decision, assess the content of the question and determine if it refers to any movie, actor, director, film industry, 
or related topics. Provide only the specified output: "movie" or "end".

By the way, wrap the output in `json` tags\n{format_instructions}
"""
guardrails_prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            guardrails_system,
        ),
        (
            "human",
            ("{question}"),
        ),
    ]
)


class GuardrailsOutput(BaseModel):
    decision: Literal["movie", "end"] = Field(
        description="Decision on whether the question is related to movies"
    )

parser = PydanticOutputParser(pydantic_object=GuardrailsOutput)
guardrails_prompt = guardrails_prompt_template.partial(format_instructions=parser.get_format_instructions())
guardrails_chain = guardrails_prompt | llm | parser # .with_structured_output(GuardrailsOutput)

3.2 测试parser

在完成定义后，运行具体问答，测试parser是否生效。

question = "What was the cast of the Casino?"
checked = guardrails_chain.invoke({"question": question,})
print(f"checked: {checked}")

输出如下，采用新的prompt指令后，llm的输出内容parser可以正确识别，并且程序没有报错。

checked: decision='movie'

如此，通过自定义parser的方式，实现了llm输出结构化固定信息的功能，这是llm系统在tools广泛应用的时代应该必备的技能。

reference

---

langchain如何判断neo4j知识图谱是否能回答问题

https://blog.csdn.net/liliang199/article/details/155074227

大模型自定义结构化输出示例

https://blog.csdn.net/liliang199/article/details/155109021

langchian进阶01-结构化输出(StructedOutput)

https://zhuanlan.zhihu.com/p/15349262663

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

AI - CurSor精准上下文+应用（三）

可约束代码风格（如强制用驼峰命名、要求函数必须写注释）能限定技术选型（如禁止使用某老旧库、优先用项目指定工具类）提前指定核心参数（如提前设置连接数据库的地址和账号密码等）Rule主要的配置方案有两种：维度项目规则（Project Rules）用户规则（User Rules）作用范围仅对当前项目生效，团队成员共享相同规则对所有项目生效，个人专属配置存储位置项目根目录下的.cursor/rules

2048 AI社区

JavaScript 编年史：探索前端界巨变的幕后推手

然而，作为在企业一线构建、部署和维护复杂系统的实践者，我们深知，一个卓越的模型，本身并不能构成一个成功的企业级解决方案。AI 系统，特别是智能体 (Agent)，与数据的关系是持续的、双向的、对话式的。我们正站在一个激动人心的技术变革的门槛上。它不再是一个滞后的、审计驱动的合规流程，而必须是一个主动的、嵌入在数据流中的实时机制。它能根据模糊的目标（例如，“帮用户解决订单发货延迟的问题”）自主地规划