目前,MCP等工具类应用是基于LLM的agent系统必备的技能。

然而并不是所有的大模型都实现了with_structured_output,不一定都支持工具调用或json方式。

这里针对这些大模型,探索采用自定义的方式实现结构化固定信息的抽取。

1 大模型设置

基于langchain的ChatOpenAI模块创建大模型llm,示例程序如下。

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxx"
os.environ['OPENAI_BASE_URL'] = "https://model_provider_url/v1"

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="qwen3-70b", temperature=0)

response=llm.invoke("张三25岁,并且有168厘米")
print(response.content)

response=llm.invoke("李四28岁,并且有172厘米")
print(response.content)

输出示例如下,显然直接调用大模型,难以满足agent工具类应用结构化抽取的需求。

好的,张三今年25岁,身高168厘米。请问您需要根据这些信息进行什么操作或分析?例如计算BMI、评估健康状况,还是用于其他用途?欢迎补充更多细节!
好的,李四今年28岁,身高172厘米。  
如果您需要根据这些信息进行进一步的分析(比如健康评估、BMI计算、或其他用途),请告诉我具体需求,我可以帮您继续处理!

2 抽取问题演示

这里通过具体实例展示基于with_structured_output的结构化信息抽取。

2.1 with_structured_output示例

langchain llm的with_structured_output方法,支持以以固定结构内容的形式输出。

with_structured_output的调用示例参考如下链接

https://blog.csdn.net/liliang199/article/details/155109021

以下是一个with_structured_output能正常调用的示例。

from typing import Optional
from pydantic import BaseModel, Field
from langchain_ollama import ChatOllama
 
# 初始化Ollama LLM,注意需要后台开启ollama服务
model_name = "gemma3n:e2b"
llm  = ChatOllama(model=model_name)

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(..., description="The height of the person expressed in meters.")
 
structured_llm = llm.with_structured_output(Person) # 可以绑定结构化输出
structured_llm.invoke("张三25岁,并且有168厘米")

输出如下,llm成功解析信息,并以固定结构内容的形式输出。

Person(name='张三', height_in_meters=1.68)

2.2 with_structured_output异常

然而,并不是所有模型都能有效支持with_structured_output方法。

比如在如下示例中,这是一个判断用户问题是否与电影相关的agent子模块代码。

该模块直接调用with_structured_output(GuardrailsOutput),期望llm输出结构化的GuardrailsOutput对象并匹配decision。

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxx"
os.environ['OPENAI_BASE_URL'] = "https://model_provider_url/v1"

from typing import Literal
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(model="qwen3-70b", temperature=0)


guardrails_system = """
As an intelligent assistant, your primary objective is to decide whether a given question is related to movies or not. 
If the question is related to movies, output "movie". Otherwise, output "end".
To make this decision, assess the content of the question and determine if it refers to any movie, actor, director, film industry, 
or related topics. Provide only the specified output: "movie" or "end".
"""
guardrails_prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            guardrails_system,
        ),
        (
            "human",
            ("{question}"),
        ),
    ]
)


class GuardrailsOutput(BaseModel):
    decision: Literal["movie", "end"] = Field(
        description="Decision on whether the question is related to movies"
    )

guardrails_chain = guardrails_prompt_template | llm.with_structured_output(GuardrailsOutput)


question = "What was the cast of the Casino?"
checked = guardrails_chain.invoke({"question": question})
print(f"checked: {checked}")

输出如下,显然当前llm不支持with_structured_output

---------------------------------------------------------------------------
BadRequestError                           Traceback (most recent call last)
Cell In[361], line 36
     31 guardrails_chain = guardrails_prompt_template | llm.with_structured_output(GuardrailsOutput)
     35 question = "What was the cast of the Casino?"
---> 36 checked = guardrails_chain.invoke({"question": question})
     37 print(f"checked: {checked}")

]
............
/site-packages/openai/_base_client.py:1047, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
   1044             err.response.read()
   1046         log.debug("Re-raising status error")
-> 1047         raise self._make_status_error_from_response(err.response) from None
   1049     break
   1051 assert response is not None, "could not resolve response (should never happen)"

BadRequestError: Error code: 400 - {'error': {'message': "<400> InternalError.Algo.InvalidParameter: 'messages' must contain the word 'json' in some form, to use 'response_format' of type 'json_object'.", 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_parameter_error'}, 'id': 'xxxxx', 'request_id': 'xxxx'}

3 自定义信息抽取

这里示例通过自定义parser的方式,解决上述抽取失败的问题。

3.1 自定义parser

针对2.2小节抽取失败案例,通过PydanticOutputParser自定义parser方式,实现结构化信息抽取。

核心逻辑如下所示:

1) 通过PydanticOutputParser自定义parser,识别GuardrailsOutput中需要抽取的内容。

2)system_prompt中增加如下明确提示

By the way, wrap the output in `json` tags\n{format_instructions}

目的是提示llm在输出内容时,遵循给定指令。

其中{format_instructions}来自于parser.get_formal_instructions()

3) llm输出解析

由于输出内容经过format_instructions指令的约素,所以能够被parser解析并输出为GuardrailsOutput对象。

示例代码如下。

import os
os.environ['OPENAI_API_KEY'] = "sk-xxxx"
os.environ['OPENAI_BASE_URL'] = "https://model_provider_url/v1"

from typing import Literal
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(model="qwen3-70b", temperature=0)

guardrails_system = """
As an intelligent assistant, your primary objective is to decide whether a given question is related to movies or not. 
If the question is related to movies, output "movie". Otherwise, output "end".
To make this decision, assess the content of the question and determine if it refers to any movie, actor, director, film industry, 
or related topics. Provide only the specified output: "movie" or "end".

By the way, wrap the output in `json` tags\n{format_instructions}
"""
guardrails_prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            guardrails_system,
        ),
        (
            "human",
            ("{question}"),
        ),
    ]
)


class GuardrailsOutput(BaseModel):
    decision: Literal["movie", "end"] = Field(
        description="Decision on whether the question is related to movies"
    )

parser = PydanticOutputParser(pydantic_object=GuardrailsOutput)
guardrails_prompt = guardrails_prompt_template.partial(format_instructions=parser.get_format_instructions())
guardrails_chain = guardrails_prompt | llm | parser # .with_structured_output(GuardrailsOutput)

3.2 测试parser

在完成定义后,运行具体问答,测试parser是否生效。

question = "What was the cast of the Casino?"
checked = guardrails_chain.invoke({"question": question,})
print(f"checked: {checked}")

输出如下,采用新的prompt指令后,llm的输出内容parser可以正确识别,并且程序没有报错。

checked: decision='movie'

如此,通过自定义parser的方式,实现了llm输出结构化固定信息的功能,这是llm系统在tools广泛应用的时代应该必备的技能。

reference

---

langchain如何判断neo4j知识图谱是否能回答问题

https://blog.csdn.net/liliang199/article/details/155074227

大模型自定义结构化输出示例

https://blog.csdn.net/liliang199/article/details/155109021

langchian进阶01-结构化输出(StructedOutput)

https://zhuanlan.zhihu.com/p/15349262663

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐