大模型自定义结构化输出示例

liliangcsdn

566人浏览 · 2025-11-21 23:42:17

liliangcsdn · 2025-11-21 23:42:17 发布

puLLM可以解析输入中的信息并以json、typescript或xml等结构化的形式输出。

然而LLM模型输出内容不可控，在多次运行中，不能保证每次都输出有效的结构化信息。

这里基于langchain llm，采用with_structured_output或自定义方式，以固定结构内容的形式输出。

所用测试例参考和修改自网络资料。

1 创建LLM

使用ollama大模型，并假设ollama已安装，并成功pull gemma3n:e2b模型。

模型测试和调用示例如下所示。

from langchain_ollama import ChatOllama

# 初始化Ollama LLM，注意需要后台开启ollama服务
model_name = "gemma3n:e2b"
llm  = ChatOllama(model=model_name)

response=llm.invoke("张三25岁，并且有168厘米")
print(response.content)

response=llm.invoke("李四28岁，并且有172厘米")
print(response.content)

如以下输出所示，虽然模型能解读输入，但模型输出内容不可控，不方便后续处理。

好的，张三25岁，身高168厘米。

这提供了一些关于张三的基本信息。你想知道关于他的什么呢？例如：

* **他的体重大概是多少？** (这需要结合身高来估算，通常在168cm的身高下，体重在55-75公斤之间比较常见)
* **他的健康状况如何？** (这需要更多信息，比如是否有疾病史)
* **他有什么兴趣爱好？**
* **他有什么职业？**
* **你希望我做些什么？** (比如，帮他查找一些相关信息，或者提供一些建议)

请告诉我你具体想知道什么，我会尽力帮助你。

好的，李四28岁，身高172厘米。

这提供了一些关于李四的基本信息。你想了解关于他的什么信息呢？例如：

* **他的身体状况？** (例如，是否健康，是否有疾病)
* **他的职业？**
* **他的生活习惯？** (例如，饮食，运动)
* **你希望我做什么？** (例如，计算他的BMI，提供一些关于他身高年龄的建议)

请告诉我你具体想了解什么，我会尽力帮助你。

2 结构化输出

langchain llm提供了with_structured_output，能将信息以pydantic预定义固定格式输出。

首先，通过pydantic预定义Person，在Person中指明提取字段name和height_in_meters。

然后，将Person通过with_structured_output提供给llm，支持对llm输出进行结构化信息抽取。

示例代码如下所示。

from typing import Optional
from pydantic import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(..., description="The height of the person expressed in meters.")

structured_llm = llm.with_structured_output(Person) # 可以绑定结构化输出
structured_llm.invoke("张三25岁，并且有168厘米")

输出如下，llm成功解析信息，并以固定结构内容的形式输出。

Person(name='张三', height_in_meters=1.68)

3 自定义结构化

然而，不是所有模型都实现了with_structured_output，不一定都支持工具调用或json方式。

这里通过PydanticOutputParser，尝试采用自定义方式，实现这些模型的结构化输出。

3.1 PydanticOutputParser

这里利用内置类的方式，通过PydanticOutputParser解析与给定Pydantic模式匹配的llm的输出。

代码如下所示，核心在于如下指令

"Answer the user query. Wrap the output in `json` tags\n{format_instructions}",

from typing import List
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(..., description="The height of the person expressed in meters.")

class People(BaseModel):
    """Identifying information about all people in a text."""
    people: List[Person]

# Set up a parser
parser = PydanticOutputParser(pydantic_object=People)

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the user query. Wrap the output in `json` tags\n{format_instructions}",
        ),
        ("human", "{query}"),
    ]
).partial(format_instructions=parser.get_format_instructions())

query = "张三25岁，并且有168厘米"
print(prompt.invoke(query).to_string())

这段代码输出提交给llm的prompt，示例如下。

prompt包含system和human。

其中，system提示大模型按自定义结构输出生成的内容。

这里要求生成的json，并最好输出{"foo": ["bar", "baz"]}而不是{"properties": {"foo": ["bar", "baz"]}}。

并定义具体输出格式，期望llm按定义的格式输出，方便后续json识别和提取。

System: Answer the user query. Wrap the output in `json` tags
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"$defs": {"Person": {"description": "Information about a person.", "properties": {"name": {"description": "The name of the person", "title": "Name", "type": "string"}, "height_in_meters": {"description": "The height of the person expressed in meters.", "title": "Height In Meters", "type": "number"}}, "required": ["name", "height_in_meters"], "title": "Person", "type": "object"}}, "description": "Identifying information about all people in a text.", "properties": {"people": {"items": {"$ref": "#/$defs/Person"}, "title": "People", "type": "array"}}, "required": ["people"]}
```
Human: 张三25岁，并且有168厘米

3.2 LLM模型输出

这里就是以上prompt实现的结构化输出示例。

首先，输入上述prompt，直接关注llm的输出。

chain = prompt | llm
query = "张三25岁，并且有168厘米"
chain.invoke({"query": query}).content

llm模型输出内容如下所示，可见LLM按定义输出一个规范的json格式数据。

'```json\n{\n "people": [\n {\n "name": "张三",\n "height_in_meters": 1.68\n }\n ]\n}\n```'

3.3 parser结构化解析

这里进一步基于langchain PydanticOutputParser进行结构化解析，代码示例如下。

chain = prompt | llm | parser
query = "张三25岁，并且有168厘米"
chain.invoke({"query": query})

如下所示，在接收到llm输出后，parser进一步将json内容解析为固定格式的结构化数据。

People(people=[Person(name='张三', height_in_meters=1.68)])

3.4 结构化解析多份数据

由于prompt同样定义了数组的结构化解析，所以这里可以同时输入两个人的信息。

chain = prompt | llm | parser

query = "张三25岁，并且有168厘米并且李四28岁，并且有172厘米"
chain.invoke({"query": query}) # 输入包含多个实例也可以格式化输出

如下所示，parser成功解析两人的name和height_in_metrers，并输出固定格式的结构化数据。

People(people=[Person(name='张三', height_in_meters=168.0), Person(name='李四', height_in_meters=172.0)])

reference

---

langchian进阶01-结构化输出(StructedOutput)

https://zhuanlan.zhihu.com/p/15349262663

Structured output

https://blog.frognew.com/library/agi/langchain/techniques/structured-output.html

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

AI - CurSor精准上下文+应用（三）

可约束代码风格（如强制用驼峰命名、要求函数必须写注释）能限定技术选型（如禁止使用某老旧库、优先用项目指定工具类）提前指定核心参数（如提前设置连接数据库的地址和账号密码等）Rule主要的配置方案有两种：维度项目规则（Project Rules）用户规则（User Rules）作用范围仅对当前项目生效，团队成员共享相同规则对所有项目生效，个人专属配置存储位置项目根目录下的.cursor/rules

2048 AI社区

JavaScript 编年史：探索前端界巨变的幕后推手

然而，作为在企业一线构建、部署和维护复杂系统的实践者，我们深知，一个卓越的模型，本身并不能构成一个成功的企业级解决方案。AI 系统，特别是智能体 (Agent)，与数据的关系是持续的、双向的、对话式的。我们正站在一个激动人心的技术变革的门槛上。它不再是一个滞后的、审计驱动的合规流程，而必须是一个主动的、嵌入在数据流中的实时机制。它能根据模糊的目标（例如，“帮用户解决订单发货延迟的问题”）自主地规划