【基模组实习】大模型对话相关：从 API 构造到内部处理

④ 模型拿到 messages 后的完整流水线（tokenize → 位置编码 → 连续批处理 → 采样）。① 主流模型（GPT / Claude / Llama / GLM）各自规定的对话字段长什么样；每生成一个 token → 立即 SSE push → 前端逐字渲染；后台仍在连续批处理循环。关键：本地推理也需把 messages 转成。）在系统提示、多轮、工具调用里的作用；不同家 token

fifi的大脑正在加载中...

911人浏览 · 2026-01-07 22:12:38

fifi的大脑正在加载中... · 2026-01-07 22:12:38 发布

文章目录

① 主流模型（GPT / Claude / Llama / GLM）各自规定的对话字段长什么样；
② 划分符号（<im_start>、<|user|>、<s>…）在系统提示、多轮、工具调用里的作用；
③ 手把手拼出 HTTP 请求体；
④ 模型拿到 messages 后的完整流水线（tokenize → 位置编码 → 连续批处理 → 采样）。

一、大模型对话字段速查表

模型家族	官方字段名	必含 key	可选/扩展 key	划分符号（特殊 token）
OpenAI GPT-3.5/4	`messages[]`	`role` `content`	`name` `tool_calls` `tool_call_id`	内部 `<
Anthropic Claude	`messages[]`	`role` `content`		`\n\nHuman:` `\n\nAssistant:`（legacy）内部 `<
Meta Llama-2-chat	`messages[]`	`role` `content`	系统提示用 `system` 字段	`<s>` `</s>` `<<SYS>>\nSYS_PROMPT\n<</SYS>>`
ChatGLM-3/4	`messages[]`	`role` `content`	`tools`	`<
Qwen-VL	`messages[]`	`role` `content`	`image`（base64）	`<
InternLM-2	`messages[]`	`role` `content`	`tools`	`<

结论：所有模型对外都用 role/content 语义，对内靠特殊 token 划边界；不同家 token 互不兼容。

二、划分符号深度拆解

1. 功能分类

类型	示例	作用
轮次边界	`<	im_start
系统提示	`<<SYS>>…<</SYS>>` `<	system
工具调用	`<	tool_call
多模态	`<	vision_start
早停/填充	`<	endoftext

2. 可视化例子（Llama-2 模板）

<s>
<<SYS>>
You are a helpful assistant.
<</SYS>>

[INST] Hi! [/INST] Hello 👋 </s>
<s>
[INST] What is 2+2? [/INST] 4 </s>

<s>/<\s> 整句包围；[INST] … [/INST] 内部包裹用户指令
模型见到 [INST] 就知道「接下来是 user prompt」；见到 [/INST] 就切换生成模式。

三、如何构造 messages 发请求？（Python 代码可复制）

1. OpenAI 风格（最常见）

import openai, os

openai.api_key = os.getenv("OPENAI_API_KEY")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a haiku about AI."}
]

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.7,
    max_tokens=60,
    stream=False
)
print(response.choices[0].message.content)

2. Claude 官方 SDK（role 只有 user/assistant）

import anthropic

client = anthropic.Client(api_key="sk-ant-xxx")
resp = client.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=60,
    messages=[
        {"role": "user", "content": "Write a haiku about AI."}
    ]
)
print(resp.content[0].text)

3. Llama-2-chat 本地 vLLM（需用 chat-template）

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", gpu_memory_utilization=0.8)
sampling = SamplingParams(temperature=0.7, max_tokens=60)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a haiku about AI."}
]
# vLLM 内部调用 tokenizer.apply_chat_template(messages)
outputs = llm.chat(messages, sampling_params=sampling)
print(outputs[0].outputs[0].text)

关键：本地推理也需把 messages 转成带划分符号的 prompt，只是 apply_chat_template 帮你做。

四、模型收到输入后的完整流水线

下面以「GPT-4 等价架构 + vLLM 连续批处理」为例，按时间线展开：

① Prompt → Token IDs

prompt_str = "<|im_start|>system\nYou are helpful.<|im_end|>\n<|im_start|>user\nHi!<|im_end|>\n<|im_start|>assistant\n"
ids = tokenizer.encode(prompt_str)   # [198, 1000, 596, 389, …]

② 构造 attention mask & position ids

mask = [1] * len(ids)
pos_ids = list(range(len(ids)))

③ 连续批处理调度器

新请求加入 running batch；已完成的序列让出槽位
调度器为「当前 step」构建统一张量 input_ids: [batch, seq]（右对齐 + padding 或 packing）

④ 模型前向（一次 decode 阶段）

input_embeds = wte(input_ids) + wpe(position_ids)
→ Multi-head Attention (KV-cache 已分页)  
→ FFN + residual  
→ Logits: [batch, vocab]

⑤ 采样（Temperature/Top-p/Beam）

得到下一个 token ID 列表 next_ids[batch]
更新 KV-cache page table；把生成的 token 追加到序列

⑥ 终止判断

遇到 <|im_end|> 或 <\s> 或长度 ≥ max_tokens → 移出运行队列
返回 {"choices":[{"message":{"content":"…"}}]}

⑦ 流式场景

每生成一个 token → 立即 SSE push → 前端逐字渲染；后台仍在连续批处理循环。

五、多轮+工具调用完整消息体示例（GPT-4）

POST https://api.openai.com/v1/chat/completions
Content-Type: application/json

{
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant that can use tools."},
    {"role": "user", "content": "What is the weather in Shanghai?"},
    {"role": "assistant", "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Shanghai\"}"
        }
    }]},
    {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp\": 28, \"unit\": \"C\"}"},
    {"role": "assistant", "content": "The temperature in Shanghai is 28 °C."}
  ],
  "temperature": 0.7,
  "stream": true
}

内部序列：
<|im_start|>system…<|im_end|><|im_start|>user…<|im_end|><|im_start|>assistant<|tool_call|>…<|im_end|><|im_start|>tool…<|im_end|><|im_start|>assistant…

六、小结 & 速查口诀

对外统一 role/content，对内靠特殊 token 划边界
系统提示、工具结果、多模态图包都用专用符号包裹
发请求 → 先拼 messages → SDK/模板自动加符号 → 模型连续批处理 → 采样 → 流式/整句返回
转换格式时键名+符号+tensor 切片三者都要对齐，否则生成乱码或 OOM

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

大模型落地全景指南：从技术实现到企业价值创造

2048 AI社区

把大模型“画”成图：LangGraph 超详细入门 + 实战

摘要： LangGraph 是一种基于有向图的工作流设计工具，突破了传统线性链式结构的限制。其核心组件包括全局共享的 State、功能节点(Node)、流转边(Edge)和完整图(Graph)，通过条件分支和循环机制支持复杂流程。典型应用场景包括智能客服意图识别、代码自动修复等，能以更简洁的代码实现多轮对话、工具调用等复杂逻辑。安装简单，通过定义状态、编写节点函数、构建图形结构即可快速开发智能应用