Ollama 使用指南（官方文档笔记）

本文介绍了Ollama语言模型的多项核心功能：1）流式处理支持聊天、思考和工具调用三种模式；2）CLI命令控制思考模式；3）结构化JSON输出与Pydantic结合使用；4）向量嵌入模型推荐；5）Agent循环实现多轮工具调用；6）Web搜索与获取API集成；7）MCP服务器对接方法。重点阐述了流式处理实现细节、JSON结构化输出最佳实践以及多轮工具调用的Agent循环机制，提供了完整的API使用

FRIEDHELM02

625人浏览 · 2025-12-23 14:11:19

FRIEDHELM02 · 2025-12-23 14:11:19 发布

一、流式处理核心概念

1.1 三种流式模式

模式	字段	用途
聊天模式	`content`	流式输出助手消息，实时渲染
思考模式	`thinking`	显示模型推理过程
工具调用	`tool_calls`	执行外部工具并返回结果

1.2 流式处理示例

from ollama import chat

stream = chat(
  model='qwen3:1.7b',
  messages=[{'role': 'user', 'content': 'What is 17 × 23?'}],
  stream=True,
  think=False
)

# 初始化状态变量和累积字符串
in_thinking = False  # 标记是否正在接收思考过程
content = ''         # 存储最终回答内容
thinking = ''        # 存储思考过程内容

# 遍历流式响应中的每个数据块 (chunk)
for chunk in stream:
  # 如果当前块包含思考过程 (thinking) 内容
  if chunk.message.thinking:
    # 如果是第一次接收到思考内容，则打印标题
    if not in_thinking:
      in_thinking = True
      print('Thinking:\\n', end='', flush=False)
    
    # 打印当前部分的思考内容（不换行）
    print(chunk.message.thinking, end='', flush=False)
    
    # 将当前思考内容追加到总的思考字符串中
    thinking += chunk.message.thinking
  
  # 如果当前块包含实际回答 (content) 内容
  elif chunk.message.content:
    # 如果之前在显示思考过程，则切换状态并打印答案标题
    if in_thinking:
      in_thinking = False
      print('\\n\\nAnswer:\\n', end='', flush=True)
    
    # 打印当前部分的回答内容（不换行）
    print(chunk.message.content, end='', flush=True)
    
    # 将当前回答内容追加到总的回答字符串中
    content += chunk.message.content
  new_messages = [{'role': 'assistant', thinking: thinking, content: content}]

print('\\n\\nNew messages:\\n', new_messages)

二、CLI 命令速查

2.1 思考模式控制

# 启用思考模式
ollama run deepseek-r1 --think "Where should I visit in Lisbon?"

# 禁用思考模式
ollama run deepseek-r1 --think=false "Summarize this article"

# 隐藏思考过程(但仍使用思考模型)
ollama run deepseek-r1 --hidethinking "Is 9.9 bigger or 9.11?"

2.2 交互式会话控制

# 启用思考
/set think

# 禁用思考
/set nothink

2.3 GPT-OSS

# 支持三个级别: low, medium, high
ollama run gpt-oss --think=low "Draft a headline"
ollama run gpt-oss --think=medium "Draft a headline"
ollama run gpt-oss --think=high "Draft a headline"

三、结构化 JSON 输出

3.1 使用 Pydantic 定义 Schema

from ollama import chat
from pydantic import BaseModel
from typing import Literal, Optional

# class BlurLocation(BaseModel):
#     province: str
#     city: str
#     district: str
#
# response = chat(
#     model='qwen3:1.7b',
#     messages=[{'role': 'user', 'content': '曹县在具体在哪个位置'}],
#     think=False,
#     format=BlurLocation.model_json_schema()
# )
#
# loc = BlurLocation.model_validate_json(response.message.content)
# print(loc)

class Object(BaseModel):
    name: str
    confidence: float
    attributes: str

class ImageDes(BaseModel):
    summary: str
    objects: list[Object]
    scene: str
    colors: list[str]
    time_of_day: Literal['Morning', 'Afternoon', 'Evening', 'Night']
    setting: Literal['Outdoor', 'Indoor', 'Unknown']
    text_content: Optional[str]=None

response = chat(
    model='qwen3-vl:2b',
    messages=[{
        'role': 'user',
        'content': 'Use Chinese to describe this photo and list the objects you detect.',
        'images': ['/Users/okonma/Desktop/pics/misato.jpg']
    }],
    format=ImageDes.model_json_schema(),
    options={
        'temperature': 0
    }
)

print(response.message.content)

3.2 最佳实践

✅ 推荐做法:

使用 Pydantic (Python) 或 Zod (JavaScript) 定义 schema
设置 temperature=0 获得确定性输出
在提示词中包含 JSON schema 说明
通过 OpenAI 兼容 API 使用 response_format 参数

四、向量嵌入

4.1 推荐模型

模型	链接
embeddinggemma	ollama.com/library/embeddinggemma
qwen3-embedding	ollama.com/library/qwen3-embedding
all-minilm	ollama.com/library/all-minilm

4.2 使用建议

使用余弦相似度进行语义搜索
索引和查询使用同一个嵌入模型

五、Agent 循环 (多轮工具调用)

5.1 概念

Agent 循环允许模型:

自主决定何时调用工具
将工具结果整合到回复中
进行多轮工具调用

5.2 实现要点

💡 提示: 在提示词中告知模型它处于循环中，可以进行多次工具调用

5.3 完整示例：搜索 Agent

from ollama import chat, web_fetch, web_search

# 准备工具字典
available_tools = {'web_search': web_search, 'web_fetch': web_fetch}

messages = [{'role': 'user', 'content': "what is ollama's new engine"}]

while True:
    response = chat(
        model='qwen3:4b',
        messages=messages,
        tools=[web_search, web_fetch],
        think=True
    )

    # 显示思考过程
    if response.message.thinking:
        print('Thinking: ', response.message.thinking)

    # 显示回答内容
    if response.message.content:
        print('Content: ', response.message.content)

    messages.append(response.message)

    # 处理工具调用
    if response.message.tool_calls:
        print('Tool calls: ', response.message.tool_calls)
        for tool_call in response.message.tool_calls:
            function_to_call = available_tools.get(tool_call.function.name)
            if function_to_call:
                args = tool_call.function.arguments
                result = function_to_call(**args)
                print('Result: ', str(result)[:200]+'...')
                # 限制上下文长度，截断结果
                messages.append({
                    'role': 'tool',
                    'content': str(result)[:2000 * 4],
                    'tool_name': tool_call.function.name
                })
            else:
                messages.append({
                    'role': 'tool',
                    'content': f'Tool {tool_call.function.name} not found',
                    'tool_name': tool_call.function.name
                })
    else:
        break  # 没有工具调用，退出循环

⚠️ 重要提示:

建议将模型上下文长度增加到至少 32000 tokens
Ollama 云模型运行在完整上下文长度

六、Web Search & Fetch API

6.1 初始化客户端

import ollama

# 配置 API Key
client = ollama.Client(
    headers={
        "Authorization": "Bearer API_KEY"
    }
)

6.2 Web Search API

请求端点: POST <https://ollama.com/api/web_search>

参数:

query (string, 必填): 搜索查询字符串
max_results (integer, 可选): 最多返回结果数 (默认 5, 最大 10)

响应格式:

{
    "results": [
        {
            "title": "网页标题",
            "url": "网页 URL",
            "content": "相关内容片段"
        }
    ]
}

使用示例:

import ollama

response = ollama.web_search("What is Ollama?")
print(response)

6.3 Web Fetch API

请求端点: POST <https://ollama.com/api/web_fetch>

参数:

url (string, 必填): 要获取的 URL

响应格式:

WebFetchResponse(
    title='网页标题',
    content='网页主要内容',
    links=['链接1', '链接2', ...]
)

使用示例:

from ollama import web_fetch

result = web_fetch('<https://ollama.com>')
print(result)

七、MCP Server 集成

Ollama 支持通过 Python MCP server 在任何 MCP 客户端中启用 Web Search。

7.1 Cline 集成

在 Cline 中配置：Manage MCP Servers > Configure MCP Servers

{
  "mcpServers": {
    "web_search_and_fetch": {
      "type": "stdio",
      "command": "uv",
      "args": ["run", "path/to/web-search-mcp.py"],
      "env": { "OLLAMA_API_KEY": "your_api_key_here" }
    }
  }
}

7.2 Codex 集成

配置文件位置：~/.codex/config.toml

[mcp_servers.web_search]
command = "uv"
args = ["run", "path/to/web-search-mcp.py"]
env = { "OLLAMA_API_KEY" = "your_api_key_here" }

7.3 Goose 集成

Ollama 可通过 MCP 功能与 Goose 集成。

附录:关键术语

术语	说明
Stream	流式响应,逐块返回数据
Chunk	流式响应中的单个数据块
Thinking	模型推理过程(类似 Chain of Thought)
Tool Call	调用外部工具/函数
Schema	JSON 数据结构定义
Embedding	文本向量表示