AI代理的上下文工程：构建Manus的经验教训 Context Engineering for AI Agents / Yichao ‘Peak‘ Ji

在项目的最初阶段，我和我的团队面临一个关键决策：我们是应该使用开源基础模型训练一个端到端的智能体模型，还是基于前沿模型的能力构建一个智能体？在我的NLP生涯的第一个十年里，我们没有这种选择的奢侈。在遥远的时代（是的，已经过去七年了），模型必须先进行微调——和评估——才能迁移到新任务。这个过程通常每次迭代需要数周时间，尽管与今天的LLM相比，这些模型非常小。对于快速发展的应用，特别是在产品市场匹配(

禅与计算机程序设计艺术

12人浏览 · 2026-01-08 17:24:55

禅与计算机程序设计艺术 · 2026-01-08 17:24:55 发布

2025/7/18 --Yichao 'Peak' Ji

在Manus项目的最初阶段，我和我的团队面临一个关键决策：我们是应该使用开源基础模型训练一个端到端的智能体模型，还是基于前沿模型的上下文学习能力构建一个智能体？在我的NLP生涯的第一个十年里，我们没有这种选择的奢侈。在遥远的BERT时代（是的，已经过去七年了），模型必须先进行微调——和评估——才能迁移到新任务。这个过程通常每次迭代需要数周时间，尽管与今天的LLM相比，这些模型非常小。对于快速发展的应用，特别是在产品市场匹配(PMF)之前，这种缓慢的反馈循环是一个致命缺陷。这是我上一个创业公司的惨痛教训，当时我从头开始训练模型用于开放信息提取和语义搜索。然后GPT-3和Flan-T5出现了，我的内部模型一夜之间变得无关紧要。具有讽刺意味的是，这些相同的模型标志着上下文学习的开始——以及一条全新的前进道路。这个来之不易的教训使选择变得明确：Manus将押注于上下文工程。这使我们能够在几小时而非几周内交付改进，并使我们的产品与底层模型保持正交：如果模型进步是上涨的潮水，我们希望Manus成为那条船，而不是固定在海床上的柱子。

尽管如此，上下文工程证明绝非易事。这是一门实验科学——我们已经重建了我们的代理框架四次，每次都是在发现了更好的塑造上下文的方式之后。我们亲切地将这种手动架构搜索、提示调整和经验猜测的过程称为"随机研究生下降"。这并不优雅，但它有效。

这篇文章分享了我们通过自己的"SGD"所达到的局部最优解。如果你正在构建自己的AI代理，我希望这些原则能帮助你更快地收敛。

围绕KV缓存进行设计

如果我必须选择一个指标，我认为 KV-cache命中率是生产阶段AI代理最重要的单一指标。它直接影响延迟和成本。为了理解原因，让我们看看典型代理是如何运作的：

在接收用户输入后，代理通过一系列工具使用链来完成任务。在每次迭代中，模型根据当前上下文从预定义的动作空间中选择一个动作。然后在环境中执行该动作（例如，Manus的虚拟机沙盒）以产生观察结果。动作和观察结果被附加到上下文中，形成下一次迭代的输入。这个循环持续进行，直到任务完成。正如你所想象的，随着每一步的推进，上下文不断增长，而输出——通常是结构化的函数调用——保持相对简短。这使得代理（agents）相比聊天机器人的预填充和解码比例高度倾斜。例如在Manus中，平均输入与输出的token比例约为100:1。

幸运的是，具有相同前缀的上下文可以利用KV缓存，这大大减少了首个token的生成时间(TTFT)和推理成本——无论你是使用自托管模型还是调用推理API。我们说的不是小幅度的节省：例如使用Claude Sonnet时，缓存的输入token成本为0.30美元/百万token，而未缓存的成本为3美元/百万token——相差10倍。

从上下文工程的角度，提高KV缓存命中率涉及几个关键实践：

1 .保持你的提示前缀稳定。由于LLM的自回归特性，即使是单个标记的差异也会使该标记之后的缓存失效。一个常见的错误是在系统提示的开头包含时间戳——尤其是精确到秒的时间戳。虽然这让模型能告诉你当前时间，但也会降低你的缓存命中率。

2 .使你的上下文只追加。避免修改之前的操作或观察。确保你的序列化是确定性的。许多编程语言和库在序列化JSON对象时不保证键顺序的稳定性，这可能会悄无声息地破坏缓存。

3 .在需要时明确标记缓存断点。某些模型提供商或推理框架不支持自动增量前缀缓存，而是需要在上下文中手动插入缓存断点。在分配这些断点时，要考虑潜在的缓存过期问题，并至少确保断点包含系统提示的结尾。

此外，如果你正在使用像 vLLM 这样的框架自托管模型，请确保启用了前缀/提示缓存，并且你正在使用会话 ID 等技术在分布式工作节点之间一致地路由请求。

遮蔽，而非移除

随着代理能力的增强，其行动空间自然变得更加复杂——简单来说，工具数量爆炸式增长。最近流行的MCP只会火上浇油。如果你允许用户自定义工具，相信我：总会有人将数百个神秘工具插入到你精心策划的行动空间中。结果，模型更可能选择错误的行动或采取低效的路径。简而言之，你武装过度的代理变得更加愚蠢。

一个自然的反应是设计一个动态行动空间——可能是使用类似于RAG的方法按需加载工具。我们在Manus中也尝试过这种方法。但我们的实验表明了一个明确的规则：除非绝对必要，避免在迭代过程中动态添加或移除工具。这主要有两个原因：

1 .在大多数LLM中，工具定义在序列化后位于上下文的前部，通常在系统提示之前或之后。因此任何更改都会使后续所有动作和观察的KV缓存失效。

2 .当先前的动作和观察仍然引用当前上下文中不再定义的工具时，模型会感到困惑。如果没有约束解码，这通常会导致模式违规或幻觉动作。

为了解决这个问题并仍然改进动作选择，Manus使用上下文感知的状态机来管理工具可用性。它不是移除工具，而是在解码过程中掩蔽token的logits，以基于当前上下文阻止（或强制）选择某些动作。

在实践中，大多数模型提供商和推理框架支持某种形式的响应预填充，这允许你在不修改工具定义的情况下约束动作空间。函数调用通常有三种模式（我们将使用 NousResearch 的 Hermes 格式作为示例）：

•自动 – 模型可以选择调用或不调用函数。通过仅预填充回复前缀实现：<|im_start|>assistant

•必需 – 模型必须调用函数，但选择不受约束。通过预填充到工具调用令牌实现：<|im_start|>assistant<tool_call>

•指定 – 模型必须从特定子集中调用函数。通过预填充到函数名称的开头实现：<|im_start|>assistant<tool_call>{"name": "browser_通过这种方式，我们通过直接掩码token的logits来约束动作选择。例如，当用户提供新输入时，Manus必须立即回复而不是执行动作。我们还有意设计了具有一致前缀的动作名称——例如，所有与浏览器相关的工具都以browser_开头，命令行工具以shell_开头。这使我们能够轻松确保代理在给定状态下只从特定工具组中进行选择而无需使用有状态的logits处理器。

这些设计有助于确保Manus代理循环保持稳定——即使在模型驱动的架构下。

使用文件系统作为上下文

现代前沿LLM现在提供128K令牌或更多的上下文窗口。但在真实世界的代理场景中，这通常不够，有时甚至是一种负担。有三个常见的痛点：

1 .观察结果可能非常庞大，尤其是当代理与网页或PDF等非结构化数据交互时。很容易超出上下文限制。

2 .模型性能往往会下降，超过一定的上下文长度后，即使技术上支持该窗口大小。

3 .长输入成本高昂，即使使用前缀缓存。你仍然需要为传输和预填充每个token付费。

为了解决这个问题，许多代理系统实现了上下文截断或压缩策略。但过度激进的压缩不可避免地导致信息丢失。这个问题是根本性的：代理本质上必须根据所有先前状态预测下一个动作——而你无法可靠地预测哪个观察结果可能在十步之后变得至关重要。从逻辑角度看，任何不可逆的压缩都带有风险。这就是为什么我们在Manus中将文件系统视为终极上下文：大小不受限制，天然持久化，并且代理可以直接操作。模型学会按需写入和读取文件——不仅将文件系统用作存储，还用作结构化的外部记忆。

我们的压缩策略始终设计为可恢复的。例如，只要保留URL，网页内容就可以从上下文中移除；如果沙盒中仍然保留文档路径，则可以省略文档内容。这使得Manus能够缩短上下文长度，而不会永久丢失信息。在开发这个功能时，我发现自己在想象状态空间模型(State Space Model, SSM)在智能体环境中有效工作需要什么条件。与Transformer不同，SSM缺乏完整的注意力机制，并且在处理长距离的后向依赖关系时表现不佳。但如果它们能够掌握基于文件的记忆——将长期状态外部化而不是保存在上下文中——那么它们的速度和效率可能会开启一类新型智能体。基于SSM的智能体可能是神经图灵机真正的继任者。

通过复述操控注意力

如果你使用过Manus，你可能注意到一个有趣的现象：在处理复杂任务时，它倾向于创建一个todo.md文件——并在任务进行过程中逐步更新它，勾选已完成的项目。

这不仅仅是可爱的行为——这是一种操控注意力的刻意机制。

Manus中的一个典型任务平均需要大约50次工具调用。这是一个很长的循环——由于Manus依赖LLM进行决策，它很容易偏离主题或忘记早期目标，尤其是在长上下文或复杂任务中。

通过不断重写待办事项列表，Manus将其目标复述到上下文的末尾。这将全局计划推入模型的近期注意力范围内，避免了"丢失在中间"的问题，并减少了目标不一致。实际上，它使用自然语言来使自己的注意力偏向任务目标——而不需要特殊的架构变更。

保留错误的内容

代理会犯错。这不是bug——这是现实。语言模型会产生幻觉，环境会返回错误，外部工具会出现异常行为，意外的边缘情况随时都会出现。在多步骤任务中，失败不是例外；它是循环的一部分。

然而，一个常见的冲动是隐藏这些错误：清理痕迹，重试操作，或重置模型的状态并将其留给神奇的"温度"。这感觉更安全，更受控制。但这是有代价的：擦除失败会移除证据。没有证据，模型就无法适应。

根据我们的经验，改善代理行为最有效的方法之一出奇地简单：将错误的尝试保留在上下文中。当模型看到一个失败的行动——以及由此产生的观察结果或堆栈跟踪——它会隐式地更新其内部信念。这会改变其先验，降低重复相同错误的可能性。事实上，我们认为错误恢复是真正代理行为的最明显指标之一。然而，在大多数学术工作和公共基准测试中，这一点仍然代表性不足，它们通常关注理想条件下的任务成功。

不要被少样本示例所困

少样本提示是提高LLM输出的常用技术。但在代理系统中，它可能会以微妙的方式适得其反。语言模型是优秀的模仿者；它们模仿上下文中的行为模式。如果你的上下文充满了类似的过去行动-观察对，模型将倾向于遵循该模式，即使这不再是最优的。

这在涉及重复决策或行动的任务中可能很危险。例如，当使用Manus帮助审查20份简历时，代理通常会陷入一种节奏——仅仅因为这是它在上下文中看到的，就重复类似的行动。这导致偏离、过度泛化，或有时产生幻觉。

解决方法是增加多样性。Manus在行动和观察中引入少量的结构化变化——不同的序列化模板、替代性措辞、顺序或格式上的微小噪音。这种受控的随机性有助于打破模式并调整模型的注意力。换句话说，不要让自己陷入少样本学习的窠臼。你的上下文越单一，你的智能体就变得越脆弱。

结论

上下文工程仍然是一门新兴的科学——但对于智能体系统来说，它已经是必不可少的。模型可能变得更强大、更快速、更经济，但再多的原始能力也无法替代对记忆、环境和反馈的需求。你如何塑造上下文最终决定了你的智能体的行为方式：它运行的速度、恢复的效果以及扩展的范围。

在Manus，我们通过反复的重写、死胡同以及面向数百万用户的实际测试学到了这些经验。我们在这里分享的内容并非放之四海而皆准的真理——但这些是对我们有效的模式。如果它们能帮助你避免哪怕一次痛苦的迭代，那么这篇文章就达到了它的目的。

智能体的未来将一次构建一个上下文。好好设计它们吧。

Context Engineering for AI Agents: Lessons from Building Manus / Yichao 'Peak' Ji

2025/7/18 --Yichao 'Peak' Ji

At the very beginning of the project, my team and I faced a key decision: should we train an end-to-end agentic model using open-source foundations, or build an agent on top of the abilities of frontier models?

Back in my first decade in NLP, we didn't have the luxury of that choice. In the distant days of (yes, it's been seven years), models had to be fine-tuned—and evaluated—before they could transfer to a new task. That process often took weeks per iteration, even though the models were tiny compared to today's LLMs. For fast-moving applications, especially pre–PMF, such slow feedback loops are a deal-breaker. That was a bitter lesson from my last startup, where I trained models from scratch for and semantic search. Then came and , and my in-house models became irrelevant overnight. Ironically, those same models marked the beginning of in-context learning—and a whole new path forward.

That hard-earned lesson made the choice clear: Manus would bet on context engineering. This allows us to ship improvements in hours instead of weeks, and kept our product orthogonal to the underlying models: If model progress is the rising tide, we want Manus to be the boat, not the pillar stuck to the seabed.

Still, context engineering turned out to be anything but straightforward. It's an experimental science—and we've rebuilt our agent framework four times, each time after discovering a better way to shape context. We affectionately refer to this manual process of architecture searching, prompt fiddling, and empirical guesswork as "Stochastic Graduate Descent". It's not elegant, but it works.

This post shares the local optima we arrived at through our own "SGD". If you're building your own AI agent, I hope these principles help you converge faster.

Design Around the KV-Cache

If I had to choose just one metric, I'd argue that the KV-cache hit rate is the single most important metric for a production-stage AI agent. It directly affects both latency and cost. To understand why, let's look at how operates:

After receiving a user input, the agent proceeds through a chain of tool uses to complete the task. In each iteration, the model selects an action from a predefined action space based on the current context. That action is then executed in the environment (e.g., Manus's virtual machine sandbox) to produce an observation. The action and observation are appended to the context, forming the input for the next iteration. This loop continues until the task is complete.

As you can imagine, the context grows with every step, while the output—usually a structured function call—remains relatively short. This makes the ratio between prefilling and decoding highly skewed in agents compared to chatbots. In Manus, for example, the average input-to-output token ratio is around 100:1.

Fortunately, contexts with identical prefixes can take advantage of , which drastically reduces time-to-first-token (TTFT) and inference cost—whether you're using a self-hosted model or calling an inference API. And we're not talking about small savings: with Claude Sonnet, for instance, cached input tokens cost 0.30 USD/MTok, while uncached ones cost 3 USD/MTok—a 10x difference.

From a context engineering perspective, improving KV-cache hit rate involves a few key practices:

Keep your prompt prefix stable. Due to the nature of LLMs, even a single-token difference can invalidate the cache from that token onward. A common mistake is including a timestamp—especially one precise to the second—at the beginning of the system prompt. Sure, it lets the model tell you the current time, but it also kills your cache hit rate.

Make your context append-only. Avoid modifying previous actions or observations. Ensure your serialization is deterministic. Many programming languages and libraries don't guarantee stable key ordering when serializing JSON objects, which can silently break the cache.

Mark cache breakpoints explicitly when needed. Some model providers or inference frameworks don't support automatic incremental prefix caching, and instead require manual insertion of cache breakpoints in the context. When assigning these, account for potential cache expiration and at minimum, ensure the breakpoint includes the end of the system prompt.

Additionally, if you're self-hosting models using frameworks like , make sure is enabled, and that you're using techniques like session IDs to route requests consistently across distributed workers.

Mask, Don't Remove

As your agent takes on more capabilities, its action space naturally grows more complex—in plain terms, the number of tools explodes. The recent popularity of only adds fuel to the fire. If you allow user-configurable tools, trust me: someone will inevitably plug hundreds of mysterious tools into your carefully curated action space. As a result, the model is more likely to select the wrong action or take an inefficient path. In short, your heavily armed agent gets dumber.

A natural reaction is to design a dynamic action space—perhaps loading tools on demand using something -like. We tried that in Manus too. But our experiments suggest a clear rule: unless absolutely necessary, avoid dynamically adding or removing tools mid-iteration. There are two main reasons for this:

In most LLMs, tool definitions live near the front of the context after serialization, typically before or after the system prompt. So any change will invalidate the KV-cache for all subsequent actions and observations.

When previous actions and observations still refer to tools that are no longer defined in the current context, the model gets confused. Without , this often leads to schema violations or hallucinated actions.

To solve this while still improving action selection, Manus uses a context-aware to manage tool availability. Rather than removing tools, it masks the token logits during decoding to prevent (or enforce) the selection of certain actions based on the current context.

In practice, most model providers and inference frameworks support some form of response prefill, which allows you to constrain the action space without modifying the tool definitions. There are generally three modes of function calling (we'll use the from NousResearch as an example):

Auto – The model may choose to call a function or not. Implemented by prefilling only the reply prefix: <|im_start|>assistant

Required – The model must call a function, but the choice is unconstrained. Implemented by prefilling up to tool call token: <|im_start|>assistant<tool_call>

Specified – The model must call a function from a specific subset. Implemented by prefilling up to the beginning of the function name: <|im_start|>assistant<tool_call>{"name": “browser_

Using this, we constrain action selection by masking token logits directly. For example, when the user provides a new input, Manus must reply immediately instead of taking an action. We've also deliberately designed action names with consistent prefixes—e.g., all browser-related tools start with browser_, and command-line tools with shell_. This allows us to easily enforce that the agent only chooses from a certain group of tools at a given state without using stateful logits processors.

These designs help ensure that the Manus agent loop remains stable—even under a model-driven architecture.

Use the File System as Context

Modern frontier LLMs now offer context windows of 128K tokens or more. But in real-world agentic scenarios, that's often not enough, and sometimes even a liability. There are three common pain points:

Observations can be huge, especially when agents interact with unstructured data like web pages or PDFs. It's easy to blow past the context limit.

Model performance tends to degrade beyond a certain context length, even if the window technically supports it.

Long inputs are expensive, even with prefix caching. You're still paying to transmit and prefill every token.

To deal with this, many agent systems implement context truncation or compression strategies. But overly aggressive compression inevitably leads to information loss. The problem is fundamental: an agent, by nature, must predict the next action based on all prior state—and you can't reliably predict which observation might become critical ten steps later. From a logical standpoint, any irreversible compression carries risk.

That's why we treat the file system as the ultimate context in Manus: unlimited in size, persistent by nature, and directly operable by the agent itself. The model learns to write to and read from files on demand—using the file system not just as storage, but as structured, externalized memory.

Our compression strategies are always designed to be restorable. For instance, the content of a web page can be dropped from the context as long as the URL is preserved, and a document's contents can be omitted if its path remains available in the sandbox. This allows Manus to shrink context length without permanently losing information.

While developing this feature, I found myself imagining what it would take for a State Space Model (SSM) to work effectively in an agentic setting. Unlike Transformers, SSMs lack full attention and struggle with long-range backward dependencies. But if they could master file-based memory—externalizing long-term state instead of holding it in context—then their speed and efficiency might unlock a new class of agents. Agentic SSMs could be the real successors to .

Manipulate Attention Through Recitation

If you've worked with Manus, you've probably noticed something curious: when handling complex tasks, it tends to create a todo.md file—and update it step-by-step as the task progresses, checking off completed items.

That's not just cute behavior—it's a deliberate mechanism to manipulate attention.

A typical task in Manus requires around 50 tool calls on average. That's a long loop—and since Manus relies on LLMs for decision-making, it's vulnerable to drifting off-topic or forgetting earlier goals, especially in long contexts or complicated tasks.

By constantly rewriting the todo list, Manus is reciting its objectives into the end of the context. This pushes the global plan into the model's recent attention span, avoiding "lost-in-the-middle" issues and reducing goal misalignment. In effect, it's using natural language to bias its own focus toward the task objective—without needing special architectural changes.

Keep the Wrong Stuff In

Agents make mistakes. That's not a bug—it's reality. Language models hallucinate, environments return errors, external tools misbehave, and unexpected edge cases show up all the time. In multi-step tasks, failure is not the exception; it's part of the loop.

And yet, a common impulse is to hide these errors: clean up the trace, retry the action, or reset the model's state and leave it to the magical "". That feels safer, more controlled. But it comes at a cost: Erasing failure removes evidence. And without evidence, the model can't adapt.

In our experience, one of the most effective ways to improve agent behavior is deceptively simple: leave the wrong turns in the context. When the model sees a failed action—and the resulting observation or stack trace—it implicitly updates its internal beliefs. This shifts its prior away from similar actions, reducing the chance of repeating the same mistake.In fact, we believe error recovery is one of the clearest indicators of true agentic behavior. Yet it's still underrepresented in most academic work and public benchmarks, which often focus on task success under ideal conditions.

Don't Get Few-Shotted

is a common technique for improving LLM outputs. But in agent systems, it can backfire in subtle ways.

Language models are excellent mimics; they imitate the pattern of behavior in the context. If your context is full of similar past action-observation pairs, the model will tend to follow that pattern, even when it's no longer optimal.

This can be dangerous in tasks that involve repetitive decisions or actions. For example, when using Manus to help review a batch of 20 resumes, the agent often falls into a rhythm—repeating similar actions simply because that's what it sees in the context. This leads to drift, overgeneralization, or sometimes hallucination.

The fix is to increase diversity. Manus introduces small amounts of structured variation in actions and observations—different serialization templates, alternate phrasing, minor noise in order or formatting. This controlled randomness helps break the pattern and tweaks the model's attention.In other words, don't few-shot yourself into a rut. The more uniform your context, the more brittle your agent becomes.

Conclusion

Context engineering is still an emerging science—but for agent systems, it's already essential. Models may be getting stronger, faster, and cheaper, but no amount of raw capability replaces the need for memory, environment, and feedback. How you shape the context ultimately defines how your agent behaves: how fast it runs, how well it recovers, and how far it scales.

At Manus, we've learned these lessons through repeated rewrites, dead ends, and real-world testing across millions of users. None of what we've shared here is universal truth—but these are the patterns that worked for us. If they help you avoid even one painful iteration, then this post did its job.

The agentic future will be built one context at a time. Engineer them well.

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

paperzz 开题报告 + 6 款 AI 工具：开题不用改 5 版的 “学术辅助矩阵”

2048 AI社区

agentic设计模式第10章：模型上下文协议（MCP）

模型上下文协议（MCP）是一个开放标准，它为大语言模型和外部系统之间的通信提供了一套通用规范。这个协议采用「客户端 - 服务器」架构，使 LLM 能够通过标准化工具访问资源、利用提示词并执行操作。MCP 允许 LLM 与数据库交互、管理生成式媒体工作流、控制物联网设备以及自动化金融服务。实际示例演示了设置智能体与 MCP 服务器通信，包括文件系统服务器和使用 FastMCP 构建的服务器，说明了其