AutoGen智能体开发：autogen_ext.agents.web_surfer包

autogen_ext.agents.web_surfer

王国平

626人浏览 · 2025-12-11 00:13:40

王国平 · 2025-12-11 00:13:40 发布

class MultimodalWebSurfer(name: str, model_client: ChatCompletionClient, downloads_folder: str | None = None, description: str = DEFAULT_DESCRIPTION, debug_dir: str | None = None, headless: bool = True, start_page: str | None = DEFAULT_START_PAGE, animate_actions: bool = False, to_save_screenshots: bool = False, use_ocr: bool = False, browser_channel: str | None = None, browser_data_dir: str | None = None, to_resize_viewport: bool = True, playwright: Playwright | None = None, context: BrowserContext | None = None)[source]

基类：BaseChatAgent, Component[MultimodalWebSurferConfig]

MultimodalWebSurfer 是一个多模态代理，可作为网页冲浪者搜索网页和访问网页。

安装

pip install "autogen-ext[web-surfer]"

它启动一个 Chromium 浏览器并允许 Playwright 与网页浏览器交互并执行各种操作。浏览器在首次调用代理时启动，并可用于后续调用。

它必须与支持函数/工具调用的多模态模型客户端一起使用，理想情况下目前是 GPT-4o。

当调用on_messages()或on_messages_stream()时，会发生以下情况：

如果这是第一次调用，则会初始化浏览器并加载页面。这在 _lazy_init() 中完成。浏览器仅在调用 close() 时关闭。
调用方法 _generate_reply()，然后如下创建最终响应。
代理会截取页面截图，提取交互元素，并准备一组带有交互元素周围边界框的标记截图。
代理使用 SOM 截图、消息历史记录和可用工具列表调用 model_client。
- 如果模型返回字符串，代理将字符串作为最终响应返回。
- 如果模型返回工具调用列表，代理使用 _playwright_controller 通过 _execute_tool() 执行工具调用。
- 代理返回最终响应，其中包括页面截图、页面元数据、所执行操作的描述和网页的内部文本。
如果在任何时候代理遇到错误，它将返回错误消息作为最终响应。

注意

请注意，使用 MultimodalWebSurfer 涉及与为人类设计的数字世界交互，这本身就存在固有风险。请注意，代理有时可能会尝试危险操作，例如招募人类协助或在没有人为干预的情况下接受 Cookie 协议。始终确保代理受到监控并在受控环境中运行，以防止意外后果。此外，请注意 MultimodalWebSurfer 可能会受到网页的提示注入攻击。

注意

在 Windows 上，必须将事件循环策略设置为 WindowsProactorEventLoopPolicy 以避免子进程出现问题。

import sys
import asyncio

if sys.platform == "win32":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

参数:

name (str) – 代理的名称。
model_client (ChatCompletionClient) – 代理使用的模型客户端。必须是多模态并支持函数调用。
downloads_folder (str, optional) – 下载保存的文件夹。默认为 None，不保存下载。
description (str, optional) – 代理的描述。默认为 MultimodalWebSurfer.DEFAULT_DESCRIPTION。
debug_dir (str, optional) – 调试信息保存的目录。默认为 None。
headless (bool, optional) – 浏览器是否应无头运行。默认为 True。
start_page (str, optional) – 浏览器的起始页面。默认为 MultimodalWebSurfer.DEFAULT_START_PAGE。
animate_actions (bool, optional) – 是否动画化操作。默认为 False。
to_save_screenshots (bool, optional) – 是否保存截图。默认为 False。
use_ocr (bool, optional) – 是否使用 OCR。默认为 False。
browser_channel (str, optional) – 浏览器通道。默认为 None。
browser_data_dir (str, optional) – 浏览器数据目录。默认为 None。
to_resize_viewport (bool, optional) – 是否调整视口大小。默认为 True。
playwright (Playwright, optional) – Playwright 实例。默认为 None。
context (BrowserContext, optional) – 浏览器上下文。默认为 None。

示例用法

以下示例演示如何使用模型客户端创建网络冲浪代理并进行多次轮次运行。

import asyncio
from autogen_agentchat.ui import Console
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.web_surfer import MultimodalWebSurfer


async def main() -> None:
    # Define an agent
    web_surfer_agent = MultimodalWebSurfer(
        name="MultimodalWebSurfer",
        model_client=OpenAIChatCompletionClient(model="gpt-4o-2024-08-06"),
    )

    # Define a team
    agent_team = RoundRobinGroupChat([web_surfer_agent], max_turns=3)

    # Run the team and stream messages to the console
    stream = agent_team.run_stream(task="Navigate to the AutoGen readme on GitHub.")
    await Console(stream)
    # Close the browser controlled by the agent
    await web_surfer_agent.close()


asyncio.run(main())

component_type: ClassVar[ComponentType] = 'agent'

组件的逻辑类型。

component_config_schema

别名 MultimodalWebSurferConfig

component_provider_override: ClassVar[str | None] = 'autogen_ext.agents.web_surfer.MultimodalWebSurfer'

覆盖组件的提供者字符串。这应该用于防止内部模块名称成为模块名称的一部分。

DEFAULT_DESCRIPTION = '\n 一个拥有网页浏览器访问权限的有用助手。\n 要求他们执行网络搜索、打开页面并与内容交互（例如，点击链接、滚动视口、填写表单字段等）。\n 它还可以总结整个页面，或根据页面内容回答问题。\n 在页面似乎尚未完全加载的情况下，也可以要求它暂停并等待页面加载。\n '

DEFAULT_START_PAGE = 'https://www.bing.com/'

VIEWPORT_HEIGHT = 900

VIEWPORT_WIDTH = 1440

MLM_HEIGHT = 765

MLM_WIDTH = 1224

SCREENSHOT_TOKENS = 1105

async close() → None [source]

关闭浏览器和页面。当不再需要代理时应调用此函数。

property produced_message_types: Sequence[type[BaseChatMessage]]

代理在 Response.chat_message 字段中生成的消息类型。它们必须是 BaseChatMessage 类型。

async on_reset(cancellation_token: CancellationToken) → None [source]

将代理重置为其初始化状态。

async on_messages(messages: Sequence[BaseChatMessage], cancellation_token: CancellationToken) → Response [source]

处理传入消息并返回响应。

注意

代理是有状态的，传递给此方法的消息应该是自上次调用此方法以来的新消息。代理应在调用此方法之间保持其状态。例如，如果代理需要记住以前的消息才能响应当前消息，它应该将以前的消息存储在代理状态中。

async on_messages_stream(messages: Sequence[BaseChatMessage], cancellation_token: CancellationToken) → AsyncGenerator[BaseAgentEvent | BaseChatMessage | Response, None][source]

处理传入消息并返回消息流，最后一个是响应。 BaseChatAgent 中的基本实现只是调用 on_messages() 并生成响应中的消息。

注意

_to_config() → MultimodalWebSurferConfig[source]

转储创建与此实例配置匹配的组件新实例所需的配置。

T – 组件的配置。

classmethod _from_config(config: MultimodalWebSurferConfig) → Self [source]

从配置对象创建组件的新实例。

参数:

config (T) – 配置对象。

Self – 组件的新实例。

class PlaywrightController(downloads_folder: str | None = None, animate_actions: bool = False, viewport_width: int = 1440, viewport_height: int = 900, _download_handler: Callable[[Download], None] | None = None, to_resize_viewport: bool = True)[source]

基类: object

一个帮助类，允许 Playwright 与网页交互以执行点击、填写和滚动等操作。

参数:

downloads_folder (str | None) – 下载保存到的文件夹。如果为 None，则不保存下载。
animate_actions (bool) – 是否动画化操作（创建假光标点击）。
viewport_width (int) – 视口的宽度。
viewport_height (int) – 视口的高度。
_download_handler (Optional[Callable[[Download], None]]) – 一个处理下载的函数。
to_resize_viewport (bool) – 是否调整视口大小。

async sleep(page: Page, duration: int | float) → None [source]

暂停执行指定时长。

参数:

page (Page) – Playwright 页面对象。
duration (Union[int, float]) – 暂停时长，单位为毫秒。

async get_interactive_rects(page: Page) → Dict[str, InteractiveRegion]

从网页检索交互区域。

参数:

page (Page) – Playwright 页面对象。

Dict[str, InteractiveRegion] – 交互区域的字典。

async get_visual_viewport(page: Page) → VisualViewport[source]

检索网页的视觉视口。

参数:

page (Page) – Playwright 页面对象。

VisualViewport – 页面的视觉视口。

async get_focused_rect_id(page: Page) → str | None [source]

检索当前聚焦元素的 ID。

参数:

page (Page) – Playwright 页面对象。

str – 聚焦元素的 ID，如果没有控件聚焦则为 None。

async get_page_metadata(page: Page) → Dict[str, Any][source]

从网页检索元数据。

参数:

page (Page) – Playwright 页面对象。

Dict[str, Any] – 页面元数据的字典。

async on_new_page(page: Page) → None [source]

处理在新页面上执行的操作。

参数:

page (Page) – Playwright 页面对象。

async back(page: Page) → None [source]

导航回上一页。

参数:

page (Page) – Playwright 页面对象。

async visit_page(page: Page, url: str) → Tuple[bool, bool][source]

访问指定的 URL。

参数:

page (Page) – Playwright 页面对象。
url (str) – 要访问的 URL。

Tuple[bool, bool] – 一个元组，指示是否重置先前的元数据哈希和上次下载。

async page_down(page: Page) → None [source]

将页面向下滚动一个视口高度减去 50 像素。

参数:

page (Page) – Playwright 页面对象。

async page_up(page: Page) → None [source]

将页面向上滚动一个视口高度减去 50 像素。

参数:

page (Page) – Playwright 页面对象。

async gradual_cursor_animation(page: Page, start_x: float, start_y: float, end_x: float, end_y: float) → None [source]

将光标移动从起始坐标逐渐动画化到结束坐标。

参数:

page (Page) – Playwright 页面对象。
start_x (float) – 起始 x 坐标。
start_y (float) – 起始 y 坐标。
end_x (float) – 结束 x 坐标。
end_y (float) – 结束 y 坐标。

async add_cursor_box(page: Page, identifier: str) → None [source]

在给定标识符的元素周围添加红色光标框。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

async remove_cursor_box(page: Page, identifier: str) → None [source]

移除给定标识符的元素周围的红色光标框。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

async click_id(page: Page, identifier: str) → Page | None [source]

点击具有给定标识符的元素。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

Page | None – 如果打开了新页面，则为新页面，否则为 None。

async hover_id(page: Page, identifier: str) → None [source]

将鼠标悬停在具有给定标识符的元素上。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

async fill_id(page: Page, identifier: str, value: str, press_enter: bool = True) → None [source]

用指定的值填充具有给定标识符的元素。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。
value (str) – 要填充的值。

async scroll_id(page: Page, identifier: str, direction: str) → None [source]

在指定方向上滚动具有给定标识符的元素。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。
direction (str) – 滚动方向（“up”或“down”）。

async get_webpage_text(page: Page, n_lines: int = 50) → str [source]

检索网页的文本内容。

参数:

page (Page) – Playwright 页面对象。
n_lines (int) – 从页面内部文本返回的行数。

str – 页面的文本内容。

async get_visible_text(page: Page) → str [source]

检索浏览器视口的文本内容（大约）。

参数:

page (Page) – Playwright 页面对象。

str – 页面的文本内容。

async get_page_markdown(page: Page) → str [source]

检索网页的 Markdown 内容。目前尚未实现。

参数:

page (Page) – Playwright 页面对象。

str – 页面的 Markdown 内容。

《AI提示工程必知必会》为读者提供了丰富的AI提示工程知识与实战技能。《AI提示工程必知必会》主要内容包括各类提示词的应用，如问答式、指令式、状态类、建议式、安全类和感谢类提示词，以及如何通过实战演练掌握提示词的使用技巧；使用提示词进行文本摘要、改写重述、语法纠错、机器翻译等语言处理任务，以及在数据挖掘、程序开发等领域的应用；AI在绘画创作上的应用，百度文心一言和阿里通义大模型这两大智能平台的特性与功能，以及市场调研中提示词的实战应用。通过阅读《AI提示工程必知必会》，读者可掌握如何有效利用AI提示工程提升工作效率，创新工作流程，并在职场中脱颖而出。