CrewAI智能体开发：Scrapfly 网站抓取工具

ScrapflyScrapeWebsiteTool 利用 Scrapfly 的网页抓取 API 以各种格式从网站提取内容。

王国平

469人浏览 · 2026-01-01 09:17:59

王国平 · 2026-01-01 09:17:59 发布

描述

ScrapflyScrapeWebsiteTool 旨在利用 Scrapfly 的网页抓取 API 从网站提取内容。该工具提供高级网页抓取功能，支持无头浏览器、代理和反机器人绕过功能。它允许以各种格式提取网页数据，包括原始 HTML、Markdown 和纯文本，使其成为各种网页抓取任务的理想选择。

安装要使用此工具，您需要安装 Scrapfly SDK

uv add scrapfly-sdk

您还需要通过在 scrapfly.io/register 注册来获取 Scrapfly API 密钥。

开始步骤要有效使用 ScrapflyScrapeWebsiteTool，请遵循以下步骤

安装依赖项：使用上述命令安装 Scrapfly SDK。
获取 API 密钥：在 Scrapfly 注册以获取您的 API 密钥。
初始化工具：使用您的 API 密钥创建工具实例。
配置抓取参数：根据您的需求自定义抓取参数。

示例以下示例演示了如何使用 ScrapflyScrapeWebsiteTool 从网站提取内容

from crewai import Agent, Task, Crew
from crewai_tools import ScrapflyScrapeWebsiteTool

# Initialize the tool
scrape_tool = ScrapflyScrapeWebsiteTool(api_key="your_scrapfly_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract content from a website
scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products and summarize the available products.",
    expected_output="A summary of the products available on the website.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

您还可以自定义抓取参数

# Example with custom scraping parameters
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites with custom parameters",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# The agent will use the tool with parameters like:
# url="https://web-scraping.dev/products"
# scrape_format="markdown"
# ignore_scrape_failures=True
# scrape_config={
#     "asp": True,  # Bypass scraping blocking solutions, like Cloudflare
#     "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
#     "proxy_pool": "public_residential_pool",  # Select a proxy pool
#     "country": "us",  # Select a proxy location
#     "auto_scroll": True,  # Auto scroll the page
# }

scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products using advanced scraping options including JavaScript rendering and proxy settings.",
    expected_output="A detailed summary of the products with all available information.",
    agent=web_scraper_agent,
)

参数ScrapflyScrapeWebsiteTool 接受以下参数

初始化参数

api_key：必需。您的 Scrapfly API 密钥。

运行参数

url：必需。要抓取的网站 URL。
scrape_format：可选。提取网页内容的格式。选项包括“raw”（HTML）、“markdown”或“text”。默认值为“markdown”。
scrape_config：可选。包含其他 Scrapfly 抓取配置选项的字典。
ignore_scrape_failures：可选。是否忽略抓取过程中的失败。如果设置为 True，当抓取失败时，该工具将返回 None 而不是抛出异常。

Scrapfly 配置选项scrape_config 参数允许您使用以下选项自定义抓取行为

asp：启用反抓取保护绕过。
render_js：使用云无头浏览器启用 JavaScript 渲染。
proxy_pool：选择代理池（例如，“public_residential_pool”、“datacenter”）。
country：选择代理位置（例如，“us”、“uk”）。
auto_scroll：自动滚动页面以加载延迟加载的内容。
js：由无头浏览器执行自定义 JavaScript 代码。

有关配置选项的完整列表，请参阅 Scrapfly API 文档。

用法将 ScrapflyScrapeWebsiteTool 与代理一起使用时，代理需要提供要抓取的网站 URL，并且可以选择指定格式和其他配置选项

# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent
scrape_task = Task(
    description="Extract the main content from example.com in markdown format.",
    expected_output="The main content of example.com in markdown format.",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

对于具有自定义配置的更高级用法

# Create a task with more specific instructions
advanced_scrape_task = Task(
    description="""
    Extract content from example.com with the following requirements:
    - Convert the content to plain text format
    - Enable JavaScript rendering
    - Use a US-based proxy
    - Handle any scraping failures gracefully
    """,
    expected_output="The extracted content from example.com",
    agent=web_scraper_agent,
)

错误处理默认情况下，如果抓取失败，ScrapflyScrapeWebsiteTool 将抛出异常。可以通过指定 ignore_scrape_failures 参数来指示代理优雅地处理失败

# Create a task that instructs the agent to handle errors
error_handling_task = Task(
    description="""
    Extract content from a potentially problematic website and make sure to handle any 
    scraping failures gracefully by setting ignore_scrape_failures to True.
    """,
    expected_output="Either the extracted content or a graceful error message",
    agent=web_scraper_agent,
)

实现细节ScrapflyScrapeWebsiteTool 使用 Scrapfly SDK 与 Scrapfly API 进行交互

class ScrapflyScrapeWebsiteTool(BaseTool):
    name: str = "Scrapfly web scraping API tool"
    description: str = (
        "Scrape a webpage url using Scrapfly and return its content as markdown or text"
    )
    
    # Implementation details...
    
    def _run(
        self,
        url: str,
        scrape_format: str = "markdown",
        scrape_config: Optional[Dict[str, Any]] = None,
        ignore_scrape_failures: Optional[bool] = None,
    ):
        from scrapfly import ScrapeApiResponse, ScrapeConfig

        scrape_config = scrape_config if scrape_config is not None else {}
        try:
            response: ScrapeApiResponse = self.scrapfly.scrape(
                ScrapeConfig(url, format=scrape_format, **scrape_config)
            )
            return response.scrape_result["content"]
        except Exception as e:
            if ignore_scrape_failures:
                logger.error(f"Error fetching data from {url}, exception: {e}")
                return None
            else:
                raise e

结论

ScrapflyScrapeWebsiteTool 提供了一种强大的方式，利用 Scrapfly 的高级网页抓取功能从网站提取内容。凭借无头浏览器支持、代理和反机器人绕过等功能，它可以处理复杂的网站并以各种格式提取内容。该工具对于需要可靠网页抓取的数据提取、内容监控和研究任务特别有用。

《DeepSeek高效数据分析：从数据清洗到行业案例》聚焦DeepSeek在数据分析领域的高效应用，是系统讲解其从数据处理到可视化全流程的实用指南。作者结合多年职场实战经验，不仅深入拆解DeepSeek数据分析的核心功能——涵盖数据采集、清洗、预处理、探索分析、建模（回归、聚类、时间序列等）及模型评估，更通过金融量化数据分析、电商平台数据分析等真实行业案例，搭配报告撰写技巧，提供独到见解与落地建议。助力职场人在激烈竞争中凭借先进技能突破瓶颈，实现职业进阶，开启发展新篇。