基于scrapy爬虫的网页版新闻网站监控告警系统设计与实现

边学爬虫，边做了个系统，在项目中学习是最快的方法。利用scrapy爬虫框架，实现多个网站的增量、并行、定时爬取，保存到mysql数据库中；前端通过HTML语言设计，通过flask和websocket与后端通信，实现敏感词监控告警、新闻查询检索，同时，设计了一个词云显示和检索的功能，方便可视化分析；后端通过ollama服务调用大模型，实现翻译、内容转写等功能。总体属于全栈开发，是学习或者毕业设计的好

Faris_yzf

958人浏览 · 2025-09-09 01:07:45

Faris_yzf · 2025-09-09 01:07:45 发布

基于scrapy爬虫的网页版新闻网站监控告警系统设计与实现

一、系统概述

本文将介绍一款网页版新闻网站爬虫系统的设计与实现过程。该系统能够自动爬取多个新闻网站的内容，存储到数据库中，并通过前端页面进行展示和管理，同时具备 AI 翻译和内容转写功能。系统采用 Scrapy 作为爬虫框架，MySQL 作为数据库，Python 作为后端开发语言，结合 Tailwind CSS 构建前端界面。
系统架构
数据采集层：基于 Scrapy 的多网站分布式增量爬虫
数据存储层：MySQL 数据库
业务逻辑层：数据处理与 AI 服务
展示层：基于 Tailwind CSS 的前端页面

在这里插入图片描述 系统设计流程图

二、环境搭建

技术栈

爬虫框架：Scrapy 2.6+
数据库：MySQL 8.0
后端语言：Python 3.8+
前端技术：HTML5、Tailwind CSS、JavaScript
其他依赖：pymysql、python-dotenv、pytz 等
依赖安装

bash
pip install scrapy pymysql python-dotenv pytz python-dateutil

三、核心功能实现

1. 数据库设计与操作

数据库连接工具类（db.py）
首先实现数据库连接和基础操作工具：

python
运行

import pymysql
from dotenv import load_dotenv
import os
from datetime import datetime

 #加载 .env 文件中的环境变量
load_dotenv()

def get_db_connection():
    return pymysql.connect(
        host=os.getenv('DB_HOST', 'localhost'),
        user=os.getenv('DB_USER', 'root'),
        password=os.getenv('DB_PASSWORD', '060109yzf'),
        db=os.getenv('DB_NAME', 'news_DB'),
        charset='utf8mb4',
        cursorclass=pymysql.cursors.DictCursor
    )

 #增量爬虫的关键，从数据库中检索是否存在url
def is_url_exists(url):
    conn = get_db_connection()
    try:
        with conn.cursor() as cursor:
            # 从news_table表中查询，有返回1
            cursor.execute("SELECT 1 FROM news_table WHERE url = %s", (url,))
            return cursor.fetchone() is not None
    finally:
        conn.close()

def save_news(item):
    conn = get_db_connection()
    try:
        with conn.cursor() as cursor:
            sql = """
            INSERT INTO news_table (title, pub_time, url, source1, content)
            VALUES (%s, %s, %s, %s, %s)
            """
            cursor.execute(sql, (
                item['title'],
                item['pub_time'],
                item['url'],
                item['source1'],
                item['content']
            ))
        conn.commit()
    finally:
        conn.close()

数据库表设计（mysqlpipeline_news.py）
实现数据库表创建和数据存储的 Pipeline：

python
运行

import pymysql

class MysqlPipeline_News(object):
    """新闻数据存储管道"""

    def __init__(self):
        # 建立连接
        self.conn = pymysql.connect(host='localhost', user='root', password='060109yzf', database='news_DB',
                                    charset='utf8mb4')
        # 创建游标
        try:
            self.cursor = self.conn.cursor()
            # 创建表（如果需要）
            self.cursor.execute("""
                        CREATE TABLE IF NOT EXISTS news_table (
                            id INT AUTO_INCREMENT PRIMARY KEY,
                            source1 TEXT NOT NULL,
                            title TEXT NOT NULL,
                            pub_time VARCHAR(255) NOT NULL,
                            url VARCHAR(255),
                            description TEXT,
                            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                            chinese_title TEXT,
                            chinese_content TEXT,
                            summary TEXT,
                            ai_processed TINYINT NOT NULL DEFAULT 0,# 0:未处理 1:已处理
                            language_type VARCHAR(10) NOT NULL DEFAULT 'unknown'
                        ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
                    """)
            self.conn.commit()
        except pymysql.MySQLError as e:
            print(f"游标创建失败: {e}")
            return

    def process_item(self, item, spider):
        # sql语句
        insert_sql = """
        insert into news_table(source1,title,pub_time,url,description,language_type) VALUES(%s,%s,%s,%s,%s,%s)
        """
        # 执行插入数据到数据库操作
        self.cursor.execute(insert_sql,
                            (item['source1'], item['title'], item['pub_time'], item['url'], item['description'], item['language_type']))
        # 提交，不进行提交无法保存到数据库
        self.conn.commit()
        return item

    def close_spider(self, spider):
        # 关闭游标和连接
        self.cursor.close()
        self.conn.close()

在这里插入图片描述 数据库结构和爬取的结果

2. 爬虫模块实现

通用新闻 Item 定义（items.py）
定义统一的新闻数据结构：

python
运行

import scrapy

# 统一新闻ITEM
class NewsItem(scrapy.Item):
    title = scrapy.Field()#标题
    pub_time=scrapy.Field()#发布时间
    source1 = scrapy.Field()  # 来源
    url = scrapy.Field()  # 链接
    description=scrapy.Field()#正文
    language_type = scrapy.Field()#语言类型

CNN 新闻爬虫（cnn_news.py）
实现 CNN 新闻网站的爬虫，包含多页面导航和详情提取：

python
运行

import scrapy
import re
from quotes_toscrape.items import NewsItem
from ..db import is_url_exists, convert_datetime_format
from urllib.parse import urljoin
import pytz
from dateutil import parser

class CNNSpider(scrapy.Spider):
    name = 'cnn_news'
    allowed_domains = ['cnn.com']
    start_urls = ['https://edition.cnn.com/']
    base_url = 'https://edition.cnn.com/'
    
    custom_settings = {
        'ITEM_PIPELINES': {'quotes_toscrape.mysqlpipeline_news.MysqlPipeline_News': 202},
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
        },
        'DOWNLOAD_DELAY': 1,
        'RANDOMIZE_DOWNLOAD_DELAY': True
    }
    
    def detect_url_lang(self, response):
        """检测页面语言类型"""
        html_lang = response.xpath('//html/@lang').get()
        meta_lang = None
        if not html_lang:
            meta_lang = response.xpath('//meta[@http-equiv="content-language"]/@content').get()
        og_locale = None
        if not html_lang and not meta_lang:
            og_locale = response.xpath('//meta[@property="og:locale"]/@content').get()
        return html_lang or meta_lang or og_locale or 'unknown'

    def convert_to_beijing_time(self, time_str):
        """将ET/EDT/EST时区时间转换为北京时间"""
        cleaned = re.sub(r'[^\x00-\x7F]', '', time_str.strip())
        cleaned = re.sub(r'\s+', ' ', cleaned.replace('\n', ' '))
        cleaned = re.sub(r'^(\w+)\s+', '', cleaned, flags=re.IGNORECASE)
        
        tz_match = re.search(r'\s*(ET|EDT|EST)\s*', cleaned, flags=re.IGNORECASE)
        if not tz_match:
            raise ValueError(f"未找到有效时区（ET/EDT/EST）: {cleaned}")
        tz_abbr = tz_match.group(1).upper()
        
        time_part = re.sub(r'\s*(ET|EDT|EST)\s*', '', cleaned, flags=re.IGNORECASE).strip()
        try:
            naive_dt = parser.parse(time_part)
        except parser.ParserError as e:
            raise ValueError(f"解析时间失败: {time_part}（错误：{str(e)}）")
            
        source_tz = pytz.timezone('America/New_York')
        is_dst = (tz_abbr == 'EDT')
        localized_dt = source_tz.localize(naive_dt, is_dst=is_dst)
        
        beijing_tz = pytz.timezone('Asia/Shanghai')
        beijing_dt = localized_dt.astimezone(beijing_tz)
        
        return beijing_dt.strftime("%Y-%m-%d %H:%M:%S")

    def parse(self, response):
        """第一层解析：提取所有第一层导航链接"""
        first_level_links = response.xpath(
            "//div[@class='header__nav-item']/a[@class='header__nav-item-link']/@href"
        ).getall()
        
        self.log(f"找到{len(first_level_links)}个第一层导航链接")
        limited_links = first_level_links[:2]  # 限制爬取数量，避免请求过多
        
        for link in limited_links:
            full_url = urljoin(response.url, link)
            self.log(f"跟进第一层链接: {full_url}")
            yield scrapy.Request(
                url=full_url,
                callback=self.parse_second_level,
                meta={'first_level_url': full_url}
            )

    def parse_second_level(self, response):
        """第二层解析：从第一层链接页面提取第二层导航链接"""
        second_level_links = response.xpath(
            "//div[@class='header__nav-item']/a[@class='header__nav-item-link']/@href"
        ).getall()
        
        self.log(f"从{response.meta['first_level_url']}找到{len(second_level_links)}个第二层导航链接")
        
        for link in second_level_links:
            full_url = urljoin(response.url, link)
            self.log(f"跟进第二层链接: {full_url}")
            yield scrapy.Request(
                url=full_url,
                callback=self.parse_news_section,
                meta={
                    'first_level_url': response.meta['first_level_url'],
                    'second_level_url': full_url
                }
            )

    def parse_news_section(self, response):
        """解析新闻版块，提取新闻链接"""
        news_cards = response.xpath('//li[contains(@class, "container__item--type-media-image")]')
        
        for card in news_cards:
            link = card.xpath('.//a[contains(@class, "container__link--type-article")]/@href').get()
            
            if not link:
                continue
            url = response.urljoin(link)
            if is_url_exists(url):  # 增量爬取，已存在的URL不再爬取
                continue
                
            title = card.xpath('.//span[contains(@class, "container__headline-text")]/text()').get()
            
            yield scrapy.Request(url=url, meta={'title': title, 'url': url},
                                 callback=self.parse_news_detail)

    def parse_news_detail(self, response):
        """解析新闻详情页，提取新闻原文内容"""
        self.log(f"开始解析新闻详情: {response.url}")
        item = NewsItem()
        item['source1'] = "美国有线电视"
        item['title'] = response.meta['title']
        item['url'] = response.meta['url']
        item['language_type'] = self.detect_url_lang(response)
        
        # 提取并处理时间
        raw_time = response.xpath("//div[@class='timestamp__published']/text() | "
                                  "//div[contains(@class, 'timestamp vossi-timestamp')]//text()"
                                  ).getall()
                                  
        if raw_time:
            raw_time_str = ' '.join(raw_time).strip()
        else:
            raw_time_str = None
            print(f"网站:{item['url']}没有提取到时间：{raw_time_str}")
            
        formatted_time = self.convert_to_beijing_time(raw_time_str)
        item['pub_time'] = formatted_time
        #下文略……………………

联合早报爬虫（zaobao.py）
实现联合早报网站的爬虫：

python
运行

import scrapy
from quotes_toscrape.items import NewsItem
from ..db import is_url_exists, convert_datetime_format

class ZaobaoSpider(scrapy.Spider):
    name = 'zaobao'
    allowed_domains = ['zaobao.com/']
    start_urls = ['https://www.zaobao.com/realtime']
    base_url = 'https://www.zaobao.com/realtime'
    
    custom_settings = {
        'ITEM_PIPELINES': {'quotes_toscrape.mysqlpipeline_news.MysqlPipeline_News': 202}
    }
    
    def detect_url_lang(self, response):
        """检测页面语言类型"""
        html_lang = response.xpath('//html/@lang').get()
        meta_lang = None
        if not html_lang:
            meta_lang = response.xpath('//meta[@http-equiv="content-language"]/@content').get()
        og_locale = None
        if not html_lang and not meta_lang:
            og_locale = response.xpath('//meta[@property="og:locale"]/@content').get()
        return html_lang or meta_lang or og_locale or 'unknown'

    def parse(self, response):
        # 联合早报的即时板块
        source1 = '联合早报-即时'
        news_list = response.xpath(
            '//a[@class="py-4 flex gap-2 border-solid border-b-[1px] border-grey-150 last:border-none"]')
        
        for news in news_list:
            url = news.xpath('./@href').get()
            if url:
                url = response.urljoin(url)
                
            # 增量爬取判断
            if is_url_exists(url):
                continue
                
            title = news.xpath('.//article/text()').extract()[0]
            
            yield scrapy.Request(url=url, meta={'source1': source1, 'title': title, 'url': url},
                                 callback=self.parse_news, dont_filter=True)
        
        print(f"{source1} 网站爬取完毕……")

    def parse_news(self, response):
        item = NewsItem()
        item['title'] = response.meta['title']
        item['source1'] = response.meta['source1']
        item['url'] = response.meta['url']
        item['language_type'] = self.detect_url_lang(response)
        
        # 提取发布时间
        time_div = response.xpath('//div[@class="byline_area line-clamp-2 text-grey-400"]')
        time_text = time_div.xpath('string(.)').get()
        
        if time_text:
            # 处理文本，提取纯时间部分
            publish_time = time_text.split(' / ')[-1].strip()
            item['pub_time'] = convert_datetime_format(publish_time)
        
        # 提取正文内容
       #（略）
        
        yield item

3. 前端页面实现

主页面布局（index.html）
实现响应式的新闻展示页面：

html代码如下：

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>新闻监控告警分析系统</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <link href="https://cdn.jsdelivr.net/npm/font-awesome@4.7.0/css/font-awesome.min.css" rel="stylesheet">
    <script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/4.0.1/socket.io.js"></script>
    <link rel="stylesheet" href="styles.css">
</head>
<body class="bg-gray-100">
<div class="container mx-auto px-4 py-8">
    <header class="mb-8">
        <h1 class="text-3xl font-bold text-center text-gray-800">新闻监控告警系统 V1.0.1</h1>
        <!-- 告警容器 -->
        <div id="alert-container"
             class="mt-4 hidden bg-red-100 border border-red-400 text-red-700 px-4 py-3 rounded relative">
            <span class="block sm:inline" id="alert-message">敏感词告警</span>
            <div class="absolute top-0 right-0 px-4 py-3 flex items-center gap-2">
                <button id="stop-alert-sound" class="text-sm bg-red-500 text-white px-2 py-1 rounded hover:bg-red-600">
                    <i class="fa fa-volume-off mr-1"></i> 关闭声音
                </button>
                <button onclick="document.getElementById('alert-container').classList.add('hidden')"
                        class="text-gray-500 hover:text-gray-700">
                    <i class="fa fa-times"></i>
                </button>
            </div>
        </div>
    </header>

    <!-- 搜索和筛选区域 -->
    <div class="bg-white p-4 rounded-lg shadow mb-6">
        <h2 class="text-xl font-semibold mb-4 flex items-center">
            <i class="fa fa-search text-blue-500 mr-2"></i>新闻搜索
        </h2>
        <div class="grid grid-cols-1 md:grid-cols-2 gap-4">
            <!-- 来源筛选 -->
            <div>
                <label class="block text-sm font-medium text-gray-700 mb-1">新闻来源</label>
                <select id="source-filter" class="w-full border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500">
                    <option value="">全部来源</option>
                    <option value="美国有线电视">美国有线电视</option>
                    <option value="联合早报-即时">联合早报</option>
                    <option value="凤凰网-军事">凤凰网</option>
                </select>
            </div>
            
            <!-- 发布时间范围 -->
            <div>
                <label class="block text-sm font-medium text-gray-700 mb-1">发布时间范围</label>
                <div class="flex gap-2">
                    <input type="date" id="pub-start-date"
                           class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500"
                           placeholder="开始日期">
                    <input type="date" id="pub-end-date"
                           class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500"
                           placeholder="结束日期">
                </div>
            </div>

            <!-- 关键词搜索 -->
            <div class="md:col-span-2">
                <label class="block text-sm font-medium text-gray-700 mb-1">关键词</label>
                <div class="flex gap-2">
                    <input type="text" id="keyword" placeholder="标题或内容关键词"
                           class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500">
                    <button id="search-btn" class="bg-blue-500 hover:bg-blue-600 text-white px-6 py-2 rounded">
                        <i class="fa fa-search mr-1"></i> 搜索
                    </button>
                    <button id="reset-btn" class="bg-gray-200 hover:bg-gray-300 text-gray-700 px-6 py-2 rounded">
                        <i class="fa fa-refresh mr-1"></i> 重置
                    </button>
                </div>
            </div>
        </div>
    </div>

    <!-- 新闻列表和详情区域 -->
    <div class="grid grid-cols-1 lg:grid-cols-3 gap-6">
        <!-- 左侧：新闻列表 -->
        <div class="lg:col-span-1">
            <div class="bg-white rounded-lg shadow mb-6">
                <div class="bg-white p-4 rounded-lg shadow mb-6 flex flex-wrap items-center justify-between">
                    <!-- 总新闻数统计 -->
                    <div class="flex items-center mb-2 sm:mb-0">
                        <span class="text-gray-700">共 <span id="news-total-count"
                                                            class="font-bold text-blue-600">0</span> 条新闻</span>
                    </div>

                    <!-- 敏感词新闻统计 -->
                    <div class="flex items-center ml-0 sm:ml-auto">
                        <i class="fa fa-exclamation-circle text-red-500 mr-2"></i>
                        <span class="text-gray-700">含有敏感词的新闻总数: </span>
                        <span id="sensitive-news-count" class="ml-2 font-bold text-red-600">0</span>
                    </div>
                </div>
                
                <!-- 新闻列表表头 -->
                <div class="grid grid-cols-12 bg-gray-50 py-2 px-4 font-medium border-b">
                    <div class="col-span-4">标题</div>
                    <div class="col-span-2 hidden md:block">来源</div>
                    <div class="col-span-3 hidden sm:block">出版时间</div>
                    <div class="col-span-3 hidden lg:block">爬取时间</div>
                </div>
                
                <!-- 新闻列表内容 -->
                <div id="news-list" class="max-h-[500px] overflow-y-auto">
                    <!-- 新闻列表将在这里动态生成 -->
                    <div class="flex justify-center items-center h-32 text-gray-500">
                        请点击搜索按钮加载新闻
                    </div>
                </div>
                
                <!-- 分页控件 -->
                <div id="pagination" class="py-3 px-4 border-t flex justify-center">
                    <!-- 分页控件将在这里动态生成 -->
                </div>
            </div>
        </div>

        <!-- 右侧：新闻详情 -->
        <div class="lg:col-span-2">
            <div class="w-full bg-white rounded-lg shadow p-5">
                <h2 class="text-lg font-semibold mb-4 flex items-center">
                    <i class="fa fa-newspaper-o text-primary mr-2"></i>新闻详情
                </h2>

                <!-- 新闻标题区域 -->
                <div class="mb-6 pb-4 border-b border-gray-200">
                    <div class="inline-flex items-center">
                        <h3 id="news-detail-title" class="text-xl font-bold text-dark"></h3>
                        <a id="preview-url" href="" target="_blank" class="text-blue-500 hover:underline ml-2 text-sm">
                            <i class="fa fa-external-link mr-1"></i> 查看原文
                        </a>
                    </div>

                    <!-- 来源和时间区域 -->
                    <div class="mt-2 flex justify-start text-sm text-gray-500">
                        <span id="news-detail-source"></span>
                        <span id="news-detail-time" class="ml-4"></span>
                    </div>
                </div>

                <!-- 新闻内容标签切换 -->
                <div class="border-b border-gray-200 mb-4">
                    <div class="flex">
                        <button class="content-tab px-4 py-2 font-medium border-b-2 border-primary text-primary"
                                data-tab="original">
                            原文
                        </button>
                        <button class="content-tab px-4 py-2 font-medium border-b-2 border-transparent text-gray-500 hover:text-gray-700"
                                data-tab="chinese">
                            中文
                        </button>
                        <button class="content-tab px-4 py-2 font-medium border-b-2 border-transparent text-gray-500 hover:text-gray-700"
                                data-tab="summary">
                            转写后
                        </button>
                    </div>
                </div>

                <!-- 新闻内容区域 -->
                <div class="min-h-[500px]">
                    <!-- 原文区域 -->
                    <div id="original-content" class="content-panel">
                        <div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
                            请选择一条新闻查看详情
                        </div>
                    </div>

                    <!-- 中文翻译区域 -->
                    <div id="chinese-content" class="content-panel hidden">
                        <div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
                            请选择一条新闻查看详情
                        </div>
                    </div>

                    <!-- 转写后区域 -->
                    <div id="summary-content" class="content-panel hidden">
                        <div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
                            请选择一条新闻查看详情
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div>

<script src="app.js"></script>
</body>
</html>

在这里插入图片描述 前端界面图1

在这里插入图片描述 界面2

在这里插入图片描述 使用大模型智能翻译的中文
使用大模型自动简写摘录

4. 项目主程序与定时任务

实现爬虫的定时调度功能（main.py）：

python
运行

import json
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor, defer
import logging
import signal
from pathlib import Path

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("spider.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# 爬虫名称列表
SPIDER_NAMES = ['cnn_news', 'zaobao', 'fenghuang']  # 多个爬虫
# 爬取间隔时间（秒）
CRAWL_INTERVAL = 60

# 初始化CrawlerRunner
runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl_all_spiders():
    """运行所有爬虫"""
    logger.info("===== 开始新一轮爬取 =====")
    try:
        # 依次启动所有爬虫
        for spider_name in SPIDER_NAMES:
            yield runner.crawl(spider_name)
        logger.info("===== 本轮爬取完成 =====")
    except Exception as e:
        logger.error(f"爬取过程中出错: {str(e)}", exc_info=True)

    # 安排下一次爬取
    schedule_next_crawl()

def schedule_next_crawl():
    """安排下一次爬取任务"""
    logger.info(f"将在 {CRAWL_INTERVAL} 秒后进行下一轮爬取...")
    reactor.callLater(CRAWL_INTERVAL, crawl_all_spiders)

def shutdown(signal, frame):
    """处理程序退出信号"""
    logger.info("接收到退出信号，正在停止爬虫...")
    reactor.stop()

if __name__ == "__main__":
    # 注册信号处理，确保程序可以优雅退出
    signal.signal(signal.SIGINT, shutdown)  # 处理Ctrl+C
    signal.signal(signal.SIGTERM, shutdown)

    logger.info("程序启动，开始首次爬取...")
    # 启动第一次爬取
    crawl_all_spiders()

    # 启动Twisted反应器（事件循环）
    reactor.run()

5. AI 处理模块

实现新闻内容的翻译和转写功能（backend/app.py）：

python
运行

"""优化逻辑：先判断语种，再选择性翻译"""
"""独立线程处理AI任务"""
def ai_processing_worker():
    """独立线程处理AI任务：翻译和内容转写"""
    init_ollama_client()

    while True:
        try:
            # 从队列获取待处理新闻
            with queue_lock:
                if not news_process_queue:
                    time.sleep(5)  # 队列空时休眠
                    continue
                news_item = news_process_queue.pop(0)
                news_id = news_item['id']
                # 临时标记正在处理，避免重复入队
                processing_flag = f"processing_{news_id}"
                if processing_flag in [f"processing_{item['id']}" for item in news_process_queue]:
                    print(f"新闻ID {news_id} 正在处理中，跳过重复项")
                    continue

            news_id = news_item['id']
            title = news_item['title']
            content = news_item['description']
            language_type = news_item['language_type']

            conn = get_db_connection()

            try:
                # 加行锁防止并发问题
                with conn.cursor(pymysql.cursors.DictCursor) as cursor:
                    cursor.execute(
                        "SELECT ai_processed FROM news_table WHERE id = %s FOR UPDATE",
                        (news_id,)
                    )
                    result = cursor.fetchone()

                    if not result or result['ai_processed'] == 1:
                        print(f"新闻ID {news_id} 已处理，跳过")
                        continue

                print(f"开始智能处理新闻ID: {news_id}的翻译和转写\n")
                start_time = time.time()
                
                # 1. 判断是否为中文
                is_chinese = language_type.startswith("zh")

                # 2. 初始化变量
                chinese_title = title
                chinese_content = content

                # 3. 非中文时调用翻译
                if not is_chinese:
                    chinese_title = call_LLM_for_translate(title, True, language_type)
                    print(f"新闻ID {news_id} 标题翻译成功！！！")
                    chinese_content = call_LLM_for_translate(content, False, language_type)
                    print(f"新闻ID {news_id} 内容翻译成功！！！")

                # 4. 内容转写
                paraphrased_content = call_large_model_for_paraphrase(chinese_content)

                # 5. 更新数据库
                conn = get_db_connection()
                cursor = conn.cursor(pymysql.cursors.DictCursor)

                update_query = """
                    UPDATE news_table 
                    SET chinese_title = %s, chinese_content = %s, 
                        summary = %s, ai_processed = 1 
                    WHERE id = %s
                    """
                cursor.execute(update_query, (chinese_title, chinese_content,
                                              paraphrased_content, news_id))
                conn.commit()
                
                # 检查敏感内容
                news_item['chinese_title'] = chinese_title
                news_item['chinese_content'] = chinese_content
                check_sensitive_content(news_item)
                
                # 6. 通知前端AI处理完成
                socketio.emit('ai_process_complete', {
                    'news_id': news_id,
                    'is_chinese_original': is_chinese,
                    'message': '处理完成'
                })
                print(f"新闻ID {news_id} 处理完成，耗时{time.time() - start_time:.2f}秒")
            except Exception as e:
                conn.rollback()
                print(f"AI任务处理失败（news_id={news_id}）: {str(e)}")
                # 重新入队
                with queue_lock:
                    if not any(item['id'] == news_id for item in news_process_queue):
                        news_process_queue.append(news_item)
                time.sleep(10)
            finally:
                conn.close()
        except Exception as e:
            print(f"线程循环异常: {str(e)}")
            time.sleep(10)

在这里插入图片描述

四、系统测试与部署

测试步骤

数据库连接测试：确保能够正确连接 MySQL 并创建表结构
单爬虫测试：单独运行每个爬虫，验证数据能否正确爬取和存储
定时任务测试：验证定时爬取功能是否正常工作
前端交互测试：测试搜索、筛选、查看详情等功能
AI 处理测试：验证翻译和转写功能的准确性

部署注意事项

生产环境中应使用环境变量存储数据库密码等敏感信息
调整爬取间隔和并发数，避免给目标网站造成过大压力
考虑使用 Docker 容器化部署，简化环境配置
对于大规模爬取，建议使用分布式爬虫架构

五、总结与展望

本系统实现了一个功能完整的新闻网站爬虫系统，具备多网站爬取、数据存储、前端展示和 AI 处理等功能。通过 Scrapy 框架实现高效的数据采集，使用 MySQL 存储结构化数据，结合 Tailwind CSS 构建美观的前端界面，同时利用 AI 技术提升内容的可读性。

未来可以从以下几个方面进行优化：

增加更多新闻来源，提高新闻覆盖率
优化爬虫调度策略，实现基于内容热度的动态爬取频率
增强敏感词检测和告警机制，提高系统的实用性
引入数据可视化功能，展示新闻趋势和热点分析

通过不断优化和扩展，系统可以成为一个功能强大的新闻监控与分析平台。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

Agentic AI与提示工程：企业智能转型的双引擎

在当今数字化快速发展的时代，企业面临着日益激烈的竞争和不断变化的市场环境。为了保持竞争力并实现可持续发展，智能转型成为众多企业的必经之路。人工智能（AI）技术的崛起为企业提供了前所未有的机遇，其中Agentic AI和提示工程作为新兴的关键技术，正逐渐成为企业智能转型的核心驱动力。Agentic AI具备自主决策和行动能力，能够像智能的“代理人”一样，根据环境变化和目标设定，主动地执行任务。而提示