基于scrapy爬虫的网页版新闻网站监控告警系统设计与实现

边学爬虫,边做了个系统,在项目中学习是最快的方法。利用scrapy爬虫框架,实现多个网站的增量、并行、定时爬取,保存到mysql数据库中;前端通过HTML语言设计,通过flask和websocket与后端通信,实现敏感词监控告警、新闻查询检索,同时,设计了一个词云显示和检索的功能,方便可视化分析;后端通过ollama服务调用大模型,实现翻译、内容转写等功能。总体属于全栈开发,是学习或者毕业设计的好案例。欢迎批评指正!!!

一、系统概述

本文将介绍一款网页版新闻网站爬虫系统的设计与实现过程。该系统能够自动爬取多个新闻网站的内容,存储到数据库中,并通过前端页面进行展示和管理,同时具备 AI 翻译和内容转写功能。系统采用 Scrapy 作为爬虫框架,MySQL 作为数据库,Python 作为后端开发语言,结合 Tailwind CSS 构建前端界面。
系统架构
数据采集层:基于 Scrapy 的多网站分布式增量爬虫
数据存储层:MySQL 数据库
业务逻辑层:数据处理与 AI 服务
展示层:基于 Tailwind CSS 的前端页面

在这里插入图片描述系统设计流程图

二、环境搭建

技术栈

爬虫框架:Scrapy 2.6+
数据库:MySQL 8.0
后端语言:Python 3.8+
前端技术:HTML5、Tailwind CSS、JavaScript
其他依赖:pymysql、python-dotenv、pytz 等
依赖安装

bash
pip install scrapy pymysql python-dotenv pytz python-dateutil

三、核心功能实现

1. 数据库设计与操作

数据库连接工具类(db.py)
首先实现数据库连接和基础操作工具:

python
运行

import pymysql
from dotenv import load_dotenv
import os
from datetime import datetime

 #加载 .env 文件中的环境变量
load_dotenv()

def get_db_connection():
    return pymysql.connect(
        host=os.getenv('DB_HOST', 'localhost'),
        user=os.getenv('DB_USER', 'root'),
        password=os.getenv('DB_PASSWORD', '060109yzf'),
        db=os.getenv('DB_NAME', 'news_DB'),
        charset='utf8mb4',
        cursorclass=pymysql.cursors.DictCursor
    )

 #增量爬虫的关键,从数据库中检索是否存在url
def is_url_exists(url):
    conn = get_db_connection()
    try:
        with conn.cursor() as cursor:
            # 从news_table表中查询,有返回1
            cursor.execute("SELECT 1 FROM news_table WHERE url = %s", (url,))
            return cursor.fetchone() is not None
    finally:
        conn.close()

def save_news(item):
    conn = get_db_connection()
    try:
        with conn.cursor() as cursor:
            sql = """
            INSERT INTO news_table (title, pub_time, url, source1, content)
            VALUES (%s, %s, %s, %s, %s)
            """
            cursor.execute(sql, (
                item['title'],
                item['pub_time'],
                item['url'],
                item['source1'],
                item['content']
            ))
        conn.commit()
    finally:
        conn.close()

数据库表设计(mysqlpipeline_news.py)
实现数据库表创建和数据存储的 Pipeline:

python
运行

import pymysql

class MysqlPipeline_News(object):
    """新闻数据存储管道"""

    def __init__(self):
        # 建立连接
        self.conn = pymysql.connect(host='localhost', user='root', password='060109yzf', database='news_DB',
                                    charset='utf8mb4')
        # 创建游标
        try:
            self.cursor = self.conn.cursor()
            # 创建表(如果需要)
            self.cursor.execute("""
                        CREATE TABLE IF NOT EXISTS news_table (
                            id INT AUTO_INCREMENT PRIMARY KEY,
                            source1 TEXT NOT NULL,
                            title TEXT NOT NULL,
                            pub_time VARCHAR(255) NOT NULL,
                            url VARCHAR(255),
                            description TEXT,
                            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                            chinese_title TEXT,
                            chinese_content TEXT,
                            summary TEXT,
                            ai_processed TINYINT NOT NULL DEFAULT 0,# 0:未处理 1:已处理
                            language_type VARCHAR(10) NOT NULL DEFAULT 'unknown'
                        ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
                    """)
            self.conn.commit()
        except pymysql.MySQLError as e:
            print(f"游标创建失败: {e}")
            return

    def process_item(self, item, spider):
        # sql语句
        insert_sql = """
        insert into news_table(source1,title,pub_time,url,description,language_type) VALUES(%s,%s,%s,%s,%s,%s)
        """
        # 执行插入数据到数据库操作
        self.cursor.execute(insert_sql,
                            (item['source1'], item['title'], item['pub_time'], item['url'], item['description'], item['language_type']))
        # 提交,不进行提交无法保存到数据库
        self.conn.commit()
        return item

    def close_spider(self, spider):
        # 关闭游标和连接
        self.cursor.close()
        self.conn.close()

在这里插入图片描述数据库结构和爬取的结果

2. 爬虫模块实现

通用新闻 Item 定义(items.py)
定义统一的新闻数据结构:

python
运行

import scrapy

# 统一新闻ITEM
class NewsItem(scrapy.Item):
    title = scrapy.Field()#标题
    pub_time=scrapy.Field()#发布时间
    source1 = scrapy.Field()  # 来源
    url = scrapy.Field()  # 链接
    description=scrapy.Field()#正文
    language_type = scrapy.Field()#语言类型

CNN 新闻爬虫(cnn_news.py)
实现 CNN 新闻网站的爬虫,包含多页面导航和详情提取:

python
运行

import scrapy
import re
from quotes_toscrape.items import NewsItem
from ..db import is_url_exists, convert_datetime_format
from urllib.parse import urljoin
import pytz
from dateutil import parser

class CNNSpider(scrapy.Spider):
    name = 'cnn_news'
    allowed_domains = ['cnn.com']
    start_urls = ['https://edition.cnn.com/']
    base_url = 'https://edition.cnn.com/'
    
    custom_settings = {
        'ITEM_PIPELINES': {'quotes_toscrape.mysqlpipeline_news.MysqlPipeline_News': 202},
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
        },
        'DOWNLOAD_DELAY': 1,
        'RANDOMIZE_DOWNLOAD_DELAY': True
    }
    
    def detect_url_lang(self, response):
        """检测页面语言类型"""
        html_lang = response.xpath('//html/@lang').get()
        meta_lang = None
        if not html_lang:
            meta_lang = response.xpath('//meta[@http-equiv="content-language"]/@content').get()
        og_locale = None
        if not html_lang and not meta_lang:
            og_locale = response.xpath('//meta[@property="og:locale"]/@content').get()
        return html_lang or meta_lang or og_locale or 'unknown'

    def convert_to_beijing_time(self, time_str):
        """将ET/EDT/EST时区时间转换为北京时间"""
        cleaned = re.sub(r'[^\x00-\x7F]', '', time_str.strip())
        cleaned = re.sub(r'\s+', ' ', cleaned.replace('\n', ' '))
        cleaned = re.sub(r'^(\w+)\s+', '', cleaned, flags=re.IGNORECASE)
        
        tz_match = re.search(r'\s*(ET|EDT|EST)\s*', cleaned, flags=re.IGNORECASE)
        if not tz_match:
            raise ValueError(f"未找到有效时区(ET/EDT/EST): {cleaned}")
        tz_abbr = tz_match.group(1).upper()
        
        time_part = re.sub(r'\s*(ET|EDT|EST)\s*', '', cleaned, flags=re.IGNORECASE).strip()
        try:
            naive_dt = parser.parse(time_part)
        except parser.ParserError as e:
            raise ValueError(f"解析时间失败: {time_part}(错误:{str(e)})")
            
        source_tz = pytz.timezone('America/New_York')
        is_dst = (tz_abbr == 'EDT')
        localized_dt = source_tz.localize(naive_dt, is_dst=is_dst)
        
        beijing_tz = pytz.timezone('Asia/Shanghai')
        beijing_dt = localized_dt.astimezone(beijing_tz)
        
        return beijing_dt.strftime("%Y-%m-%d %H:%M:%S")

    def parse(self, response):
        """第一层解析:提取所有第一层导航链接"""
        first_level_links = response.xpath(
            "//div[@class='header__nav-item']/a[@class='header__nav-item-link']/@href"
        ).getall()
        
        self.log(f"找到{len(first_level_links)}个第一层导航链接")
        limited_links = first_level_links[:2]  # 限制爬取数量,避免请求过多
        
        for link in limited_links:
            full_url = urljoin(response.url, link)
            self.log(f"跟进第一层链接: {full_url}")
            yield scrapy.Request(
                url=full_url,
                callback=self.parse_second_level,
                meta={'first_level_url': full_url}
            )

    def parse_second_level(self, response):
        """第二层解析:从第一层链接页面提取第二层导航链接"""
        second_level_links = response.xpath(
            "//div[@class='header__nav-item']/a[@class='header__nav-item-link']/@href"
        ).getall()
        
        self.log(f"从{response.meta['first_level_url']}找到{len(second_level_links)}个第二层导航链接")
        
        for link in second_level_links:
            full_url = urljoin(response.url, link)
            self.log(f"跟进第二层链接: {full_url}")
            yield scrapy.Request(
                url=full_url,
                callback=self.parse_news_section,
                meta={
                    'first_level_url': response.meta['first_level_url'],
                    'second_level_url': full_url
                }
            )

    def parse_news_section(self, response):
        """解析新闻版块,提取新闻链接"""
        news_cards = response.xpath('//li[contains(@class, "container__item--type-media-image")]')
        
        for card in news_cards:
            link = card.xpath('.//a[contains(@class, "container__link--type-article")]/@href').get()
            
            if not link:
                continue
            url = response.urljoin(link)
            if is_url_exists(url):  # 增量爬取,已存在的URL不再爬取
                continue
                
            title = card.xpath('.//span[contains(@class, "container__headline-text")]/text()').get()
            
            yield scrapy.Request(url=url, meta={'title': title, 'url': url},
                                 callback=self.parse_news_detail)

    def parse_news_detail(self, response):
        """解析新闻详情页,提取新闻原文内容"""
        self.log(f"开始解析新闻详情: {response.url}")
        item = NewsItem()
        item['source1'] = "美国有线电视"
        item['title'] = response.meta['title']
        item['url'] = response.meta['url']
        item['language_type'] = self.detect_url_lang(response)
        
        # 提取并处理时间
        raw_time = response.xpath("//div[@class='timestamp__published']/text() | "
                                  "//div[contains(@class, 'timestamp vossi-timestamp')]//text()"
                                  ).getall()
                                  
        if raw_time:
            raw_time_str = ' '.join(raw_time).strip()
        else:
            raw_time_str = None
            print(f"网站:{item['url']}没有提取到时间:{raw_time_str}")
            
        formatted_time = self.convert_to_beijing_time(raw_time_str)
        item['pub_time'] = formatted_time
        #下文略……………………

联合早报爬虫(zaobao.py)
实现联合早报网站的爬虫:

python
运行

import scrapy
from quotes_toscrape.items import NewsItem
from ..db import is_url_exists, convert_datetime_format

class ZaobaoSpider(scrapy.Spider):
    name = 'zaobao'
    allowed_domains = ['zaobao.com/']
    start_urls = ['https://www.zaobao.com/realtime']
    base_url = 'https://www.zaobao.com/realtime'
    
    custom_settings = {
        'ITEM_PIPELINES': {'quotes_toscrape.mysqlpipeline_news.MysqlPipeline_News': 202}
    }
    
    def detect_url_lang(self, response):
        """检测页面语言类型"""
        html_lang = response.xpath('//html/@lang').get()
        meta_lang = None
        if not html_lang:
            meta_lang = response.xpath('//meta[@http-equiv="content-language"]/@content').get()
        og_locale = None
        if not html_lang and not meta_lang:
            og_locale = response.xpath('//meta[@property="og:locale"]/@content').get()
        return html_lang or meta_lang or og_locale or 'unknown'

    def parse(self, response):
        # 联合早报的即时板块
        source1 = '联合早报-即时'
        news_list = response.xpath(
            '//a[@class="py-4 flex gap-2 border-solid border-b-[1px] border-grey-150 last:border-none"]')
        
        for news in news_list:
            url = news.xpath('./@href').get()
            if url:
                url = response.urljoin(url)
                
            # 增量爬取判断
            if is_url_exists(url):
                continue
                
            title = news.xpath('.//article/text()').extract()[0]
            
            yield scrapy.Request(url=url, meta={'source1': source1, 'title': title, 'url': url},
                                 callback=self.parse_news, dont_filter=True)
        
        print(f"{source1} 网站爬取完毕……")

    def parse_news(self, response):
        item = NewsItem()
        item['title'] = response.meta['title']
        item['source1'] = response.meta['source1']
        item['url'] = response.meta['url']
        item['language_type'] = self.detect_url_lang(response)
        
        # 提取发布时间
        time_div = response.xpath('//div[@class="byline_area line-clamp-2 text-grey-400"]')
        time_text = time_div.xpath('string(.)').get()
        
        if time_text:
            # 处理文本,提取纯时间部分
            publish_time = time_text.split(' / ')[-1].strip()
            item['pub_time'] = convert_datetime_format(publish_time)
        
        # 提取正文内容
       #(略)
        
        yield item

3. 前端页面实现

主页面布局(index.html)
实现响应式的新闻展示页面:

html代码如下:

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>新闻监控告警分析系统</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <link href="https://cdn.jsdelivr.net/npm/font-awesome@4.7.0/css/font-awesome.min.css" rel="stylesheet">
    <script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/4.0.1/socket.io.js"></script>
    <link rel="stylesheet" href="styles.css">
</head>
<body class="bg-gray-100">
<div class="container mx-auto px-4 py-8">
    <header class="mb-8">
        <h1 class="text-3xl font-bold text-center text-gray-800">新闻监控告警系统 V1.0.1</h1>
        <!-- 告警容器 -->
        <div id="alert-container"
             class="mt-4 hidden bg-red-100 border border-red-400 text-red-700 px-4 py-3 rounded relative">
            <span class="block sm:inline" id="alert-message">敏感词告警</span>
            <div class="absolute top-0 right-0 px-4 py-3 flex items-center gap-2">
                <button id="stop-alert-sound" class="text-sm bg-red-500 text-white px-2 py-1 rounded hover:bg-red-600">
                    <i class="fa fa-volume-off mr-1"></i> 关闭声音
                </button>
                <button onclick="document.getElementById('alert-container').classList.add('hidden')"
                        class="text-gray-500 hover:text-gray-700">
                    <i class="fa fa-times"></i>
                </button>
            </div>
        </div>
    </header>

    <!-- 搜索和筛选区域 -->
    <div class="bg-white p-4 rounded-lg shadow mb-6">
        <h2 class="text-xl font-semibold mb-4 flex items-center">
            <i class="fa fa-search text-blue-500 mr-2"></i>新闻搜索
        </h2>
        <div class="grid grid-cols-1 md:grid-cols-2 gap-4">
            <!-- 来源筛选 -->
            <div>
                <label class="block text-sm font-medium text-gray-700 mb-1">新闻来源</label>
                <select id="source-filter" class="w-full border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500">
                    <option value="">全部来源</option>
                    <option value="美国有线电视">美国有线电视</option>
                    <option value="联合早报-即时">联合早报</option>
                    <option value="凤凰网-军事">凤凰网</option>
                </select>
            </div>
            
            <!-- 发布时间范围 -->
            <div>
                <label class="block text-sm font-medium text-gray-700 mb-1">发布时间范围</label>
                <div class="flex gap-2">
                    <input type="date" id="pub-start-date"
                           class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500"
                           placeholder="开始日期">
                    <input type="date" id="pub-end-date"
                           class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500"
                           placeholder="结束日期">
                </div>
            </div>

            <!-- 关键词搜索 -->
            <div class="md:col-span-2">
                <label class="block text-sm font-medium text-gray-700 mb-1">关键词</label>
                <div class="flex gap-2">
                    <input type="text" id="keyword" placeholder="标题或内容关键词"
                           class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500">
                    <button id="search-btn" class="bg-blue-500 hover:bg-blue-600 text-white px-6 py-2 rounded">
                        <i class="fa fa-search mr-1"></i> 搜索
                    </button>
                    <button id="reset-btn" class="bg-gray-200 hover:bg-gray-300 text-gray-700 px-6 py-2 rounded">
                        <i class="fa fa-refresh mr-1"></i> 重置
                    </button>
                </div>
            </div>
        </div>
    </div>

    <!-- 新闻列表和详情区域 -->
    <div class="grid grid-cols-1 lg:grid-cols-3 gap-6">
        <!-- 左侧:新闻列表 -->
        <div class="lg:col-span-1">
            <div class="bg-white rounded-lg shadow mb-6">
                <div class="bg-white p-4 rounded-lg shadow mb-6 flex flex-wrap items-center justify-between">
                    <!-- 总新闻数统计 -->
                    <div class="flex items-center mb-2 sm:mb-0">
                        <span class="text-gray-700"><span id="news-total-count"
                                                            class="font-bold text-blue-600">0</span> 条新闻</span>
                    </div>

                    <!-- 敏感词新闻统计 -->
                    <div class="flex items-center ml-0 sm:ml-auto">
                        <i class="fa fa-exclamation-circle text-red-500 mr-2"></i>
                        <span class="text-gray-700">含有敏感词的新闻总数: </span>
                        <span id="sensitive-news-count" class="ml-2 font-bold text-red-600">0</span>
                    </div>
                </div>
                
                <!-- 新闻列表表头 -->
                <div class="grid grid-cols-12 bg-gray-50 py-2 px-4 font-medium border-b">
                    <div class="col-span-4">标题</div>
                    <div class="col-span-2 hidden md:block">来源</div>
                    <div class="col-span-3 hidden sm:block">出版时间</div>
                    <div class="col-span-3 hidden lg:block">爬取时间</div>
                </div>
                
                <!-- 新闻列表内容 -->
                <div id="news-list" class="max-h-[500px] overflow-y-auto">
                    <!-- 新闻列表将在这里动态生成 -->
                    <div class="flex justify-center items-center h-32 text-gray-500">
                        请点击搜索按钮加载新闻
                    </div>
                </div>
                
                <!-- 分页控件 -->
                <div id="pagination" class="py-3 px-4 border-t flex justify-center">
                    <!-- 分页控件将在这里动态生成 -->
                </div>
            </div>
        </div>

        <!-- 右侧:新闻详情 -->
        <div class="lg:col-span-2">
            <div class="w-full bg-white rounded-lg shadow p-5">
                <h2 class="text-lg font-semibold mb-4 flex items-center">
                    <i class="fa fa-newspaper-o text-primary mr-2"></i>新闻详情
                </h2>

                <!-- 新闻标题区域 -->
                <div class="mb-6 pb-4 border-b border-gray-200">
                    <div class="inline-flex items-center">
                        <h3 id="news-detail-title" class="text-xl font-bold text-dark"></h3>
                        <a id="preview-url" href="" target="_blank" class="text-blue-500 hover:underline ml-2 text-sm">
                            <i class="fa fa-external-link mr-1"></i> 查看原文
                        </a>
                    </div>

                    <!-- 来源和时间区域 -->
                    <div class="mt-2 flex justify-start text-sm text-gray-500">
                        <span id="news-detail-source"></span>
                        <span id="news-detail-time" class="ml-4"></span>
                    </div>
                </div>

                <!-- 新闻内容标签切换 -->
                <div class="border-b border-gray-200 mb-4">
                    <div class="flex">
                        <button class="content-tab px-4 py-2 font-medium border-b-2 border-primary text-primary"
                                data-tab="original">
                            原文
                        </button>
                        <button class="content-tab px-4 py-2 font-medium border-b-2 border-transparent text-gray-500 hover:text-gray-700"
                                data-tab="chinese">
                            中文
                        </button>
                        <button class="content-tab px-4 py-2 font-medium border-b-2 border-transparent text-gray-500 hover:text-gray-700"
                                data-tab="summary">
                            转写后
                        </button>
                    </div>
                </div>

                <!-- 新闻内容区域 -->
                <div class="min-h-[500px]">
                    <!-- 原文区域 -->
                    <div id="original-content" class="content-panel">
                        <div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
                            请选择一条新闻查看详情
                        </div>
                    </div>

                    <!-- 中文翻译区域 -->
                    <div id="chinese-content" class="content-panel hidden">
                        <div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
                            请选择一条新闻查看详情
                        </div>
                    </div>

                    <!-- 转写后区域 -->
                    <div id="summary-content" class="content-panel hidden">
                        <div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
                            请选择一条新闻查看详情
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div>

<script src="app.js"></script>
</body>
</html>

在这里插入图片描述前端界面图1

在这里插入图片描述界面2

在这里插入图片描述使用大模型智能翻译的中文
在这里插入图片描述使用大模型自动简写摘录

4. 项目主程序与定时任务

实现爬虫的定时调度功能(main.py):

python
运行

import json
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor, defer
import logging
import signal
from pathlib import Path

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("spider.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# 爬虫名称列表
SPIDER_NAMES = ['cnn_news', 'zaobao', 'fenghuang']  # 多个爬虫
# 爬取间隔时间(秒)
CRAWL_INTERVAL = 60

# 初始化CrawlerRunner
runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl_all_spiders():
    """运行所有爬虫"""
    logger.info("===== 开始新一轮爬取 =====")
    try:
        # 依次启动所有爬虫
        for spider_name in SPIDER_NAMES:
            yield runner.crawl(spider_name)
        logger.info("===== 本轮爬取完成 =====")
    except Exception as e:
        logger.error(f"爬取过程中出错: {str(e)}", exc_info=True)

    # 安排下一次爬取
    schedule_next_crawl()

def schedule_next_crawl():
    """安排下一次爬取任务"""
    logger.info(f"将在 {CRAWL_INTERVAL} 秒后进行下一轮爬取...")
    reactor.callLater(CRAWL_INTERVAL, crawl_all_spiders)

def shutdown(signal, frame):
    """处理程序退出信号"""
    logger.info("接收到退出信号,正在停止爬虫...")
    reactor.stop()

if __name__ == "__main__":
    # 注册信号处理,确保程序可以优雅退出
    signal.signal(signal.SIGINT, shutdown)  # 处理Ctrl+C
    signal.signal(signal.SIGTERM, shutdown)

    logger.info("程序启动,开始首次爬取...")
    # 启动第一次爬取
    crawl_all_spiders()

    # 启动Twisted反应器(事件循环)
    reactor.run()

5. AI 处理模块

实现新闻内容的翻译和转写功能(backend/app.py):

python
运行

"""优化逻辑:先判断语种,再选择性翻译"""
"""独立线程处理AI任务"""
def ai_processing_worker():
    """独立线程处理AI任务:翻译和内容转写"""
    init_ollama_client()

    while True:
        try:
            # 从队列获取待处理新闻
            with queue_lock:
                if not news_process_queue:
                    time.sleep(5)  # 队列空时休眠
                    continue
                news_item = news_process_queue.pop(0)
                news_id = news_item['id']
                # 临时标记正在处理,避免重复入队
                processing_flag = f"processing_{news_id}"
                if processing_flag in [f"processing_{item['id']}" for item in news_process_queue]:
                    print(f"新闻ID {news_id} 正在处理中,跳过重复项")
                    continue

            news_id = news_item['id']
            title = news_item['title']
            content = news_item['description']
            language_type = news_item['language_type']

            conn = get_db_connection()

            try:
                # 加行锁防止并发问题
                with conn.cursor(pymysql.cursors.DictCursor) as cursor:
                    cursor.execute(
                        "SELECT ai_processed FROM news_table WHERE id = %s FOR UPDATE",
                        (news_id,)
                    )
                    result = cursor.fetchone()

                    if not result or result['ai_processed'] == 1:
                        print(f"新闻ID {news_id} 已处理,跳过")
                        continue

                print(f"开始智能处理新闻ID: {news_id}的翻译和转写\n")
                start_time = time.time()
                
                # 1. 判断是否为中文
                is_chinese = language_type.startswith("zh")

                # 2. 初始化变量
                chinese_title = title
                chinese_content = content

                # 3. 非中文时调用翻译
                if not is_chinese:
                    chinese_title = call_LLM_for_translate(title, True, language_type)
                    print(f"新闻ID {news_id} 标题翻译成功!!!")
                    chinese_content = call_LLM_for_translate(content, False, language_type)
                    print(f"新闻ID {news_id} 内容翻译成功!!!")

                # 4. 内容转写
                paraphrased_content = call_large_model_for_paraphrase(chinese_content)

                # 5. 更新数据库
                conn = get_db_connection()
                cursor = conn.cursor(pymysql.cursors.DictCursor)

                update_query = """
                    UPDATE news_table 
                    SET chinese_title = %s, chinese_content = %s, 
                        summary = %s, ai_processed = 1 
                    WHERE id = %s
                    """
                cursor.execute(update_query, (chinese_title, chinese_content,
                                              paraphrased_content, news_id))
                conn.commit()
                
                # 检查敏感内容
                news_item['chinese_title'] = chinese_title
                news_item['chinese_content'] = chinese_content
                check_sensitive_content(news_item)
                
                # 6. 通知前端AI处理完成
                socketio.emit('ai_process_complete', {
                    'news_id': news_id,
                    'is_chinese_original': is_chinese,
                    'message': '处理完成'
                })
                print(f"新闻ID {news_id} 处理完成,耗时{time.time() - start_time:.2f}秒")
            except Exception as e:
                conn.rollback()
                print(f"AI任务处理失败(news_id={news_id}): {str(e)}")
                # 重新入队
                with queue_lock:
                    if not any(item['id'] == news_id for item in news_process_queue):
                        news_process_queue.append(news_item)
                time.sleep(10)
            finally:
                conn.close()
        except Exception as e:
            print(f"线程循环异常: {str(e)}")
            time.sleep(10)

在这里插入图片描述

四、系统测试与部署

测试步骤

数据库连接测试:确保能够正确连接 MySQL 并创建表结构
单爬虫测试:单独运行每个爬虫,验证数据能否正确爬取和存储
定时任务测试:验证定时爬取功能是否正常工作
前端交互测试:测试搜索、筛选、查看详情等功能
AI 处理测试:验证翻译和转写功能的准确性

部署注意事项

生产环境中应使用环境变量存储数据库密码等敏感信息
调整爬取间隔和并发数,避免给目标网站造成过大压力
考虑使用 Docker 容器化部署,简化环境配置
对于大规模爬取,建议使用分布式爬虫架构

五、总结与展望

本系统实现了一个功能完整的新闻网站爬虫系统,具备多网站爬取、数据存储、前端展示和 AI 处理等功能。通过 Scrapy 框架实现高效的数据采集,使用 MySQL 存储结构化数据,结合 Tailwind CSS 构建美观的前端界面,同时利用 AI 技术提升内容的可读性。

未来可以从以下几个方面进行优化:

增加更多新闻来源,提高新闻覆盖率
优化爬虫调度策略,实现基于内容热度的动态爬取频率
增强敏感词检测和告警机制,提高系统的实用性
引入数据可视化功能,展示新闻趋势和热点分析

通过不断优化和扩展,系统可以成为一个功能强大的新闻监控与分析平台。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐