基于scrapy爬虫的网页版新闻网站监控告警系统设计与实现
边学爬虫,边做了个系统,在项目中学习是最快的方法。利用scrapy爬虫框架,实现多个网站的增量、并行、定时爬取,保存到mysql数据库中;前端通过HTML语言设计,通过flask和websocket与后端通信,实现敏感词监控告警、新闻查询检索,同时,设计了一个词云显示和检索的功能,方便可视化分析;后端通过ollama服务调用大模型,实现翻译、内容转写等功能。总体属于全栈开发,是学习或者毕业设计的好
基于scrapy爬虫的网页版新闻网站监控告警系统设计与实现
边学爬虫,边做了个系统,在项目中学习是最快的方法。利用scrapy爬虫框架,实现多个网站的增量、并行、定时爬取,保存到mysql数据库中;前端通过HTML语言设计,通过flask和websocket与后端通信,实现敏感词监控告警、新闻查询检索,同时,设计了一个词云显示和检索的功能,方便可视化分析;后端通过ollama服务调用大模型,实现翻译、内容转写等功能。总体属于全栈开发,是学习或者毕业设计的好案例。欢迎批评指正!!!
一、系统概述
本文将介绍一款网页版新闻网站爬虫系统的设计与实现过程。该系统能够自动爬取多个新闻网站的内容,存储到数据库中,并通过前端页面进行展示和管理,同时具备 AI 翻译和内容转写功能。系统采用 Scrapy 作为爬虫框架,MySQL 作为数据库,Python 作为后端开发语言,结合 Tailwind CSS 构建前端界面。
系统架构
数据采集层:基于 Scrapy 的多网站分布式增量爬虫
数据存储层:MySQL 数据库
业务逻辑层:数据处理与 AI 服务
展示层:基于 Tailwind CSS 的前端页面
系统设计流程图
二、环境搭建
技术栈
爬虫框架:Scrapy 2.6+
数据库:MySQL 8.0
后端语言:Python 3.8+
前端技术:HTML5、Tailwind CSS、JavaScript
其他依赖:pymysql、python-dotenv、pytz 等
依赖安装
bash
pip install scrapy pymysql python-dotenv pytz python-dateutil
三、核心功能实现
1. 数据库设计与操作
数据库连接工具类(db.py)
首先实现数据库连接和基础操作工具:
python
运行
import pymysql
from dotenv import load_dotenv
import os
from datetime import datetime
#加载 .env 文件中的环境变量
load_dotenv()
def get_db_connection():
return pymysql.connect(
host=os.getenv('DB_HOST', 'localhost'),
user=os.getenv('DB_USER', 'root'),
password=os.getenv('DB_PASSWORD', '060109yzf'),
db=os.getenv('DB_NAME', 'news_DB'),
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
#增量爬虫的关键,从数据库中检索是否存在url
def is_url_exists(url):
conn = get_db_connection()
try:
with conn.cursor() as cursor:
# 从news_table表中查询,有返回1
cursor.execute("SELECT 1 FROM news_table WHERE url = %s", (url,))
return cursor.fetchone() is not None
finally:
conn.close()
def save_news(item):
conn = get_db_connection()
try:
with conn.cursor() as cursor:
sql = """
INSERT INTO news_table (title, pub_time, url, source1, content)
VALUES (%s, %s, %s, %s, %s)
"""
cursor.execute(sql, (
item['title'],
item['pub_time'],
item['url'],
item['source1'],
item['content']
))
conn.commit()
finally:
conn.close()
数据库表设计(mysqlpipeline_news.py)
实现数据库表创建和数据存储的 Pipeline:
python
运行
import pymysql
class MysqlPipeline_News(object):
"""新闻数据存储管道"""
def __init__(self):
# 建立连接
self.conn = pymysql.connect(host='localhost', user='root', password='060109yzf', database='news_DB',
charset='utf8mb4')
# 创建游标
try:
self.cursor = self.conn.cursor()
# 创建表(如果需要)
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS news_table (
id INT AUTO_INCREMENT PRIMARY KEY,
source1 TEXT NOT NULL,
title TEXT NOT NULL,
pub_time VARCHAR(255) NOT NULL,
url VARCHAR(255),
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
chinese_title TEXT,
chinese_content TEXT,
summary TEXT,
ai_processed TINYINT NOT NULL DEFAULT 0,# 0:未处理 1:已处理
language_type VARCHAR(10) NOT NULL DEFAULT 'unknown'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
""")
self.conn.commit()
except pymysql.MySQLError as e:
print(f"游标创建失败: {e}")
return
def process_item(self, item, spider):
# sql语句
insert_sql = """
insert into news_table(source1,title,pub_time,url,description,language_type) VALUES(%s,%s,%s,%s,%s,%s)
"""
# 执行插入数据到数据库操作
self.cursor.execute(insert_sql,
(item['source1'], item['title'], item['pub_time'], item['url'], item['description'], item['language_type']))
# 提交,不进行提交无法保存到数据库
self.conn.commit()
return item
def close_spider(self, spider):
# 关闭游标和连接
self.cursor.close()
self.conn.close()
数据库结构和爬取的结果
2. 爬虫模块实现
通用新闻 Item 定义(items.py)
定义统一的新闻数据结构:
python
运行
import scrapy
# 统一新闻ITEM
class NewsItem(scrapy.Item):
title = scrapy.Field()#标题
pub_time=scrapy.Field()#发布时间
source1 = scrapy.Field() # 来源
url = scrapy.Field() # 链接
description=scrapy.Field()#正文
language_type = scrapy.Field()#语言类型
CNN 新闻爬虫(cnn_news.py)
实现 CNN 新闻网站的爬虫,包含多页面导航和详情提取:
python
运行
import scrapy
import re
from quotes_toscrape.items import NewsItem
from ..db import is_url_exists, convert_datetime_format
from urllib.parse import urljoin
import pytz
from dateutil import parser
class CNNSpider(scrapy.Spider):
name = 'cnn_news'
allowed_domains = ['cnn.com']
start_urls = ['https://edition.cnn.com/']
base_url = 'https://edition.cnn.com/'
custom_settings = {
'ITEM_PIPELINES': {'quotes_toscrape.mysqlpipeline_news.MysqlPipeline_News': 202},
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
},
'DOWNLOAD_DELAY': 1,
'RANDOMIZE_DOWNLOAD_DELAY': True
}
def detect_url_lang(self, response):
"""检测页面语言类型"""
html_lang = response.xpath('//html/@lang').get()
meta_lang = None
if not html_lang:
meta_lang = response.xpath('//meta[@http-equiv="content-language"]/@content').get()
og_locale = None
if not html_lang and not meta_lang:
og_locale = response.xpath('//meta[@property="og:locale"]/@content').get()
return html_lang or meta_lang or og_locale or 'unknown'
def convert_to_beijing_time(self, time_str):
"""将ET/EDT/EST时区时间转换为北京时间"""
cleaned = re.sub(r'[^\x00-\x7F]', '', time_str.strip())
cleaned = re.sub(r'\s+', ' ', cleaned.replace('\n', ' '))
cleaned = re.sub(r'^(\w+)\s+', '', cleaned, flags=re.IGNORECASE)
tz_match = re.search(r'\s*(ET|EDT|EST)\s*', cleaned, flags=re.IGNORECASE)
if not tz_match:
raise ValueError(f"未找到有效时区(ET/EDT/EST): {cleaned}")
tz_abbr = tz_match.group(1).upper()
time_part = re.sub(r'\s*(ET|EDT|EST)\s*', '', cleaned, flags=re.IGNORECASE).strip()
try:
naive_dt = parser.parse(time_part)
except parser.ParserError as e:
raise ValueError(f"解析时间失败: {time_part}(错误:{str(e)})")
source_tz = pytz.timezone('America/New_York')
is_dst = (tz_abbr == 'EDT')
localized_dt = source_tz.localize(naive_dt, is_dst=is_dst)
beijing_tz = pytz.timezone('Asia/Shanghai')
beijing_dt = localized_dt.astimezone(beijing_tz)
return beijing_dt.strftime("%Y-%m-%d %H:%M:%S")
def parse(self, response):
"""第一层解析:提取所有第一层导航链接"""
first_level_links = response.xpath(
"//div[@class='header__nav-item']/a[@class='header__nav-item-link']/@href"
).getall()
self.log(f"找到{len(first_level_links)}个第一层导航链接")
limited_links = first_level_links[:2] # 限制爬取数量,避免请求过多
for link in limited_links:
full_url = urljoin(response.url, link)
self.log(f"跟进第一层链接: {full_url}")
yield scrapy.Request(
url=full_url,
callback=self.parse_second_level,
meta={'first_level_url': full_url}
)
def parse_second_level(self, response):
"""第二层解析:从第一层链接页面提取第二层导航链接"""
second_level_links = response.xpath(
"//div[@class='header__nav-item']/a[@class='header__nav-item-link']/@href"
).getall()
self.log(f"从{response.meta['first_level_url']}找到{len(second_level_links)}个第二层导航链接")
for link in second_level_links:
full_url = urljoin(response.url, link)
self.log(f"跟进第二层链接: {full_url}")
yield scrapy.Request(
url=full_url,
callback=self.parse_news_section,
meta={
'first_level_url': response.meta['first_level_url'],
'second_level_url': full_url
}
)
def parse_news_section(self, response):
"""解析新闻版块,提取新闻链接"""
news_cards = response.xpath('//li[contains(@class, "container__item--type-media-image")]')
for card in news_cards:
link = card.xpath('.//a[contains(@class, "container__link--type-article")]/@href').get()
if not link:
continue
url = response.urljoin(link)
if is_url_exists(url): # 增量爬取,已存在的URL不再爬取
continue
title = card.xpath('.//span[contains(@class, "container__headline-text")]/text()').get()
yield scrapy.Request(url=url, meta={'title': title, 'url': url},
callback=self.parse_news_detail)
def parse_news_detail(self, response):
"""解析新闻详情页,提取新闻原文内容"""
self.log(f"开始解析新闻详情: {response.url}")
item = NewsItem()
item['source1'] = "美国有线电视"
item['title'] = response.meta['title']
item['url'] = response.meta['url']
item['language_type'] = self.detect_url_lang(response)
# 提取并处理时间
raw_time = response.xpath("//div[@class='timestamp__published']/text() | "
"//div[contains(@class, 'timestamp vossi-timestamp')]//text()"
).getall()
if raw_time:
raw_time_str = ' '.join(raw_time).strip()
else:
raw_time_str = None
print(f"网站:{item['url']}没有提取到时间:{raw_time_str}")
formatted_time = self.convert_to_beijing_time(raw_time_str)
item['pub_time'] = formatted_time
#下文略……………………
联合早报爬虫(zaobao.py)
实现联合早报网站的爬虫:
python
运行
import scrapy
from quotes_toscrape.items import NewsItem
from ..db import is_url_exists, convert_datetime_format
class ZaobaoSpider(scrapy.Spider):
name = 'zaobao'
allowed_domains = ['zaobao.com/']
start_urls = ['https://www.zaobao.com/realtime']
base_url = 'https://www.zaobao.com/realtime'
custom_settings = {
'ITEM_PIPELINES': {'quotes_toscrape.mysqlpipeline_news.MysqlPipeline_News': 202}
}
def detect_url_lang(self, response):
"""检测页面语言类型"""
html_lang = response.xpath('//html/@lang').get()
meta_lang = None
if not html_lang:
meta_lang = response.xpath('//meta[@http-equiv="content-language"]/@content').get()
og_locale = None
if not html_lang and not meta_lang:
og_locale = response.xpath('//meta[@property="og:locale"]/@content').get()
return html_lang or meta_lang or og_locale or 'unknown'
def parse(self, response):
# 联合早报的即时板块
source1 = '联合早报-即时'
news_list = response.xpath(
'//a[@class="py-4 flex gap-2 border-solid border-b-[1px] border-grey-150 last:border-none"]')
for news in news_list:
url = news.xpath('./@href').get()
if url:
url = response.urljoin(url)
# 增量爬取判断
if is_url_exists(url):
continue
title = news.xpath('.//article/text()').extract()[0]
yield scrapy.Request(url=url, meta={'source1': source1, 'title': title, 'url': url},
callback=self.parse_news, dont_filter=True)
print(f"{source1} 网站爬取完毕……")
def parse_news(self, response):
item = NewsItem()
item['title'] = response.meta['title']
item['source1'] = response.meta['source1']
item['url'] = response.meta['url']
item['language_type'] = self.detect_url_lang(response)
# 提取发布时间
time_div = response.xpath('//div[@class="byline_area line-clamp-2 text-grey-400"]')
time_text = time_div.xpath('string(.)').get()
if time_text:
# 处理文本,提取纯时间部分
publish_time = time_text.split(' / ')[-1].strip()
item['pub_time'] = convert_datetime_format(publish_time)
# 提取正文内容
#(略)
yield item
3. 前端页面实现
主页面布局(index.html)
实现响应式的新闻展示页面:
html代码如下:
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>新闻监控告警分析系统</title>
<script src="https://cdn.tailwindcss.com"></script>
<link href="https://cdn.jsdelivr.net/npm/font-awesome@4.7.0/css/font-awesome.min.css" rel="stylesheet">
<script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/4.0.1/socket.io.js"></script>
<link rel="stylesheet" href="styles.css">
</head>
<body class="bg-gray-100">
<div class="container mx-auto px-4 py-8">
<header class="mb-8">
<h1 class="text-3xl font-bold text-center text-gray-800">新闻监控告警系统 V1.0.1</h1>
<!-- 告警容器 -->
<div id="alert-container"
class="mt-4 hidden bg-red-100 border border-red-400 text-red-700 px-4 py-3 rounded relative">
<span class="block sm:inline" id="alert-message">敏感词告警</span>
<div class="absolute top-0 right-0 px-4 py-3 flex items-center gap-2">
<button id="stop-alert-sound" class="text-sm bg-red-500 text-white px-2 py-1 rounded hover:bg-red-600">
<i class="fa fa-volume-off mr-1"></i> 关闭声音
</button>
<button onclick="document.getElementById('alert-container').classList.add('hidden')"
class="text-gray-500 hover:text-gray-700">
<i class="fa fa-times"></i>
</button>
</div>
</div>
</header>
<!-- 搜索和筛选区域 -->
<div class="bg-white p-4 rounded-lg shadow mb-6">
<h2 class="text-xl font-semibold mb-4 flex items-center">
<i class="fa fa-search text-blue-500 mr-2"></i>新闻搜索
</h2>
<div class="grid grid-cols-1 md:grid-cols-2 gap-4">
<!-- 来源筛选 -->
<div>
<label class="block text-sm font-medium text-gray-700 mb-1">新闻来源</label>
<select id="source-filter" class="w-full border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500">
<option value="">全部来源</option>
<option value="美国有线电视">美国有线电视</option>
<option value="联合早报-即时">联合早报</option>
<option value="凤凰网-军事">凤凰网</option>
</select>
</div>
<!-- 发布时间范围 -->
<div>
<label class="block text-sm font-medium text-gray-700 mb-1">发布时间范围</label>
<div class="flex gap-2">
<input type="date" id="pub-start-date"
class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500"
placeholder="开始日期">
<input type="date" id="pub-end-date"
class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500"
placeholder="结束日期">
</div>
</div>
<!-- 关键词搜索 -->
<div class="md:col-span-2">
<label class="block text-sm font-medium text-gray-700 mb-1">关键词</label>
<div class="flex gap-2">
<input type="text" id="keyword" placeholder="标题或内容关键词"
class="flex-1 border border-gray-300 rounded px-4 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500">
<button id="search-btn" class="bg-blue-500 hover:bg-blue-600 text-white px-6 py-2 rounded">
<i class="fa fa-search mr-1"></i> 搜索
</button>
<button id="reset-btn" class="bg-gray-200 hover:bg-gray-300 text-gray-700 px-6 py-2 rounded">
<i class="fa fa-refresh mr-1"></i> 重置
</button>
</div>
</div>
</div>
</div>
<!-- 新闻列表和详情区域 -->
<div class="grid grid-cols-1 lg:grid-cols-3 gap-6">
<!-- 左侧:新闻列表 -->
<div class="lg:col-span-1">
<div class="bg-white rounded-lg shadow mb-6">
<div class="bg-white p-4 rounded-lg shadow mb-6 flex flex-wrap items-center justify-between">
<!-- 总新闻数统计 -->
<div class="flex items-center mb-2 sm:mb-0">
<span class="text-gray-700">共 <span id="news-total-count"
class="font-bold text-blue-600">0</span> 条新闻</span>
</div>
<!-- 敏感词新闻统计 -->
<div class="flex items-center ml-0 sm:ml-auto">
<i class="fa fa-exclamation-circle text-red-500 mr-2"></i>
<span class="text-gray-700">含有敏感词的新闻总数: </span>
<span id="sensitive-news-count" class="ml-2 font-bold text-red-600">0</span>
</div>
</div>
<!-- 新闻列表表头 -->
<div class="grid grid-cols-12 bg-gray-50 py-2 px-4 font-medium border-b">
<div class="col-span-4">标题</div>
<div class="col-span-2 hidden md:block">来源</div>
<div class="col-span-3 hidden sm:block">出版时间</div>
<div class="col-span-3 hidden lg:block">爬取时间</div>
</div>
<!-- 新闻列表内容 -->
<div id="news-list" class="max-h-[500px] overflow-y-auto">
<!-- 新闻列表将在这里动态生成 -->
<div class="flex justify-center items-center h-32 text-gray-500">
请点击搜索按钮加载新闻
</div>
</div>
<!-- 分页控件 -->
<div id="pagination" class="py-3 px-4 border-t flex justify-center">
<!-- 分页控件将在这里动态生成 -->
</div>
</div>
</div>
<!-- 右侧:新闻详情 -->
<div class="lg:col-span-2">
<div class="w-full bg-white rounded-lg shadow p-5">
<h2 class="text-lg font-semibold mb-4 flex items-center">
<i class="fa fa-newspaper-o text-primary mr-2"></i>新闻详情
</h2>
<!-- 新闻标题区域 -->
<div class="mb-6 pb-4 border-b border-gray-200">
<div class="inline-flex items-center">
<h3 id="news-detail-title" class="text-xl font-bold text-dark"></h3>
<a id="preview-url" href="" target="_blank" class="text-blue-500 hover:underline ml-2 text-sm">
<i class="fa fa-external-link mr-1"></i> 查看原文
</a>
</div>
<!-- 来源和时间区域 -->
<div class="mt-2 flex justify-start text-sm text-gray-500">
<span id="news-detail-source"></span>
<span id="news-detail-time" class="ml-4"></span>
</div>
</div>
<!-- 新闻内容标签切换 -->
<div class="border-b border-gray-200 mb-4">
<div class="flex">
<button class="content-tab px-4 py-2 font-medium border-b-2 border-primary text-primary"
data-tab="original">
原文
</button>
<button class="content-tab px-4 py-2 font-medium border-b-2 border-transparent text-gray-500 hover:text-gray-700"
data-tab="chinese">
中文
</button>
<button class="content-tab px-4 py-2 font-medium border-b-2 border-transparent text-gray-500 hover:text-gray-700"
data-tab="summary">
转写后
</button>
</div>
</div>
<!-- 新闻内容区域 -->
<div class="min-h-[500px]">
<!-- 原文区域 -->
<div id="original-content" class="content-panel">
<div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
请选择一条新闻查看详情
</div>
</div>
<!-- 中文翻译区域 -->
<div id="chinese-content" class="content-panel hidden">
<div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
请选择一条新闻查看详情
</div>
</div>
<!-- 转写后区域 -->
<div id="summary-content" class="content-panel hidden">
<div class="text-gray-500 flex items-center justify-center h-full min-h-[500px]">
请选择一条新闻查看详情
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<script src="app.js"></script>
</body>
</html>
前端界面图1
界面2
使用大模型智能翻译的中文
使用大模型自动简写摘录
4. 项目主程序与定时任务
实现爬虫的定时调度功能(main.py):
python
运行
import json
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor, defer
import logging
import signal
from pathlib import Path
# 配置日志
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler("spider.log"),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
# 爬虫名称列表
SPIDER_NAMES = ['cnn_news', 'zaobao', 'fenghuang'] # 多个爬虫
# 爬取间隔时间(秒)
CRAWL_INTERVAL = 60
# 初始化CrawlerRunner
runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl_all_spiders():
"""运行所有爬虫"""
logger.info("===== 开始新一轮爬取 =====")
try:
# 依次启动所有爬虫
for spider_name in SPIDER_NAMES:
yield runner.crawl(spider_name)
logger.info("===== 本轮爬取完成 =====")
except Exception as e:
logger.error(f"爬取过程中出错: {str(e)}", exc_info=True)
# 安排下一次爬取
schedule_next_crawl()
def schedule_next_crawl():
"""安排下一次爬取任务"""
logger.info(f"将在 {CRAWL_INTERVAL} 秒后进行下一轮爬取...")
reactor.callLater(CRAWL_INTERVAL, crawl_all_spiders)
def shutdown(signal, frame):
"""处理程序退出信号"""
logger.info("接收到退出信号,正在停止爬虫...")
reactor.stop()
if __name__ == "__main__":
# 注册信号处理,确保程序可以优雅退出
signal.signal(signal.SIGINT, shutdown) # 处理Ctrl+C
signal.signal(signal.SIGTERM, shutdown)
logger.info("程序启动,开始首次爬取...")
# 启动第一次爬取
crawl_all_spiders()
# 启动Twisted反应器(事件循环)
reactor.run()
5. AI 处理模块
实现新闻内容的翻译和转写功能(backend/app.py):
python
运行
"""优化逻辑:先判断语种,再选择性翻译"""
"""独立线程处理AI任务"""
def ai_processing_worker():
"""独立线程处理AI任务:翻译和内容转写"""
init_ollama_client()
while True:
try:
# 从队列获取待处理新闻
with queue_lock:
if not news_process_queue:
time.sleep(5) # 队列空时休眠
continue
news_item = news_process_queue.pop(0)
news_id = news_item['id']
# 临时标记正在处理,避免重复入队
processing_flag = f"processing_{news_id}"
if processing_flag in [f"processing_{item['id']}" for item in news_process_queue]:
print(f"新闻ID {news_id} 正在处理中,跳过重复项")
continue
news_id = news_item['id']
title = news_item['title']
content = news_item['description']
language_type = news_item['language_type']
conn = get_db_connection()
try:
# 加行锁防止并发问题
with conn.cursor(pymysql.cursors.DictCursor) as cursor:
cursor.execute(
"SELECT ai_processed FROM news_table WHERE id = %s FOR UPDATE",
(news_id,)
)
result = cursor.fetchone()
if not result or result['ai_processed'] == 1:
print(f"新闻ID {news_id} 已处理,跳过")
continue
print(f"开始智能处理新闻ID: {news_id}的翻译和转写\n")
start_time = time.time()
# 1. 判断是否为中文
is_chinese = language_type.startswith("zh")
# 2. 初始化变量
chinese_title = title
chinese_content = content
# 3. 非中文时调用翻译
if not is_chinese:
chinese_title = call_LLM_for_translate(title, True, language_type)
print(f"新闻ID {news_id} 标题翻译成功!!!")
chinese_content = call_LLM_for_translate(content, False, language_type)
print(f"新闻ID {news_id} 内容翻译成功!!!")
# 4. 内容转写
paraphrased_content = call_large_model_for_paraphrase(chinese_content)
# 5. 更新数据库
conn = get_db_connection()
cursor = conn.cursor(pymysql.cursors.DictCursor)
update_query = """
UPDATE news_table
SET chinese_title = %s, chinese_content = %s,
summary = %s, ai_processed = 1
WHERE id = %s
"""
cursor.execute(update_query, (chinese_title, chinese_content,
paraphrased_content, news_id))
conn.commit()
# 检查敏感内容
news_item['chinese_title'] = chinese_title
news_item['chinese_content'] = chinese_content
check_sensitive_content(news_item)
# 6. 通知前端AI处理完成
socketio.emit('ai_process_complete', {
'news_id': news_id,
'is_chinese_original': is_chinese,
'message': '处理完成'
})
print(f"新闻ID {news_id} 处理完成,耗时{time.time() - start_time:.2f}秒")
except Exception as e:
conn.rollback()
print(f"AI任务处理失败(news_id={news_id}): {str(e)}")
# 重新入队
with queue_lock:
if not any(item['id'] == news_id for item in news_process_queue):
news_process_queue.append(news_item)
time.sleep(10)
finally:
conn.close()
except Exception as e:
print(f"线程循环异常: {str(e)}")
time.sleep(10)
四、系统测试与部署
测试步骤
数据库连接测试:确保能够正确连接 MySQL 并创建表结构
单爬虫测试:单独运行每个爬虫,验证数据能否正确爬取和存储
定时任务测试:验证定时爬取功能是否正常工作
前端交互测试:测试搜索、筛选、查看详情等功能
AI 处理测试:验证翻译和转写功能的准确性
部署注意事项
生产环境中应使用环境变量存储数据库密码等敏感信息
调整爬取间隔和并发数,避免给目标网站造成过大压力
考虑使用 Docker 容器化部署,简化环境配置
对于大规模爬取,建议使用分布式爬虫架构
五、总结与展望
本系统实现了一个功能完整的新闻网站爬虫系统,具备多网站爬取、数据存储、前端展示和 AI 处理等功能。通过 Scrapy 框架实现高效的数据采集,使用 MySQL 存储结构化数据,结合 Tailwind CSS 构建美观的前端界面,同时利用 AI 技术提升内容的可读性。
未来可以从以下几个方面进行优化:
增加更多新闻来源,提高新闻覆盖率
优化爬虫调度策略,实现基于内容热度的动态爬取频率
增强敏感词检测和告警机制,提高系统的实用性
引入数据可视化功能,展示新闻趋势和热点分析
通过不断优化和扩展,系统可以成为一个功能强大的新闻监控与分析平台。
更多推荐
所有评论(0)