去年带一个零基础的朋友学爬虫,他一开始连“请求头”是什么都不知道,跟着我做了10个案例后,现在已经能自己写分布式爬虫爬取电商数据了。这篇文章把我带他入门的全过程整理出来,从最基础的requests发请求,到解析数据、反爬破解,再到异步和分布式爬虫,每个阶段都配实战案例,代码可直接运行,新手照着做就能上手。

第一阶段:爬虫入门——用requests获取网页数据

爬虫的本质就是“模拟浏览器发请求,拿到数据后解析”,第一步先搞定最基础的requests库。

知识点1:requests的GET请求(90%的爬虫都用GET)

安装requests:

pip install requests
案例1:爬取豆瓣电影Top250(基础GET请求)

目标:获取Top250的电影名称、评分、简介。

import requests
from bs4 import BeautifulSoup  # 后面解析用,先安装:pip install beautifulsoup4

def crawl_douban_top250():
    url = 'https://movie.douban.com/top250'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    # 发GET请求
    response = requests.get(url, headers=headers)
    # 解析HTML
    soup = BeautifulSoup(response.text, 'html.parser')
    # 提取电影列表
    movies = soup.find_all('div', class_='item')
    for movie in movies:
        title = movie.find('span', class_='title').text
        rating = movie.find('span', class_='rating_num').text
        quote = movie.find('span', class_='inq').text if movie.find('span', class_='inq') else '无简介'
        print(f'电影:{title} | 评分:{rating} | 简介:{quote}')

if __name__ == '__main__':
    crawl_douban_top250()

新手踩坑点

  • 一定要加User-Agent!否则豆瓣会识别出是爬虫,返回403错误;
  • 解析时要对照网页的HTML结构(按F12查看),class名称别写错。
案例2:爬取知乎热榜(带参数的GET请求)

目标:获取知乎热榜的标题和链接。

import requests

def crawl_zhihu_hot():
    url = 'https://www.zhihu.com/api/v3/feed/topstory/hot-list-web'
    params = {
        'limit': 10,  # 取前10条热榜
        'desktop': True
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    # 带参数的GET请求
    response = requests.get(url, headers=headers, params=params)
    # 知乎返回JSON数据,直接解析
    data = response.json()
    for item in data['data']:
        title = item['target']['title']
        url = item['target']['url']
        print(f'标题:{title}\n链接:{url}\n---')

if __name__ == '__main__':
    crawl_zhihu_hot()

知识点2:requests的POST请求(登录/表单提交用)

案例3:模拟登录某测试网站(POST请求)

目标:用账号密码登录测试网站(http://httpbin.org/post)。

import requests

def simulate_login():
    url = 'http://httpbin.org/post'
    # POST的表单数据
    data = {
        'username': 'test_user',
        'password': 'test_pass'
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    # 发POST请求
    response = requests.post(url, headers=headers, data=data)
    print(response.json())  # 查看返回的表单数据

if __name__ == '__main__':
    simulate_login()

第二阶段:数据解析——把网页数据变成结构化信息

拿到网页数据后,需要从HTML/JSON中提取有用信息,常用3种方法:BeautifulSoup(HTML)、正则表达式(文本)、XPath(XML/HTML)。

知识点3:BeautifulSoup解析HTML(最常用)

案例4:爬取小说网站章节内容(BeautifulSoup进阶)

目标:爬取某小说网站的章节标题和内容(以笔趣阁为例)。

import requests
from bs4 import BeautifulSoup

def crawl_novel_chapter():
    url = 'https://www.biquge.com.cn/book/12345/67890.html'  # 小说章节链接
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'  # 解决乱码
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取章节标题
    title = soup.find('h1').text
    # 提取章节内容(小说内容在<div id="content">里)
    content = soup.find('div', id='content').text.replace('\xa0', ' ')  # 替换空格符
    # 保存到文件
    with open(f'{title}.txt', 'w', encoding='utf-8') as f:
        f.write(title + '\n\n' + content)
    print(f'已保存章节:{title}')

if __name__ == '__main__':
    crawl_novel_chapter()

知识点4:正则表达式解析文本(处理无规律的字符串)

案例5:爬取网页中的所有邮箱(正则表达式)

目标:从任意网页中提取所有邮箱地址。

import requests
import re

def crawl_emails(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    # 正则表达式匹配邮箱:r'[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+'
    emails = re.findall(r'[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+', response.text)
    print(f'找到的邮箱:{list(set(emails))}')  # 去重

if __name__ == '__main__':
    crawl_emails('https://www.example.com/contact')  # 替换成目标网址

知识点5:XPath解析HTML/XML(比BeautifulSoup更灵活)

先安装lxml库(XPath需要):

pip install lxml
案例6:爬取B站视频信息(XPath)

目标:获取B站视频的标题、播放量、弹幕数。

import requests
from lxml import etree

def crawl_bilibili_video():
    url = 'https://www.bilibili.com/video/BV1xx411c7mZ'  # 替换成目标视频链接
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Referer': 'https://www.bilibili.com/'
    }
    response = requests.get(url, headers=headers)
    html = etree.HTML(response.text)
    
    # XPath提取数据
    title = html.xpath('//h1[@class="video-title"]/text()')[0].strip()
    play_count = html.xpath('//span[@class="view"]/text()')[0]
    danmu_count = html.xpath('//span[@class="danmaku"]/text()')[0]
    
    print(f'标题:{title}\n播放量:{play_count}\n弹幕数:{danmu_count}')

if __name__ == '__main__':
    crawl_bilibili_video()

第三阶段:反爬基础——让爬虫不被封IP

新手爬取时最容易遇到“IP被封”,这一步教你基础的反爬破解方法。

知识点6:User-Agent池+随机延迟

案例7:爬取电商商品(带随机User-Agent和延迟)
import requests
import random
import time
from fake_useragent import UserAgent  # 自动生成User-Agent,安装:pip install fake-useragent

def crawl_shop_products():
    url = 'https://www.taobao.com'  # 示例网址
    # User-Agent池
    ua = UserAgent()
    headers = {
        'User-Agent': ua.random  # 随机选一个User-Agent
    }
    # 随机延迟1-3秒
    time.sleep(random.uniform(1, 3))
    
    response = requests.get(url, headers=headers)
    print(f'请求成功,状态码:{response.status_code}')

if __name__ == '__main__':
    for _ in range(5):
        crawl_shop_products()

知识点7:代理IP的使用(解决IP被封)

案例8:用代理IP爬取数据(以免费代理为例)
import requests

def crawl_with_proxy():
    url = 'https://www.httpbin.org/ip'  # 查看当前IP
    # 免费代理(稳定性差,生产环境用付费代理如阿布云)
    proxies = {
        'http': 'http://123.45.67.89:8080',
        'https': 'https://123.45.67.89:8080'
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    try:
        response = requests.get(url, headers=headers, proxies=proxies, timeout=10)
        print(f'当前IP:{response.json()["origin"]}')
    except Exception as e:
        print(f'代理失效:{e}')

if __name__ == '__main__':
    crawl_with_proxy()

知识点8:Cookie登录(爬取需要登录的内容)

案例9:爬取知乎个人主页(带Cookie)
import requests

def crawl_zhihu_profile():
    url = 'https://www.zhihu.com/people/your-profile'  # 替换成目标用户主页
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Cookie': '你的Cookie'  # 从浏览器F12的Network里复制
    }
    response = requests.get(url, headers=headers)
    print(f'用户主页内容长度:{len(response.text)}')

if __name__ == '__main__':
    crawl_zhihu_profile()

第四阶段:进阶——异步爬虫(提升爬取效率)

同步爬虫一次只能爬一个页面,异步爬虫可以同时爬多个,效率提升10倍以上。

知识点9:aiohttp异步爬虫

安装aiohttp:

pip install aiohttp
案例10:异步爬取多个网页(aiohttp基础)
import aiohttp
import asyncio

async def fetch(session, url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    async with session.get(url, headers=headers) as response:
        return await response.text()

async def batch_fetch(urls):
    async with aiohttp.ClientSession() as session:
        # 创建任务列表
        tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
        # 等待所有任务完成
        results = await asyncio.gather(*tasks)
        return results

if __name__ == '__main__':
    urls = [
        'https://www.baidu.com',
        'https://www.zhihu.com',
        'https://www.douban.com'
    ]
    # 运行异步任务
    results = asyncio.run(batch_fetch(urls))
    for i, result in enumerate(results):
        print(f'URL {i+1} 的内容长度:{len(result)}')

第五阶段:终极——分布式爬虫(爬取海量数据)

异步爬虫只能用单机资源,分布式爬虫可以把任务分给多台机器,爬取亿级数据。

知识点10:Scrapy-Redis分布式爬虫

Scrapy是专业的爬虫框架,Scrapy-Redis实现了分布式功能。

步骤1:安装Scrapy和Scrapy-Redis
pip install scrapy scrapy-redis
步骤2:创建Scrapy项目
scrapy startproject distributed_spider
cd distributed_spider
步骤3:配置分布式(settings.py)
# 启用Scrapy-Redis调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 启用Scrapy-Redis去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Redis连接地址
REDIS_URL = "redis://localhost:6379/0"
# 任务持久化(重启后不丢失)
SCHEDULER_PERSIST = True
步骤4:编写爬虫(spiders/douban_spider.py)
import scrapy
from scrapy_redis.spiders import RedisSpider

class DoubanSpider(RedisSpider):
    name = 'douban_spider'
    redis_key = 'douban:start_urls'  # Redis中的任务队列

    def parse(self, response):
        # 提取电影信息(和案例1类似)
        movies = response.css('div.item')
        for movie in movies:
            title = movie.css('span.title::text').get()
            rating = movie.css('span.rating_num::text').get()
            yield {
                'title': title,
                'rating': rating
            }
        # 下一页
        next_page = response.css('span.next a::attr(href)').get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
步骤5:启动分布式爬虫
  1. 启动Redis服务器;
  2. 向Redis中添加起始URL:
    redis-cli lpush douban:start_urls https://movie.douban.com/top250
    
  3. 启动多个爬虫节点(多台机器/多个终端):
    scrapy crawl douban_spider
    

新手学习路线总结

  1. 基础阶段:掌握requests发请求,BeautifulSoup/XPath解析数据(案例1-6);
  2. 反爬阶段:学会User-Agent、代理、Cookie的使用(案例7-9);
  3. 效率阶段:用aiohttp做异步爬虫(案例10);
  4. 进阶阶段:用Scrapy-Redis做分布式爬虫。

新手建议先从豆瓣、知乎这类反爬宽松的网站练手,别一开始就爬电商、金融类网站(反爬严格,容易违法)。如果遇到“解析失败”“IP被封”等问题,评论区留问题,我会用新手能听懂的话帮你解答。

觉得这篇保姆级教程有用的话,点赞收藏,下次学爬虫直接照着练!

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐