「Datawhale」RAG技术全栈指南 Task 2

本文介绍了文档处理中的两个关键环节：数据加载和文本分块。数据加载部分详细说明了如何将各种格式文档转换为结构化数据，并提供了Unstructured工具的使用示例及常见错误解决方法。文本分块部分阐述了分块的必要性（如模型长度限制）、常见策略（固定大小、递归字符、语义分块等）以及工具应用（Unstructured、LlamaIndex）。文章强调应根据文档特点选择合适分块方式，避免过大分块导致信息模糊

小汤圆不甜不要钱

1038人浏览 · 2026-01-17 00:49:47

小汤圆不甜不要钱 · 2026-01-17 00:49:47 发布

第一节数据加载

文档加载器

功能

负责将各种格式的非结构化文档（如PDF、Word、Markdown、HTML等）转换为程序可以处理的数据结构。步骤：将各种格式的内容都提取为可处理的纯文本 >>> 抽取文档来源、页码、作者等关键信息作为原数据 >>> 结构化数据。

主流处理器

具体例子可点击链接查看，学习时间紧没有时间都尝试，等后续场景中有用到会来更新对应的代码例子的。

Unstructured：代码例子

遇到的问题&实测好用的解决办法：

代码运行的时候首先遇到：

Traceback (most recent call last):                                                                                                                               
  File "/workspaces/all-in-rag/code/C2/01_unstructured_example.py", line 7, in <module>                                                                          
    elements = partition(                                                                                                                                        
               ^^^^^^^^^^                                                                                                                                        
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/auto.py", line 211, in partition                                          
    partition_pdf = partitioner_loader.get(file_type)                                                                                                            
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                            
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/auto.py", line 364, in get                                                
    self._partitioners[file_type] = self._load_partitioner(file_type)                                                                                            
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                            
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/auto.py", line 382, in _load_partitioner                                  
    partitioner_module = importlib.import_module(file_type.partitioner_module_qname)                                                                             
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                             
  File "/opt/conda/envs/all-in-rag/lib/python3.12/importlib/__init__.py", line 90, in import_module                                                              
    return _bootstrap._gcd_import(name[level:], package, level)                                                                                                  
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                  
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import                                                                                                
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load                                                                                             
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 19, in <module>
    from unstructured_inference.inference.layout import DocumentLayout
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured_inference/inference/layout.py", line 18, in <module>
    from unstructured_inference.models.base import get_model
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured_inference/models/base.py", line 8, in <module>
    from unstructured_inference.models.detectron2onnx import (
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured_inference/models/detectron2onnx.py", line 4, in <module>
    import cv2
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/all-in-rag/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

解决办法：

在系统中安装对应的库：

sudo apt-get update
sudo apt-get install -y libgl1

安装好包之后又遇到了新的问题：

Traceback (most recent call last):
  File "/workspaces/all-in-rag/code/C2/01_unstructured_example.py", line 7, in <module>
    elements = partition(
               ^^^^^^^^^^
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/auto.py", line 211, in partition
    partition_pdf = partitioner_loader.get(file_type)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/auto.py", line 364, in get
    self._partitioners[file_type] = self._load_partitioner(file_type)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/auto.py", line 382, in _load_partitioner
    partitioner_module = importlib.import_module(file_type.partitioner_module_qname)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/all-in-rag/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 70, in <module>
    from unstructured.partition.pdf_image.pdfminer_processing import (
  File "/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured/partition/pdf_image/pdfminer_processing.py", line 11, in <module>
    from unstructured_inference.constants import FULL_PAGE_REGION_THRESHOLD, IsExtracted
ImportError: cannot import name 'IsExtracted' from 'unstructured_inference.constants' (/opt/conda/envs/all-in-rag/lib/python3.12/site-packages/unstructured_inference/constants.py)

解决办法：

该问题是环境下的版本不兼容导致的，典型的 unstructured ↔ unstructured-inference 版本不匹配。将两者都升级到配套版本即可解决：

pip install -U "unstructured[pdf]" unstructured-inference

练习题：

使用partition_pdf替换当前partition函数并分别尝试用hi_res和ocr_only进行解析，观察输出结果有何变化。

遇到的问题&实测好用的解决办法：

遇到了几个包缺失的报错，apt安装一下就解决了。

from unstructured.partition.auto import partition

# PDF文件路径
pdf_path = "../../data/C2/pdf/rag.pdf"

from unstructured.partition.pdf import partition_pdf

def summarize_elements(elements, title, n=15):
    print(f"\n{'='*20} {title} {'='*20}")
    print(f"元素总数: {len(elements)}")
    # 统计各元素类型数量
    type_counts = {}
    for el in elements:
        t = type(el).__name__
        type_counts[t] = type_counts.get(t, 0) + 1
    print("类型分布:", dict(sorted(type_counts.items(), key=lambda x: (-x[1], x[0]))))

    # 打印前 n 个元素的简要信息（类型 + 前 120 字）
    print(f"\n前 {min(n, len(elements))} 个元素预览:")
    for i, el in enumerate(elements[:n], 1):
        text = (getattr(el, "text", "") or "").strip().replace("\n", " ")
        if len(text) > 120:
            text = text[:120] + "..."
        print(f"{i:02d}. {type(el).__name__}: {text}")


# 1) hi_res：更强的版面分析/结构提取（通常更“像文档”）
elements_hi_res = partition_pdf(
    filename=pdf_path,
    strategy="hi_res",
)

# 2) ocr_only：完全走 OCR（通常更适合扫描版 PDF）
elements_ocr_only = partition_pdf(
    filename=pdf_path,
    strategy="ocr_only",
)

# 输出对比
summarize_elements(elements_hi_res, "strategy=hi_res")
summarize_elements(elements_ocr_only, "strategy=ocr_only")

hi_res：先做高精度版面分析/分区（尽量还原标题、段落、表格等结构），必要时再结合 OCR。
ocr_only：不做版面结构推断，直接把页面当图片走 OCR 抽文字（更适合纯扫描件，但结构信息更少、也更慢）。

从数据中也能明显的看出，hi_res的元素总数更多，更符合原pdf的结构，ocr_only更纯ocr的文本化。

第二节：文本分块

1）文本分块到底是啥

简单说就是：把一整篇很长的文档，切成一小块一小块的文本单元。后面做 向量化检索、把内容塞给 大模型回答，基本都是按“块”来处理的。

2）为什么必须分块（最核心原因）

（1）模型有长度上限

Embedding 模型输入太长会被截断，向量就不准，信息直接丢。
LLM上下文也有限，你检索出来的内容 + 提示词 + 问题要一起塞进去。块太大，就只能塞很少几块，反而信息不够。

结论：分块是为了让“能完整处理”和“能放进去”。

3）块不是越大越好（大块的三个坑）

坑 1：向量表示会“变糊” ——语义笼统

坑 2：大模型会“中间失忆”（Lost in the Middle）——忽略中间的关键句

坑 3：主题被稀释，检索召回会失败 —— 主题不突出，整体语义模糊，无法被相关内容检索到。

4）几种常见分块策略

（1）固定大小/按字符切（LangChain 的 CharacterTextSplitter）
看起来是“固定大小”，但实际更像“尽量按段落切，凑到接近 chunk_size”，并且可以用 overlap 做一点重叠来缓解断裂。

（2）递归字符分块（RecursiveCharacterTextSplitter）
它会按“段落 → 行 → 句子/标点 → 词/字符”这种层级递归尝试，能更好处理“某一段特别长”的情况。

（3）语义分块（Semantic Chunking）
不硬按长度切，而是看相邻句子的语义距离：语义变化大就切开。优点是 chunk 更“同主题”，缺点是成本更高（要算 embedding），也需要调阈值。

（4）按文档结构切（比如 Markdown 按标题）
适合结构清晰的文档：先按标题分章节，再对每章用递归/句子分割细切，并保留标题层级当元数据，检索和回答会更“知道这段属于哪一节”。

5）具体工具

Unstructured：先把文档解析成“元素”，再拼成 chunk
它不是纯文本硬切，而是先做 partition（识别出 Title、段落、ListItem、Table 等），然后：

basic：尽量把连续元素拼满到 max_characters
by_title：遇到标题就开新块，保留章节边界
这种思路更像“先理解结构，再切块”。

LlamaIndex：把切块当成 Node Parser/Transformation 的一部分
它提供很多节点解析器，比如：

MarkdownNodeParser（按结构）
SentenceSplitter / TokenTextSplitter（按句子/Token）
SemanticSplitterNodeParser（语义断点）
SentenceWindowNodeParser（句子级检索，但带窗口上下文）

6）小结

文档结构清晰：优先按结构切（标题/章节），再补一个二次细切。
文档很乱但又很长：用 递归分块比较稳。
特别在意“同主题、检索精度”：可以试 语义分块，但要接受更慢、更需要调参。
块不要贪大：不然 embedding 变糊 + 模型“中间失忆”一起踩。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

破解技术文档 4 大痛点，PandaWiki 让研发效率翻倍

在软件开发流程中，技术文档的重要性无需多言。但现实中，许多研发团队都面临着文档维护繁琐、查找效率低下、内容更新滞后等问题，传统的文档管理模式已难以适配现代开发团队的实际需求。PandaWiki 作为一款开源知识库系统，为技术文档的管理与维护提供了全新解决方案。它不仅能帮助团队搭建结构化的文档体系，还借助 AI 能力让文档变得更智能、更易用。