传统分块已死?Agentic Chunking:让RAG不再漏掉关键信息的智能分块策略!
本文介绍Agentic Chunking这一新型文本分块技术,通过让大语言模型主动评估并分配相关句子到同一文本块,解决了传统分块方法无法处理跨段落关联内容的问题。该方法通过propositioning处理和动态创建文本块,将相距较远但主题相关的句子归入同一组,显著提升RAG系统准确率(实测提升40%)。虽成本较高,但特别适合非结构化文本、主题反复切换的内容和需要跨段落关联的QA系统。
最近公司处理LLM项目的同事咨询了我一个问题:明明文档中多次提到同一个专有名词,RAG却总是漏掉关键信息。排查后发现,问题出在传统的分块方法上——那些相隔几页却密切相关的句子,被无情地拆散了。我给了一些通用的建议,比如使用混合检索代替单一的语义检索,基于chunk生成QA对等等。接着他又提出了一个问题,有没有通过分块技术能减少这类问题的发生?我说你也可以试试最近新提出的一种分块策略:Agentic Chunking.
为什么分块如此重要?
在RAG模型中,文本分块是第一步,也是最关键的一步。传统的分块方法,比如递归字符分割(Recursive character splitting),虽然简单易用,但它有一个明显的缺点:它依赖于固定的token长度进行分割,这可能导致一个主题被分割到不同的文本块中,从而破坏了上下文的连贯性。
另一种常见的分块方法是语义分割(semantic splitting),它通过检测句子之间的语义变化来进行分割。这种方法虽然比递归字符分割更智能,但它也有局限性。比如,当文档中的话题来回切换时,语义分割可能会将相关内容分割到不同的块中,导致信息不连贯。
比如遇到下面这种场景时,它们就会集体失灵:
“小明介绍了Transformer架构…(中间插入5段其他内容)…最后他强调,Transformer的核心是自注意力机制。”
传统方法要么把这两句话拆到不同区块,要么被中间内容干扰导致语义断裂。而人工分块时,我们自然会将它们归为“模型原理”组——这种跨越文本距离的关联性,正是Agentic Chunking要解决的。
Agentic Chunking的工作原理
Agentic Chunking的核心思想是让大语言模型(LLM)主动评估每一句话,并将其分配到最合适的文本块中。与传统的分块方法不同,Agentic Chunking不依赖于固定的token长度或语义变化,而是通过LLM的智能判断,将文档中相隔较远但主题相关的句子归入同一组。
举个例子,假设我们有以下文本:
On July 20, 1969, astronaut Neil Armstrong walked on the moon. He was leading the NASA’s Apollo 11 mission. Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface.
在Agentic Chunking中,LLM会将这些句子进行propositioning处理,即将每个句子独立化,确保每个句子都有自己的主语。处理后的文本如下:
On July 20, 1969, astronaut Neil Armstrong walked on the moon.Neil Armstrong was leading the NASA’s Apollo 11 mission.Neil Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface.
这样,LLM就可以单独检查每一个句子,并将其分配到最合适的文本块中。
propositioning 可以看做是对文档进行“句子级整容”,确保每个句子独立完整
如何实现Agentic Chunking?
实现Agentic Chunking的关键在于propositioning和文本块的动态创建与更新。我们可以使用Langchain和Pydantic等工具来实现这一过程。流程图如下:

- Propositioning文本
首先,我们需要将文本中的每个句子进行propositioning处理。我们可以使用Langchain提供的提示词模板,让LLM自动完成这项工作。以下是一个简单的代码示例:
from langchain.chains import create_extraction_chain_pydanticfrom langchain_core.pydantic_v1 import BaseModelfrom typing import Optionalfrom langchain.chat_models import ChatOpenAIimport uuidimport osfrom typing import Listfrom langchain import hubfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom pydantic import BaseModelobj = hub.pull("wfh/proposal-indexing")llm = ChatOpenAI(model="gpt-4o")class Sentences(BaseModel): sentences: List[str]extraction_llm = llm.with_structured_output(Sentences)extraction_chain = obj | extraction_llmsentences = extraction_chain.invoke( """ On July 20, 1969, astronaut Neil Armstrong walked on the moon. He was leading the NASA's Apollo 11 mission. Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface. """)
- 创建和更新文本块
接下来,我们需要创建一个函数来动态生成和更新文本块。每个文本块包含主题相似的propositions,并且随着新propositions的加入,文本块的标题和摘要也会不断更新。
def create_new_chunk(chunk_id, proposition): summary_llm = llm.with_structured_output(ChunkMeta) summary_prompt_template = ChatPromptTemplate.from_messages([ ("system", "Generate a new summary and a title based on the propositions."), ("user", "propositions:{propositions}"), ]) summary_chain = summary_prompt_template | summary_llm chunk_meta = summary_chain.invoke({"propositions": [proposition]}) chunks[chunk_id] = { "summary": chunk_meta.summary, "title": chunk_meta.title, "propositions": [proposition], }
- 将proposition推送到合适的文本块
最后,我们需要一个AI Agent来判断新的proposition应该被添加到哪个文本块中。如果没有合适的文本块,Agent会创建一个新的文本块。
def find_chunk_and_push_proposition(proposition): class ChunkID(BaseModel): chunk_id: int = Field(description="The chunk id.") allocation_llm = llm.with_structured_output(ChunkID) allocation_prompt = ChatPromptTemplate.from_messages([ ("system", "Find the chunk that best matches the proposition. If no chunk matches, return a new chunk id."), ("user", "proposition:{proposition} chunks_summaries:{chunks_summaries}"), ]) allocation_chain = allocation_prompt | allocation_llm chunks_summaries = {chunk_id: chunk["summary"] for chunk_id, chunk in chunks.items()} best_chunk_id = allocation_chain.invoke({"proposition": proposition, "chunks_summaries": chunks_summaries}).chunk_id if best_chunk_id not in chunks: create_new_chunk(best_chunk_id, proposition) else: add_proposition(best_chunk_id, proposition)
实测效果如何
我选择了新加坡圣淘沙著名景点 Wings of Time 的介绍文本作为测试对象,使用 GPT-4 模型进行处理。这段文本包含了景点介绍、票务信息、开放时间等多个方面的内容,是一个很好的测试样本。
Product Name: Wings of TimeProduct Description: Wings of Time is one of Sentosa's most breathtaking attractions, combining water, laser, fire, and music to create a mesmerizing night show about friendship and courage. Situated on the scenic (https://www.sentosa.com.sg/en/things-to-do/attractions/siloso-beach/) Siloso Beach , this award-winning spectacle is staged nightly, promising an unforgettable experience for visitors of all ages. Be wowed by spellbinding laser, fire, and water effects set to a majestic soundtrack, complete with a jaw-dropping fireworks display. A fitting end to your day out at Sentosa, it’s possibly the only place in Singapore where you can witness such an awe-inspiring performance. Get ready for an even better experience starting 1 February 2025 ! Wings of Time Fireworks Symphony, Singapore’s only daily fireworks show, now features a fireworks display that is four times longer! Important Note: Please visit (https://www.sentosa.com.sg/sentosa-reservation) here if you need to change your visit date. All changes must be made at least 1 day prior to the visit date.Product Category: ShowsProduct Type: AttractionKeywords: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time ticketsMeta Description: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening.Product Tags: Family Fun,Popular experiences,Frequently BoughtLocations: Beach Station[Tickets]Name: Wings of Time (Std)Terms: • All Wings of Time (WOT) Open-Dated tickets require prior redemption at Singapore Cable Car Ticketing counters and are subjected to seats availability on a first come first serve basis. • This is a rain or shine event. Tickets are non-exchangeable or nonrefundable under any circumstances. • Once timeslot is confirmed, no further amendments are allowed. Please proceed to WOT admission gates to scan your issued QR code via mobile or physical printout for admission. • Gates will open 15 minutes prior to the start of the show. • Show Duration: 20 minutes per show. • Please be punctual for your booked time slot. • Admission will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host. • Standard seats are applicable to guest aged 4 years and above. • No outside Food & Drinks are allowed. • Refer to (https://www.mountfaberleisure.com/attraction/wings-of-time/) https://www.mountfaberleisure.com/attraction/wings-of-time/ for more information on Wings of Time.Pax Type: StandardPromotion A: Enjoy $1.90 off when you purchase online! Discount will automatically be applied upon checkout.Price: 19Opening Hours: Daily Show 1: 7.40pm Show 2: 8.40pmAccessibilities: Wheelchair[Information]Title: Terms & ConditionsDescription: For more information, click (https://www.sentosa.com.sg/en/promotional-general-store-terms-and-conditions) here for Terms & ConditionsTitle: Getting HereDescription: By Sentosa Express: Alight at Beach Station By Public Bus: Board Bus 123 and alight at Beach Station By Intra-Island Bus: Board Sentosa Bus A or B and alight at Beach Station Nearest Car Park Beach Station Car ParkTitle: Contact UsDescription: Beach Station +65 6361 0088 (mailto:guestrelations@mflg.com.sg) guestrelations@mflg.com.sg
系统首先将原文转化为 50 多个独立的陈述句(propositions)。有趣的是,在这个过程中,系统自动将每句话的主语统一为"Wings of Time",这显示出了 AI 对文本主题的准确把握。
[ "Wings of Time is one of Sentosa's most breathtaking attractions.", 'Wings of Time combines water, laser, fire, and music to create a mesmerizing night show.', 'The night show of Wings of Time is about friendship and courage.', 'Wings of Time is situated on the scenic Siloso Beach.', 'Wings of Time is an award-winning spectacle staged nightly.', 'Wings of Time promises an unforgettable experience for visitors of all ages.', 'Wings of Time features spellbinding laser, fire, and water effects set to a majestic soundtrack.', 'Wings of Time includes a jaw-dropping fireworks display.', 'Wings of Time is a fitting end to a day out at Sentosa.', 'Wings of Time is possibly the only place in Singapore where such an awe-inspiring performance can be witnessed.', 'Wings of Time will offer an even better experience starting 1 February 2025.', 'Wings of Time Fireworks Symphony is Singapore’s only daily fireworks show.', 'Wings of Time Fireworks Symphony now features a fireworks display that is four times longer.', 'Visitors should visit the provided link if they need to change their visit date to Wings of Time.', 'All changes to the visit date must be made at least 1 day prior to the visit date.', 'Wings of Time is categorized as a show.', 'Wings of Time is a type of attraction.', 'Keywords for Wings of Time include: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time tickets.', 'The meta description for Wings of Time is: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening.', 'Product tags for Wings of Time include: Family Fun, Popular experiences, Frequently Bought.', 'Wings of Time is located at Beach Station.', 'Wings of Time (Std) tickets require prior redemption at Singapore Cable Car Ticketing counters.', 'Wings of Time (Std) tickets are subjected to seats availability on a first come first serve basis.', 'Wings of Time is a rain or shine event.', 'Tickets for Wings of Time are non-exchangeable or nonrefundable under any circumstances.', 'Once the timeslot for Wings of Time is confirmed, no further amendments are allowed.', 'Visitors should proceed to Wings of Time admission gates to scan their issued QR code via mobile or physical printout for admission.', 'Gates for Wings of Time will open 15 minutes prior to the start of the show.', 'The show duration for Wings of Time is 20 minutes per show.', 'Visitors should be punctual for their booked time slot for Wings of Time.', 'Admission to Wings of Time will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host.', 'Standard seats for Wings of Time are applicable to guests aged 4 years and above.', 'No outside food and drinks are allowed at Wings of Time.', 'More information on Wings of Time can be found at the provided link.', 'The pax type for Wings of Time is Standard.', 'Promotion A for Wings of Time offers $1.90 off when purchased online.', 'The discount for Promotion A will automatically be applied upon checkout.', 'The price for Wings of Time is 19.', 'Wings of Time has opening hours daily with Show 1 at 7.40pm and Show 2 at 8.40pm.', 'Wings of Time is accessible by wheelchair.', "The title for terms and conditions is 'Terms & Conditions'.", 'More information on terms and conditions can be found at the provided link.', "The title for getting to Wings of Time is 'Getting Here'.", 'Visitors can get to Wings of Time by Sentosa Express by alighting at Beach Station.', 'Visitors can get to Wings of Time by Public Bus by boarding Bus 123 and alighting at Beach Station.', 'Visitors can get to Wings of Time by Intra-Island Bus by boarding Sentosa Bus A or B and alighting at Beach Station.', 'The nearest car park to Wings of Time is Beach Station Car Park.', "The title for contacting Wings of Time is 'Contact Us'.", 'The contact location for Wings of Time is Beach Station.', 'The contact phone number for Wings of Time is +65 6361 0088.', 'The contact email for Wings of Time is guestrelations@mflg.com.sg.']
经过 AI 的智能分块(agentic chunking),整个文本被自然地划分为四个主要部分:
- 主体信息块:包含了 Wings of Time 的核心介绍、特色、位置等综合信息
- 日程政策块:专门处理预约变更相关的信息
- 价格优惠块:聚焦于折扣和支付相关内容
- 法律条款块:归纳了各项条款和规定
Chunk (a641f): Sentosa's Wings of Time Show & Visitor InformationSummary: This chunk contains comprehensive details about the Wings of Time attraction in Sentosa, including its features, themes, location, visitor experience, ticketing and admission procedures, future enhancements, promotions, classification as a show and attraction, unique fireworks display, daily show schedule, accessibility options, importance of punctuality and ticket redemption, extended fireworks display in the Fireworks Symphony, transportation options to reach the venue, and the necessity of adhering to non-exchangeable ticket policies, with a focus on the standard ticketing process and visitor guidelines, and the recent update on the extended fireworks display, as well as the contact information and accessibility details, and the new experience starting February 2025.Chunk (ae2b8): Scheduling PoliciesSummary: This chunk contains information about policies regarding changes to scheduled dates and times.Chunk (dadbb): Retail & DiscountsSummary: This chunk contains information about the application of discounts during the checkout process.Chunk (3347c): Legal Terms & ConditionsSummary: This chunk contains information about terms and conditions, including their titles and where to find more information.
经过这样的分块之后,各个块的主题明确,不重叠,且重要信息优先,辅助信息分类存放。把这样的信息放在一起,也有助于提升向量库的召回率,从而提升RAG的准确率。
总结
Agentic Chunking是一种非常强大的文本分块技术,它能够将文档中相隔较远但主题相关的句子归入同一组,从而提升RAG模型的效果,但是这种方法在成本和延迟上相对较高。同事尝试了Agentic chunking之后,据他说准确率提升了40%,但成本也增加了3倍。那么我们时候应该使用Agentic chunking呢?
根据我的项目经验,以下场景特别适合:
- 非结构化文本(如客服对话记录)
- 主题反复横跳的内容(技术沙龙实录)
- 需要跨段落关联的QA系统
而面对结构清晰的论文、说明书等,传统分块和语义分块仍是性价比之选。
说真的,这两年看着身边一个个搞Java、C++、前端、数据、架构的开始卷大模型,挺唏嘘的。大家最开始都是写接口、搞Spring Boot、连数据库、配Redis,稳稳当当过日子。
结果GPT、DeepSeek火了之后,整条线上的人都开始有点慌了,大家都在想:“我是不是要学大模型,不然这饭碗还能保多久?”
我先给出最直接的答案:一定要把现有的技术和大模型结合起来,而不是抛弃你们现有技术!掌握AI能力的Java工程师比纯Java岗要吃香的多。
即使现在裁员、降薪、团队解散的比比皆是……但后续的趋势一定是AI应用落地!大模型方向才是实现职业升级、提升薪资待遇的绝佳机遇!
这绝非空谈。数据说话
2025年的最后一个月,脉脉高聘发布了《2025年度人才迁徙报告》,披露了2025年前10个月的招聘市场现状。
AI领域的人才需求呈现出极为迫切的“井喷”态势

2025年前10个月,新发AI岗位量同比增长543%,9月单月同比增幅超11倍。同时,在薪资方面,AI领域也显著领先。其中,月薪排名前20的高薪岗位平均月薪均超过6万元,而这些席位大部分被AI研发岗占据。
与此相对应,市场为AI人才支付了显著的溢价:算法工程师中,专攻AIGC方向的岗位平均薪资较普通算法工程师高出近18%;产品经理岗位中,AI方向的产品经理薪资也领先约20%。
当你意识到“技术+AI”是个人突围的最佳路径时,整个就业市场的数据也印证了同一个事实:AI大模型正成为高薪机会的最大源头。
最后
我在一线科技企业深耕十二载,见证过太多因技术卡位而跃迁的案例。那些率先拥抱 AI 的同事,早已在效率与薪资上形成代际优势,我意识到有很多经验和知识值得分享给大家,也可以通过我们的能力和经验解答大家在大模型的学习中的很多困惑。
我整理出这套 AI 大模型突围资料包【允许白嫖】:
-
✅从入门到精通的全套视频教程
-
✅AI大模型学习路线图(0基础到项目实战仅需90天)
-
✅大模型书籍与技术文档PDF
-
✅各大厂大模型面试题目详解
-
✅640套AI大模型报告合集
-
✅大模型入门实战训练
这份完整版的大模型 AI 学习和面试资料已经上传CSDN,朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费】

①从入门到精通的全套视频教程
包含提示词工程、RAG、Agent等技术点

② AI大模型学习路线图(0基础到项目实战仅需90天)
全过程AI大模型学习路线

③学习电子书籍和技术文档
市面上的大模型书籍确实太多了,这些是我精选出来的

④各大厂大模型面试题目详解

⑤640套AI大模型报告合集

⑥大模型入门实战训练

👉获取方式:
有需要的小伙伴,可以保存图片到wx扫描二v码免费领取【保证100%免费】🆓

更多推荐



所有评论(0)