LLM大模型 - 实战篇 - 实现文本处理：分类、信息抽取与匹配

文本分类旨在将金融文本划分到预定义的类别中。以下是完整的实现代码：python"""利用LLM进行金融文本分类任务"""# 定义类别及其示例'新闻报道': '今日，股市经历了一轮震荡，受到宏观经济数据和全球贸易紧张局势的影响。投资者密切关注美联储可能的政策调整，以适应市场的不确定性。','财务报告': '本公司年度财务报告显示，去年公司实现了稳步增长的盈利，同时资产负债表呈现强劲的状况。经济环境的

广东蚂蚁

723人浏览 · 2025-09-18 16:05:16

广东蚂蚁 · 2025-09-18 16:05:16 发布

近年来，大语言模型（LLM）在自然语言处理（NLP）任务中表现出色，尤其在金融领域，文本数据的结构化处理需求日益增长。本文将详细介绍如何利用开源模型 ChatGLM-6B 实现三个典型的金融文本处理任务：文本分类、信息抽取和文本匹配，并提供完整的代码实现与结果展示。

一、环境准备与模型加载

1.1 安装依赖库

bash

pip install transformers rich torch

1.2 模型加载代码详解

python

from transformers import AutoTokenizer, AutoModel
from rich.console import Console

# 初始化控制台输出
console = Console()

# 设置设备（GPU优先）
device = 'cuda:0'  # 使用GPU
# device = 'cpu'    # 使用CPU（速度较慢）

# 加载tokenizer和模型
model_path = r"D:\02-weights\chatglm2-6b-int4"  # 模型本地路径
# 或者使用在线模型: "THUDM/chatglm2-6b-int4"

tokenizer = AutoTokenizer.from_pretrained(
    model_path, 
    trust_remote_code=True  # 允许执行远程代码
)

model = AutoModel.from_pretrained(
    model_path, 
    trust_remote_code=True
)

# 模型优化和设备转移
if device.startswith('cuda'):
    model = model.half().cuda()  # 使用半精度浮点数减少显存占用
else:
    model = model.float()  # CPU模式使用全精度

model = model.eval()  # 设置为评估模式

二、文本分类任务

2.1 任务介绍与代码详解

文本分类旨在将金融文本划分到预定义的类别中。以下是完整的实现代码：

python

# -*- coding: utf-8 -*-
"""
利用LLM进行金融文本分类任务
"""
from rich import print
from rich.console import Console
from transformers import AutoTokenizer, AutoModel

# 定义类别及其示例
class_examples = {
    '新闻报道': '今日，股市经历了一轮震荡，受到宏观经济数据和全球贸易紧张局势的影响。投资者密切关注美联储可能的政策调整，以适应市场的不确定性。',
    '财务报告': '本公司年度财务报告显示，去年公司实现了稳步增长的盈利，同时资产负债表呈现强劲的状况。经济环境的稳定和管理层的有效战略执行为公司的健康发展奠定了基础。',
    '公司公告': '本公司高兴地宣布成功完成最新一轮并购交易，收购了一家在人工智能领域领先的公司。这一战略举措将有助于扩大我们的业务领域，提高市场竞争力',
    '分析师报告': '最新的行业分析报告指出，科技公司的创新将成为未来增长的主要推动力。云计算、人工智能和数字化转型被认为是引领行业发展的关键因素，投资者应关注这些趋势'
}

def init_prompts():
    """
    初始化前置prompt，构建few-shot学习示例
    返回包含类别列表和历史对话的字典
    """
    class_list = list(class_examples.keys())
    print(f'可分类的类别: {class_list}')

    # 初始化对话历史
    pre_history = [
        (f'现在你是一个文本分类器，你需要按照要求将我给你的句子分类到：{class_list}类别中。',
         f'好的。')
    ]
    
    # 添加few-shot示例
    for _type, example in class_examples.items():
        pre_history.append((f'"{example}"是{class_list}里的什么类别？', _type))
    
    return {"class_list": class_list, "pre_history": pre_history}

def inference(sentences: list, custom_settings: dict):
    """
    执行分类推理
    
    Args:
        sentences: 待分类的句子列表
        custom_settings: 包含类别列表和对话历史的配置字典
    """
    for sentence in sentences:
        with console.status("[bold bright_green] 模型推理中..."):
            # 构建查询prompt
            sentence_prompt = f'"{sentence}"是{custom_settings["class_list"]}里的什么类别？'
            
            # 调用模型生成回复
            response, history = model.chat(
                tokenizer, 
                sentence_prompt, 
                history=custom_settings['pre_history']
            )
        
        # 输出结果
        print(f'>>> [bold bright_red]句子: {sentence}')
        print(f'>>> [bold bright_green]分类结果: {response}')
        print("*" * 80)

if __name__ == '__main__':
    # 初始化控制台和模型
    console = Console()
    
    # 待分类的句子
    sentences = [
        "今日，央行发布公告宣布降低利率，以刺激经济增长。这一降息举措将影响贷款利率，并在未来几个季度内对金融市场产生影响。",
        "ABC公司今日发布公告称，已成功完成对XYZ公司股权的收购交易。本次交易是ABC公司在扩大业务范围、加强市场竞争力方面的重要举措。",
        "公司资产负债表显示，公司偿债能力强劲，现金流充足，为未来投资和扩张提供了坚实的财务基础。",
        "最新的分析报告指出，可再生能源行业预计将在未来几年经历持续增长，投资者应该关注这一领域的投资机会",
        "金融系统是建设金融强国责无旁贷的主力军，必须切实把思想和行动统一到党中央决策部署上来。"
    ]
    
    # 初始化prompt设置
    custom_settings = init_prompts()
    
    # 执行推理
    inference(sentences, custom_settings)

2.2 关键技术点说明

Few-shot Learning: 通过提供每个类别的示例，帮助模型理解分类标准
对话历史管理: 使用pre_history维护对话上下文，确保模型理解任务要求
Prompt工程: 精心设计prompt格式，明确指示模型输出类别名称

三、信息抽取任务

3.1 任务介绍与代码详解

从金融文本中抽取结构化信息，如股票名称、价格等。以下是完整实现：

python

import re
import json
from rich import print
from transformers import AutoTokenizer, AutoModel

# 定义实体抽取的schema
schema = {
    '金融': ['日期', '股票名称', '开盘价', '收盘价', '成交量'],
}

# 信息抽取提示词模板
IE_PATTERN = "{}\n\n提取上述句子中{}的实体，并按照JSON格式输出，上述句子中不存在的信息用['原文中未提及']来表示，多个值之间用','分隔。"

# 提供few-shot示例
ie_examples = {
    '金融': [
        {
            'content': '2023-01-10，股市震荡。股票古哥-D[EOOE]美股今日开盘价100美元，一度飙升至105美元，随后回落至98美元，最终以102美元收盘，成交量达到520000。',
            'answers': {
                '日期': ['2023-01-10'],
                '股票名称': ['古哥-D[EOOE]美股'],
                '开盘价': ['100美元'],
                '收盘价': ['102美元'],
                '成交量': ['520000'],
            }
        }
    ]
}

def init_prompts():
    """
    初始化信息抽取的prompt，构建in-context learning示例
    """
    ie_pre_history = [
        (
            "现在你需要帮助我完成信息抽取任务，当我给你一个句子时，你需要帮我抽取出句子中实体信息，并按照JSON的格式输出，上述句子中没有的信息用['原文中未提及']来表示，多个值之间用','分隔。",
            '好的，请输入您的句子。'
        )
    ]
    
    # 添加few-shot示例到对话历史
    for _type, example_list in ie_examples.items():
        for example in example_list:
            sentence = example["content"]
            properties_str = ', '.join(schema[_type])
            schema_str_list = f'"{_type}"({properties_str})'
            
            # 构建完整的prompt
            sentence_with_prompt = IE_PATTERN.format(sentence, schema_str_list)
            
            # 将示例添加到历史
            ie_pre_history.append((
                sentence_with_prompt,
                json.dumps(example['answers'], ensure_ascii=False)
            ))
    
    return {"ie_pre_history": ie_pre_history}

def clean_response(response: str):
    """
    清洗和解析模型响应，提取JSON内容
    
    Args:
        response: 模型的原始响应文本
        
    Returns:
        解析后的JSON对象或原始文本（如果解析失败）
    """
    # 处理可能包含代码块的响应
    if '```json' in response:
        res = re.findall(r'```json(.*?)```', response, re.DOTALL)
        if len(res) and res[0]:
            response = res[0]
    
    # 替换可能的中文标点
    response = response.replace('、', ',')
    
    try:
        # 尝试解析JSON
        return json.loads(response)
    except json.JSONDecodeError:
        # 解析失败时返回原始文本
        print(f"JSON解析失败，原始响应: {response}")
        return response

def inference(sentences: list, custom_settings: dict):
    """
    执行信息抽取推理
    
    Args:
        sentences: 待抽取的句子列表
        custom_settings: 包含对话历史的配置字典
    """
    for sentence in sentences:
        # 确定抽取类型（这里简化为固定类型）
        cls_res = "金融"
        if cls_res not in schema:
            print(f'模型推断的类型 {cls_res} 不在schema字典中，程序退出。')
            exit()
        
        # 构建属性字符串和schema描述
        properties_str = ', '.join(schema[cls_res])
        schema_str_list = f'"{cls_res}"({properties_str})'
        
        # 构建完整的prompt
        sentence_with_ie_prompt = IE_PATTERN.format(sentence, schema_str_list)
        
        # 调用模型进行推理
        ie_res, history = model.chat(
            tokenizer,
            sentence_with_ie_prompt,
            history=custom_settings["ie_pre_history"]
        )
        
        # 清洗和解析响应
        ie_res = clean_response(ie_res)
        
        # 输出结果
        print(f'>>> [bold bright_red]句子: {sentence}')
        print(f'>>> [bold bright_green]抽取结果: {ie_res} ')
        print("*" * 80)

if __name__ == '__main__':
    # 待抽取的句子
    sentences = [
        '2023-02-15，寓意吉祥的节日，股票佰笃[BD]美股开盘价10美元，虽然经历了波动，但最终以13美元收盘，成交量微幅增加至460,000，投资者情绪较为平稳。',
        '2023-04-05，市场迎来轻松氛围，股票盘古(0021)开盘价23元，尽管经历了波动，但最终以26美元收盘，成交量缩小至310,000，投资者保持观望态度。',
    ]
    
    # 初始化prompt设置
    custom_settings = init_prompts()
    
    # 执行推理
    inference(sentences, custom_settings)

3.2 关键技术点说明

Schema设计: 明确定义需要抽取的实体类型和属性
JSON格式输出: 要求模型输出结构化数据，便于后续处理
响应清洗: 处理模型可能返回的代码块格式，确保JSON解析成功
缺失值处理: 使用['原文中未提及']明确标识未找到的信息

四、文本匹配任务

4.1 任务介绍与代码详解

判断两个句子是否语义相似。以下是完整实现：

python

from rich import print
from transformers import AutoTokenizer, AutoModel

# 提供相似和不相似的语义匹配例子
examples = {
    '是': [
        ('公司ABC发布了季度财报，显示盈利增长。', '财报披露，公司ABC利润上升。'),
    ],
    '不是': [
        ('黄金价格下跌，投资者抛售。', '外汇市场交易额创下新高。'),
        ('央行降息，刺激经济增长。', '新能源技术的创新。')
    ]
}

def init_prompts():
    """
    初始化文本匹配的prompt，构建in-context learning示例
    """
    pre_history = [
        (
            '现在你需要帮助我完成文本匹配任务，当我给你两个句子时，你需要回答我这两句话语义是否相似。只需要回答是否相似，不要做多余的回答。',
            '好的，我将只回答"是"或"不是"。'
        )
    ]
    
    # 添加few-shot示例
    for key, sentence_pairs in examples.items():
        for sentence_pair in sentence_pairs:
            sentence1, sentence2 = sentence_pair
            pre_history.append((
                f'句子一:{sentence1}\n句子二:{sentence2}\n上面两句话是相似的语义吗？',
                key
            ))
    
    return {"pre_history": pre_history}

def inference(sentence_pairs: list, custom_settings: dict):
    """
    执行文本匹配推理
    
    Args:
        sentence_pairs: 待匹配的句子对列表
        custom_settings: 包含对话历史的配置字典
    """
    for sentence_pair in sentence_pairs:
        sentence1, sentence2 = sentence_pair
        
        # 构建查询prompt
        sentence_with_prompt = f'句子一: {sentence1}\n句子二: {sentence2}\n上面两句话是相似的语义吗？'
        
        # 调用模型生成回复
        response, history = model.chat(
            tokenizer, 
            sentence_with_prompt, 
            history=custom_settings['pre_history']
        )
        
        # 输出结果
        print(f'>>> [bold bright_red]句子对: {sentence_pair}')
        print(f'>>> [bold bright_green]匹配结果: {response}')
        print("*" * 80)

if __name__ == '__main__':
    # 待匹配的句子对
    sentence_pairs = [
        ('股票市场今日大涨，投资者乐观。', '持续上涨的市场让投资者感到满意。'),
        ('油价大幅下跌，能源公司面临挑战。', '未来智能城市的建设趋势愈发明显。'),
        ('利率上升，影响房地产市场。', '高利率对房地产有一定冲击。'),
    ]
    
    # 初始化prompt设置
    custom_settings = init_prompts()
    
    # 执行推理
    inference(sentence_pairs, custom_settings)

4.2 关键技术点说明

二分类任务: 将复杂的语义匹配简化为"是/不是"二分类问题
明确指令: 要求模型只输出判断结果，避免多余解释
平衡示例: 提供正负例平衡的few-shot示例，帮助模型理解任务

五、优化建议与注意事项

5.1 性能优化

模型量化: 使用4-bit或8-bit量化减少显存占用
批处理: 对多个输入进行批处理提高推理效率
缓存机制: 对重复查询实现结果缓存

5.2 错误处理

python

# 增强的错误处理示例
def safe_chat(model, tokenizer, prompt, history, max_retries=3):
    for attempt in range(max_retries):
        try:
            response, history = model.chat(tokenizer, prompt, history=history)
            return response, history
        except Exception as e:
            print(f"请求失败，尝试 {attempt + 1}/{max_retries}: {str(e)}")
            time.sleep(2)  # 等待后重试
    return "请求失败，请稍后重试", history