一学就会的深度学习基础指令及操作步骤（7）自然语言处理

利用分词器为神经网络准备文本观察如何使用嵌入来识别文本数据的数值特征将文本转换为数值 token，加载 BERT 的分词器：（1）使用从的模型库中加载的，具体是模型，区分大小写BERT 的分词器可以一次性编码多段文本tokenizer.encode 过程：可以用 convert_ids_to_tokens 来查看使用了哪些 token可以用直接解码编码过的文本，注意已经被添加进去了。文本分段

minyshi

1177人浏览 · 2025-03-10 14:23:27

minyshi · 2025-03-10 14:23:27 发布

文章目录

- 自然语言处理

自然语言处理

利用分词器为神经网络准备文本
观察如何使用嵌入来识别文本数据的数值特征

分词

将文本转换为数值 token，加载 BERT 的分词器：

（1）使用 torch.hub.load从Hugging Face的模型库中加载BERT的tokenizer，具体是bert-base-cased模型，bert-base-cased区分大小写

import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')

BERT 的分词器可以一次性编码多段文本

text_1 = "I understand equations, both the simple and quadratical."
text_2 = "What kind of equations do I understand?"

# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
# 用加载的 tokenizer 对text_1和text_2进行编码，添加特殊标记。
# 输入文本转换后的数字ID列表，这些ID对应模型词汇表中的每个token。
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)
indexed_tokens

tokenizer.encode 过程：

先加标签，开头贴 [CLS] 标签，中间用 [SEP] 隔开两句话，结尾再加 [SEP]
再裁剪成片段（比如 “quadratical” → “q”, “##uad”, “##ratic”,“##al”）
把每个片段对应数字编号

可以用 convert_ids_to_tokens 来查看使用了哪些 token

tokenizer.convert_ids_to_tokens([str(token) for token in indexed_tokens])


>>>
>['[CLS]',
 'I',
 'understand',
 'equations',
 ',',
 'both',
 'the',
 'simple',
 'and',
 'q',
 '##uad',
 '##ratic',
 '##al',
 '.',
 '[SEP]',
 'What',
 'kind',
 'of',
 'equations',
 'do',
 'I',
 'understand',
 '?',
 '[SEP]']

句子中的 token 数量多于单词数，索引列表比原始输入长的原因有两个：

tokenizer 添加了 special_tokens 来表示序列的开始（[CLS]）和句子之间的分隔（[SEP]）。

tokenizer 可以将一个词分解成多个部分。

许多语言都有词根，或构成单词的组成部分。BERT 并不是用语言定义词根，而是使用 WordPiece 模型来找出如何拆分单词模式。

可以用 tokenizer.decode 直接解码编码过的文本，注意 special_tokens 已经被添加进去了。

tokenizer.decode(indexed_tokens)

>>> [CLS] I understand equations, both the simple and quadratical. [SEP] What kind of equations do I understand? [SEP]

文本分段

为了使用 BERT 模型进行预测，它还需要一个 segment_ids 的列表。这是一个与我们 token 相同长度的向量，表示每个句子属于哪个段落。

理解句子关系：哪些词属于问题，哪些属于答案。
注意力机制的需求：模型内部的注意力计算会根据段落标签调整权重。
处理多句子输入：段落标签帮助模型明确句子的分界。

由于我们的 tokenizer 添加了一些 special_tokens，我们可以使用这些特殊标记来找到段落。

（1）定义哪个索引对应哪个特殊标记。

cls_token = 101
sep_token = 102

（2）创建一个 for 循环，从 segment_id 设置为 0 开始，并且每当遇到 [SEP] 标记时就增加 segment_id。为了确保效果，我们将在稍后将这些 segment_ids 和 indexed_tokens 作为张量输入模型。

def get_segment_ids(indexed_tokens):
    segment_ids = []
    segment_id = 0
    for token in indexed_tokens:
        if token == sep_token:
            segment_id += 1
        segment_ids.append(segment_id)
    segment_ids[-1] -= 1  # Last [SEP] is ignored
    return torch.tensor([segment_ids]), torch.tensor([indexed_tokens])

segments_tensors, tokens_tensor = get_segment_ids(indexed_tokens)
segments_tensors

# 输出结果如下，同一个语句的编号相同，编号为0的词是一句，编号为1的为一句
>>> tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

文本掩码

为了训练词嵌入，BERT 在一系列单词中掩掉一个单词。掩码用了一个特殊的标记：

BERT通过某些单词挖空，根据上下文填空进行学习：

被遮住的词可能是哪个
理解前后词语的关系
掌握词语在不同语境中的含义

（1）设置掩码
设置掩码位置 → 转为张量

tokenizer.mask_token  # 获取掩码符号（通常是 [MASK]）
>>> [MASK]

tokenizer.mask_token_id  # 对应的数字编号（比如 103）
>>> 103

masked_index = 5 # 假设要遮住第6个词（索引从0开始）

# 把第6个词替换成 [MASK]
indexed_tokens[masked_index] = tokenizer.mask_token_id 

# 把处理后的句子变成数字，把这个列表变成一个张量
tokens_tensor = torch.tensor([indexed_tokens])

# 把数字还原成文字，确认遮盖是否正确
# 加一层方括号 [...]，是为了创建一个批次维度（batch dimension）
tokenizer.decode(indexed_tokens)

>>> [CLS] I understand equations, [MASK] the simple and quadratical. [SEP] What kind of equations do I understand? [SEP]

（2）加载模型

加载用于预测缺失单词的模型：modelForMaskedLM（预训练的BERT模型（cased版本））

masked_lm_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForMaskedLM', 'bert-base-cased')

masked_lm_model
>>> 
>BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (transform_act_fn): GELUActivation()
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=28996, bias=True)
    )
  )
)

模型的名字是 BertForMaskedLM，做掩码语言建模的，预测被遮盖的单词。模型分为两个主要部分：bert 和 cls。

bert 部分是一个BertModel，里面包含了 embeddings 和 encoder。
- embeddings
  - 有三个嵌入层：
    - word_embeddings（单词本身的嵌入）
    - position_embeddings（位置信息）
    - token_type_embeddings（句子类型）
  - LayerNorm 和 dropout，用于归一化和防止过拟合
- encoder，由12个 BertLayer 组成，每个 BertLayer 包括 attention、intermediate 和 output
  - attention 部分又分为 self-attention 和输出处理
    - self-attention 中的 query、key、value 线性层应该用来生成注意力机制的三个向量。
  - intermediate 中间层，将768维的向量扩展到3072维，再用GELU激活函数。
  - output 输出层再压缩回768维，并应用 LayerNorm 和 dropout。
- cls 部分，predictions，BertOnlyMLMHead，负责最终的预测
  - 包含一个 transform，将 768 维的向量经过线性层和激活函数，再通过 LayerNorm
  - 最后用decoder线性层映射到词汇表大小 28996，也就是预测每个位置可能的单词

查看 BERT 为每个 token 学习到的词嵌入

# masked_lm_model：你加载的 BERT 填空模型
# .bert：进入模型的核心部分
# .embeddings：找到词嵌入模块
# .word_embeddings：调出「单词 → 向量」的转换表
# .parameters()：获取模型参数的函数
# next(...)：从参数列表中取出第一个（也是唯一一个）元素
embedding_table = next(masked_lm_model.bert.embeddings.word_embeddings.parameters())
embedding_table

>>> Parameter containing:
tensor([[-0.0005, -0.0416,  0.0131,  ..., -0.0039, -0.0335,  0.0150],
        [ 0.0169, -0.0311,  0.0042,  ..., -0.0147, -0.0356, -0.0036],
        [-0.0006, -0.0267,  0.0080,  ..., -0.0100, -0.0331, -0.0165],
        ...,
        [-0.0064,  0.0166, -0.0204,  ..., -0.0418, -0.0492,  0.0042],
        [-0.0048, -0.0027, -0.0290,  ..., -0.0512,  0.0045, -0.0118],
        [ 0.0313, -0.0297, -0.0230,  ..., -0.0145, -0.0525,  0.0284]],
       requires_grad=True)

BERT 词汇表中的 28996 个 token 都有一个大小为 768 的嵌入

embedding_table.shape
>>> torch.Size([28996, 768])

embedding_table ，(28996行, 768列)，巨大的矩阵：

28996 行 → BERT 认识的单词数量（近3万个词）
768 列 → 每个词对应的向量长度（类似每个词有768个特征密码）

测试一下模型是否能正确预测我们提供的句子中缺失的单词。
把数字化的句子 tokens_tensor 和段落标签 segments_tensors 一起喂给 BERT。
masked_lm_model 内部过程：词嵌入层转为向量 → 经过12层Transformer 分析上下文 → 在 [MASK] 位置预测可能的单词概率。

with torch.no_grad(): # 不计算梯度
    predictions = masked_lm_model(tokens_tensor, token_type_ids=segments_tensors)
predictions

>>> MaskedLMOutput(loss=None, logits=tensor([[[ -7.3832,  -7.2504,  -7.4539,  ...,  -6.0597,  -5.7928,  -6.2133],
         [ -6.7681,  -6.7896,  -6.8316,  ...,  -5.4655,  -5.4048,  -6.0682],
         [ -7.7323,  -7.9597,  -7.7348,  ...,  -5.7611,  -5.3566,  -4.3361],
         ...,
         [ -6.1213,  -6.3311,  -6.4144,  ...,  -5.8884,  -4.1157,  -3.1189],
         [-12.3216, -12.4479, -11.9787,  ..., -10.6539,  -8.7396, -11.0487],
         [-13.4115, -13.7876, -13.5183,  ..., -10.6359, -11.6582, -10.9009]]]), hidden_states=None, attentions=None)

predictions[0].shape
>>> torch.Size([1, 24, 28996])

predictions 是一个张量，形状为 (batch_size, 序列长度, 词汇表大小)，[1, 24, 28996]对应的意思是：

处理了1句话
句子有 24 个词（包括特殊符号）
28996：每个位置对所有可能的单词打分，最大的概率就是预测的词汇

我们想找到词汇表中所有 token 的最大值，可以用 torch.argmax。

# Get the predicted token
predicted_index = torch.argmaxtorch.argmax(predictions[0, masked_index]).item()
predicted_index

>>> 1241

这里预测的最大的打分是 1241
查看看 token 1241 对应的是什么
tokenizer.convert_ids_to_tokens([predicted_index])，将编号列表转成词语列表

predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
predicted_token

>>> 'both'

问题与回答

BERT 可以解决更复杂的问题，比如句子预测。它能通过 Attention Transformer 架构来完成这一任务。
将使用 BERT 的不同版本，加载用于问答任务的 tokenizer 'bert-large-uncased-whole-word-masking-finetuned-squad'。
流程：加载模型 → 对文本进行编码，加上特殊标记（比如[CLS]和[SEP]）分隔句子 → 设置段落ID

cls_token = 101
sep_token = 102

# get_segment_ids：生成段落ID和输入张量，初始化段落ID为0，遍历所有token：遇到[SEP]（ID=102）时切换段落ID（0→1→2...）。记录每个token的段落ID。
def get_segment_ids(indexed_tokens):
    segment_ids = []
    segment_id = 0
    for token in indexed_tokens:
        if token == sep_token:
            segment_id += 1
        segment_ids.append(segment_id)
    segment_ids[-1] -= 1  # Last [SEP] is ignored
    return torch.tensor([segment_ids]), torch.tensor([indexed_tokens])
    
text_1 = "I understand equations, both the simple and quadratical."
text_2 = "What kind of equations do I understand?"

# 针对问答任务，加载微调的BERT-large分词器
question_answering_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-large-uncased-whole-word-masking-finetuned-squad')

# add_special_tokens=True,自动添加特殊标记：[CLS]：起始标记[SEP]：分隔上下文和问题
# question_answering_tokenizer.encode，分词并转换为ID序列
indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)

# get_segment_ids：生成段落ID和输入张量，初始化段落ID为0，遍历所有token：遇到[SEP]（ID=102）时切换段落ID（0→1→2...）。记录每个token的段落ID。
segments_tensors, tokens_tensor = get_segment_ids(indexed_tokens)
segments_tensors, tokens_tensor

>>> (tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 tensor([[  101,  1045,  3305, 11380,  1010,  2119,  1996,  3722,  1998, 17718,
          23671,  2389,  1012,   102,  2054,  2785,  1997, 11380,  2079,  1045,
           3305,  1029,   102]]))

加载 question_answering_model
question_answering_model 和问答模型正在扫描我们的输入序列，以找到最能回答问题的子序列。

# Predict the start and end positions logits
with torch.no_grad():
    out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)
out

>>> QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-5.5943, -4.2960, -5.2682, -1.2511, -6.8350, -0.3992,  2.2274,  2.4654,
         -6.6066,  2.5014, -4.4613, -4.8040, -7.8383, -5.5944, -4.7833, -6.9730,
         -7.1477, -5.2967, -7.4825, -6.7737, -6.8806, -8.6612, -5.5944]]), end_logits=tensor([[-0.7409, -5.3478, -4.2317, -0.0275, -2.6293, -5.9589, -2.8828,  2.7770,
         -4.8512, -2.2092, -2.2413,  4.4412, -0.7181, -0.7411, -3.8988, -5.3865,
         -5.0452, -4.4974, -6.3098, -5.5937, -5.5562, -5.3034, -0.7412]]), hidden_states=None, attentions=None)

start_logits 的最高分在索引 9（值为2.5014）
end_logits 的最高分在索引 11（值为4.4412）

start_logits，形状 [1, 23] → 每个位置作为答案起始的概率
23 对应输入序列的总token数（包括特殊符号）

out.start_logits

>>> tensor([[-5.5943, -4.2960, -5.2682, -1.2511, -6.8350, -0.3992,  2.2274,  2.4654,
         -6.6066,  2.5014, -4.4613, -4.8040, -7.8383, -5.5944, -4.7833, -6.9730,
         -7.1477, -5.2967, -7.4825, -6.7737, -6.8806, -8.6612, -5.5944]])

end_logits 中的数越高，答案就越可能结束在那个 token 上。

out.end_logits

>>> tensor([[-0.7409, -5.3478, -4.2317, -0.0275, -2.6293, -5.9589, -2.8828,  2.7770,
         -4.8512, -2.2092, -2.2413,  4.4412, -0.7181, -0.7411, -3.8988, -5.3865,
         -5.0452, -4.4974, -6.3098, -5.5937, -5.5562, -5.3034, -0.7412]])

用 torch.argmax 来找到从开始到结束的 answer_sequence：

answer_sequence = indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1]
answer_sequence

>>> [17718, 23671, 2389]

解码这些 token

answer_sequence 是上面得到的单位位置对应的数字编号
convert_ids_to_tokens()，把数字编号还原成词语或子词
decode 会自动将子词合并成完整的单词

question_answering_tokenizer.convert_ids_to_tokens(answer_sequence)
>>> ['quad', '##ratic', '##al']

question_answering_tokenizer.decode(answer_sequence)
>>> 'quadratical'