转载-BERT源码详解（一）——HuggingFace Transformers源码解

众所周知，BERT模型自2018年问世起就各种屠榜，开启了NLP领域预训练+微调的范式。到现在，BERT的相关衍生模型层出不穷（XL-Net、RoBERTa、ALBERT、ELECTRA、ERNIE等），要理解它们可以先从BERT这个始祖入手。HuggingFace是一家总部位于纽约的聊天机器人初创服务商，很早就捕捉到BERT大潮流的信号并着手实现基于pytorch的BERT模型。这一项目最初名为

lucas-nlp

1751人浏览 · 2022-01-02 22:21:59

lucas-nlp · 2022-01-02 22:21:59 发布

众所周知，BERT模型自2018年问世起就各种屠榜，开启了NLP领域预训练+微调的范式。到现在，BERT的相关衍生模型层出不穷（XL-Net、RoBERTa、ALBERT、ELECTRA、ERNIE等），要理解它们可以先从BERT这个始祖入手。

HuggingFace是一家总部位于纽约的聊天机器人初创服务商，很早就捕捉到BERT大潮流的信号并着手实现基于pytorch的BERT模型。这一项目最初名为pytorch-pretrained-bert，在复现了原始效果的同时，提供了易用的方法以方便在这一强大模型的基础上进行各种玩耍和研究。

随着使用人数的增加，这一项目也发展成为一个较大的开源社区，合并了各种预训练语言模型以及增加了Tensorflow的实现，并且在2019年下半年改名为Transformers。截止写文章时（2021年3月30日）这一项目已经拥有43k+的star，可以说Transformers已经成为事实上的NLP基本工具。

https://github.com/huggingface/transformersgithub.com/huggingface/transformers

本文基于Transformers版本4.4.2（2021年3月19日发布）项目中，pytorch版的BERT相关代码，从代码结构、具体实现与原理，以及使用的角度进行分析，包含以下内容：

BERT Tokenization分词模型（BertTokenizer）
BERT Model本体模型（BertModel）
1. BertEmbeddings
2. BertEncoder
  1. BertLayer
    1. BertAttention
      1. BertSelfAttention
      2. BertSelfOutput
    2. BertIntermediate
    3. BertOutput
  2. BertPooler
BERT-based Models应用模型（请看下篇）
1. BertForPreTraining
2. BertForSequenceClassification
3. BertForMultiChoice
4. BertForTokenClassification
5. BertForQuestionAnswering
BERT训练与优化（请看下篇）
1. Pre-Training
2. Fine-Tuning
  1. AdamW
  2. Warmup

下篇传送门：

Riroaki：BERT源码详解（二）——HuggingFace Transformers最新版本源码解读134 赞同 · 25 评论文章正在上传…重新上传取消

1 Tokenization（BertTokenizer）

和BERT有关的Tokenizer主要写在/models/bert/tokenization_bert.py和/models/bert/tokenization_bert_fast.py 中。

这两份代码分别对应基本的BertTokenizer，以及不进行token到index映射的BertTokenizerFast，这里主要讲解第一个。

class BertTokenizer(PreTrainedTokenizer):
"""
Construct a BERT tokenizer. Based on WordPiece.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
...
"""

BertTokenizer 是基于BasicTokenizer和WordPieceTokenizer 的分词器：

BasicTokenizer负责处理的第一步——按标点、空格等分割句子，并处理是否统一小写，以及清理非法字符。
- 对于中文字符，通过预处理（加空格）来按字分割；
- 同时可以通过never_split指定对某些词不进行分割；
- 这一步是可选的（默认执行）。
WordPieceTokenizer在词的基础上，进一步将词分解为子词（subword）。
- subword介于char和word之间，既在一定程度保留了词的含义，又能够照顾到英文中单复数、时态导致的词表爆炸和未登录词的OOV（Out-Of-Vocabulary）问题，将词根与时态词缀等分割出来，从而减小词表，也降低了训练难度；
- 例如，tokenizer这个词就可以拆解为“token”和“##izer”两部分，注意后面一个词的“##”表示接在前一个词后面。

BertTokenizer 有以下常用方法：

from_pretrained：从包含词表文件（vocab.txt）的目录中初始化一个分词器；
tokenize：将文本（词或者句子）分解为子词列表；
convert_tokens_to_ids：将子词列表转化为子词对应下标的列表；
convert_ids_to_tokens ：与上一个相反；
convert_tokens_to_string：将subword列表按“##”拼接回词或者句子；
encode：对于单个句子输入，分解词并加入特殊词形成“[CLS], x, [SEP]”的结构并转换为词表对应下标的列表；对于两个句子输入（多个句子只取前两个），分解词并加入特殊词形成“[CLS], x1, [SEP], x2, [SEP]”的结构并转换为下标列表；
decode：可以将encode方法的输出变为完整句子。

以及，类自身的方法：

>>> from transformers import BertTokenizer
>>> bt = BertTokenizer.from_pretrained('./bert-base-uncased/')
>>> bt('I like natural language progressing!')
{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

2 Model（BertModel）

和BERT模型有关的代码主要写在/models/bert/modeling_bert.py中，这一份代码有一千多行，包含BERT模型的基本结构和基于它的微调模型等。

下面从BERT模型本体入手分析：

class BertModel(BertPreTrainedModel):
"""
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in `Attention is
all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration
set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
input to the forward pass.
"""

BertModel主要为transformer encoder结构，包含三个部分：

embeddings，即BertEmbeddings类的实体，对应词嵌入；
encoder，即BertEncoder类的实体；
pooler，即BertPooler类的实体，这一部分是可选的。

补充：注意BertModel也可以配置为Decoder，不过下文中不包含对这一部分的讨论。

下面将介绍BertModel的前向传播过程中各个参数的含义以及返回值：

def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
): ...

input_ids：经过tokenizer分词后的subword对应的下标列表；
attention_mask：在self-attention过程中，这一块mask用于标记subword所处句子和padding的区别，将padding部分填充为0；
token_type_ids：标记subword当前所处句子（第一句/第二句/padding）；
position_ids：标记当前词所在句子的位置下标；
head_mask：用于将某些层的某些注意力计算无效化；
inputs_embeds：如果提供了，那就不需要input_ids，跨过embedding lookup过程直接作为Embedding进入Encoder计算；
encoder_hidden_states：这一部分在BertModel配置为decoder时起作用，将执行cross-attention而不是self-attention；
encoder_attention_mask：同上，在cross-attention中用于标记encoder端输入的padding；
past_key_values：这个参数貌似是把预先计算好的K-V乘积传入，以降低cross-attention的开销（因为原本这部分是重复计算）；
use_cache：将保存上一个参数并传回，加速decoding；
output_attentions：是否返回中间每层的attention输出；
output_hidden_states：是否返回中间每层的输出；
return_dict：是否按键值对的形式（ModelOutput类，也可以当作tuple用）返回输出，默认为真。

补充：注意，这里的head_mask对注意力计算的无效化，和下文提到的注意力头剪枝不同，而仅仅把某些注意力的计算结果给乘以这一系数。

返回部分如下：

# BertModel的前向传播返回部分
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)

可以看出，返回值不但包含了encoder和pooler的输出，也包含了其他指定输出的部分（hidden_states和attention等，这一部分在encoder_outputs[1:]）方便取用：

# BertEncoder的前向传播返回部分，即上面的encoder_outputs
if not return_dict:
return tuple(
v
for v in [
hidden_states,
next_decoder_cache,
all_hidden_states,
all_self_attentions,
all_cross_attentions,
]
if v is not None
)
return BaseModelOutputWithPastAndCrossAttentions(
last_hidden_state=hidden_states,
past_key_values=next_decoder_cache,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
cross_attentions=all_cross_attentions,
)

此外，BertModel还有以下的方法，方便BERT玩家进行各种骚操作：

get_input_embeddings：提取embedding中的word_embeddings即词向量部分；
set_input_embeddings：为embedding中的word_embeddings赋值；
_prune_heads：提供了将注意力头剪枝的函数，输入为{layer_num: list of heads to prune in this layer}的字典，可以将指定层的某些注意力头剪枝。

补充：剪枝是一个复杂的操作，需要将保留的注意力头部分的Wq、Kq、Vq和拼接后全连接部分的权重拷贝到一个新的较小的权重矩阵（注意先禁止grad再拷贝），并实时记录被剪掉的头以防下标出错。具体参考BertAttention部分的prune_heads方法。

2.1 BertEmbeddings

包含三个部分求和得到：

word_embeddings，上文中subword对应的嵌入。
token_type_embeddings，用于表示当前词所在的句子，辅助区别句子与padding、句子对间的差异。
position_embeddings，句子中每个词的位置嵌入，用于区别词的顺序。和transformer论文中的设计不同，这一块是训练出来的，而不是通过Sinusoidal函数计算得到的固定嵌入。一般认为这种实现不利于拓展性（难以直接迁移到更长的句子中）。

三个embedding不带权重相加，并通过一层LayerNorm+dropout后输出，其大小为(batch_size, sequence_length, hidden_size)。

补充：这里为什么要用LayerNorm+Dropout呢？为什么要用LayerNorm而不是BatchNorm？可以参考一个不错的回答：

transformer 为什么使用 layer normalization，而不是其他的归一化方法？202 赞同 · 12 评论回答正在上传…重新上传取消

2.2 BertEncoder

包含多层BertLayer，这一块本身没有特别需要说明的地方，不过有一个细节值得参考：

利用gradient checkpointing技术以降低训练时的显存占用。

补充：gradient checkpointing即梯度检查点，通过减少保存的计算图节点压缩模型占用空间，但是在计算梯度的时候需要重新计算没有存储的值，参考论文《Training Deep Nets with Sublinear Memory Cost》，过程如下示意图：

在BertEncoder中，gradient checkpoint是通过torch.utils.checkpoint.checkpoint实现的，使用起来比较方便，可以参考文档：

torch.utils.checkpoint - PyTorch 1.8.1 documentationpytorch.org/docs/stable/checkpoint.html

这一机制的具体实现比较复杂（没看懂），在此不作展开。

再往深一层走，就进入了Encoder的某一层：

2.2.1 BertLayer

这一层包装了BertAttention和BertIntermediate+BertOutput（即Attention后的FFN部分），以及这里直接忽略的cross-attention部分（将BERT作为Decoder时涉及的部分）。

理论上，这里顺序调用三个子模块就可以，没有什么值得说明的地方。

然而这里又出现了一个细节：

# 这是forward的一部分
self_attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions=output_attentions,
past_key_value=self_attn_past_key_value,
)
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
# 中间省略一部分……
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
outputs = (layer_output,) + outputs
# 省略一部分……
return outputs
# 这是feed_forward_chunk的部分
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output

看到上面那个apply_chunking_to_forward和feed_forward_chunk了吗（为什么要整这么复杂，直接调用它不香吗）？

那么这个apply_chunking_to_forward到底是啥？深入看看……

def apply_chunking_to_forward(
forward_fn: Callable[..., torch.Tensor], chunk_size: int, chunk_dim: int, *input_tensors
) -> torch.Tensor:
"""
This function chunks the :obj:`input_tensors` into smaller input tensor parts of size :obj:`chunk_size` over the
dimension :obj:`chunk_dim`. It then applies a layer :obj:`forward_fn` to each chunk independently to save memory.
If the :obj:`forward_fn` is independent across the :obj:`chunk_dim` this function will yield the same result as
directly applying :obj:`forward_fn` to :obj:`input_tensors`.
...
"""

哦，又是一个节约显存的技术——包装了一个切分小batch或者低维数操作的功能：这里参数chunk_size其实就是切分的batch大小，而chunk_dim就是一次计算维数的大小，最后拼接起来返回。

不过，在默认操作中不会特意设置这两个值（在源代码中默认为0和1），所以会直接等效于正常的forward过程。

继续往下深入，就是Transformer的核心：BertAttention部分，以及紧随其后的FFN部分。

2.2.1.1 BertAttention

本以为attention的实现就在这里，没想到还要再下一层……其中，self成员就是多头注意力的实现，而output成员实现attention后的全连接+dropout+residual+LayerNorm一系列操作。

class BertAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.self = BertSelfAttention(config)
self.output = BertSelfOutput(config)
self.pruned_heads = set()

首先还是回到这一层。这里出现了上文提到的剪枝操作，即prune_heads方法：

def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)

这里的具体实现概括如下：

find_pruneable_heads_and_indices是定位需要剪掉的head，以及需要保留的维度下标index；
prune_linear_layer则负责将Wk/Wq/Wv权重矩阵（连同bias）中按照index保留没有被剪枝的维度后转移到新的矩阵。

接下来就到重头戏——Self-Attention的具体实现。

2.2.1.1.1 BertSelfAttention

预警：这一块可以说是模型的核心区域，也是唯一涉及到公式的地方，所以将贴出大量代码。

初始化部分：

class BertSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
self.max_position_embeddings = config.max_position_embeddings
self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
self.is_decoder = config.is_decoder

除掉熟悉的query、key、value三个权重和一个dropout，这里还有一个谜一样的position_embedding_type，以及decoder标记（当然，我不打算介绍cross-attenton部分）；
注意，hidden_size和all_head_size在一开始是一样的。至于为什么要看起来多此一举地设置这一个变量——显然是因为上面那个剪枝函数，剪掉几个attention head以后all_head_size自然就小了；
hidden_size必须是num_attention_heads的整数倍，以bert-base为例，每个attention包含12个head，hidden_size是768，所以每个head大小即attention_head_size=768/12=64；
position_embedding_type是什么？继续往下看就知道了……

然后是重点，也就是前向传播过程。

首先回顾一下multi-head self-attention的基本公式：

其中表示注意力头的个数，表示向量拼接，。

而这些注意力头，众所周知是并行计算的，所以上面的query、key、value三个权重是唯一的——这并不是所有heads共享了权重，而是“拼接”起来了。

补充：原论文中多头的理由为Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.而另一个比较靠谱的分析有：

看看forward方法：

def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
):
mixed_query_layer = self.query(hidden_states)
# 省略一部分cross-attention的计算
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
query_layer = self.transpose_for_scores(mixed_query_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
# ...

这里的transpose_for_scores用来把hidden_size拆成多个头输出的形状，并且将中间两维转置以进行矩阵相乘；
这里key_layer/value_layer/query_layer的形状为：(batch_size, num_attention_heads, sequence_length, attention_head_size)；
这里attention_scores的形状为：(batch_size, num_attention_heads, sequence_length, sequence_length)，符合多个头单独计算获得的attention map形状。

到这里实现了K与Q相乘，获得raw attention scores的部分，按公式接下来应该是按dk进行scaling并做softmax的操作。然而——

先出现在眼前的是一个奇怪的positional_embedding，以及一堆爱因斯坦求和：

# ...
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
seq_length = hidden_states.size()[1]
position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility
if self.position_embedding_type == "relative_key":
relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == "relative_key_query":
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
# ...

补充：关于爱因斯坦求和约定，参考以下文档

torch.einsum - PyTorch 1.8.1 documentationpytorch.org/docs/stable/generated/torch.einsum.html

补充：这里的positional_embedding引入了attention map中的位置嵌入——为什么要这么做呢？我目前还没搞明白……

对于不同的positional_embedding_type，有三种操作：

absolute：默认值，这部分就不用处理；
relative_key：对key_layer作处理，将其与这里的positional_embedding和key矩阵相乘作为key相关的位置编码；
relative_key_query：对key和value都进行相乘以作为位置编码。

暂时跳过这一迷惑的部分，回到正常attention的流程：

# ...
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask # 这里为什么是+而不是*？
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
# 省略decoder返回值部分……
return outputs

重大疑问：这里的attention_scores = attention_scores + attention_mask是在做什么？难道不应该是乘mask吗？

因为这里的attention_mask已经【被动过手脚】，将原本为1的部分变为0，而原本为0的部分（即padding）变为一个较大的负数，这样相加就得到了一个较大的负值：

至于为什么要用【一个较大的负数】？因为这样一来经过softmax操作以后这一项就会变成接近0的小数。

(Pdb) attention_mask
tensor([[[[ -0., -0., -0., ..., -10000., -10000., -10000.]]],
[[[ -0., -0., -0., ..., -10000., -10000., -10000.]]],
[[[ -0., -0., -0., ..., -10000., -10000., -10000.]]],
...,
[[[ -0., -0., -0., ..., -10000., -10000., -10000.]]],
[[[ -0., -0., -0., ..., -10000., -10000., -10000.]]],
[[[ -0., -0., -0., ..., -10000., -10000., -10000.]]]],
device='cuda:0')

那么，这一步是在哪里执行的呢？

我在modeling_bert.py中没有找到答案，但是在modeling_utils.py中找到了一个特别的类：class ModuleUtilsMixin，在它的get_extended_attention_mask方法中发现了端倪：

def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple[int], device: device) -> Tensor:
"""
Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
Arguments:
attention_mask (:obj:`torch.Tensor`):
Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
input_shape (:obj:`Tuple[int]`):
The shape of the input to the model.
device: (:obj:`torch.device`):
The device of the input to the model.
Returns:
:obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
"""
# 省略一部分……
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
extended_attention_mask = extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
return extended_attention_mask

那么，这个函数是在什么时候被调用的呢？和BertModel有什么关系呢？

OK，这里涉及到BertModel的继承细节了：BertModel继承自BertPreTrainedModel ，后者继承自PreTrainedModel，而PreTrainedModel继承自[nn.Module, ModuleUtilsMixin, GenerationMixin] 三个基类。——好复杂的封装！

这也就是说， BertModel必然在中间的某个步骤对原始的attention_mask调用了get_extended_attention_mask ，导致attention_mask从原始的[1, 0]变为[0, -1e4]的取值。

真相只有一个！

最终在BertModel的前向传播过程中找到了这一调用（第944行）：

# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)

问题解决了：这一方法不但实现了改变mask的值，还将其广播（broadcast）为可以直接与attention map相加的形状。

不愧是你，HuggingFace。

抱脸虫可不是乱叫的！

除此之外，值得注意的细节有：

按照每个头的维度进行缩放，对于bert-base就是64的平方根即8；
attention_probs不但做了softmax，还用了一次dropout，这是担心attention矩阵太稠密吗……这里也提到很不寻常，但是原始Transformer论文就是这么做的；
head_mask就是之前提到的对多头计算的mask，如果不设置默认是全1，在这里就不会起作用；
context_layer即attention矩阵与value矩阵的乘积，原始的大小为：(batch_size, num_attention_heads, sequence_length, attention_head_size) ；
context_layer进行转置和view操作以后，形状就恢复了(batch_size, sequence_length, hidden_size)。

OK, that's all for attention.

2.2.1.1.2 BertSelfOutput

这一块操作略多但不复杂，一目了然：

class BertSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states

补充：这里又出现了LayerNorm和Dropout的组合，只不过这里是先Dropout，进行残差连接后再进行LayerNorm。至于为什么要做残差连接，最直接的目的就是降低网络层数过深带来的训练难度，对原始输入更加敏感～

2.2.1.2 BertIntermediate

看完了BertAttention，在Attention后面还有一个全连接+激活的操作：

class BertIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states

这里的全连接做了一个扩展，以bert-base为例，扩展维度为3072，是原始维度768的4倍之多；

补充：为什么要过一个FFN？不知道……谷歌最近的论文貌似说明只有attention的模型什么用都没有：

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Deptharxiv.org/abs/2103.03404

这里的激活函数默认实现为gelu（Gaussian Error Linerar Units(GELUS）：；当然，它是无法直接计算的，可以用一个包含tanh的表达式进行近似（略）。

作为参考（图源网络）：

至于为什么在transformer中要用这个激活函数……

补充：看了一些研究，应该是说GeLU比ReLU这些表现都好，以至于后续的语言模型都沿用了这一激活函数。

2.2.1.3 BertOutput

在这里又是一个全连接+dropout+LayerNorm，还有一个残差连接residual connect：

class BertOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states

这里的操作和BertSelfOutput不能说没有关系，只能说一模一样……非常容易混淆的两个组件。

以下内容还包含基于BERT的应用模型，以及BERT相关的优化器和用法，将在下一篇文章作详细介绍。

2.2.3 BertPooler

这一层只是简单地取出了句子的第一个token，即[CLS]对应的向量，然后过一个全连接层和一个激活函数后输出：

（这一部分是可选的，因为pooling有很多不同的操作）

class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output

Takeaways·小结

在HuggingFace实现的Bert模型中，使用了多种节约显存的技术：
- gradient checkpoint，不保留前向传播节点，只在用时计算；
- apply_chunking_to_forward，按多个小批量和低维度计算FFN部分；
BertModel包含复杂的封装和较多的组件。以bert-base为例，主要组件如下：
- 总计Dropout出现了1+(1+1+1)x12=37次；
- 总计LayerNorm出现了1+(1+1)x12=25次；
- 总计dense全连接层出现了(1+1+1)x12+1=37次，并不是每个dense都配了激活函数……

BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
# 如此重复11层，直到大厦崩塌：）
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)

BertModel有极大的参数量。以bert-base为例，其参数量为109M，具体计算过程可以参考：

HiroLin：小白Bert系列-参数计算29 赞同 · 7 评论文章正在上传…重新上传取消

（未完待续）

后记

只看论文未免不接地气，偶尔整点代码分析也是必要的。这么想着，于是我选了最近一直在用的BERT模型进行介绍，细看还是有很多地方自己原先没有考虑过的，在此记录。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

TIKTOK × Sora 2：AI视频时代的跨境电商内容革命

2048 AI社区

【必学收藏】AI智能体全解析：从基础概念到实际应用场景

2048 AI社区

什么是大模型？一文读懂大模型的基本概念

本文系统梳理了大模型的关键概念与发展历程。首先明确了大模型（参数规模达数十亿至数千亿）与小模型的本质区别在于涌现能力，并区分了大模型、超大模型、大语言模型等相关概念。文章回顾了大模型从传统神经网络到Transformer架构的演进过程，重点分析了GPT系列模型的突破性进展。详细阐释了大模型的八大核心特征，包括巨大规模、涌现能力、多任务学习等，并按输入数据类型和应用领域进行了分类。最后探讨了大模型的