In this article, we talk about how Large Language Models (LLMs) work, from scratch — assuming only that you know how to add and multiply two numbers. The article is meant to be fully self-contained. We start by building a simple Generative AI on pen and paper, and then walk through everything we need to have a firm understanding of modern LLMs and the Transformer architecture. The article will strip out all the fancy language and jargon in ML and represent everything simply as they are: numbers. We will still call out what things are called to tether your thoughts when you read jargon-y content.
在本文中,我们将讨论大型语言模型 (LLM) 如何从头开始工作——假设您知道如何将两个数字相加和相乘。这篇文章旨在完全独立。我们首先在笔和纸上构建一个简单的生成式 AI,然后逐步介绍深入了解现代 LLM 和 Transformer 架构所需的一切。本文将去除 ML 中所有花哨的语言和行话,并简单地表示一切:数字。当您阅读行话内容时,我们仍然会指出什么叫来束缚您的思想。

Going from addition/multiplication to the most advanced AI models today without assuming other knowledge or referring to other sources means we cover a LOT of ground. This is NOT a toy LLM explanation — a determined person can theoretically recreate a modern LLM from all the information here. I have cut out every word/line that was unnecessary and as such this article isn’t really meant to be browsed.
从加法/乘法到当今最先进的人工智能模型,而无需假设其他知识或参考其他来源,这意味着我们涵盖了很多领域 。这不是玩具 LLM 的解释——一个有决心的人理论上可以根据这里的所有信息重新创建现代 LLM。我已经删除了每一个不必要的单词/行,因此这篇文章并不是真正用来浏览的。

What will we cover?  我们将涵盖哪些内容?

  1. A simple neural network  一个简单的神经网络
  2. How are these models trained?
    这些模型是如何训练的?
  3. How does all this generate language?
    这一切是如何产生语言的?
  4. What makes LLMs work so well?
    是什么让法学硕士如此出色?
  5. Embeddings  嵌入
  6. Sub-word tokenizers  子词分词器
  7. Self-attention  自我关注
  8. Softmax  软麦克斯
  9. Residual connections  残差连接
  10. Layer Normalization  层归一化
  11. Dropout  辍学
  12. Multi-head attention  多头注意力
  13. Positional embeddings  位置嵌入
  14. The GPT architecture  GPT 架构
  15. The transformer architecture
    变压器架构

Let’s dive in.  让我们深入了解一下。

The first thing to note is that neural networks can only take numbers as inputs and can only output numbers. No exceptions. The art is in figuring out how to feed your inputs as numbers, interpreting the output numbers in a way that achieves your goals. And finally, building neural nets that will take the inputs you provide and give you the outputs you want (given the interpretation you chose for these outputs). Let’s walk through how we get from adding and multiplying numbers to things like Llama 3.1.
首先要注意的是,神经网络只能将数字作为输入,只能输出数字。没有例外。艺术在于弄清楚如何将您的输入作为数字提供,以实现您的目标的方式解释输出数字。最后,构建神经网络,它将接受您提供的输入并为您提供所需的输出(给定您为这些输出选择的解释)。让我们来看看我们是如何从数字的加法和乘法到 Llama 3.1 之类的。

A simple neural network:  一个简单的神经网络:

Let’s work through a simple neural network that can classify an object:
让我们通过一个可以对对象进行分类的简单神经网络:

  • Object data available: Dominant color (RGB) & Volume (in milli-liters)
    可用的对象数据: 主色 (RGB) 和体积(以毫升为单位)
  • Classify into: Leaf OR Flower
    分类为 :叶或花

Here’s what the data for a leaf and a sunflower can look like:
下面是一片叶子和一朵向日葵的数据:

Press enter or click to view image in full size

Image by author  图片由作者提供

Let’s now build a neural net that does this classification. We need to decide on input/output interpretations. Our inputs are already numbers, so we can feed them directly into the network. Our outputs are two objects, leaf and flower which the neural network cannot output. Let’s look at a couple of schemes we can use here:
现在让我们构建一个进行此分类的神经网络。我们需要决定输入/输出解释。我们的输入已经是数字,因此我们可以将它们直接输入网络。我们的输出是神经网络无法输出的两个对象,叶子和花朵。让我们看看几个可以使用她的方案 :

  • We can make the network output a single number. And if the number is positive we say it’s a leaf and if it is negative we say it’s a flower
    我们可以使网络输出成为单个数字。如果数字为正数,我们说它是一片叶子,如果它是负数,我们说它是一朵花
  • OR, we can make the network output two numbers. We interpret the first one as a number for leaf and second one as the number for flower and we will say that the selection is whichever number is larger
    或者,我们可以使网络输出两个数字。我们将第一个解释为叶子的数字,第二个解释为花朵的数字,我们会说选择是较大的数字

Both schemes allow the network to output number(s) that we can interpret as leaf or flower. Let’s pick the second scheme here because it generalizes well to other things we will look at later. And here’s a neural network that does the classification using this scheme. Let’s work through it:
这两种方案都允许网络输出我们可以解释为叶子或花朵的数字。让我们在这里选择第二种方案,因为它可以很好地推广到我们稍后将要讨论的其他事情。这是一个神经网络,它使用这种方案进行分类。让我们来了解一下:

Press enter or click to view image in full size

Image by author  图片由作者提供

Some jargon:  一些行话:

Neurons/nodes: The numbers in the circles
神经元/节点 :圆圈中的数字

Weights: The colored numbers on the lines
权重 :线条上的彩色数字

Layers: A collection of neurons is called a layer. You could think of this network as having 3 layers: Input layer with 4 neurons, Middle layer with 3 neurons, and the Output layer with 2 neurons.
层 :神经元的集合称为层。您可以将此网络视为 3 层:具有 4 个神经元的输入层、具有 3 个神经元的中间层和具有 2 个神经元的输出层。

To calculate the prediction/output from this network (called a “forward pass”), you start from the left. We have the data available for the neurons in the Input layer. To move “forward” to the next layer, you multiply the number in the circle with the weight for the corresponding neuron pairing and you add them all up. We demonstrate blue and orange circle math above. Running the whole network we see that the first number in the output layer comes out higher so we interpret it as “network classified these (RGB,Vol) values as leaf”. A well trained network can take various inputs for (RGB,Vol) and correctly classify the object.
要计算该网络的预测/输出(称为“ 前向传递 ”),请从左侧开始。我们在输入层中拥有神经元的可用数据。要“向前”移动到下一层,您将圆圈中的数字与相应神经元配对的权重相乘,然后将它们全部相加。 我们在上面演示了蓝色和橙色圆圈的数学。 运行整个网络,我们看到输出层中的第一个数字更高,因此我们将其解释为“网络将这些(RGB,Vol)值分类为叶子”。训练有素的网络可以接受 (RGB,Vol) 的各种输入并正确分类对象。

The model has no notion of what a leaf or a flower is, or what (RGB,Vol) are. It has a job of taking in exactly 4 numbers and giving out exactly 2 numbers. It is our interpretation that the 4 input numbers are (RGB,Vol) and it is also our decision to look at the output numbers and infer that if the first number is larger it’s a leaf and so on. And finally, it is also up to us to choose the right weights such that the model will take our input numbers and give us the right two numbers such that when we interpret them we get the interpretation we want.
该模型不知道什么是叶子或花朵,或者 (RGB,Vol) 是什么。它的工作是恰好接受 4 个数字并给出恰好 2 个数字。我们的解释是 4 个输入数字是 (RGB,Vol),这也是我们决定查看输出数字并推断如果第一个数字更大,它就是叶子,依此类推。最后,我们也需要选择正确的权重,以便模型将采用我们的输入数字并给我们正确的两个数字,以便当我们解释它们时,我们会得到我们想要的解释。

An interesting side effect of this is that you can take the same network and instead of feeding RGB,Vol feed other 4 numbers like cloud cover, humidity etc.. and interpret the two numbers as “Sunny in an hour” or “Rainy in an hour” and then if you have the weights well calibrated you can get the exact same network to do two things at the same time — classify leaf/flower and predict rain in an hour! The network just gives you two numbers, whether you interpret it as classification or prediction or something else is entirely up to you.
这样做的一个有趣的副作用是,您可以使用相同的网络,而不是馈送 RGB,Vol 馈送其他 4 个数字,如云量、湿度等。并将这两个数字解释为“一小时后晴天”或“一小时内下雨”,然后如果你对权重进行了很好的校准,你可以让完全相同的网络同时做两件事——对叶子/花朵进行分类并预测一小时内的雨!网络只给你两个数字,无论你把它解释为分类还是预测,还是其他什么,完全取决于你。

Stuff left out for simplification (feel free to ignore without compromising comprehensibility):
为了简化而省略的内容(在不影响可理解性的情况下随意忽略):

  • Activation layer: A critical thing missing from this network is an “activation layer”. That’s a fancy word for saying that we take the number in each circle and apply a nonlinear function to it (RELU is a common function where you just take the number and set it to zero if it is negative, and leave it unchanged if it is positive). So basically in our case above, we would take the middle layer and replace the two numbers (-26.6 and -47.1) with zeros before we proceed further to the next layer. Of course, we would have to re-train the weights here to make the network useful again. Without the activation layer all the additions and multiplications in the network can be collapsed to a single layer. In our case, you could write the green circle as the sum of RGB directly with some weights and you would not need the middle layer. It would be something like (0.10 * -0.17 + 0.12 * 0.39–0.36 * 0.1) * R + (-0.29 * -0.17–0.05 * 0.39–0.21 * 0.1) * G …and so on. This is usually not possible if we have a nonlinearity there. This helps networks deal with more complex situations.
    激活层 :该网络缺少的一个关键东西是“激活层”。这是一个奇特的词,意思是说我们取每个圆圈中的数字并对其应用一个非线性函数(RELU 是一个常见的函数,您只需将数字设置为负数并将其设置为零,如果它为正则保持不变)。因此,基本上在上面的案例中,我们将采用中间层并将两个数字(-26.6 和 -47.1)替换为零,然后再继续下一层。当然,我们必须在这里重新训练权重,以使网络再次有用。如果没有激活层,网络中的所有加法和乘法都可以折叠到一个层。在我们的例子中,您可以直接将绿色圆圈写成带有一些权重的 RGB 之和,并且不需要中间层。它类似于 (0.10 * -0.17 + 0.12 * 0.39–0.36 * 0.1) * R + (-0.29 * -0.17–0.05 * 0.39–0.21 * 0.1) * G ...等等。如果我们在那里有非线性,这通常是不可能的。这有助于网络处理更复杂的情况。
  • Bias: Networks will usually also contain another number associated with each node, this number is simply added to the product to calculate the value of the node and this number is called the “bias”. So if the bias for the top blue node was 0.25 then the value in the node would be: (32 * 0.10) + (107 * -0.29) + (56 * -0.07) + (11.2 * 0.46) + 0.25 = — 26.35. The word parameters is usually used to refer to all these numbers in the model that are not neurons/nodes.
    偏差: 网络通常还包含与每个节点关联的另一个数字,该数字只是简单地添加到乘积中以计算节点的值,该数字称为“偏差”。因此,如果顶部蓝色节点的偏差为 0.25,则节点中的值将为:(32 * 0.10) + (107 * -0.29) + (56 * -0.07) + (11.2 * 0.46) + 0.25 = — 26.35。参数一词通常用于指代模型中所有这些不是神经元/节点的数字。
  • Softmax: We don’t usually interpret the output layer directly as shown in our models. We convert the numbers into probabilities (i.e. make it so that all numbers are positive and add up to 1). If all the numbers in the output layer were already positive one way you could achieve this is by dividing each number by the sum of all numbers in the output layer. Though a “softmax” function is normally used which can handle both positive and negative numbers.
    软麦克斯: 我们通常不会像模型中所示那样直接解释输出层。我们将数字转换为概率(即使所有数字均为正数并加起来为 1)。如果输出层中的所有数字都已经为正数,则实现此目的的一种方法是将每个数字除以输出层中所有数字的总和。虽然通常使用“softmax”函数,它可以处理正数和负数。

How are these models trained?
这些模型是如何训练的?

In the example above, we magically had the weights that allowed us to put data into the model and get a good output. But how are these weights determined? The process of setting these weights (or “parameters”) is called “training the model”, and we need some training data to train the model.
在上面的示例中,我们神奇地拥有权重,使我们能够将数据放入模型并获得良好的输出。但是这些权重是如何确定的呢?设置这些权重(或“参数”)的过程称为“ 训练模型 ”,我们需要一些训练数据来训练模型。

Let’s say we have some data where we have the inputs and we already know if each input corresponds to leaf or flower, this is our “training data” and since we have the leaf/flower label for each set of (R,G,B,Vol) numbers, this is “labeled data”.
假设我们有一些数据,其中有输入,并且我们已经知道每个输入是否对应于叶子或花朵,这是我们的“ 训练数据 ”,并且由于我们有每组 (R、G、B、Vol) 数字的叶子/花标签,这就是“ 标记数据 ”。

Here’s how it works:  它的工作原理如下:

  • Start with a random numbers, i.e. set each parameter/weight to a random number
    从随机数开始,即将每个参数/权重设置为随机数
  • Now, we know that when we input the data corresponding to the leaf (R=32, G=107, B=56, Vol=11.2). Suppose we want a larger number for leaf in the output layer. Let’s say we want the number corresponding to leaf as 0.8 and the one corresponding to flower as 0.2 (as shown in example above, but these are illustrative numbers to demonstrate training, in reality we would not want 0.8 and 0.2. In reality these would be probabilities, which they are not here, and we would them to be 1 and 0)
    现在,当我们输入与叶子相对应的数据时,我们知道这一点(R=32,G=107,B=56,体积=11.2)。假设我们想要输出层中的叶子有更大的数字。假设我们希望叶子对应的数字为 0.8,花朵对应的数字为 0.2(如上例所示,但这些是演示训练的说明性数字,实际上我们不想要 0.8 和 0.2。实际上,这些是概率,它们不在这里,我们希望它们是 1 和 0)
  • We know the numbers we want in the output layer, and the numbers we are getting from the randomly selected parameters (which are different from what we want). So for all the neurons in the output layer, let’s take the difference between the number we want and the number we have. Then add up the differences. E.g., if the output layer is 0.6 and 0.4 in the two neurons, then we get: (0.8–0.6)=0.2 and (0.2–0.4)= -0.2 so we get a total of 0.4 (ignoring minus signs before adding). We can call this our “loss”. Ideally we want the loss to be close to zero, i.e. we want to “minimize the loss”.
    我们知道我们想要在输出层中的数字,以及我们从随机选择的参数中获得的数字(与我们想要的不同)。因此,对于输出层中的所有神经元,让我们取我们想要的数量和我们拥有的数量之间的差值。然后把差异加起来。例如,如果两个神经元的输出层分别为 0.6 和 0.4,则我们得到:(0.8–0.6)=0.2 和 (0.2–0.4)= -0.2,因此我们得到总共 0.4(在添加之前忽略减号)。我们可以称之为我们的 “损失 ”。理想情况下,我们希望损失接近于零,即我们希望“ 最小化损失 ”。
  • Once we have the loss, we can slightly change each parameter to see if increasing or decreasing it will increase the loss. This is called the “gradient” of that parameter. Then we can move each of the parameters by a small amount in the direction where the loss goes down (opposite the direction of the gradient). Once we have moved all the parameters slightly, the loss should be lower
    一旦我们有了损失,我们可以稍微改变每个参数,看看增加还是减少它是否会增加损失。这称为该参数的“ 梯度 ”。然后我们可以在损失下降的方向(与梯度方向相反)上将每个参数移动少量。一旦我们稍微移动了所有参数,损失应该会更低
  • Keep repeating the process and you will reduce the loss, and eventually have a set of weights/parameters that are “trained”. This whole process is called “gradient descent”.
    不断重复这个过程,你会减少损失,最终拥有一组“ 训练 ”的权重/参数。这整个过程称为“ 梯度下降 ”。

Couple of notes:  几个注意事项:

  • You often have multiple training examples, so when you change the weights slightly to minimize the loss for one example it might make the loss worse for another example. The way to deal with this is to define loss as average loss over all the examples and then take gradient over that average loss. This reduces the average loss over the entire training data set. Each such cycle is called an “epoch”. Then you can keep repeating the epochs thus finding weights that reduce average loss.
    您通常有多个训练示例,因此当您稍微更改权重以尽量减少一个示例的损失时,可能会使另一个示例的损失变得更糟。处理此问题的方法是将损失定义为所有示例的平均损失,然后对该平均损失进行梯度处理。这减少了整个训练数据集的平均损失。每个这样的周期都称为“ 纪元 ”。然后,您可以继续重复这些纪元,从而找到减少平均损失的权重。
  • We don’t actually need to “move weights around” to calculate the gradient for each weight — we can just infer it from the formula (e.g. if the weight is 0.17 in the last step, and the value of neuron is positive, and we want a larger number in output we can see that increasing this number to 0.18 will help).
    我们实际上不需要“移动权重”来计算每个权重的梯度——我们只需从公式中推断出来(例如,如果最后一步的权重为 0.17,并且神经元的值为正,并且我们想要更大的输出数字,我们可以看到将这个数字增加到 0.18 会有所帮助)。

In practice, training deep networks is a hard and complex process because gradients can easily spiral out of control, going to zero or infinity during training (called “vanishing gradient” and “exploding gradient” problems). The simple definition of loss that we talked about here is perfectly valid, but rarely used as there are better functional forms that work well for specific purposes. With modern models containing billions of parameters, training a model requires massive compute resources which has its own problems (memory limitations, parallelization etc.)
在实践中,训练深度网络是一个艰难而复杂的过程,因为梯度很容易失控,在训练过程中趋于零或无穷大(称为“梯度消失”和“梯度爆炸”问题)。我们在这里讨论的损失的简单定义是完全有效的,但很少使用,因为有更好的功能形式可以很好地用于特定目的。对于包含数十亿个参数的现代模型,训练模型需要大量的计算资源,这有其自身的问题(内存限制、并行化等)。

How does all this help generate language?
所有这些如何帮助生成语言?

Remember, neural nets take in some numbers, do some math based on the trained parameters, and give out some other numbers. Everything is about interpretation and training the parameters (i.e. setting them to some numbers). If we can interpret the two numbers as “leaf/flower” or “rain or sun in an hour”, we can also interpret them as “next character in a sentence”.
请记住,神经网络会接受一些数字,根据训练的参数进行一些数学运算,并给出一些其他数字。一切都与解释和训练参数有关(即将它们设置为某些数字)。如果我们可以将这两个数字解释为“叶子/花朵”或“一小时内的雨或太阳”,我们也可以将它们解释为“句子中的下一个字符”。

But there are more than 2 letters in English, and so we must expand the number of neurons in the output layer to, say, the 26 letters in the English language (let’s also throw in some symbols like space, period etc..). Each neuron can correspond to a character and we look at the (26 or so) neurons in the output layer and say that the character corresponding to the highest numbered neuron in the output layer is the output character. Now we have a network that can take some inputs and output a character.
但是英语中有超过 2 个字母,因此我们必须将输出层中的神经元数量扩展到英语中的 26 个字母(我们还加入一些符号,如空格、句点等)。每个神经元可以对应一个字符,我们查看输出层中的(大约 26 个)神经元,并说与输出层中编号最高的神经元相对应的字符是输出字符。现在我们有一个网络,可以接受一些输入并输出一个字符。

What if we replace the input in our network with these characters: “Humpty Dumpt” and asked it to output a character and interpreted it as the “Network’s suggestion of the next character in the sequence that we just entered”. We can probably set the weights well enough for it to output “y” — thereby completing “Humpty Dumpty”. Except for one problem, how do we input these lists of characters in the network? Our network only accepts numbers!!
如果我们将网络中的输入替换为以下字符:“Humpty Dumpt”,并要求它输出一个字符,并将其解释为“网络对我们刚刚输入的序列中下一个字符的建议”,会怎样。我们可能可以将权重设置得足够好,使其输出“y”——从而完成“Humpty Dumpty”。除了一个问题,我们如何在网络中输入这些字符列表?我们的网络只接受号码!!

One simple solution is to assign a number to each character. Let’s say a=1, b=2 and so on. Now we can input “humpty dumpt” and train it to give us “y”. Our network looks something like this:
一个简单的解决方案是为每个字符分配一个数字。假设 a=1、b=2 等。现在我们可以输入“humpty dumpt”并训练它给我们“y”。我们的网络如下所示:

Press enter or click to view image in full size

Image by author  图片由作者提供

Ok, so now we can predict one character ahead by providing the network a list of characters. We can use this fact to build a whole sentence. For example, once we have the “y” predicted, we can append that “y” to the list of characters we have and feed it to the network and ask it to predict the next character. And if well trained it should give us a space, and so on and so forth. By the end, we should be able to recursively generate “Humpty Dumpty sat on a wall”. We have Generative AI. Moreover, we now have a network capable of generating language! Now, nobody ever actually puts in randomly assigned numbers and we will see more sensible schemes down the line. If you cannot wait, feel free to check out the one-hot encoding section in the appendix.
好的,现在我们可以通过向网络提供字符列表来预测一个字符。我们可以利用这个事实来构建一个完整的句子。例如,一旦我们预测了“y”,我们就可以将该“y”附加到我们拥有的字符列表中,并将其提供给网络并要求它预测下一个字符。如果训练有素,它应该给我们一个空间,依此类推。到最后,我们应该能够递归生成“Humpty Dumpty sat on a wall”。我们有生成式人工智能。此外, 我们现在拥有了一个能够生成语言的网络! 现在,没有人真正输入随机分配的数字,我们将看到更明智的方案。如果您等不及了,请随时查看附录中的单热编码部分。

Astute readers will note that we can’t actually input “Humpty Dumpty” into the network since the way the diagram is, it only has 12 neurons in the input layer one for each character in “humpty dumpt” (including the space). So how can we put in the “y” for the next pass. Putting a 13th neuron there would require us to modify the entire network, that’s not workable. The solution is simple, let’s kick the “h” out and send the 12 most recent characters. So we would be sending “umpty dumpty” and the network will predict a space. Then we would input “mpty dumpty “ and it will produce an s and so on. It looks something like this:
精明的读者会注意到,我们实际上无法将“Humpty Dumpty”输入到网络中,因为按照图表的方式,它在输入层中只有 12 个神经元,每个神经元对应“humpty dumpt”中的每个字符(包括空格)。那么我们如何才能为下一次传递输入“y”。在那里放置第 13 个神经元需要我们修改整个网络,这是不可行的。解决方法很简单,让我们把“h”踢出去,发送最近的 12 个字符。因此,我们将发送“umpty dumpty”,网络将预测一个空格。然后我们输入“mpty dumpty”,它会产生一个 s,依此类推。它看起来像这样:

Press enter or click to view image in full size

Image by author  图片由作者提供

We’re throwing away a lot of information in the last line by feeding the model only “ sat on the wal”. So what do the latest and greatest networks of today do? More or less exactly that. The length of inputs we can put into a network is fixed (determined by the size of the input layer). This is called “context length” — the context that is provided to the network to make future predictions. Modern networks can have very large context lengths (several thousand words) and that helps. There are some ways of inputting infinite length sequences but the performance of those methods, while impressive, has since been surpassed by other models with large (but fixed) context length.
我们在最后一行丢弃了很多信息,只给模型提供“坐在 wal 上”。那么,当今最新、最伟大的网络是做什么的呢?或多或少正是这样。我们可以放入网络的输入长度是固定的(由输入层的大小决定)。这称为“上下文长度”——提供给网络以进行未来预测的上下文。现代网络可以有非常大的上下文长度(几千字),这很有帮助。有一些方法可以输入无限长度的序列,但这些方法的性能虽然令人印象深刻,但后来已经被其他具有大(但固定)上下文长度的模型所超越。

One other thing careful readers will notice is that we have different interpretations for inputs and outputs for the same letters! For example, when inputting “h” we are simply denoting it with the number 8 but on the output layer we are not asking the model to output a single number (8 for “h”, 9 for “i” and so on..) instead we are are asking the model to output 26 numbers and then we see which one is the highest and then if the 8th number is highest we interpret the output as “h”. Why don’t we use the same, consistent, interpretation on both ends? We could, it’s just that in the case of language, freeing yourself to choose between different interpretations gives you a better chance of building better models. And it just so happens that the most effective currently known interpretations for the input and output are different. In-fact, the way we are inputting numbers in this model is not the best way to do it, we will look at better ways to do that shortly.
细心的读者会注意到的另一件事是,我们对同一字母的输入和输出有不同的解释!例如,当输入“h”时,我们只是用数字 8 表示它,但在输出层上,我们不是要求模型输出单个数字(8 表示“h”,9 表示“i”,依此类推......)相反,我们要求模型输出 26 个数字,然后我们看到哪个是最高的,然后如果第 8 个数字最高,我们将输出解释为“h”。为什么我们不在两端都使用相同的、一致的解释呢?我们可以,只是就语言而言,让自己自由地在不同的解释之间进行选择,让你有更好的机会构建更好的模型。碰巧的是,目前已知的对输入和输出的最有效的解释是不同的。事实上,我们在这个模型中输入数字的方式并不是最好的方法,我们很快就会找到更好的方法。

What makes large language models work so well?
是什么让大型语言模型运行得如此出色?

Generating “Humpty Dumpty sat on a wall” character-by-character is a far cry from what modern LLMs can do. There are a number of differences and innovations that get us from the simple generative AI that we discussed above to the human-like bot. Let’s go through them:
逐个角色生成“Humpty Dumpty 坐在墙上”与现代 LLM 的能力相去甚远。有许多差异和创新使我们从上面讨论的简单生成式人工智能发展到类人机器人。让我们来看看它们:

Embeddings  嵌入

Remember we said that the way that we are inputting characters into the model isn’t the best way to do it. We just arbitrarily selected a number for each character. What if there were better numbers we could assign that would make it possible for us to train better networks? How do we find these better numbers? Here’s a clever trick:
请记住,我们说过,将字符输入模型的方式并不是最好的方法。我们只是为每个字符任意选择一个数字。如果我们可以分配更好的数字,使我们能够训练更好的网络,会怎样?我们如何找到这些更好的数字?这里有一个巧妙的技巧:

When we trained the models above, the way we did it was by moving around weights and seeing that gives us a smaller loss in the end. And then slowly and recursively changing the weights. At each turn we would:
当我们训练上面的模型时,我们的做法是移动权重,看看这最终会给我们带来更小的损失。然后缓慢地递归地改变权重。在每个转折点,我们都会:

  • Feed in the inputs  输入
  • Calculate the output layer
    计算输出层
  • Compare it to the output we ideally want and calculate the average loss
    将其与我们理想所需的输出进行比较,并计算平均损失
  • Adjust the weights and start again
    调整重量并重新开始

In this process, the inputs are fixed. This made sense when inputs were (RGB, Vol). But the numbers we are putting in now for a,b,c etc.. are arbitrarily picked by us. What if at every iteration in addition to moving the weights around by a bit we also moved the input around and see if we can get a lower loss by using a different number to represent “a” and so on? We are definitely reducing the loss and making the model better (that’s the direction we moved a’s input in, by design). Basically, apply gradient descent not just to the weights but also the number representations for the inputs since they are arbitrarily picked numbers anyway. This is called an “embedding”. It is a mapping of inputs to numbers, and as you just saw, it needs to be trained. The process of training an embedding is much like that of training a parameter. One big advantage of this though is that once you train an embedding you can use it in another model if you wish. Keep in mind that you will consistently use the same embedding to represent a single token/character/word.
在此过程中,输入是固定的。当输入为 (RGB, Vol) 时,这是有道理的。但是我们现在为 a、b、c 等输入的数字。都是我们任意挑选的。如果在每次迭代中,除了将权重移动一点之外,我们还移动输入,看看是否可以通过使用不同的数字来表示“a”等来获得更低的损失呢?我们肯定会减少损失并使模型变得更好(这是我们设计上将 a 的输入移动到的方向)。基本上,不仅将梯度下降应用于权重,还应用于输入的数字表示,因为它们无论如何都是任意选择的数字。这称为“ 嵌入 ”。它是输入到数字的映射,正如您刚刚看到的,它需要进行训练。训练嵌入的过程与训练参数的过程非常相似。不过,这样做的一大优点是,一旦你训练了嵌入,你就可以根据需要在另一个模型中使用它。请记住,您将始终使用相同的嵌入来表示单个标记/字符/单词。

We talked about embeddings that are just one number per character. However, in reality embeddings have more than one number. That’s because it is hard to capture the richness of concept by a single number. If we look at our leaf and flower example, we have four numbers for each object (the size of the input layer). Each of these four numbers conveyed a property and the model was able to use all of them to effectively guess the object. If we had only one number, say the red channel of the color, it might have been a lot harder for the model. We’re trying to capture human language here — we’re going to need more than one number.
我们讨论了每个字符只有一个数字的嵌入。然而,实际上嵌入有多个数字。这是因为很难用一个数字来捕捉概念的丰富性。如果我们看一下我们的叶子和花朵示例,我们每个对象有四个数字(输入层的大小)。这四个数字中的每一个都传达了一个属性,模型能够使用所有这些属性来有效地猜测对象。如果我们只有一个数字,比如颜色的红色通道,这对模型来说可能会困难得多。我们在这里试图捕捉人类的语言——我们将需要不止一个数字。

So instead of representing each character by a single number, maybe we can represent it by multiple numbers to capture the richness? Let’s assign a bunch of numbers to each character. Let’s call an ordered collection of numbers a “vector” (ordered as in each number has a position, and if we swap position of two numbers it gives us a different vector. This was the case with our leaf/flower data, if we swapped the R and G numbers for the leaf, we would get a different color, it would not be the same vector anymore). The length of a vector is simply how many numbers it contains. We’ll assign a vector to each character. Two questions arise:
因此,与其用单个数字表示每个字符,也许我们可以用多个数字来表示它来捕捉丰富性? 让我们为每个字符分配一堆数字。让我们将有序数字集合称为“向量”(排序为每个数字都有一个位置,如果我们交换两个数字的位置,它会给我们一个不同的向量。我们的叶子/花朵数据就是这种情况,如果我们将叶子的 R 和 G 数字交换,我们将得到不同的颜色,它将不再是相同的矢量)。向量的长度只是它包含多少个数字。我们将为每个字符分配一个向量。出现了两个问题:

  • If we have a vector assigned to each character instead of a number, how do we now feed “humpty dumpt” to the network? The answer is simple. Let’s say we assigned a vector of 10 numbers to each character. Then instead of the input layer having 12 neurons we would just put 120 neurons there since each of the 12 characters in “humpty dumpt” has 10 numbers to input. Now we just put the neurons next to each other and we are good to go
    如果我们为每个字符分配了一个向量而不是一个数字,我们现在如何向网络提供“humpty dumpt”?答案很简单。假设我们为每个字符分配了一个包含 10 个数字的向量。然后,我们不会在输入层有 12 个神经元,而是在那里放置 120 个神经元,因为“humpty dumpt”中的 12 个字符中的每一个都有 10 个数字要输入。现在我们只需将神经元放在一起,就可以开始了
  • How do we find these vectors? Thankfully, we just learned how to train embedding numbers. Training an embedding vector is no different. You now have 120 inputs instead of 12 but all you are doing is moving them around to see how you can minimize loss. And then you take the first 10 of those and that’s the vector corresponding to “h” and so on.
    我们如何找到这些向量?值得庆幸的是,我们刚刚学会了如何训练嵌入数字。训练嵌入向量也不例外。您现在有 120 个输入,而不是 12 个,但您所做的只是移动它们,看看如何最大限度地减少损失。然后你取其中的前 10 个,这就是对应于“h”的向量,依此类推。

All the embedding vectors must of course be the same length, otherwise we would not have a way of entering all the character combinations into the network. E.g. “humpty dumpt” and in the next iteration “umpty dumpty” — in both cases we are entering 12 characters in the network and if each of the 12 characters was not represented by vectors of length 10 we won’t be able to reliably feed them all into a 120-long input layer. Let’s visualize these embedding vectors:
当然,所有嵌入向量的长度必须相同,否则我们将无法将所有字符组合输入网络。例如,“humpty dumpt”和下一次迭代中的“umpty dumpty”——在这两种情况下,我们在网络中输入了 12 个字符,如果 12 个字符中的每一个都不是由长度为 10 的向量表示,我们将无法可靠地将它们全部输入到 120 长的输入层中。让我们可视化这些嵌入向量:

Image by author  图片由作者提供

Let’s call an ordered collection of same-sized vectors a matrix. This matrix above is called an embedding matrix. You tell it a column number corresponding to your letter and looking at that column in the matrix will give you the vector that you are using to represent that letter. This can be applied more generally for embedding any arbitrary collection of things — you would just need to have as many columns in this matrix as the things you have.
我们将相同大小向量的有序集合称为矩阵。上面的这个矩阵称为嵌入矩阵 。你告诉它一个与你的字母相对应的列号,查看矩阵中的该列将给你用来表示该字母的向量。这可以更广泛地应用于嵌入任何任意事物的集合——你只需要在这个矩阵中拥有与你拥有的东西一样多的列。

Subword Tokenizers  子词标记器

So far, we have been working with characters as the basic building blocks of language. This has its limitations. The neural network weights have to do a lot of the heavy lifting where they must make sense of certain sequences of characters (i.e. words) appearing next to each other and then next to other words. What if we directly assigned embeddings to words and made the network predict the next word. The network doesn’t understand anything more than numbers anyway, so we can assign a 10-length vector to each of the words “humpty”, “dumpty”, “sat”, “on” etc.. and then we just feed it two words and it can give us the next word. “Token” is the term for a single unit that we embed and then feed to the model. Our models so far were using characters as tokens, now we are proposing to use entire words as a token (you can of course use entire sentences or phrases as tokens if you like).
到目前为止,我们一直在将字符作为语言的基本组成部分。这有其局限性。神经网络权重必须做很多繁重的工作,它们必须理解某些字符序列(即单词)彼此相邻出现,然后出现在其他单词旁边。如果我们直接为单词分配嵌入并让网络预测下一个单词会怎样。无论如何,网络只理解数字,因此我们可以为每个单词“humpty”、“dumpty”、“sat”、“on”等分配一个 10 长度的向量。然后我们只需给它两个词,它就可以给我们下一个词。“ 令牌 ”是我们嵌入然后提供给模型的单个单元的术语。到目前为止,我们的模型使用字符作为标记,现在我们建议使用整个单词作为标记(如果您愿意,当然可以使用整个句子或短语作为标记)。

Using word tokenization has one profound effect on our model. There are more than 180K words in the English language. Using our output interpretation scheme of having a neuron per possible output we need hundreds of thousands of neurons in the output layer insead of the 26 or so. With the size of the hidden layers needed to achieve meaningful results for modern networks, this issue becomes less pressing. What is however worth noting is that since we are treating each word separately, and we are starting with a random number embeddings for each — very similar words (e.g. “cat” and “cats”) will start with no relationship. You would expect that embeddings for the two words should be close to each other — which undoubtedly the model will learn. But, can we somehow use this obvious similarity to get a jumpstart and simplify matters?
使用单词标记对我们的模型有一个深远的影响。英语中有超过 180K 个单词。使用我们的输出解释方案,即每个可能的输出都有一个神经元,我们需要在输出层中 26 个左右的数十万个神经元。随着现代网络实现有意义的结果所需的隐藏层的大小,这个问题变得不那么紧迫了。然而,值得注意的是,由于我们分别处理每个单词,并且我们从每个单词的随机数字嵌入开始,因此非常相似的单词(例如“cat”和“cats”)将以没有关系开始。你可能会期望这两个词的嵌入应该彼此接近——毫无疑问,模型会学习这一点。但是,我们能否以某种方式利用这种明显的相似性来快速启动并简化问题?

Yes we can. The most common embedding scheme in language models today is something where you break words down into subwords and then embed them. In the cat example, we would break down cats into two tokens “cat” and ”s”. Now it is easier for the model to understand the concept of “s” followed by other familiar words and so on. This also reduces the number of tokens we need (sentencpiece is a common tokenizer with vocab size options in tens of thousands vs hundreds of thousands of words in english). A tokenizer is something that takes you input text (e.g. “Humpty Dumpt”) and splits it into the tokens and gives you the corresponding numbers that you need to look up the embedding vector for that token in the embedding matrix. For example, in case of “humpty dumpty” if we’re using character level tokenizer and we arranged our embedding matrix as in the picture above, then the tokenizer will first split humpty dumpt into characters [‘h’,’u’,…’t’] and then give you back the numbers [8,21,…20] because you need to look up the 8th column of the embedding matrix to get the embedding vector for ‘h’ (embedding vector is what you will feed into the model, not the number 8, unlike before). The arrangement of the columns in the matrix is completely irrelevant, we could assign any column to ‘h’ and as long as we look up the same vector every time we input ‘h’ we should be good. Tokenizers just give us an arbitrary (but fixed) number to make lookup easy. The main task we need them for really is splitting the sentence in tokens.
是的,我们能。当今语言模型中最常见的嵌入方案是将单词分解为子词,然后嵌入它们。在 cat 示例中,我们将 cat 分解为两个标记“cat”和“s”。现在,模型更容易理解“s”的概念,后跟其他熟悉的单词等。这也减少了我们需要的标记数量(sentencpiece 是一种常见的标记器,词汇量选项为数万个,而英语中的词汇量为数十万个单词)。 分词器是获取您输入的文本(例如“Humpty Dumpt”)并将其拆分为标记并为您提供相应的数字,以便在嵌入矩阵中查找该标记的嵌入向量。 例如,在“humpty dumpty”的情况下,如果我们使用字符级分词器,并且我们如上图所示排列了嵌入矩阵,那么分词器将首先将 humpty dumpt 拆分为字符 ['h','u',...'t'],然后返回数字 [8,21,...20]因为你需要查找嵌入矩阵的第 8 列来获得“h”的嵌入向量(与以前不同,嵌入向量是你将输入模型的内容,而不是数字 8)。矩阵中列的排列完全无关紧要,我们可以将任何列分配给“h”,只要我们每次输入“h”时查找相同的向量,我们就应该很好。分词器只是给我们一个任意(但固定的)数字,以便于查找。我们真正需要它们的主要任务是将句子拆分为标记。

With embeddings and subword tokenization, a model could look something like this:
通过嵌入和子词标记化,模型可能如下所示:

Press enter or click to view image in full size

Image by author  图片由作者提供

The next few sections deal with more recent advances in language modeling, and the ones that made LLMs as powerful as they are today. However, to understand these there are a few basic math concepts you need to know. Here are the concepts:
接下来的几节将介绍语言建模的最新进展,以及使 LLM 像今天这样强大的进展。但是,要理解这些,您需要了解一些基本的数学概念。以下是概念:

  • Matrices and matrix multiplication
    矩阵和矩阵乘法
  • General concept of functions in mathematics
    数学中函数的一般概念
  • Raising numbers to powers (e.g. a3 = a*a*a)
    将数字提高到幂(例如 a3 = a*a*a)
  • Sample mean, variance, and standard deviation
    样本均值、方差和标准差

I have added summaries of these concepts in the appendix.
我在附录中添加了这些概念的摘要。

Self Attention  自我关注

So far we have seen only one simple neural network structure (called feedforward network), one which contains a number of layers and each layer is fully connected to the next (i.e., there is a line connecting any two neurons in consecutive layers), and it is only connected to the next layer (e.g. no lines between layer 1 and layer 3 etc..). However, as you can imagine there is nothing stopping us from removing or making other connections. Or even making more complex structures. Let’s explore a particularly important structure: self-attention.
到目前为止,我们只看到了一种简单的神经网络结构(称为前馈网络),它包含许多层,每一层都完全连接到下一层(即,有一条线连接连续层中的任意两个神经元),并且它只连接到下一层(例如,第 1 层和第 3 层之间没有线等)。然而,正如您可以想象的那样,没有什么可以阻止我们删除或建立其他联系。甚至制作更复杂的结构。让我们探讨一个特别重要的结构:自我关注。

If you look at the structure of human language, the next word that we want to predict will depend on all the words before. However, they may depend on some words before them to a greater degree than others. For example, if we are trying to predict the next word in “Damian had a secret child, a girl, and he had written in his will that all his belongings, along with the magical orb, will belong to ____”. This word here could be “her” or “him” and it depends specifically on a much earlier word in the sentence: girl/boy.
如果你看看人类语言的结构,我们要预测的下一个单词将取决于之前的所有单词。然而,他们可能比其他词更依赖于他们面前的某些词语。例如,如果我们试图预测“达米安有一个秘密的孩子,一个女孩,他在遗嘱中写下,他的所有财产,连同魔法球,都将属于____”中的下一个词。这里的这个词可以是“她”或“他”,它特别取决于句子中更早的词: 女孩/男孩 

The good news is, our simple feedforward model connects to all the words in the context, and so it can learn the appropriate weights for important words, But here’s the problem, the weights connecting specific positions in our model through feed forward layers are fixed (for every position). If the important word was always in the same position, it would learn the weights appropriately and we would be fine. However, the relevant word to the next prediction could be anywhere in the system. We could paraphrase that sentence above and when guessing “her vs his”, one very important word for this prediction would be boy/girl no matter where it appeared in that sentence. So, we need weights that depend not only on the position but also on the content in that position. How do we achieve this?
好消息是,我们的简单前馈模型连接到上下文中的所有单词,因此它可以为重要单词学习适当的权重,但问题是,通过前馈层连接模型中特定位置的权重是固定的(对于每个位置)。如果重要的词总是在同一个位置,它会适当地学习权重,我们就没事了。但是,与下一个预测相关的词可能位于系统中的任何位置。我们可以解释上面的那句话,当猜测“她与他的”时,这个预测的一个非常重要的词是男孩/女孩,无论它出现在该句子的哪个位置。因此,我们需要不仅取决于位置,还取决于该位置的内容的权重。我们如何实现这一目标?

Self attention does something like adding up the embedding vectors for each of the words, but instead of directly adding them up it applies some weights to each. So if the embedding vectors for humpty,dumpty, sat are x1, x2, x3 respectively, then it will multiply each one with a weight (a number) before adding them up. Something like output = 0.5 x1 + 0.25 x2 + 0.25 x3 where output is the self-attention output. If we write the weights as u1, u2, u3 such that output = u1x1+u2x2+u3x3 then how do we find these weights u1, u2, u3?
自注意力的作用是将每个单词的嵌入向量相加,但它不是直接将它们相加,而是对每个单词应用一些权重。因此,如果 humpty,dumpty, sat 的嵌入向量分别是 x1、x2、x3,那么它会在将它们相加之前将每个向量乘以权重(一个数字)。类似于输出 = 0.5 x1 + 0.25 x2 + 0.25 x3,其中输出是自注意力输出。如果我们将权重写为 u1、u2、u3,使得输出 = u1x1+u2x2+u3x3,那么我们如何找到这些权重 u1、u2、u3?

Ideally, we want these weights to be dependent on the vector we are adding — as we saw some may be more important than others. But important to whom? To the word we are about to predict. So we also want the weights to depend on the word we are about to predict. Now that’s an issue, we of course don’t know the word we are about to predict before we predict it. So, self attention uses the word immediately preceding the word we are about to predict, i.e., the last word in the sentence available (I don’t really know why this and why not something else, but a lot of things in deep learning are trial and error and I suspect this works well).
理想情况下,我们希望这些权重依赖于我们添加的向量——正如我们所看到的那样,有些权重可能比其他权重更重要。但对谁很重要?对于我们即将预测的话语。因此,我们还希望权重取决于我们即将预测的单词。现在这是一个问题,在预测之前,我们当然不知道我们将要预测的单词。因此,自我注意力使用紧接在我们即将预测的单词之前的单词,即可用句子中的最后一个单词 (我真的不知道为什么这个,为什么不是其他东西,但深度学习中的很多事情都是反复试验,我怀疑这很有效)。

Great, so we want weights for these vectors, and we want each weight to depend on the word that we are aggregating and word immediately preceding the one we are going to predict. Basically, we want a function u1 = F(x1, x3) where x1 is the word we will weight and x3 is the last word in the sequence we have (assuming we have only 3 words). Now, a straightforward way of achieving this is to have a vector for x1 (let’s call it k1) and a separate vector for x3 (let’s call it q3) and then simply take their dot product. This will give us a number and it will depend on both x1 and x3. How do we get these vectors k1 and q3? We build a tiny single layer neural network to go from x1 to k1 (or x2 to k2, x3 to k3 and so on). And we build another network going from x3 to q3 etc… Using our matrix notation, we basically come up with weight matrices Wk and Wq such that k1 = Wkx1 and q1 =Wqx1 and so on. Now we can take a dot product of k1 and q3 to get a scalar, so u1 = F(x1,x3) = Wkx1 · Wqx3.
太好了, 所以我们想要这些向量的权重 ,我们希望每个权重都取决于我们正在聚合的单词和紧接在我们要预测的单词之前的单词。基本上,我们想要一个函数 u1 = F(x1, x3),其中 x1 是我们将加权的单词,x3 是我们拥有的序列中的最后一个单词(假设我们只有 3 个单词)。现在,实现这一点的一种直接方法是有一个 x1 的向量(我们称之为 k1)和一个单独的 x3 向量(我们称之为 q3),然后简单地取它们的点积。这将给我们一个数字,它将取决于 x1 和 x3。我们如何获得这些向量 k1 和 q3?我们构建了一个微小的单层神经网络,从 x1 到 k1(或 x2 到 k2,x3 到 k3 依此类推)。我们建立了另一个从 x3 到 q3 的网络,等等......使用我们的矩阵表示法,我们基本上得出了权重矩阵 Wk 和 Wq,使得 k1 = Wkx1 和 q1 =Wqx1,依此类推。现在我们可以取 k1 和 q3 的点积来得到标量,所以 u1 = F(x1,x3) = Wkx1 · Wqx3。

One additional thing that happens in self-attention is that we don’t directly take the weighted sum of the embedding vectors themselves. Instead, we take the weighted sum of some “value” of that embedding vector, which is obtained by another small single layer network. What this means is similar to k1 and q1, we also now have a v1 for the word x1 and we obtain it through a matrix Wv such that v1=Wvx1. This v1 is then aggregated. So it all looks something like this if we only have 3 words and we are trying to predict the fourth:
在自注意力中发生的另一件事是,我们不直接取嵌入向量本身的加权和。相反,我们取该嵌入向量的某个“值”的加权和,该值是由另一个小型单层网络获得的。这意味着类似于 k1 和 q1,我们现在也有单词 x1 的 v1,我们通过矩阵 Wv 获得它,使得 v1=Wvx1。然后聚合此 v1。因此,如果我们只有 3 个单词并且我们试图预测第四个单词,那么这一切看起来像这样:

Press enter or click to view image in full size

Self attention. Image by author
自我关注。图片由作者提供

The plus sign represents a simple addition of the vectors, implying they have to have the same length. One last modification not shown here is that the scalars u1, u2, u3 etc.. won’t necessarily add up to 1. If we need them to be weights, we should make them add up. So we will apply a familiar trick here and use the softmax function.
加号表示向量的简单添加,这意味着它们必须具有相同的长度。这里没有显示的最后一个修改是标量 u1、u2、u3 等。加起来不一定等于 1。如果我们需要它们是权重,我们应该让它们相加。因此,我们将在这里应用一个熟悉的技巧并使用 softmax 函数。

This is self-attention. There is also cross-attention where you can have the q3 come from the last word, but the k’s and the v’s can come from another sentence altogether. This is for example valuable in translation tasks. Now we know what attention is.
这就是自我关注。还有交叉注意力,你可以让 q3 来自最后一个单词,但 k 和 v 可以完全来自另一个句子。例如,这在翻译任务中很有价值。现在我们知道什么是注意力了。

This whole thing can now be put in a box and be called a “self attention block”. Basically, this self attention block takes in the embedding vectors and spits out a single output vector of any user-chosen length. This block has three parameters, Wk,Wq,Wv — it doesn’t need to be more complicated than that. There are many such blocks in the machine learning literature, and they are usually represented by boxes in diagrams with their name on it. Something like this:
这整个东西现在可以放在一个盒子里,被称为“自我注意力块”。基本上,这个自注意力块接受嵌入向量并吐出任何用户选择长度的单个输出向量。这个块有三个参数,Wk,Wq,Wv——它不需要比这更复杂。机器学习文献中有很多这样的块,它们通常用图表中的框表示,上面有它们的名字。类似这样:

Press enter or click to view image in full size

Image by author  图片由作者提供

One of the things that you will notice with self-attention is that the position of things so far does not seem relevant. We are using the same W’s across the board and so switching Humpty and Dumpty won’t really make a difference here — all numbers will end up being the same. This means that while attention can figure out what to pay attention to, this won’t depend on word position. However, we do know that word positions are important in english and we can probably improve performance by giving the model some sense of a word’s position.
通过自我关注,您会注意到的一件事是,到目前为止,事物的位置似乎并不相关。我们全面使用相同的 W,因此切换 Humpty 和 Dumpty 在这里不会真正产生影响——所有数字最终都会相同。这意味着,虽然注意力可以弄清楚要注意什么,但这并不取决于单词的位置。但是,我们确实知道单词位置在英语中很重要,我们可能可以通过让模型了解单词的位置来提高性能。

And so, when attention is used, we don’t often feed the embedding vectors directly to the self attention block. We will later see how “positional encoding” is added to embedding vectors before feeding to attention blocks.
因此,当使用注意力时,我们通常不会将嵌入向量直接提供给自注意力块。稍后我们将看到如何在馈送到注意力块之前将“位置编码”添加到嵌入向量中。

Note for the pre-initiated: Those for whom this isn’t the first time reading about self-attention will note that we are not referencing any K and Q matrices, or applying masks etc.. That is because those things are implementation details arising out of how these models are commonly trained. A batch of data is fed and the model is simultaneously trained to predict dumpty from humpty, sat from humpty dumpty and so on. This is a matter of gaining efficiency and does not affect interpretation or even model outputs, and we have chosen to omit training efficiency hacks here.
预先入门的注意事项 :对于那些不是第一次阅读有关自我注意力的人会注意到,我们没有引用任何 K 和 Q 矩阵,也没有应用掩码等。这是因为这些东西是由于这些模型的通常训练方式而产生的实现细节。输入一批数据,同时训练模型预测 dumpty 与 humpty、sat 与 humpty dumpty 等。这是一个提高效率的问题,不会影响解释甚至模型输出,我们在这里选择省略训练效率黑客。

Softmax  软麦克斯

We talked briefly about softmax in the very first note. Here’s the problem softmax is trying to solve: In our output interpretation we have as many neurons as the options from which we want the network to select one. And we said that we are going to interpret the network’s choice as the highest value neuron. Then we said we are going to calculate loss as the difference between the value that network provides, and an ideal value we want. But what’s that ideal value we want? We set it to 0.8 in the leaf/flower example. But why 0.8? Why no 5, or 10, or 10 million? The higher the better for that training example. Ideally we want infinity there! Now that would make the problem intractable — all loss would be infinite and our plan of minimizing loss by moving around parameters (remember “gradient descent”) fails. How do we deal with this?
我们在第一条笔记中简要讨论了 softmax。这是 softmax 试图解决的问题:在我们的输出解释中,我们拥有与我们希望网络从中选择一个的选项一样多的神经元。我们说过,我们将把网络的选择解释为最高价值的神经元。然后我们说我们将计算损失作为网络提供的值与我们想要的理想值之间的差值。但我们想要的理想值是什么?我们在 leaf/flower 示例中将其设置为 0.8。但为什么是 0.8?为什么没有 5 个、10 个或 1000 万?对于该训练示例来说,越高越好。理想情况下,我们希望那里有无限大!现在这将使问题变得棘手——所有损失都将是无限的,我们通过移动参数(记住“梯度下降”)来最小化损失的计划失败了。我们该如何处理这个问题?

One simple thing we can do is cap the values we want. Let’s say between 0 and 1? This would make all loss finite, but now we have the issue of what happens when the network overshoots. Let’s say it outputs (5,1) for (leaf,flower) in one case, and (0,1) in another. The first case made the right choice but the loss is worse! Ok, so now we need a way to also convert the outputs of the last layer in (0,1) range so that it preserves the order. We could use any function (a “function” in mathematics is simply a mapping of one number to another — in goes one number, out comes another — it’s rule based in terms of what will be output for a given input) here to get the job done. One possible option is the logistic function (see graph below) which maps all numbers to numbers between (0,1) and preserves the order:
我们可以做的一件简单的事情就是限制我们想要的值。假设在 0 和 1 之间?这将使所有损失都是有限的,但现在我们遇到了一个问题,即当网络超调时会发生什么。假设它在一种情况下为 (leaf,flower) 输出 (5,1),在另一种情况下输出 (0,1)。第一种情况做出了正确的选择,但损失更严重!好的,现在我们需要一种方法来转换 (0,1) 范围内最后一层的输出,以便保留顺序。我们可以在这里使用任何函数(数学中的“ 函数 ”只是一个数字到另一个数字的映射——一个数字进去,另一个数字出来——它是基于给定输入的输出内容的规则)来完成工作。一种可能的选择是逻辑函数(见下图),它将所有数字映射到 (0,1) 之间的数字并保留顺序:

Press enter or click to view image in full size

Image by author  图片由作者提供

Now, we have a number between 0 and 1 for each of the neurons in the last layer and we can calculate loss by setting the correct neuron to 1, others to 0 and taking the difference of that from what the network provides us. This will work, but can we do better?
现在,我们最后一层中的每个神经元都有一个介于 0 和 1 之间的数字,我们可以通过将正确的神经元设置为 1,将其他神经元设置为 0 并取其与网络为我们提供的差值来计算损失。这将奏效,但我们能做得更好吗?

Going back to our “Humpty dumpty” example, let’s say we are trying to generate dumpty character-by-character and our model makes a mistake when predicting “m” in dumpty. Instead of giving us the last layer with “m” as the highest value, it gives us “u” as the highest value but “m” is a close second.
回到我们的“Humpty dumpty”示例,假设我们正在尝试逐个字符生成 dumpty,而我们的模型在预测 dumpty 中的“m”时犯了错误。它不是给我们以“m”为最高值的最后一层,而是给我们“u”作为最高值,但“m”紧随其后。

Now we can continue with “duu” and try to predict next character and so on, but the model confidence will be low because there are not that many good continuations from “humpty duu..”. On the other hand, “m” was a close second, so we can also give “m” a shot, predict the next few characters, and see what happens? Maybe it gives us a better overall word?
现在我们可以继续使用“duu”并尝试预测下一个字符等等,但模型置信度会很低,因为“humpty duu..”没有那么多好的延续。另一方面,“m”紧随其后,所以我们也可以给“m”一个机会,预测接下来的几个字符,看看会发生什么?也许它给了我们一个更好的整体词?

So what we are talking about here is not just blindly selecting the max value, but trying a few. What’s a good way to do it? Well we have to assign a chance to each one — say we will pick the top one with 50%, second one with 25% and so on. That’s a good way to do it. But maybe we would want the chance to be dependent on the underlying model predictions. If the model predicts values for m and u to be really close to each other here (compared to other values) — then maybe a close 50–50 chance of exploring the two is a good idea?
所以我们这里说的不仅仅是盲目选择最大值,而是尝试几个。有什么好方法呢?好吧,我们必须为每个人分配一个机会——假设我们将选择 50% 的前一个,第二个 25% 的,依此类推。这是一个很好的方法。但也许我们希望机会依赖于底层模型预测。如果模型预测 m 和 u 的值在这里彼此非常接近(与其他值相比)——那么探索两者的几率接近 50-50 是个好主意?

So we need a nice rule that takes all these numbers and converts them into chances. That’s what softmax does. It is a generalization of the logistic function above but with additional features. If you give it 10 arbitrary numbers — it will give you 10 outputs, each between 0 and 1 and importantly, all 10 adding up to 1 so that we can interpret them as chance. You will find softmax as the last layer in nearly every language model.
因此,我们需要一个很好的规则,将所有这些数字转化为机会。这就是 softmax 所做的。它是上述逻辑功能的推广,但具有附加功能。如果你给它 10 个任意数字——它会给你 10 个输出,每个输出在 0 到 1 之间,重要的是,所有 10 个加起来都是 1,这样我们就可以将它们解释为偶然。您会发现 softmax 是几乎每个语言模型中的最后一层。

Residual connections  残差连接

We have slowly changed our visualization of networks as the sections progress. We are now using boxes/blocks to denote certain concepts. This notation is useful in denoting a particularly useful concept of residual connections. Let’s look at residual connection combined with a self-attention block:
随着部分的进展,我们慢慢改变了网络的可视化。我们现在使用方框/块来表示某些概念。这种符号可用于表示残差连接的特别有用的概念。让我们看看残差连接与自注意力块的结合:

Press enter or click to view image in full size

A residual connection. Image by author
残差连接。图片由作者提供

Note that we put “Input” and “Output” as boxes to make things simpler, but these are still basically just a collection of neurons/numbers same as shown above.
请注意,我们将“输入”和“输出”作为框以简化事情,但它们基本上仍然只是神经元/数字的集合,与上面所示相同。

So what’s going on here? We are basically taking the output of self-attention block and before passing it to the next block, we are adding to it the original Input. First thing to note is that this would require that the dimensions of the self-attention block output must now be the same as that of the input. This is not a problem since as we noted the self-attention output is determined by the user. But why do this? We won’t get into all the details here but the key thing is that as networks get deeper (more layers between input and output) it gets increasingly harder to train them. Residual connections have been shown to help with these training challenges.
那么这是怎么回事呢?我们基本上是获取自注意力块的输出,并在将其传递给下一个块之前,向其添加原始输入。首先要注意的是,这要求自注意力块输出的尺寸现在必须与输入的尺寸相同。这不是问题,因为正如我们所指出的,自注意力输出是由用户决定的。但为什么要这样做呢?我们不会在这里详细介绍所有细节,但关键是,随着网络越来越深(输入和输出之间的层数越来越多),训练它们变得越来越困难。残余连接已被证明有助于应对这些训练挑战。

Layer Normalization  层归一化

Layer normalization is a fairly simple layer that takes the data coming into the layer and normalizes it by subtracting the mean and dividing it by standard deviation (maybe a bit more, as we see below). For example, if we were to apply layer normalization immediately after the input, it would take all the neurons in the input layer and then it would calculate two statistics: their mean and their standard deviation. Let’s say the mean is M and the standard deviation is D then what layer norm is doing is taking each of these neurons and replacing it with (x-M)/D where x denotes any given neuron’s original value.
层归一化是一个相当简单的层,它获取进入层的数据,并通过减去平均值并将其除以标准差(可能更多一点,如下所示)对其进行归一化。例如,如果我们要在输入后立即应用层归一化,它将获取输入层中的所有神经元,然后计算两个统计数据:它们的平均值和标准差。假设平均值是 M,标准差是 D,那么层范数正在做的是获取这些神经元中的每一个并将其替换为 (x-M)/D,其中 x 表示任何给定神经元的原始值。

Now how does this help? It basically stabilizes the input vector and helps with training deep networks. One concern is that by normalizing inputs, are we removing some useful information from them that may be helpful in learning something valuable about our goal? To address this, the layer norm layer has a scale and a bias parameter. Basically, for each neuron you just multiply it with a scalar and then add a bias to it. These scalar and bias values are parameters that can be trained. This allows the network to learn some of the variation that may be valuable to the predictions. And since these are the only parameters, the LayerNorm block doesn’t have a lot of parameters to train. The whole thing looks something like this:
现在这有什么帮助呢?它基本上稳定了输入向量并有助于训练深度网络。一个担忧是,通过规范化输入,我们是否从中删除了一些有用的信息,这些信息可能有助于了解有关我们目标的宝贵信息?为了解决这个问题,图层范数图层具有尺度和偏差参数。基本上,对于每个神经元,您只需将其与标量相乘,然后向其添加偏差。这些标量和偏差值是可以训练的参数。这允许网络学习一些可能对预测有价值的变化。由于这些是唯一的参数,因此 LayerNorm 块没有很多参数需要训练。整个事情看起来像这样:

Press enter or click to view image in full size

Layer Normalization. Image by author
层归一化。图片由作者提供

The Scale and Bias are trainable parameters. You can see that layer norm is a relatively simple block where each number is only operated on pointwise (after the initial mean and std calculation). Reminds us of the activation layer (e.g. RELU) with the key difference being that here we have some trainable parameters (albeit lot fewer than other layers because of the simple pointwise operation).
Scale 和 Bias 是可训练的参数。您可以看到,层范数是一个相对简单的块,其中每个数字仅逐点运算(在初始平均值和标准计算之后)。让我们想起激活层(例如 RELU),关键区别在于这里我们有一些可训练的参数(尽管由于简单的逐点作,比其他层少得多)。

Standard deviation is a statistical measure of how spread out the values are, e.g., if the values are all the same you would say the standard deviation is zero. If, in general, each value is really far from the mean of these very same values, then you will have a high standard deviation. The formula to calculate standard deviation for a set of numbers, a1, a2, a3…. (say N numbers) goes something like this: subtract the mean (of these numbers) from each of the numbers, then square the answer for each of N numbers. Add up all these numbers and then divide by N. Now take a square root of the answer.
标准差是衡量值分布程度的统计指标,例如,如果值都相同,则会说标准差为零。一般来说,如果每个值都与这些完全相同的值的平均值相去甚远,那么您将有一个很高的标准差。计算一组数字 a1、a2、a3 的标准差的公式。(比如说 N 个数字)是这样的:从每个数字中减去(这些数字的)平均值,然后将 N 个数字中的每一个的答案平方。将所有这些数字相加,然后除以 N。现在取答案的平方根。

Note for the pre-initiated: Experienced ML professionals will note that there is no discussion of batch norm here. In-fact, we haven’t even introduced the concept of batches in this article at all. For the most part, I believe batches are another training accelerant not related to the understanding of core concepts (except perhaps batch norm which we do not need here).
预先入门的注意事项:有经验的 ML 专业人员会注意到,这里没有讨论批处理规范。事实上,我们在本文中甚至根本没有介绍批次的概念。在大多数情况下,我认为批次是另一种与核心概念的理解无关的训练加速剂(除了我们在这里不需要的批次规范)。

Dropout  辍学

Dropout is a simple but effective method to avoid model overfitting. Overfitting is a term for when you train the model on your training data, and it works well on that dataset but does not generalize well to the examples the model has not seen. Techniques that help us avoid overfitting are called “regularization techniques”, and dropout is one of them.
Dropout 是避免模型过度拟合的一种简单但有效的方法。过度拟合是指在训练数据上训练模型时,它在该数据集上效果很好,但不能很好地推广到模型没有看到的示例。帮助我们避免过度拟合的技术称为“ 正则化技术 ”,辍学就是其中之一。

If you train a model, it might make errors on the data and/or overfit it in a particular way. If you train another model, it might do the same, but in a different way. What if you trained a number of these models and averaged the outputs? These are typically called “ensemble models” because they predict the outputs by combining outputs from an ensemble of models, and ensemble models generally perform better than any of the individual models.
如果您训练模型,它可能会在数据上出错和/或以特定方式过度拟合。如果您训练另一个模型,它可能会执行相同的作,但方式不同。如果您训练了许多这样的模型并对输出进行平均会怎样?这些通常称为“ 集成模型 ”,因为它们通过组合模型集成的输出来预测输出,并且集成模型通常比任何单个模型的性能都更好。

In neural networks, you could do the same. You could build multiple (slightly different) models and then combine their outputs to get a better model. However, this can be computationally expensive. Dropout is a technique that doesn’t quite build ensemble models but does capture some of the essence of the concept.
在神经网络中,您也可以这样做。您可以构建多个(略有不同的)模型,然后组合它们的输出以获得更好的模型。但是,这可能会造成计算成本。Dropout 是一种技术,它不能完全构建集成模型,但确实抓住了该概念的一些精髓。

The concept is simple, by inserting a dropout layer during training what you are doing is randomly deleting a certain percentage of the direct neuron connections between the layers that dropout is inserted. Considering our initial network and inserting a Dropout layer between the input and the middle layer with 50% dropout rate can look something like this:
这个概念很简单,通过在训练期间插入一个辍学层,你正在做的是随机删除插入辍学的层之间一定百分比的直接神经元连接。考虑到我们的初始网络,并在输入层和中间层之间插入一个 Dropout 层,并且 50% 的 Dropout 率可能看起来像这样:

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

Image by author  图片由作者提供

Now, this forces the network to train with a lot of redundancy. Essentially, you are training a number of different models at the same time — but they share weights.
现在,这迫使网络进行大量冗余的训练。从本质上讲,您正在同时训练许多不同的模型,但它们共享权重。

Now for making inferences, we could follow the same approach as an ensemble model. We could make multiple predictions using dropouts and then combine them. However, since that is computationally intensive — and since our models share common weights — why don’t we just do a prediction using all the weights (so instead of using 50% of the weights at a time we use all at the same time). This should give us some approximation of what an ensemble will provide.
现在,对于进行推理,我们可以遵循与集成模型相同的方法。我们可以使用辍学进行多个预测,然后将它们组合起来。然而,既然这是计算密集型的——而且我们的模型共享共同的权重——为什么我们不只使用所有权重进行预测(这样我们就不是一次使用 50% 的权重,而是同时使用所有权重)。这应该让我们对合奏将提供什么有一定的近似。

One issue though: the model trained with 50% of the weights will have very different numbers in the middle neurons than one using all the weights. What we want is more ensemble style averaging here. How do we do this? Well, a simple way is to simply take all the weights and multiply them by 0.5 since we are now using twice as many weights. This is what Droput does during inference. It will use the full network with all the weights and simply multiply the weights with (1- p) where p is the deletion probability. And this has been shown to work rather well as a regularization technique.
但有一个问题:用 50% 的权重训练的模型在中间神经元中的数字与使用所有权重的模型的数字将大不相同。我们想要的是更多的合奏风格平均。我们该怎么做呢?好吧,一个简单的方法是简单地取所有权重并将它们乘以 0.5,因为我们现在使用的权重是原来的两倍。这就是 Droput 在推理过程中所做的。它将使用具有所有权重的完整网络,并简单地将权重乘以 (1-p),其中 p 是删除概率。这已被证明作为一种正则化技术效果很好。

Multi-head Attention  多头注意力

This is the key block in the transformer architecture. We’ve already seen what an attention block is. Remember that the output of an attention block was determined by the user and it was the length of v’s. What a multi-attention head is basically you run several attention heads in parallel (they all take the same inputs). Then we take all their outputs and simply concatenate them. It looks something like this:
这是 Transformer 架构中的关键块。我们已经看到了什么是注意力障碍。请记住,注意力块的输出是由用户决定的,它是 v 的长度。什么是多注意力头,基本上你并行运行多个注意力头(它们都接受相同的输入)。然后我们获取它们的所有输出并简单地将它们连接起来。它看起来像这样:

Press enter or click to view image in full size

Multi-head attention. Image by author
多头注意力。图片由作者提供

Keep in mind the arrows going from v1 -> v1h1 are linear layers — there’s a matrix on each arrow that transforms. I just did not show them to avoid clutter.
请记住,从 v1 -> v1h1 的箭头是线性层——每个箭头上都有一个矩阵可以转换。我只是没有展示它们以避免混乱。

What is going on here is that we are generating the same key, query and values for each of the heads. But then we are basically applying a linear transformation on top of that (separately to each k,q,v and separately for each head) before we use those k,q,v values. This extra layer did not exist in self attention.
这里发生的事情是,我们为每个头生成相同的键、查询和值。但是,在使用这些 k,q,v 值之前,我们基本上在此基础上应用线性变换(分别应用于每个 k,q,v 和每个头)。这种额外的层次在自我关注中并不存在。

A side note is that to me, this is a slightly surprising way of creating a multi-headed attention. For example, why not create separate Wk,Wq,Wv matrices for each of the heads rather than adding a new layer and sharing these weights. Let me know if you know — I really have no idea.
附带说明的是,对我来说,这是一种创造多头注意力的一种有点令人惊讶的方式。例如,为什么不为每个头创建单独的 Wk、Wq、Wv 矩阵,而不是添加一个新层并共享这些权重。如果你知道,请告诉我——我真的不知道。

Positional encoding and embedding
位置编码和嵌入

We briefly talked about the motivation for using positional encoding in the self-attention section. What are these? While the picture shows positional encoding, using a positional embedding is more common than using an encoding. As such we talk about a common positional embedding here but the appendix also covers positional encoding used in the original paper. A positional embedding is no different than any other embedding except that instead of embedding the word vocabulary we will embed numbers 1, 2, 3 etc. So this embedding is a matrix of the same length as word embedding, and each column corresponds to a number. That’s really all there is to it.
我们在自注意力部分简要讨论了使用位置编码的动机。这些是什么?虽然图片显示了位置编码,但使用位置嵌入比使用编码更常见。因此,我们在这里讨论一种常见的位置嵌入,但附录还涵盖了原始论文中使用的位置编码。位置嵌入与任何其他嵌入没有什么不同,只是我们将嵌入数字 1、2、3 等,而不是嵌入单词词汇表。所以这个嵌入是一个和词嵌入长度相同的矩阵,每列对应一个数字。这真的是它的全部。

The GPT architecture  GPT 架构

Let’s talk about the GPT architecture. This is what is used in most GPT models (with variation across). If you have been following the article thus far, this should be fairly trivial to understand. Using the box notation, this is what the architecture looks like at high level:
我们来谈谈 GPT 架构。这是大多数 GPT 模型中使用的(有变化)。如果您到目前为止一直在关注这篇文章,那么理解起来应该相当简单。使用框表示法,这是架构在高级上的样子:

Press enter or click to view image in full size

The GPT Architecture. Image by author
GPT 架构。图片由作者提供

At this point, other than the “GPT Transformer Block” all the other blocks have been discussed in great detail. The + sign here simply means that the two vectors are added together (which means the two embeddings must be the same size). Let’s look at this GPT Transformer Block:
至此,除了“GPT Transformer Block”之外,所有其他块都已经进行了非常详细的讨论。这里的 + 号仅表示将两个向量相加(这意味着两个嵌入的大小必须相同)。让我们看看这个 GPT Transformer 块:

Press enter or click to view image in full size

And that’s pretty much it. It is called “transformer” here because it is derived from and is a type of transformer — which is an architecture we will look at in the next section. This doesn’t affect understanding as we’ve already covered all the building blocks shown here before. Let’s recap everything we’ve covered so far building up to this GPT architecture:
差不多就是这样。它在这里被称为“变压器”,因为它源自并且是一种变压器——这是我们将在下一节中介绍的架构。这不会影响理解,因为我们之前已经介绍了此处显示的所有构建块。让我们回顾一下迄今为止我们为构建这个 GPT 架构所涵盖的所有内容:

  • We saw how neural nets take numbers and output other numbers and have weights as parameters which can be trained
    我们看到了神经网络如何接受数字并输出其他数字,并将权重作为可以训练的参数
  • We can attach interpretations to these input/output numbers and give real world meaning to a neural network
    我们可以为这些输入/输出数字附加解释,并为神经网络赋予现实世界的意义
  • We can chain neural networks to create bigger ones, and we can call each one a “block” and denote it with a box to make diagrams easier. Each block still does the same thing, take in a bunch of numbers and output other bunch of numbers
    我们可以链接神经网络来创建更大的神经网络,我们可以将每个神经网络称为“块”并用方框表示,以使图表更容易。每个块仍然做同样的事情,接收一堆数字并输出其他一堆数字
  • We learned a lot of different types of blocks that serve different purposes
    我们学到了很多不同类型的块,它们服务于不同的目的
  • GPT is just a special arrangement of these blocks that is shown above with an interpretation that we discussed in Part 1
    GPT 只是这些块的特殊排列,如上所示,我们在第 1 部分中讨论了解释

Modifications have been made over time to this as companies have built up to powerful modern LLMs, but the basic remains the same.
随着时间的推移,随着公司已经建立了强大的现代法学硕士,对此进行了修改,但基本原理保持不变。

Now, this GPT transformer is actually what is called a “decoder” in the original transformer paper that introduced the transformer architecture. Let’s take a look at that.
现在,这个 GPT transformer 实际上就是最初引入 transformer 架构的 transformer 论文中所说的“解码器”。让我们来看看。

The transformer architecture
变压器架构

This is one of the key innovations driving rapid acceleration in the capabilities of language models recently. Transformers not only improved the prediction accuracy, they are also easier/more efficient than previous models (to train), allowing for larger model sizes. This is what the GPT architecture above is based on.
这是近期推动语言模型能力快速加速的关键创新之一。Transformer 不仅提高了预测精度,而且比以前的模型(训练)更容易/更高效,允许更大的模型大小。这就是上面的 GPT 架构所基于的。

If you look at GPT architecture, you can see that it is great for generating the next word in the sequence. It fundamentally follows the same logic we discussed in Part 1. Start with a few words and then continue generating one at a time. But, what if you wanted to do translation. What if you had a sentence in german (e.g. “Wo wohnst du?” = “Where do you live?”) and you wanted to translate it to english. How would we train the model to do this?
如果您查看 GPT 架构,您会发现它非常适合生成序列中的下一个单词。它从根本上遵循我们在第 1 部分中讨论的相同逻辑。从几个单词开始,然后继续一次生成一个单词。但是,如果你想做翻译怎么办?如果你有一个德语句子(例如“Wo wohnst du?” = “你住在哪里?”),并且你想把它翻译成英语怎么办?我们将如何训练模型来做到这一点?

Well, first thing we would need to do is figure out a way to input german words. Which means we have to expand our embedding to include both german and english. Now, I guess here is a simply way of inputting the information. Why don’t we just concatenate the german sentence at the beginning of whatever so far generated english is and feed it to the context. To make it easier for the model, we can add a separator. This would look something like this at each step:
好吧,我们需要做的第一件事是找出一种输入德语单词的方法。这意味着我们必须扩展我们的嵌入以包括德语和英语。现在,我想这是一种输入信息的简单方法。为什么我们不直接将德语句子连接在迄今为止生成的英语的开头,并将其提供给上下文。为了使模型更容易,我们可以添加一个分隔符。这在每个步骤中看起来像这样:

Press enter or click to view image in full size

Image by author  图片由作者提供

This will work, but it has room for improvement:
这将有效,但还有改进的空间:

  • If the context length is fixed, sometimes the original sentence is lost
    如果上下文长度是固定的,有时会丢失原句
  • The model has a lot to learn here. Two languages simultaneously, but also to know that <SEP> is the separator token where it needs to start translating
    该模型在这里有很多东西需要学习。同时两种语言,还要知道是分<SEP>隔符标记,它需要开始翻译
  • You are processing the entire german sentence, with different offsets, for each word generation. This means there will be different internal representations of the same thing and the model should be able to work through it all for translation
    您正在处理每个单词生成的整个德语句子,具有不同的偏移量。这意味着同一事物将有不同的内部表示,并且模型应该能够处理所有翻译

Transformer was originally created for this task and consists of an “encoder” and a “decoder” — which are basically two separate blocks. One block simply takes the german sentence and gives out an intermediate representation (again, bunch of numbers, basically) — this is called the encoder.
Transformer 最初是为此任务创建的,由一个“编码器”和一个“解码器”组成——它们基本上是两个独立的块。一个块简单地采用德语句子并给出一个中间表示(同样,基本上是一堆数字)——这称为编码器。

The second block generates words (we’ve seen a lot of this so far). The only difference is that in addition to feeding it the words generated so far we also feed it the encoded german (from the encoder block) sentence. So as it is generating language, it’s context is basically all the words generated so far, plus the german. This block is called the decoder.
第二个块生成单词(到目前为止我们已经看到了很多这样的内容)。唯一的区别是,除了向它提供到目前为止生成的单词外,我们还向它提供编码的德语(来自编码器块)句子。因此,当它生成语言时,它的上下文基本上是迄今为止生成的所有单词,加上德语。此块称为解码器。

Each of these encoders and decoders consist of a few blocks, notably the attention block sandwiched between other layers. Let’s look at the illustration of a transformer from the paper “Attention is all you need” and try to understand it:
这些编码器和解码器中的每一个都由几个块组成,特别是夹在其他层之间的注意力块。让我们看看论文“注意力就是你所需要的一切”中的变压器插图,并尝试理解它:

Image from Vaswani et al. (2017)
图片来自 Vaswani 等人 (2017)

The vertical set of blocks on the left is called the “encoder” and the ones to the right is called the “decoder”. Let’s go over and understand anything that we have not already covered before:
左侧的垂直块集称为“编码器”,右侧的块称为“解码器”。让我们回顾一下我们之前没有介绍过的任何内容:

Recap on how to read the diagram: Each of the boxes here is a block that takes in some inputs in the form of neurons, and spits out a set of neurons as output that can then either be processed by the next block or interpreted by us. The arrows show where the output of a block is going. As you can see, we will often take the output of one block and feed it in as input into multiple blocks. Let’s go through each thing here:
回顾如何阅读图表: 这里的每个框都是一个块,它以神经元的形式接收一些输入,并吐出一组神经元作为输出,然后可以由下一个块处理或由我们解释。箭头显示块的输出去向。如您所见,我们通常会获取一个块的输出并将其作为输入到多个块中。让我们在这里逐一介绍一下:

Feed forward: A feedforward network is one that does not contain cycles. Our original network in section 1 is a feed forward. In-fact, this block uses very much the same structure. It contains two linear layers, each followed by a RELU (see note on RELU in first section) and a dropout layer. Keep in mind that this feedforward network applies to each position independently. What this means is that the information on position 0 has a feedforward network, and on position 1 has one and so on.. But the neurons from position x do not have a linkage to the feedforward network of position y. This is important because if we did not do this, it would allow the network to cheat during training time by looking forward.
前馈:前馈网络是不包含周期的网络。我们在第 1 部分的原始网络是前馈的。事实上,这个块使用了非常相同的结构。它包含两个线性层,每个层后跟一个 RELU(参见第一部分中关于 RELU 的注释)和一个辍学层。请记住,此前馈网络独立适用于每个位置。这意味着位置 0 上的信息有一个前馈网络,位置 1 上有一个前馈网络,依此类推。但是位置 x 的神经元与位置 y 的前馈网络没有链接。这很重要,因为如果我们不这样做,它将允许网络在训练期间通过向前看来作弊。

Cross-attention: You will notice that the decoder has a multi-head attention with arrows coming from the encoder. What is going on here? Remember the value, key, query in self-attention and multi-head attention? They all came from the same sequence. The query was just from the last word of the sequence in-fact. So what if we kept the query but fetched the value and key from a completely different sequence altogether? That is what is happening here. The value and key come from the output of the encoder. Nothing has changed mathematically except where the inputs for key and value are coming from now.
交叉关注: 您会注意到解码器具有多头注意力,箭头来自编码器。这究竟是怎么回事呢?还记得自我注意力和多头注意力中的价值、关键、查询吗?它们都来自同一个序列。事实上,该查询只是来自序列的最后一个单词。那么,如果我们保留查询,但从完全不同的序列中获取值和键呢?这就是这里正在发生的事情。值和键来自编码器的输出。除了键和值的输入从现在开始之外,数学上没有任何变化。

Nx: The Nx here simply represents that this block is chain-repeated N times. So basically you are stacking the block back-to-back and passing the input from the previous block to the next one. This is a way to make the neural network deeper. Now, looking at the diagram there is room for confusion about how the encoder output is fed to the decoder. Let’s say N=5. Do we feed the output of each encoder layer to the corresponding decoder layer? No. Basically you run the encoder all the way through once and only once. Then you just take that representation and feed the same thing to every one of the 5 decoder layers.
Nx:这里的 Nx 只是表示这个区块被链式重复 N 次。所以基本上你是背靠背堆叠块,并将输入从前一个块传递到下一个块。这是一种使神经网络更深入的方法。现在,从图表中可以看出编码器输出如何馈送到解码器存在混淆的余地。假设 N=5。我们是否将每个编码器层的输出馈送到相应的解码器层?不。基本上,您一直运行编码器一次,而且只运行一次。然后,您只需获取该表示并将相同的内容馈送到 5 个解码器层中的每一个。

Add & Norm block: This is basically the same as below (guess the authors were just trying to save space)
Add & Norm 块 :这与下面基本相同(猜作者只是想节省空间)

Image by author  图片由作者提供

Everything else has already been discussed. Now you have a complete explanation of the transformer architecture building up from simple sum and product operations and fully self contained! You know what every line, every sum, every box and word means in terms of how to build them from scratch. Theoretically, these notes contain what you need to code up the transformer from scratch. In-fact, if you are interested this repo does that for the GPT architecture above.
其他一切都已经讨论过了。现在,您已经对 Transformer 架构有了完整的解释,该架构由简单的求和和乘积运算构建而成,并且完全独立!你知道每一行、每一个总和、每一个方框和每个单词对于如何从头开始构建它们意味着什么。从理论上讲,这些注释包含从头开始编写转换器所需的内容。事实上,如果您有兴趣, 这个存储库可以为上面的 GPT 架构执行此作。

Building a pre-trained model
构建预训练模型

At this point, we have all the parts necessary to design and train an LLM. Let’s put the pieces together for an English language model:
在这一点上,我们已经具备了设计和训练 LLM 所需的所有部分。让我们将英语语言模型的各个部分放在一起:

  • First, we start by building a tokenizer that can encode and decode the English language as we talked about in the Subword tokenizer section. Let’s assume a vocabulary size of 32k
    首先,我们首先构建一个可以编码和解码英语的分词器,正如我们在子词分词器部分中所讨论的那样。假设词汇量为 32k
  • Next, we build an LLM with transformer architecture. The output vector and the embedding matrix must both have 32k elements, equal to the size of vocablury
    接下来,我们构建一个具有 transformer 架构的 LLM。输出向量和嵌入矩阵都必须有 32k 个元素,等于词汇的大小
  • Now we collect a corpus of english language to train the model on. This is our training data. Often the key part of this dataset is a crawl of the entire internet, something like the Common Crawl
    现在我们收集一个英语语料库来训练模型。这是我们的训练数据。通常,此数据集的关键部分是对整个互联网的爬取,类似于通用爬虫
  • We can now start training the model by asking the model to predict the next token as discussed in the training section. You do this by defining a loss function for the next token (which you already know from your training data) and push the network to predict that token.
    我们现在可以通过要求模型预测下一个标记来开始训练模型,如训练部分所述。您可以通过为下一个令牌定义损失函数(您已经从训练数据中知道)并推送网络来预测该令牌来做到这一点。
  • Doing this for the tens of billions or trillions of tokens will give you a set of weights that is your pre-trained model. This is a model capable of predicting next token and this can now be used to complete sentences, or even entire articles.
    对数百亿或数万亿个代币执行此作将为您提供一组权重,即您的预训练模型。这是一个能够预测下一个标记的模型,现在可用于完成句子,甚至整篇文章。

Appendix  附录

Matrix Multiplication  矩阵乘法

We introduced vectors and matrices above in the context of embeddings. A matrix has two dimensions (number or rows and columns). A vector can also be thought of as a matrix where one of the dimensions equals one. Product of two matrices is defined as:
我们在上面在嵌入的上下文中引入了向量和矩阵。矩阵有两个维度(数字或行和列)。向量也可以被认为是一个矩阵,其中一个维度等于 1。两个矩阵的乘积定义为:

Press enter or click to view image in full size

Image by author  图片由作者提供

Dots represent multiplication. Now let’s take a second look at the calculation of blue and organic neurons in the very first picture. If we write the weights as a matrix and the inputs as vectors, we can write the whole operation in the following way:
点代表乘法。现在让我们再看看第一张图片中蓝色和有机神经元的计算。如果我们将权重写为矩阵,将输入写为向量,我们可以按以下方式编写整个作:

Image by author  图片由作者提供

If the weight matrix is called “W” and the inputs are called “x” then Wx is the result (the middle layer in this case). We can also transpose the two and write it as xW — this is a matter of preference.
如果权重矩阵称为“W”,输入称为“x”,则 Wx 是结果(在本例中为中间层)。我们也可以将两者转置并写为 xW——这是一个偏好问题。

Standard deviation  标准差

We use the concept of standard deviation in the Layer Normalization section. Standard deviation is a statistical measure of how spread out the values are (in a set of numbers), e.g., if the values are all the same you would say the standard deviation is zero. If, in general, each value is really far from the mean of these very same values, then you will have a high standard deviation. The formula to calculate standard deviation for a set of numbers, a1, a2, a3…. (say N numbers) goes something like this: subtract the mean (of these numbers) from each of the numbers, then square the answer for each of N numbers. Add up all these numbers and then divide by N. Now take a square root of the answer.
我们在层归一化部分使用标准差的概念。标准差是衡量值分布程度(在一组数字中)的统计指标,例如,如果值都相同,您会说标准差为零。一般来说,如果每个值都与这些完全相同的值的平均值相去甚远,那么您将有一个很高的标准差。计算一组数字 a1、a2、a3 的标准差的公式。(比如说 N 个数字)是这样的:从每个数字中减去(这些数字的)平均值,然后将 N 个数字中的每一个的答案平方。将所有这些数字相加,然后除以 N。现在取答案的平方根。

Positional Encoding  位置编码

(this part is beyond middle school math)
(这部分超出了中学数学的范围)

We talked about positional embedding above. A positional encoding is simply a vector of the same length as the word embedding vector, except it is not an embedding in the sense that it is not trained. We simply assign a unique vector to every position e.g. a different vector for position 1 and different one for position 2 and so on. A simple way of doing this is to make the vector for that position simply full of the position number. So the vector for position 1 would be [1,1,1…1] for 2 would be [2,2,2…2] and so on (remember length of each vector must match embedding length for addition to work). This is problematic because we can end up with large numbers in vectors which creates challenges during training. We can, of course, normalize these vectors by dividing every number by the max of position, so if there are 3 words total then position 1 is [.33,.33,..,.33] and 2 is [.67, .67, ..,.67] and so on. This has the problem now that we are constantly changing the encoding for position 1 (those numbers will be different when we feed 4 word sentence as input) and it creates challenges for the network to learn. So here, we want a scheme that allocates a unique vector to each position, and the numbers don’t explode. Basically if the context length is d (i.e., maximum number of tokens/words that we can feed into the network for predicting next token/word, see discussion in “how does it all generate language?” section) and if the length of the embedding vector is 10 (say), then we need a matrix with 10 rows and d columns where all the columns are unique and all the numbers lie between 0 and 1. Given that there are infinitely many numbers between zero and 1, and the matrix is finitely sized, this can be done in many ways.
我们在上面谈到了位置嵌入。位置编码只是与单词嵌入向量长度相同的向量,只不过它不是未训练的嵌入。我们只需为每个位置分配一个唯一的向量,例如,位置 1 使用不同的向量,位置 2 使用不同的向量,依此类推。一个简单的方法是使该位置的向量简单地充满位置编号。因此,位置 1 的向量将是 [1,1,1...1] 对于 2 将是 [2,2,2...2] 依此类推(请记住,每个向量的长度必须与嵌入长度匹配才能加法)。这是有问题的,因为我们最终可能会在向量中产生大量数字,这在训练过程中会带来挑战。当然,我们可以通过将每个数字除以最大位置来规范化这些向量,因此,如果总共有 3 个单词,则位置 1 是 [.33,.33,..,.33],2 是 [.67, .67, ..,.67],依此类推。现在我们不断更改位置 1 的编码(当我们输入 4 个单词的句子时,这些数字会有所不同),这给网络带来了学习挑战。所以在这里,我们想要一个方案,为每个位置分配一个唯一的向量,并且数字不会爆炸。基本上,如果上下文长度为 d(即,我们可以输入网络以预测下一个标记/单词的最大标记/单词数,请参阅“它如何生成语言?鉴于 0 到 1 之间有无限多个数字,并且矩阵的大小是有限的,这可以通过多种方式完成。

The approach used in the “Attention is all you need” paper goes something like this:
“注意力就是你所需要的一切”论文中使用的方法如下:

  • Draw 10 sin curves each being si(p) = sin (p/10000(i/d)) (that’s 10k to power i/d)
    绘制 10 条 sin 曲线,每条曲线为 si(p) = sin (p/10000(i/d))(即 10k 的幂 i/d)
  • Fill the encoding matrix with numbers such that (i,p)th number is si(p), e.g., for position 1 the 5th element of the encoding vector is s5(1)=sin (1/10000(5/d))
    用数字填充编码矩阵,使得 (i,p) 第 (i,p) 个数字为 si(p),例如,对于位置 1,编码向量的第 5 个元素是 s5(1)=sin (1/10000(5/d))

Why choose this method? By changing the power on 10k you are changing the amplitude of the sine function when viewed on the p-axis. And if you have 10 different sine functions with 10 different amplitudes, then it will be a long time before you get a repetition (i.e. all 10 values are the same) for changing values of p. And this helps give us unique values. Now, the actual paper uses both sine and cosine functions and the form of encoding is: si(p) = sin (p/10000(i/d)) if i is even and si(p) = cos(p/10000(i/d)) if i is odd.
为什么选择这种方法?通过改变 10k 的功率,您正在改变在 p 轴上查看时正弦函数的幅度。如果你有 10 个不同的正弦函数和 10 个不同的振幅,那么你需要很长时间才能得到改变 p 值的重复(即所有 10 个值都相同)。这有助于赋予我们独特的价值。现在,实际的论文同时使用正弦和余弦函数,编码形式为:如果 i 是偶数,则 si(p) = sin (p/10000(i/d)),如果 i 是奇数,则 si(p) = cos(p/10000(i/d))。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐