移动应用 微信公众号 联系我们

咨询热线 -

电话 15988168888

  • 价格透明
  • 信息保密
  • 进度掌控
  • 售后无忧




The Illustrated Transformer

  • A High-Level Look
  • Bringing The Tensors Into The Picture
  • Now We’re Encoding!
  • Self-Attention at a High Level
  • Self-Attention in Detail
  • Matrix Calculation of Self-Attention
  • The Beast With Many Heads
  • Representing The Order of The Sequence Using Positional Encoding
  • The Residuals
  • The Decoder Side
  • The Final Linear and Softmax Layer
  • Recap Of Training
  • The Loss Function
  • Go Forth And Transform

The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.

Transformer 是在论文《Attention is All You Need 》中提出的。 它的 TensorFlow 实现可作为 Tensor2Tensor 包的一部分使用。 哈佛大学的 NLP 小组创建了一个指南,用 PyTorch 实现对论文进行了注释。 在这篇文章中,我们将尝试将事情稍微简化一点,并逐一介绍概念,希望能让没有深入了解主题的人更容易理解。

A High-Level Look

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

让我们首先将模型视为单个黑匣子。 在机器翻译应用程序中,它会用一种语言输入一个句子,然后用另一种语言输出它的翻译。

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.


The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

编码组件是由一堆编码器(encoder)组成的(论文将6个编码器叠在一起——数字 6 没有什么神奇之处,也可以尝试其他数字)。 解码组件是由相同数量(与编码器一致)的解码器(decoder)组成的。

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

编码器在结构上都是相同的(但它们不共享权重)。 每一层又分为两个子层:
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

编码器的输入首先流经自注意力(self-attention)层——该层帮助编码器在编码特定单词时查看输入句子中的其他单词。 我们将在文章的后面仔细研究 self-attention。

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

自注意力层的输出传递到前馈神经网络( feed-forward neural network),每个位置的单词对应的前馈神经网络都是一样的。

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

解码器中也有编码器中的自注意(self-attention)层和前馈( feed-forward)神经网络层。除此之外,这两个层之间还有一个注意力层,用来关注输入句子的相关部分(类似于注意力在 seq2seq 模型中的作用)。


Bringing The Tensors Into The Picture

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.


As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

与 NLP 应用的一般情况一样,我们首先使用嵌入算法将每个输入词转换为向量。

Each word is embedded into a vector of size 512. We’ll represent those vectors with these simple boxes:

每个单词都嵌入到一个大小为 512 的向量中。我们将用这些简单的框表示这些向量:
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

嵌入仅发生在最底层的编码器中。 所有编码器它们都会接收一个向量列表(由词向量组成),每个向量的大小为 512。在底部编码器中,这个列表将是词嵌入,但在其他编码器中,这个列表将是直接位于其下方的编码器的输出 。列表的大小是我们可以设置的超参数——基本上它是我们训练数据集中最长句子的长度。

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.


在这里插入图片描述Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

在这里,我们开始看到 Transformer 的一个关键属性,即每个位置的单词在编码器中都流经自己的路径。 self-attention 层中这些路径之间存在依赖关系。 然而,前馈层没有这些依赖关系,因此各种路径可以在流经前馈层时并行执行

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.


Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

正如我们已经提到的,编码器接收一个向量列表作为输入。 它通过将这些向量传递到一个“自我注意”层,然后传递到一个前馈神经网络,然后将输出向上发送到下一个编码器来处理这个列表。

The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network – the exact same network with each vector flowing through it separately.

每个位置的单词都经过一个自注意力过程。 然后,它们每个都通过一个前馈神经网络——完全相同的网络,每个向量分别流过它。

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

不要被我用“自我关注”这个词给愚弄了,好像这是一个每个人都应该熟悉的概念。 我在阅读《Attention is All You Need》之前从未接触过这个概念。 让我们研究它是如何工作的。

Say the following sentence is an input sentence we want to translate:


The animal didn't cross the street because it was too tired

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

这句话中的“它”指的是什么? 它指的是街道还是动物? 这对人类来说是一个简单的问题,但对算法来说并不那么简单。

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

当模型处理“it”这个词时,self-attention 允许它把“it”和“animal”联系起来。

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.


If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.


As we are encoding the word “it” in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on “The Animal”, and baked a part of its representation into the encoding of “it”.

当我们在编码器 #5(堆栈中的顶部编码器)中对单词“it”进行编码时,部分注意力机制专注于“The Animal”,并将其表示的一部分融入到“it”的编码中。

Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.

请务必查看 Tensor2Tensor 笔记本,您可以在其中加载 Transformer 模型,并使用此交互式可视化对其进行检查。

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.


The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

计算自注意力的第一步是从每个编码器的输入向量(在本例中为每个词的嵌入)创建三个向量。 因此,对于每个单词,我们创建一个 Query 向量、一个 Key 向量和一个 Value 向量。 这三个向量是通过词嵌入与训练过程中训练的三个权重矩阵相乘得到的。

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

请注意,这些新向量的维度比词嵌入向量小。 它们的维数是 64,而词嵌入和编码器输入/输出向量的维数是 512。这三个新的向量不要求维度更小,这只是一种基于架构上的选择,可以使多头注意力(multi-headed attention)的计算(大部分)保持不变。


What are the “query”, “key”, and “value” vectors?


They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

它们是可用于计算和思考注意力的抽象。 一旦你继续阅读下面如何计算注意力,你就会知道你需要知道的关于每个向量所扮演的角色的所有知识。

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

计算self-attention的第二步是计算一个score。 假设我们正在计算本例中第一个单词“Thinking”的自注意力。 我们需要根据这个词对输入句子的每个词进行评分。 当我们在某个位置对单词进行编码时,分数决定了将多少注意力放在输入句子的其他部分上。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1 . The second score would be the dot product of q1 and k2.

得分是通过将查询向量与我们正在评分的各个单词的键向量进行点积来计算的。 因此,如果我们正在处理位置 #1 中单词的自注意力,第一个分数将是 q1 和 k1 的点积。 第二个分数是 q1 和 k2 的点积。

The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

第三步和第四步是将分数除以 8(论文中使用的关键向量维度的平方根 - 64。这会导致梯度更稳定。这里可能还有其他可能的值,但这是 默认),然后通过 softmax 操作传递结果。 Softmax 将分数归一化,因此它们都是正数并且加起来为 1。

This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

这个 softmax 分数决定了每个单词在这个位置的表达量。 很明显,这个位置的词将具有最高的 softmax 分数,但有时关注与当前词相关的另一个词是有用的。

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

第五步是将每个值向量乘以 softmax 分数(准备将它们相加)。 这里的直觉是保持我们想要关注的单词的值不变,并淹没不相关的单词(例如,通过将它们乘以像 0.001 这样的小数字)。

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

第六步是对加权值向量求和。 这会在这个位置(对于第一个词)产生自注意力层的输出。

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

自注意力计算到此结束。 结果向量是我们可以发送到前馈神经网络的向量。 然而,在实际实现中,为了更快的处理,这个计算是以矩阵形式完成的。 既然我们已经看到了单词级别计算的直觉,那么让我们来看看。

Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

第一步是计算查询、键和值矩阵。 我们通过将我们的词嵌入打包到一个矩阵 X 中,并将其乘以我们训练过的权重矩阵(WQ、WK、WV)来实现。

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

X 矩阵中的每一行对应于输入句子中的一个词。 我们再次看到嵌入向量(512,或图中的 4 个框)和 q/k/v 向量(64,或图中的 3 个框)的大小差异。

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.



The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

该论文通过添加一种称为“多头”注意力的机制进一步细化了自注意力层。 这通过两种方式提高了注意力层的性能:

  1. It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.

  2. 它扩展了模型专注于不同位置的能力。 是的,在上面的例子中, z1 包含一点点其他编码,但它可能由实际单词本身主导。 如果我们翻译“The animal didn’t cross the street because it was too tired”这样的句子会很有用,我们想知道“它”指的是哪个词.

  3. It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

  4. 它为注意力层提供了多个“表示子空间”。 正如我们接下来将看到的,对于多头注意力,我们不仅有一个,而且还有多组查询/键/值权重矩阵(转换器使用八个注意力头,所以我们最终为每个编码器/解码器提供了八组) . 这些集合中的每一个都是随机初始化的。 然后,在训练之后,每组用于将输入词嵌入(或来自较低编码器/解码器的向量)投影到不同的表示子空间中。

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

通过多头注意力,我们为每个头维护单独的 Q/K/V 权重矩阵,从而产生不同的 Q/K/V 矩阵。 正如我们之前所做的那样,我们将 X 乘以 WQ/WK/WV 矩阵以生成 Q/K/V 矩阵。

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

如果我们进行上面概述的相同自注意力计算,只需使用不同的权重矩阵进行 8 次不同的计算,我们就会得到 8 个不同的 Z 矩阵

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

这给我们留下了一些挑战。 前馈层不需要八个矩阵——它需要一个矩阵(每个单词一个向量)。 所以我们需要一种方法将这八个压缩成一个矩阵。

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.

我们怎么做? 我们将矩阵连接起来,然后通过一个额外的权重矩阵 WO 将它们相乘。

That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place

这几乎就是多头自注意力的全部内容。 这是相当多的矩阵,我意识到。 让我尝试将它们全部放在一个视觉效果中,以便我们可以在一个地方查看它们

Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:


As we encode the word “it”, one attention head is focusing most on “the animal”, while another is focusing on “tired” – in a sense, the model’s representation of the word “it” bakes in some of the representation of both “animal” and “tired”.

当我们对“it”这个词进行编码时,一个注意力头最关注“动物”,而另一个关注“累了”——从某种意义上说,模型对“it”这个词的表示在一些表示中融入 “动物”和“累”。

If we add all the attention heads to the picture, however, things can be harder to interpret:



Representing The Order of The Sequence Using Positional Encoding

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.


To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

为了解决这个问题, transformer 为每个输入词嵌入添加了一个向量。 这些向量遵循模型学习的特定模式,这有助于确定每个单词的位置,或序列中不同单词之间的距离。 这里的直觉是,一旦将这些值投影到 Q/K/V 向量中以及在点积注意力期间,将这些值添加到词嵌入中就可以提供词嵌入向量之间的有意义的距离。

To give the model a sense of the order of the words, we add positional encoding vectors – the values of which follow a specific pattern.


If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

如果我们假设词嵌入的维数为 4,则实际位置编码将如下所示:

A real example of positional encoding with a toy embedding size of 4

词嵌入大小为 4 的位置编码的真实示例

What might this pattern look like?


In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.

在下图中,每一行对应一个向量的一个位置编码。 因此,第一行将是我们添加到输入序列中第一个词的嵌入中的向量。 每行包含 512 个值——每个值都在 1 到 -1 之间。 我们对它们进行了颜色编码,因此图案是可见的。

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That’s because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They’re then concatenated to form each of the positional encoding vectors.

20 个词(行)位置编码的真实示例,嵌入大小为 512(列)。 你可以看到它看起来在中心分成两半。 这是因为左半部分的值由一个函数(使用正弦)生成,右半部分由另一个函数(使用余弦)生成。 然后将它们连接起来形成每个位置编码向量。

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

位置编码的公式在论文(第 3.5 节)中有所描述。 您可以在 get_timing_signal_1d() 中看到生成位置编码的代码。 这不是位置编码的唯一可能方法。 然而,它的优势在于能够扩展到看不见的序列长度(例如,如果我们训练的模型被要求翻译一个比我们训练集中的任何一个都长的句子)。

July 2020 Update: The positional encoding shown above is from the Tranformer2Transformer implementation of the Transformer. The method shown in the paper is slightly different in that it doesn’t directly concatenate, but interweaves the two signals. The following figure shows what that looks like. Here’s the code to generate it:

2020 年 7 月更新:上面显示的位置编码来自 Transformer 的 Transformer2Transformer 实现。 论文中显示的方法略有不同,它不直接连接,而是将两个信号交织在一起。 下图显示了它的样子。 这是生成它的代码:


The Residuals

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.



If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:



This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:

这也适用于解码器的子层。 如果我们考虑一个由 2 个堆叠编码器和解码器组成的 Transformer,它看起来像这样:


The Decoder Side

Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.

现在我们已经涵盖了编码器端的大部分概念,我们基本上也知道解码器的组件是如何工作的。 但是让我们来看看它们是如何协同工作的。

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

编码器首先处理输入序列。 然后将顶级编码器的输出转换为一组注意力向量 K 和 V。 这些将由每个解码器在其“编码器-解码器注意力”层中使用,帮助解码器专注于输入序列中的适当位置:


After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

完成编码阶段后,我们开始解码阶段。 解码阶段的每一步都从输出序列中输出一个元素(在这种情况下是英文翻译句子)。

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

以下步骤重复该过程,直到达到一个特殊符号,表明转换器解码器已完成其输出。 每一步的输出在下一个时间步被传送到底部解码器,解码器就像编码器一样向上输送他们的解码结果。 就像我们对编码器输入所做的一样,我们将位置编码嵌入并添加到这些解码器输入中,以指示每个单词的位置。

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:


In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

在解码器中,自注意力层只允许关注输出序列中较早的位置。 这是通过在 self-attention 计算中的 softmax 步骤之前屏蔽未来位置(将它们设置为 -inf)来完成的。

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

“Encoder-Decoder Attention”层就像多头自注意层一样工作,除了它从它下面的层创建它的查询矩阵,并从编码器堆栈的输出中获取键和值矩阵。

The Final Linear and Softmax Layer

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.

解码器堆栈输出浮点数向量。 我们如何把它变成一个词? 这就是最后一个 Linear 层的工作,然后是一个 Softmax 层。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

线性层是一个简单的全连接神经网络,它将解码器堆栈产生的向量投影到一个更大的向量中,称为 logits 向量。

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

假设我们的模型知道 10,000 个独特的英语单词(我们模型的“输出词汇表”),它们是从训练数据集中学习的。 这将使 logits 向量有 10,000 个单元格宽——每个单元格对应一个唯一单词的分数。 这就是我们如何解释模型的线性层之后的输出。

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

softmax 层然后将这些分数转换为概率(全部为正,加起来为 1.0)。 选择概率最高的单元格,并生成与其关联的单词作为该时间步的输出。

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

该图从底部开始,生成的向量作为解码器堆栈的输出。 然后将其转换为输出字。

Recap Of Training

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.

现在我们已经通过训练有素的 Transformer 涵盖了整个前向传递过程,看看训练模型的直觉会很有用。

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

在训练期间,未经训练的模型将经历完全相同的前向传递。 但是由于我们是在一个标记的训练数据集上训练它,我们可以将它的输出与实际的正确输出进行比较。

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” (short for ‘end of sentence’)).

为了可视化这一点,假设我们的输出词汇表仅包含六个单词(“a”、“am”、“i”、“thanks”、“student”和“”(“end of sentence”的缩写)) .

The output vocabulary of our model is created in the preprocessing phase before we even begin training.


Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:

一旦我们定义了输出词汇表,我们就可以使用相同宽度的向量来表示词汇表中的每个单词。 这也称为 one-hot 编码。 例如,我们可以使用以下向量表示单词“am”:

Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.

The Loss Function

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.

假设我们正在训练我们的模型。 假设这是我们在训练阶段的第一步,我们正在用一个简单的例子来训练它——将“merci”翻译成“thanks”。

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.

这意味着,我们希望输出是表示“thanks”这个词的概率分布。 但由于这个模型还没有经过训练,这不太可能发生。

Since the model’s parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model’s weights using backpropagation to make the output closer to the desired output.

由于模型的参数(权重)都是随机初始化的,(未经训练的)模型会为每个单元格/单词生成具有任意值的概率分布。 我们可以将其与实际输出进行比较,然后使用反向传播调整所有模型的权重,使输出更接近所需的输出。

How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.

你如何比较两个概率分布? 我们只是从另一个中减去一个。 有关更多详细信息,请查看交叉熵和 Kullback-Leibler 散度。

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

但请注意,这是一个过于简化的示例。 更现实的是,我们将使用一个比一个单词长的句子。 例如 – 输入:“je suis étudiant”和预期输出:“我是学生”。 这真正意味着我们希望我们的模型连续输出概率分布,其中:

  • Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)
    每个概率分布都由一个宽度为 vocab_size 的向量表示(在我们的示例中为 6,但更实际的是像 30,000 或 50,000 这样的数字)
  • The first probability distribution has the highest probability at the cell associated with the word “i”
  • The second probability distribution has the highest probability at the cell associated with the word “am”
  • And so on, until the fifth output distribution indicates ‘’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.
    依此类推,直到第五个输出分布指示< end of sentence >符号,该符号也有一个来自 10,000 个元素词汇表的单元格与之关联。

The targeted probability distributions we’ll train our model against in the training example for one sample sentence.


After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:


Hopefully upon training, the model would output the right translation we expect. Of course it’s no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it’s unlikely to be the output of that time step – that’s a very useful property of softmax which helps the training process.

希望在训练时,模型会输出我们期望的正确翻译。 当然,如果这个短语是训练数据集的一部分,这并没有真正的迹象(参见:交叉验证)。 请注意,每个位置都有一点概率,即使它不太可能是该时间步的输出——这是 softmax 的一个非常有用的属性,它有助于训练过程。

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we’ll return two translations). These are both hyperparameters that you can experiment with.

现在,因为模型一次只生成一个输出,我们可以假设模型正在从概率分布中选择概率最高的单词并丢弃其余的单词。 这是一种方法(称为贪婪解码)。 另一种方法是保留前两个词(例如“I”和“a”),然后在下一步中,运行模型两次:一次假设第一个输出位置是 单词“I”,另一次假设第一个输出位置是单词“a”,并且考虑到位置#1 和#2,无论哪个版本产生较少的错误都被保留。 我们对位置 #2 和 #3 重复此操作…等。 这种方法被称为“beam search”,在我们的例子中,beam_size 是 2(意味着在任何时候,两个部分假设(未完成的翻译)被保存在内存中),top_beams 也是 2(意味着我们将返回两个翻译 )。 这些都是您可以试验的超参数。

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:

我希望你已经发现这是一个有用的地方,可以开始用 Transformer 的主要概念打破僵局。 如果您想更深入,我建议您执行以下步骤:

  • Read the Attention Is All You Need paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the Tensor2Tensor announcement.
    阅读Attention Is All You Need论文、Transformer 博客文章(Transformer:一种用于语言理解的新型神经网络架构)和 Tensor2Tensor 公告。
  • Watch Łukasz Kaiser’s talk walking through the model and its details
    观看 Łukasz Kaiser 讲解模型及其细节的演讲
  • Play with the Jupyter Notebook provided as part of the Tensor2Tensor repo
    使用作为 Tensor2Tensor 存储库的一部分提供的 Jupyter Notebook
  • Explore the Tensor2Tensor repo.
    探索 Tensor2Tensor 存储库