Review of
“Attention is all you need,”
Vaswani & co., 2017 (Google)
Juan M. Bello-Rivas
jmbr@jhu.edu
April 29, 2022
Outline
1 Introduction
2 Architecture
3 Additional pointers
Outline
1 Introduction
2 Architecture
3 Additional pointers
Attention Is All You Need
Ashish Vaswani
Google Brain
avaswani@google.com
Noam Shazeer
Google Brain
noam@google.com
Niki Parmar
Google Research
nikip@google.com
Jakob Uszkoreit
Google Research
usz@google.com
Llion Jones
Google Research
llion@google.com
Aidan N. Gomez
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser
Google Brain
lukaszkaiser@google.com
Illia Polosukhin
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.0 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature.
1 Introduction
Recurrent neural networks, long short-term memory [
12
] and gated recurrent [
7
] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [
29
,
2
,
5
]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [31, 21, 13].
Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
attention and the parameter-free position representation and became the other person involved in nearly every
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
our research.
Work performed while at Google Brain.
Work performed while at Google Research.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
Transformers
Self-supervised.
Initially proposed for translation.
Can be used for a variety of purposes:
Zero-shot classification.
Text classification.
Sentiment analysis.
Named entity recognition.
Question answering.
Summarization.
Text generation.
Sentence completion.
etc.
More applications in image/video processing.
GPT-* (OpenAI), BERT (Google), etc. are examples of transformers.
Outline
1 Introduction
2 Architecture
3 Additional pointers
Encoder
Let’s peel the layers one by one.
Encoder (cont.)
The input tokens (x
1
, . . . , x
n
) Z
n
are
determined by the sentence.
Example
“There’s no place like home” is mapped by the
tokenizer WordPiece (BERT model) to:
i x
i
Correspondence
1 101 [CLS]
2 1247 There
3 112
4 188 s
5 1185 no
6 1282 place
7 1176 like
8 1313 home
9 102 [SEP]
Encoder (cont.)
The inputs (x
1
, . . . , x
n
) Z
n
are embedded into
(z
1
, . . . , z
n
), where z
i
R
d
model
.
In the original paper, d
model
= 512. All layers
have width equal to d
model
.
The embedding is either learned along with the
model or pre-trained (torch.nn.Embedding,
keras.layers.Embedding, etc.).
Encoder (cont.)
The rationale for a positional encoding
p
i
R
d
model
applied as
z
i
7→ z
i
+ p
i
,
is to give the model a sense of where in the
sentence is each token.
(p
i
)
j
=
sin(i10
4j
d
model
), if j is even,
cos(i10
4(j1)
d
model
) if j is odd.
1
0
1
0 32 64
Pos. enc.
j
i = 0
Encoder (cont.)
The rationale for a positional encoding
p
i
R
d
model
applied as
z
i
7→ z
i
+ p
i
,
is to give the model a sense of where in the
sentence is each token.
(p
i
)
j
=
sin(i10
4j
d
model
), if j is even,
cos(i10
4(j1)
d
model
) if j is odd.
1
0
1
0 32 64
Pos. enc.
j
i = 1
Encoder (cont.)
The rationale for a positional encoding
p
i
R
d
model
applied as
z
i
7→ z
i
+ p
i
,
is to give the model a sense of where in the
sentence is each token.
(p
i
)
j
=
sin(i10
4j
d
model
), if j is even,
cos(i10
4(j1)
d
model
) if j is odd.
1
0
1
0 32 64
Pos. enc.
j
i = 5
Encoder (cont.)
The rationale for a positional encoding
p
i
R
d
model
applied as
z
i
7→ z
i
+ p
i
,
is to give the model a sense of where in the
sentence is each token.
(p
i
)
j
=
sin(i10
4j
d
model
), if j is even,
cos(i10
4(j1)
d
model
) if j is odd.
1
0
1
0 32 64
Pos. enc.
j
i = 10
Encoder (cont.)
The rationale for a positional encoding
p
i
R
d
model
applied as
z
i
7→ z
i
+ p
i
,
is to give the model a sense of where in the
sentence is each token.
(p
i
)
j
=
sin(i10
4j
d
model
), if j is even,
cos(i10
4(j1)
d
model
) if j is odd.
1
0
1
0 32 64
Pos. enc.
j
i = 15
Encoder (cont.)
Fully connected feed-forward network
z 7→ W
2
ReLU(W
1
z + b
1
) + b
2
.
Residual layer (doi:10.1109/CVPR.2016.90)
z 7→ z + f (z)
Repeat N = 6 times.
Encoder (cont.)
Layer normalization (arXiv:abs/1607.06450v1)
aims to decorrelate weights.
Ideally, one would normalize by centering and
rescaling all the weights so that they have mean
zero and unit variance.
This is impractical, so layer normalization,
normalizes over the neurons within each layer.
Unlike batch-normalization, it doesn’t depend on
batch size.
Attention
Attention (cont.)
“Attention is something where you make a query
with a vector and then you basically look at
similar things in your past. [. . .] [A]ttention looks
at everything but gets things that are similar.
[. . .] When you retrieve similar things you can
look at a very long context.”
L. Kaiser, contrasting the attention
architecture with convolutional neural networks
(https://youtu.be/rBCqOTEfxvg).
Attention (cont.)
Let q
i
, k
i
, v
i
R
d
model
for i = 1, . . . , n be
parameters of the NN.
Consider
Q =
q
>
1
.
.
.
q
>
n
R
n×d
model
and
K
>
= [k
1
, . . . , k
n
] R
d
model
×n
Thus, QK
>
R
n×n
.
Attention (cont.)
Applying
softmax(u)
i
=
e
u
i
P
d
k
j=1
e
u
j
row-wise to QK
>
, we obtain the stochastic
matrix
softmax(
1
d
model
QK
>
).
The rows of the above matrix are probability
mass functions of discrete distributions with unit
variance (hence the scaling).
Attention (cont.)
Let
V =
v
>
1
.
.
.
v
>
n
R
n×d
model
Then, attention is defined by
Attention(Q, K, V )
= softmax(
1
d
model
QK
>
)V
Attention (cont.)
The intuition here seems to be that each word is
embedded in three different spaces: the “query
space,” the “key space,” and the “value space.”
The alignment between pairs of vectors, q
>
i
k
j
,
determines how related the two words are
(asymmetric embedding).
The dot product between each row-vector of
softmax(
1
d
model
QK
>
) and each column of V is
an expectation value.
Multihead attention
Rationale: multiple parallel levels of attention.
Let h the number of heads and let d
k
, d
v
N
such that Q, K
>
R
n×d
k
and V R
n×d
v
.
Attention is rescaled as
Attention(Q, K, V )
= softmax(
1
d
k
QK
>
)V.
In the paper, d
k
= d
v
= d
model
/h.
Multihead attention (cont.)
The inputs here are linearly mapped prior to the
attention mechanism
head
`
= Attention(QW
Q
`
, KW
K
`
, V W
V
`
),
for ` = 1, . . . , h where
W
Q
`
, W
K
`
R
d
model
×d
k
, W
V
`
R
d
model
×d
v
Multihead attention (cont.)
Finally,
MultiHead(Q, K, V )
= Concat(head
1
, . . . , head
h
)W
O
,
where W
O
R
hd
v
×d
model
.
Examples
Attention Visualizations
Input-Input Layer5
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
Figure 3: An example of the attention mechanism following long-distance dependencies in the
encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of
the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for
the word ‘making’. Different colors represent different heads. Best viewed in color.
13
Examples (cont.)
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
and 6. Note that the attentions are very sharp for this word.
14
Examples (cont.)
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the
sentence. We give two such examples above, from two different heads from the encoder self-attention
at layer 5 of 6. The heads clearly learned to perform different tasks.
15
Decoder
The decoder is mostly like the encoder but with
an additional set of layers that perform
multi-head attention on the output of the
encoder.
In the encoder, we have self-attention layers. In
the decoder, we mix self-attention with regular
attention.
Decoder (cont.)
The output from the encoder is used as K and V
in the decoder.
The attention mechanism is masked so that the
decoder only sees the previous words.
Auto-regressive: each output word is appended to
the previous outputs.
Outline
1 Introduction
2 Architecture
3 Additional pointers
Additional pointers
https://geometricdeeplearning.com/
https://youtu.be/rBCqOTEfxvg
http://nlp.seas.harvard.edu/2018/04/03/attention.html
https://jalammar.github.io/illustrated-transformer/
https://huggingface.co