Review of

“Attention is all you need,”

Vaswani & co., 2017 (Google)

Juan M. Bello-Rivas

jmbr@jhu.edu

April 29, 2022

Outline

1 Introduction

2 Architecture

3 Additional pointers

Outline

1 Introduction

2 Architecture

3 Additional pointers

Attention Is All You Need

Ashish Vaswani

∗

Google Brain

avaswani@google.com

Noam Shazeer

∗

Google Brain

noam@google.com

Niki Parmar

∗

Google Research

nikip@google.com

Jakob Uszkoreit

∗

Google Research

usz@google.com

Llion Jones

∗

Google Research

llion@google.com

Aidan N. Gomez

∗ †

University of Toronto

aidan@cs.toronto.edu

Łukasz Kaiser

∗

Google Brain

lukaszkaiser@google.com

Illia Polosukhin

∗ ‡

illia.polosukhin@gmail.com

Abstract

The dominant sequence transduction models are based on complex recurrent or

convolutional neural networks that include an encoder and a decoder. The best

performing models also connect the encoder and decoder through an attention

mechanism. We propose a new simple network architecture, the Transformer,

based solely on attention mechanisms, dispensing with recurrence and convolutions

entirely. Experiments on two machine translation tasks show these models to

be superior in quality while being more parallelizable and requiring signiﬁcantly

less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-

to-German translation task, improving over the existing best results, including

ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,

our model establishes a new single-model state-of-the-art BLEU score of 41.0 after

training for 3.5 days on eight GPUs, a small fraction of the training costs of the

best models from the literature.

1 Introduction

Recurrent neural networks, long short-term memory [

] and gated recurrent [

] neural networks

in particular, have been ﬁrmly established as state of the art approaches in sequence modeling and

transduction problems such as language modeling and machine translation [

]. Numerous

efforts have since continued to push the boundaries of recurrent language models and encoder-decoder

architectures [31, 21, 13].

∗

Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started

the effort to evaluate this idea. Ashish, with Illia, designed and implemented the ﬁrst Transformer models and

has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head

attention and the parameter-free position representation and became the other person involved in nearly every

detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and

tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and

efﬁcient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and

implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating

our research.

†

Work performed while at Google Brain.

‡

Work performed while at Google Research.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Transformers

•

Self-supervised.

•

Initially proposed for translation.

•

Can be used for a variety of purposes:

•

Zero-shot classiﬁcation.

•

Text classiﬁcation.

•

Sentiment analysis.

•

Named entity recognition.

•

Question answering.

•

Summarization.

•

Text generation.

•

Sentence completion.

•

etc.

•

More applications in image/video processing.

•

GPT-* (OpenAI), BERT (Google), etc. are examples of transformers.

Outline

1 Introduction

2 Architecture

3 Additional pointers

Encoder

Let’s peel the layers one by one.

Encoder (cont.)

The input tokens (x

, . . . , x

) ∈ Z

are

determined by the sentence.

Example

“There’s no place like home” is mapped by the

tokenizer WordPiece (BERT model) to:

i x

Correspondence

1 101 [CLS]

2 1247 There

3 112 ’

4 188 s

5 1185 no

6 1282 place

7 1176 like

8 1313 home

9 102 [SEP]

Encoder (cont.)

The inputs (x

, . . . , x

) ∈ Z

are embedded into

, . . . , z

), where z

∈ R

model

In the original paper, d

model

= 512. All layers

have width equal to d

model

The embedding is either learned along with the

model or pre-trained (torch.nn.Embedding,

keras.layers.Embedding, etc.).

Encoder (cont.)

The rationale for a positional encoding

∈ R

model

applied as

7→ z

+ p

is to give the model a sense of where in the

sentence is each token.

)







sin(i10

−

model

), if j is even,

cos(i10

−

4(j−1)

model

) if j is odd.

−1

0 32 64

Pos. enc.

i = 0

Encoder (cont.)

The rationale for a positional encoding

∈ R

model

applied as

7→ z

+ p

is to give the model a sense of where in the

sentence is each token.

)







sin(i10

−

model

), if j is even,

cos(i10

−

4(j−1)

model

) if j is odd.

−1

0 32 64

Pos. enc.

i = 1

Encoder (cont.)

The rationale for a positional encoding

∈ R

model

applied as

7→ z

+ p

is to give the model a sense of where in the

sentence is each token.

)







sin(i10

−

model

), if j is even,

cos(i10

−

4(j−1)

model

) if j is odd.

−1

0 32 64

Pos. enc.

i = 5

Encoder (cont.)

The rationale for a positional encoding

∈ R

model

applied as

7→ z

+ p

is to give the model a sense of where in the

sentence is each token.

)







sin(i10

−

model

), if j is even,

cos(i10

−

4(j−1)

model

) if j is odd.

−1

0 32 64

Pos. enc.

i = 10

Encoder (cont.)

The rationale for a positional encoding

∈ R

model

applied as

7→ z

+ p

is to give the model a sense of where in the

sentence is each token.

)







sin(i10

−

model

), if j is even,

cos(i10

−

4(j−1)

model

) if j is odd.

−1

0 32 64

Pos. enc.

i = 15

Encoder (cont.)

Fully connected feed-forward network

z 7→ W

ReLU(W

z + b

) + b

Residual layer (doi:10.1109/CVPR.2016.90)

z 7→ z + f (z)

Repeat N = 6 times.

Encoder (cont.)

Layer normalization (arXiv:abs/1607.06450v1)

aims to decorrelate weights.

Ideally, one would normalize by centering and

rescaling all the weights so that they have mean

zero and unit variance.

This is impractical, so layer normalization,

normalizes over the neurons within each layer.

Unlike batch-normalization, it doesn’t depend on

batch size.

Attention

Attention (cont.)

“Attention is something where you make a query

with a vector and then you basically look at

similar things in your past. [. . .] [A]ttention looks

at everything but gets things that are similar.

[. . .] When you retrieve similar things you can

look at a very long context.”

— L. Kaiser, contrasting the attention

architecture with convolutional neural networks

(https://youtu.be/rBCqOTEfxvg).

Attention (cont.)

Let q

, k

, v

∈ R

model

for i = 1, . . . , n be

parameters of the NN.

Consider

Q =













∈ R

n×d

model

and

= [k

, . . . , k

] ∈ R

model

×n

Thus, QK

∈ R

n×n

Attention (cont.)

Applying

softmax(u)

j=1

row-wise to QK

, we obtain the stochastic

matrix

softmax(

√

model

The rows of the above matrix are probability

mass functions of discrete distributions with unit

variance (hence the scaling).

Attention (cont.)

Let

V =













∈ R

n×d

model

Then, attention is deﬁned by

Attention(Q, K, V )

= softmax(

√

model

Attention (cont.)

The intuition here seems to be that each word is

embedded in three diﬀerent spaces: the “query

space,” the “key space,” and the “value space.”

The alignment between pairs of vectors, q

determines how related the two words are

(asymmetric embedding).

The dot product between each row-vector of

softmax(

√

model

) and each column of V is

an expectation value.

Multihead attention

Rationale: multiple parallel levels of attention.

Let h the number of heads and let d

, d

∈ N

such that Q, K

∈ R

n×d

and V ∈ R

n×d

Attention is rescaled as

Attention(Q, K, V )

= softmax(

√

)V.

In the paper, d

= d

model

/h.

Multihead attention (cont.)

The inputs here are linearly mapped prior to the

attention mechanism

head

= Attention(QW

, KW

, V W

for ` = 1, . . . , h where

, W

∈ R

model

×d

, W

∈ R

model

×d

Multihead attention (cont.)

Finally,

MultiHead(Q, K, V )

= Concat(head

, . . . , head

where W

∈ R

×d

model

Examples

Attention Visualizations

Input-Input Layer5

this

spirit

that

majority

American

governments

have

passed

new

laws

since

2009

making

the

registration

voting

process

difficult

<EOS>

<pad>

this

spirit

that

majority

American

governments

have

passed

new

laws

since

2009

making

the

registration

voting

process

difficult

<EOS>

<pad>

Figure 3: An example of the attention mechanism following long-distance dependencies in the

encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of

the verb ‘making’, completing the phrase ‘making...more difﬁcult’. Attentions here shown only for

the word ‘making’. Different colors represent different heads. Best viewed in color.

Examples (cont.)

Input-Input Layer5

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

Input-Input Layer5

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:

Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5

and 6. Note that the attentions are very sharp for this word.

Examples (cont.)

Input-Input Layer5

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

Input-Input Layer5

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

The

Law

will

never

perfect

but

its

application

should

just

this

what

are

missing

opinion

<EOS>

<pad>

Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the

sentence. We give two such examples above, from two different heads from the encoder self-attention

at layer 5 of 6. The heads clearly learned to perform different tasks.

Decoder

The decoder is mostly like the encoder but with

an additional set of layers that perform

multi-head attention on the output of the

encoder.

In the encoder, we have self-attention layers. In

the decoder, we mix self-attention with regular

attention.

Decoder (cont.)

The output from the encoder is used as K and V

in the decoder.

The attention mechanism is masked so that the

decoder only sees the previous words.

Auto-regressive: each output word is appended to

the previous outputs.

Outline

1 Introduction

2 Architecture

3 Additional pointers

Additional pointers

•

https://geometricdeeplearning.com/

•

https://youtu.be/rBCqOTEfxvg

•

http://nlp.seas.harvard.edu/2018/04/03/attention.html

•

https://jalammar.github.io/illustrated-transformer/

•

https://huggingface.co