Introduction to Graph Neural Networks
Juan M. Bello-Rivas
jmbr@jhu.edu
November 29, 2022
Outline
1 Graph Neural Networks
Graphs
A graph 𝐺 as a pair 𝐺 = (𝑉, 𝐸) where 𝑉 is a set of vertices (or nodes) and
𝐸 βŠ† 𝑉 ×𝑉 is a set of edges. The graph is said to be undirected if the ordering of
the vertices within the edges does not matter and directed otherwise.
`
2
`
3
`
1
`
6
`
7
`
4
`
5
`
8
y
4
y
10
y
6
y
7
y
1
y
5
y
2
y
8
y
9
y
3
y
11
Motivation
a
b
c
d
Complete graph
1
23
4 5
6
Grid graph
1, 2
1, 1
1, 3
1, 4
2, 1
2, 2
2, 3
2, 4
3, 1
3, 2
3, 3
3, 4
4, 1
4, 2
4, 3
4, 4
Adjacency and incidence matrices
The adjacency matrix 𝐴 ∈ R
|𝑉 |Γ—|𝑉 |
of a graph 𝐺 is given by
𝐴
𝑖𝑗
=

1, if (𝑖, 𝑗) ∈ 𝐸,
0, otherwise.
The incidence matrix 𝐡 ∈ R
|𝑉 |Γ—|𝐸|
of the graph 𝐺 is such that each of its
column vectors 𝑏
β„“
∈ R
|𝑉 |
for β„“ = 1, . . . , |𝐸| corresponds to an edge (𝑖, 𝑗) ∈ 𝐸 and
(𝑏
β„“
)
π‘˜
=
⎧
βŽͺ
⎨
βŽͺ
⎩
βˆ’1, if π‘˜ = 𝑖,
1, if π‘˜ = 𝑗,
0, otherwise.
Combinatorial Laplacian
The degree 𝑑
𝑒
of a vertex 𝑒 is the number of adjacent nodes.
The adjacency and incidence matrices are related by the formula
𝐡𝐡
⊀
= 𝐷 βˆ’ 𝐴,
where 𝐷 is a diagonal matrix whose entries are the degrees of each vertex (i.e.,
𝐷
𝑖𝑖
= 𝑑
𝑖
).
The matrix
𝐿 = 𝐷 βˆ’ 𝐴
is the combinatorial Laplacian.
Combinatorial Laplacian on a grid
1, 2
1, 1
1, 3
1, 4
2, 1
2, 2
2, 3
2, 4
3, 1
3, 2
3, 3
3, 4
4, 1
4, 2
4, 3
4, 4
βˆ†π‘“ = 𝑓
π‘₯π‘₯
+ 𝑓
𝑦𝑦
β‰ˆ
1
β„Ž
2
(𝑓
𝑖+1,𝑗
βˆ’ 2𝑓
𝑖,𝑗
+ 𝑓
π‘–βˆ’1,𝑗
)
+
1
β„Ž
2
(𝑓
𝑖,𝑗+1
βˆ’ 2𝑓
𝑖,𝑗
+ 𝑓
𝑖,π‘—βˆ’1
)
βˆ† = βˆ’(𝐷 βˆ’ 𝐴)/β„Ž
2
= βˆ’πΏ/β„Ž
2
Calculus on graphs
Let 𝑓 : R
3
β†’ R be a smooth function and consider the integral
β„°[𝑓] =
ξ™š

ξ˜’
πœ•π‘“
πœ•π‘₯
ξ˜“
2
+
ξ˜’
πœ•π‘“
πœ•π‘¦
ξ˜“
2
+
ξ˜’
πœ•π‘“
πœ•π‘§
ξ˜“
2

dπ‘₯ d𝑦 d𝑧 =
ξ™š
β€–βˆ‡π‘“β€–
2
dπ‘₯ d𝑦 d𝑧.
Suppose we want to minimize β„° subject to the constraint
ξ™š
𝑓(π‘₯, 𝑦, 𝑧)
2
dπ‘₯ d𝑦 d𝑧 = non-zero constant.
The corresponding Euler-Lagrange equation turns out to be
βˆ†π‘“ =
πœ•
2
𝑓
πœ•π‘₯
2
+
πœ•
2
𝑓
πœ•π‘¦
2
+
πœ•
2
𝑓
πœ•π‘§
2
= πœ†π‘“.
Calculus on graphs (cont.)
Two important points we can extract from the above:
1 The Dirichlet energy
ξ™’
β€–βˆ‡π‘“β€–
2
is a regularization term when 𝑓 is a loss
function.
2 The Laplacian gives rise to Fourier series (given 𝑓 ∈ 𝐿
2
, we write
𝑓 =

π‘›βˆˆN
βŸ¨π‘“, πœ‘
𝑛
βŸ©πœ‘
𝑛
, where βŸ¨π‘“, πœ‘
𝑛
⟩ =
ξ™’
𝑓 πœ‘
𝑛
and βˆ†πœ‘
𝑛
= πœ†
𝑛
πœ‘
𝑛
).
It turns out that there exists counterparts of these ideas on graphs!
Calculus on graphs (cont.)
An
1,2
undirected graph is weighted if there is a function 𝑀 : 𝑉 Γ— 𝑉 β†’ R
β‰₯0
such
that 𝑀(𝑖, 𝑗) = 𝑀(𝑖, 𝑗). Let 𝑁
𝑖
= {𝑗 ∈ 𝑉 | (𝑖, 𝑗) ∈ 𝐸} be the set of vertices adjacent
to 𝑖. The degree of a weighted graph is defined as
𝑑(𝑖) β‰œ
ξ™˜
π‘—βˆˆπ‘
𝑖
𝑀(𝑖, 𝑗).
Let 𝐻(𝑉 ) denote the vector space of vertex functions endowed with the inner
product
βŸ¨π‘“, π‘”βŸ© β‰œ
ξ™˜
π‘–βˆˆπ‘‰
𝑓(𝑖) 𝑔(𝑖),
where 𝑓, 𝑔 : 𝑉 β†’ R.
1
D. Zhou and B. Sch¨olkopf, A Regularization Framework for Learning from Graph Data,
ICML, 2004,
2
D. Zhou and B. Sch¨olkopf, Regularization on Discrete Spaces, Pattern Recognition, Lecture
Notes in Computer Science, Springer, 2005.
Calculus on graphs (cont.)
The graph gradient is an operator that maps vertex functions to edge functions as
(βˆ‡π‘“)((𝑖, 𝑗)) β‰œ
ξ™³
𝑀((𝑖, 𝑗))
𝑑(𝑗)
𝑓(𝑗) βˆ’
ξ™³
𝑀((𝑖, 𝑗))
𝑑(𝑖)
𝑓(𝑖),
for all (𝑖, 𝑗) ∈ 𝐸.
Observe that βˆ‡ is skew-symmetric:
(βˆ‡π‘“)((𝑖, 𝑗)) = βˆ’(βˆ‡π‘“)((𝑗, 𝑖))
Calculus on graphs (cont.)
The graph gradient may also be defined at each vertex. Given 𝑓 : 𝑉 β†’ R, the
gradient of 𝑓 at 𝑗 ∈ 𝑉 is defined by
βˆ‡π‘“(𝑗) β‰œ {(βˆ‡π‘“)((𝑗, 𝑖)) | (𝑗, 𝑖) ∈ 𝐸}
The norm of the graph gradient βˆ‡π‘“ at vertex 𝑣 is
β€–βˆ‡
𝑗
𝑓‖ β‰œ
ξ™³
ξ™˜
π‘–βˆˆπ‘
𝑗
(βˆ‡π‘“)((𝑖, 𝑗))
2
The Dirichlet energy of 𝑓 is
β„°[𝑓] β‰œ
1
2
ξ™˜
π‘—βˆˆπ‘‰
β€–βˆ‡
𝑗
𝑓‖
2
.
Calculus on graphs (cont.)
There is a way to define the notion of divergence on graphs and to write the
Laplacian βˆ†π‘“ in terms of the divergence of the gradient βˆ‡π‘“. The resulting
Laplacian is
(βˆ†π‘“)(𝑗) β‰œ 𝑓(𝑗) βˆ’
ξ™˜
π‘’βˆˆπ‘
𝑣
𝑀(𝑖, 𝑗)
ξ™°
𝑑(𝑖)𝑑(𝑗)
𝑓(𝑖).
In vector notation
βˆ†π‘“ = (𝐼 βˆ’ 𝐷
βˆ’1/2
π‘Š 𝐷
βˆ’1/2
)𝑓 for 𝑓 : 𝑉 β†’ R.
If π‘Š is the adjacency matrix 𝐴, then we recover the graph Laplacian
3
.
3
F. Chung, Spectral Graph Theory, CBMS, vol. 92, AMS, 1996.
Calculus on graphs (cont.)
Using these ideas, it is possible to formulate partial differential equations on
graphs
4
and to study gradient flows on their Dirichlet energy (these give rise to
GNNs)
5
.
4
B. P. Chamberlain, J. Rowbottom, M. Gorinova, S. Webb, E. Rossi, and M. M. Bronstein,
GRAND: Graph Neural Diffusion, 2021, arXiv:2106.10934
5
F. Di Giovanni, J. Rowbottom, B. P. Chamberlain, T. Markovich, and M. M. Bronstein,
Graph Neural Networks as Gradient Flows, 2022, arXiv:2206.10991
Outline
1 Graph Neural Networks
Feature maps on graphs
Message-passing GNNs
Convolutional GNNs
Attention GNNs
Motivation
Graph Neural Networks (GNNs) generalize:
βˆ™
Convolutional Neural Networks.
βˆ™
Attention Networks.
βˆ™
Recurrent Neural Networks.
βˆ™
Dynamic Programming algorithms.
βˆ™
etc.
Feature maps
Consider a feature map π‘₯: 𝑉 β†’ R
𝑠
and arrange the image of π‘₯ for each node as the
rows of a matrix:
𝑋 =
⎑
⎒
⎣
π‘₯(1)
π‘₯(2)
.
.
.
⎀
βŽ₯
⎦
∈ R
|𝑉 |×𝑠
It is also possible to endow the edges with feature maps but we won’t consider that
case.
Feature maps (cont.)
The nodes 𝑉 are unordered and a function 𝑓 : R
𝑠
β†’ R of the features of the
vertices ought to be invariant to vertex permutations. However, the matrix 𝑋
induces an ordering of the nodes and so invariant functions must satisfy
𝑓(𝑃 𝑋) = 𝑓(𝑋),
where 𝑃 is a permutation matrix and 𝑓(𝑋) ∈ R
𝑠
is the vector with components
𝑓(π‘₯(𝑖)) for 𝑖 ∈ 𝑉 .
For example, the map
𝑓(𝑋) = πœ‘

ξ™˜
π‘–βˆˆπ‘‰
πœ“(π‘₯(𝑖))

,
for some suitable maps πœ“ and πœ‘ is permutation-invariant.
Now let 𝐹 = (𝑓
1
, . . . , 𝑓
π‘˜
): 𝑉 β†’ R
π‘˜
and we write the matrix 𝐻 = 𝐹 (𝑋) ∈ R
|𝑉 |Γ—π‘˜
.
The so-called latent matrix of node features 𝐻 is no longer permutation
invariant.
The function (matrix) 𝐹 (𝑋) is permutation-equivariant if 𝐹 (𝑃 𝑋) = 𝑃𝐹 (𝑋).
Example
Given a weight matrix Θ ∈ R
𝑑×𝑑
β€²
, the linear map
𝐹
Θ
(𝑋) = π‘‹Ξ˜,
is equivariant.
Let 𝐴 be the adjacency matrix of 𝐺. Applying a permutation matrix to the node
features 𝑋 must induce a transformation on 𝐴’s rows and columns. We say that
𝑓 = 𝑓(𝑋, 𝐴) is permutation-invariant if
𝑓(𝑃 𝑋, 𝑃 𝐴𝑃
⊀
) = 𝑓(𝑋, 𝐴)
and it is permutation-equivariant if
𝐹 (𝑃 𝑋, 𝑃 𝐴𝑃
⊀
) = 𝑃 𝐹 (𝑋, 𝐴)
A 1-neighborhood of node 𝑒 ∈ 𝑉 is the set of adjacent nodes
𝑁
𝑒
= {𝑣 ∈ 𝑉 | (𝑒, 𝑣) ∈ 𝐸}.
The neighborhood features are the multiset
𝑋
𝑁
𝑒
= {{π‘₯(𝑣) | 𝑣 ∈ 𝑁
𝑒
}}
We can now specify a local function πœ‘ that operates over the features of a node
and its neighborhood, πœ‘(π‘₯(𝑒), 𝑋
𝑁
𝑒
). This, in turn leads to a permutation
equivariant function 𝐹 defined as
𝐹 (𝑋, 𝐴) =
⎑
⎒
⎣
πœ‘(π‘₯
1
, 𝑋
𝑁
1
)
.
.
.
πœ‘(π‘₯
𝑛
, 𝑋
𝑁
𝑛
)
⎀
βŽ₯
⎦
As long as πœ‘ is permutation-invariant, then 𝐹 will be equivariant.
x
b
x
a
x
c
x
d
x
e
h
b
Ο†(x
b
, X
N
b
)
An illustration
6
of constructing permutation-equivariant functions over graphs, by
applying a permutation-invariant function πœ‘ to every neighbourhood. In this case, πœ‘ is
applied to the features x
𝑏
of node 𝑏 as well as the multiset of its neighbourhood features,
𝑋
𝑁
𝑏
= {{π‘₯
π‘Ž
, π‘₯
𝑏
, π‘₯
𝑐
, π‘₯
𝑑
, π‘₯
𝑒
}}. Applying πœ‘ in this manner to every node’s neighbourhood
recovers the rows of the resulting matrix of latent features 𝐻 = 𝐹 (𝑋, 𝐴).
6
M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi´c, Geometric Deep Learning: Grids,
Groups, Graphs, Geodesics, and Gauges, arXiv:2104.13478 [cs, stat] (2021)
Flavors of GNNs
c
ba
c
bc
c
bd
c
be
c
bb
Convolutional
x
b
x
a
x
c
x
d
x
e
Ξ±
ba
Ξ±
bc
Ξ±
bd
Ξ±
be
Ξ±
bb
Attentional
x
b
x
a
x
c
x
d
x
e
m
ba
m
bc
m
bd
m
be
m
bb
Message-passing
x
b
x
a
x
c
x
d
x
e
A visualisation of the dataflow for the three flavours of GNN layers, 𝑔. We use the
neighbourhood of node 𝑏 from the previous figure to illustrate this. Left-to-right:
convolutional, where sender node features are multiplied with a constant, 𝑐
𝑒𝑣
;
attentional, where this multiplier is implicitly computed via an attention mechanism of
the receiver over the sender: 𝛼
𝑒𝑣
= π‘Ž(π‘₯
𝑒
, π‘₯
𝑣
); and message-passing, where vector-based
messages are computed based on both the sender and receiver: π‘š
𝑒𝑣
= πœ“(π‘₯
𝑒
, π‘₯
𝑣
).
Message-passing GNNs
m
ba
m
bc
m
bd
m
be
m
bb
Message-passing
x
b
x
a
x
c
x
d
x
e
Message-passing GNNs
β„Ž
𝑒
= πœ‘
βŽ›
⎝
π‘₯
𝑒
,

π‘£βˆˆπ‘
𝑒
πœ“(π‘₯
𝑒
, π‘₯
𝑣
)
⎞
⎠
Output
Aggregation operation
Read-out function
Node features Message function
The functions πœ‘ and πœ“ can be realized as multi-layer perceptrons but we are about
to see that they often have more concrete architectures.
Message-passing GNNs (example)
The paper I. Batatia, D. P. KovΒ΄acs, G. N. C. Simm, C. Ortner, and G. CsΒ΄anyi,
MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and
Accurate Force Fields, 2022, arXiv:2206.07697 introduces an 𝑛-body force
field that attains SotA.
Message-passing GNNs (example)
The graph is 𝐺 = (𝑉, 𝐸) with the vertices 𝑉 representing the atoms and the edges
𝐸 representing interatomic interactions within a cut-off distance. The architecture
of at the β„“-th layer is given by:
π‘₯
(β„“+1)
𝑖
= πœ‘
(β„“)

π‘₯
(β„“)
𝑖
,

π‘£βˆˆπ‘
𝑒
πœ“(π‘₯
(β„“)
𝑖
, π‘₯
(β„“)
𝑗
)

, π‘₯
(β„“)
𝑖
= ( π‘Ÿ
𝑖
, 𝑧
𝑖
, β„Ž
(β„“)
𝑖
).
3D position
chemical element
learnable features
The readout function πœ‘ is of the form πœ‘(π‘₯
𝑖
, π‘š
𝑖
) = (π‘Ÿ
𝑖
, 𝑧
𝑖
, πœ…(π‘₯
𝑖
, π‘š
𝑖
)). The functions
πœ“, πœ…, and πœ‘ are learnable and βŠ• is some permutation-invariant pooling operation.
Equivariance
The features β„Ž should change in a specific way under the action of the group of
rigid body motions 𝑂(3),
β„Ž
𝑖
(π‘„π‘Ÿ
1
, . . . , π‘„π‘Ÿ
𝑁
) = 𝐷(𝑄)β„Ž
𝑖
(π‘Ÿ
1
, . . . , π‘Ÿ
𝑁
)
for some matrix 𝐷(𝑄) representing the rotation 𝑄 acting on the features β„Ž
𝑖
.
Skip connections in the energy
The energy of the 𝑖-th atom is obtained via skip connections from all 𝐿 layers:
𝐸
𝑖
=
𝐿
ξ™˜
β„“=1
𝑅
(β„“)
(π‘₯
(β„“)
𝑖
)
The functions 𝑅
(β„“)
only depend on the β„Ž
(β„“)
𝑖
to ensure that the energy is invariant to
rigid body motions.
Message
The message is more complicated in practice than
π‘š
(β„“)
𝑖
=

π‘—βˆˆπ‘
𝑖
πœ“
(β„“)
(π‘₯
(β„“)
𝑖
, π‘₯
(β„“)
𝑗
)
The actual message is
π‘š
(β„“)
𝑖
=
𝑁
ξ™˜
𝑗=1
𝑒
1
(π‘₯
(β„“)
𝑖
, π‘₯
(β„“)
𝑗
) +
𝑁
ξ™˜
𝑗
1
=1,𝑗
2
=1
𝑒
2
(π‘₯
(β„“)
𝑖
, π‘₯
(β„“)
𝑗
1
, π‘₯
(β„“)
𝑗
2
) + Β·Β·Β·
+
𝑁
ξ™˜
𝑗
1
=1,...,𝑗
𝜈
=1
𝑒
𝜈
(π‘₯
(β„“)
𝑖
, π‘₯
(β„“)
𝑗
1
, . . . , π‘₯
(β„“)
𝑗
𝜈
),
where 𝑒
1
, . . . , 𝑒
𝜈
are learnable functions.
Dynamic programming, etc.
Convolutional GNNs
c
ba
c
bc
c
bd
c
be
c
bb
Convolutional
x
b
x
a
x
c
x
d
x
e
Convolutional GNNs (cont.)
β„Ž
𝑖
= πœ‘
βŽ›
⎝
π‘₯
𝑖
,

π‘—βˆˆπ‘
𝑗
𝑐
𝑖𝑗
πœ“(π‘₯
𝑗
)
⎞
⎠
Importance of node 𝑗
to node 𝑖.
The convolution stencil is 𝐢 = (𝑐
𝑖𝑗
)
𝑖,π‘—βˆˆπ‘‰
. When βŠ• = +, it is said that the
architecture above implements β€œlinear diffusion” or β€œposition-dependent linear
filtering.”
Convolutional GNNs (cont.)
Recall the definition of the convolution of two functions 𝑓 and 𝑔 is defined as
(𝑓 ⋆ 𝑔)(π‘₯) β‰œ
ξ™š
R
𝑓(𝑦) 𝑔(π‘₯ βˆ’ 𝑦) d𝑦.
One of the properties of the convolution is that
[
𝑓 ⋆ 𝑔 =
ξ™’
𝑓 𝑔,
where
Λ†
𝑓(πœ”) β‰œ
ξ™š
R
𝑓(π‘₯)e
βˆ’iπœ”Β·π‘₯
dπ‘₯,
is the Fourier transform of 𝑓.
Convolutional GNNs (cont.)
Note that
βˆ†(e
iπœ”Β·π‘₯
) = (βˆ’πœ”
2
) e
iπœ”Β·π‘₯
.
The Fourier transform
Λ†
𝑓(πœ”) is the component of the orthogonal projection of the
function 𝑓 onto the eigenfunction e
iπœ”Β·π‘₯
under the inner product
βŸ¨π‘“, π‘”βŸ© =
ξ™’
R
𝑓(π‘₯)𝑔(π‘₯) dπ‘₯. In other words,
Λ†
𝑓(πœ”) = βŸ¨π‘“, e
iπœ”Β·π‘₯
⟩.
Convolutional GNNs (cont.)
If the graph Laplacian is diagonalized as
β„’ = 𝐼 βˆ’π·
βˆ’1/2
𝐴𝐷
βˆ’1/2
= π‘ˆΞ›π‘ˆ
⊀
,
then we can define
7
the graph Fourier transform of a vertex function 𝑓 as
Λ†
𝑓 = π‘ˆ
⊀
𝑓.
The convolution of 𝑓 and 𝑔 can be defined on graphs based on the identity
[
𝑓 ⋆ 𝑔 =
Λ†
𝑓 ˆ𝑔 as
𝑓 ⋆ 𝑔 = π‘ˆ

(π‘ˆ
⊀
𝑓) βŠ™ (π‘ˆ
⊀
𝑔)
ξ˜‘
=

π‘ˆ diag(ˆ𝑔)π‘ˆ
⊀
ξ˜‘
𝑓
7
T. N. Kipf and M. Welling, Semi-Supervised Classification with Graph Convolutional
Networks, ICLR 2017
Convolutional GNNs (cont.)
Consider a function that can be written as a power series
β„Ž(𝑧) =
∞
ξ™˜
𝑛=0
𝑐
𝑛
𝑧
𝑛
.
The function applied to a diagonalizable matrix 𝑀 = 𝑄Λ𝑄
⊀
is then equal to
β„Ž(𝑀) = 𝑄

∞
ξ™˜
𝑛=0
𝑐
𝑛
Ξ›
𝑛

𝑄
⊀
= π‘„β„Ž(Ξ›)𝑄
⊀
,
so it is natural to interpret diag(ˆ𝑔) in π‘ˆ diag(ˆ𝑔)π‘ˆ
⊀
as a function ˆ𝑔(Ξ›) of the
matrix of eigenvalues of β„’.
To recap,
𝑓 ⋆ 𝑔 = π‘ˆ ˆ𝑔(Ξ›)π‘ˆ
⊀
𝑓.
Convolutional GNNs (cont.)
Instead of diagonalizing the graph Laplacian, convolutional filters are defined by
approximating ˆ𝑔(Ξ›) by a polynomial expansion (e.g., in terms of Chebyshev
polynomials) and one arrives at
𝑓 ⋆ 𝑔 =

𝐾
ξ™˜
π‘˜=0
πœƒ
𝑛
𝑇
π‘˜
(
˜
β„’)

𝑓,
where 𝑇
0
(π‘₯) = 1, 𝑇
1
(π‘₯) = π‘₯, 𝑇
2
(π‘₯) = 2π‘₯
2
βˆ’ 1, . . . for βˆ’1 < π‘₯ < 1. The rescaled
Laplacian is
˜
β„’ =
2
πœ†
max
(β„’ βˆ’ 𝐼).
Convolutional GNNs (cont.)
If 𝐾 = 1, we have
𝑓 ⋆ 𝑔 = πœƒ
0
𝑓 + πœƒ
1
˜
ℒ𝑓 = πœƒ
0
𝑓 + πœƒ
1
𝐷
βˆ’1/2
𝐴𝐷
βˆ’1/2
𝑓
A convolutional layer from the paper by Kipf and Welling (2017) is
β„Ž
𝑖
= πœ‘
βŽ›
⎝
π‘₯
𝑖
,

π‘—βˆˆπ‘
𝑖
𝑐
𝑖𝑗
πœ“(π‘₯
𝑗
)
⎞
⎠
= 𝜎
βŽ›
⎝
𝑀
⊀
0
π‘₯
𝑖
+
ξ™˜
π‘—βˆˆπ‘
𝑖
1
ξ™°
𝑑(𝑖)𝑑(𝑗)
𝑀
⊀
1
π‘₯
𝑗
⎞
⎠
.
πœ“(π‘₯
𝑗
)
𝑐
𝑖𝑗
Attention GNNs
Ξ±
ba
Ξ±
bc
Ξ±
bd
Ξ±
be
Ξ±
bb
Attentional
x
b
x
a
x
c
x
d
x
e
Attention GNNs
β„Ž
𝑒
= πœ‘
βŽ›
⎝
π‘₯
𝑒
,

π‘£βˆˆπ‘
𝑒
π‘Ž(π‘₯
𝑒
, π‘₯
𝑣
) πœ“(π‘₯
𝑣
)
⎞
⎠
Importance of node 𝑣 to node 𝑒.
The bivariate map π‘Ž(π‘₯
𝑒
, π‘₯
𝑣
), known as the self-attention mechanism, is computed
implicitly. When βŠ• = +, this generalizes convolutional GNNs to the case in which
the importance coefficients are feature-dependent (i.e., π‘Ž(π‘₯
𝑒
, π‘₯
𝑣
) = 𝑐
𝑒𝑣
).
Attention
Attention (cont.)
Let π‘ž
𝑖
, π‘˜
𝑖
, 𝑣
𝑖
∈ R
𝑠
for 𝑖 = 1, . . . , 𝑛 be parameters
of the NN.
Consider
𝑄 =
⎑
⎒
⎣
π‘ž
⊀
1
.
.
.
π‘ž
⊀
𝑛
⎀
βŽ₯
⎦
∈ R
𝑛×𝑠
and
𝐾
⊀
= [π‘˜
1
, . . . , π‘˜
𝑛
] ∈ R
𝑠×𝑛
Thus, 𝑄𝐾
⊀
∈ R
𝑛×𝑛
.
Attention (cont.)
Applying
softmax(𝑒)
𝑖
=
e
𝑒
𝑖

d
π‘˜
𝑗=1
e
𝑒
𝑗
row-wise to 𝑄𝐾
⊀
, we obtain the stochastic
matrix
softmax(
1
√
𝑠
𝑄𝐾
⊀
).
The rows of the above matrix are probability
mass functions of discrete distributions with unit
variance (hence the scaling).
Attention (cont.)
Let
𝑉 =
⎑
⎒
⎣
𝑣
⊀
1
.
.
.
𝑣
⊀
𝑛
⎀
βŽ₯
⎦
∈ R
𝑛×𝑠
Then, attention is defined by
Attention(𝑄, 𝐾, 𝑉 )
= softmax(
1
√
𝑠
𝑄𝐾
⊀
)𝑉
Attention
Comparing
(Attention(𝑄, 𝐾, 𝑉 ))
𝑖
= (𝑀𝑉 )
𝑖
=
𝑛
ξ™˜
𝑗=1
𝑀
𝑖𝑗
𝑉
𝑗
,
where 𝑀 = softmax(
1
√
𝑠
𝑄𝐾
⊀
) with
β„Ž
𝑖
= πœ‘
βŽ›
⎝
π‘₯
𝑖
,

π‘—βˆˆπ‘
𝑖
π‘Ž(π‘₯
𝑖
, π‘₯
𝑗
) πœ“(π‘₯
𝑗
)
⎞
⎠
,
we see that 𝐺 is the complete graph, πœ‘(π‘₯, 𝑀) = π‘Ž(π‘₯, π‘₯)πœ“(π‘₯) + 𝑀, and
π‘Ž(π‘₯
𝑖
, π‘₯
𝑗
) = 𝑀
𝑖𝑗
. Indeed,
β„Ž
𝑖
= π‘Ž(π‘₯
𝑖
, π‘₯
𝑖
)πœ“(π‘₯
𝑖
) +
ξ™˜
π‘—βˆˆπ‘
𝑖
π‘Ž(π‘₯
𝑖
, π‘₯
𝑗
) πœ“(π‘₯
𝑗
).
Training message-passing GNNs is more complicated than training attentional
GNNs, which are in turn more complicated to train than convolutional GNNs.
Interpretability is also easier to attain in convolutional GNNs, less so in attentional
GNNs, and even less so in message-passing GNNs.
References
βˆ™
M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi´c, Geometric Deep
Learning: Grids, Groups, Graphs, Geodesics, and Gauges, arXiv:2104.13478
[cs, stat] (2021)
βˆ™
I. Batatia, D. P. KovΒ΄acs, G. N. C. Simm, C. Ortner, and G. CsΒ΄anyi, MACE:
Higher Order Equivariant Message Passing Neural Networks for Fast and
Accurate Force Fields, June 2022, arXiv: 2206.07697.
βˆ™
F. Chung, Spectral Graph Theory, CBMS, American Mathematical Society,
1996.
βˆ™
B. P. Chamberlain, J. Rowbottom, M. Gorinova, S. Webb, E. Rossi, and
M. M. Bronstein, GRAND: Graph Neural Diffusion, arXiv:2106.10934 [cs,
stat] (2021)
βˆ™
F. Di Giovanni, J. Rowbottom, B. P. Chamberlain, T. Markovich, and M. M.
Bronstein, Graph Neural Networks as Gradient Flows, August 2022,
arXiv:2206.10991 [cs, stat].
References
βˆ™
A. Dudzik and P. Veliˇckovi´c, Graph Neural Networks are Dynamic
Programmers, arXiv:2203.15544 [cs, math, stat] (2022).
βˆ™
A. Grigoryan, Introduction to Analysis on Graphs, University Lecture Series,
no. 71, American Mathematical Society, 2018.
βˆ™
T. N. Kipf and M. Welling, Semi-Supervised Classification with Graph
Convolutional Networks, ICLR 2017.
βˆ™
D. Zhou and B. Sch¨olkopf, A Regularization Framework for Learning from
Graph Data, ICML, 2004.
βˆ™
D. Zhou and B. Sch¨olkopf, Regularization on Discrete Spaces, Pattern
Recognition, Lecture Notes in Computer Science, Springer, 2005.