Introduction to Graph Neural Networks

Juan M. Bello-Rivas

jmbr@jhu.edu

November 29, 2022

Outline

1 Graph Neural Networks

Graphs

A graph 𝐺 as a pair 𝐺 = (𝑉, 𝐸) where 𝑉 is a set of vertices (or nodes) and

𝐸 ⊆ 𝑉 ×𝑉 is a set of edges. The graph is said to be undirected if the ordering of

the vertices within the edges does not matter and directed otherwise.

Motivation

Complete graph

4 5

Grid graph

1, 2

1, 1

1, 3

1, 4

2, 1

2, 2

2, 3

2, 4

3, 1

3, 2

3, 3

3, 4

4, 1

4, 2

4, 3

4, 4

Adjacency and incidence matrices

The adjacency matrix 𝐴 ∈ R

|𝑉 |×|𝑉 |

of a graph 𝐺 is given by

𝐴

𝑖𝑗



1, if (𝑖, 𝑗) ∈ 𝐸,

0, otherwise.

The incidence matrix 𝐵 ∈ R

|𝑉 |×|𝐸|

of the graph 𝐺 is such that each of its

column vectors 𝑏

ℓ

∈ R

|𝑉 |

for ℓ = 1, . . . , |𝐸| corresponds to an edge (𝑖, 𝑗) ∈ 𝐸 and

(𝑏

ℓ

)

𝑘

⎧

⎪

⎨

⎪

⎩

−1, if 𝑘 = 𝑖,

1, if 𝑘 = 𝑗,

0, otherwise.

Combinatorial Laplacian

The degree 𝑑

𝑢

of a vertex 𝑢 is the number of adjacent nodes.

The adjacency and incidence matrices are related by the formula

𝐵𝐵

⊤

= 𝐷 − 𝐴,

where 𝐷 is a diagonal matrix whose entries are the degrees of each vertex (i.e.,

𝐷

𝑖𝑖

= 𝑑

𝑖

The matrix

𝐿 = 𝐷 − 𝐴

is the combinatorial Laplacian.

Combinatorial Laplacian on a grid

1, 2

1, 1

1, 3

1, 4

2, 1

2, 2

2, 3

2, 4

3, 1

3, 2

3, 3

3, 4

4, 1

4, 2

4, 3

4, 4

∆𝑓 = 𝑓

𝑥𝑥

+ 𝑓

𝑦𝑦

≈

ℎ

(𝑓

𝑖+1,𝑗

− 2𝑓

𝑖,𝑗

+ 𝑓

𝑖−1,𝑗

)

ℎ

(𝑓

𝑖,𝑗+1

− 2𝑓

𝑖,𝑗

+ 𝑓

𝑖,𝑗−1

)

∆ = −(𝐷 − 𝐴)/ℎ

= −𝐿/ℎ

Calculus on graphs

Let 𝑓 : R

→ R be a smooth function and consider the integral

ℰ[𝑓] =







𝜕𝑓

𝜕𝑥





𝜕𝑓

𝜕𝑦





𝜕𝑓

𝜕𝑧





d𝑥 d𝑦 d𝑧 =



‖∇𝑓‖

d𝑥 d𝑦 d𝑧.

Suppose we want to minimize ℰ subject to the constraint



𝑓(𝑥, 𝑦, 𝑧)

d𝑥 d𝑦 d𝑧 = non-zero constant.

The corresponding Euler-Lagrange equation turns out to be

∆𝑓 =

𝜕

𝑓

𝜕𝑥

𝜕

𝑓

𝜕𝑦

𝜕

𝑓

𝜕𝑧

= 𝜆𝑓.

Calculus on graphs (cont.)

Two important points we can extract from the above:

1 The Dirichlet energy



‖∇𝑓‖

is a regularization term when 𝑓 is a loss

function.

2 The Laplacian gives rise to Fourier series (given 𝑓 ∈ 𝐿

, we write

𝑓 =



𝑛∈N

⟨𝑓, 𝜑

𝑛

⟩𝜑

𝑛

, where ⟨𝑓, 𝜑

𝑛

⟩ =



𝑓 𝜑

𝑛

and ∆𝜑

𝑛

= 𝜆

𝑛

𝜑

𝑛

It turns out that there exists counterparts of these ideas on graphs!

Calculus on graphs (cont.)

1,2

undirected graph is weighted if there is a function 𝑤 : 𝑉 × 𝑉 → R

≥0

such

that 𝑤(𝑖, 𝑗) = 𝑤(𝑖, 𝑗). Let 𝑁

𝑖

= {𝑗 ∈ 𝑉 | (𝑖, 𝑗) ∈ 𝐸} be the set of vertices adjacent

to 𝑖. The degree of a weighted graph is deﬁned as

𝑑(𝑖) ≜



𝑗∈𝑁

𝑖

𝑤(𝑖, 𝑗).

Let 𝐻(𝑉 ) denote the vector space of vertex functions endowed with the inner

product

⟨𝑓, 𝑔⟩ ≜



𝑖∈𝑉

𝑓(𝑖) 𝑔(𝑖),

where 𝑓, 𝑔 : 𝑉 → R.

D. Zhou and B. Sch¨olkopf, A Regularization Framework for Learning from Graph Data,

ICML, 2004,

D. Zhou and B. Sch¨olkopf, Regularization on Discrete Spaces, Pattern Recognition, Lecture

Notes in Computer Science, Springer, 2005.

Calculus on graphs (cont.)

The graph gradient is an operator that maps vertex functions to edge functions as

(∇𝑓)((𝑖, 𝑗)) ≜



𝑤((𝑖, 𝑗))

𝑑(𝑗)

𝑓(𝑗) −



𝑤((𝑖, 𝑗))

𝑑(𝑖)

𝑓(𝑖),

for all (𝑖, 𝑗) ∈ 𝐸.

Observe that ∇ is skew-symmetric:

(∇𝑓)((𝑖, 𝑗)) = −(∇𝑓)((𝑗, 𝑖))

Calculus on graphs (cont.)

The graph gradient may also be deﬁned at each vertex. Given 𝑓 : 𝑉 → R, the

gradient of 𝑓 at 𝑗 ∈ 𝑉 is deﬁned by

∇𝑓(𝑗) ≜ {(∇𝑓)((𝑗, 𝑖)) | (𝑗, 𝑖) ∈ 𝐸}

The norm of the graph gradient ∇𝑓 at vertex 𝑣 is

‖∇

𝑗

𝑓‖ ≜





𝑖∈𝑁

𝑗

(∇𝑓)((𝑖, 𝑗))

The Dirichlet energy of 𝑓 is

ℰ[𝑓] ≜



𝑗∈𝑉

‖∇

𝑗

𝑓‖

Calculus on graphs (cont.)

There is a way to deﬁne the notion of divergence on graphs and to write the

Laplacian ∆𝑓 in terms of the divergence of the gradient ∇𝑓. The resulting

Laplacian is

(∆𝑓)(𝑗) ≜ 𝑓(𝑗) −



𝑢∈𝑁

𝑣

𝑤(𝑖, 𝑗)



𝑑(𝑖)𝑑(𝑗)

𝑓(𝑖).

In vector notation

∆𝑓 = (𝐼 − 𝐷

−1/2

𝑊 𝐷

−1/2

)𝑓 for 𝑓 : 𝑉 → R.

If 𝑊 is the adjacency matrix 𝐴, then we recover the graph Laplacian

F. Chung, Spectral Graph Theory, CBMS, vol. 92, AMS, 1996.

Calculus on graphs (cont.)

Using these ideas, it is possible to formulate partial diﬀerential equations on

graphs

and to study gradient ﬂows on their Dirichlet energy (these give rise to

GNNs)

B. P. Chamberlain, J. Rowbottom, M. Gorinova, S. Webb, E. Rossi, and M. M. Bronstein,

GRAND: Graph Neural Diﬀusion, 2021, arXiv:2106.10934

F. Di Giovanni, J. Rowbottom, B. P. Chamberlain, T. Markovich, and M. M. Bronstein,

Graph Neural Networks as Gradient Flows, 2022, arXiv:2206.10991

Outline

1 Graph Neural Networks

Feature maps on graphs

Message-passing GNNs

Convolutional GNNs

Attention GNNs

Motivation

Graph Neural Networks (GNNs) generalize:

∙

Convolutional Neural Networks.

∙

Attention Networks.

∙

Recurrent Neural Networks.

∙

Dynamic Programming algorithms.

∙

etc.

Feature maps

Consider a feature map 𝑥: 𝑉 → R

𝑠

and arrange the image of 𝑥 for each node as the

rows of a matrix:

𝑋 =

⎡

⎢

⎣

𝑥(1)

𝑥(2)

⎤

⎥

⎦

∈ R

|𝑉 |×𝑠

It is also possible to endow the edges with feature maps but we won’t consider that

case.

Feature maps (cont.)

The nodes 𝑉 are unordered and a function 𝑓 : R

𝑠

→ R of the features of the

vertices ought to be invariant to vertex permutations. However, the matrix 𝑋

induces an ordering of the nodes and so invariant functions must satisfy

𝑓(𝑃 𝑋) = 𝑓(𝑋),

where 𝑃 is a permutation matrix and 𝑓(𝑋) ∈ R

𝑠

is the vector with components

𝑓(𝑥(𝑖)) for 𝑖 ∈ 𝑉 .

For example, the map

𝑓(𝑋) = 𝜑





𝑖∈𝑉

𝜓(𝑥(𝑖))



for some suitable maps 𝜓 and 𝜑 is permutation-invariant.

Now let 𝐹 = (𝑓

, . . . , 𝑓

𝑘

): 𝑉 → R

𝑘

and we write the matrix 𝐻 = 𝐹 (𝑋) ∈ R

|𝑉 |×𝑘

The so-called latent matrix of node features 𝐻 is no longer permutation

invariant.

The function (matrix) 𝐹 (𝑋) is permutation-equivariant if 𝐹 (𝑃 𝑋) = 𝑃𝐹 (𝑋).

Example

Given a weight matrix Θ ∈ R

𝑑×𝑑

′

, the linear map

𝐹

(𝑋) = 𝑋Θ,

is equivariant.

Let 𝐴 be the adjacency matrix of 𝐺. Applying a permutation matrix to the node

features 𝑋 must induce a transformation on 𝐴’s rows and columns. We say that

𝑓 = 𝑓(𝑋, 𝐴) is permutation-invariant if

𝑓(𝑃 𝑋, 𝑃 𝐴𝑃

⊤

) = 𝑓(𝑋, 𝐴)

and it is permutation-equivariant if

𝐹 (𝑃 𝑋, 𝑃 𝐴𝑃

⊤

) = 𝑃 𝐹 (𝑋, 𝐴)

A 1-neighborhood of node 𝑢 ∈ 𝑉 is the set of adjacent nodes

𝑁

𝑢

= {𝑣 ∈ 𝑉 | (𝑢, 𝑣) ∈ 𝐸}.

The neighborhood features are the multiset

𝑋

𝑁

𝑢

= {{𝑥(𝑣) | 𝑣 ∈ 𝑁

𝑢

}}

We can now specify a local function 𝜑 that operates over the features of a node

and its neighborhood, 𝜑(𝑥(𝑢), 𝑋

𝑁

𝑢

). This, in turn leads to a permutation

equivariant function 𝐹 deﬁned as

𝐹 (𝑋, 𝐴) =

⎡

⎢

⎣

𝜑(𝑥

, 𝑋

𝑁

)

𝜑(𝑥

𝑛

, 𝑋

𝑁

𝑛

)

⎤

⎥

⎦

As long as 𝜑 is permutation-invariant, then 𝐹 will be equivariant.

φ(x

, X

)

An illustration

of constructing permutation-equivariant functions over graphs, by

applying a permutation-invariant function 𝜑 to every neighbourhood. In this case, 𝜑 is

applied to the features x

𝑏

of node 𝑏 as well as the multiset of its neighbourhood features,

𝑋

𝑁

𝑏

= {{𝑥

𝑎

, 𝑥

𝑏

, 𝑥

𝑐

, 𝑥

𝑑

, 𝑥

𝑒

}}. Applying 𝜑 in this manner to every node’s neighbourhood

recovers the rows of the resulting matrix of latent features 𝐻 = 𝐹 (𝑋, 𝐴).

M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi´c, Geometric Deep Learning: Grids,

Groups, Graphs, Geodesics, and Gauges, arXiv:2104.13478 [cs, stat] (2021)

Flavors of GNNs

Convolutional

Attentional

Message-passing

A visualisation of the dataﬂow for the three ﬂavours of GNN layers, 𝑔. We use the

neighbourhood of node 𝑏 from the previous ﬁgure to illustrate this. Left-to-right:

convolutional, where sender node features are multiplied with a constant, 𝑐

𝑢𝑣

;

attentional, where this multiplier is implicitly computed via an attention mechanism of

the receiver over the sender: 𝛼

𝑢𝑣

= 𝑎(𝑥

𝑢

, 𝑥

𝑣

); and message-passing, where vector-based

messages are computed based on both the sender and receiver: 𝑚

𝑢𝑣

= 𝜓(𝑥

𝑢

, 𝑥

𝑣

Message-passing GNNs

Message-passing

Message-passing GNNs

ℎ

𝑢

= 𝜑

⎛

⎝

𝑥

𝑢



𝑣∈𝑁

𝑢

𝜓(𝑥

𝑢

, 𝑥

𝑣

)

⎞

⎠

Output

Aggregation operation

Read-out function

Node features Message function

The functions 𝜑 and 𝜓 can be realized as multi-layer perceptrons but we are about

to see that they often have more concrete architectures.

Message-passing GNNs (example)

The paper I. Batatia, D. P. Kov´acs, G. N. C. Simm, C. Ortner, and G. Cs´anyi,

MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and

Accurate Force Fields, 2022, arXiv:2206.07697 introduces an 𝑛-body force

ﬁeld that attains SotA.

Message-passing GNNs (example)

The graph is 𝐺 = (𝑉, 𝐸) with the vertices 𝑉 representing the atoms and the edges

𝐸 representing interatomic interactions within a cut-oﬀ distance. The architecture

of at the ℓ-th layer is given by:

𝑥

(ℓ+1)

𝑖

= 𝜑

(ℓ)



𝑥

(ℓ)

𝑖



𝑣∈𝑁

𝑢

𝜓(𝑥

(ℓ)

𝑖

, 𝑥

(ℓ)

𝑗

)



, 𝑥

(ℓ)

𝑖

= ( 𝑟

𝑖

, 𝑧

𝑖

, ℎ

(ℓ)

𝑖

3D position

chemical element

learnable features

The readout function 𝜑 is of the form 𝜑(𝑥

𝑖

, 𝑚

𝑖

) = (𝑟

𝑖

, 𝑧

𝑖

, 𝜅(𝑥

𝑖

, 𝑚

𝑖

)). The functions

𝜓, 𝜅, and 𝜑 are learnable and ⊕ is some permutation-invariant pooling operation.

Equivariance

The features ℎ should change in a speciﬁc way under the action of the group of

rigid body motions 𝑂(3),

ℎ

𝑖

(𝑄𝑟

, . . . , 𝑄𝑟

𝑁

) = 𝐷(𝑄)ℎ

𝑖

(𝑟

, . . . , 𝑟

𝑁

)

for some matrix 𝐷(𝑄) representing the rotation 𝑄 acting on the features ℎ

𝑖

Skip connections in the energy

The energy of the 𝑖-th atom is obtained via skip connections from all 𝐿 layers:

𝐸

𝑖

𝐿



ℓ=1

𝑅

(ℓ)

(𝑥

(ℓ)

𝑖

)

The functions 𝑅

(ℓ)

only depend on the ℎ

(ℓ)

𝑖

to ensure that the energy is invariant to

rigid body motions.

Message

The message is more complicated in practice than

𝑚

(ℓ)

𝑖



𝑗∈𝑁

𝑖

𝜓

(ℓ)

(𝑥

(ℓ)

𝑖

, 𝑥

(ℓ)

𝑗

)

The actual message is

𝑚

(ℓ)

𝑖

𝑁



𝑗=1

𝑢

(𝑥

(ℓ)

𝑖

, 𝑥

(ℓ)

𝑗

) +

𝑁



𝑗

=1,𝑗

𝑢

(𝑥

(ℓ)

𝑖

, 𝑥

(ℓ)

𝑗

, 𝑥

(ℓ)

𝑗

) + ···

𝑁



𝑗

=1,...,𝑗

𝜈

𝑢

𝜈

(𝑥

(ℓ)

𝑖

, 𝑥

(ℓ)

𝑗

, . . . , 𝑥

(ℓ)

𝑗

𝜈

where 𝑢

, . . . , 𝑢

𝜈

are learnable functions.

Dynamic programming, etc.

Convolutional GNNs

Convolutional

Convolutional GNNs (cont.)

ℎ

𝑖

= 𝜑

⎛

⎝

𝑥

𝑖



𝑗∈𝑁

𝑗

𝑐

𝑖𝑗

𝜓(𝑥

𝑗

)

⎞

⎠

Importance of node 𝑗

to node 𝑖.

The convolution stencil is 𝐶 = (𝑐

𝑖𝑗

)

𝑖,𝑗∈𝑉

. When ⊕ = +, it is said that the

architecture above implements “linear diﬀusion” or “position-dependent linear

ﬁltering.”

Convolutional GNNs (cont.)

Recall the deﬁnition of the convolution of two functions 𝑓 and 𝑔 is deﬁned as

(𝑓 ⋆ 𝑔)(𝑥) ≜



𝑓(𝑦) 𝑔(𝑥 − 𝑦) d𝑦.

One of the properties of the convolution is that

[

𝑓 ⋆ 𝑔 =



𝑓 𝑔,

where

𝑓(𝜔) ≜



𝑓(𝑥)e

−i𝜔·𝑥

d𝑥,

is the Fourier transform of 𝑓.

Convolutional GNNs (cont.)

Note that

∆(e

i𝜔·𝑥

) = (−𝜔

) e

i𝜔·𝑥

The Fourier transform

𝑓(𝜔) is the component of the orthogonal projection of the

function 𝑓 onto the eigenfunction e

i𝜔·𝑥

under the inner product

⟨𝑓, 𝑔⟩ =



𝑓(𝑥)𝑔(𝑥) d𝑥. In other words,

𝑓(𝜔) = ⟨𝑓, e

i𝜔·𝑥

⟩.

Convolutional GNNs (cont.)

If the graph Laplacian is diagonalized as

ℒ = 𝐼 −𝐷

−1/2

𝐴𝐷

−1/2

= 𝑈Λ𝑈

⊤

then we can deﬁne

the graph Fourier transform of a vertex function 𝑓 as

𝑓 = 𝑈

⊤

𝑓.

The convolution of 𝑓 and 𝑔 can be deﬁned on graphs based on the identity

[

𝑓 ⋆ 𝑔 =

𝑓 ˆ𝑔 as

𝑓 ⋆ 𝑔 = 𝑈



(𝑈

⊤

𝑓) ⊙ (𝑈

⊤

𝑔)





𝑈 diag(ˆ𝑔)𝑈

⊤



𝑓

T. N. Kipf and M. Welling, Semi-Supervised Classiﬁcation with Graph Convolutional

Networks, ICLR 2017

Convolutional GNNs (cont.)

Consider a function that can be written as a power series

ℎ(𝑧) =

∞



𝑛=0

𝑐

𝑛

𝑧

𝑛

The function applied to a diagonalizable matrix 𝑀 = 𝑄Λ𝑄

⊤

is then equal to

ℎ(𝑀) = 𝑄



∞



𝑛=0

𝑐

𝑛



𝑄

⊤

= 𝑄ℎ(Λ)𝑄

⊤

so it is natural to interpret diag(ˆ𝑔) in 𝑈 diag(ˆ𝑔)𝑈

⊤

as a function ˆ𝑔(Λ) of the

matrix of eigenvalues of ℒ.

To recap,

𝑓 ⋆ 𝑔 = 𝑈 ˆ𝑔(Λ)𝑈

⊤

𝑓.

Convolutional GNNs (cont.)

Instead of diagonalizing the graph Laplacian, convolutional ﬁlters are deﬁned by

approximating ˆ𝑔(Λ) by a polynomial expansion (e.g., in terms of Chebyshev

polynomials) and one arrives at

𝑓 ⋆ 𝑔 =



𝐾



𝑘=0

𝜃

𝑛

𝑇

𝑘

(

ℒ)



𝑓,

where 𝑇

(𝑥) = 1, 𝑇

(𝑥) = 𝑥, 𝑇

(𝑥) = 2𝑥

− 1, . . . for −1 < 𝑥 < 1. The rescaled

Laplacian is

ℒ =

𝜆

max

(ℒ − 𝐼).

Convolutional GNNs (cont.)

If 𝐾 = 1, we have

𝑓 ⋆ 𝑔 = 𝜃

𝑓 + 𝜃

ℒ𝑓 = 𝜃

𝑓 + 𝜃

𝐷

−1/2

𝐴𝐷

−1/2

𝑓

A convolutional layer from the paper by Kipf and Welling (2017) is

ℎ

𝑖

= 𝜑

⎛

⎝

𝑥

𝑖



𝑗∈𝑁

𝑖

𝑐

𝑖𝑗

𝜓(𝑥

𝑗

)

⎞

⎠

= 𝜎

⎛

⎝

𝑤

⊤

𝑥

𝑖



𝑗∈𝑁

𝑖



𝑑(𝑖)𝑑(𝑗)

𝑤

⊤

𝑥

𝑗

⎞

⎠

𝜓(𝑥

𝑗

)

𝑐

𝑖𝑗

Attention GNNs

Attentional

Attention GNNs

ℎ

𝑢

= 𝜑

⎛

⎝

𝑥

𝑢



𝑣∈𝑁

𝑢

𝑎(𝑥

𝑢

, 𝑥

𝑣

) 𝜓(𝑥

𝑣

)

⎞

⎠

Importance of node 𝑣 to node 𝑢.

The bivariate map 𝑎(𝑥

𝑢

, 𝑥

𝑣

), known as the self-attention mechanism, is computed

implicitly. When ⊕ = +, this generalizes convolutional GNNs to the case in which

the importance coeﬃcients are feature-dependent (i.e., 𝑎(𝑥

𝑢

, 𝑥

𝑣

) = 𝑐

𝑢𝑣

Attention

Attention (cont.)

Let 𝑞

𝑖

, 𝑘

𝑖

, 𝑣

𝑖

∈ R

𝑠

for 𝑖 = 1, . . . , 𝑛 be parameters

of the NN.

Consider

𝑄 =

⎡

⎢

⎣

𝑞

⊤

𝑞

⊤

𝑛

⎤

⎥

⎦

∈ R

𝑛×𝑠

and

𝐾

⊤

= [𝑘

, . . . , 𝑘

𝑛

] ∈ R

𝑠×𝑛

Thus, 𝑄𝐾

⊤

∈ R

𝑛×𝑛

Attention (cont.)

Applying

softmax(𝑢)

𝑖

𝑢

𝑖



𝑘

𝑗=1

𝑢

𝑗

row-wise to 𝑄𝐾

⊤

, we obtain the stochastic

matrix

softmax(

√

𝑠

𝑄𝐾

⊤

The rows of the above matrix are probability

mass functions of discrete distributions with unit

variance (hence the scaling).

Attention (cont.)

Let

𝑉 =

⎡

⎢

⎣

𝑣

⊤

𝑣

⊤

𝑛

⎤

⎥

⎦

∈ R

𝑛×𝑠

Then, attention is deﬁned by

Attention(𝑄, 𝐾, 𝑉 )

= softmax(

√

𝑠

𝑄𝐾

⊤

)𝑉

Attention

Comparing

(Attention(𝑄, 𝐾, 𝑉 ))

𝑖

= (𝑀𝑉 )

𝑖

𝑛



𝑗=1

𝑀

𝑖𝑗

𝑉

𝑗

where 𝑀 = softmax(

√

𝑠

𝑄𝐾

⊤

) with

ℎ

𝑖

= 𝜑

⎛

⎝

𝑥

𝑖



𝑗∈𝑁

𝑖

𝑎(𝑥

𝑖

, 𝑥

𝑗

) 𝜓(𝑥

𝑗

)

⎞

⎠

we see that 𝐺 is the complete graph, 𝜑(𝑥, 𝑤) = 𝑎(𝑥, 𝑥)𝜓(𝑥) + 𝑤, and

𝑎(𝑥

𝑖

, 𝑥

𝑗

) = 𝑀

𝑖𝑗

. Indeed,

ℎ

𝑖

= 𝑎(𝑥

𝑖

, 𝑥

𝑖

)𝜓(𝑥

𝑖

) +



𝑗∈𝑁

𝑖

𝑎(𝑥

𝑖

, 𝑥

𝑗

) 𝜓(𝑥

𝑗

Training message-passing GNNs is more complicated than training attentional

GNNs, which are in turn more complicated to train than convolutional GNNs.

Interpretability is also easier to attain in convolutional GNNs, less so in attentional

GNNs, and even less so in message-passing GNNs.

References

∙

M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi´c, Geometric Deep

Learning: Grids, Groups, Graphs, Geodesics, and Gauges, arXiv:2104.13478

[cs, stat] (2021)

∙

I. Batatia, D. P. Kov´acs, G. N. C. Simm, C. Ortner, and G. Cs´anyi, MACE:

Higher Order Equivariant Message Passing Neural Networks for Fast and

Accurate Force Fields, June 2022, arXiv: 2206.07697.

∙

F. Chung, Spectral Graph Theory, CBMS, American Mathematical Society,

1996.

∙

B. P. Chamberlain, J. Rowbottom, M. Gorinova, S. Webb, E. Rossi, and

M. M. Bronstein, GRAND: Graph Neural Diﬀusion, arXiv:2106.10934 [cs,

stat] (2021)

∙

F. Di Giovanni, J. Rowbottom, B. P. Chamberlain, T. Markovich, and M. M.

Bronstein, Graph Neural Networks as Gradient Flows, August 2022,

arXiv:2206.10991 [cs, stat].

References

∙

A. Dudzik and P. Veliˇckovi´c, Graph Neural Networks are Dynamic

Programmers, arXiv:2203.15544 [cs, math, stat] (2022).

∙

A. Grigoryan, Introduction to Analysis on Graphs, University Lecture Series,

no. 71, American Mathematical Society, 2018.

∙

T. N. Kipf and M. Welling, Semi-Supervised Classiﬁcation with Graph

Convolutional Networks, ICLR 2017.

∙

D. Zhou and B. Sch¨olkopf, A Regularization Framework for Learning from

Graph Data, ICML, 2004.

∙

D. Zhou and B. Sch¨olkopf, Regularization on Discrete Spaces, Pattern

Recognition, Lecture Notes in Computer Science, Springer, 2005.