The Monte Carlo method, Markov chains,
and the Metropolis algorithm
Juan M. Bello-Rivas
jmb@ices.utexas.edu
Outline
1 High-dimensional integration
2 Markov chains
3 Metropolis algorithm
Outline
1 High-dimensional integration
2 Markov chains
3 Metropolis algorithm
High-dimensional integration
Suppose we have
Z
X
f(x) π(x) dx
where X R
n
, f : X R, and π : X R
0
is a probability
density function.
High-dimensional integration
Traditional approach: quadrature rules.
In 1D, if X = (a, b) R, then divide X into I N subintervals.
Then,
Z
X
f(x) π(x) dx =
I
X
i=1
w
i
f(x
i
)π(x
i
) + O(I
r
)
as I with r N.
Problem: If X = (a, b)
n
, then the number of nodes grows with n
as I
n
.
High-dimensional integration
Another problem: curse of dimensionality.
0
1
2
3
4
5
6
5 10 15 20 25 30
Volume V (n)
Dimension n
n-dimensional volume: V (n) =
π
n/2
Γ(n/2+1)
High-dimensional integration
Solution: Monte Carlo integration.
Z
X
f(x) π(x) dx = E
π
[f(x)]
and
E
π
[f(x)] =
1
I
I
X
i=1
f(x
i
) + O(I
1/2
)
as I , where the x
i
are random samples from the
probability distribution with density π.
The error term is bad but it is independent of n. Also note that
unlike traditional quadrature rules, the error in the Monte Carlo
method is independent of the smoothness of the integrand.
Outline
1 High-dimensional integration
2 Markov chains
3 Metropolis algorithm
Definition
Let X be a finite set called state space. A Markov chain is a real
matrix A(x, y) such that
1 A(x, y) 0 y X and
2
P
y∈X
A(x, y) = 1
for each x X .
Definition
We can regard each row in A(x, y) as a probability mass
function and then consider a random walk directed by the
following procedure:
1 from state x choose a new state y with probability A(x, y),
2 from state y choose a new state z with probability A(y, z),
3 and so on.
The outcomes X
0
= x, X
1
= y, X
2
= z, . . . are referred to as a
run of the chain starting at x.
Remark
In other words, the value of A(x, y) completely determines the
conditional probability of reaching y from x.
Example
Let p (0, 1) and X = {R, I, P } with
R = 1, I = 2, P = 3. We can think of the Markov
chain
A =
0 1 0
1 p 0 p
1 0 0
as the graph on the right.
R
I
1 1 - p
P
p
1
From the previous definitions and the chain rule
1
of conditional
probability, we know that
P(X
1
= y | X
0
= x) = A(x, y)
and
P(X
2
= z, X
1
= y | X
0
= x)
= P(X
2
= z | X
1
= y, X
0
= x) P(X
1
= y | X
0
= x)
= P(X
2
= z | X
1
= y) P(X
1
= y | X
0
= x)
= A(y, z) A(x, y).
1
P(A, B) = P(A|B) P(B).
Thus, since
P(X
2
= z, X
1
= y | X
0
= x) = A(y, z) A(x, y),
we conclude that
P(X
2
= z | X
0
= x) =
X
y∈X
P(X
2
= z, X
1
= y | X
0
= x)
=
X
y∈X
A(x, y)A(y, z)
= A
(2)
(x, z).
This leads us to the expression of the n-th step Markov chain:
A
(n)
(x, y) = P(X
n
= y | X
0
= x).
Definition
A Markov chain A is said to be irreducible if for all x, y X
there exists n N such that A
(n)
(x, y) > 0.
Definition
An irreducible Markov chain is aperiodic if
gcd({n N | A
(n)
(x, x) > 0}) = 1
for some x X .
Example
The Markov chain given by
1/2 1/2
1/2 1/2
is aperiodic whereas the Markov chain given by
0 1
1 0
is periodic with period two. Both chains are irreducible.
Definition
Let A(x, y) be a Markov chain over the state space X . A
stationary distribution of A is given by a probability mass
function π(x) such that
π(x) 0 for each x X (1)
and
π(x) =
X
y∈X
π(y)A(y, x) (2)
Remark
Note that condition (2) can be interpreted as saying that A(x, y)
has a left eigenvector with corresponding eigenvalue equal to 1
that coincides with the probability mass function π(x).
This suggests a way of sampling from π(x).
Definition
Let µ and ν be two probability mass functions over the same state
space X . The total variation distance between µ and ν is
defined as
kµ νk
TV
= max
A⊂X
|µ(A) ν(A)|
µ
ν
1
2
3
Interpretation of TV distance.
Theorem (Fundamental theorem of Markov chains)
If A is an irreducible Markov chain, then there exists a unique
stationary distribution π of A. Moreover, if A is aperiodic, then for
all x X
lim
n→∞
A
(n)
(x, ·) = π(·) in k · k
TV
Effect of break-down of irreducibility
Let X = {1, 2, 3, 4}. Consider the family of matrices given by
A
t
=
0 1 0 0
1 t/2 0 t/2 0
0 t/2 0 1 t/2
0 0 1 0
,
where 0 t 1. Clearly,
A
0
=
0 1 0 0
1 0 0 0
0 0 0 1
0 0 1 0
and A
1
=
0 1 0 0
1/2 0 1/2 0
0 1/2 0 1/2
0 0 1 0
.
Effect of break-down of irreducibility
1
2
1 1
3
4
1 1
(a) A
0
1
2
1 1/2
3
1/2 1/2
4
1/2 1
(b) A
1
Effect of break-down of irreducibility
We can characterize the invariant subspaces of A
T
t
. Indeed, the
eigenvalues are
{1, 1 t/2, 1 + t/2, 1} ,
and the corresponding eigenspaces are spanned by the vectors
E
1
= (1, 2/(2 t), 2/(2 t), 1),
E
1t/2
= (1, 1, 1, 1),
E
1+t/2
= (1, 1, 1, 1),
E
1
= (1, 2/(2 t), 2/(2 t), 1).
The negative and positive parts of E
1t/2
when t = 0 allow us to
identify metastable states.
Outline
1 High-dimensional integration
2 Markov chains
3 Metropolis algorithm
Let π(x) be a probability density function (known up to a nonzero
scale factor) on a state space X .
We want to generate random samples from π and, in order to
achieve this goal, we are going to run a Markov chain in X that
has π as its stationary distribution.
The Metropolis algorithm allows us to construct and run that
Markov chain.
Definition
A Markov chain T (x, y) is said to be symmetric if
T (x, y) = T (y, x) for all x, y X .
Definition
A Markov chain A(x, y) is said to satisfy the detailed balance
equations if
π(x) A(x, y) = π(y) A(y, x) (3)
Proposition
If the chain A satisfies (3) for π, then π is stationary for A.
Proof.
Summing over y X in both sides of (3), we have
X
y∈X
π(y) A(y, x) =
X
y∈X
π(x) A(x, y)
= π(x)
X
y∈X
A(x, y)
| {z }
=1
= π(x).
Let T (x, y) be a symmetric Markov chain on X .
Algorithm 1 Metropolis.
Input: Current state x
(n)
Output: Next state x
(n+1)
1. y random sample from the distribution given by T (x
(n)
, ·)
2. C random sample from a uniform distribution in the [0, 1]
interval
3. if C min
n
1,
π(y)
π(x
(n)
)
o
then
4. x
(n+1)
y
5. else
6. x
(n+1)
x
(n)
7. end if
8. return x
(n+1)
Remark
It is clear that we only need knowledge of π up to a constant.
Illustration
π(x)
x
x
(n)
y
Illustration
π(x)
x
x
(n)
y
Demo
c l e a r a l l
c l o s e a l l
bet a = 0 . 2 5 ;
U = @( x ) ( x . ˆ2 1 ) . ˆ 2 x ;
p = @( x ) e xp( b e ta U( x ) ) ;
dx = 0 . 2 5 ;
x s = l i n s p a c e (3, 3 , 2 0 0 0 ) ;
px = p ( x s ) ;
K = 10 0 0 0 0 ;
x = z e r o s (K, 1 ) ;
x ( 1 ) = 2;
f o r k = 2 :K
y = x ( k1) + (2 ra nd 1) dx ;
c = r an d ;
i f c <= p ( y ) / p ( x ( k 1))
x ( k ) = y ;
e l s e
x ( k ) = x ( k 1);
end
c l f ; s u b p l o t ( 1 , 2 , 1 ) ; h o l d a l l ; p l o t ( xs , px , k , Lin e Widt h , 2 ) ;
p l o t ( [ x ( k ) ] , [ p ( x ( k ) ) ] , M a r k e r S i ze , 1 0 , Marker , o , L i neW i d th , 4 , . . .
L i n e S t y l e , no ne ) ; x l i m ([ 2.5 , 2 . 5 ] ) ; h o l d o f f ; s u b p l o t ( 1 , 2 , 2 ) ;
h i s t ( x ( 1 : k ) , 2 5 ) ; x l im ([ 2.5 , 2 . 5 ] ) ; drawnow ;
end
Questions?