In statistics and in statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. This sequence can be used to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value). Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, other methods are usually available (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and are free from the problem of auto-correlated samples that is inherent in MCMC methods.
|
Contents
|
The algorithm was named after Nicholas Metropolis, who was an author along with Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller of the 1953 paper Equation of State Calculations by Fast Computing Machines which first proposed the algorithm for the specific case of the Boltzmann distribution;[1] and W. Keith Hastings,[2] who extended it to the more general case in 1970.[3] There is controversy over the credit for discovery of the algorithm. Edward Teller states in his memoirs that the five authors of the 1953 paper worked together for "days (and nights)".[4] M. Rosenbluth, in an oral history recorded shortly before his death [5] credits E. Teller with posing the original problem, himself with solving it, and A.W. Rosenbluth (his wife) with programming the computer. According to M. Rosenbluth, neither Metropolis nor A.H. Teller participated in any way. Rosenbluth's account of events is supported by other contemporary recollections.[6]
The Gibbs sampling algorithm is a special case of the Metropolis–Hastings algorithm which is usually faster and easier to use but is less generally applicable.
The Metropolis–Hastings algorithm can draw samples from any probability distribution
, requiring only that a function proportional to the density be calculable. In Bayesian applications, the normalization factor is often extremely difficult to compute, so the ability to generate a sample without knowing this constant of proportionality is an important feature of this and other commonly used sampling algorithms.
The general idea of the algorithm is to generate a series of samples that are linked in a Markov chain (where each sample is correlated only with the directly preceding sample). At sufficiently long times, the distribution of the generated samples matches the
distribution. The algorithm essentially works as follows (this is actually a description of the Metropolis algorithm, a special case of Metropolis–Hastings):
(the proposal density or jumping distribution), which suggests a new sample value
given a sample value
. For the simple Metropolis algorithm described here, this proposal density must be symmetric in that
. A commonly used symmetric jumping distribution is a Gaussian distribution centered at
, which tends to move to points nearby
(and thus explores the probability space using a random walk).
as the first sample.
given the most recent sample
, we proceed as follows:
from the jumping distribution
.
.
, accept
by setting
.
with probability
. That is, pick a uniformly distributed random number
between 0 and 1, and if
set
, else set
.This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio
indicates how probable the new proposed sample is with respect to the current sample, according to the distribution
. If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of
), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the more the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of
, while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works, and returns samples that follow the desired distribution
.
Compared with an algorithm like adaptive rejection sampling that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:
, a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that if we want a set of independent samples, we have to throw away the majority of samples and only take every nth sample, for some value of n (typically determined by examining the auto-correlation between adjacent samples). Auto-correlation can be reduced by increasing the jumping width (the average size of a jump, which is related to the variance of the jumping distribution), but this will also increase the likelihood of rejection of the proposed jump. Too large or too small a jumping size will lead to a slow-mixing Markov chain, i.e. a highly correlated set of samples, so that a very large number of samples will be needed to get a reasonable estimate of any desired property of the distribution.On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from hierarchical Bayesian models and other high-dimensional statistical models used nowadays in many disciplines.
In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the right jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. This is especially applicable when the multivariate distribution is composed out of a set of individual random variables in which each variable is conditioned on only a small number of other variables, as is the case in most typical hierarchical models. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are adaptive rejection sampling, a one-dimensional Metropolis–Hastings step, or slice sampling.
The objective of the Metropolis-Hastings algorithm is to asymptotically generate states x according to a desired distribution
. Asymptotic here means that the Markov process must first converge to the asymptotic distribution, i.e. there is a transient time before the states are being sampled according to
. As any Markov process, each new state x' is generated from the current state x and this transition does not depend on the history of the process. The transition probability can then be written as a Markov matrix
.
The question is how this matrix can be chosen such that, asymptotically, the states are generated according to
. It can be shown that two conditions are necessary and sufficient: ergodicity of the process, and condition of balance of the matrix which is generally satisfied if the matrix fulfills detailed balance. The ergodicity condition ensures that, at most, one asymptotic distribution exists. The detailed balance condition ensures that there exists at least one asymptotic distribution and that this asymptotic distribution is
.Formally, the detailed balance states that the probability to be in the state x and transit to x' must be the same as be in the state x' and transit to x. This can be written has

which can be rearranged to

One important issue to address is how the dynamics of the process can be set using this transition matrix. One normal approach is to separate each step of the process in two sub-steps; the proposal and the acceptance-refusal. The idea is to first propose a state x' according to a given distribution
and accept it according to
. This means that for the system to transit from x to x', the state x' must be both proposed and accepted. These two sub-processes are equivalente to a transition, and so
.
This relation can be inserted on the previous equation to obtain
.
This equation is still redundant in the sense that there are several matrixes A's that lead to this same relationship. A typical choice for the acceptance is

which is known as the Metropolis choice. This equation completes the formal derivation of the process. From the practical point of view the Metropolis-Hastings algorithm consists in the following steps:
;
. If not accepted, that means that x' = x, and so there is no need to update anything. Else, the system transits to x';This procedure in principle ensures that the saved states are not correlated, and T must be chosen according to different factors such as the distribution g, the acceptance it leads, etc.
One important remark is that it is not clear which distribution
one should use; it is a free parameter of the method. However, in the complete lack of knowledge, one can always use a uniform distribution. On this case, T can be set to 1 because the proposed state x' is always de-correlated from the present state x.
Suppose the most recent value sampled is
. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state
with probability density
, and calculate a value

where

is the likelihood ratio between the proposed sample
and the previous sample
, and

is the ratio of the proposal density in two directions (from
to
and vice versa). This is equal to 1 if the proposal density is symmetric. Then the new state
is chosen according to the following rules.


The Markov chain is started from an arbitrary initial value
and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as burn-in. The remaining set of accepted values of
represent a sample from the distribution
.
The algorithm works best if the proposal density matches the shape of the target distribution
from which direct sampling is difficult, that is
. If a Gaussian proposal density
is used the variance parameter
has to be tuned during the burn-in period. This is usually done by calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a window of the last
samples. The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one dimensional Gaussian distribution is approx 50%, decreasing to approx 23% for an
-dimensional Gaussian target distribution.[7] If
is too small the chain will mix slowly (i.e., the acceptance rate will be high but successive samples will move around the space slowly and the chain will converge only slowly to
). On the other hand, if
is too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so
will be very small and again the chain will converge very slowly.
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)