A really first conceptual introduction to Hamiltonian Monte Carlo

0
105
A really first conceptual introduction to Hamiltonian Monte Carlo


Why a very (that means: VERY!) first conceptual introduction to Hamiltonian Monte Carlo (HMC) on this weblog?

Well, in our endeavor to characteristic the assorted capabilities of TensorFlow Probability (TFP) / tfprobability, we began displaying examples of easy methods to match hierarchical fashions, utilizing certainly one of TFP’s joint distribution courses and HMC. The technical facets being advanced sufficient in themselves, we by no means gave an introduction to the “math side of things.” Here we are attempting to make up for this.

Seeing how it’s not possible, in a brief weblog submit, to offer an introduction to Bayesian modeling and Markov Chain Monte Carlo normally, and the way there are such a lot of glorious texts doing this already, we’ll presuppose some prior information. Our particular focus then is on the most recent and biggest, the magic buzzwords, the well-known incantations: Hamiltonian Monte Carlo, leapfrog steps, NUTS – as all the time, making an attempt to demystify, to make issues as comprehensible as attainable.
In that spirit, welcome to a “glossary with a narrative.”

So what’s it for?

Sampling, or Monte Carlo, strategies normally are used once we need to produce samples from, or statistically describe a distribution we don’t have a closed-form formulation of. Sometimes, we would actually have an interest within the samples; generally we simply need them so we are able to compute, for instance, the imply and variance of the distribution.

What distribution? In the kind of purposes we’re speaking about, we’ve a mannequin, a joint distribution, which is meant to explain some actuality. Starting from essentially the most primary situation, it’d appear to be this:

[
x sim mathcal{Poisson}(lambda)
]

This “joint distribution” solely has a single member, a Poisson distribution, that’s presupposed to mannequin, say, the variety of feedback in a code assessment. We even have knowledge on precise code evaluations, like this, say:

We now need to decide the parameter, (lambda), of the Poisson that make these knowledge most probably. So far, we’re not even being Bayesian but: There isn’t any prior on this parameter. But after all, we need to be Bayesian, so we add one – think about fastened priors on its parameters:

[
x sim mathcal{Poisson}(lambda)
lambda sim gamma(alpha, beta)
alpha sim […]
beta sim […]
]

This being a joint distribution, we’ve three parameters to find out: (lambda), (alpha) and (beta).
And what we’re concerned about is the posterior distribution of the parameters given the info.

Now, relying on the distributions concerned, we normally can not calculate the posterior distributions in closed type. Instead, we’ve to make use of sampling strategies to find out these parameters. What we’d prefer to level out as an alternative is the next: In the upcoming discussions of sampling, HMC & co., it’s very easy to neglect what’s it that we’re sampling. Try to all the time remember that what we’re sampling isn’t the info, it’s parameters: the parameters of the posterior distributions we’re concerned about.

Sampling

Sampling strategies normally include two steps: producing a pattern (“proposal”) and deciding whether or not to maintain it or to throw it away (“acceptance”). Intuitively, in our given situation – the place we’ve measured one thing and are actually searching for a mechanism that explains these measurements – the latter must be simpler: We “just” want to find out the probability of the info underneath these hypothetical mannequin parameters. But how can we give you recommendations to begin with?

In principle, easy(-ish) strategies exist that might be used to generate samples from an unknown (in closed type) distribution – so long as their unnormalized chances might be evaluated, and the issue is (very) low-dimensional. (For concise portraits of these strategies, equivalent to uniform sampling, significance sampling, and rejection sampling, see(MacKay 2002).) Those aren’t utilized in MCMC software program although, for lack of effectivity and non-suitability in excessive dimensions. Before HMC turned the dominant algorithm in such software program, the Metropolis and Gibbs strategies have been the algorithms of selection. Both are properly and understandably defined – within the case of Metropolis, usually exemplified by good tales –, and we refer the reader to the go-to references, equivalent to (McElreath 2016) and (Kruschke 2010). Both have been proven to be much less environment friendly than HMC, the principle matter of this submit, as a consequence of their random-walk conduct: Every proposal is predicated on the present place in state house, that means that samples could also be extremely correlated and state house exploration proceeds slowly.

HMC

So HMC is widespread as a result of in comparison with random-walk-based algorithms, it’s a lot extra environment friendly. Unfortunately, additionally it is much more tough to “get.” As mentioned in Math, code, ideas: A 3rd highway to deep studying, there appear to be (at the very least) three languages to precise an algorithm: Math; code (together with pseudo-code, which can or is probably not on the verge to math notation); and one I name conceptual which spans the entire vary from very summary to very concrete, even visible. To me personally, HMC is totally different from most different instances in that despite the fact that I discover the conceptual explanations fascinating, they lead to much less “perceived understanding” than both the equations or the code. For individuals with backgrounds in physics, statistical mechanics and/or differential geometry this may most likely be totally different!

In any case, bodily analogies make for the very best begin.

Physical analogies

The basic bodily analogy is given within the reference article, Radford Neal’s “MCMC using Hamiltonian dynamics” (Neal 2012), and properly defined in a video by Ben Lambert.

So there’s this “thing” we need to maximize, the loglikelihood of the info underneath the mannequin parameters. Alternatively we are able to say, we need to reduce the detrimental loglikelihood (like loss in a neural community). This “thing” to be optimized can then be visualized as an object sliding over a panorama with hills and valleys, and like with gradient descent in deep studying, we would like it to finish up deep down in some valley.

In Neal’s personal phrases

In two dimensions, we are able to visualize the dynamics as that of a frictionless puck that slides over a floor of various peak. The state of this method consists of the place of the puck, given by a 2D vector q, and the momentum of the puck (its mass instances its velocity), given by a 2D vector p.

Now if you hear “momentum” (and provided that I’ve primed you to think about deep studying) it’s possible you’ll really feel that sounds acquainted, however despite the fact that the respective analogies are associated the affiliation doesn’t assist that a lot. In deep studying, momentum is usually praised for its avoidance of ineffective oscillations in imbalanced optimization landscapes.
With HMC nonetheless, the main target is on the idea of vitality.

In statistical mechanics, the likelihood of being in some state (i) is inverse-exponentially associated to its vitality. (Here (T) is the temperature; we gained’t concentrate on this so simply think about it being set to 1 on this and subsequent equations.)

[P(E_i) sim e^{frac{-E_i}{T}} ]

As you would possibly or may not keep in mind from college physics, vitality is available in two kinds: potential vitality and kinetic vitality. In the sliding-object situation, the item’s potential vitality corresponds to its peak (place), whereas its kinetic vitality is expounded to its momentum, (m), by the system

[K(m) = frac{m^2}{2 * mass} ]

Now with out kinetic vitality, the item would slide downhill all the time, and as quickly because the panorama slopes up once more, would come to a halt. Through its momentum although, it is ready to proceed uphill for some time, simply as if, going downhill in your bike, you decide up velocity it’s possible you’ll make it over the subsequent (quick) hill with out pedaling.

So that’s kinetic vitality. The different half, potential vitality, corresponds to the factor we actually need to know – the detrimental log posterior of the parameters we’re actually after:

[U(theta) sim – log (P(x | theta) P(theta))]

So the “trick” of HMC is augmenting the state house of curiosity – the vector of posterior parameters – by a momentum vector, to enhance optimization effectivity. When we’re completed, the momentum half is simply thrown away. (This facet is very properly defined in Ben Lambert’s video.)

Following his exposition and notation, right here we’ve the vitality of a state of parameter and momentum vectors, equaling a sum of potential and kinetic energies:

[E(theta, m) = U(theta) + K(m)]

The corresponding likelihood, as per the connection given above, then is

[P(E) sim e^{frac{-E}{T}} = e^{frac{- U(theta)}{T}} e^{frac{- K(m)}{T}}]

We now substitute into this equation, assuming a temperature (T) of 1 and a mass of 1:

[P(E) sim P(x | theta) P(theta) e^{frac{- m^2}{2}}]

Now on this formulation, the distribution of momentum is simply a regular regular ((e^{frac{- m^2}{2}}))! Thus, we are able to simply combine out the momentum and take (P(theta)) as samples from the posterior distribution:

[
begin{aligned}
& P(theta) =
int ! P(theta, m) mathrm{d}m = frac{1}{Z} int ! P(x | theta) P(theta) mathcal{N}(m|0,1) mathrm{d}m
& P(theta) = frac{1}{Z} int ! P(x | theta) P(theta)
end{aligned}
]

How does this work in observe? At each step, we

  • pattern a brand new momentum worth from its marginal distribution (which is similar because the conditional distribution given (U), as they’re unbiased), and
  • remedy for the trail of the particle. This is the place Hamilton’s equations come into play.

Hamilton’s equations (equations of movement)

For the sake of much less confusion, must you determine to learn the paper, right here we swap to Radford Neal’s notation.

Hamiltonian dynamics operates on a d-dimensional place vector, (q), and a d-dimensional momentum vector, (p). The state house is described by the Hamiltonian, a perform of (p) and (q):

[H(q, p) =U(q) +K(p)]

Here (U(q)) is the potential vitality (referred to as (U(theta)) above), and (Ok(p)) is the kinetic vitality as a perform of momentum (referred to as (Ok(m)) above).

The partial derivatives of the Hamiltonian decide how (p) and (q) change over time, (t), in line with Hamilton’s equations:

[
begin{aligned}
& frac{dq}{dt} = frac{partial H}{partial p}
& frac{dp}{dt} = – frac{partial H}{partial q}
end{aligned}
]

How can we remedy this method of partial differential equations? The primary workhorse in numerical integration is Euler’s technique, the place time (or the unbiased variable, normally) is superior by a step of measurement (epsilon), and a brand new worth of the dependent variable is computed by taking the (partial) by-product and including it to its present worth. For the Hamiltonian system, doing this one equation after the opposite appears like this:

[
begin{aligned}
& p(t+epsilon) = p(t) + epsilon frac{dp}{dt}(t) = p(t) − epsilon frac{partial U}{partial q}(q(t))
& q(t+epsilon) = q(t) + epsilon frac{dq}{dt}(t) = q(t) + epsilon frac{p(t)}{m})
end{aligned}
]

Here first a brand new place is computed for time (t + 1), making use of the present momentum at time (t); then a brand new momentum is computed, additionally for time (t + 1), making use of the present place at time (t).

This course of might be improved if in step 2, we make use of the new place we simply freshly computed in step 1; however let’s immediately go to what’s truly utilized in up to date software program, the leapfrog technique.

Leapfrog algorithm

So after Hamiltonian, we’ve hit the second magic phrase: leapfrog. Unlike Hamiltonian nonetheless, there’s much less thriller right here. The leapfrog technique is “just” a extra environment friendly method to carry out the numerical integration.

It consists of three steps, mainly splitting up the Euler step 1 into two elements, earlier than and after the momentum replace:

[
begin{aligned}
& p(t+frac{epsilon}{2}) = p(t) − frac{epsilon}{2} frac{partial U}{partial q}(q(t))
& q(t+epsilon) = q(t) + epsilon frac{p(t + frac{epsilon}{2})}{m}
& p(t+ epsilon) = p(t+frac{epsilon}{2}) − frac{epsilon}{2} frac{partial U}{partial q}(q(t + epsilon))
end{aligned}
]

As you’ll be able to see, every step makes use of the corresponding variable-to-differentiate’s worth computed within the previous step. In observe, a number of leapfrog steps are executed earlier than a proposal is made; so steps 3 and 1 (of the next iteration) are mixed.

Proposal – this key phrase brings us again to the higher-level “plan.” All this – Hamiltonian equations, leapfrog integration – served to generate a proposal for a brand new worth of the parameters, which might be accepted or not. The manner that call is taken will not be specific to HMC and defined intimately within the above-mentioned expositions on the Metropolis algorithm, so we simply cowl it briefly.

Acceptance: Metropolis algorithm

Under the Metropolis algorithm, proposed new vectors (q*) and (p*) are accepted with likelihood

[
min(1, exp(−H(q∗, p∗) +H(q, p)))
]

That is, if the proposed parameters yield the next probability, they’re accepted; if not, they’re accepted solely with a sure likelihood that is dependent upon the ratio between previous and new likelihoods.
In principle, vitality staying fixed in a Hamiltonian system, proposals ought to all the time be accepted; in observe, lack of precision as a consequence of numerical integration might yield an acceptance charge lower than 1.

HMC in just a few traces of code

We’ve talked about ideas, and we’ve seen the mathematics, however between analogies and equations, it’s straightforward to lose monitor of the general algorithm. Nicely, Radford Neal’s paper (Neal 2012) has some code, too! Here it’s reproduced, with only a few extra feedback added (many feedback have been preexisting):

# U is a perform that returns the potential vitality given q
# grad_U returns the respective partial derivatives
# epsilon stepsize
# L variety of leapfrog steps
# current_q present place

# kinetic vitality is assumed to be sum(p^2/2) (mass == 1)
HMC <- perform (U, grad_U, epsilon, L, current_q) {
  q <- current_q
  # unbiased customary regular variates
  p <- rnorm(size(q), 0, 1)  
  # Make a half step for momentum firstly
  current_p <- p 
  # Alternate full steps for place and momentum
  p <- p - epsilon * grad_U(q) / 2 
  for (i in 1:L) {
    # Make a full step for the place
    q <- q + epsilon * p
    # Make a full step for the momentum, besides at finish of trajectory
    if (i != L) p <- p - epsilon * grad_U(q)
    }
  # Make a half step for momentum on the finish
  p <- p - epsilon * grad_U(q) / 2
  # Negate momentum at finish of trajectory to make the proposal symmetric
  p <- -p
  # Evaluate potential and kinetic energies at begin and finish of trajectory 
  current_U <- U(current_q)
  current_K <- sum(current_p^2) / 2
  proposed_U <- U(q)
  proposed_K <- sum(p^2) / 2
  # Accept or reject the state at finish of trajectory, returning both
  # the place on the finish of the trajectory or the preliminary place
  if (runif(1) < exp(current_U-proposed_U+current_K-proposed_K)) {
    return (q)  # settle for
  } else {
    return (current_q)  # reject
  }
}

Hopefully, you discover this piece of code as useful as I do. Are we by means of but? Well, up to now we haven’t encountered the final magic phrase: NUTS. What, or who, is NUTS?

NUTS

NUTS, added to Stan in 2011 and a few month in the past, to TensorFlow Probability’s grasp department, is an algorithm that goals to bypass one of many sensible difficulties in utilizing HMC: The selection of variety of leapfrog steps to carry out earlier than making a proposal. The acronym stands for No-U-Turn Sampler, alluding to the avoidance of U-turn-shaped curves within the optimization panorama when the variety of leapfrog steps is chosen too excessive.

The reference paper by Hoffman & Gelman (Hoffman and Gelman 2011) additionally describes an answer to a associated issue: selecting the step measurement (epsilon). The respective algorithm, twin averaging, was additionally not too long ago added to TFP.

NUTS being extra of algorithm within the laptop science utilization of the phrase than a factor to elucidate conceptually, we’ll go away it at that, and ask the reader to learn the paper – and even, seek the advice of the TFP documentation to see how NUTS is carried out there. Instead, we’ll spherical up with one other conceptual analogy, Michael Bétancourts crashing (or not!) satellite tv for pc (Betancourt 2017).

How to keep away from crashes

Bétancourt’s article is an superior learn, and a paragraph specializing in a single level made within the paper might be nothing than a “teaser” (which is why we’ll have an image, too!).

To introduce the upcoming analogy, the issue begins with excessive dimensionality, which is a given in most real-world issues. In excessive dimensions, as traditional, the density perform has a mode (the place the place it’s maximal), however essentially, there can’t be a lot quantity round it – identical to with k-nearest neighbors, the extra dimensions you add, the farther your nearest neighbor can be.
A product of quantity and density, the one vital likelihood mass resides within the so-called typical set, which turns into an increasing number of slender in excessive dimensions.

So, the everyday set is what we need to discover, but it surely will get an increasing number of tough to search out it (and keep there). Now as we noticed above, HMC makes use of gradient data to get close to the mode, but when it simply adopted the gradient of the log likelihood (the place) it will go away the everyday set and cease on the mode.

This is the place momentum is available in – it counteracts the gradient, and each collectively be certain that the Markov chain stays on the everyday set. Now right here’s the satellite tv for pc analogy, in Bétancourt’s personal phrases:

For instance, as an alternative of making an attempt to cause a few mode, a gradient, and a typical set, we are able to equivalently cause a few planet, a gravitational subject, and an orbit (Figure 14). The probabilistic endeavor of exploring the everyday set then turns into a bodily endeavor of inserting a satellite tv for pc in a steady orbit across the hypothetical planet. Because these are simply two totally different views of the identical mathematical system, they’ll endure from the identical pathologies. Indeed, if we place a satellite tv for pc at relaxation out in house it should fall within the gravitational subject and crash into the floor of the planet, simply as naive gradient-driven trajectories crash into the mode (Figure 15). From both the probabilistic or bodily perspective we’re left with a catastrophic consequence.

The bodily image, nonetheless, supplies a direct resolution: though objects at relaxation will crash into the planet, we are able to keep a steady orbit by endowing our satellite tv for pc with sufficient momentum to counteract the gravitational attraction. We must watch out, nonetheless, in how precisely we add momentum to our satellite tv for pc. If we add too little momentum transverse to the gravitational subject, for instance, then the gravitational attraction can be too sturdy and the satellite tv for pc will nonetheless crash into the planet (Figure 16a). On the opposite hand, if we add an excessive amount of momentum then the gravitational attraction can be too weak to seize the satellite tv for pc in any respect and it’ll as an alternative fly out into the depths of house (Figure 16b).

And right here’s the image I promised (Figure 16 from the paper):

And with this, we conclude. Hopefully, you’ll have discovered this useful – until you knew all of it (or extra) beforehand, through which case you most likely wouldn’t have learn this submit 🙂

Thanks for studying!

Betancourt, Michael. 2017. A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv e-Prints, January, arXiv:1701.02434. https://arxiv.org/abs/1701.02434.
Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112 (518): 859–77. https://doi.org/10.1080/01621459.2017.1285773.
Hoffman, Matthew D., and Andrew Gelman. 2011. “The No-u-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” https://arxiv.org/abs/1111.4246.

Kruschke, John Ok. 2010. Doing Bayesian Data Analysis: A Tutorial with r and BUGS. 1st ed. Orlando, FL, USA: Academic Press, Inc.

MacKay, David J. C. 2002. Information Theory, Inference & Learning Algorithms. New York, NY, USA: Cambridge University Press.

McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. CRC Press. http://xcelab.net/rm/statistical-rethinking/.
Neal, Radford M. 2012. MCMC using Hamiltonian dynamics.” arXiv e-Prints, June, arXiv:1206.1901. https://arxiv.org/abs/1206.1901.

LEAVE A REPLY

Please enter your comment!
Please enter your name here