Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Image by Gabriela Morgenshtern

Hands-on Tutorials

Baby’s First Algorithmic Sampling from a Distribution: Methods Available and How to Use Them

A slow and approachable introduction to sampling from and approximating complex distributions

7 min readApr 15, 2021

--

TL;DR

This writeup includes descriptions from a recent paper on algorithmic sampling, to describe in simpler terms the motivation and approach for sampling using simple or Markov Chain Monte Carlo methods. It builds up to the Metropolis Hastings algorithm as an example of the implementation of such ideas in the reader’s practice, and briefly lays out considerations and heuristics for hyperparameters involved.

Sampling as a concept

Here we refer to samples xᵢ from a distribution p(x) as single realizations whose probability distribution is p(x). This is in contrast to some other places in statistics where a sample x might refer to a collection of realizations

Monte Carlo

Monte Carlo methods have 2 goals: (1) to generate samples from a given distribution with good coverage of that distribution; and (2) to estimate the expectation of functions under some given distribution. If we use Monte Carlo methods for sampling (i.e. solving our first problem), we will also be able to provide an unbiased estimator for the solution of our second problem — good coverage sampling will allow us to approximate the integral in the expectation expression for x~P(X):

Eₓ = E[ f(x) ] = ∫ f(x) p(x) dx

And so the fundamental computing problem we should think about here is how to best take these samples. Here we discuss one algorithmic approach for taking these samples according to Monte Carlo and Markovian approaches, as well as some considerations for improvements to time to compute of this one Markov Chain Monte Carlo method: the Metropolis Hastings sampling algorithm

Markov Property/Process

A short but important note is that what we discuss here is not the Markov chain, which is a type of probabilistic model that incorporates ideas from the Markovian process. Methods of the Markov process sort operate under the assumption that states that in a time series of random variables (RVs), every RV is independent of every other RV, other than the one immediately preceding it. This translates to the probability chain rule applying to the joint probability distribution of a series of samples thusly:

P( X , … , X ) = P( X ) P( X | X ) … P( X | Xₙ₋₁ )

Monte Carlo Methods

If the distribution of our data is simple (i.e. well defined by one of our distribution formulas), or if our cumulative density function (CDF) is known, sampling is easy. If the distribution is not simple, we can’t directly sample from it. Such a case exists when we work on energy-based models, for example, due to the normalization constant requiring a computationally expensive summation calculation over all the values in the input domain.

Monte Carlo (MC) methods are iterative methods of exploring (sampling with good coverage of) the domain space. There are 2 overarching categorizations for these methods: simple and Markov chain.

Simple Sampling

Here, every iteration is independent, because each is performed blindly (without knowledge of the previous sampling result). Simple MC uses a simple-to-sample (well-defined) distribution. Though we won’t discuss these extensively here, simple MC sampling is further subcategorized into:

Importance Sampling:

Approximates the expectation of a complex (not simple-to-sample) function over the domain through a simple-to-sample distribution

Rejection Sampling:

Samples from the complex (not simple-to-sample) function through a simple-to-sample upper bound distribution. That is, we approximate our complex function by taking some well-defined distribution Q(x), which by multiplying with some positive integer c, will always lie above our distribution of interest P(x) on the Cartesian axis, and take samples from a uniformly-distributed range of axis values from a vertical slice of Q(x). An example of and pseudocode for such an approach can be seen in this figure:

Rejection sampling with simultaneous calculation of c. Figure and pseudocode are an adaptation of those produced by Ghojogh, Nekoei, and Ghojogh et al.,

Our unbiased estimate for the expectation of interest ( E[P(x)] ) could then be computed through a mean across all of our samples {x} ∈ S. It’s unbiased due to the fact that we approach the true value of the distribution-of-interest’s expectation as the size of our sample set approaches the size of the domain. Additionally, by increasing the cardinality of S (i.e. by taking more samples), our estimator’s variance will decrease at a rate proportional to 1/R. A proof of these two properties, common to all Monte Carlo estimators, seen here:

As you can see from the estimator computation expression above, this is a very simple calculation to include in your Evidence Lower Bound (ELBO) function computation for training models. And, as you increase your training set size, or the number of minibatches used for training, you will more closely approach the real expectation of your probabilistic, intractable loss function.

Markov Chain Sampling

In Markov chain sampling algorithms, there is a candidate-generating distribution Q is a choice made by the implementer. The candidate at Φᵢ may or may not depend directly on Φᵢ-₁, but will always in some way depend on the previous iteration, thus allowing for the Markovian property characteristics (as defined in above). An alternative to having Φᵢ depend on Φᵢ-₁ directly is to keep the same q(Φ) distribution, with the same parameterization, and then hinge the dependency on some other property related to the Φᵢ-₁th iteration. Then, q(Φᵢ) must be similar to our target distribution to ensure that it is a good approximation to it.

Metropolis-Hastings Sampling Algorithm

In Metropolis algorithms, there is a simple-to-sample candidate (or proposal) distribution conditioned on the sample value of the previous iteration, Φᵢ-₁. We use this candidate proposal as means to sample from a complicated distribution of interest, P*(X).

If we choose a candidate distribution to be a family of Gaussians of one constant 𝝈 but iteration-dependent 𝝁, for example, we simplify the computation of the 𝞪 value due to the symmetry this dependency introduces, as q(Φᵢ-₁| Φ*) = q(Φ* | Φᵢ-₁). The value of 𝝈 in such cases is also referred to as the step-size, which is a hyperparameter one must consider carefully, as it affects the time it takes to explore the sample space with sufficient coverage, and thus the algorithm’s runtime. Such a symmetric choice for the proposal function is what defines the Metropolis Algorithm. The Hastings augmentation occurs when the algorithm’s alpha value calculation is generalized to accept a not-necessarily-symmetric proposal. The Metropolis-Hastings builds on the Metropolis approach by using ideas from importance sampling: it weighs both the new and the old samples by the candidate distribution.

LEFT: Pseudocode of the Metropolis algorithm. RIGHT: Hastings augmentation allowing for asymmetric candidate distributions for approximation. Adapted from Ghojogh, Nekoei, and Ghojogh et al.

Note that in both the Metropolis and the Metropolis-Hastings alpha computation, the P*(X) term appears in both the numerator and denominator — a clever way to cancel out any normalization factors that don’t depend on X but are nonetheless intractable, such as those present in energy-based models.

On the Random Walk of the Metropolis (RWM)

In RWM such as the Metropolis Hastings, we have dependency of the current candidate sample on the previous iteration. Generally in RWM, the candidate Φᵢ may or may not depend on directly Φᵢ-₁, but will always in some way depend on the previous iteration, thereby utilizing the Markovian property. We have two important choices to make when using this algorithm family: the candidate distribution, and the step size. The step size will directly affect the rate at which we accept samples: too high a rate (too small a step-size) will result in a very slow domain exploration; too low a rate (too large a step size) will result in incomplete coverage of the domain. There is one paper that suggests that the appropriate step size to set for a Metropolis implementation is, in most cases, 𝝈 = 2.38. One proposed heuristic for the appropriate acceptance rate is 23–50% of samples, and in any case, the candidate distribution should have higher variance than
P*(X) (the distribution of interest for approximation).

Sources

1. Ghojogh, Nekoei, Ghojogh, et. al. Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review. Nov. 2020: https://arxiv.org/pdf/2011.00901.pdf

2. Giorgi. University of Toronto, CSC412 Winter 2020 Lecture 5: https://probmlcourse.github.io/csc412/lectures/week_6/

3. Giorgi. University of Toronto, CSC412 Winter 2020 Lecture 7: https://probmlcourse.github.io/csc412/lectures/week_10/#optimizing-the-elbo

4. Understanding Metropolis-Hastings algorithm. Machine Learning TV (UC Santa Cruz). Feb 2020: https://www.youtube.com/watch?v=0lpT-yveuIA

5. Rosenthal. Optimising and Adapting the Metropolis Algorithm. Feb. 2013: http://probability.ca/jeff/ftpdir/lawlessart.pdf

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Gabriela Morgenshtern
Gabriela Morgenshtern

Written by Gabriela Morgenshtern

research_interests = [HCI, ML, healthcare]

No responses yet