In the previous post, we looked into rejection and sampling and how it can be used to sample from a distribution when inverse sampling is not feasible. In this post, we will discuss a method called importance sampling. Importance sampling unlike rejection sampling or inverse sampling isn't used to sample from distributions. Instead its a methodology to approximate integrals of the form

Figure 1.

where p is a probability density function.

A naive approach would be to sample from p using either inverse or rejection sampling and then using a Monte-Carlo estimate of the integral as shown below,

Figure 2.

This method, however, requires having a lot of samples so as to have a low variance in the estimation of the integral. Importance sampling is a method to achieve lower variance with the same number of samples.

The main idea of importance sampling is to sample from parts of the distribution that are “important” i.e. sample from a region that has high probability under p and a high absolute value of f. This is helpful as we want to evaluate f at points such that they are highly likely under p and the value of f at those points are high enough to be significant when calculating the sum in figure 2.

This is achieved by sampling from a proposal distribution q and rewriting the integral as:

Figure 3.

The above-derived expectation Eq can be computed using Monte-Carlo estimation as

Figure 4


Figure 5

are the importance weights.

One of the motivations of using importance sampling is to reduce the variance to the estimate. Hence a natural criterion to select a proposal distribution would be to pick one that minimizes the variance of the estimate.

Let derive what such a posterior would look like.

Figure 6.

Since the second term is independent of q, we can ignore it. Using Jensen’s inequality we can find a lower bound of the first term.

Figure 7.

The lower bound is obtained when

Figure 8.

This post concludes the series on sampling. In the next post we will look into variational inference.

[1] Kevin Murphy, Machine Learning: A Probabilistic Perspective, Chapter 23