rspeare.github.io

Importance Sampling

09 Jul 2014

So, last week I was learning a great deal about something called the VEGAS algorithm, which was first presented by Peter LePage in the 60's or 70's. It's a very clever way of going about importance sampling, which is a Monte Carlo method for integration. To start, let's say we would like to evaluate the integral: \begin{eqnarray} I &=& \int f(\mathbf{x})d^n x \end{eqnarray} Where $\mathbf{x}$ is an n-dimensional vector and we are integrating over n-dimensional space, with jacobian $d^n x$. Let's say that this function, our integrand, $f(\mathbf{x})$, is purely intractable by analytic means -- it is impossible to integrate with pen and paper. Luckily, we can estimate this integral by other (Monte Carlo) means. A naive strategy would be to construct a uniform Probability density function (PDF) on the integration region, and sample randomly from this PDF to get the independent variable $\mathbf{x}$ and the corresponding integrand value $f(\mathbf{x}$. This would be essentially obtaining an expectation value. Let's take a look in 1-D: \begin{eqnarray} I &=& \int_a^b f(x) dx \\ g(x) &=& \frac{1}{(b-a)} \eta \sim P(x) \\ \langle f(x) \rangle_{g(x)} &=& \int_a^b f(x) g(x) dx = \frac{1}{b-a}I \end{eqnarray} As the number of our random samples, $\eta$ gets large, we can use the law of large numbers to assume that our discrete mean closes in on the "true" mean, which, in this case, is the value of the integral $I$: \begin{eqnarray} \langle f(x) \rangle_g &=& \lim_{N \to \infty} \sum_i^N \frac{f(\eta_i)}{N} \ \ \ \ \ \eta \sim g(x) \\ &=& (b-a) \int f(x) dx \end{eqnarray} So that's pretty cool! We have estimated an integral by taking a "strange type of mean", drawing random variables from some arbitrary -- well, in this case uniform -- PDF $g(x)$. One way to take care of our strange factor $b-a$ above is to "normalize" \begin{eqnarray} I &=& \int f(x) dx = \int \frac{f(x)}{g(x)}g(x)dx \end{eqnarray} We see that the above is just our old integral $I$, but we find that, in numerical means, the last term on the right is equivalent to finding the expectation value of $f/g$ "under" $g(x)$, which means: \begin{eqnarray} I &=& \int \frac{f(x)}{g(x)}g(x)dx = \langle \frac{f(x)}{g(x)} \rangle_g \\ &=& \lim_{N\to \infty} \sum_i^N \frac{1}{N}\frac{f(\eta_i)}{g(\eta_i)} \ \ \ \ \ \eta \sim g(x) \end{eqnarray} So our random sample $\eta$ are drawn from the PDF $g(x)$, and we are searching for the mean value of this \textbf{ratio}, $\frac{f}{g}$. Interesting trick right? Now our next step in this numerical procedure is to estimate the variance in our "integral estimator", which is what we will call our sum of the ratios, above: \begin{eqnarray} \hat{I} &=& \sum_i^N \frac{1}{N}\frac{f(\eta_i)}{g(\eta_i)} \ \ \ \ \ \eta \sim g(x) \\ \lim_{N\to \infty} \hat{I} &=& I \end{eqnarray} What is $\sigma^2_I$? We construct our estimator of the variance in the usual way: the mean of the square minus the square of the mean: \begin{eqnarray} \sigma^2_I &=& \langle (\hat{I}-I)^2 \rangle = \langle \hat{I}^2 \rangle - \langle \hat{I} \rangle^2 \\ &=& \langle \left(\frac{f(x)}{g(x)}\right)^2 \rangle_g - \langle \frac{f(x)}{g(x)}\rangle_g^2 \\ &=& \int \left(\frac{f(x)}{g(x)}\right)^2 g(x) dx - \left( \int f(x) dx\right)^2\\ &=& \int \frac{\vert f(x) \vert^2}{g(x)^2} g(x) dx - \left( \int f(x) dx\right)^2\\ \end{eqnarray} Numerically, this final equation can be represented by another discrete sum, which we will call our estimate of the variance: \begin{eqnarray} \hat{\sigma}^2_I &=& \frac{ \sum_i^N \frac{\vert f(\eta_i) \vert^2}{g(\eta_i)^2}- \left( \sum_i^N \frac{f(\eta_i)}{g(\eta_i)}\right)^2}{N^2} \ \ \ \ \ \ \ \eta \sim g(x) \end{eqnarray} Now since we are subtracting by an estimator of the mean and not the true mean, we find that the above estimator of the variance is biased, by a factor of $\frac{N-1}{N}$, so we should rewrite our denominator above with an $N(N-1)$, as we are used to. ----------------------------------------------------------------------------------------------------------- So what choice of the probability density $g(x)$ minimizes the variance in our integral estimate? Intuitively, we can expect that if $g(x)$ closely matches the behaviour of $f(x)$ over the whole domain of integration, the ratio $f/g$ would remain constant, and we expect very small variance. Let's take a function derivative of our expression for variance $\sigma^2_I$ with a lagrange multiplier $\lambda$ in place to maintain the constraint of $g(x)$'s normalization as a probability density function: \begin{eqnarray} \sigma^2_I &=& \int \frac{\vert f(x)\vert^2}{g(x)}dx - (\int f(x) dx)^2 + \lambda \left(\int g(x) dx -1 \right) \\ \frac{\delta \sigma^2_I}{\delta g}&=& \int \left[ \frac{-\vert f(x)\vert^2}{g(x)^2}+\lambda \right] dx \end{eqnarray} We see that this derivative above will be zero only when the integrand is zero for all x, and so we have \begin{eqnarray} g(x) &=& \lambda \vert f(x) \vert \end{eqnarray} Since we have already assumed $g(x)$ to be a PDF, we need not place the absolute value sign on the LHS, above, because PDF's are positive definite (in order to make their cumulative PDF's $G(x)=\int_{-\infty}^x g(x^\prime)dx^\prime $ monotonically increasing). Plugging in our normalization constraint, once again, we find: \begin{eqnarray} \int g(x) dx &=& \lambda \int \vert f(x) \vert dx \\ 1 &=& \lambda \int \vert f(x) \vert dx \\ \frac{1}{\int \vert f(x) \vert dx} &=& \lambda \\ \end{eqnarray} and so the optimum PDF is \begin{eqnarray} g(x) &=& \frac{\vert f(x) \vert}{\int \vert f(x) \vert dx} \end{eqnarray} or, is proportional to absolute value of our integrand -- with some normalization constant -- at every $x$. This is pretty cool! One can find, by taking a similar functional derivative in N-dimensions, that if $g(x)$, the multidimensional PDF is separable across the components of our vector $\mathbf{x}$ \begin{eqnarray} g(\mathbf{x}) &=& \Pi_i^N g_i (x_i) \end{eqnarray} then the optimum $g(\mathbf{x})$ takes the following form: \begin{eqnarray} g_j (x_j) &=& \frac{\left[ \int \frac{\vert f(x) \vert^2}{\Pi_{i \ne j} g_i (x_i) } d^{n-1}x\right]^{1/2}}{\int dx_j \left[ \int \frac{\vert f(x) \vert^2}{\Pi_{i \ne j} g_i (x_i) } d^{n-1}x\right]^{1/2}} \end{eqnarray} ----------------------------------------------------------------------------------------------------------- So how does VEGAS go about approximating these "optimum" multidimensional distributions? By changing the sampling grid. LePage did a very clever thing when he constructed an algorithm that initially samples some integrand uniformly, essentially adopting the following PDF: \begin{eqnarray} g_0(\mathbf{x}) &=& \frac{1}{V_n} \end{eqnarray} Where $V_n$, is the n-dimensional volume of the integration region. But then, once these samples are taken, he perturbs the grid spacings -- keeping the number of samples constant -- in proportion with the absolute value of the integrand, $f(\mathbf{x})$. What this essentially does is match $g(\mathbf{x})$'s peaks to $f(\mathbf{x})$'s peaks, be they high or low, and over successive sampling rounds -- let's say M of them -- the "grid", or $g(\mathbf{x})$, gets perfectly attuned to the characteristics of $f(\mathbf{x})$.