Differential Regularizers as Priors over Functions $f(x)$: Gaussian Process Kernels as Green's Functions
Often, when we are
solving a regression problem, we are given the following functional:
\begin{eqnarray}
J[f] &=& l\left[f,y\right]+\Omega \left[f \right]
\end{eqnarray}
Where $l$ is some loss functional and $\Omega$ is some regularizer. Given some
finite dataset, this might look like, in the case of squared loss:
\begin{eqnarray}
X &=& \lbrace \mathbf{x}_n \rbrace_{n=1}^N \\
J &=& \sum_{n=1}^N \left( f(\mathbf{x}_n)-y_n \right)^2 + \Omega(f)
\end{eqnarray}
For a linear basis expansion model, we have the following:
\begin{eqnarray}
f(\mathbf{x}) &=& \mathbf{w}\cdot \phi(x) \\
J(\mathbf{w}) &=& \sum_{n=1}^N \left( \mathbf{w}\cdot \phi(x_n) -y_n
\right)^2 + \lambda \vert \mathbf{w} \vert^2
\end{eqnarray}
where $\lambda \vert \mathbf{w} \vert^2$ plays the role of a prior over
functions. The cost function in this example is proportional to the negative
log prior, what we have is essentially:
\begin{eqnarray}
-\log P(\mathbf{w} \vert X, \mathbf{y}, \phi) & \sim & J(\mathbf{w})
\\
-\log \mathcal{L}(X, \mathbf{y} \vert \mathbf{w} \phi) & \sim &
\sum_{n=1}^N \left( \mathbf{w}\cdot \phi(x_n) -y_n \right)^2 \\
-\log P(\mathbf{w} \vert \phi) & \sim & \Omega(f) = \lambda \vert
\mathbf{w} \vert^2
\end{eqnarray}
(So that's a Gaussian likelihood and a Gaussian prior). Minimizing the cost
with respect to $\mathbf{w}$ is same thing as finding the mode of the
posterior, or Maximum a Posteriori. (MAP). We've already talked about how, for
such a regularized regression problem, we can right the solution as a linear
combination of kernels, centered at the data:
\begin{eqnarray}
f(\mathbf{x}) &=& \sum_{n=1}^N \alpha_n K(\mathbf{x},\mathbf{x}_n )
\end{eqnarray}
a manifestation of the representer theorem. But one important question to ask,
given some regularization functional, is, what's the ``best'' Kernel? Let's
take for example the regularizer:
\begin{eqnarray}
J &=& \int dx \left(f(x)-y \right)^2 + \lambda \int dx
\vert\frac{\partial^2 f(x)}{\partial x^2}\vert^2
\end{eqnarray}
This $\Omega$ functional penalizes curvature in our fitting function $f(x)$,
and we can note that what such a regularizer really is, is a prior over
functions, since:
\begin{eqnarray}
P(f \vert f(X), \mathbf{y} ) &=&\frac{P(X,y \vert f) P(f)}{P(X,
\mathbf{y})} \\
&=& \frac{\mathrm{exp}\left[-\int dx \delta^D(y-f(X)) \left(f(x)-y
\right)^2 - \lambda \int dx \vert\frac{\partial^2 f(x)}{\partial x^2}\vert^2
\right] }{P(X, \mathbf{y})} \\
\end{eqnarray}
We see that the prior on our function is:
\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\lambda \int \vert f^{\prime
\prime}(x)\vert^2 dx \right)
\end{eqnarray}
where $\lambda$ controls the "strength" of our penalty for curvature. To be
more general, we could have written the prior on our function as a
superposition of differential operators:
\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\int \sum_{m=1}^\infty a_m
\frac{\partial^m }{\partial x^m}\vert f(x)\vert^2 dx \right)
\end{eqnarray}
If we integrate by parts, we note that this prior functional can be put into
the form:
\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\int dx dx^\prime f(x)
K(x,x^\prime)^{-1} f(x^\prime) \right)
\end{eqnarray}
Which of course gives us the familiar prior assumptions that $f(x)$ has mean
and covariance:
\begin{eqnarray}
\langle f(x) \rangle &=& 0 \\
\mathrm{Var}\left(f(x) \right) &=& K(x,x^\prime)
\end{eqnarray}
But, for a given differential regularizer, how do we find the associated
Kernel? The answer is simple, it's just the Green's function of the operator:
\begin{eqnarray}
\hat{L} &=& \sum_m a_m \frac{\partial^m}{\partial x^m}\\
\hat{L}K &=& \sum_m a_m \frac{\partial^m}{\partial x^m} K(x,x^\prime)
= \delta(x-x^\prime)
\end{eqnarray}
An easy way to get the green's function -- or in this case Kernel -- is to
fourier transform. We can re-write our prior in $s$ space:
\begin{eqnarray}
f(s) &=& \int dx e^{-isx}f(x) \\
P[f] &\sim & \mathrm{exp}\left(-\int ds\sum_{m=1}^\infty a_m
\vert\mathbf{s}\cdot \mathbf{s}\vert^m\vert f(s)\vert^2 dx \right)
\end{eqnarray}
We see now the fourier transform of our inverse kernel is:
\begin{eqnarray}
\frac{1}{K(s,s^\prime)} &=& \sum_m a_m (-1)^m \vert \mathbf{s}\cdot
\mathbf{s} \vert^m \delta^D(\mathbf{s}+\mathbf{s}^\prime)
\end{eqnarray}
We see that the kernel is diagonal and in $s$ space and semi-positive
definite. Which means that we are translationally invariant in $x$ space. We
have:
\begin{eqnarray}
K(x,x^\prime) &=& \int ds ds^\prime e^{isx+is^\prime x^\prime}
\frac{1}{\sum_m a_m (-1)^m \vert \mathbf{s}\cdot \mathbf{s}^\prime
\vert^m}\delta^D(\mathbf{s}+\mathbf{s}^\prime)\\
K(x-x^\prime) &=& \int ds e^{is(x-x^\prime)} \frac{1}{\sum_m a_m
\vert \mathbf{s}\cdot \mathbf{s} \vert^m}
\end{eqnarray}
We see that, indeed, our Kernel will be translationally invariant, and when we
put a prior over functions, what is essentially a Lagrangian in physics:
\begin{eqnarray}
\Omega[f] \sim L &=& \int dx f(x) \left(\sum_m a_m
\frac{\partial^m}{\partial x^m}\right) f(x)
\end{eqnarray}
We find that the kernel is just the correlation function -- or, the propagator
-- for the resulting stochastic process. One example is the massive free
particle in higher dimensions -- or the free process in higher dimensional
feature space -- for which we get the Yukawa potential:
\begin{eqnarray}
\Omega[f] \sim L &=& \int dx \frac{m^2}{2}f(x)^2+\frac{1}{2}\nabla^2
f(x) \\
K(x-x^\prime) &\sim & \frac{1}{\vert x - x^\prime \vert }e^{-\vert
x-x^\prime \vert m}
\end{eqnarray}
So, the Kernel we use when interpolating some field, or, specifying the prior
of our Gaussian process, can be viewed as the Green's function to our
penalizing "regularizer". If we want smooth functions up to order $m=M$, we
know precisely how to formulate the associated $K(x-x^\prime)$. Note that all
differential regularizers, such as discussed here will lead to stationary
kernels and thus stationary processes.