Gaussian Process and Defining a Stochastic Process (Connection with Rasmussen)

23 Oct 2015

About a week ago I posted about [how to define a stochastic process](, and the conclusion was, if we wanted to write some function: \begin{eqnarray} f(x) \sim P[f(x)] &=& \frac{1}{Z}e^{\int dx dx^\prime f(x) K^{-1}(x,x^\prime)f(x^\prime)} \end{eqnarray} with mean zero and covariance $\mathrm{Var}[f(x)]=K(x,x^\prime)$, we can decompose the kernel into basis functions through Mercer's theorem and write \begin{eqnarray} K(x,x^\prime) &=& \sum_n \lambda_n \phi_n(x) \phi_{n}(x^\prime) \\ f(x) &=& \sum_m \phi_m(x) \sqrt{2\lambda_m} \mathrm{erf}^{-1}(X_m) \end{eqnarray} This is precisely the basis function regression we see so often in Rasmussen: \begin{eqnarray} f(x) &=& \vec{\mathbf{w}}\cdot \vec{\phi}(x) \\ \vec{\mathbf{w}}_m &=& \sqrt{2\lambda_m} \mathrm{erf}^{-1}(X_m)\\ \mathrm{Var}(ww^T) &=& 2 \sqrt{\lambda_m \lambda_n} \langle \mathrm{erf}^{-1}(X_m) \mathrm{erf}^{-1}(X_n)\rangle \end{eqnarray} Since the erf transformed variables are unit normal, we get a kronecker delta in $n,m$ and \begin{eqnarray} \mathrm{Var}(w_m w_n) &=& 2\lambda_m \delta^K_{mn} \end{eqnarray} I've missed a factor of two somehow but no matter. The basic point comes across. We can define an arbitrary Gaussian process with arbitrary Kernel by expanding in basis functions $\phi(x)$, much like we did when moving from the univariate Gaussian to the multivariate Gaussian, with non-trivial covariance between the components of our random vector. In the case of the squared exponential kernel, the set of basis functions is formally infinite, but ultimately this doesn't matter for the conditional distribution, \begin{eqnarray} f(x^*) \vert X &=& \sum_{i} \alpha_i K(x^*, X_i) \end{eqnarray} which is a piece of wisdom from the representer theorem. For any regularized regression problem with quadratic cost -- which, is the same thing as putting a prior on the weights $\vec{w}$ in our model and maximizing the Gaussian posterior -- we can represent our smooth solution completely in terms of the input data $X$. This may have a simple connection, in the process view, to the "family of functions", and that when we observe our stochastic process at discrete intervals $X$ with values $\vec{f}$, we are really hitting our original pdf with a series of dirac delta functions.