A Brief Note on the Bias-Variance Decompositon
When one trains a model that is highly flexible, highly ``articulate'' on a
training set, you often get a great training score -- be it AUC, accuracy,
MSE, etc. But such models -- as I've talked about before -- often have trouble
generalizing to ``test'' sets, or, the real world. One of the easiest ways to
see this is by a simple FOIL operation on the following quantity:
\begin{eqnarray}
Y&=& f(X)+\epsilon
\end{eqnarray}
Here $X$ is a random variable -- the independent inputs -- $f(X)$ is the
generative process that creates our object of interest $Y$ -- be it cardinal
or $\in \mathcal{R}$ -- and $\epsilon$ is drawn from some noise distribution,
say a Gaussian process with zero mean, $\mathcal{N}(0,K(X,X^\prime))$. Let
$g(X)$ be our picked model for $Y$. (Normally people write $\hat{f}(X)=g(X)$
but I chose $g$ to avoid confusion.) If we take a look at the mean squared
error, we get
\begin{eqnarray}
MSE &=& \langle \vert f(X)+\epsilon-g(X) \vert^2 \rangle\\
&=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle
g(X)^2 \rangle - 2 \langle \epsilon \rangle \langle f(X) \rangle - 2 \langle
\epsilon \rangle \langle g(X) \rangle - 2 \langle f(X) \rangle \langle g(X)
\rangle
\end{eqnarray}
Where I've assumed the noise and our $f,g$ are uncorrelated. We see the terms
that are linear in $\epsilon$ fall away and we can write:
\begin{eqnarray}
MSE &=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle
g(X)^2 \rangle - 2 \langle f(X) \rangle \langle g(X) \rangle
\end{eqnarray}
Adding and subtracting the mean of our model squared $\langle g(X) \rangle^2$
we get:
\begin{eqnarray}
MSE &=& \langle \left(f(X)-\langle g(X)\rangle \right)^2 \rangle
+\mathrm{Var}\left(g(X) \right)+\mathrm{Var}\left(\epsilon \right)
\end{eqnarray}
So now, term by term, we see we have: the squared difference between our data
and our average model -- the Bias, which quantifies how much our model is
``off'' (a quantity that will be quite low for complex models and high for
simple ones); the variance of the model itself, which quantifies how much our
$g(X)$ changes given different training inputs (a quantity that will be high
for complex models and low for simple ones) ; and the variance of the noise
variable $\epsilon$, which is an ineluctable contribution to our error.
Recall from [last
post](http://rspeare.blogspot.com/2015/11/learning-theory-and-xgboost-failure-on.html)
we had a very similar bias-variance decomposition:
\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon
(h_{\theta}(X)) \right)+\sqrt{\frac{2\log \left( \frac{2K}{\delta}
\right)}{N}}
\end{eqnarray}
The bias is the first term and the variance the second, which goes like square
root logarithm of the degrees of freedom $K$. This decomposition illustrates
the balancing act we have to play, as model builders, between simplicity and
goodness of fit. Refer to these when explaining why your highly un-regularized
gradient boosted decision tree -- or boosted anything -- did a poor job of
generalizing! (This doesn't always happen, of course, just a word of caution.)