Dual Lagrangians and ordinary Least Squares
The ordinary least
squares problem can be set up as just a norm minimization problem:
\begin{eqnarray}
\mathrm{minimize} \ \ \frac{\vert \vert Ax-t \vert \vert^2}{2}+l \frac{x^T
x}{2}
\end{eqnarray}
Where $x$ is some vector that in lives in $D$-dimensional ``feature'' space,
and the matrix $A$ is $N \times D$ -- essentially the design matrix $\Phi(X)$
common in the Machine learning literature. We can write down the ``primal
objective'' of this problem by writing down the ``Lagrangian'':
\begin{eqnarray}
L(x)&=& \frac{(Ax-t)^T (Ax-t)}{2}+l \frac{x^T x}{2} \ \ \
\mathrm{or,}\\
L(z,x\nu) &=& \frac{z^T z}{2}+\nu^T \left(Ax-t-z \right)+l
\frac{x^Tx}{2}
\end{eqnarray}
Where $\nu$ is a vector-valued lagrange multiplier, that \textbf{imposes} our
constraint $z=Ax-t$. This reformulation of the problem is just one of the many
simple tricks you can do using lagrange multipliers. The canonical form of
such Lagrangian in optimization theory is the following:
\begin{eqnarray}
L(x,\lambda,\nu)&=& f_0(x)+ \lambda_i f_i(x)+\nu_j h_j(x)
\end{eqnarray}
where Einstein summation convention is used. Pardon the confusion of
$\lambda$'s : normally this greek letter is used for regularization, but in
the problem I am going to use $l$ as the regularization parameter and
$\lambda$ as the Lagrange multiplier that imposes some inequality constraint.
We require, for $x$ to be a feasible solution to the problem:
\begin{eqnarray}
h_i(x) &=& 0 \forall i \\
f_i(x) &\le & 0 \forall i
\end{eqnarray}
Notice that in this example we no inequality constraints, so $\lambda$ will be
irrelevant. If we were to minimize this ordinary least squares problem
directly -- analytically solving what's called the primal objective -- we get:
\begin{eqnarray}
\partial_x L(x) &=& A^T(Ax-t)+l x =0 \\
x_\mathrm{min} &=& (A^TA+l)^{-1} A^T t
\end{eqnarray}
We see that the solution involves the inversion of a $D \times D$ matrix --
quite difficult if the length of our feature vector is large. This is what's
called the ordinary least squares solution. We have let $l$ be non-zero in
order to create something called a ``ridge'', which smooths out our solution.
Now, we can define the dual lagrangian as:
\begin{eqnarray}
g(\lambda, \nu)&=& \mathrm{min}_{x,z}\left(L(x,z,\lambda,\nu) \right)
\\
\end{eqnarray}
which we can find by taking the proper derivatives and plugging in:
\begin{eqnarray}
\partial_x L(x,z,\lambda,\nu) &=& \nu^T A + l x^T \\
\Rightarrow x&=& -\frac{A^T \nu}{l}\\
\partial_z L(x,z,\lambda,\nu) &=& z^T-\nu^T \\
\Rightarrow z=\nu
\end{eqnarray}
So we get
\begin{eqnarray}
g(\lambda, \nu) &=& \frac{\nu^T \nu}{2}+\nu^T \left(-AA^T \nu/l - t -
\nu \right) + \frac{l \nu A A^T \nu}{2}\\
&=& -\nu^T \left(\frac{AA^T}{l}+1\right)\nu-\nu^Tt
\end{eqnarray}
Now, we define $\mathrm{max}(g)=d^*$ as the maximum of the dual, which, is in
general less than the minimum of the primal, $\mathrm{min}(L)=p^*$. When
$d^*=p^*$, we have what's called strong duality. The conditions for this to be
the case are called the KKT conditions, and boil down to essentially all
derivatives being equal to zero at the solution:
\begin{eqnarray}
\partial_{x,z,\lambda,\nu}L=0
\end{eqnarray}
and all lagrange multiplier terms being equal to zero evaluated at the
solution. If we maximize our dual, we find:
\begin{eqnarray}
\partial_\nu g &=& - \left(\frac{AA^T}{l}+1 \right)\nu-t=0 \\
\Rightarrow \nu &=& -\left(AA^T/l+1 \right)^{-1} t
\end{eqnarray}
and, using our expression before, we find the dual $x$ is:
\begin{eqnarray}
x &=& -\frac{A^T \nu}{l} = A^T \left(AA^T+l \right)^{-1}t
\end{eqnarray}
These two solutions are equivalent if $d^*=p^*$, and in this case, they are!
Note that the second solution involves the inversion of an $N \times N$
matrix, and may be a much better route if the number of features outweigh the
number of data points, $N << D$. This dual solution is precisely the
kernel ridge regression solution, because $AA^T$ is our kernel ``Gram
Matrix''. The model becomes:
\begin{eqnarray}
\mathrm{model} &=& Ax = AA^T \left(AA^T+l \right)^{-1}t\\
AA^T &=& K \\
\mathrm{model} &=& Ax = K \left(K+l \right)^{-1}t\\
\end{eqnarray}