Basic Logistic Regression and $\vec{\theta}$ as the gradient of the Decision Boundary
A few months ago I was
asked about a classification problem in a phone interview, and, not knowing
much about machine learning or basic regression at that time, I didn't realize
what was really being asked. I fumbled about in the dark only to suggest basic
use of histograms and correlations, when the answer was really quite simple:
Logistic regression.
Linear regression takes the following form: you have some input vector
$\vec{x}$, which lives in a ``feature space'' and you'd like to map that
$\vec{x}$ onto a corresponding $y$, which has some scalar value. If you have a
set, or a ``census'' of sorts, of such pairs $\lbrace \vec{x}_i, y_i
\rbrace_{i=1}^N$, you'd like to learn from this data and be able derive an
analytical relationship between $\vec{x}$ and $y$.
It turns out this problem is analytically solve-able by some nasty linear
algebra -- I did it once years ago, following David Hogg and Dustin Lang's
paper on Bayesian inference -- if you model the errors, or deviations from
your model as being Gaussian. Then the cost function is a $\chi^2$ statistic,
essentially, and if you maximize the likelihood function you get something
called the normal equation:
\begin{eqnarray}
y_i &=& \vec{\theta} \cdot \vec{x}_i \\
- \log \left(\mathcal{L}(\vec{x},y \vert \vec{\theta})\right) &=&
\sum_{i=1}^N
\frac{(\vec{\theta}\cdot \vec{x}_i-y_i)^2}{N}\\
\theta_j &=& \frac{X_{ji} y_i}{(X^TX)}
\end{eqnarray}
Where $X$ is something called the data matrix, which I won't take the time to
define here. The point is that for linear regression we get a nice plane in
hyper-space, that maps our inputs $\vec{x}$ onto scalar values $y$. And,
logistic regression works the same way, but now $y$ can only take on two
values, $0$ or $1$, and we wrap our linear model in a sigmoid function:
\begin{eqnarray}
y_i &=& \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}
\end{eqnarray}
And what this model really means is the probability of our output $y$ being
``yes'' or ``no'', 0 or 1:
\begin{eqnarray}
P(y=1 \vert \vec{\theta}, \vec{x}) &=& \frac{1}{1+e^{-\vec{\theta}
\cdot \vec{x}_i}}
\end{eqnarray}
Now, when you maximize the likelihood, which can no longer be done
analytically, by any method I know of, but must be done numerically, with
something like stochastic gradient descent, you will find that the same
hyperplane defined by the $\vec{\theta}$ in the linear model now describes the
**gradient** of the **decision** **boundary** in $\vec{x}$ space. By the
properties of the sigmoid function, we will classify $y$ with a ``yes'' or a
1, when $\vec{\theta}\cdot \vec{x}$ is greater than zero and a ``no'' or a 0,
when $\vec{\theta}\cdot \vec{x}$ is less than zero. Let's look at when we are
right on the ``cusp'' of a decision:
\begin{eqnarray}
\frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}& \approx& \frac{1}{2}\\
e^{-\vec{\theta} \cdot \vec{x}_i} & \approx & 1
\end{eqnarray}
So we can taylor expand our hypothesis $P(y=1 \vert \vec{x},\vec{\theta})$ and
get
\begin{eqnarray}
P(y=1 \vert \vec{\theta}, \vec{x}) &=&\frac{1}{2}+\vec{\theta} \cdot
\vec{x}_i+O()^2
\end{eqnarray}
If we take the derivative of this function with respect to the vector
$\vec{x}$, we see that the normal vector to our decision boundary in feature
space is precisely $\vec{\theta}$. Mathematically, gradient presents the
direction of **fastest increase, **and so relatively dominant components of
our $\vec{\theta}$ vector correspond to relatively dominant features in
classification.
\begin{eqnarray}
\frac{P(y=1 \vert \vec{\theta}, \vec{x})}{\partial \vec{x}}
&=&\vec{\theta}+O()^2
\end{eqnarray}
So, regression coefficients, at least near the decision boundary, play the
same intuitive role. If we have a large coefficient, we know that the
associated parameter plays a prominent role in classification.
------------------------------------------------------------------------
Of course, adding higher order terms to the expression in our sigmoid
function, say:
\begin{eqnarray}
\frac{1}{1+e^{\sum_j \theta_{2j} x_j^ + \sum_j \theta_{2j+1} x_j^{2}}}
\end{eqnarray}
allows for a curved decision boundary in feature space $\vec{x}$ -- not just a
hyperplane.