Basic Logistic Regression and $\vec{\theta}$ as the gradient of the Decision Boundary
A few months ago I was 
asked about a classification problem in a phone interview, and, not knowing 
much about machine learning or basic regression at that time, I didn't realize 
what was really being asked. I fumbled about in the dark only to suggest basic 
use of histograms and correlations, when the answer was really quite simple: 
Logistic regression. 
Linear regression takes the following form: you have some input vector 
$\vec{x}$, which lives in a ``feature space'' and you'd like to map that 
$\vec{x}$ onto a corresponding $y$, which has some scalar value. If you have a 
set, or a ``census'' of sorts, of such pairs $\lbrace \vec{x}_i, y_i 
\rbrace_{i=1}^N$, you'd like to learn from this data and be able derive an 
analytical relationship between $\vec{x}$ and $y$. 
It turns out this problem is analytically solve-able by some nasty linear 
algebra -- I did it once years ago, following David Hogg and Dustin Lang's 
paper on Bayesian inference -- if you model the errors, or deviations from 
your model as being Gaussian. Then the cost function is a $\chi^2$ statistic, 
essentially, and if you maximize the likelihood function you get something 
called the normal equation: 
\begin{eqnarray} 
y_i &=& \vec{\theta} \cdot \vec{x}_i \\ 
- \log \left(\mathcal{L}(\vec{x},y \vert \vec{\theta})\right) &=& 
\sum_{i=1}^N 
\frac{(\vec{\theta}\cdot \vec{x}_i-y_i)^2}{N}\\ 
\theta_j &=& \frac{X_{ji} y_i}{(X^TX)} 
\end{eqnarray} 
Where $X$ is something called the data matrix, which I won't take the time to 
define here. The point is that for linear regression we get a nice plane in 
hyper-space, that maps our inputs $\vec{x}$ onto scalar values $y$. And, 
logistic regression works the same way, but now $y$ can only take on two 
values, $0$ or $1$, and we wrap our linear model in a sigmoid function: 
\begin{eqnarray} 
y_i &=& \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}} 
\end{eqnarray} 
And what this model really means is the probability of our output $y$ being 
``yes'' or ``no'', 0 or 1: 
\begin{eqnarray} 
P(y=1 \vert \vec{\theta}, \vec{x}) &=& \frac{1}{1+e^{-\vec{\theta} 
\cdot \vec{x}_i}} 
\end{eqnarray} 
Now, when you maximize the likelihood, which can no longer be done 
analytically, by any method I know of, but must be done numerically, with 
something like stochastic gradient descent, you will find that the same 
hyperplane defined by the $\vec{\theta}$ in the linear model now describes the 
**gradient** of the **decision** **boundary** in $\vec{x}$ space. By the 
properties of the sigmoid function, we will classify $y$ with a ``yes'' or a 
1, when $\vec{\theta}\cdot \vec{x}$ is greater than zero and a ``no'' or a 0, 
when $\vec{\theta}\cdot \vec{x}$ is less than zero. Let's look at when we are 
right on the ``cusp'' of a decision: 
\begin{eqnarray} 
 \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}& \approx& \frac{1}{2}\\ 
e^{-\vec{\theta} \cdot \vec{x}_i} & \approx & 1 
\end{eqnarray} 
So we can taylor expand our hypothesis $P(y=1 \vert \vec{x},\vec{\theta})$ and 
get 
\begin{eqnarray} 
P(y=1 \vert \vec{\theta}, \vec{x}) &=&\frac{1}{2}+\vec{\theta} \cdot 
\vec{x}_i+O()^2 
\end{eqnarray} 
If we take the derivative of this function with respect to the vector 
$\vec{x}$, we see that the normal vector to our decision boundary in feature 
space is precisely $\vec{\theta}$. Mathematically, gradient presents the 
direction of **fastest increase, **and so relatively dominant components of 
our $\vec{\theta}$ vector correspond to relatively dominant features in 
classification. 
\begin{eqnarray} 
\frac{P(y=1 \vert \vec{\theta}, \vec{x})}{\partial \vec{x}} 
&=&\vec{\theta}+O()^2 
\end{eqnarray} 
So, regression coefficients, at least near the decision boundary, play the 
same intuitive role. If we have a large coefficient, we know that the 
associated parameter plays a prominent role in classification. 
------------------------------------------------------------------------ 
Of course, adding higher order terms to the expression in our sigmoid 
function, say: 
\begin{eqnarray} 
 \frac{1}{1+e^{\sum_j \theta_{2j} x_j^ + \sum_j \theta_{2j+1} x_j^{2}}} 
\end{eqnarray} 
allows for a curved decision boundary in feature space $\vec{x}$ -- not just a 
hyperplane.