Basic Logistic Regression and →θ as the gradient of the Decision Boundary
A few months ago I was
asked about a classification problem in a phone interview, and, not knowing
much about machine learning or basic regression at that time, I didn't realize
what was really being asked. I fumbled about in the dark only to suggest basic
use of histograms and correlations, when the answer was really quite simple:
Logistic regression.
Linear regression takes the following form: you have some input vector
→x, which lives in a ``feature space'' and you'd like to map that
→x onto a corresponding y, which has some scalar value. If you have a
set, or a ``census'' of sorts, of such pairs {→xi,yi}Ni=1, you'd like to learn from this data and be able derive an
analytical relationship between →x and y.
It turns out this problem is analytically solve-able by some nasty linear
algebra -- I did it once years ago, following David Hogg and Dustin Lang's
paper on Bayesian inference -- if you model the errors, or deviations from
your model as being Gaussian. Then the cost function is a χ2 statistic,
essentially, and if you maximize the likelihood function you get something
called the normal equation:
yi=→θ⋅→xi−log(L(→x,y|→θ))=N∑i=1(→θ⋅→xi−yi)2Nθj=Xjiyi(XTX)
Where X is something called the data matrix, which I won't take the time to
define here. The point is that for linear regression we get a nice plane in
hyper-space, that maps our inputs →x onto scalar values y. And,
logistic regression works the same way, but now y can only take on two
values, 0 or 1, and we wrap our linear model in a sigmoid function:
yi=11+e−→θ⋅→xi
And what this model really means is the probability of our output y being
``yes'' or ``no'', 0 or 1:
P(y=1|→θ,→x)=11+e−→θ⋅→xi
Now, when you maximize the likelihood, which can no longer be done
analytically, by any method I know of, but must be done numerically, with
something like stochastic gradient descent, you will find that the same
hyperplane defined by the →θ in the linear model now describes the
**gradient** of the **decision** **boundary** in →x space. By the
properties of the sigmoid function, we will classify y with a ``yes'' or a
1, when →θ⋅→x is greater than zero and a ``no'' or a 0,
when →θ⋅→x is less than zero. Let's look at when we are
right on the ``cusp'' of a decision:
11+e−→θ⋅→xi≈12e−→θ⋅→xi≈1
So we can taylor expand our hypothesis P(y=1|→x,→θ) and
get
P(y=1|→θ,→x)=12+→θ⋅→xi+O()2
If we take the derivative of this function with respect to the vector
→x, we see that the normal vector to our decision boundary in feature
space is precisely →θ. Mathematically, gradient presents the
direction of **fastest increase, **and so relatively dominant components of
our →θ vector correspond to relatively dominant features in
classification.
P(y=1|→θ,→x)∂→x=→θ+O()2
So, regression coefficients, at least near the decision boundary, play the
same intuitive role. If we have a large coefficient, we know that the
associated parameter plays a prominent role in classification.
------------------------------------------------------------------------
Of course, adding higher order terms to the expression in our sigmoid
function, say:
11+e∑jθ2jx+j∑jθ2j+1x2j
allows for a curved decision boundary in feature space →x -- not just a
hyperplane.