Basic Logistic Regression and θ as the gradient of the Decision Boundary

11 Jul 2015

A few months ago I was asked about a classification problem in a phone interview, and, not knowing much about machine learning or basic regression at that time, I didn't realize what was really being asked. I fumbled about in the dark only to suggest basic use of histograms and correlations, when the answer was really quite simple: Logistic regression. Linear regression takes the following form: you have some input vector x, which lives in a ``feature space'' and you'd like to map that x onto a corresponding y, which has some scalar value. If you have a set, or a ``census'' of sorts, of such pairs {xi,yi}Ni=1, you'd like to learn from this data and be able derive an analytical relationship between x and y. It turns out this problem is analytically solve-able by some nasty linear algebra -- I did it once years ago, following David Hogg and Dustin Lang's paper on Bayesian inference -- if you model the errors, or deviations from your model as being Gaussian. Then the cost function is a χ2 statistic, essentially, and if you maximize the likelihood function you get something called the normal equation: yi=θxilog(L(x,y|θ))=Ni=1(θxiyi)2Nθj=Xjiyi(XTX) Where X is something called the data matrix, which I won't take the time to define here. The point is that for linear regression we get a nice plane in hyper-space, that maps our inputs x onto scalar values y. And, logistic regression works the same way, but now y can only take on two values, 0 or 1, and we wrap our linear model in a sigmoid function: yi=11+eθxi And what this model really means is the probability of our output y being ``yes'' or ``no'', 0 or 1: P(y=1|θ,x)=11+eθxi Now, when you maximize the likelihood, which can no longer be done analytically, by any method I know of, but must be done numerically, with something like stochastic gradient descent, you will find that the same hyperplane defined by the θ in the linear model now describes the **gradient** of the **decision** **boundary** in x space. By the properties of the sigmoid function, we will classify y with a ``yes'' or a 1, when θx is greater than zero and a ``no'' or a 0, when θx is less than zero. Let's look at when we are right on the ``cusp'' of a decision: 11+eθxi12eθxi1 So we can taylor expand our hypothesis P(y=1|x,θ) and get P(y=1|θ,x)=12+θxi+O()2 If we take the derivative of this function with respect to the vector x, we see that the normal vector to our decision boundary in feature space is precisely θ. Mathematically, gradient presents the direction of **fastest increase, **and so relatively dominant components of our θ vector correspond to relatively dominant features in classification. P(y=1|θ,x)x=θ+O()2 So, regression coefficients, at least near the decision boundary, play the same intuitive role. If we have a large coefficient, we know that the associated parameter plays a prominent role in classification. ------------------------------------------------------------------------ Of course, adding higher order terms to the expression in our sigmoid function, say: 11+ejθ2jx+jjθ2j+1x2j allows for a curved decision boundary in feature space x -- not just a hyperplane.