rspeare.github.io

Colinearity (Part 1)

07 Feb 2017

I've heard a lot comments -- some captions, some cowardly -- about "co-linearity" recently, from both colleagues at work and friends using statistics in their jobs. And, well, GUESS WHAT? Co-linearity is not as scary as it used to be! Many people don't realize that there are a variety of ways to avoid or control "co-linearity" in data when performing basic regressions, and I want to take some time to outline them. Let's begin by saying we've got regression problem on our hands: one where we have $N$ examples of some feature vectors $\mathbf{x}$, of dimension $D$, contained in an $N \times D$ data matrix: \begin{eqnarray} X &=& \left( \begin{array}{c} \leftarrow \vec{x_1} \rightarrow \\ \leftarrow \vec{x_2} \rightarrow \\ \vdots \\ \leftarrow \vec{x_N} \rightarrow \end{array}\right) \end{eqnarray} and, $N$ real-valued response variable examples, $Y$: \begin{eqnarray} Y &=& \left( \begin{array}{c} y_1\\ y_2 \\ \vdots \\ y_N \end{array}\right) \end{eqnarray} Our goal is to write a linear model of some sort: \begin{eqnarray} \hat{y_n} &\approx & \mathbf{\beta} \cdot \mathbf{x_n} + \epsilon \end{eqnarray} Where we assume the errors are drawn from some probability distribution. In standard regression problems Normal, so we'll keep it that way: \begin{eqnarray} \epsilon & \sim & \mathcal{N}(0,\sigma^2) \end{eqnarray} Now, as we've covered before in this blog, the likelihood of the data -- or, the co-occurence of features and labels we see in the world $(X,Y)$ -- given some model, specified by $\mathbf{beta}$, is equal to: \begin{eqnarray} P(X,Y \vert \mathbf{\beta}) &=& \prod_{n=1}^N \frac{1}{\sqrt{2\pi \sigma^2}}\mathrm{exp}\left(-(y_n - \beta \cdot x_n)^2/ 2\sigma^2 \right) \end{eqnarray} Where, by taking a product above, we assume each data instance $n = 1, \dots N$ is independent. Taking the log of this likelihood we get: \begin{eqnarray} \mathcal{L}(X,Y \vert \beta) &=& -\frac{N}{2} \log(2\pi \sigma^2) - \sum_{n=1}^N \frac{(y_n - \beta \cdot x_n)^2}{ 2\sigma^2 } \end{eqnarray} This is a convex function of $\beta$, meaning that if we set the derivative equal to zero, we are guaranteed to find a global maximum / minimum (very good), and so our MLE or maximum likelihood estimate of the model $\beta$ becomes, if we write things now in matrix notation: \begin{eqnarray} \mathcal{L}(X,Y \vert \beta) &=& -\frac{N}{2} \log(2\pi \sigma^2) - \frac{(\beta_d X_{nd}-Y_n)^2}{ 2\sigma^2 } \end{eqnarray} Taking the derivative with respect to $\beta_d$ (a gradient) we get: \begin{eqnarray} \frac{\partial \mathcal{L}(X,Y \vert \beta)}{\partial \beta_d}\vert_{\beta = \beta^\prime} &=& (\beta_l X_{nl}-Y_n) X_{nd} = 0 \\ \beta_l X_{nl}X_{nd} &=& X_{nd} Y_n\\ \beta_l &=& (X_{nl}X_{nd})^{-1} X_{nd} Y_n \end{eqnarray} This is called the Normal Equation, can actually be well understood that noting: \begin{eqnarray} X_{nl}X_{nd} & \approx & N \mathrm{Cov}(x_l, x_d) \\ X_{nl}Y_{n} & \approx & N \mathrm{Cov}(x_l, y) \end{eqnarray} which, words, is the covariance betwen the $l^{\mathrm{th}}$ $d^{\mathrm{th}}$ components of the feature vector and the covariance between the $l^{\mathrm{th}}$ feature and the target, $y$. There's a very important thing to notice here, straight off the bat, which is that the Normal Equation -- which is the standard way of solving regression, or OLS (ordinary least squares) problems -- accounts for interactions between the features: we can see it in the "discounting" factor of the inverted matrix, above. Highly correlated features will dampen each other's effect, which is very, very cool. Regression coefficients $\beta_d$ represent the ``net'' effect of the $d$ feature, not the ``gross'' effect, as one would get by doing a single, univariate regression of $x_d$ on $y$. This is important to keep in mind, but we're not out of the -- or even into the -- co-linearity woods yet. ---------------------------------------- People get upset or concerned about colinearity when they want:

1. Sub-select the features, such that $D < N$, and move on with the analysis. 1. Use PCA to "represent" the matrix $X_{nl}X_{nd}$ in terms of its most prominent eigenvectors, and once again make sure that $D<N$. 1. Add some other method I'm not currently remembering here. But, in modern times this is a brutal way to do things, and especially with genetic data, where the feature space has far more dimensions than the dataset has examples (often by a factor of 10, 100 or even a 1000). We can do better than this.