Brief Note on Information Gain Definition
After being confused for
quite some time, I've realized that information gain and mutual information
gain are essentially the same thing, yet the first one is not symmetric.
Mutual information as an intuitive measure of correlation can be displayed by
noting that for a bivariate normal distribution, the mutual information
between the two variables is:
\begin{eqnarray}
I(x,y) &=& IG(x \vert y) \\
&=& H(x)-H(x \vert y) \\
&\sim & -\frac{1}{2}\log(1-\rho^2)
\end{eqnarray}
[
](http://2.bp.blogspot.com/-tIxn9qOI4Kg/VhantUnUgLI/AAAAAAAACQs/q2fNffiB1zI/s1600/Screenshot%2B2015-10-08%2B13.28.03.png)
Where $\rho$ is the Pearson correlation, varying between zero and one. So,
when choosing to split on variables in a decision tree, or designing an
experiment that probes some underlying model space, we obviously want to
choose pairs $x,y$ that have $\rho \to 1$, in order to yield as much
information as possible.
