rspeare.github.io

ANOVA and the "Temperature" of a data set

21 Apr 2014

Typically, with analysis of variance, one chooses to weight all data points -- or deviations from our model -- equally in the chi-squared statistic, assuming \begin{eqnarray} \chi^2 &=& \sum_i^N \frac{(x_i-f(x_i))^2}{2\sigma^2_i}\\ \sigma_i &=& \sigma \ \ \forall, i \end{eqnarray} Which yields a very simple posterior distribution: \begin{eqnarray} P(\mu, \sigma^2 \vert D) &=& \frac{\left(2\pi \sigma^2\right)^{-N/2}e^{-\chi^2/2}P(\mu,\sigma^2)}{Z} \end{eqnarray} Where Z is our awful, marginalized normalizing factor from before. We can write our posterior distribution much like the partition function of an Bose-Einstein ensemble, and associate our chi-squared statistic -- or deviation from our model f(x) -- with an 'Energy': \begin{eqnarray} P(\mu, \sigma^2 \vert D) &=& \frac{e^{-\beta E(\mu,\sigma^2)}}{Z} \\ Z &=& \int \int e^{-\beta E(\mu,\sigma^2)} d\mu d\sigma^2 \\ \beta E(\mu,\sigma^2) &=& \sum_i^N \frac{(x_i-f(x_i))^2}{2\sigma^2} \end{eqnarray} In this case the variance of each and every data point y_i -- the denominator in our chi-squared expression above -- acts much like the "temperature" of our distribution. We can make things even clearer by putting in the chemical potential, or the energy associated with adding/replacing a member of our dataset: \begin{eqnarray} \mu_c &=& \mathrm{chemical}\ \ \mathrm{potential}\\ P(\mu, \sigma^2 \vert D) &=& \frac{e^{N\beta\mu_c-\beta E)}}{Z} \\ \beta \mu_c &=& \log\left(\frac{1}{\sqrt{2\pi \sigma^2}} \right) \end{eqnarray} Which manifests itself as our extra normalizing factor for adding an extra data point. Exactly like our chemical potential mu's function in the bose einstein distribution! (Pardon the repetition of mu).