Least Chi-Squared -- or Maximum Log-Likelihood -- Estimation of Parameters
Least Chi-Squared -- or
Maximum Log-Likelihood -- Estimation of Parameters
Let's say your given some data, a list of x and y values, which seem to be
related by a model. One example is a perfect linear relation ship with some
scatter, where
\begin{eqnarray}
y_i \approx m*x_i+b
\end{eqnarray}
Where $m$ is a best fit slope and $b$ a vertical offset. Since this is not a
perfect relationship, we must introduce an error term into our model:
\begin{eqnarray}
y_i &=& m*x_i+b + e_i\\
y_i &=& f(x_i) + e_i \\
f(x) &=& mx+b
\end{eqnarray}
$f(x_i)$ in this instance, represents our model, or, our "perceived"
relationship between $x$ and $y$ values. Another example is when the $y_i$
values are just centered around some single value -- i.e. a random variable.
Then we have:
\begin{eqnarray}
y_i &=& f(x_i) + e_i \\
f(x) &=& \mu
\end{eqnarray}
For a random variable $Y$ centered about zero, we might say $f(x)=0$. This is
STILL a model!
But the point being, if we want to make a "best-guess" at the parameters
within our model -- such as $m$ and $b$ above, or $\mu$ -- we are going to
have to come up with some way of describing the error terms, $e_i$.
A reasonable assumption is that those error terms are drawn from some
Probability density; let's say a Gaussian.
\begin{eqnarray}
e_i &\sim & \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{e_i^2}{2\sigma^2}}
\end{eqnarray}
We can isolate the error term in the following way: we subtract our "output"
value $y_i$ from the model $f(x_i)$, this results in, for a random variable
centered about a true value $\mu$, $f(x)=\mu$:
\begin{eqnarray}
e_i &=& y_i - f(x_i) \\
P(y_i \vert \mu, \sigma^2) &=& \frac{1}{\sqrt{2\pi
\sigma^2}}e^{-\frac{\left(y_i - f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}
Now, this is the probability of a single data point-pair, $x_i,y_i$, assuming
our model $f(x)$, and an error term that is described by this Gaussian
distribution. If we were to count up all $N$ data point pairs, and calculate
the probability of observing the whole set $D:{x_i}$, we would take the
product of the individual probabilities:
\begin{eqnarray}
P(D| \mu, \sigma^2) &=& \Pi_i P(y_i \vert \mu, \sigma^2)\\
&=& (2\pi \sigma^2)^{-\frac{N}{2}} e^{-\sum_i \frac{\left(y_i -
f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}
Notice that this conditional probability has the same structure as our
$P(D\vert H_i) $ term from Bayes Theorem, before. Our "hypothesis" in this
case, is the mean and variance; or, the parameters of our model. We can
construct posterior probability densities on those parameters in the usual
way:
\begin{eqnarray}
P(H_i \vert D) &=& \frac{P(D \vert H_i)P(H_i)}{P(D)}\\
&=& \frac{P(D \vert H_i)P(H_i)}{\sum_i P(D \vert H_i)P(H_i)}\\
P(\mu, \sigma^2 \vert D) &=& \frac{P(D \vert \mu, \sigma^2)P(\mu,
\sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}
\end{eqnarray}
The denominator in these terms is simply a normalizing factor. If one wants to
construct the posterior on a single parameter, say $\mu$, we merely perform
what's called marginalization:
\begin{eqnarray}
P(\mu \vert D) &=& \int P(\mu, \sigma^2 \vert D) d\sigma^2
\end{eqnarray}
This equation reads "the probability of some mean $\mu$, given the data $D$,
an array of x and y values".
Ok, so why do we care? Well, in Bayes Theorem above, we implicitly used
something called a Likelihood function, which is the Likelihood of seeing the
data, given some model and its parameters:
\begin{eqnarray}
\mathcal{L}(D \vert \mu, \sigma^2)=P(D \vert \mu, \sigma^2) &=& (2\pi
\sigma^2)^{-\frac{N}{2}} e^{-\sum_i \frac{\left(y_i -
f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}
Now the term in our exponential above can be simplified. Defining the $\chi^2$
statistic for our data:
\begin{eqnarray}
\chi^2 &=& \sum_i \frac{\left(y_i - f(x_i)\right)^2}{\sigma^2}\\
\mathcal{L}(D \vert \mu, \sigma^2)&=& (2\pi \sigma^2)^{-\frac{N}{2}}
e^{-\frac{\chi^2}{2}}
\end{eqnarray}
Pretty cool right? So, maximizing Likelihood is minimizing $\chi^2$. We can
take the log of this Likelihood function to write things in a more
illuminating fashion:
\begin{eqnarray}
\log\left[\mathcal{L}(D \vert \mu, \sigma^2)\right]&=&
-\frac{N}{2}\log (2\pi \sigma^2)-\frac{\chi^2}{2}
\end{eqnarray}
Taking the derivative with respect to $\mu$ and setting equal to zero yields
are critical value of $\mu$ -- let's call it $\hat{\mu}$ -- that yields
maximum Likelihood:
\begin{eqnarray}
\frac{\partial}{\partial \mu}\log\left[\mathcal{L}(D \vert \mu,
\sigma^2)\right] &=&-\frac{\partial}{\partial \mu}\chi^2/2\\
0&=&\frac{\sum_i x_i - N\hat{\mu}}{\sigma^2}\\
\hat{\mu} &=& \sum_i \frac{x_i}{N}
\end{eqnarray}
our standard "average". This is our estimator of the mean.
Now for an estimator of the variance -- or, in our older terms, the value of
$\sigma$ for which Likelihood is maximized:
\begin{eqnarray}
\frac{\partial}{\partial \sigma}\log\left[\mathcal{L}(D \vert \mu,
\sigma^2)\right] &=& 0\\
\frac{N}{\hat{\sigma}}&=&\frac{\sum_i (x_i-\mu)^2}{\hat{\sigma}^3}\\
\hat{\sigma^2} &=&\frac{\sum_i (x_i-\mu)^2}{N}
\end{eqnarray}
Interesting, our standard sum of squares. Notice however, that we've used our
"true" mean $\mu$ in the above equation, not our estimator $\hat{\mu}$. In
order to account for this, one can write
\begin{eqnarray}
\hat{\sigma^2} &=&\frac{\sum_i (x_i-\mu)^2}{N}\\
N\hat{\sigma^2} &=&\sum_i (x_i-\mu)^2\\
&=&\sum_i (x_i-\hat{\mu})^2+N(\hat{\mu}-\mu)^2\\
&=& Ns^2+N(\sum_i \frac{x_i}{N}-\mu)^2\\
N\hat{\sigma^2} &=& Ns^2+N(\sum_i \frac{(x_i-\mu)^2}{N^2}\\
N\hat{\sigma^2} &=& Ns^2+\hat{\sigma^2}\\
\frac{(N-1)}{N}\hat{\sigma^2} &=& s^2
\end{eqnarray}
Where we now have a new estimator for the variance, which requires no "true
information":
\begin{eqnarray}
s^2&=& \frac{\sum_i (x_i-\hat{\mu})^2}{N}
\end{eqnarray}
If we want to best emulate the "true information estimator", we had best
divide by $N-1$ instead of $N$. Another way to show this is that $s^2$ is a
biased estimator: it's expectation value, or average over an infinite number
of data sets, is not $\sigma^2$ but in fact:
\begin{eqnarray}
\langle s^2 \rangle &=& \frac{N}{N-1}\sigma^2
\end{eqnarray}
To review, let us write our Posterior distribution one more time:
\begin{eqnarray}
P(H_i \vert D) &=& \frac{P(D \vert H_i)P(H_i)}{\sum_i P(D \vert
H_i)P(H_i)}\\
P(\mu, \sigma^2 \vert D) &=& \frac{\mathcal{L}(D \vert \mu,
\sigma^2)P(\mu, \sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu,
\sigma^2)d\mu d\sigma^2}\\
P(\mu, \sigma^2 \vert D) &=& \frac{(2\pi \sigma^2)^{-\frac{N}{2}}
e^{-\frac{\chi^2}{2}}P(\mu, \sigma^2)}{\int \int P(D \vert \mu,
\sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}
\end{eqnarray}
The factor $P(\mu,\sigma^2)$ on the right hand side is called a prior, and
determines our a priori information about the parameters $\mu$ and $\sigma^2$.
What's nice about Bayes' construction is that the posterior from one data set
can be used as the prior for another -- $D$ just goes into more initial
information. (Perhaps we should've written the prior
\begin{eqnarray}
P(\mu,\sigma^2 \vert \mathrm{old\ dataset})
\end{eqnarray}