Facility Location as Adaptive K-means
In linear programming, a
typical problem is the following: we have $N$ facilities, which we can choose
to open or not open, in order to serve $M$ customers, who are strewn
throughout the world. We would like to minimize the operational cost of
serving our customers from these facilities, which can be summarized as:
\begin{eqnarray}
\mathrm{startup} \ \mathrm{costs} &=& \sum_f s_f y_f\\
y_f &=& 0,1
\end{eqnarray}
Where $y_f$ is a binary variable -- zero or one -- based on whether we choose
to open facility $f$. $s_f$ is obviously the associated cost for opening that
specific facility. We also have transit costs from serving our customers,
which might be of the very simple form:
\begin{eqnarray}
\mathrm{transit} \ \mathrm{costs} &=& \sum_c \sum_f X_{cf}
D(\mathbf{x}_c, \mathbf{x}_f)\\
X_{cf} &=& 0,1
\end{eqnarray}
Where $X_{cf}$ is a binary matrix that denotes whether customer $c$ was
assigned to facility $f$. $D(x_1,x_2)$, is our distance metric, which could be
Euclidean or "Manhattan" -- where we only take steps to the left or right, up
or down, no diagonal lines between destinations -- or maybe just a query in
google maps.
Adding these to costs together we get a loss function of sorts:
\begin{eqnarray}
J &=& \sum_c \sum_f X_{cf} D(\mathbf{x}_c, \mathbf{x}_f)+ \sum_f s_f
y_f
\end{eqnarray}
which we would like to minimize. There are some simple constraints on our
binary variables that we can articulate mathematically. Say each facility has
an associated production capacity $c_f$ and each customer has an associated
demand $d_c$. Then we have to make sure that the assignments are not
overworking our facilities:
\begin{eqnarray}
\sum_c d_c X_{cf} & \le & c_f y_f \ \forall \ f
\end{eqnarray}
Notice I've multiplied by $y_f$ on the RHS. Another constraint we need to have
is that every customer is served by exactly one facility -- no redundancy:
\begin{eqnarray}
\sum_f X_{cf} &=& 1\ \forall \ c
\end{eqnarray}
We can also require more explicitly that facilities who are turned off do not
get any assignments:
\begin{eqnarray}
X_{cf} & \le & y_f \ \ \ \forall \ c,f
\end{eqnarray}
That's a ton of constraints, but each of them are linear and so we have well
defined linear program. Given $N$ facility locations and $M$ customer
locations, with the associated capacities, demands, and start up costs, we can
find the global optimum of this thing with $N(M+1)$ decision variables, $X$
and $y$.
Turns out this problem is very similar to K means. Because what we are
essentially doing is breaking down our customers into "facility clusters".
What if we promoted all customers to facilities themselves? (Or at least gave
them the option). Then the distance function $D(\mathbf{x}_1,\mathbf{x}_2)$ is
simply the square root of the $L_2$ norm associated with a mixture of
Gaussians with a diagonal covariance matrix. Let $\mathbf{x}_2$ be the
facility, or centroid of the $k^\mathrm{th}$ cluster, then we have:
\begin{eqnarray}
\mathrm{Let} \ \ \mathbf{x}_2 &=& \mu_k \\
\mathbf{\Sigma} &=& \mathbf{1} \\
\mathrm{Then} \ \ D(\mathbf{x}_1,\mathbf{x}_1) &=&
\sqrt{\left(\mathbf{x}_1-\mathbf{\vec{\mu}}_k\right)
\mathbf{\Sigma}^{-1}\left(\mathbf{x}_1-\mathbf{\vec{\mu}}_k\right)}
\end{eqnarray}
For a mixture of Gaussians density Estimation, our PDF is:
\begin{eqnarray}
P(\mathbf{x}) &=& \sum_{k=1}^K \mathcal{N}\left(\mu_k,\mathbf{\Sigma}
\right)
\end{eqnarray}
and the log likelihood for our sequence of customers/facilities is:
\begin{eqnarray}
-\log \mathcal{L}(X \vert \lbrace \mu \rbrace_{k=1}^K )&=& \sum_k
\sum_{ n \in c_k} \frac{\left(\mathbf{x}_n-\mathbf{\mu}_k
\right)^2}{2\sigma^2}+K\sqrt{2\pi \sigma^2}
\end{eqnarray}
I've kept the $\sigma^2$ terms for clarity. Notice that the second term in
this negative log likelihood, or loss function is the normalization factor: it
penalizes a high level of total clusters $K$, preventing overfitting. This
loss function -- apart from the quadratic nature of the summand -- is exactly
like the objective in our facility location problem. The startup cost for
every "facility" in this example would be
\begin{eqnarray}
s_f &=& \sqrt{2\pi \sigma^2} \ \ \forall f
\end{eqnarray}
And the variance of our gaussians we would set to unity: $\sigma \to 1$. Let's
rewrite the negative log likelihood, but in decision variable language:
\begin{eqnarray}
J^\prime &=& \frac{1}{2}\sum_i \sum_j X_{ij} D(\mathbf{x}_i,
\mathbf{x}_j)^2+ \sum_j s_j y_j \\
s_i &=& \sqrt{2\pi}\\
\end{eqnarray}
We have not yet specified the capacity of our clusters or the "demand" of each
data point. A simple choice would be to set some upper limit on the size of
each cluster, and give every data point the same weight or demand:
\begin{eqnarray}
d_i &=& 1 \ \ \forall \ i \\
c_j &=& c_\mathrm{max} \ \ \forall \ j
\end{eqnarray}
This is still a linear program, just with a slightly different list of
coefficients on the first term! To solve such a system we would need $N(N+1)$
decision variables, which could be prohibitive with large systems, but thanks
to interior point solvers and very clever routines for integer programming we
can actually run this "adaptive" K-means algorithm in polynomial time!