Facility location optimization using a variant of the k-means algorithm

How to modify a basic k-means algorithm to solve famous constrained optimization problem

Hedi

Published in

TotalEnergies Digital Factory

10 min readJun 8, 2021

Authors : Hedi Hargam, Pierre de Fréminville

In this article we are interested in the optimization of object placement under certain constraints. We will take as example (this is not a real use case) the placement of cell phone towers that must be within reach of the users. The goal is to place a limited number of cell towers as close as possible to housing areas, and thus to be able to cover the maximum of territory. For that, we chose as an optimization objective the sum of distances between each user and the closest cell tower. It is also specified that the cell towers have maximum and minimum capacities to be respected to be considered as functional.

Example of 12 samples with k=4 cell towers. Condition on the capacity C is 1 < C < 5

In the following, we propose an algorithm to solve this problem, and a new solution developed in python based on a modified Expectation-Maximization algorithm. We will also present the computational results obtained on a fictitious dataset for our problem.

Mathematical modeling

In the following, we will use the bird’s eye distance between each customer and the tower. We use the classical euclidean distance in the plane.

Our problem is close to a well-known problem in the mathematical optimization literature and is generally referred as the capacitated k-geometric-median problem, the Fermat-Weber localization problem, or the Facility location problem. In our case, the centers can be chosen freely in the plane whereas in some of the variants of the problem found in the literature, centers must be chosen among the given points.

The k-geometric-median problem is combinatorial optimization problem which consists consists in determining k location (geometric medians) such that the sum of distances between demand (client) and the nearest of the selected facilities is minimal. Additional constraints can be added : like ensuring a maximum capacity limit (capacitated k-geometric-median problem) and/or a minimum capacity limit.

The k-geometric-median problem (KGMP) belongs to the class of NP-hard problems.

The KGMP can been seen as a clustering problem close to the well-known k-means problem. The major difference is that in the KGMP we minimize the sum of Euclidean distances from each point to its cluster centroid, whereas in the k-means problem we minimize the sum of squared Euclidean distances from each point to its cluster centroid (which is equivalent to minimizing the sum of the intra-cluster variances).

The centroid of a cluster in the k-means problem is simply the mean of the cluster’s points. In the k-geometric-median problem the centroid is different from the mean and is called the geometric median.

Let’s have a look at an example on a line and an example in the plane to compare the mean with the geometric median.

1D Example : with (0, 0, 9)

Mean is : 9/3 = 3 → Total mean-sample distance = (3–0) + (3–0) + (9–3) = 12
Median is 0 → Total median-sample distance = (0–0) + (0–0) + (9–0) = 9

It can be noted that the geometric median is often used in machine learning because it is more robust to outliers.

Geometric median

The point that minimizes the sum of the distances to a set of points is called the geometric median in the literature.

It is also referred to as: the L_1-median, the spatial median, the solution to the Weber problem, the solution to the Fermat point problem (in case of triangles).

In the k-means problem, the goal is to minimize the sum of the square distances to the centroid in each cluster. The centroid of each cluster is the mean of each cluster, it can thus be found by a direct analytical formula.

On the contrary, the geometric median has no explicit formula, but is strictly convex (in the non-degenerate case) and can be found by solving the optimization problem below.

The convexity of C(y) comes from the fact that the sum of distances to the sample points is a convex function, since the distance to each sample point is convex and the sum of convex functions remains convex.

3D surface visualisation of the convex function C(y)

Note also that the geometric median is differentiable almost everywhere and its gradient (first derivates) and hessian (second derivatives) can be expressed analytically (this is useful in case we want to use to general convex optimization algorithms).

Resolution method chosen

Basic approach

In order to solve the Capacited-KGMP, we got inspired by the article “Constrained K-Means Clustering” by Microsoft Research. An great open-source implementation which uses OR-tools min cost flow solver and based on sklearn API can be found here: joshlk/k-means-constrained (credits to Josh Levy-Kramer)

The main idea of the article is the following: in the classical Expectation Maximization algorithm used to solve the k-means problem, replace the E (expectation) and M (maximization) step to match the problem additional constraints:

In our case, we follow these steps:

Initialization: choose initial cluster centers position using the kmeans++ algorithm to avoid local optima
Expectation: assign points to a cluster by solving a constrained optimization problem (min-cost flow)
Maximization: update clusters centers by computing the geometric medians (note that we have to use an optimization algorithm here as well contrary to the k-means centroids version because there is no analytical formula)
Repeat expectation and maximization steps until convergence criterion is reached.

The convergence of the original algorithm still holds with our modifications because the geometric-median is convex (in the proposition 2.3 in the article, the proof relies on the fact that the k-means centroid is convex).

Constrained K-Geometrical algorithm (k-means comparison)

Expectation part (cluster assignment)

The goal of the expectation step is to assign points to each cluster while satisfying our side constraints.

This can be done by solving the general Mixed Integer Programming optimization problem :

Instead of using a general MIP solver we can reformulate the assignment problem as a min-cost flow problem:

Equivalent Minimum Cost Flow formulation Bradley & al.

Maximization part (cluster centers update)

In this part, we update the clusters centers keeping the cluster assignments fixed. Below the geometric median with our notations :

The geometric-median has been well studied in the literature (a list can be found in the wiki article: Geometric median) and a couple of algorithms have been proposed, among the most efficients:

the Weiszfeld algorithm : the best known specific approximate algorithm for finding the geometric median, the principle is described in the article from Yehuda Vardi and Cun-Hui Zhang.
divide-and-conquer: by using the convexity in a binary search Bose, Maheshwari & Morin (2003)
general convex optimization procedures, such as: Nelder and Mead (NM), Broyden, Fletcher, Goldfarb and Shanno (BFGS), conjugate gradient (CG).
The convexity of the geometric median implies that procedures that strictly decrease the sum of distances at each step cannot get trapped in a local optimum.

In the following computational results, we have used the COBYLA algorithm, noting that we have already a good initial solution x_0 which is the mean of our points. Thus, this step is actually fast and precise. Example of python code :

Adding operational constraints on the clusters

More constraints can be added, like forbidden areas or maximum distance between a house and its assigned tower. Forbidden areas constraints can be added in the M step, however, the feasibility set over which we minimize loses its global convexity property.

To deal with the loss of convexity, a flexible way is to use a meta heuristic algorithm, such as the particle swarm optimization (PSO : Particle swarm optimization ).

To solve the problem with the PSO framework, we have two options :

Add a penalty to values that don’t match the constraints (ie : in a forbidden area or too far from a house)
Filter the swarm particle in order to keep only the particle that match the constraints

PSO is a very robust algorithm in term of results even if the function has many local optima. This is partly due to the high number of particles that we can set.

In our use case, setting the number of particles to 100 seemed to reach a good balance between computation time (the higher the number of particles, the slower the computation) and the quality of the optimum found. As an example for a problem with 100 cell towers the algorithm runs in 2 seconds on a classic 8GB ram CPU, and around 10 seconds if we add more constraints like forbidden areas.

Evaluation methods

In order to evaluate the performance of the algorithm, we have set up two different methods.

Offline

In the offline evaluation, we benchmark the algorithm with:

historical baselines on the one hand: in our case, the chosen baselines are historical studies where cell towers where placed manually by industry experts. To this day, our algorithm has always outperformed this baseline.
the optimal (or almost optimal) solution using discretization of the map and a brute force search, where we don’t have the computation time limits of the production tool

Note that other algorithms have been proposed in the literature, such as ones combining discretization of the map and resolution of CPMP problems, and could be used to extend our benchmark.

Online

The goal of the online evaluation is to give to the user general indicators of quality when we don’t have access to the optimal solution (due to computation time constraints). These indicators can be used for monitoring as well and be based on clustering problem evaluation metrics such as :

Internal Clustering Validation Measures, Yanchi Liu & al

An interesting metric for our problem is the Silhouette index. A higher Silhouette index score relates to a model with better defined clusters. It is calculated using:

the mean intra-cluster distance (a) and,
the mean nearest-cluster distance (b) for each sample. Where (b) is the distance between a sample and the nearest cluster that the sample is not a part of.

To sum up, if the Silhouette index is negative this indicates a potential bad positioning of a tower. However, this index is only indicative because it might not be suited for our additional constraints.

Real case example

Dataset

There is a real case example. We need to place 3 towers among 34 neighborhoods where the maximum and minimum capacity are 8 to 12 by cell station.

Unfortunately, a location constraint is added at the last moment. For reasons of building permits it is not possible to place a station in the red zone.

We include this forbidden area in the constraints of the geometric median calculation. As we can see the station 3 is now out from this areas and some clusters had to be restructured to ensure a minimal total distance. We can see the neighborhoods that switched stations (circles) and the slight deviations of station 1 & 2 positions.

Performance evaluation (without forbidden area)

Above, an extract from our performance evaluation table, we can see that the Silhouette score is negative for house n°13. Actually this house is closer to the cell tower 2 than to the 3 to which it has been assigned. This is simply due to the fact that station 2 has reached its maximum capacity of 12 and it was necessary to redirect this house on the second closest station.

The total sum of all housing area — cell tower distances is 13.021943852. Now, how far are we from the minimum calculated by the raw testing of a large number of solutions ?

Well, actually on this quite straight forward example the best solution that we found by testing a few billion points is 13.021943849. With a difference of less than one millionth of a percent, we are satisfied with our results.

However we cannot generalize this result on all possible use cases, especially in cases where the algorithm would be stuck in a local minimum. Which are nevertheless very rare in our problems.

Conclusion

As a summary, we have proposed a flexible and scalable algorithm, inspired from the classical k-means algorithm, to solve an operational problem. Although the tool might not reach state-of-the-art performance compared to some more sophisticated algorithms that can be found in the literature on some large maps, it gives fast and accurate enough solutions to answer to our business problem. Last but not least, the solution framework design for our use case can be used for many other similar problems.

Going further / research directions

The facility location problem has been well studied in the literature, for example we refer the reader interested in going further on the topic towards the following articles:

Local search heuristics for k-Median and facility location problems,
V. Arya & al.
A Hybrid Nested Partitions Algorithm for Banking Facility Location Problems, L. Xia & al.
Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics, V. Cohen-Addad & al.
A nearly linear-time approximation scheme for the Euclidean k-median problem, S. Kolliopoulos & al.