Clustering the US population: observation-weighted k-means

In this post, I detail a form of k-means clustering in which weights are associated with individual observations.

k-means clustering has been a workhorse of machine learning for almost 60 years. It is simple, effective, and intuitive. One of the reasons that I love it, is that you can plot cluster assignments over time and see it learning.

From: https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

However, it doesn’t scale particularly well. On a laptop, can you cluster 10,000 points? Sure, no problem. A million? Maybe, but it will be slow. A hundred million points? Fuhgeddaboutit!

Imagine that we wanted to cluster the U.S. population. Why? We could set k=48 and determine what the lower 48 states might look like based on current centers of population. Perhaps, we want to set up k=1000 distribution centers across the country. Maybe we want to set k=10 million Starbucks locations.

The U.S. population has more than 300 million people. How would you cluster them using k-means? Ignoring parallel k-means, let’s constrain it to run on a single laptop in say less than 3 minutes. Ideas?

One thing that you could do is sample the data; that is, run with a reduced, hopefully representative, subset of the data. Another approach would be to aggregate the data, drastically reducing the number of points, but associate each with the original sample size that each aggregate point represents. In other words, weight each point.

Where might we get such a dataset? The U.S. Census Bureau. They provide a very convenient dataset consisting of the number of inhabitants for each Zip code. There are about 43,000 Zip codes in the U.S, a number we can comfortably cluster on a laptop. Imagine then, we have a data file consisting of Zip code, a latitude-longitude pair (which are the x-y coordinates that k-means works on), and the number of inhabitants in that Zip (the weight):

“zip”,”state”,”latitude”,”longitude”,”population”
“00601”,”PR”,18.180103,-66.74947,19143
“00602”,”PR”,18.363285,-67.18024,42042
.
.
.
99929,”AK”,56.409507,-132.33822,2424
99950,”AK”,55.875767,-131.46633,47

Implementation

Surely we can go to scikit-learn or R or other major machine learning library and run some weighted k-means algorithm. Unfortunately, not. There are weighted k-means in a few of those libraries but they are not the sort that we want. They provide weights not for the observations but for the features.

That is, with feature-weights we could specify that latitude should influence the centroids more than longitude. However, we want to specify observation-weights such that Zip code=10012 (a dense Manhattan Zip) has far greater draw on a centroid than 59011 (a vast, low density region in Montana).

The core idea is very intuitive. Take four equally-weighted points in the (x,y) plane and the centroid is (mean(x), mean(y)).

If we apply weights, say w=(13,7,4,4), then the point with weight 13 has far greater gravity and should draw the cluster center much closer to it. The weighted centroid is now: (weighted.mean(x,w),weighted.mean(y,w))

The distance metric can be our usual Euclidean distance. However, as we are dealing with latitude-longitude pairs, the correct as-the-crow-flies distance metric is something called the Haversine distance so we’ll use that.

I was not able to find an implementation in major library. Even stackoverflow failed me. One suggestion from there was to replicate the data points. That is, if the weighted data were as shown (top), one would created the equivalent unweighted dataset (below):

Obviously, this has the problem of scaling up the data — the exact thing we wanted to fix — and this only works with small integer weights.

Instead, I found a nice “regular” k-means example from stackexchange (thanks Gareth Rees!) which I stripped down and modified to be observation-weighted:

import random
import numpy as np
import pandas as pd
import scipy.spatial
from haversine import haversine
def distance(p1,p2):
return haversine(p1[1:],p2[1:])
def cluster_centroids(data, clusters, k):
results=[]
for i in range(k):
results.append( np.average(data[clusters == i],weights=np.squeeze(np.asarray(data[clusters == i][:,0:1])),axis=0))
return results
def kmeans(data, k=None, centroids=None, steps=20):
# Forgy initialization method: choose k data points randomly.
centroids = data[np.random.choice(np.arange(len(data)), k, False)]
  for _ in range(max(steps, 1)):
sqdists = scipy.spatial.distance.cdist(centroids, data, lambda u, v: distance(u,v)**2)
    # Index of the closest centroid to each data point.
clusters = np.argmin(sqdists, axis=0)
    new_centroids = cluster_centroids(data, clusters, k)
    if np.array_equal(new_centroids, centroids):
break
    centroids = new_centroids

return clusters, centroids
#setup
data = pd.read_csv(“us_census.csv”)
data = data[~data[‘state’].isin([‘AK’,’HI’,’PR’])]
vals = data[[‘population’,’latitude’,’longitude’]].values
k = 3
random.seed(42)
#run it
clusters,centroids=observation_weighted_kmeans(vals,k)
#output
data[‘c’]=[int(c) for c in clusters]
lats = [centroids[i][1] for i in range(k)]
data[‘clat’] = data[‘c’].map(lambda x: lats[x])
longs = [centroids[i][2] for i in range(k)]
data[‘clong’] = data[‘c’].map(lambda x: longs[x])
data.to_csv("clustered_us_census.csv", index=False)

On a old Macbook Air, run time ranged from 2 seconds (k=1) to 160 seconds (k=48).

Results

Where is the center of mass of the U.S. population (i.e., k=1)? That appears to be in south east Missouri.

And for k=10,

Finally, if we were to “redistrict” our states based on population density then we see a very different picture. At least 5 states would be divvied out to neighbors and CA and TX split into four or five regions.