Data Science based climate zones

Dávid Guszejnov
7 min readJan 24, 2023

--

While I am originally from Europe, I have recently had the opportunity to move to a city of my choosing in the US. Since I had spent a long time living in hot places like Texas, I wanted to move to a city which has a cooler climate, more like what I grew up in.

If you want to find cities with similar climates, you might be tempted to look up a world map of climate zones. However, these zones are fairly broadly defined so cities with quite different climates can end up in the same zone. Also, due to the hard boundary between climate zones, border cities (like Frankfurt or Philadelphia) can have cities in a different climate zone that are more similar to them than others in their own zone. So I decided to look at the question in more detail to calculate a more fine-grained similarity metric than climate zones.

This project has three stages:

  • Preparing climate data
  • Clustering cities by climate and qualitatively comparing with the official Köppen climate classification
  • Ranking cities based on how similar their climate is to a chosen hometown

We will be using Python packages: sklearn, pandas, metpy, geopandas. A web app version of this work is available here, while the detailed code for this project is on GitHub

Preparing the data

As in all Data Science (DS) project we first need to gather data, which we can get from Weatherbase.com. We have 163 features describing the different aspects of the local climate (e.g., temperature, wind, humidity) on a monthly basis. This is vastly more information than what the standard Köppen climate zone classification uses, which was established more than a hundred years ago.

Data sample; image by author

While we have a lot of climate data for each city, we are missing perhaps the most important one, the apparent temperature (i.e., how hot/cold a human would feel). We can calculate it using metpy, as it is a function of temperature, humidity and wind speed.

The first rule of any DS project is to normalize the data to put the different features on a more “equal footing”. The usual practice is to use StandardScaler or MinMaxScaler, however both suffer from issues on this particular dataset. For min-max scaling the rare, extreme climates will dominate, so most locations would mostly cluster in a much smaller range (e.g., the highest yearly precipitation in the world is about 12000 mm, while most cities get between 0 and 1000). Standard scaling works better, but that suffers from the opposite problem for features where most cities are around the median but extremes are still important (e.g., the number of days with snow). So I decided to use a custom MinMax scaling based on the 10th and 90th percentiles to reduce both effects.

df.drop(['City', 'State', 'Country', 'Region', 'Latitude', 'Longitude'], axis=1, inplace=True)

def percentile_scaler(arr, percentile_low = 10, percentile_high = 90, axis=0):
low = np.nanpercentile(arr, percentile_low, axis = axis)
median = np.nanpercentile(arr, 50, axis = axis)
high = np.nanpercentile(arr, percentile_high, axis = axis)
scale = high - low
return (arr - median)/scale

df = percentile_scaler(df)

Looking at the column headers we can use some common sense to realize that some columns are more important than others to describe the local climate (e.g., Average Temperature vs Length of Day). We can increase their weight in further computations, by rescaling them. Doing so in a DS project is always risky as one can very easily introduce unintended biases. On the other hand, integrating domain knowledge can sometimes dramatically improve an algorithm.

df[temperature_columns] *= 5
df[precipitation_columns] *= 2
df[number_of_days_columns] *= 0.1

Clustering by Climate

Since we already have a large trove of climate data, we can try to identify clusters, i.e., climate zones. When trying to cluster data, it is often useful to start out with a simple, easy to understand algorithm like KMeans. In our case this has the downside that we need to drop all columns with any missing data. This is because KMeans can’t handle NaNs and substituting in mean values would cause a lot of problems (e.g., if a desert city is missing the average number of snowy days column).

Since KMeans requires the number of clusters as an input parameter and we don’t really know how many clusters we need, it is a good idea to try out different numbers. We do know that there are 30 Köppen climate zones, and of those 7 are very rare (i.e., they contain only a few outlier cities), so we should try cluster numbers within a factor of a few around 20, then select the optimal one using the elbow method

Silhouette score, distortion and inertia for KMeans clustering using different cluster numbers; image by author

It seems the silhouette score does not change significantly in the range we probed, however both inertia and distortion starts leveling off between 20 and 30. For now let’s use 20 as our fiducial cluster number.

Now that we have a clustering let us plot it on the world map and compare with the Köppen climate zones that is often used to describe climates.

The Köppen climate zones compared to what we get from clustering; image by author

We don’t expect a perfect match, since the Köppen classification only uses seasonal temperature and precipitation, while we used over a 100 data columns. Nevertheless, our clustering did recover quite a few climate zones from the Köppen classification, e.g.:

Note that we can only have climate zones in both the Northern and Southern Hemispheres if we account for the difference in seasons between them. We can accomplish this by simply shifting all southern features by six months during the data preparation phase.

Where our clustering gives markedly different answers is with continental climates (e.g., not putting Northeastern US into the same cluster as Central Europe). Looking into this we find that the main reason is that Central Europe is far from the ocean so it has significantly lower precipitation and humidity. The latter means that even though the temperature is similar in both regions (what is used by Köppen) the apparent temperature (what humans feel) is higher in the Northeastern US than in Europe at similar latitudes.

Similarity/Distance between Climates

If we want to quantify the similarity/distance between the climates of different locations we can use the Euclidean distance in our parameter space. Note that we can not use sklearn.metrics.pairwise_distances if we keep missing values in our data (i.e., some cities are missing a few data points). Since we want to keep as much data as possible, to describe the distance between two climates we will use a dimension averaged distance along dimensions that are non-NaN for both locations.

We can quickly test this with some European cities to see if what we get makes sense, i.e., the climate in Paris should be somewhat similar to other European cities but should differ significantly from Cairo or Moscow.

City                                               Climate distance to Paris
----------------------------------------------- ----------------------------
Germany/Berlin 0.319866
Russia/Moscow 0.686279
United Kingdom/England/London 0.30342
United States of America/Massachusetts/Boston 0.36103
United States of America/California/Los Angeles 0.528939
Austria/Vienna 0.18383
Italy/Rome 0.36599
Egypt/Cairo 0.865264

It is also useful to check what features are driving the differences between two cities and once again do a sanity check (i.e., LA is much warmer and less humid in the winter than Paris)

Differences between Paris and Los Angeles
Feature Distance
----------------------------------- ----------
Apparent Temperature, month 11 -2.03046
Apparent Temperature, month 12 -1.84454
Apparent Temperature, month 1 -1.71809
Apparent Temperature, month 9 -1.52339
Apparent Temperature, month 2 -1.48279
Apparent Temperature, month 10 -1.46267
Apparent Temperature, month 3 -1.42241
Average Precipitation, month 2 -1.01778
Apparent Temperature, month 4 -0.991551
Apparent Temperature, month 8 -0.860856
Average Relative Humidity, month 12 0.828125
Average Relative Humidity, month 11 0.778146
Apparent Temperature, month 5 -0.748815
Average Relative Humidity, month 1 0.74344
Apparent Temperature, month 7 -0.691615

Now we can come back to our original question: Which places in the world have similar climates as my hometown? Since all cities are potential hometowns, we need to calculate the pairwise distances between all cities.

I grew up in Budapest, Hungary, but I live in the US now. So let’s see where in North America would the weather be most similar to my hometown.

Similarity of local climates in North America to that of Budapest, Hungary; image by author

As expected the best matches are at roughly the same latitude as my hometown, and are similarly fairly far from the ocean. To my surprise most of these are close to the Great Lakes (e.g., Michigan), which leads to higher humidity which is one of the main drivers of difference.

So where else would the climate be similar? Most good matches are in Europe or North America, but there are somewhat similar ones in surprising locations such as New Zealand and New South Wales in Australia.

Similarity of local climates in thr world to that of Budapest, Hungary; image by autho

I have created a web app where you can look up the same maps for your hometown, check it out here! Also, the detailed code for this project is on GitHub

--

--

Dávid Guszejnov

NASA Hubble Fellow at Harvard | Data Science Enthusiast | Astrophysics PhD | Open to Work | find me at: www.linkedin.com/in/guszejnovdavid