We Analyzed the Most Competitive Consumer Retail Locations in the US. Here Are The Results.

Published in

ScrapeHero

7 min readMay 11, 2023

Ever since I was a little kid, cities have fascinated me. The big buildings, the hustle and bustle of daily life, the high standards of living, and the associated cultures and communities. When I got into data science, wrangling economic and financial data to find the might of the cities was my natural inclination. However, economics is not that straightforward, often deep insights are hidden in plain sight. And there are other ways to arrive at a conclusion, using alternate data sources, without using some complex theoretical economic calculations.

The power of alternate data can’t be overstated, and companies are actively finding ways to harness these types of data into their workflows. For us, the question was rather simple, which are the biggest and most competitive consumer retail locations in the United States? If it was a simple question like, which are the biggest cities in the US, we would have just looked at the GDP figures and called it a day. However, answering the question, specific to consumer retail was not that straightforward.

That’s where POI data from Scrapehero comes in. As we are looking specifically for consumer retail, we chose the grocery store locations bundle. It doesn’t cover all consumer retail stores but will work fine and give us a good idea. How will we derive the most competitive consumer retail locations do you ask? That’s where our good old friend clustering comes in.

You see, the data has the latitudes and longitudes given for each of the stores. And we can represent latitudes and longitudes in an X-Y cartesian axes. Then we can cluster these points to find the biggest and most competitive retail store centers.

Enough of the talk. Let’s dive into the code and see how.

Viewing And Combining The Data

The data comes as a zip, which we can extract and find the CSV files of the following grocery store chains.

Albertsons
Food Lion
H-E-B
Hy-Vee
Kroger
Publix
Safeway Inc
Trader Joes
Walmart
Whole Foods Market

There is an easy way we can combine all this data in Python.

import pandas as pd
import os
df_grocers = pd.DataFrame()
for csvfile in os.listdir(data_dir):
    df = pd.read_csv(data_dir + '/' + csvfile);
    df_grocers = pd.concat([df_grocers, df], ignore_index=True)

When we find the total number of store locations that we have, we get 11,458. That’s a good enough number to work on.

len(df_grocers)

We can also group the stores by providers. As we see, Walmart has by far the largest number of store locations.

df_grocers.groupby('Provider').describe()

Let’s look at all the columns this data has:

df_grocers.info()

This has a lot of information that we will not need. So we can just choose the columns that we will work on.

df_grocers = df_grocers[['Store No.', 'Name', 'Latitude', 'Longitude', 'Street', 'City', 'State', 'Zip_Code', 'County', 'Provider']]

Let’s finally look at what this data looks like:

df_grocers.head(10)

Perfect.

Plotting The Data

The best way to explore something is via visualization, and we will do just that next.

We can just use the pandas plotting function to draw a scatter plot of the locations, using their latitude and longitudes.

df_grocers.plot.scatter(x='Latitude', y='Longitude', alpha=0.2)

We have given a small alpha here so that the points are able to blend in.

This looks kind of like the whole US, with some islands scattered here and there. We will plot this on an interactive map later for a better view.

The Clustering

The next obvious step is clustering, however, there is some pre-processing required. The algorithm we will be using here is a density-based clustering algorithm called HDBSCAN. We will explore other algorithms and explain why this is our choice in a later blog post. However, we cannot work latitudes and longitudes with HDBSCAN as is, we first need to convert it into radians.

We can write a small utility function to do that.

import numpy as np
def get_point_as_radian(data, column):
    return np.radians(data[column])

And finally, run this function on latitude and longitude columns to do the operation:

df_grocers['lat_radian'] = df_grocers.apply(get_point_as_radian, axis=1, args=('Latitude', ))
df_grocers['lon_radian'] = df_grocers.apply(get_point_as_radian, axis=1, args=('Longitude', ))

As we see, there are new columns now in the data frame for radians, and we will work on that.

Next, we create the HDBSCAN object and use it to cluster our data. We will be using the haversine metric here and the minimum cluster size is set to 15.

import hdbscan
clusterer = hdbscan.HDBSCAN(metric='haversine', min_cluster_size=15)

Make sure you install the hdbscan package right by following the instructions here.

Next, we create a vector with the latitude and longitude in radians and fit a cluster in that. Finally, we create a new column in the data frame and assign a cluster id to each of the stores.

radians_array = []

for index, row in df_grocers.iterrows():
    radians_array.append([row['lat_radian'], row['lon_radian']])

cluster_labels = clusterer.fit(radians_array)
df_grocers['cluster_id'] = cluster_labels.labels_

When we look at the data frame now, it should look like this. Note, if the cluster id is denoted as -1, it means that the algorithm has not been able to find a suitable cluster for that store.

Visualizing The Clusters

We can draw a simple scatter plot as we did above to see the clusters. However, this time we pass arguments to color the clusters differently based on the cluster id.

df_grocers.plot.scatter(x='Latitude', y='Longitude', c='cluster_id', cmap='RdYlBu')

This arguably looks better. However, the best way to visualize this will be if we can plot it in a real interactive map.

Plotting The Clusters On A Map

The first step we need for this is finding the total number of clusters. We can easily do that using the following:

total_clusters = len(df_grocers['cluster_id'].value_counts())
print(total_clusters)

This gives us a value of 122.

Next, to display on the map. we need to mark each cluster with a specific color. We can generate some random colors for that. We are using the randomcolor package, which creates colors that are easily distinguishable from each other, and useful for charts.

import randomcolor
rand_color = randomcolor.RandomColor()
colormap = rand_color.generate(count=total_clusters)

Next, we will be using folium, which is useful to display maps in Jupyter Notebooks.

import folium 
m = folium.Map(location=(37.09, -95.71))
for index, row in df_grocers.iterrows():
    if row['cluster_id'] == -1:
        continue
    
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=7,
        popup=row['Provider'],
        fill=True,
        color=colormap[row['cluster_id']],
    ).add_to(m)
m

We are adding the provider name to the markers. Once you zoom out of the map, you will find a view something like this:

This is great, and exactly what we wanted. You can zoom into the map to see individual clusters. For example, if I zoom into Texas, I find something like this.

You can click on a marker to know which grocery chain it belongs to.

So What Are The Most Competitive Consumer Retail Locations In The US?

We can find the largest clusters to find this information. Since the largest clusters will have a lot of locations in a small area, this will give us an idea of the densest locations where a large number of stores are located.

top_cluster_ids = df_grocers.groupby('cluster_id').size().sort_values(ascending=False).index[1:11]
top_cluster_ids

Since -1 denotes unclustered, we can ignore that, so we are starting from the 1st index. Next, We can now use a for loop to find the major city in each of the clusters.

for cluster_id in top_cluster_ids:
    print(df_grocers[df_grocers['cluster_id'] == cluster_id].groupby('City').size().sort_values(ascending=False).index[0])

These are the results:

Atlanta
Miami
Dallas
Houston
Los Angeles
Seattle
Charlotte
Tampa
Phoenix
Virginia Beach

Note, this gives the output closest to the densest of the clusters, hence the most competitive, which is not necessarily the largest.

What We Have Learnt & What Can Be Improved

So we have seen how to use clustering on locations to find the biggest grocery retail hotspots in the US. We now also know to plot locations on maps and visualize them.

The next thing we can do is instead of being limited to just groceries, we can get a list of all retail stores and apply the same analysis as above. We can also include parameters like population and population density, along with economic variables such as per capita income and median income to further enhance our data. We will try to do that in a future post and see if we get different results.

Do you think our analysis is accurate? What changes could we have made to get more accurate or meaningful results? Let us know below.

If you’ve found this article helpful or intriguing, don’t hesitate to give it a clap! As a writer, your feedback helps me understand what resonates with my readers.

Follow ScrapeHero for more insightful content like this. Whether you’re a developer, an entrepreneur, or someone interested in web scraping, machine learning, AI, etc., ScrapeHero has compelling articles that will fascinate you.