Counting Stuff in Geopandas

Alexandra Marshall
5 min readOct 18, 2021

--

In my last blog entry I described python’s Geopandas library with a focus on using it, and some related libraries like Folium and contextily, to create map visualizations. Pretty pretty maps are only one side of the Geopandas coin as you may remember it builds on the spatial analysis capabilities of the Shapely library, making it a powerful tool to analyze geodata.

What kind of analysis you might ask? Types that you see everyday when you use apps like Yelp, or search for businesses on Google Maps: How many restaurants are within one mile of my location? How many hospitals are in my zip code area?

screenshot of Google Maps
After working with Geopandas Google Maps just isn’t as impressive…

To demonstrate what you can do with Geopandas I’m going to walk you through a simple example using two shapefiles:

- One containing the building footprints of all the individual landmark sites in Brooklyn.
- One containing the zip code area boundaries for Brooklyn.

Both shapefiles were downloaded from NYC Open Data and some processing was done “off stage” for the purposes of this tutorial:

  • The CRS of the landmark shapefile was reset using Geopandas’ set_crs method to match that of the area zip code file
  • Both files were reduced to only contain landmarks and zip code areas in Brooklyn

To begin with we’ll want to import the necessary libraries and set our shapefiles as GeoDataFrames.

import geopandas
import matplotlib.pyplot as plt
%matplotlib inline
brooklyn_zip = geopandas.read_file(‘Data/Brookly_Zips.shp’)
brooklyn_landmarks = geopandas.read_file(‘Data/Brooklyn_Landmarks.shp’)

Then it’s always a good idea to plot your data to make sure that everything is lining up as expected:

#Do yourself a favor and check out your matplotlib color options at
# https://matplotlib.org/stable/gallery/color/named_colors.html
base = brooklyn_zip.plot(ax = ax, color=’mediumaquamarine’, edgecolor=’gainsboro’)
brooklyn_landmarks.plot(ax=ax, color=’royalblue’)
ax.set_title(“Brooklyn Landmarks”)
ax.axis(‘off’)
Map of Brooklyn zip code areas overlaid with the locations of NYC landmarks
Map output from above code

So now we have a map that shows us where landmarked properties are in Brooklyn and how that relates to the zip code areas. But what if you need some hard numbers? How can we get the count of landmarks in each zip code? The brooklyn_landmarks GeoDataFrame doesn’t contain the full addresses of the landmarks so we can’t groupby() the zip codes; instead we’re going to get that information by combining our two GDF’s.

This is where the Geopandas overlay function comes into play. Overlay allows us to create new shapes based on where the two GeoDataFrames overlap and outputs a GeoDataFrame which contains not only the new shape but also the attributes from both inputs (like a pandas join with more geometry).

Here you can see the types of methods that can be applied to the shapes in the two files:

Shows visualizations of intersections, unions, sypmetrical differences, and differences
Source: https://geopandas.org/docs/user_guide/set_operations.html

In this case, since we want to count the landmarks in the zip code areas we’ll be using the “intersection” method which returns a new shape for every area where the two GeoDataFrames overlap.

contains = geopandas.overlay(brooklyn_zip, brooklyn_landmarks, how = ‘intersection’)

Taking a quick look at contains.info() you’ll see columns from both files — the ones carried over from the zip code shapefile are in all-caps:

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ZIPCODE 214 non-null object
1 BLDGZIP 214 non-null object
2 PO_NAME 214 non-null object
3 POPULATION 214 non-null float64
4 AREA 214 non-null float64
5 STATE 214 non-null object
6 COUNTY 214 non-null object
7 ST_FIPS 214 non-null object
8 CTY_FIPS 214 non-null object
9 URL 214 non-null object
10 SHAPE_AREA 214 non-null float64
11 SHAPE_LEN 214 non-null float64
12 address 214 non-null object
13 bbl 214 non-null float64
14 block 214 non-null float64
15 borough 214 non-null object
16 date_des_d 214 non-null object
17 time_des_d 214 non-null object
18 landmark_t 214 non-null object
19 lot 214 non-null float64
20 lpc_altern 52 non-null object
21 lpc_lpnumb 214 non-null object
22 lpc_name 214 non-null object
23 lpc_site_d 213 non-null object
24 lpc_site_s 214 non-null object
25 objectid 214 non-null float64
26 objectid_1 214 non-null float64
27 shape_area 214 non-null float64
28 shape_leng 214 non-null float64
29 url_report 214 non-null object
30 geometry 214 non-null geometry
dtypes: float64(11), geometry(1), object(19)
memory usage: 52.0+ KB

With our data consolidated we can group the new GeoDataFrame by the ‘ZIPCODE’ column and use .size() to get a count of the number of landmarks in each area code:

groups = contains.groupby(‘ZIPCODE’).size()# Display the 10 zip codes with the highest number of landmarks 
print(groups.sort_values(ascending=False)[:10])
ZIPCODE
11201 43
11216 15
11215 14
11238 13
11211 11
11205 9
11226 8
11221 8
11213 7
11222 7
dtype: int64

To make this data easier to visualize we’re going to map the groups data into a new column of the brooklyn_zip GeoDataFrame called “num_landmarks”:

brooklyn_zip[‘num_landmark’] = brooklyn_zip[‘ZIPCODE’].map(groups)#some zip code areas did not contain any landmarks and we need to fill the #null value cells or they polygons will not be mappedbrooklyn_zip[‘num_landmark’] = brooklyn_zip[‘num_landmark’].fillna(0)

Finally we come to the visualization itself. The below code will generate a choropleth map color-coding each zip code area by the number of landmarks it contains. Note that .plot() is using “natural_breaks” as the classification scheme. This is a “optimal” classification scheme that minimizes within-class variance and maximizes between-class differences.*

import contextily as cx
fig, ax = plt.subplots(figsize=(10,10))
brooklyn_zip.plot(ax=ax,column = 'num_landmark',scheme='natural_breaks', k=4, vmin = 0, cmap = 'cool',
edgecolor= 'gainsboro', alpha=.85, legend = True, figsize= (10,10))
cx.add_basemap(ax, crs=brooklyn_zip.crs, source=cx.providers.Stamen.TonerLite)
ax.set_title("Brooklyn Zip Code Area Landmark Count")
ax.axis('off')
Map of Brooklyn area codes color coded with an cmap set called “cool”. It’s very 80s, ranging from lower values set to a bright turquoise to high values in bright magenta.
Sadly the ‘cool’ cmap color scheme doesn’t quite live up to its name…

This way of using overlay isn’t the only Geopandas tool that will show you how many of one object lie within another — there’s also the contains() method. Unlike the overlay function, which returns a new GeoDataFrame merging the data from both sets used, contains simply returns a Series where the valueTrue is given for each geometry that contains the input geometry.

test_landmark = brooklyn_landmarks[‘geometry’][0]
print(test_landmark)
results = brooklyn_zip[‘geometry’].contains(test_landmark)
# the index of the True value in results will give us the index of the row with the correct zip code in brooklyn_zip
# in this case row 13 holds the data on the 11201 zip code
results
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 True .....

This is, as I’ve said before, only the tip of the iceberg when it comes to Geopandas spatial analysis capabilities, and I hope this blog entry has given you a base from which you can explore further on your own.

--

--