Interactive Choropleth Map of Washington DC using Folium and Python

Lindsay Allen
8 min readDec 1, 2019

--

Creating an interactive map with tooltips using Folium, GeoPandas, and Shapely

Section 1: Introduction

Geospatial analyses are a great way to identify relationships, spot trends and make inferences that would not be apparent in tabular form. To get started with geospatial analyses, there are several open source packages that you can use in R and Python. In this post, I’ll cover how to use Folium, a popular python package, to create an interactive choropleth map — which is a map that has geographic regions shaded based on a metric. To demonstrate this concept I’ll be creating maps of Washington, DC restaurants.

1.1 Obtaining Data:

There are 2 key sources of data that we need to make this map:

1. Washington, DC restaurant data: I am using DC business license data, which I obtained through Open Data DC.

2. Cartographic boundaries, or shape files: This will allow us to set the geographic regions for our analysis. For this map, we could use common boundaries that are available through Census geographies, such as state, county, city, or zip code. However, since Washington, DC is only 68 square miles, I wanted a more granular detail and chose to work with neighborhood boundaries. The neighborhood boundaries I used were created by the Zillow data team.

1.2 Choosing a Mapping Method:

Before moving to the data preparation stage we need to decide which Folium mapping method to use.This is an important decision because it drives the data transformation and determines the map’s features. I’ve found two main methods in Folium for creating a Choropleth map:

1. folium.Choropleth: this method is a great entry point to mapping with Folium, and often used by new users. One limitation is that this method can only display information from the GeoJSON in the tooltip. (See Issue #1074 in the Folium repo for a discussion on this by the amazing folks who built Folium.)

2. folium.GeoJSON: this method provides more flexibility in styling and interactivity, but also requires building a colormap. This is the method I’m using because I want the user to see the number of restaurants when he or she hovers.

Section 2: Data Preparation

During the data preparation stage, we need to clean, validate, and transform the data. This section summarizes key aspects of my data preparation. If you’d like to see all of my work, feel free to look at my “dataprep” script.

2.1a Cleaning — License Categories:

My Washington, DC business license dataset has 185,688 records, covering a broad range of licenses. Food establishments account for 9.7% of the dataset, and the top two sub-categories of food establishments are restaurants (4.1%) and delicatessens (2.6%). For now I’ll include all categories of Public Health Food Establishments (N = 18,086).

2.1b Cleaning — License Periods:

Within food establishments, 98.2% of restaurants have a license length of 2 years or less. While restaurants have a notoriously short lifetime, I’d expect more restaurants to be open for longer than 2 years.

To take a closer look at this, I examine the records for Washington DC’s oldest continuously operated restaurant. Old Ebbit Grill has been owned by Clyde’s Restaurant Group since the 1970s and located at 675 15th St NW since 1983, yet, my data show’s the restaurant’s license is from 2018. This implies that license issue date and start date are updated upon renewal.

To complete this section, I analyze the distribution of license end dates and missing license years. Since the data seems reasonable since the mid-2000s I decide to retain food establishment licenses with a non-null license date from the past 15 years (N = 14,710).

2.2a Validations — Benchmarks:

There are two types of validations that I like to complete when working with a new dataset. The first is to tie out summary statistics of the dataset to a benchmark number. I complete this by confirming that my count of active restaurants as of September 2019 is within 2% of Destination DC’s restaurant count.

2.2b Validations — Spot Check:

The second is to spot check individual records. In addition to spot checking the data for Washington DC’s oldest restaurant, I also looked at an area in Washington DC where restaurants have recently opened or closed. While I’m examining the data I notice that some cancelled licenses have duplicate records, which are caused by fields that are irrelevant to my analysis. After removing the duplicates (N=1,604) I am able to confirm the data looks accurate for 1) L’Hommage Bistro which closed in July 2017 and 2) La Betty which opened in March 2019.

2.3a Transformations — Encoding Neighborhood:

Because I’ve chosen to map restaurants by neighborhoods, I need to determine the neighborhood of each establishment. I can do this by projecting each neighborhood onto a 2d-plane and checking if the latitude and longitude of my establishment are contained within the shape.

Thankfully, Washington DC’s Master Address Repository has automatically identified latitude and longitude for 95% of the restaurant data, so I iterate through 12,327 food establishments and use another handy python package, Shapely, to project the neighborhood shapes.

2.3b Transformations — GeoPandas Dataframe:

Using Folium’s GeoJSON method requires a GeoPandas dataframe. This is surprisingly easy. After installing GeoPandas, the GeoJSON file can be opened as a GeoPandas dataframe by using the “read_file” method. This method essentially creates a pandas dataframe with columns derived from the properties section of the GeoJSON and a geometry column with the shape coordinates.

After loading the data, I group the DC restaurant and deli data by neighborhood and count the number of active and closed establishments. Next I roll-up the data to a neighborhood level so I can merge the GeoPandas dataframe with my Pandas dataframe. This is similar to merging two Pandas dataframes and akin to a join in SQL.

Section 3: Creating the Map

Now that we have our GeoPandas dataframe, to create our map we’ll need to 1) initiate a map, 2) create a colormap, and 3) set the GeoJSON class parameters.

3.1 Initiating the Map

This is super easy. To make my map dynamic I’ve set the “folium.Map” to initiate based on the center of my geographic regions, but you can also hardcode this to a specific latitude and longitude.

3.2 Creating a Colormap

The tricky part of the Folium GeoJSON class is creating a colormap, which we use to color the geographic regions. However, when we break it down into its components it’s not too difficult.

First, we need to use branca to create a colormap, which is a linear interpretation of two or more colors. The branca colormap can be created based off of tuples of RBGs or shortcuts. In my map I’ll color from yellow to red using the “YlOrRd_09” shortcut. Like its name implies, the colormap maps colors to numbers so we need to set the endpoints of the map to the minimum and maximum of our variable.

Currently our colormap is a gradient, but we need it to be discrete so that specific intermediate colors can map to ranges of our variable. I’d like to break the gradient into 6 colors, so I need to determine the values of my variable to break at. I do this by sorting my variable in descending order, taking the 0th,4th,9th,19th,29th, and 49th largest values, and forcing 0 to be the smallest value.

Now I can use my breaks, which I’ve stored in the variable “leg_brks”, to split the gradient into 6 discrete colors. Although the variable name and intermediate numbers are not showing in my output, they will appear on the map when the color scale is added to it.

3.2 The GeoJSON Class

Now that we have our map initiated, we’ll use “folium.GeoJson” to add the GeoJSON layer to the map. Within this method we’ll 1) point it to our Geopandas dataframe, 2) use “style_function” to color the regions based on the colormap and variable value, and 3) identify the fields that we want the user to see in the tooltip when hovering.

Just like that we have a map of Washington, DC! Now we can easily see that Downtown DC has the highest number of active restaurants, and the majority of neighborhoods in DC have a much smaller number.

Moreover, since we coded this in a dynamic fashion we can wrap the code in a function and quickly plot other metrics. This allows us to see that Chinatown has more than 2.5x as many restaurants per square mile as Downtown.

Although Chinatown has more than 2.5x restaurants per square mile than Downtown, the neighborhoods experience a similar rate of restaurant closings over the past two years. 22.6% of Chinatown restaurants closed over the past 18 months compared with 18.6% of Downtown restaurants.

That’s all for today! If you’d like to take a look at my code feel free to check-out my github project.

Additionally, here are some improvements and next steps that I’m considering.

  1. Adding a box around the legend to make it easier to see
  2. Improving the colormap so that there is a better distribution of colors
  3. Create an animated map to see changes over time

If you’d like to connect on these items feel free to reach out to me via LinkedIn.

--

--