Analytics Vidhya
Published in

Analytics Vidhya

Folium and Choropleth. Weird Names, Cool Graph.

Let’s just make something cool and call it a day.

The Vessel at Hudson Yards in Manhattan (Photocredit: Me)

I’m a data scientist. I’m a data scientist who is currently working on a time series anomaly detection tool. I’m a data scientist who also misses messing around with datasets for sh!ts and giggles. So, I found a relatively small dataset on Kaggle that features (hehe, machine learning pun) some Airbnb listings and metrics in New York City (NYC) from 2019. I had full intentions to spend hours upon hours analyzing the dataset and provide excrutiatingly detailed answering the questions Kaggle posed. About an hour later I didn’t get around to answering a single question…INSTEAD, I MADE A COOL LOOKING GRAPH!

For some of you, a choropleth graph might seem a little ‘meh’. But I personally think it’s a pretty cool thing. So cool, in fact, that I decided to postpone my analysis of the Airbnb dataset and create a tutorial on how I made a choropleth graph using a Python module called Folium.

I know what you’re thinking. Choropleth and Folium? What the heck? Despite sounding like grass-type Pokemon, they are actually Python/data science-related. Choropleth is a type of graph that represents a geographical space with sub-areas shaded in a way that is proportional to some overarching statistic, like shading in different areas of the San Fransisco Bay Area based on how many people are playing Animal Crossing: New Horizons (feel free to send me some Ironwood Furniture). Folium is a Python package that lets you easily create interactive maps that just so happens to have built-in choropleth capabilities.

The Kaggle page asks, “what can we learn about different areas…?”. Well, let’s see what a choropleth of the average price/night of an Airbnb in different NYC neighborhoods can tell us!

Small talk is over. Let’s begin shall we?

The Data

Download the ‘AB_NYC_2019.csv’ from this Kaggle link.

Download the GeoJSON file from BetaNYC’s website.

The Code

Let’s import our packages

import numpy as np
import pandas as pd
import folium

Then import and take a sneak peak at our csv file:

path = <path_to_csv_file>
raw_df = pd.read_csv(path)
raw_df.head()

What we want is a dataframe that gives each neighborhood’s average price, which we can easily achieve using the ‘groupby’ command:

choro_df = raw_df.copy()
choro_df = choro_df[['price','neighbourhood']].groupby('neighbourhood').mean().reset_index()
choro_df = choro_df.dropna(subset=['price'])
choro_df.rename(columns={'price':'avg_price'}, inplace=True)
choro_df.head()

Perfect. Now before we start graphing, we’ll have to bin our ‘avg_price’ column. Each of these bins represents a different color that will be featured on the choropleth! Also, at the time of this post (4/24/2020), Folium only accepts at most 6 bin values.

bins = list(choro_df['avg_price'].quantile([0, 0.25, 0.50, 0.75, 1.0]))

Simple enough, right? Now we can start coding up the graphics portion of the tutorial. First define a simple function that creates a default Folium Map object:

def get_base_map(default_location, default_zoom_start=11):
"""
default_location: [latitude, longitude]
"""
base_map = folium.Map(width=800,
height=600,
location=default_location,
control_scale=True,
zoom_start=default_zoom_start)
return base_map

We need to calculate a default location for our map to start at, so let’s simply calculate the centroid of all the coordinates of the Airbnb’s given to us in the dataset:

avg_lat = raw_df['latitude'].mean()
avg_long = raw_df['longitude'].mean()

Finally, we can create our default Map object and add a choropleth later on top of it!

base_map = get_base_map(default_location=[avg_lat, avg_long])
folium.Choropleth(geo_data=<location_of_geoJSON_file>,
name='choropleth',
data=choro_df,
columns=['neighbourhood', 'avg_price'],
key_on='feature.properties.neighborhood',
fill_opacity=0.5,
fill_color='Spectral',
line_opacity=1,
bins=bins,
nan_fill_color='white',
nan_fill_opacity=0,
legend_name='Price/Night [$]').add_to(base_map)
base_map

All the parameters are pretty straight forward except for the ‘key_on’ parameter. It takes in JavaScript Object Notation that denotes which column to key on when relating the values in our ‘choro_df’ DataFrame and the objects included in the geoJSON file we downloaded. Since we want to link each neighborhood’s coordinates to a particular price, it makes that we use the ‘feature.properties.neighborhood’ object as our ‘key_on’ value.

The map is actually interactive when you run the code in a Jupyter Notebook!

The perfectionist in me immediately realized the color scale of the legend isn’t spaced out in a very user-friendly way. I haven’t quite figured out how to change the scale of the legend, but let me know if you have in the comments (please… it’s killing me…).

But, that’s it! Pretty easy eh?

Feel free to mention other cool graphs that exist outside the realm of vanilla EDA like box-plots and histograms in the comments section… I’d love to hear about the ones you use when you dissect a dataset!

Until next time, peace!

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

Linear Regression (Part-II)

Establishing Causality: Part 2

Data visualization using R | Customer Churn | Exploratory Data Analysis

Simple Objectives Work Better, Groupon

Flatiron Movie Analysis Project: 2.Results and Presentation

Load CSV Data into Elasticsearch

Great article!

Part 4: Searching for Signals

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mikey Ling

Mikey Ling

Data Scientist at Veeva Systems

More from Medium

Setting up ML Environment with Anaconda

Python Libraries for Mesh, Point Cloud, and Data Visualization (Part 2)

Use Python’s itertools built-in module to create a decaying sine wave

Can we difference artists by the colors they use?

Haystacks is the common English title for a series of impressionist paintings by Claude Monet. The principal subject of each painting in the series is stacks of harvested wheat (or possibly barley or oats: the original French title, Les Meules à Giverny, simply means The Stacks at Giverny).