Geographic Imaging Systems (GIS) for Python — A Brief Introduction

Visualizing hospital facility distribution across NYC

Published in

Analytics Vidhya

6 min readNov 25, 2019

Map Projections/Coordinate Reference Systems:
Map projections are a way of representing 3-dimensional data on a 2-dimensional surface — most frequently used to show mapped data on a computer screen or piece of paper rather than projected onto a globe. Latitudes and longitudes of locations on the Earth’s surface must be transformed into locations on a plane. This inevitably comes with a certain degree of distortion. For our purposes, map projections and coordinate reference systems (crs) are essentially the same thing.

If you want to properly map your data, your choice of projection may matter, because the way data is displayed can change the way it is understood. In the case of global data, this XKCD comic illustrates the ways in which your results may look completely different depending on the projection you choose.

This tutorial is more localized and will focus on NYC, for which we actually have a choice of coordinate systems. Here, we will use the WGS 84/EPSG: 4326 map, representing latitude and longitude coordinates on the WGS 84 reference ellipsoid. A reference ellipsoid is a mathematically defined surface that approximates the true shape of the Earth (a flattened sphere, more or less). The WGS 84 reference ellipsoid is also used by global positioning systems (GPS), as well as OpenStreetMap.

GIS Data Types

Vector: Composed of vertices and paths
* Points: x, y coordinates (think GPS coordinates)
* Lines: lines between points (roads, rivers, pipelines)
* Polygons: used to show boundaries/units with area (zipcode, city council district)

Raster: Composed of pixels/grid cells
* Discrete: categorical data (land vs ocean, urban vs rural)
* Continous: continuous data (elevation, temperature, oil concentrations around an oil spill)

For more information about the differences between GIS data types, see this link.

Using GeoPandas:

This tutorial will focus on vector data types, specifically polygons and points.

# Import necessary libraries
import numpy as np
import pandas as pd
import geopandas as gpd
import scipy
import shapely
import matplotlib.pyplot as plt

For GIS plotting in GeoPandas, our data must have a ‘geometry’ column, with values for the GIS data type in question — point, line, polygon, and so forth. To show an example of this, we use NYC zipcodes, courtesy of NYC Open Data.

Going in blind and using your shapefile as-is:

# Read in shapefile of NYC zipcodes
gdf = gpd.read_file('./zipcodes/ZIP_CODE_040114.shp')

Note: for this to behave, all other related files (as seen below) must be in the same folder as the .shp file-

Check out the first zipcode polygon:

gdf['geometry'][0]

BUT also note the values given for the boundaries of this polygon in the geometry column:

These values are not an easily understandable reference to geographic coordinates, like latitude and longitude would be. Instead, they are coordinates on the reference ellipsoid mentioned above — making them completely meaningless for our purposes, because we have no context for them. Take a look at the map produced with these coordinates:

fig, ax = plt.subplots(figsize = (6, 6))
gdf.plot(ax = ax);
ax.set_title('NYC Zipcodes', fontsize = 20);

Geographic shapes are recognizable, but the coordinates are gobbledygook. How do we fix this? One way might be to pair this zipcode dataset with a dataset that still uses zipcodes…but also contains latitude and longitude coordinates for reference. In this case we will use the Hospital Facility dataset, also from NYC Open Data.

hospdf = pd.read_csv('./NYC_Health___Hospitals_Facilities_-_2011.csv')
hospdf.head()

The next step is to convert this Pandas dataframe to a GeoPandas dataframe, which is essentially a Pandas dataframe with a Geometry column, but with bonus access to certain GIS functions:

hosp_gdf = gpd.GeoDataFrame(hospdf, geometry=gpd.points_from_xy(hospdf.Longitude, hospdf.Latitude))

Now, see what happens when we plot our gridded hospital locations:

hosp_gdf.plot(figsize = (9,9));
plt.title('Hospital Facilities in NYC', fontsize = 20);
plt.xlabel('Longitude', fontsize = 16);
plt.ylabel('Latitude', fontsize = 16);

We have the locations of our hospitals, but no reference geography!! Next, we try to plot our zipcodes and our hospitals on the same plot. That should be helpful, right?

fig, ax = plt.subplots(figsize = (9,9))
gdf.plot(ax = ax);
hosp_gdf.plot(ax = ax, color = 'r');
ax.set_title('Hospital Facilities in NYC', fontsize = 20);

Not so much, as it turns out. The polygon boundaries need to be in the correct reference frame — just having properly gridded points is not enough.

Now that we’ve learned better:

Let’s try this again — this time, by reading in the NYC zipcodes data converted to the EPSG 4326 coordinate reference system mentioned at the top of this article.

gdf = gpd.read_file('./zipcodes/ZIP_CODE_040114.shp').to_crs(epsg=4326)

What does this dataframe look like?

Note that the polygon boundary values here are more recognizable latitude and longitude coordinates.

What happens when we map this data now?

fig, ax = plt.subplots(figsize = (9, 9))
gdf.plot(ax = ax);
ax.set_title('NYC Zipcodes', fontsize = 20);

NYC zipcodes in EPSG 4326 coordinate system

Next, let’s try that plot with the zipcodes and the hospitals together once again:

fig, ax = plt.subplots(figsize = (9,9))
gdf.plot(ax = ax);
hosp_gdf.plot(ax = ax, color = 'r');
ax.set_title('Hospital Facilities in NYC', fontsize = 20)
ax.set_xlabel('Longitude', fontsize = 16);
ax.set_ylabel('Latitude', fontsize = 16);

Next steps:
1. Clean up both dataframes (get rid of NaNs, convert zipcodes to int datatype, etc)
2. GroupBy zipcode on the hospital dataframe to get the number of hospitals per zipcode
3. Merge it with the zipcodes dataframe on their shared zipcode/postcode column
4. Convert that to a GeoPandas dataframe

# Merge hospital data with zipcode polygons for easy plotting
gdf_merge = hosp_agg.merge(gdf, left_on = 'Postcode', right_on = 'ZIPCODE')# Create Gepandas DataFrame from merged data
gdf_geo = gpd.GeoDataFrame(gdf_merge, geometry = gdf_merge.geometry, crs = {'init':'epsg:4326', no_defs':True})

What does this look like when plotted?

# Plot # of hospitals per zip
fig, ax = plt.subplots(figsize = (13, 13))
gdf.plot(ax=ax, color='white', edgecolor='black')
gdf_geo.plot('Facility Type', legend=True, ax=ax,  # All df columns are identical
            cmap='viridis_r', legend_kwds={'fraction':.035}, # Scale legend height to plot
             vmin = 0)
ax.set_title('Hospitals per Zipcode', fontsize = 20)
ax.set_xlabel('Longitude', fontsize = 16)
ax.set_ylabel('Latitude', fontsize = 16)
plt.show()

Instead of plotting individual data points, we can map the distribution of hospitals across the different zipcodes.

TL; DR :
1. Make sure polygon data is referenced to a usable coordinate system.
2. See point #1.
3. Are you sure you’ve completed #1? If so proceed to #4.
4. Merge with the data you wish to map.
5. Produce functional map plots that help you tell a story.

Geographic Imaging Systems (GIS) for Python — A Brief Introduction

Visualizing hospital facility distribution across NYC

GIS Data Types

Using GeoPandas:

Now that we’ve learned better:

Written by Chana Tilevitz