Geo-spatial analysis with Python

The quest

Understanding co-ordinates and projections

We all use GPS tools and so we’re well familiar with latitude and longitude — these are measurements that reference our position on earth in relation to the Greenwich Meridian and the Equator. In South Africa our latitude is always a negative number (we are below the equator therefore negative), and our longitude is always a positive number (we are to the right of the Greenwich Meridian therefore positive).

Understanding shapefiles

Map data typically comes in shapefile format so it’s an important concept to understand. A shapefile is described by Wikipedia as a format which can spatially describe vector features: points, lines, and polygons, representing, for example, water wells, rivers, and lakes. The format is composed of several different files which together hold all the required information to draw these objects. There are 3 mandatory files, without which you will not be able to work: *.shp, *.shx, and *.dbf. However, there is a fourth file *.prj which contains the projection and co-ordinate reference system information. If this file is not supplied you will be able to draw the shapes of states and counties, but you won’t actually know exactly where in the world they are located. The moment you want to combine one dataset with another you may end up superimposing Italy over Texas!

The magic of GeoPandas

I can’t over-state how much I love this library! Working with it is very much like working in regular Pandas so it’s easy to get started, and the functionality included is comprehensive, clear and simple to use.

Reading in the data — gpd.read_file()

Notice the geometry column: in this case it contains polygons so we know we are dealing with shapes, rather than points or lines.

Checking the co-ordinate reference system — .crs

This attribute tells us which co-ordinate reference system the GeoDataFrame is using — for more details use http://www.spatialreference.org/ref/epsg/4269/.

Re-projecting to a new CRS — .to_crs()

This is a big part of how we will be solving our challenge, and it’s a one-liner!

Convert a Pandas df to a GeoPandas gdf — GeoDataFrame()

Some of our data, such as the use of force files, comes with information including x,y co-ordinates or latitude, longitude — but it is in a *.csv format. We need to convert the resulting Pandas dataframe to a GeoPandas GeoDataFrame:

Geocoding with Google’s API — .tools.geocode()

To do geocoding with the Google Maps API, you’ll need to register — but it’s a very simple process (and free!): you can head here to read more: https://cloud.google.com/maps-platform/. Once you’ve obtained your key, geocoding is a simple one-liner:

Measuring distance between points — .distance()

The distance between points in 2 GeoSeries:

And finally, let’s not forget actually plotting the maps! — .plot()

Drawing a basic map is deceptively simple…

Solving the geospatial challenges

I picked 2 key aspects to address:

What are we dealing with?

The folder structure for our data was a bit of a monster. On Day 1 I spent most of the day just opening folders, wondering what the numbers meant, wondering why folders had different numbers of files in them, and then laboriously opening files to inspect their contents:

  • What county / state combinations does each Dept file represent?
  • Has a use of force (uof) file been provided or not?
  • In the shapefile data are any of the 4 key files missing?
  • In the ACS data were the files provided consistently for the same county?
  • We have only been given use of force data for 3 of our 6 police departments: Travis, Mecklenburg and Dallas.
  • The shapefile data from Travis is missing a “prj” file so we’re going to need to identify the correct co-ordinate reference system before we can even thinking of analysing or plotting that data further.
  • And for both Travis and Worcester we have troublesome ACS data — notice that for Travis all entries are “48453”, except for rsa (race-sex-age) where we’ve been given the data for FIPS county “48217”. Similarly for Worcester the rsa data is for a different county.

How do we standardize the CRS?

The American Community Survey (ACS) data is broken down into “census tracts” but the corresponding shapefiles were not supplied. Fortunately Darren Sholes supplied us with the link to get them. These shapefiles consistently use CRS = epsg:4269, so for my purposes I decided to use that as my base standard. All other data would need to be converted to this CRS.

  1. Sample 3 random addresses from UOF file
  2. Geocode them using the Google API
  3. Convert these to our standard CRS = epsg:4269
  4. Then for each possible CRS projection, convert our original UOF addresses to that projection, then convert to our standard epsg:4269
  5. And finally compare the distance between the resulting sets of points
  6. The combination with the least distance between them represents the “best fit” CRS for using on our data

And so finally I can do the big reveal…

Just look at the pretty maps that can be made once the data has been standardized! Now choosing colour schemes -that’s at least another 2 weeks’ worth of work right there :)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store