Data science with location data: Visualizations and tools, part 1

Published in

That Data Guy

5 min readOct 28, 2019

Data visualization is a key part in data science processes, and geographic data might seem a bit out-of-reach, as it’s not a simple X-Y axis configuration and this post is meant to help newbies and intermediate data-lovers get more out of their location data sets, with less effort and some useful guidance.

Geographic data and Location data — what are they?

Geographic data and location data are often used interchangeably. The terms relate to data sets that some of their features are geographic, or location-based. Geographic data (also known as geospatial data, or geodata in short) can have many faces, but generally speaking, it is a dataset with geographic component: be it a zip/postal code, address, country names, latitude and longitude or other location indicators. It can also come in many formats (databases, tabular data, shapefiles, raster data and more), and it might be worth reading more about if you’re completely novice to the area (take a look at esri’s definitions, as well as this easier to digest intro on GIS Geography.

In this article, we will focus on some basics, mostly around visualizing geodata with Python, and offer some code snippets for you to use as reference.

Python Libraries

When working with geodata in Python a few libraries can come in handy, depends on how deep you want to go and how complex is your data. A few main ones are pandas, geopandas, plotly, shapely and fiona. In this article we will mostly be using pandas and plotly, but in the coming ones, we will use other libraries as well.

Sample Data

As you might know, Kaggle is a great source for data sets (but I highly recommend taking on a few competitions on there for practice), and we’ll use an earthquake data set I found on there.

We start by loading the data, and converting the dates to the right format using pandas to_datetime:

import pandas as pd
df = pd.read_csv('database.csv')
df['date'] = pd.to_datetime(df['Date'], utc=True)

We won’t spend much time on exploring the dataframe itself, but as usual, we can just take a brief view of its content using the very useful:

print(df.describe())
print(df.head(5))
print(df.info())

We can see that the data here has two columns that describe its geographical content: latitude and longitude. These represent the data point location in angles on the globe (more info here). I’ll just point out that location can be described in different manners (lat/long just a very common one) and in different projections — and that’s an entire topic on it’s own!

Visualizing

After installing plotly, if it wasn’t already installed on your machine, you can start playing around with some basic but helpful visuals. The first we will inspect is scatter_geo.

scatter_geo

This is a plotly function that (as you might’ve guessed) plots geographic data as a scatter plot — meaning, markers on a map. This is a very simple and very high level of plotting map data, as it might lack some detail, but it can provide a very good general understanding, and might be sufficient when not trying to go too deep into geographical details.

The code above generates a scatter plot on a map, and opens it in a browser window (you can modify that by changing pio.renderers.default), while allowing you to interact with it — that’s the best part about plotly, if you ask me. See for yourself:

Neat eh? We can clearly see where tectonic plates are! We can make this visual a bit more handy if we add a dimension — let’s add color, to capture the more intense earthquakes.

fig = px.scatter_geo(df, lat = "Latitude", lon = "Longitude", color = 'Magnitude')
pio.show(fig)

We can filter our dataframe to only show earthquakes where the magnitude was over 7.5 Richter, and add the date they happened to the hover data available:

fig = px.scatter_geo(df[df.Magnitude >= 7.5], lat = "Latitude", lon = "Longitude", color = 'Magnitude', hover_data = ['date'])
pio.show(fig)

It’s important to note that all of the sample plots above are much more customizable. By typing ? px.geo_scatter you can get the full help on the function and how to use it (or simply check out plotly’s full documentation page — it’s quite helpful!). If you want to add another dimension — you can use size — that way, another numeric measure can be presented in a single geoscatter plot, represented by the size of the marker.

graph_objects

All of the above examples used one of plotly’s recent releases, known as plotly-express. However, plotly is a much more complex library than that, and is actually based on graph objects, that can still be used to even further customize your figures.

A graph object contains the data and the layout, or looks, of the figure. Here’s how we could create the same figure as before, with a graph object:

Clearly, this is a bit more code than the one-liners used by plotly-express, which might be a downside, but it also means that this is much more customizable. For example, by breaking down the markers into dictionary with different attributes, we can be much more specific on how we would like it visualized. We can also change the projection to show earth-like figure, edit the colors of land and ocean and more…

A different view of a geoscatter

Another advantage using graph_objects is we can use subplots! It's a very basic requirement but at the time of writing this post, px still did not allow it, and it's important to know, as it's a very fundamental and helpful feature.

To create subplots with varying map set ups, we need to use plotly’s make_subplots, and add each map (or in fact, any of the plotly graph objects) as a separate trace, defining each on its own (or better yet, if they're easily related, with some iterative process).

Apparently, while very common, tectonic border quakes are usually lower in magnitude!

Clearly, using graph objects gives much more freedom, but makes the code a bit less easy to generate. But with Google on your side, and community forums and Stackoverflow to help — you should not worry about it too much. If you have an idea, it can probably be achieved — just do the research!

Next article, we will look into another useful plotly plotting technique for geodata, using the mapbox engine, and more is coming soon. If you have any questions or requests, please feel free to shoot me a message or comment :)