In the digital age we never lack for information, but often have trouble finding useful knowledge. The answer is nuance.
Context makes data truly useful.
Add the Pareto Principle: Of all the total information about some topic, 80% of the “useful knowledge” is contained in about 20% of the net information.
Some context matters more than others.
You will often be reminded of this. We don’t mind petting a salamander but steer clear of alligators, even though their shape and function is largely the same.
It’s why Facebook knows what ads to show you, even though it doesn’t have access to all the thoughts in your brain (yet).
They don’t know “all that’s on your mind”. But analyzing your GPS data, search history (they own Instagram, remember) and yes, listening through your phone’s microphone is enough to understand “what’s on your mind enough that you bother talking and writing about”.
That’s why you see strangely specific ads for something you talked about with your friend but never searched for. This is context at work.
And as any real estate agent would tell you, there’s one type of context that usually trumps all others: location.
Everything (that we know of, at least) physical exists somewhere in space. If the object or topic you’re studying is able to be geotagged & recorded on a map, it’s often surprisingly helpful to do so.
Google Maps gives you a handy estimate of traffic on your route: It’s able to do this by tracking (hopefully anonymized) GPS data from phones it’s installed on.
By comparing the distance between consecutive location update “pings” and the time they were issued, it can discern in real-time when cars are backed up on a highway they normally travel 60mph on.
However, working with GPS data seems like an awfully big field to get into, technically speaking. Fortunately, other people have already done most of the heavy lifting regarding coordinate-friendly data structures and packaging:
ArcGIS is one of the more robust geospatial analysis software libraries out there. It has a host of features from data storage, server integration, in-built deep learning and smooth visualizations.
It is, without a doubt, the next clear step on our journey to understand why hexagons are so prevalent in nature, and why they’re so useful for geographic data modeling.
If that seems like a stretch, here’s a module diagram straight from their tutorial page. Tessellation is just another word for “efficient Euclidean data compression”.
They have various pricing plans for their full desktop & business suites, but the easiest way forward is to create a free Developer account to use with the API and Python modules.
They have a number of setup options, but I elected for the clean Anaconda command line installation:
conda install -c esri arcgis. If you’re like me and enjoy living dangerously, make sure to install all new packages directly in your main environment.
As is tradition for messing around with a new data science toolkit, open up a Jupyter notebook and install a good-looking theme, as only the most terrifying of beasts code against a bright white background.
Data in Frame
Their tutorial is quite comprehensive. We’ll start by importing the base library and instantiating a GIS object using our account we made earlier.
from arcgis.gis import GIS
gis = GIS("https://www.arcgis.com", "username", "password")
gis object will be the gateway to access most of the module’s content. We can load a default satellite overview by calling
You can click to drag and zoom in the cell output, which is neat. We could draw all sorts of coordinates and lines on this map. But it doesn’t really feel like data science without a Dataframe (or at least some matrix serving the same purpose).
They’ve actually developed a
Spatially Enabled Dataframe that extends Pandas. Tremendous.
# get your imports in order:
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
Already we find ourselves in familiar territory: Assigning a variable to
gis.content.get(‘item_string’) lets us grab data from publicly-hosted map-layer
items and store it in-memory much as regular Pandas Dataframes do. They directly extend Pandas, so go ahead and try out the regular operations — slicing, subselection formats are quite the same.
Rasters are essentially a cell grid where data can be stored in each cell.
A honeycomb is a raster, and fishnet stockings are 3-dimensional raster manifolds. Math is consistent, even when it’s not.
Let’s draw an image from the NASA-USGS Landsat-8 satellite and unpack the first layer:
l8_lyr.properties[‘description’] tells us that it’s an image analysis service covering most of the world’s landmass at 30-meter resolution, and can be used for purposes such as vegetation, agriculture and boundary studies.
ArcGIS covers the intermediate functions to compute raster methods directly on these map layers, saving a ton of memory.
Let’s apply some to the
nyc map we made earlier. The below loop operates on the cell output since we call
We essentially just loop through a list of raster functions (agriculture, bathymetric, infrared etc) from the Landsat’s first image layer and cycle through adding and removing them from the map.
The Earth Opens Up Before You
You’re now working with ArcGIS data. There’s a lot of other things to do with these tools; you could integrate live ocean buoy data feeds into a live wave-height map or conduct deep learning on Yellowstone wolf movements.
To search for more data, you may want to check out the API’s search functions:
# search for feature layers relating to california
my_content = gis.content.search(query='california',
These objects can generally be explored in the same way as above. Next time we’ll look into hexagonal rasters.