De-clutter Your Maps With Simple Outlier Filtering

Published in

SafeGraph

3 min readDec 11, 2019

Outliers in datasets are controversial. Are they bad or are they trying to tell you something? Is it an artifact (also sometimes called “contamination”) or an important anomaly? Should you ignore them or make them the target of your focus?

Outliers are particularly evident (and problematic ) when visualized on maps. In this blog post, I show real-world examples of how outliers make map visualization complicated, describe a simple method to filter outliers, and provide a Google CoLab Jupyter Notebook with sample SafeGraph data so you can replicate and play with the data yourself.

Sometimes extreme values (“outliers”) are distracting on a map

What the heck is going on with that one data point in the southern tip of California?

I want to visualize how visitor dwell time varies by geographic location. I can use SafeGraph Core Places to get centroid locations of each McDonald’s and SafeGraph Patterns data (specifically the column median_dwell) for the median visit duration for each point-of-interest (POI).

Clearly there is something unusual about a single POI in the southern tip of California that has a median_dwell 600x larger than the average McDonald's and 6x larger than the next largest median_dwell.

The goal of my map is to illustrate the variance in dwell times for all McDonald’s across the state. Whether we think the extreme POI is interesting or an artifact, this one extreme value is very distracting on my map because it is off-scale from the rest of the dataset.

We could cherry-pick it and drop this one POI from the dataset, but instead, I prefer a simple and generic method for filtering out extreme values.

A Simple Method to Filter Outliers in SafeGraph Patterns

There is no one correct way to “handle” outliers, and sometimes outliers shouldn’t be “handled” at all.

Warning! You should always look at your data; do not blindly filter “outliers” without consideration.

John Tukey’s IQR method for defining outliers.

Here we implement the inter-quartile-range (IQR) based method as originally formulated by John Tukey. [1] [2]. Note: This is one of the most common definitions for “whiskers” on a box-and-whiskers plot.

Typically the Upper Extreme is defined as the Upper_Quartile + k * IQR and the Lower Extreme is Lower_Quartile - k * IQR.

The standard is to use k = 1.5. But for the purposes of visualization, you should use whatever works.

See the full post on Google CoLab

To read the full post, see the results, and play with the code yourself, click here!

Google Colaboratory

Edit description

colab.research.google.com

Curious how this works? See the full post on Google CoLab.

Want to see a different question answered with SafeGraph data?

Please send us your ideas, feedback, bug discoveries, and suggestions to datastories@safegraph.com or as a comment below.