Ten famous and useful demo data sets

Hjalmar Gislason
Aug 29, 2017 · 4 min read

When working with data preparation, analytics and visualization software, a few data sets show up again and again. A short while ago, I posed a question about people’s favorite demo data sets on Twitter.

Below you will find a selection of a few of the data sets people mentioned — as well as some of my own favorites — with a short description and a link to the raw data. In selecting the data sets from the larger list of suggestions I aimed for variability in size, format and topic.

If you think I’ve missed something important or have additional suggestions, feel free to leave a comment below.

Titanic passengers

A list of all passengers on the Titanic with several attributes, including name, class, age, gender and whether or not they survived.

Titanic data visualization from Zachary M. Jones

World Population Prospects

Probably the most comprehensive global population database. Available at various levels of details and focus. The “Population by age groups” files are in many ways the most interesting.

Population pyramid from UN DESA Population Division

Gapminder’s GDP per capita vs. child mortality

The data behind Hans Rosling’s first and most famous TED talk in 2006. The animated scatterplot with Dr. Rosling’s sport-game-like commentary changed data storytelling from a nerd pastime to a world-changing endeavor.

Hans Rosling presenting at TED

Significant Earthquakes Database

A database of large and/or particularly destructive earthquakes over the course of the last ~4000 years.

Map of significant earthquakes from NOAA

Enron emails

The text and meta-data for about 500,000 emails sent from 150 people — mostly senior management — at Enron before their collapse. The emails were made public by the Federal Energy Regulatory Commission during its investigation.

  • Size: 500,000 emails of various lengths with a handful of meta-data fields each.
  • Useful for: Unstructured data. Network analysis.
  • URL: https://www.cs.cmu.edu/~./enron/
Enron email graph by Nathaniel Wroblewski

Anscombe’s quartet

Four data sets of 11 data points each that have nearly identical descriptive statistics, yet are very different when visualized. Constructed by statistician Francis Anscombe. Also, check out the recent The Datasaurus Dozen that plays homage to Anscombe’s (and Alberto Cairo’s) work in an interesting way.

Anscombe’s Quarted as visualized by Justin Matejka and George Fitzmaurice

Les Miserables

Weighted network of co-appearances of characters in Victor Hugo’s novel “Les Misérables”. Nodes represent characters, edges connect pairs of characters that appear in the same chapter of the book and the values on the edges the number of such chapter co-appearances.

Les Misérables co-occurance as visualized by Mike Bostock

TLC Trip Data

Details about millions of taxi & limousine trips in NYC from 2009 through 2016 (updated annually).

Napoleon’s March

The data behind one of the most famous data visualizations of all times, Charles Joseph Minard’s map of Napoleon’s disastrous march into Russia in 1812.

Qlik’s Drew Clarke recreating Minard’s map in Qlik Sense on stage

TPC benchmarks (Transaction Processing Performance Council)

Possibly the most “boring” of this bunch, the TPC benchmarks are not an actual dataset, but a set of tools for generation of realistic transactional data, data structures, volumes and database loads. Frequently used for benchmarking of enterprise software.

TPC-H data structure from Neo4j

)

Hjalmar Gislason

Written by

Adventures in data. Founder and CEO of GRID (@grid_hq). Proud data nerd. Curious about everything. Founder of 5 software companies.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade