Looking for Datasets?

Published in

Digital Dialects

5 min readJan 31, 2016

My teammate, Yangdong Ye and I have been scouring the internet for multi-dimensional, large datasets that we can use in our advanced visualization technologies class at Northeastern’s IDV program. There are so many datasets available - big, small, dimension rich, binary. Picking one is that meets our criteria is very hard. Here, I present a curated list of five datasets to suit what will hopefully be a range of interests and pursuits.

1. NASA’s Outgassing Database

Nasa has a really nice open “data catalog”. Some of the offerings of tabular data, data lenses and maps that may catch your attention are meteorite landings, fireball and bolide reports, a list of NASA facilities, and possibly, the Candida Albicans cultures responses to spaceflight. There is a list of Extra-vehicular activity conducted by US and Russia, and a fairly extensive dataset of NASA patents.

After spending a few hours combing through the data catalog, I kept returning to one particularly challenging dataset, The Outgassing Database. Outgassing, very simply defined, is the release of gasses or vapors in some materials when exposed to a certain set of conditions.

12859 rows present highly descriptive names (a challenge to parse), the total mass loss and the curing time. It even contains information pertaining to the part of a spacecraft in which material is used. It’s not a very big dataset, and the key is a little hard to identify for someone new to the dataset. To start you off, here is clue one and clue two. Once you start to really get into processes in which outgassing occurs and the implications of outgassing, magic starts to happen.

2. The Tate Collection

The Moon is Up by Millais / Tate Britain / Image Source

If you’re an art enthusiast, and all this talk about data oftentimes makes you want to run out of the room, maybe this dataset will change your mind. For those who don’t know about it, Tate is a “family of four art galleries”, Tate modern, Tate Britain, Tate Liverpool and Tate St. Ives.

This dataset, available through github, documents over 70,000 artworks owned by Tate (independently or jointly with the National Galleries of Scotland), and metadata for 3,500 associated artists.

3. International Trade & Shipping / 2014

Most of you will already know about enigma.io, an absolutely fantastic dataset resource that lets you browse public datasets from all over the world very intuitively. It is one of my go-to sources for the best kinds of data. You may need to create an account, but it’s worth it. You can search for datasets by topics, by institutions, by industry, what have you.

In the recent past, I had the pleasure of getting to know more about the supply chain and logistics industry — one of the perks of being a designer. It is quite a complicated industry and critically important, too. From a designer’s perspective, it offers us a huge solution space to work with. From an information designer’s perspective, it offers us a huge dataset to work with. Presenting, from the curated collections of enigma, a summary of Bill(s) of Lading in 2014.

With over 36 million rows covering vessel names, ports of unlading, estimated arrival dates, foreign ports of lading, place of receipt, ports of destination, consignee names, shipper names, container numbers, item descriptions and item piece counts, it is as multidimensional as it is engrossing. The entry that caught my eye are the 82 hourglasses that travelled in a container for 20 days from Yantian, China to Tacoma, Washington through a shipper in Guangdong, in China for a promotions company in Minnesota on a vessel named “Lions Gate Bridge”.

4. The Crunchbase API

The crunchbase API will let you explore the relationship between organizations, people and products. Where are the larger VCs investing, and which startups are fireworks? Know all here.

They have three distinct data products, each catering to different access needs and with different license requirements. An example of the data that you can request access to are here (companies), here (investments) and here (acquisitions). [I’m heavily advocating for this one in my advanced visualization technology class, but let’s see where we go with that].

5. SDSS DR12

The SDSS map of the Universe. Each dot is a galaxy; the color bar shows the local density. Image Source

Back to a fascination with the universe, I would like to introduce to you, the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). It’s essentially an initiative that is trying to map the universe. This dataset, comprising imaging statistics and spectroscopic data, was updated in July 2014, and features a total of 469,053,874 objects, including 260,562,744 stars, 208,478,448 galaxies, and 12,682 unknown entities. Spectroscopy is the study of the relationship between matter and electronic radiation, using the wavelength of visible light spectrum.

6. Build your own dataset

Worst case scenario (or best case, for that matter), build your own dataset. For example, let’s take this image:

Identify which region on earth it depicts, get the topographic data for each location. Map out the color data from each point in the image to the topographic data, perhaps population data, perhaps architectural data, perhaps lat/long data.. Building your own dataset has it’s own sets of challenges, but it indeed is a (retrospectively) fun process with a high (process) learning curve.