7 Things Every Data Worker Should Know About Metadata

A Gentle Introduction to Metadata, or Data About Data.

To paraphrase my colleague Mandy Chessell, in a world of increasingly data-driven business and semi-autonomous AI algorithms participating in decision making, you’d better have a great understanding of your data — where it’s coming from, how it’s being used, and what downstream products are being generated from it. This is what metadata is all about.

If you work with data on a regular basis, you definitely know the deep pain of finding, understanding and preparing data sets for your work. How many times have you turned up lackluster data search results on Google? Or found out that a colleague has a great data set you could have really used…months ago? Or spent hours figuring out whether some data was right for your project, and whether you could legally make use of it?

Here is some nasty metadata. We can do better than this.

In all these cases, if we as a community did a better job of publishing metadata, the productivity gains would be tremendous. Data science is the shiny new thing in the tech world, and companies are rushing to adopt a “data-driven” mindset. But behind all the glitz and glamour, metadata is the foundation upon which data-driven companies are built. And as you construct that foundation, you’ll end up painfully re-inventing the wheel if you ignore a generation of prior work.

This article serves as a gentle introduction to metadata, or data about data — the current trends in making metadata an integral part of the business process.

Library card catalogs — an early example of metadata (photo by Sanwal Deen on Unsplash)

What is metadata exactly?

Metadata is, simply put, data about data. Or even better, “Metadata is information about the content that provides structure, context, and meaning,” says Rachel Lovinger. It serves to make it easier for others to discover, assess and utilize a data set. Discovery and use are self-explanatory, but what is assessment? This term covers all the information that might be useful in determining whether or not one can or should use the data. It answers questions such as, does the data come from a trustworthy source? Is it timely? How often is it updated? Is the license compatible with my use case? All the things you might ask a person before using a data set can be instead codified in metadata.

One nice way of knowing what goes into metadata is by remembering the 7 “W”s, which Jason does a good job of describing in this article:

The W7 Ontological Model of Metadata

  1. What are the data’s properties (e.g. schema, size, etc.)?
  2. When does the data apply to temporally?
  3. Where does the data apply to geographically?
  4. Who created it?
  5. How was it created (survey, IoT device readings, web sales extract, etc.)?
  6. Which instrument or software package created it?
  7. Why was it created (to monitor water levels, track product inventory, etc.)?

If you just think of the metadata process as answering these 7 basic questions, you’re less likely to be intimidated by the often complex standards in use for encoding metadata as JSON, HTML, XML, YAML, or another software-friendly format.

An Example: Describing Income from the US Census American Community Survey

Let’s make things a little more concrete and look at an example. One of the most valuable free and open data sets available (at least in the US) is the US Census Bureau’s American Community Survey, or ACS for short. When most people hear “Census data,” they are actually thinking of the decennial census. That’s the classic work of the Census Bureau. The decennial census is taken every 10 years (hence the name) and is used for voting and redistricting concerns, such as how many seats in Congress a state will get based on their population.

American Community Survey: U.S. Census Bureau (logo)

However, it doesn’t just count people. At some point in time a congressional committee decided since we’re spending so much money going to every house in the country, let’s get some more information about people so we can do a better job of setting policy and allocating funds to federal programs. So the Census began to ask a lot of questions about housing, income, race and more. This information was so valuable that people decided we couldn’t wait 10 years to adjust policy based on this data — that just wasn’t agile — so in the 2000s the ACS was born. It’s not a full survey as it only covers a sample of the country, but it gives a yearly perspective on US demographics. If you’re interested, you can learn all about the ACS data collection methodology here.

Now that we know a little bit about the ACS, let’s take a look at a single “data set:” 2017 income. I put data set in quotes because I don’t know if this is actually a table in a database or a file on a disk somewhere deep in some federal server room. But I do know I can bookmark a URL for it and get a CSV file for use. Here’s that URL:

https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_1YR/S1901/0100000US.05000.003

By scanning the data for a minute or two I can derive the following metadata using the W7 system discussed above.

Wasn’t that easy and straightforward? With this background, it should be relatively easy to understand the official ACS metadata record available from data.gov, either in it’s JSON or HTML format.

Google Dataset Search

To re-phrase the purpose of metadata, we use this terse, structured summary of the data to get better search results.

When you think of it that way, it makes you wonder what the king of search, Google, thinks about metadata for data sets. Well the answer is they’ve been thinking about it for a long time, and they’ve finally launched Google Dataset Search, a search engine specifically tailored to the data community which Natasha Noy introduces nicely in this blog post. The idea is that instead of building “data portals” requiring custom search engines (like data.gov which uses CKAN or NASA’s Socrata-powered one), web pages can just embed structured micro-data in their HTML, and search engines can crawl this HTML-friendly metadata so that people can find data the same way they find other information on the Internet.

Screenshot of Google Dataset Search

The metadata data format used by Google is called schema.org Dataset, and has been developed in an open, standards-based process overseen by the World Wide Web Consortium and in partnership with companies such as Microsoft, Pinterest, and Yandex.

As you can see from that page, even though the list of things you could say about your data is long (almost all attributes are optional), you’re still just trying to provide answers to those 7 basic questions above. And the more thorough a job you do, the more fully your data will be utilized.

Remember, don’t sweat it if you don’t have all this metadata to supply. It’s more important to at least be consistent if you can’t be comprehensive. Your users will thank you, and you’ll be the new office data hero.