Data Discovery in 2020

A brief survey of data catalogs from Big Tech data teams

In May of 2013, as Uber’s new data analyst, I was excited to jump into their data and deliver some juicy analyses. But, in an experience I think many can relate to, I ran into some seemingly simple questions that slowed things down out of the gate:

What data is there and what does it look like?

What joins to what, using what conditions?

When was it updated?

How do I know if there’s anything wrong before I use it?

Who can I ask for help?

The common pattern to solve these questions is called a data catalog: a web UI that allows you to search and learn about data. They’re often described as a Wikipedia data, but instead of completely hand-written articles, they often automatically crawl and update the most basic information (table name, schema, sample rows, etc.) and then human knowledge augments that further (table and column descriptions).

This gives your team a great starting point for any work with the data: just type into a search box, and boom, you can see the datasets that are relevant, then go further into on to see the schema, sample rows, sometimes freshness and lineage…all useful stuff to any data analyst, scientist, or engineer.

A few years later, I had the opportunity to lead the team solving these problems within Uber, and to meet some of my peers doing the same at their own companies. It was hugely rewarding to see the level of adoption and appreciation these tools generated —thousands of weekly active users and consistently positive NPS — there was a clear demand from data people across the company for better tools. I wish we had started on it sooner.

Even after leaving Uber to work on my own project, I couldn’t keep what I saw as the eventual rise of the data catalog out of my mind, so I’ve been keeping brief notes on them whenever I see an announcement or a talk.

I started writing this post when a friend suggested that I clean up and share those notes in case they were useful to others.

If you’re on a data team today and don’t have a catalog, I strongly encourage you to look at these solutions, and try one in production if you can. Years ago there were fewer options for doing this, but with more open source offerings these days (e.g. Datahub and Amundsen), the barrier is coming down considerably — and your team’s new data analyst will really appreciate you :)

Note: enterprise-oriented products like Alation and Collibra have been around for a while, but to keep this post manageable in size, I’m only looking at the internal tools being built within “Big Tech” companies. Would you be interested in a similar survey for the commercial products? Let me know in the comments.

On to the Catalogs…

Features matrix, to the best of my knowledge. See bottom of post if you have a correction.

Lyft Amundsen

Announcement https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

Talk/slides https://www.youtube.com/watch?v=EOCYw0yf63k

Github https://github.com/lyft/amundsen

License Apache-2.0

Ingests from Airflow DAG w/ Amundsen python library, script w/ same library

Stack Python, Node, (Neo4j OR Apache Atlas), Elasticsearch, Flask, React

Features sample rows, column stats/profiling, data date range, frequent users, owners, tags, table and column descriptions.

Key persons Mark Grover, Alagappan Sethuraman, Daniel Won, Jin Chang, Tamika Tannis, Tao Feng, Matt Spiel, Shenghu Yang, and Philippe Mizrahi.

LinkedIn Datahub (the artist formerly known as WhereHows)

Github https://github.com/linkedin/datahub

License Apache-2.0

Talk/slides Strata 2019

Ingests from LDAP, Hive, Kafka, MySQL, DB2, Firebird, SQL Server, Oracle, Postgres, SQLite, ODBC — all via API calls or Kafka events

Stack Pegasus, Ember, Play

Features Schemas, lineage, ownership, descriptions/notes, dataset lifecycles

Key persons Mars Lan, Seyi Adebajo, Shirshanka Das

Netflix Metacat+BigDataPortal

Announcement Making Big Data Discoverable and Meaningful at Netflix

Github https://github.com/Netflix/metacat

Talk/slides https://www.youtube.com/watch?v=nMyuCdqzpZc

License Apache-2.0

Ingests from Spark, Presto, Pig, Hive, MySQL, Redshift, S3

Stack Java/Groovy, Elasticsearch

Features Table TTL, Hive+S3 partition and table metrics, manual custom metadata (column default values, table validation rules), space/cost metrics, org/subject-matter tagging, table-change-notifications / auditing-events (for Keystone).

Key persons Ajoy Majumdar, Zhen Li

Uber Databook

Announcement https://eng.uber.com/databook/

Github N/A

License N/A

Talk/slides Strata 2019

Ingests from Hive, Presto, Vertica, MySQL, Cassandra, (bonus: Multi-datacenter support)

Stack Java, Elasticsearch, Dropwizard, React, Redux, D3

Features Schemas, descriptions, lineage, freshness, nested data support (e.g. JSON), usage stats (via Queryparser), ingestion anomaly detection at the partition level.

Key persons Kaan Onuk, Lauren Tindal, Luyao Li

AirBnB Dataportal

Announcement https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770

Github N/A

License N/A

Ingest from Unknown

Stack Neo4j, MySql, Elasticsearch, Flask, React, Redux

Features Schemas, owners, top users, descriptions

Key persons Chris Williams, Eli Brumbaugh, Jeff Feng, John Bodley, and Michelle Thomas

Updates and corrections

Are you a contributor to one of these platforms and have an update or correction to suggest? Let me know in the comments, or email me (kyle at torodata.io) and I’ll fix it right away.

Thanks for reading! 👋

CEO, co-founder at Toro. Former PM of the metadata tools team, and co-founder of the ab-testing team, at Uber. Loves coffee, does not love pie charts.