Data Discovery in 2020
A brief survey of data catalogs from Big Tech data teams
In May of 2013, as Uber’s new data analyst, I was excited to jump into their data and deliver some juicy analyses. But, in an experience I think many can relate to, I ran into some seemingly simple questions that slowed things down out of the gate:
What data is there and what does it look like?
What joins to what, using what conditions?
When was it updated?
How do I know if there’s anything wrong before I use it?
Who can I ask for help?
The common pattern to solve these questions is called a data catalog: a web UI that allows you to search and learn about data. They’re often described as a Wikipedia data, but instead of completely hand-written articles, they often automatically crawl and update the most basic information (table name, schema, sample rows, etc.) and then human knowledge augments that further (table and column descriptions).
This gives your team a great starting point for any work with the data: just type into a search box, and boom, you can see the datasets that are relevant, then go further into on to see the schema, sample rows, sometimes freshness and lineage…all useful stuff to any data analyst, scientist, or engineer.
A few years later, I had the opportunity to lead the team solving these problems within Uber, and to meet some of my peers doing the same at their own companies. It was hugely rewarding to see the level of adoption and appreciation these tools generated —thousands of weekly active users and consistently positive NPS — there was a clear demand from data people across the company for better tools. I wish we had started on it sooner.
Even after leaving Uber to work on my own project, I couldn’t keep what I saw as the eventual rise of the data catalog out of my mind, so I’ve been keeping brief notes on them whenever I see an announcement or a talk.
I started writing this post when a friend suggested that I clean up and share those notes in case they were useful to others.
If you’re on a data team today and don’t have a catalog, I strongly encourage you to look at these solutions, and try one in production if you can. Years ago there were fewer options for doing this, but with more open source offerings these days (e.g. Datahub and Amundsen), the barrier is coming down considerably — and your team’s new data analyst will really appreciate you :)
Note: enterprise-oriented products like Alation and Collibra have been around for a while, but to keep this post manageable in size, I’m only looking at the internal tools being built within “Big Tech” companies. Would you be interested in a similar survey for the commercial products? Let me know in the comments.
On to the Catalogs…
Lyft Amundsen
Announcement https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
Talk/slides https://www.youtube.com/watch?v=EOCYw0yf63k
Github https://github.com/lyft/amundsen
License Apache-2.0
Ingests from Airflow DAG w/ Amundsen python library, script w/ same library
Stack Python, Node, (Neo4j OR Apache Atlas), Elasticsearch, Flask, React
Features sample rows, column stats/profiling, data date range, frequent users, owners, tags, table and column descriptions.
Key persons Mark Grover, Alagappan Sethuraman, Daniel Won, Jin Chang, Tamika Tannis, Tao Feng, Matt Spiel, Shenghu Yang, and Philippe Mizrahi.
LinkedIn Datahub (the artist formerly known as WhereHows)
Github https://github.com/linkedin/datahub
License Apache-2.0
Talk/slides Strata 2019
Ingests from LDAP, Hive, Kafka, MySQL, DB2, Firebird, SQL Server, Oracle, Postgres, SQLite, ODBC — all via API calls or Kafka events
Stack Pegasus, Ember, Play
Features Schemas, lineage, ownership, descriptions/notes, dataset lifecycles
Key persons Mars Lan, Seyi Adebajo, Shirshanka Das
Netflix Metacat+BigDataPortal
Announcement Making Big Data Discoverable and Meaningful at Netflix
Github https://github.com/Netflix/metacat
Talk/slides https://www.youtube.com/watch?v=nMyuCdqzpZc
License Apache-2.0
Ingests from Spark, Presto, Pig, Hive, MySQL, Redshift, S3
Stack Java/Groovy, Elasticsearch
Features Table TTL, Hive+S3 partition and table metrics, manual custom metadata (column default values, table validation rules), space/cost metrics, org/subject-matter tagging, table-change-notifications / auditing-events (for Keystone).
Key persons Ajoy Majumdar, Zhen Li
Uber Databook
Announcement https://eng.uber.com/databook/
Github N/A
License N/A
Talk/slides Strata 2019
Ingests from Hive, Presto, Vertica, MySQL, Cassandra, (bonus: Multi-datacenter support)
Stack Java, Elasticsearch, Dropwizard, React, Redux, D3
Features Schemas, descriptions, lineage, freshness, nested data support (e.g. JSON), usage stats (via Queryparser), ingestion anomaly detection at the partition level.
Key persons Kaan Onuk, Lauren Tindal, Luyao Li
AirBnB Dataportal
Announcement https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
Github N/A
License N/A
Ingest from Unknown
Stack Neo4j, MySql, Elasticsearch, Flask, React, Redux
Features Schemas, owners, top users, descriptions
Key persons Chris Williams, Eli Brumbaugh, Jeff Feng, John Bodley, and Michelle Thomas
Updates and corrections
Are you a contributor to one of these platforms and have an update or correction to suggest? Let me know in the comments, or email me (kyle at torodata.io) and I’ll fix it right away.
Thanks for reading! 👋