Data Discovery in 2020
In May of 2013, as Uber’s new data analyst, I was excited to jump into their data and deliver some juicy analyses. But, in an experience I think many can relate to, I ran into some seemingly simple questions that slowed things down out of the gate:
What data is there and what does it look like?
What joins to what, using what conditions?
When was it updated?
How do I know if there’s anything wrong before I use it?
Who can I ask for help?
The common pattern to solve these questions is called a data catalog: a web UI that allows you to search and learn about data. They’re often described as a Wikipedia data, but instead of completely hand-written articles, they often automatically crawl and update the most basic information (table name, schema, sample rows, etc.) and then human knowledge augments that further (table and column descriptions).
This gives your team a great starting point for any work with the data: just type into a search box, and boom, you can see the datasets that are relevant, then go further into on to see the schema, sample rows, sometimes freshness and lineage…all useful stuff to any data analyst, scientist, or engineer.
A few years later, I had the opportunity to lead the team solving these problems within Uber, and to meet some of my peers doing the same at their own companies. It was hugely rewarding to see the level of adoption and appreciation these tools generated —thousands of weekly active users and consistently positive NPS — there was a clear demand from data people across the company for better tools. I wish we had started on it sooner.
Even after leaving Uber to work on my own project, I couldn’t keep what I saw as the eventual rise of the data catalog out of my mind, so I’ve been keeping brief notes on them whenever I see an announcement or a talk.
I started writing this post when a friend suggested that I clean up and share those notes in case they were useful to others.
If you’re on a data team today and don’t have a catalog, I strongly encourage you to look at these solutions, and try one in production if you can. Years ago there were fewer options for doing this, but with more open source offerings these days (e.g. Datahub and Amundsen), the barrier is coming down considerably — and your team’s new data analyst will really appreciate you :)
Note: enterprise-oriented products like Alation and Collibra have been around for a while, but to keep this post manageable in size, I’m only looking at the internal tools being built within “Big Tech” companies. Would you be interested in a similar survey for the commercial products? Let me know in the comments.
On to the Catalogs…
Ingests from Airflow DAG w/ Amundsen python library, script w/ same library
Stack Python, Node, (Neo4j OR Apache Atlas), Elasticsearch, Flask, React
Features sample rows, column stats/profiling, data date range, frequent users, owners, tags, table and column descriptions.
LinkedIn Datahub (the artist formerly known as WhereHows)
Talk/slides Strata 2019
Ingests from LDAP, Hive, Kafka, MySQL, DB2, Firebird, SQL Server, Oracle, Postgres, SQLite, ODBC — all via API calls or Kafka events
Stack Pegasus, Ember, Play
Features Schemas, lineage, ownership, descriptions/notes, dataset lifecycles
Ingests from Spark, Presto, Pig, Hive, MySQL, Redshift, S3
Stack Java/Groovy, Elasticsearch
Features Table TTL, Hive+S3 partition and table metrics, manual custom metadata (column default values, table validation rules), space/cost metrics, org/subject-matter tagging, table-change-notifications / auditing-events (for Keystone).
Talk/slides Strata 2019
Ingests from Hive, Presto, Vertica, MySQL, Cassandra, (bonus: Multi-datacenter support)
Stack Java, Elasticsearch, Dropwizard, React, Redux, D3
Features Schemas, descriptions, lineage, freshness, nested data support (e.g. JSON), usage stats (via Queryparser), ingestion anomaly detection at the partition level.
Ingest from Unknown
Stack Neo4j, MySql, Elasticsearch, Flask, React, Redux
Features Schemas, owners, top users, descriptions
Updates and corrections
Are you a contributor to one of these platforms and have an update or correction to suggest? Let me know in the comments, or email me (kyle at torodata.io) and I’ll fix it right away.
Thanks for reading! 👋