The Grand Rewrite of DataHub

Published in

@metaphor

7 min readSep 20, 2023

The good, the bad, and the ugly of a popular open-source project we created and how to rewrite it (sorry, Joel).

Photo by Fons Heijnsbroek from Unsplash.

When we left LinkedIn to start Metaphor, there were lots of unknowns. The only thing certain was that it would be an open-core company. In other words, we’d just need to follow the trailblazers like MongoDB, Elastic, and Confluent, and take the company to IPO based on our snazzy open-source project. Easy peasy.

Boy, were we naïve?

While creating and open-sourcing DataHub—and subsequently nurturing it into the most popular project in its category — we always knew that there were non-ideal engineering decisions made along the way. Some of these were due to the technology choices afforded to us at LinkedIn, but many were due to our inexperience. Over time, these “skeletons in the closet” slowed the progress and prevented the product from reaching its full potential.

This post outlines “the good, the bad, and the ugly” of DataHub from a technical perspective. It also serves as the living proof of how we went against the common wisdom of “never rewrite the code from scratch” and rebuilt the whole thing from the ground up. Fortunately, not only did we survive the process, but we also produced a more reliable, scalable, and extendable Modern Metadata Platform in record time!

The Good

Push-based Ingestion

DataHub popularized the concept of Push-based Ingestion, i.e., having data systems push the metadata to the server. It is still our opinion that the approach is superior to Pull-based Ingestion, i.e., having the server “crawl” the data systems. The former offers the benefits of real-time updates, reduced traffic, and a well-defined interface to be plugged into. In fact, crawlers can simply “pretend” to be a push-based system by writing the output to the said interface.

Change Stream

The idea of emitting a change stream from the metadata platform has many merits. Internally, we could use the change stream to update the search and graph indices. Externally, the change events can also be used to trigger downstream systems, such as generating alerts/notifications, driving data governance workflows, managing access control, etc.

Modeling

DataHub adopted a model-first approach to development. In a nutshell, metadata models are first defined using a declarative language, and the corresponding metadata events, APIs, and storage formats are automatically generated from the models. This removes a lot of boilerplate code and manual model conversations.

This also forces us to think hard about how the metadata should be structured and the relationships between them. As pointed out by Chad Sanderson, proper data modeling is extremely valuable and can have wide-ranging, unanticipated downstream impacts.

The Bad

PDL

PDL is the modeling language used extensively in DataHub. However, if you ask any seasoned engineer what PDL is, you’ll likely end up with a blank stare. You’ll then explain that, “It’s like an extension to Avro, but in a DSL syntax, and only used by LinkedIn and perhaps two other companies.” — I think you get the idea.

Don’t get me wrong, PDL is a fine language for data and API modeling, but so are Protobuf, Thrift, and JSON schema. The difference is that the latter ones are far more popular and have a much richer ecosystem. The choice should be obvious here.

URN

Using URN as the entity ID doesn’t seem like a terrible choice on paper. After all, URN stands for Uniform Resource Name. However, the version of URN used in DataHub introduced a few non-standard aspects, such as nested components. It also sports a funky hard-coded urn:li: namespace—li stands for LinkedIn, in case you're wondering.

Another issue with URNs is that they are verbose and require escaping when used as part of a URL. Let me try to fit the following example into a single line:

https://demo.datahubproject.io/dataset/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3As3%2Clongtail-core-data%252Fmongo%252Fadoption%252Fpet_profiles%2CPROD%29

One may argue that this is more aesthetic than technical. However, DataHub also uses this long string-based ID in table primary keys. This can lead to potential performance issues and limit scalability.

The solution is rather simple: hashing. Yes, you will lose the human-readable feature of URN once it’s hashed, but the result is an evenly distributed, fixed-length string that is easy on both the eyes and the database.

Stateful Ingestion

Stateful ingestion was designed to solve the thorny issue of deleted assets. Without it, the crawlers only know what exists in the current run but know not what was removed since the last run. Unfortunately, it’s destined to fail in more complex settings.

Depending on where the state is kept, there are different kinds of challenges to deleting the correct entities. When the state is kept locally, the obvious question is, “What if the state is lost?” You’ll miss deleting some entities because there’s no previous state to compare to. Even if the local state is never lost, a trivial configuration change, e.g., including/excluding some tables from the scan, can completely invalidate the previous state.

When the state is kept on the server, you’ll need to deal with race conditions, given that the “diff-then-delete” operation can’t happen atomically. The problem quickly exacerbates when you have multiple crawlers, each in charge of scanning a subset of the same system.

A more robust solution is to keep the ingestion stateless and run a “garbage collector” on the server side. It’ll automatically delete assets that haven’t been reported by crawlers for a while. The trade-off is that entities may be incorrectly deleted when the crawlers fail to run for a prolonged period. Fortunately, since we decided to make everything soft-deleted, all it takes is to run the crawlers again to restore the entities and all associated metadata.

The Ugly

MySQL

One of the most regrettable engineering choices we made was using MySQL. The problem is not MySQL itself but the fact that we “abused” it as a document store. LinkedIn has a proper document store, Espresso, but it’s completely proprietary, so we can’t use it for an open-source project. We were also restricted from using other feature-rich NoSQL databases, such as CouchDB, MongoDB, or Azure Cosmos DB.

As a result, we store the document-oriented metadata as blobs in an opaque column and build makeshift secondary indexes over the blobs with complex custom logic. We’re practically building a poor man’s document store out of MySQL. While it may be an interesting engineering challenge, it’s essentially reinventing an inferior wheel.

Kafka

Kafka is not exactly the easiest thing to set up and manage — just see the number of Kafka-related questions posted in DataHub’s #getting-started & #troubleshoot channels. As correctly pointed out in Suresh’s post, this only makes sense in companies with a large dedicated team that manages and scales the Kafka cluster.

More importantly, Kafka is designed to process large amounts of streaming data. Using it as a low-volume message queue is like using a sledgehammer to crack a nut. There are plenty of message queue services (e.g., AWS SQS, GCP Pub/Sub, or RabbitMQ) that are simpler to provision and manage. They can easily handle thousands of messages per second and require no repartitioning or reconfiguration to scale. The pricing model also makes these options more favorable when compared to a dedicated Kafka cluster.

Data Synchronization

Another ugly secret of DataHub is that it’s quite difficult to keep the 3 data systems (MySQL, Elasticsearch, Neo4j) in sync. The custom logic added to emit the Metadata Change Log event is brittle, and so are the consumer jobs that update Elasticsearch & Neo4j. Even if these were made more robust, whenever the indexing logic changes, one must manually rebuild the search & graph indices, leading to significant downtime and space for errors. In reality, synchronizing data across heterogeneous systems is a hard problem, and that’s why companies like Fivetran and Hightouch exist.

Fortunately, more databases are multi-model now. For example, ArangoDB supports document, graph, and full-text search natively. The database system will automatically handle all the data synchronization between models with guaranteed correctness and minimal latency.

One More Thing…

The beauty of software is that everything is possible. The aforementioned shortcomings can all be overcome by continuously evolving the code base through sheer engineering grit. Indeed, some of them may have already been addressed by the community. Unfortunately, there exist problems that no software in the world can solve.

After being forced out¹ of the open-source project we created² and the community we nurtured, it became clear that betting your startup on a project your competitor took over is not exactly a wise idea. Rewriting the whole thing became the only viable future for the company. Also, let’s face it, few good programmers can resist the urge to—in Joel Spolsky’s words— bulldoze the place flat and build something grand.

The story behind this deserves its own post, which will be highly entertaining but lacks technical depth. We’ll save it for another day.
The definition of “creation” can be contentious, so I’ll let the readers be the better judge of that. Here are some data for your consideration: Seyi, Yi, and I authored 50.4% of the total commits when we left LinkedIn. The remaining half came from members of LinkedIn’s DataHub Team, none of whom chose to join Acryl. Even after nearly three years of being unable to contribute, we’re still the #1, #3 & #8 top contributors as of this writing.