Data Lineage: Just a Feature or Something Else?

Gabs Ferreira
Alvin
Published in
5 min readJan 5, 2023

Yesterday when I was reading the article The Many Layers of Data Lineage by Borja Vázquez, I started wondering when companies usually start thinking about implementing data lineage.

This is a topic that I’m always curious about since lineage is the foundation of Alvin and everything we’re building. Then, I did what I usually do when I come across this kind of question: asked about it in a community. In this case, was on Data Quality Camp Slack:

We got lots of answers there and a very good discussion (this is why I love communities). They gave me a lot to think about and I'd like to answer some of them, but since it’s going to get lost in there I will bring them there to an article.

In a perfect world, lineage probably wouldn’t be that necessary

Juan pointed out that if you focus on your modeling and transformation, then lineage is something cool to have, but probably not a huge priority.

As he said: invest in your foundation. Follow the good practices. Stay healthy. But is it that easy?

In the real world, things might be a little different

Well, the reality of most companies is that foundations unfortunately aren’t taken that seriously. Be it for time, budget, people, or even technical reasons.

And, at the end of the day, it’s like Larry said:

I mean, nobody wakes up one day and thinks “what a beautiful day to buy a data lineage tool”. If you want lineage, there is a reason for it. And behind this reason, there are usually strategic objectives, metrics, and obviously money.

Chad shows here specific use cases where lineage was useful in his experience. If the foundations are bad and the customers are in pain now, something must be done now.

The discussion continues here as Juan points out that if you have a mess, lineage is useful. But if you are starting from scratch, maybe there are better things to prioritize.

Is lineage just a feature?

Yeah, I get what he says here: we’re living in a moment in the data space where there are lots of startups coming up with fancy names and trying to be category-defining somehow.

He sees lineage as another tool in the data professional toolbelt. Just like programmers see different programming languages as tools, and use the right ones for the job they have to do.

It’s really hard to build good lineage

When we talk about ways of storing, modeling, and transforming data, there are an almost infinite amount of tools to be talked about. Building cross-system accurate data lineage is far from an easy task.

How we see lineage at Alvin: a live dataset

One more time, Juan says here that he feels that lineage is a feature, after basically describing the main functionalities we have now.

So, let me tell you how we see data lineage here at Alvin.

If you look at it as just a diagram I would agree it’s more of a feature than a platform. But hey, forget about the diagram for a while. We actually look at lineage more as a dataset, so not something you either have or you don’t. It really depends on the accuracy, granularity, and tool coverage that a specific company needs to solve its challenges.

For some companies, something more basic may be fine (e.g. dbt lineage). But really I think seeing lineage as just a feature limits its potential and applications. It’s a foundational dataset that use case driven features can be built on top of (testing, logging, alerting, cost, discovery, etc.).

By thinking ‘beyond the diagram’ there is so much value to be gained for data teams!

To wrap it up: I agree with Juan here one more time. Robots should do robot work, and humans should take care of creative and valuable tasks.

We strongly believe that automated data lineage has the potential to empower data professionals and even replace some boring aspects of their daily jobs.

So yeah, in our point of view, data lineage is way more than a feature :)

What do you think?

I love this kind of discussion and would be happy to hear your thoughts. Also, you should join Data Quality Camp Slack for more high-level discussions about the data space.

Special thanks to Juan, Chad, Larry, and the others who contributed to this conversation. And ah, to Borja who made me think in the first place.

--

--