Data Observability: What’s Changed and Why Now is the Time to Solve It

Manu Bansal
Lightup Data
Published in
11 min readAug 1, 2021

An interview between TFiR Media and Lightup.

Photo by Markus Spiske from Pexels

Why has data observability suddenly become such a hot topic?

What’s changed in the data stack that’s made observability a new, different, and solvable problem — and why can’t legacy tools seem to deal with it?

I recently answered these questions, and many more, in a far-ranging chat with TFiR Media‘s’s CEO and host Swapnil Bhartiya.

To watch the full video interview, click here.

Or, read on for a full transcript that has been lightly edited for readability.

Swapnil Bhartiya: Let’s start with the story of Lightup. What was the problem in the data quality space you were trying to solve when you co-founded Lightup?

Manu Bansal: It’s been a very personal journey for me. Before starting Lightup I was building a company called Uhana, that got acquired by VMWare in 2019. We were building predictive analytics systems for telcos, and processing tremendous amounts of data in real-time. At its peak we were doing 3 million events per second at sub-second latency, and running machine-learning models on top — all for predicting performance in the telco network.

But we spent most of our time debugging things, or fixing issues that would catch us by surprise, that would usually end up being unexpected data issues.

When machines would go down or infrastructure would break we would know instantly because we had monitoring there. And if application endpoints would become unresponsive we would know that too. But we lacked a lot of visibility, or observability, in the data that was feeding our pipeline.

And it was a garbage-in, garbage-out system. If you were putting in bad data you would just get bizarre results. We just didn’t have any tools to deal with those issues and solving them would always be a very ad hoc, very reactive process.

We knew we could do better, and should do better, otherwise it would keep creating very embarrassing situations. That’s what led us to look into the problem of data quality in modern data pipelines and data stacks.

And as I looked around I saw that problem elsewhere as well — whether it was a retail company or a fintech company it was the same symptom everywhere.

We realized this problem needs solving because the world is starting to become more and more data-driven. So, we began to investigate the story for data observability and for monitoring data quality as data is passing through the pipelines. And that’s what led to Lightup.

SB: Why is data observability a new and different problem? How does it contrast with observability in IT infrastructure monitoring or application monitoring? And how does it compare to previous generations of data quality problems and solutions?

MB: First, let’s talk about how data observability compares to IT infrastructure monitoring.

Let’s say you pick up Datadog to monitor your IT infrastructure. There’s a common denominator. You’re talking about the same kind of metrics. You have CPU, and memory, and disk to monitor across a cluster of machines or containers, and that’s the same regardless of what domain you’re operating in as a business.

But when it comes to monitoring data, things are not as cut-and-dry. Data is very business-specific, and data’s cadence is very business-specific — maybe someone is processing data every minute, and another business is processing data once every day, or even slower.

And then, when you are deciding on the health of data there’s usually a multi-stakeholder conversation. Sometimes the data engineer is able to make the assessment, but other times you want to include the product manager too. The data monitoring problem is just fundamentally more diverse, which is probably why it has been unsolved in the new stack for such a long time. It just hasn’t been clear how you would go about doing it. It’s a very different kind of a beast.

Now, if you compare today’s data observability to the last generation of data quality tools — let’s say Informatica or Talend or even Trifacta — those tools are coming from a different generation altogether, when we used to think about data being a static object.

You’re talking about spreadsheets or a small database like a collection of rows in MySQL. You’re talking about data quality analysis in the same sense as doing an interactive BI analysis, where you have a human being sitting on the other end making queries about data quality assessment and deciding if data is good or not.

That [incumbent approach to data quality] doesn’t work anymore when you’re talking about modern data pipelines where data is flowing in continuously in large volumes, in large cardinality — where it’s simply impossible to inspect manually and it’s instead a continuous monitoring problem.

So, there are similarities with IT monitoring but then data is now very diverse, and very business-specific, and just creates a very new problem.

SB: We also now live in a data-driven world. No matter where you look everyone is now sending so much data back and forth, and storing data in warehouses and lakes, and those data centers have their own challenges. Everyone is also moving data around, and needs to extract value from it. So when we look at data observability and debugging we are looking at a new set of challenges.

But how different is it? What are the unique challenges created by everyone generating a huge amount of data and that data changing so fast?

MB: At heart the question you are raising is about timing. Data has looked like such a complex beast that it has not been clear that you can solve this problem. So — what are the signs we are seeing today, and what’s happening in the world today, that now makes this a tenable problem?

In short: We’ve seen several major architectural shifts happen in the last five years or so that have made this problem solvable.

When we’re talking about the evolution of big data, Hadoop came about in the mid-2000s. Then we saw that graduate into the Spark ecosystem, and then Kafka and streaming data came along, and we had this complex real-time stream processing pipeline where data was just processed in-memory.

We are now seeing a big shift that’s very promising from the perspective of data control, data governance, and data quality. We are moving from traditional ETL-style pipelines to an ELT-style way of building the stack.

And at the center of that is the rise of the cloud data warehouse. We saw cloud happen after we talked about big data, and we also saw the warehouse come up to scale — we used to have databases that didn’t scale, then we went to unstructured data lakes, and now we’re starting to get a best-of-both-worlds architecture with the cloud data warehouse, and maybe structured query engines on top of the unstructured data lake.

We’re now seeing this ELT-style architecture, where no matter where your data sits the tendency is to bring that into a central place — no matter if that’s a data lake with a structured query engine on top, or it’s a data warehouse — and then you want to successively refine that data and make it more usable for an end purpose, whether it’s feeding a BI dashboard, or feeding a product, or even feeding a machine learning model.

So, this warehouse or the lake at the center is the one place where you can actually run all of those data quality checks regardless of whether your data is getting collected from the edge, or data is already collected in the cloud or coming from user devices.

We’re just starting to bring that into the warehouse or the lake, because it’s economical and it’s very powerful to do so. Now we can just operate on data at rest while still meeting our latency requirements.

That’s the shift I think that’s leading to this new generation of data quality solutions, where it suddenly looks like a very solvable problem. The problem has been important, but it really has not been solvable until the warehouse came along.

SB: If you’re mentioning those technologies, I want to know if you’re also looking at things like Snowflake, or Databricks, or all those other new technologies. These are so complicated and have made complexity part of today’s data world.

From the perspective of observability, debugging, and other data problems — are technologies like Snowflake or Databricks making things better or worse?

MB: There’s something interesting in the way [data] lakes and warehouses evolved, where we have brought back many of the features we took for granted in the world of relational databases.

It’s almost the same view you would get with Databricks on Delta Lake, or on Snowflake as a warehouse. But then there are these subtle differences, because we had to give some of those features up to get that scalability.

For example, Snowflake will not enforce a primary key constraint on your data. It would be a suggestion for another analytics developer to look at, but that’s not something that the engine is enforcing because it takes away a lot of the scalability you would get with a distributed database. And if you drop down to data lakes you give up even more of that structure that you would expect from data for it to be healthy.

In many ways the warehouses have made life easier with ELT-style workloads, compared to the previous ETL-style workloads where everything was just data in motion. At least you’re now working with data at rest. At the same time the scalability comes at the expense of traditional data integrity checks, which we need to think about with a clean slate.

These checks are probably not going to make it into the data fabric itself. This has to be a separate tool on the side, which is why we believe a company like Lightup can do really well and fits this conversation really well. It should not, and likely will not, reside in your data storage itself because it’s an expensive feature, and you want to apply it selectively. Sometimes you want to apply it with the context of the specific business use-case.

So, to your point the warehouse has made things easy, but it has also created this additional complexity because it is not exactly the same feature set that we used to have in the world of databases.

SB: BI tools have been around for decades, and there are a lot of legacy companies that are trying to solve this problem. But the way legacy systems work is very different from the way we now work in our cloud-native, data-centric world.

So why, exactly, do those tools fail? And how does Lightup come into that picture to solve our new problems in new ways?

MB: I think that’s a great question, just in terms of how would you create the ideal solution knowing what you know today, and given the kinds of data problems and workloads that we have to deal with.

In a simple, one-line summary — it’s data scaling.

That’s what becomes the biggest challenge for legacy tools. They were designed in a very different world, where we were talking about small, static datasets. Concretely, that means those tools were designed with an import-and-inspect workflow, where you have a human running interactive analysis with those tools to bring the data out of where it resides and into the tool itself before they could run quality tests.

That doesn’t scale anymore. You can’t expect to bring out petabytes of data from a Redshift or a Snowflake instance before you can run checks on it.

Instead, what we are doing — which seems to be needed right now — is use that already scalable warehouse or lake. You can now offload some of those tests down to the scaled-out fabric which already exists to run queries on the data.

The way we have architected the system is — we can go and issue data quality queries into the warehouse or the data lake, and we never actually have to bring out the raw data into our system. So that just fundamentally gives you a scaling advantage where you’re not having to duplicate scaling the solution, and where you’re not having to spin up separate infrastructure to do this.

SB: Since we are talking about how you do things differently, I also want to know more about your offerings, solutions, and product portfolio. Is Lightup service-based or is it software?

MB: In April we announced a beta program. We are accepting early users and adopters who are interested in solving this problem with us to partner with us and to take the platform where it needs to go.

We are seeing a lot of interest from large-scale customers, the classic enterprise — whether they are coming from retail or from food and beverage — but there is also interest from mid-market. Both of those segments have great use-cases for our system. We can already create a lot of value out-of-the-box.

One thing we learned along the way — when we are talking about data quality we are basically touching the crown jewels of the customer. This is business-critical data, very sensitive at times. So, we architected the system so we can deploy it on the cloud as a SaaS service, but we can also bring the system over to a customer’s cloud. And it could also go into a legacy data center just the same.

This is turning out to be a very powerful feature. It’s necessary in cases where you have regulatory or compliance requirements, where even if you trusted Lightup with data privacy and security your own customer base might not permit deploying the solution as a cloud service.

So, we’re very open to bringing the solution into your cloud, and we’ve put in a lot of effort to make Lightup a turn-key deployment, regardless of how you choose to deploy the system. And we’ve been able to show results within a couple of days in most cases, focusing very heavily on ease-of-use, very welcoming to customers regardless of what vertical they come from — and the larger the data volumes the more exciting it will be.

This beta version of the Lightup data quality platform is now available. We can quickly onboard your data assets, we can start to run data quality tests on your data warehouse, or data lake, and we can also run tests on data streams if you have data moving over Kafka or Segment, and just get you out-of-the-box data quality checks over your data stack.

SB: What is the roadmap you offer in beta? When do we expect it to become available generally, and what kind of pipeline or roadmap do you expect this year?

MB: We are targeting announcing General Availability (GA) towards end of year. The goal is to stress test this system in the field with early design partners, learn from those engagements, and bring an offering that’s ready-to-go primetime when we announce GA.

Right now we are squarely focused on detecting data quality issues, and doing it in a way so that it’s generic.

The system is really good at comparing data at hand with past data observed for the same data asset. For example, we can say data is looking abnormal today relative to what it looked like in the last week. So that’s the core problem that our system solves, that generalizes very well across verticals, across businesses and use cases.

By focusing on solving this core problem, we’ve seen some really promising results with design partners so far, and confirmed that’s the product we really want to hone in on by end of this year before we start to diversify and move into other use cases — be it around detecting PII, or moving more into the cataloging side.

Lightup brings order to data chaos. We give organizations a single, unified platform to accurately detect, investigate, and remediate data outages in real-time.

To see if Lightup can solve your data outages, take the right next step.

--

--

Manu Bansal
Lightup Data

CEO & Co-founder of Lightup, previously a Co-founder of Uhana. Stay connected: linkedin.com/in/manukmrbansal/.