From Data Crisis to Data Credibility (Case Study)

Vivek Joshi
Lightup Data
Published in
11 min readMay 18, 2021

How Big Time Data used Lightup to solve their client’s data quality breaks and restore trust in their data-driven decision-making.

Summary: Big Time Data Case Study

Lightup’s Customer: Big Time Data — A full-service data agency whose client wanted to use customer data to drive product analytics and strategic decision making.

Their Client’s Data Stack:

Challenge: Their client’s data kept breaking silently. These data outages were often discovered days or weeks after the fact, and took significant time to debug manually. The entire time they remained active they impaired product performance, as well as data-driven decision making.

Solution: They connected Lightup to their client’s Snowflake data warehouse and immediately gained fine-grained observability on data quality and accurate, proactive alerting on data outages.

“Lightup takes away an entire class of data quality problems.”

— Rachel Bradley-Haas, Co-Founder Big Time Data

More Data, More Problems: Silent Data Breaks Start to Appear

Big Time Data had a problem.

Their client developed enterprise messaging solutions and wanted to use customer data to drive product analytics and strategic business decisions.

To feed this strategy, their client upgraded their data stack to collect exponentially more events, while also launching a cloud version of their open-source platform to standardize customer deployments and enforce mandatory data collection.

Their client began to collect a flood of new, granular data. They began to weave this data into every aspect of their organization. They attempted to use this data to perform advanced analytics to drive key business decisions such as which product features to prioritize, and which market segments had a low NPS and thus were unhappy and needed attention.

But then, the problems began.

Their client’s data kept breaking silently. Something would go wrong in the pipeline without the data team’s knowledge, and they only learned something was wrong when a dashboard went askew and a business stakeholder knocked on their door.

These data breaks chipped away at data credibility and impaired their client’s ability to use data to drive decision making.

In this case study, we’ll show how Big Time Data solved this problem.

To do so, we’ll explore:

  • Why they couldn’t solve their client’s data outages by performing ad hoc debugging, writing dbt checks, or building a homegrown alerting engine.
  • Why they chose Lightup to monitor for data quality outages and how they used the solution to transform their data quality.
  • The benefits they gained using Lightup and how they handle their client’s data outages today.

“We were implementing data quality solutions ourselves, and they worked-ish. But when we saw Lightup, we thought ‘OK, we don’t have to figure it all out ourselves anymore.”

— Rachel Bradley-Haas, Co-Founder Big Time Data

Ad Hoc is Not Enough: Manual Debugging Creates More Problems Than it Solves

At first, Big Time Data performed manual, ad hoc debugging every time a data outage popped up. But this approach didn’t work well enough for a few core reasons.

  • Their Client’s Data Had High Cardinality. They needed to monitor data quality metrics in hundreds of rapidly changing models spread across hundreds of tables. Manual monitoring could never match this breadth.
  • They Experienced Long Detection Delays. They often didn’t notice problems for days or weeks. By the time they found a data outage it had already caused harm, and often that harm was irreparable (e.g. lost data collection).
  • Their Debugging Took Too Long. When they did find a data outage, they had to spend an average of 4–6 hours to track down, investigate, and debug its root cause. This work began to eat up a lot of their team’s cycles.
  • They Paid a High Cost to Fix Their Outages. Every outage affected several tables, each of which needed to be rebuilt, and some bad data even made it all the way back to Salesforce.

The result: Big Time Data got caught in a cycle of constantly firefighting their client’s data outages. They spent more time resolving data quality issues than they spent building analytics, models, and pipelines, eroding their analysts’ morale.

Even worse, their client began to lose trust in their data. Their client started to rethink their desire to become data driven entirely, and started to revert back to their old approach of relying on “gut feel” to make business decisions.

Every time there was a data outage, Big Time Data had to go to each impacted stakeholder to explain what went wrong, and to attempt to regain trust in what their dashboards displayed.

They knew they couldn’t continue to address their outages in this manner. They began to look for a new approach to data quality monitoring.

“We wanted analytics to be proactive. But with bad data, we were spending more time cleaning it up than we were using data to help the business move forward.”

Attempts at Automated Detection: Why Homegrown Solutions Failed and dbt Checks Weren’t Enough

To start, Big Time Data collected and analyzed the type of data outages their client experienced. They were always complex, elusive problems caused by a combination of issues from multiple data sources, software versions, and deployment options.

For example:

  • The volume of an event might change suddenly — it might drop 50% week-over-week because of a broken integration, or double overnight because of duplicate data.
  • A table’s data delay might suddenly increase by 2 hours because a pipeline failed to run.
  • A null check that expected 2% null values might suddenly spike to 25% nulls because of a bad transformation or a change in upstream data schema.

To try to proactively monitor for these types of outages, Big Time Data first wrote basic checks in dbt. This was a logical decision. dbt’s data testing framework is an excellent choice for stateless, one-off tests for traditional data integrity constraints like “column values are unique” and “column values are non-null”. These constraints used to be enforced in relational databases, but are not always enforced in modern data warehouses due to scalability concerns.

For Big Time Data, these new dbt checks were useful for certain scenarios. However, those checks were not designed to catch the specific class of data quality issues that Big Time Data was wrestling with, which required self-calibrated tests — for example, a drop in event volume compared to the prior week, or an increase in a null fraction compared to its baseline.

Next, they tried to develop their own homegrown monitoring solution. It worked better, but it still produced a lot of noise and it demanded a high configuration burden. They would need to assign an analyst to spend 60–100% of their time writing data quality checks to make it work effectively and they lacked the internal bandwidth to do so.

Finally, Big Time Data began to see the scope of the project they had on their hands. To make their client’s data dependable, they would need to devote a tremendous amount of resources towards solving the problem of data quality monitoring. They knew they could do it on their own, but it would distract them from their core business.

Seeing this, Big Time Data began to search for a partner with a proven data quality solution.

Finding the Right Data Quality Partner: Big Time Data Meets Lightup

Big Time Data outlined their ideal data quality solution. It would need to offer:

  • Rich, Out-of-the-Box Functionality. They had a mighty but small team, and they needed a solution that delivered value quickly, with minimal intervention.
  • Easy Configuration. They needed to be able quickly and easily configure self-calibrated data quality tests with historical profiling for hundreds of tables.
  • Accurate, Proactive Detection. The solution needed to accurately detect broken data assets and send them alerts with minimal noise and false positives.

Initially, they were skeptical and didn’t think they would find a solution that hit every point. They knew how complex their unique data model was, and they doubted any outside vendor could accurately detect anomalies within it without significant set up.

They tried a few data quality tools. None of them worked. Then, they discovered Lightup.

Big Time Data connected Lightup to their client’s Snowflake data warehouse. Within minutes, Lightup automatically populated their dashboard with Data Quality Indicators (DQIs) for their client’s entire set of data assets. After another few minutes, Lightup began to find existing anomalies that Big Time Data suspected but was blind to until then.

“Without Lightup knowing our data model, their solution was still able to find outages quickly and easily. That’s impressive.”

Getting Started: How Lightup Delivered Value from Day One

To start, Big Time Data took the 20–30 data assets where they had previously seen outages, and connected them to Lightup. They quickly saw that Lightup met all of the criteria they had defined for an ideal data quality solution.

How Lightup Offered Rich, Out-of-the-Box Functionality

Using Lightup, Big Time Data gained out-of-the-box visibility into the health of their client’s key data assets. They were able to immediately configure 200 metrics across 50+ tables and track data availability and validity for all of their key columns.

Out-of-the-box DQIs for common data quality symptoms.
Instant visibility into data quality for all data assets.

How Lightup Offered Easy Configuration.

Using Lightup, they began to monitor their key metrics with one click. They created self-calibrated detection rules that learned from each metric’s past behavior, and sent alerts when a metric looked much different from expectations based on past data.

Single-click setup of monitoring and alerting on key DQIs.

How Lightup Offered Accurate, Proactive Detection.

Using Lightup, they discovered unknown data outages on some of their key tables and columns. They scaled anomaly detection across thousands of combinations of data dimensions and measures and found anomalies they didn’t know to look for.

“We didn’t really know the extent of our data quality issues, we just knew they weren’t great. Lightup found data quality issues we didn’t know about that we had been suffering from for months.”

Data Outage Deep-Dive: How Big Time Data Detected and Debugged Silent Data Drops and Delays with Lightup

Here are two examples of how Big Time Data used Lightup for anomaly detection.

Example One: Silent Data Drop

With Lightup, a key data outage that Big Time Data detected was a drop in data volume on one of their client’s tables — mobile events. This incident had gone undiscovered for over a week. But using their new solution, they found it as soon as they turned on monitoring for that table.

Detection of significant drop in overall event volume which is a weekly seasonal metric.

This was a big outage. Mobile events form the core component of how their client’s application tracks customer usage. They use these events to understand what their end users are doing, and this analysis syncs all the way to topline company metrics like Daily Active Users (DAU) and Monthly Active Users (MAU). And this outage had affected more than 10 tables.

When this outage occurred and mobile events volume appeared to drop, it triggered false alarms around monthly and daily active users. The customer success team also used these metrics, and received false alarms that caused them to believe certain customers were close to attrition. The bad data created by this outage even made it all the way back to usage metrics surfaced on Salesforce.

With Lightup, Big Time Data quickly learned that all of these problems were created by a single broken integration, and they quickly fixed the problem.

Example Two: Silent Data Delay

With Lightup, Big Time Data was able to detect increases in data delay, where fresh data was no longer available for a set of tables. These incidents occurred when one or more of their pipelines failed to execute because of broken integrations or other errors.

In one recent case, they detected data delay in their client’s mobile events table, the same table that showed a drop in data volume in the previous example. The pipeline feeding the table was not updated for more than an hour. This caused the delay metric to increase significantly over a baseline of 60 minutes tied to the normal update cycle.

Detection of large increase in data delay due to pipeline failure.

Lightup proactively detected the increase and sent an alert to Big Time Data’s teams. They quickly traced the spike to a pipeline execution error, and they were able to resolve it before their client’s downstream metrics were affected due to stale data.

Key Benefits: How Big Time Data Transformed their Client’s Data Quality By Using Lightup

Using Lightup, Big Time Data has generated a range of benefits, and overcome the core challenges they faced manually debugging data outages. They have:

  • Taken Control Over High Cardinality Data. They quickly developed data quality observability and accurate alerting for over 50 key tables.
  • Reduced Their Detection Delays. They now detect outages as soon as they happen instead of days or weeks later.
  • Shortened Their Long Debugging Cycles. They analyze an issue and debug it within 30 minutes, instead of 4–6 hours.
  • Lowered Their High Costs to Fix Issues. They find and fix broken data assets before downstream data and stakeholders are affected.

The result: Big Time Data broke their cycle of constantly firefighting their client’s data outages. In the process, they also resolved many of the “soft” issues that surrounded their client’s data outages. With Lightup:

  • They no longer worry about data quality. “You don’t have to worry about in the back of your mind all the time, ‘When is someone going to come and ask why their dashboard is not right?’”
  • They turned data quality into a manageable issue. “Now that we have Lightup, we can care about data quality as much as we should. It makes data quality more manageable as a trigger system.”
  • They have reduced the burden of data debugging. “Lightup opens the team to spend more time on the things that push the business forward.”
  • They have restored their data’s credibility. “We are no longer playing whack-a-mole every time data comes in, and stakeholders can trust the data on their dashboards again.”

They sum up the value they gained using Lightup with a simple statement.

“Without Lightup, our client would have ceased to be a data-driven company. There were so many data quality issues, there was no trust for the data coming in, and solving every data quality issue was playing whack-a-mole — the organization was about to stop using data entirely and just fall back on gut feeling to make their decisions. But with Lightup, we took away that entire class of data quality problems.”

Learn How Lightup Can Transform Your Data Quality

To use Lightup to transform your own data quality capability, reach out today.

Lightup brings order to data chaos. We give organizations a single, unified platform to accurately detect, investigate, and remediate data outages in real-time.

To see if Lightup can solve your data outages, take the right next step.

--

--

Vivek Joshi
Lightup Data

Co-founder & Product @ Lightup. Food, beer, hiking, dogs, tech.