Prevent Data Loss With This Free dbt Package

Because your data is worth a million bucks…

Piotr Herstowski
re_data
9 min readDec 30, 2022

--

Photo by Mick Haupt on Unsplash

This article has been written by Madison Schott and was first published here on the 18th of March, 2022.

Have you ever been writing a data model or completing an in-depth analysis only to find that something is wrong with your data? This may be one of the worst discoveries as a data professional. You spend so much time ensuring your data is reliable, with multiple quality checks, only for something else to go wrong.

I’ve been there. A few weeks ago I was validating a data model I had written, only to find that we had an extremely low amount of data being ingested for one of our most important tables. After much digging, we discovered that this issue was going on for 2 WHOLE WEEKS.

Luckily, we were able to fix it, but 2 weeks’ worth of data is a lot to lose. Especially when you’re a smaller business looking to create major growth using your data.

So, being the data problem solver that I am, I set out to find a solution that would help alert us when there was a discrepancy in our daily data volume.

Surprisingly, this was a difficult thing to find. We use Fivetran to ingest our data into a Snowflake data warehouse and neither of those has alerting for data volume. They don’t even have the option for me to write my own automated script!

What is re_data?

After a lot of frustration, I stumbled upon re_data, “ an open-source & dbt native, data reliability framework built for modern data stack”. This framework allows you to create alerts, track metrics, and store dbt test history.

For tables, these metrics include data volume and freshness.

For columns, these metrics include aggregates, nulls, standard deviation, length, and more.

There is no shortage of what metrics you can track to create a more reliable data ecosystem.

re_data works by detecting trends in your data (and alerting you of them) as well as shipping macros that clean, normalize, and validate your data. By adding some configuration details to your YAML files, you can easily set it up to track the data within your warehouse.

In this article, I am going to teach you how to set up an alerting system that will send you a Slack message when your table’s data volume is far from its usual.

Downloading re_data

Whether you’re a dbt user or not, re_data includes installation instructions in their documentation. Luckily, this step is as easy as it gets. If you’re a dbt user, simply add the package name to your packages.yml file and run dbt deps.

Your file should look something like this.

Image by author

You’ll also want to install the Python package using the command pip install re-data so that you can utilize the Slack notification feature as well as the re_data UI.

Setting up your monitoring config

The re_data documentation outlines how you can leverage dbt native configuration to work with their framework. For this tutorial, I am only monitoring row_count in one table. The code I’m writing will go at the beginning of the dbt SQL file that corresponds to the table you wish to monitor.

In order to set this up, I set re_data_monitored to be true. I then selected the time filter I wanted to use. Make sure the time filter is a column that specifies when the row in your table was created. In my case, it is event_created_at.

For example, if you’re monitoring a table of clicks on your website, your time_filter column would be the time that click occurred on your site.

Image by author

Now, let’s navigate to the dbt_project.yml file. Instead of writing the re_data_metrics block within the config on the individual model, we are going to add it here. This will make it so that row_count is calculated for every model we add the re_data config to. You can also add other metrics to this dbt project file if you want them to be calculated for multiple models.

Image by author

You can also specify the specific z-score in the dbt_project.yml file you want re_data to use to determine if a metric is far from its baseline. The default z-score is 3 but you can always change this depending on the nature of your table. Personally, I changed mine to 2 because I would rather be alerted too much in the beginning rather than not enough. This is something you can always adjust as you go if you find it is alerting you too often.

Image by author

Now, we can add that same config object to the top of any of our models’ corresponding SQL files and row_count will be computed. For all these models, alerts will be created when the z-score passes 2.

Data warehouse configuration

Now it’s time to run the framework! You can run it using the command:

dbt run — m re_data

Notice that this is the same command you’d use to run any other dbt data model that you’ve created.

After running this command, I want you to navigate to your data warehouse and look for any changes. You will notice that re_data created two schemas, RE and RE_INTERNAL.

Keep in mind that these schemas will be created in the database defined in your profiles.yml file. In my profile, I have two targets- one for development and one for production, each with different defined databases. Depending on your target, these schemas will populate in that database.

For example, when I am developing dbt models locally, my target is set to dev in my profiles.yml. So, the two re_data schemas are created in my development database. However, in production, my target is set to my production database. So, when I run the same code in production, the schemas are created in my production database.

You can read more about setting up your profile in dbt’s documentation.

The RE schema in your default data warehouse contains various tables that were created from the macros within the package. The ones we will want to pay attention to are RE_DATA_BASE_METRICS and RE_DATA_Z_SCORE.

Image by author

You can also see it creates 4 different views. We will want to look at the RE_DATA_ALERTING view. This contains any alerts that are created from the YAML configuration we provided.

As for the RE_INTERNAL schema, you won’t be looking at it. That is used for internal purposes by the macro to give you the pretty end tables and views we just discussed.

Image by author

re_data tables created

re_data_base_metrics

Let’s look at the “re_data_base_metrics” table. You can see there are different rows depending on when you ran your model and the “metrics_base” we defined in our dbt_profiles.yml. Since we only defined row_count in that file, we will only see row_count in the metric column.

Image by author

re_data_z_score

Next, let’s look at the “re_data_z_score” table. When you have at least 2 rows of data for the metric you are trying to calculate, you will see at least one row in this table. If you only have one row, a z-score can’t be calculated. Keep in mind that this also means the tool gets more and more effective the longer you use it. If you recall the Law of Large Numbers, the mean gets closer to the true mean of the population as the sample size increases.

Image by author

If you recall, we set the z-score to be 2 in our YAML file. Because the z-score in this table is under 2, there will be no alerts triggered. If this z-score was above 2, an alert would be triggered and a row would be created in the “re_data_alerting” table.

Running re_data in production

Now, you may be wondering, how do I run this in production? Chances are you want this alerting to work automatically, without much thought. You don’t want to have to run this command locally every day.

Personally, I use Prefect for deploying my dbt data models, so I chose to use that for re_data as well. Because of how it’s built, you’ll want to run it the same way you run your dbt models. For you that may be with Airflow or Dagster, but for me it’s with Prefect.

I use dbt_tasks within a Python file to run all my dbt models. I have a file full of these tasks which I then copy onto a Docker container. For re_data, I simply made a new dbt_task where I run the model.

dbt_task(command=’dbt run -m re_data’, task_args={“name”: “re_data monitoring”})

I also make sure I set an upstream dependency for my web events table since that is the table I’m monitoring using re_data. If you’re using it to monitor all of your tables, make sure you set a dependency for all of your data syncs.

Setting up a Slack command

re_data just released a feature that allows you to send a Slack message with alerting updates. This is ideal for an autonomous data environment where you only want to pay attention to things if they go wrong.

I try to set up my data environment so I don’t have to monitor it on a daily basis. When something goes wrong, that’s when I want to be notified. re_data does just this.

Before running the command, make sure you have the most up-to-date version of the package. This is a fairly recent release, so if you’re not using the newest version, the command will fail.

Be sure to run dbt deps after changing the version or this command will once again fail.

The re_data Slack command looks as follows:

re_data notify slack \-- end-date $(date +”%Y-%m-%d”) \--webhook-url {webhook_url}

If you want your slack notification to generate every day, you can skip specifying a start date. Instead, just specify your end date as the current date in the correct format. $(date +”%Y-%m-%d”) does just this when running a Bash command.

You then need to include a webhook-url which you can generate on Slack. These webhooks are created specifically for the channel you want the alert to be sent to.

I included this command within a dbt shell task in my Prefect environment. It utilizes dbt credentials, so you don’t want to use just an ordinary shell task.

Image by author

Again, if you’re using Prefect, make sure you set your upstream dependencies. For the slack command, I set a dependency of the re_data dbt run.

Image by author

If you are running this command locally, rather than deploying to production, re_data has a UI to take full advantage of. You would run a command similar to the Slack command, but this would generate the UI.

re_data overview generate — end-date $(date +”%Y-%m-%d”) — interval days:1

Then, as long as you’ve already generated some models using the package, you can “serve” the UI.

re_data overview serve

This will open in your browser and allow you to see all of your alerts as well as the lineage for your models.

Conclusion

Now we have a data environment with a helpful alerting system in place. We don’t have to live in fear that something could be going wrong with our data at the source. This data volume alert is a great way to ensure everything is running as expected.

Don’t wait for something to go wrong in your data pipeline. Implement checks and balances like this dbt package now before running into an issue. Trust me, you don’t want to lose 2 weeks’ worth of data like I did. Data quite literally equals money and important insights! Protect it at all costs.

The best part about re_data is that it’s free! You don’t have to pay an extra cost on top of an already expensive data stack. You can utilize all of the amazing features of dbt to your advantage.

For more advice on tools and packages to improve your modern data stack, subscribe to my newsletter.

--

--