How to Generate Free Data Quality Reports

Using the latest open source tool called re_cloud

Piotr Herstowski

Published in

re_data

8 min readDec 30, 2022

This article has been written by Madison Schott and was first published here on the 15th of December, 2022.

There’s a reason data quality is talked about so often- it’s important! If you don’t have high-quality data, you mine as well have no data at all. The quality of your data directly determines the quality of your insights into the business.

With implementing data quality initiatives, you should always focus on your source data. When you focus on data at the source, you can catch any issues before they trickle into downstream data models. This ensures business teams are looking at data that has already been tested, knowing that it is reliable when they are using it.

A large part of data quality is also ensuring this source data and its downstream data models are updated as expected. Business teams need to know when they are going to get the data they need, and on the cadence that they need it. The data has to be fresh for insights to be accurate.

re_data is an open-source tool that makes it easy to measure data quality. It allows you to set different alerts that will let you know when there are anomalies in your data. Some of my favorite metrics to track are freshness, variance, and data volume. re_data also allows you to easily set up Slack and email alerts so you get notified in the place you use the most.

I recently learned that re_data also has a cloud product to help you gather all of your data quality reports in one place. Let’s explore this a bit more!

What is re_cloud?

re_cloud is a user interface that allows you to store and collaborate on data quality reports from re_data, other open-source tools, and custom-built data tools. It consolidates all of your most important information in one location, allowing you to get a full picture of your data. You simply download HTML reports produced by outside tools, or generate reports from your local environment, and upload them to one central location on the cloud.

re_cloud integrates with many data tools such as Great Expectations, Pandas, and Trino as well as data warehouses like Postgres, Redshift, Bigquery, and Snowflake. In this tutorial, I will walk you through centralizing re_data reports and dbt docs in re_cloud.

Benefits of having your reports in one place

Before we get into the tutorial let’s discuss why you would even want all of your data quality reports in one place. I want you to think about your cloud data platform. This acts as your single source of truth- the location where data from all of your various data sources is ingested. The data team and business stakeholders know they can depend on this cloud data platform to have the most accurate, up-to-date data.

It is the same idea as having a central location for the most important data quality information on your data sources and dbt models. Rather than having to check dashboards and metrics in individual tools, and not knowing which one to depend on, the data team can navigate to one single source of truth. Here, they can get a holistic view of the entire data ecosystem from ingestion to orchestration.

re_cloud takes also deploys dbt documentation alongside these quality reports. Having your documentation sit right next to these reports saves you time and effort moving between the two UIs. Because the documentation is built within the product, you can easily compare expectations to what is happening in your data pipeline.

Seeing dbt model lineage, a feature deployed with dbt docs and re_data, allows you to measure the impact of a quality alert on downstream data models. Seeing the dbt tests written on a column allows you to understand what a column value should be compared to what it is. It’s little things like these that can save a lot of time when attempting to solve a serious production issue.

One of the reasons I started using re_data was because of how easy it makes it to look at the whole picture. I see the information I want to see, and only that, because of how easy the tool makes it to deliver alerts in the location I want it. Now, they are taking that up a notch by not only allowing you to see their alerts but alerts from other tools as well.

How to set up the cloud

Keep in mind that re_data assumes you are a dbt user. The cloud product directly works with dbt projects and optionally re_data. First, you need to install the re_cloud Python package.

pip install re_cloud

Next, you need to generate an API key on the re_cloud platform. Be sure to create an account first and then you can navigate to your user profile on the top right corner. Click “Account Settings” and then “Api Key”. Here there will be an access key that you can copy.

We are going to input this into the YAML file where you have re_data set up already. I recommend putting this in a separate re_data.yml file if you don’t have a file created for re_data yet. Your re_cloud block with the api key should look like this:

Be sure to also save this yaml file in the ~/.re_data/ directory. This will ensure re_data is looking in the right place.

Generating reports in re_cloud

Now we are ready to start generating some reports!

dbt docs

Let’s start by generating our dbt documentation. If you didn’t already know, dbt has a feature where you can populate the information defined in your YAML files in a nicely-formatted UI. This acts as a clean data catalog that users on the data team, or even business teams, can look at for more details on a source or model. It also displays model lineage and tests applied to different data assets. To generate these docs, you run the command:

dbt docs generate

Then, to upload them to re_cloud, run the following:

re_cloud upload dbt-docs

Keep in mind that you need to have dbt-core installed here. For some reason, I did not and this command failed. You can install dbt-core by running the following command:

pip install dbt-core

After running this and running the re_data command again, I got a message that my dbt docs had been uploaded successfully to re_cloud.

If you navigate to re_cloud, you should see your report uploaded.

If you click on the top block, the dbt docs will open in a separate tab. You can now further explore your dbt project, tables, column definitions, and more!

re_data

Now let’s generate a report for the re_data package. I test for most of my key metrics such as data volume and freshness using re_data, so this is super important for me to upload to re_cloud. First, to generate the report, run the following command:

re_data overview generate

Then, to upload the report to the UI, run:

re_cloud upload re-data

You should get another status code of 200, which means successful, as you did for the dbt docs.

Image by author

When you navigate to the UI, you should see two blocks now- one for dbt_docs and the other for re_data_overview.

After clicking on the re_data_overview block you’ll see the main features of re_data such as alerts, lineage, tests, tables, and metrics. This gives you all of the information you need on data quality in a friendly, visual format.

Scheduling your re_cloud reports

I recommend creating a system where someone on your team generates re_cloud reports weekly, uploading them to re_cloud. Better yet, you could automate these commands to run directly within your data pipeline, ensuring re_cloud is updated daily.

Personally, I use Prefect to orchestrate my data pipeline. Using this tool makes it easy to set dependencies between tasks. I already have my daily re_data tests being triggered when my models are finished running every morning, using Prefect. I do this by running the following command in my Prefect flow:

dbt_task(command=’dbt run -m re_data’, task_args={“name”: “re_data monitoring”})

In order to generate and upload reports in re_cloud, we would use the same commands that we ran locally on our CLI. However, this time we would include them in a dbt_task within our Prefect flow.

This would look like so:

Generate_re_data = trigger_dbt_cli_command(
        "re_data overview generate"
    )

Generate_dbt_docs = trigger_dbt_cli_command(
        "dbt docs generate"
    )Upload_re_to_re_cloud = trigger_dbt_cli_command(
        "re_cloud upload re-data"
    )Upload_docs_to_re_cloud = trigger_dbt_cli_command(
        "re_cloud upload dbt-docs"
    )

Here, I am generating my re_data report and dbt docs in two separate tasks. I am then uploading them using two different commands which I’ve assigned to two variables upload_re_to_re_cloud and upload_docs_to_re_cloud.

Keep in mind that it is important to set dependencies in orchestration tools like Prefect. You don’t want the re_cloud upload commands to run before the reports have been generated, so make such you set the two generate tasks as upstream dependencies.

Setting upstream dependencies for this task using Prefect would look like this:

Re_data_upload = Upload_re_to_re_cloud(wait_for=[generate_re_data])

dbt_docs_upload = Upload_docs_to_re_cloud(wait_for=[generate_dbt_docs])

When you implement these commands directly within your data pipeline, everything runs seamlessly. You know that when your pipeline runs every morning, you will also have updated data quality reports waiting for you in re_cloud.

Conclusion

Having a central location like re_cloud is a great way to keep track of data quality across your entire data stack, no matter the tool. A central location gives you a holistic view of the status of your data, allowing you to easily compare across reports and identify a cause for any poor standards.

Using a tool like re_cloud builds a culture of transparency within your company; One where everyone can look into data quality, not just the data team. Tools like this make it easier for non-technical people to understand the data that they use every day. It’s a powerful tool for uniting data and business teams over a common goal.

For more on producing high-quality data, check out my free weekly newsletter on analytics engineering.