Monitoring data quality at scale using Monte Carlo

Create Monte Carlo monitors as code.

Gilboa Reif
Vimeo Engineering Blog
8 min readFeb 24, 2022

--

We care a great deal about data quality. As data engineers at Vimeo, we take pains not only to make data accessible to everyone but also to guarantee data quality over time. We ingest and process data into our Snowflake data warehouse from a variety of data sources. They vary by format, frequency, origin, and more. With so much going on, without a holistic viewpoint on our data warehouse, it’s challenging to fulfill the mission of ensuring data quality.

That’s why we’ve turned to Monte Carlo as our data observability tool. If you want to know why and how we scaled its usage, keep reading this post.

Monte Carlo at Vimeo

You can think of Monte Carlo as a New Relic or Datadog for data engineers: a one-stop shop for all your observability needs on your data ecosystem. Monte Carlo offers a lineage feature, which enables you to visualize how your data flows — from which data sources down to which data models and eventually to what reports (see Figure 1).

Figure 1: This figure is a screenshot from Monte Carlo’s lineage feature. It shows how the data flows in this example from the tables `staging.stg_delighted_data` and `analytics.bi_iso_country_reference` into the table `vimeo.fact_surevery_response`. On the right side you can see two reports running on BI tools consuming the table `vimeo.fact_surevery_response`. One report is running on Mode Analytics and the other on Tableau.

Incident management natively integrates with Slack and enables you to be notified on Slack and run the whole conversation, including incident status updates, in one place. This makes it super easy to keep your colleagues and stakeholders up to date with the status and impact (see Figure 2).

Figure 2: Monte Carlo Slack integration. Alerts are sent to dedicated Slack channels, and the status can be updated back to the UI through the slack integration.

And of course — monitoring.

Monte Carlo monitors

Monte Carlo monitors are anomaly detection monitors based on ML, or machine learning. They learn the patterns in your data and alert you whenever your data is behaving abnormally.

Monte Carlo monitors can be categorized into two types: automatic monitors and custom monitors.

Automatic monitors

Automatic monitors are monitors that are set upfront to any data set in your data warehouse. You don’t need to worry about creating these monitors whenever you introduce a new data set. These monitors, which include freshness, volume, and schema, might sound basic, but they do a great job in catching staleness in your data sets, suspicious drops or increases in the volume of data, and schema changes.

The main reason that these monitors are created automatically is that they consume a relatively low level of computing resources due to their nature of leveraging metadata versus actual data. Monte Carlo collects metadata continuously from Snowflake’s information schema. It uses metadata such as LAST_ALTERED, ROW_COUNTand BYTES from the table information_schema.tables to monitor freshness and size. It uses GET_DDL() to monitor schema changes. Check out the Monte Carlo blog to get a sneak peek of how Monte Carlo leverages metadata for the most basic monitors.

Recently an outage in a bunch of backend Big Picture events at Vimeo was caught by a size anomaly monitor (see Figure 3). One of my colleagues might have more to say about what Big Picture is in an upcoming post for the Vimeo Engineering Blog.

Figure 3: This figure shows a typical size alert that indicates there is an outage. The alert includes the affected tables and the reason that the alert was raised, which, in this example, is because no new rows were added or removed in the last seven hours. Luckily this was caught by Monte Carlo and resolved relatively quickly.

Custom monitors

The other type of monitors are custom monitors. These monitors are created manually, since they’re potentially more compute-intensive processes; I’ll explain to you how to scale this option using monitors as code. They’re also more sophisticated than the basic automatic monitors.

There are four types of custom monitors: field health, dimension tracking, JSON schema, and SQL rules. I’ll elaborate on the two types our team uses the most:

  • Field health monitors. A field health monitor collects metrics on each column of your data set and alerts when any of them is anomalous. These metrics include percentage of null values, percentage of unique values, and for quantitative columns summary statistics such as minimum, maximum, average, standard deviation, and percentiles.
  • Dimension tracking monitors. A dimension tracking monitor is best suited for low-cardinality fields. It alerts you if the distribution of values is changing significantly. We have found custom monitors to be useful in catching incidents and bugs that we can’t anticipate. As opposed to tests by Great Expectations or dbt where you know what to expect or know what you want to test over time, these monitors do a great job of catching everything you don’t know you need to test. For example, if the null percentage on a certain column is anomalous, this might be a proxy of a deeper issue that is more difficult to anticipate and test.

These monitors can be created relatively easily via a wizard in the Monte Carlo UI (see Figure 4).

Figure 4: This figure is a screenshot from custom monitor creation wizard from within the Monte Carlo UI. Monitor creation is done manually, step by step. First, the table needs to be specified. Second, the fields that should be included in the monitor need to be specified. Last, the schedule of sampling the data is determined.

One example of this being useful is setting such a monitor on the platform for a data set that collects signups in a cross-platform product. Supposing in your SaaS cross-platform product, usually, 60 percent of users are onboarded from the web, 30 percent from iOS, and 10 percent from Android.

Now say that this ratio changes to 65 percent iOS, 25 percent web, and 10 percent Android (see Figure 5):

Figure 5: This chart shows a sample use case where prior to an incident triggered by Monte Carlo, 60 percent of users are onboarded from the web, 30 percent are onboarded through iOS, and 10 percent are onboarded through Android. In this use case, Monte Carlo triggered an alert since the percentage of users onboarding from iOS has increased to 65 percent at the expense of web, which has decreased down to 25 percent.

In this example, iOS signups have increased at the expense of web signups. This is suspicious and can imply that something is broken with web data collection, or something on the web product is broken, leading to fewer leads converting into users.

Monitors as code

As a team, we wanted to leverage Monte Carlo’s monitors as code functionality. This feature enables the easy creation of monitors based on configurations stored in a YML file. Creating a Monte Carlo monitor is as simple as creating or updating a YML file with the monitor’s config, like this:

This unleashes the power of their ML-based monitors in a scalable, reproducible manner. Internally our team discussed two complementary approaches for this:

  • Integrate into dbt, taking advantage of how it fits well with a dbt schema.ymlfile. For more information, see the Monte Carlo docs.
  • Create a standalone repo for creating Monte Carlo monitors, decoupled from dbt and for any use case.

We ended up deciding on the second option, the standalone repo, to start. In general, monitors created on dbt models fit into the first option, and everything else fits the second option. This includes raw data ingested into Snowflake and data models that are transformed without dbt. The main reason we started with the second option is because at the time we were still shaping the way we wanted to implement dbt, and it seemed simpler to start with a standalone tool and when the time comes, integrate this tool into dbt, with the lessons learned from the standalone tool. You can think of this one as a centralized “terraform for monitors.” The same way cloud resources are managed and deployed as terraform code (that is, infrastructure as code), this repo enables anyone for any use case to create, get a review, and deploy a Monte Carlo monitor.

We have defined a CI/CD process with Jenkins as follows:

For the more graphically inclined, the flowchart for this process, which I’ll describe below, appears in Figure 6.

Figure 6: This flowchart shows the shared responsibilities among an engineer creating the monitor, the reviewer, and the automatic steps performed by Jenkins.

Here’s how the process works.

The engineer or analyst who would like to create a Monte Carlo monitor is responsible for creating their monitor’s configuration in a YML file and testing it locally using montecarlo apply --namespace $MC_MONITORS_NAMESPACE --dry-run

This does the following:

  • Fail if the YML file is misconfigured because of indentation and so on.
  • Print out what the new configuration is going to deploy to the Monte Carlo platform. You can think of this as an equivalent of the `terraform plan` command.

See Figure 7 for an output example.

Figure 7: This figure is a screenshot of the output from running a dry run. This shows that Monte Carlo will deploy four monitors: one field health monitor and three dimension tracking monitors.

Once the config is tested locally, open a pull request, or PR. This runs a few things by Jenkins:

  • Build a Docker image.
  • Install requirements such as montecarlodata.
  • Run montecarlo apply --namespace $MC_MONITORS_NAMESPACE --dry-run .
  • Add a comment to the PR with the dry-run output, making it easy for the reviewer to understand what this PR is going to deploy to the Monte Carlo platform.

See Figure 8 for a typical PR comment added by Jenkins.

Figure 8: This figure shows a PR comment added by Jenkins about the dry-run output. Having this output on the PR provides a summary of what changes are planned to take place by the PR, making it easy for the reviewer to review.

Finally, once the PR is approved and merged, Jenkins deploys the changes into Monte Carlo by running montecarlo apply --namespace $MC_MONITORS_NAMESPACE.

Next steps

We have recently implemented dbt Cloud at Vimeo. Monte Carlo monitors can easily be coupled with dbt by adding Monte Carlo monitor configurations under the key montecarloin the dbt schema.ymlfile. This enables us to suggest that engineers and analysts create Monte Carlo monitors during their dbt model creation PR. In addition, it makes a lot of sense to have all information regarding a dbt model in one place, including documentation, tests, and Monte Carlo monitors.

We also tend to route Monte Carlo notifications to specific Slack channels. This way, the right team gets notified, with the right context. Otherwise, these alert notifications are just white noise. Another enhancement in this manner can be to add a test to Jenkins that checks if the monitors are going to be routed to a specific channel or to the default channel that holds them all, which in our case is #montecarlo-monitors.

In conclusion

I hope this blog has inspired you to take the next step in monitoring data quality at scale. To achieve a culture of data quality it is going to take more than a single engineer and I would say more than a single team. As data engineers, we would like to build tools and frameworks to make it easy for anyone to contribute to data quality, and this is one step.

--

--