Data Observability at scale using Monte Carlo

Published in

Checkout.com-techblog

12 min readOct 19, 2022

In a previous blog entry, we explored Checkout’s data platform solutions to test and monitor data at scale. In this post, we’re going to look at the natural progression in our journey with adopting a data observability platform, Monte Carlo. We will cover the motivations and challenges faced and how we addressed scaling data across different projects. This article will also cover the limitations around designing and scheduling tests, how we set up Monte Carlo and offered it to the wider data community, and lastly, improvements we’re looking forward to seeing in our observability platform.

Background

Checkout.com started a proof of concept with Monte Carlo in early 2022. The background to this initiative stemmed from a first attempt to use dbt freshness monitoring and build a latency dashboard ourselves and a wish to get more visibility of the data pipelines running on our platform.

While building a latency dashboard based on dbt, we soon faced two problems. Firstly, monitoring was configuration based. Each time a new pipeline was built or new data assets were deployed, we had to ensure SLAs were set and correct, which was a real challenge in terms of scalability. Secondly, defining correct freshness rules was problematic. Data ingestion can follow patterns or trends which are difficult to track when thresholds are set in stone. For example, some data is loaded through batch processes which can follow morning or weekday patterns. Defining a threshold, in this instance, requires you to determine the biggest acceptable gap between two loads. In this case, it means you have to choose between minimising noise from alerts and running the risk of missing load events. In addition to this, patterns do change over time. It’s not scalable to expect data owners to proactively adjust alert thresholds across all datasets ingested in the Warehouse. Our conclusion was that a manual approach leads to many false positives and alert fatigue.

Figure 1: *dbt sources yml file with meta and freshness information*

*Figure 2: Looker Data Latency Dashboard built from dbt project configuration*

Another driver for seeking a data observability solution was to get more visibility and insights on constantly evolving data pipelines. There are only so many queries you can run to figure out how much, how often and when, and this is what we used to do when faced with questions and issues.

Finally, we also had to deal with a number of incidents that, although not always preventable at the time, could have been detected earlier to minimise business impact.

Evaluation and adoption

We first evaluated Monte Carlo’s anomaly detection capability by setting up alerts on a number of datasets deemed “critical”, where load patterns are well known by many. Once Monte Carlo collectors are set up and ready to gather metadata from your Snowflake account, the auto ML monitors need a couple of weeks to gather data and learn patterns. At the start of the PoC we defined a target for the machine learning monitor accuracy, but after allowing the anomaly detection to run and raise alerts, we found more benefits than expected.

Since Monte Carlo was configured to read from production databases, one of the challenges with this evaluation was the ability to initiate incidents. On one hand, it is important to evaluate a capability based on real data. QA environments do not generate enough data, no real incidents occur, and load patterns can be unpredictable and likely generate many anomalies. The downside, as noted above, is that we simply had to let the tool run and wait for true anomalies to occur.

Finding #1 — lots of metadata available!

Monte Carlo is collecting a large amount of metadata from Snowflake in order to build observability features. Freshness, volume, statistical distributions, access, number of queries, lineage, etc. Since this data is collected, the tool can offer a rich catalogue with information about your datasets. This all happens automatically! The first thing we noticed is access to this wealth of information. In many ways, this confirmed our knowledge of many commonly used datasets, but it also expands the discoverability of many others. No longer did we have to wonder how a table was used, how it was updated, how big/small, and what changes occurred when receiving an inquiry from users. The catalogue page with a summarised view of datasets was now the first point of reference to understand and visualise the day in the life of our datasets.

Finding #2 — no need to set thresholds and run tests!

A PoC domain was defined to select a dozen of critical database schemas and datasets we wanted to monitor.

Domains in Monte Carlo simply let you define a collection of tables or views by selecting a combination of tables, schemas or databases. Domains can be used to create notifications and authorization groups as a way to adjust the scope without having to redefine a list of tables every time.

Within a few weeks of running our PoC, we quickly realised that fluctuations in load timings and volumes were successfully detected. Although our initial goal was to achieve close to zero percent false positive alerts, we soon learned how to take advantage of the incident management interface by flagging incidents with the appropriate status. Investigating incidents became more natural. Investing time in incident resolution was proving to be time better spent compared to constantly building tests and adjusting thresholds manually. Our focus shifted from reactive, learning patterns and manually adjusting tests, to proactively taking actions and investigating potential issues.

*Figure 5: Our definitions for incident statuses*

From evaluation to a new data platform service 🚀

With the great potential of data observability at our disposal, the next challenge was: how do we offer this platform to all our data providers and consumers?

Our data ecosystem is organised in such a way that the data platform acts as a central team, offering services and tools to enable other teams to derive insights from the data.

Here are the building blocks we developed to ensure this platform would remain scalable and governed:

Access Control

We enabled access to Monte Carlo through Fresh Service, our service catalogue solution and Okta for Single Sign-on. In essence, any Snowflake user is able to request access to Monte Carlo and, once approved, can access the platform as with the Viewer role.

*Figure 6: Access request to Monte Carlo*

We then determine which project they belong to and assign a more specific Authorization group associated with their Domain. A project is usually associated with a separate analytics database.

We leverage the following roles:

Viewer: Users who are not producing and monitoring data, using Monte Carlo to troubleshoot or gain insight from the metadata available
Authorization group (Editor): Data producers who need to monitor their project
Account Owner: Data Platform administrators

Leveraging Monte Carlo domains

Domains are essentially a way to define a collection of tables in Monte Carlo. The simplest scenario is to assign a project database as a domain and leverage this to build a new authorization group based on this domain. However, domains allow us to group schemas or tables and cover scenarios where databases or schemas are shared between more than one project. The intention here is to ensure that project members have permission to create monitors, and resolve incidents on the datasets they own.

Monitor as code for the masses

While a UI offers great flexibility in terms of quickly experimenting, we wanted to be able to offer a configuration based solution to manage monitors at scale. Monte Carlo Monitor as code solution covers our needs as well. However, the ability to deploy monitors using Github actions as part of our default dbt project template was an important feature to us. We currently support around twenty dbt projects on our platform. They all leverage generic packages and services, enabling dbt projects to easily test, deploy and schedule dbt models in Snowflake, as illustrated in a previous blog entry. To enable this, service accounts are created for projects who wish to deploy monitors as code. This is to avoid using API keys created from individual accounts.

The ability to deploy Monitors within a dbt repository is a great addition to our service catalogue! We can offer these solutions as a service which is critical to harmonising and scaling the platform.

The screenshot below illustrates how actions and monitor deployments are currently set up.

Use Cases covered

So, a few months after onboarding the platform to our service catalogue, how is Monte Carlo used at Checkout.com?

These are the main use cases in our community:

Leverage no config, out of the box monitors to create notifications and prevent data incidents
Schedule custom SQL tests
Give users the ability to easily add monitors (no dbt or airflow required)
Create advanced monitors: field health / dimension tracking
Integration of data incidents with our data catalogue

There’s increased flexibility and insight gained since adopting Monte Carlo. Our approach to data monitoring has evolved considerably:

Each project uses the tool to a level that is suitable to their needs. We do not yet enforce standards. However, the more critical data is to a project, the more we expect projects to follow a maturity level.

*Figure 10: Internal project maturity levels*

Integration with Airflow and dbt

Since Airflow and dbt are important tools to build analytics pipelines, integration with Monte Carlo provides many benefits. Monte Carlo integrates with dbt core by importing manifests and run_results.json files in order to surface dbt’s metadata. We have enabled this in Airflow by adding an upload artefacts task to our workflows. This gives additional context to tables and views available in the Catalog, with model descriptions, tags immediately available in Monte Carlo UI. Pushing run_results files to Monte Carlo also provides insights on model execution status and times; this metadata can be used to raise incidents during model refresh.

*Figure 13: Catalog page with dbt metadata in Monte Carlo*

Lessons learned with data testing and monitors

We initially migrated a lot of dbt tests to Monte Carlo monitors, taking advantage of the flexibility to create and schedule data tests. However, we found that not all data tests are good candidates for Monte Carlo monitors. For example, executing a test post data refresh to ensure a consistent dataset is produced requires a certain timing in a workflow. Monte Carlo monitors are executed periodically, and it can be difficult to debug false positives when a custom monitor is executed during a data refresh.

Another challenge was repetition of code. While dbt comes with the power of Jinja templating, monitors felt like a step back with blocks of SQL code copied and pasted in multiple monitors.

For these reasons, we are now looking to use dbt tests and Monte Carlo in tandem, using dbt tests as the execution engine, which will run after jobs are finished, and we will leverage the dbt integration in order to use Monte Carlo for the incident management workflow.

What we’re missing from Monte Carlo

The tool is constantly evolving with new features released regularly. At the time of writing this blog, we’re aware of many new exciting features on the roadmap. However, we still have a list of enhancements below that highlights some of the things we’re still missing in Monte Carlo.

Domain as code

Domain creation is still a manual process managed within the UI. In many cases, this is suitable (e.g. project entirely owns a database). However, we do have scenarios where we’d like to be able to track datasets exposed to specific sub-domains, whether it is based on criticality or specific use cases. We need to do this in a more dynamic way. We know domain as code or domain creation via the API is coming, and we will certainly leverage this feature.

More Granular RBAC

Custom Authorization Group is a very useful new feature. However, we’re dealing with some domains that are quite dynamic, and only Admins can edit domains. Ideally, we would like to allow users to add datasets to their domains, or, better, mirror permissions granted in Snowflake to streamline this process.

Ability to create service accounts

At the moment, we can only invite users with a real email address to Monte Carlo. We need service accounts for projects using the Monte Carlo API for use cases like monitor as code. The Monte Carlo support team can enable this via a workaround. Ideally, we’d like to be able to create service accounts and API Keys via the Admin role.

Tracking data latency

Out of the box monitors track a lot of useful metrics, including freshness and volume. However, we’ve had to deal with more complex situations where source to target latency led to missing SLAs. This scenario can be described as data regularly arriving and, therefore, freshness being unaffected, while the delay between the event timestamp and the time data is loaded in the warehouse is increasing due to throughput issues in the pipeline. Arguably volume alerts should be raised in this scenario, we’d love to see a solution to track and notify us of latency issues in our pipelines.

More flexible notifications

We love the multiple integrations Monte Carlo offers to raise alerts (Slack, PagerDuty). However, notifications can be hard to manage. The interface is flexible but does not allow users to visualise or test the outcome of settings. On a few occasions, users had to deal with unexpected behaviours which were hard to debug.

Enhanced metadata in Snowflake data share

Having all the insights available in a Snowflake data share is extremely convenient because it removes the development needed to build an extract process using the REST API. We can just query the data using SQL. We’d love to have access to more historical data, and we’d also like to see an increase in the frequency of data collected.

Sandbox/Testing environment

Currently, Monte Carlo does not offer the option to build multiple workspaces or accounts. We’re running Monte Carlo on our production datasets which cover most of our needs. However, testing and experimentation can be noisy and generate lots of false positives. It’s difficult to experiment with a new version of a monitor since there’s no segregation and no way to flag what is QA and what is Live on the same platform.

Conclusion

Our journey through data observability has brought many improvements in terms of the reliability of our pipelines. As explained in this post, we have been able to offer this capability as a new service to our catalogue and integrate Monte Carlo with our CI/CD process and workflows. There’s an increased awareness of the data lifecycle, less room for assumptions, and added control and visibility on incidents. In their Quick Start Guide, Monte Carlo states that one of its objectives is to help data teams manage reliability in the same way that Datadog helps engineers manage the reliability of their platform. A data observability platform certainly fills a gap. We’re looking forward to seeing the platform evolve in order to fully integrate with a data catalogue, enabling an owner-operator model that promotes fixing broken data and processes at the source.

Thanks for reading this blog entry on our adoption of a data observability platform. In future posts, we will cover how we enable dbt CI/CD through Github actions across all projects at Checkout, we’re also going to cover how we manage Airflow deployments without any downtime. Please keep an eye out for more in-depth reading on our tech blog.