Streamlining DataPlatform Monitoring and Alerting on GCP

Leveraging Rule-Based Systems for BigQuery Health and Beyond

Published in

Google Cloud - Community

6 min readFeb 4, 2024

Within the realm of managing a dataplatform, the sheer variety and quantity of resources can be overwhelming. From servers, databases of various types, to different data input sources, including streaming, batch processing, and flat files. Each resource has its own lifecycle and characteristics. This complexity translates to the need for a continuous and automated monitoring mechanism. The responsibility of overseeing hundreds of data streams, each with its unique lifecycle, necessitates an approach that is both comprehensive and proactive.

Image generated with DALL·E 3 by Rémy Larroye

Data lake Health: The Backbone of Data Reliability

For a data team, the data lake health is the heartbeat of their analytics ecosystem. Ensuring the integrity and health of these tables is paramount, as they form the foundation for informed decision-making. The automated monitoring and alerting system introduced by our team has extended its scope to include BigQuery health checks. By continuously measuring the health of these tables, our team gains valuable insights into the overall stability of the data platform. This proactive approach acts as an early warning system, ensuring that major errors are flagged and addressed promptly.

The Rule Engine: Powering Efficient Monitoring

At the core of this system lies a rule engine (I’m a big fan of this, I’ve already written an article using this for automatic data replay in a data pipeline, here)

This framework enables the execution of predefined rules on incoming data, allowing for the identification of anomalies based on various criteria. For instance, rules can be designed to monitor the frequency of updates within specific time windows or to detect extended periods without updates.

Here’s a rule checking that the raw_sales.SALES_DAY table is updated at least every three days :

{
    "name": "Minimum once every 3 days",
    "max_diff_time": 4320,
    "tables": [
      "raw_sales.SALES_DAY"
    ]
 }

Here’s a rule that use a regex verifying that all tables in the raw_stock dataset are updated at least once or twice a day between 11PM and 8 AM. We’ve also added the email address of the team responsible of this , so that he or she is automatically notified by email when there are errors on it.

{
    "name": "Once or twice every day between 11PM and 8 AM",
    "cron": ["* 23 * * {yesterday_week_number}", "* 0-7 * * {day_of_week_number}"],
    "min_number": 1,
    "max_number": 2,
    "maintainer": "support@email",
    "tables": [
      "^raw_stock.*$",
    ]
}

This proactive approach drastically reduces the risk of manual oversight and ensures that potential problems are identified in real-time.

How It Works

The backbone of the system comprised a series of interconnected components. A Cloud Scheduler triggered a Cloud Function, which extracted rules and logs from BigQuery. The Cloud Function then conducted rule validation against the logs. If a violation was detected, an error message was dispatched to the team’s Google Chat group or via email to the designated maintainer. Additionally, the details of the violation were stored in BigQuery, facilitating the creation of a comprehensive Looker Studio dashboard for a summarized team-wide monitoring overview.

1. Collect the necessary data

Our system relies on our ability to retrieve the necessary data and prepare it for use by the rules engines.

To achieve this, we will use cloud logging to filter the logs generated during data insertion into our dataplatform running on BigQuery. The first step will be to enable audit logs for BigQuery (which are disabled by default) and create a sink with a filter so that the logs of interest are sent to BigQuery.

resource.type="bigquery_project" OR resource.type="bigquery_dataset"

Now that our raw data is streaming into a BigQuery table, we will prepare the data for use by the rule engine. To do this, we create a view of our raw data so that for each table in our dataplatform, we have the latest data insertion dates and the amount of data inserted.

WITH insert_logs AS (
  SELECT
      JSON_EXTRACT(protopayload_auditlog.metadataJson, '$.jobChange.job.jobStats.queryStats.outputRowCount') AS rows_inserted,
      JSON_EXTRACT(protopayload_auditlog.metadataJson, '$.jobChange.job.jobStats.endTime') AS insert_datetime,
      SPLIT(JSON_EXTRACT_SCALAR(JSON_EXTRACT(protopayload_auditlog.metadataJson, '$.jobChange.job.jobStats.queryStats.referencedTables'), '$[0]'), '/') AS referencedTables,
  FROM `prj-training.log_sink.cloudaudit_googleapis_com_data_access_*` WHERE 
  JSON_EXTRACT(protopayload_auditlog.metadataJson, '$.jobChange.job.jobStatus.jobState') = "\"DONE\""
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson, '$.jobChange.job.jobConfig.queryConfig.statementType') = "\"INSERT\""
)

SELECT
  CAST(REPLACE(rows_inserted,"\"","") AS INT64) AS rows_inserted,
  TIMESTAMP(REPLACE(insert_datetime, "\"","")) AS insert_datetime,
  referencedTables[OFFSET(1)] AS project_id,
  referencedTables[OFFSET(3)] AS dataset,
  referencedTables[OFFSET(5)] AS table
FROM insert_logs
WHERE rows_inserted IS NOT NULL;

Here’s what logs look like once they’ve been prepared :

2. Run the rule engine for detect errors

Now that we have aggregated and processed data that is ready for analysis, let’s move on to the crucial step of ensuring rule compliance. The heart of our system lies in the Cloud Scheduler, which periodically triggers the execution of a cloud function responsible for the computational aspect. This automated trigger kickstarts the monitoring process, eliminating the need for manual intervention and ensuring a continuous watchful eye over our data.

The cloud function begins its task by retrieving both the predefined rules and the aggregated log data, as discussed in the previous section, from our BigQuery tables.

With the logs in hand and the rules at its disposal, the rule engine leaps into action. It systematically evaluates each log entry against the established rules, meticulously examining the data’s characteristics to verify if they meet the specified criteria. Should a log entry fail to comply with any of the defined rules, the rule engine promptly generates an error message, alerting the relevant stakeholders. The alert is also stored in a bigquery table to keep track of what’s happened and plug a dashboard into it.

In cases where we encounter a table with no predefined rules, a lower-priority alert is triggered. This alert is then sent to our data engineering team, prompting them to create the necessary rule based on the defined specifications.

3. Alert the right people in the right way

Alerts are sent to the responsible data maintenance teams in a manner that best suits their preferences, either via email or Google Chat. For our team, using Google Chat groups is the preferred method. They are user-friendly, easy to create, and allow us to include only the relevant team members from various departments. This approach also facilitates organized discussions on specific topics and the option to mute notifications to avoid unnecessary distractions when needed.

Example of error message in the google chat of the team

In addition to alerts, it’s crucial to establish monitoring to track the situation further and gather more details following an alert. To achieve this, we collect the alerts raised by the cloud function and store them in a BigQuery table, enabling comprehensive tracking and analysis. This data feeds into Looker Studio dashboards, providing a team-wide monitoring overview.

Beyond BigQuery: Flatfile Monitoring Extension, Vertex AI pipelines runs, etc

The datalake is the backbone of our team. To feed our datalake, we use a wide variety of data sources. An important part of our sources are flatfiles that other teams regularly send us. Monitoring them enables us to understand errors on certain datalake tables, as the datalake cannot be kept up to date without having received data.

As a natural extension of this system, flatfile monitoring has also been seamlessly integrated and are now monitored and tracked using the same rule-based methodology.

This inclusion enhances the accuracy and efficiency of monitoring processes while maintaining a consistent approach across different data sources.

Conclusion

Automate monitoring of our data platform, moving from human checks to alerts only when a problem is detected. Not only did this save precious team time, it also increased error detection and made us more proactive about any errors that dataplatform users might encounter.

This rule engine system is a simple way for the whole team to easily update and extend the system to monitor and alter new data sources and resources managed by the team.

PS : A Proposition for Enhanced Monitoring on GCP

We could imagine that when defining a biguery table, it would be possible to add log-based metrics and alerts to it, which would automatically alert the maintainer concerned. Similarly, in GCS, it would be practically possible to set alerts on object versions, so as to be alerted if they are not regularly updated.

That could provide insights into the best practices for monitoring and alerting, offering users actionable recommendations based on industry standards. This step could further streamline the monitoring setup and help teams optimise their data operations.