Chronicle Forwarder Telemetry via Google Cloud Monitoring

--

Have you ever wanted an alert when a Log Source in your Chronicle SIEM goes below a threshold, or even goes completely silent?

Well, good news everyone, the new Cloud Monitoring integration preview in Chronicle SIEM can do exactly that! 🎉

The preview documentation has lovely clear and concise instructions (unlike my Professor Farnsworth ramblings), so let’s give it a go and see what it can do.

Ingestion metrics, in GCP Monitoring you say?

📝 Note, you must be using the Chronicle SIEM Bring Your Own Project (BYOP) feature in order to get enabled with the Metrics. If you’ve not already migrate, migrate already!

Alert when a Chronicle Forwarder goes silent?

From an initial look, there are metrics available for Ingestion related to:

  • Total Ingested Log Count
  • Total Ingested Log Size

These appear to represent Log Sources per Collector, with a Collector being Feed Management, a Chronicle Forwarder, or the Ingestion API.

Active Chronicle Metrics

A Metric Explore graph can help understand which Forwarders are continually talking, or not…

Creating a Metric Explore using Total Ingested Log Count, with an exclude filter of all the alphabet Collector IDs (Chronicle SIEMs Feed Management, Ingestion APIs, etc...), and grouping it by Namespace and Collector ID returns the following:

Not all Chronicle Forwarders have continually talking log sources

This is a challenge as not all Chronicle Forwarders in this environment are always sending logs, which means if I were to create an Alert for a Forwarder being offline for 5 minutes it would be valid for 2 of the 4 Forwarders, but generate a lot of false positive alerts for the remaining 2.

A logical approach therefore is to have two Forwarder Alerting policies:

  • No metrics observed for 5 minutes
  • No metrics observed for 1 hour

Let’s create the no metrics observed for 5 minutes alert first.

Within Monitoring go to Alerting and add an Alert condition, including Filters for the Chronicle Forwarder collector IDs that we expect to always be communicating.

It seems you can’t have multiple filters, i.e., multiple collector_ids, but rather you need to put all your Collector IDs into a single regex statement:

(?:89135071–2702–4de9-bad4-cbd59415822c|8da7b40f-e661–4b79-bb24–85e9c003d5fd)

And importantly we need aggregate across the time series by collector, or else we’ll get an alert per log source, which isn’t intended.

Configuring the Chronicle Forwarders we wish to monitor

🤷 This whole using a regex to monitor multiple Collectors could be user error on my part, or an anti-pattern, or lack of Operations knowledge in general. If you have suggestion on this, please do let me know, but it works for my end goal.

Click Next, and Specify a Condition Type of Metric Absence over 5 minutes, and give it a meaningful name, e.g., Chronicle Forwarder — Metric Absence over 5m

The Conditions for the Alert to trigger

Finally, configure the Notifications and Name of the alert, and optionally, but recommended, add detail on the severity, and useful information for the Alert recipient, such as the name of the Forwarders.

A high severity alert monitoring specific Chronicle Forwarders.

This was the 5 minute version of the Alert, and I created the 1 hour version of the alert by copying the 5 minute version as follows, but further monitoring and tuning may be required.

Quickly copy an existing alert to save time

Alert when a log source goes silent?

The next alert I need to know is when specific high value log sources stop reporting in.

Sounds simple, but there’s a couple of things we need to know first:

  1. Which log sources are always sending data?
  2. Which log sources have late arriving data?

Conveniently I wrote on these two topics previously, and so an optional read up on those topics is recommended, but the SQL statement I run in Chronicle Data Lake to work out which log sources it makes sense to monitor is as follows:

SELECT
log_type,
collector_id,
COUNT(intervals) AS count
FROM (
SELECT
log_type,
collector_id,
-- generate a field as YYYY-MM-DD-HH
FORMAT_DATE('%F-%H',end_time) AS intervals,
COUNT(1) AS count
FROM
`datalake.ingestion_metrics`
WHERE
--build a baseline of the prior 24 hours from run time
DATETIME(end_time) BETWEEN DATETIME_SUB(CURRENT_DATETIME, INTERVAL 2 DAY)
AND DATETIME_SUB(CURRENT_DATETIME, INTERVAL 1 DAY)
AND log_type IS NOT NULL
AND collector_id IS NOT NULL
GROUP BY
1,
2,
3
)
GROUP BY 1,2
ORDER BY 3 DESC

Which in my case returns a range of log sources that all sent a log at least once per hour of the last day.

Many log sources report constantly, but not all…

This is going to get a little more complicated as we need to factor in the Log Type and the Collector method, i.e., a Forwarder, Feed Management, or Ingestion API, and that not all log sources continually send logs.

The next consideration is those alphabetical log sources:

  • aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa: represents all feeds created using the Feed Management API or page. For more information about feed management, see Feed management and Feed management API.
  • bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb: represents all ingestion sources that use the Ingestion API unstructuredlogentries method. For more information about ingestion API, see Chronicle Ingestion API.
  • cccccccc-cccc-cccc-cccc-cccccccccccc: represents all ingestion sources that use the Ingestion API udmevents method.

Given this I’m going to break up my requirements into two logical groups based upon what’s important to me:

  1. Context Sources
  2. High Value Event Sources

My main Context Source is WORKSPACE_USERS, and so I’ll create an Alert for this log source.

WORKSPACE_USERS, aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa, 1

Reviewing the above table I can see this feed comes in via the Feed Management API (the aaaa’s) and infer it’s once a day, i.e., 1 in every 24 hours.

Creating the Alert condition I apply the Collector ID filter for Feed Management, the log type as WORKSPACE_USERS, with a rolling window of 1 hour, and a trigger absence time of 1 day (as we expect at least one entry per day).

This will require several Alerts for the log sources of value, but the basics of getting started are as above.

Getting Alerts and Notifications

One detail that I skipped over was Notification policies, which if aren’t in place already you’ll need to setup. What are these? It’s how Operations will notify you outside the console of an Alert.

In your Google Cloud console browse to Monitoring > Alerting

If not already configured, click EDIT NOTIFICATION CHANNELS

There’s a wide range of options including:

  • Email
  • Mobile App
  • PagerDuty
  • SMS
  • Slack
  • Webhooks
  • Pub/Sub

Responding to Alerts

In an ideal world everything works, always. But given that’s not reality, things will break and hence we need alerts.

Here’s an example of a Notification as sent via Email:

What’s neat about GCP Operations is it will close an Incident if the metric returns to normal; however, this is where you do need apply careful research to create metrics that make sense, and don’t end up generating hundreds, or thousands of alerts, that self close shortly after.

I definitely didn’t do that…

An example of an Incident in Operations, showing the time and duration of the outage as based on the Alert conditions

Proactive monitoring via Dashboards

Finally, it’s not only metric exploration and metric alerting that GCP Operations can help with, it also provides real-time interactive Dashboards capabilities.

A quick example showing each Chronicle Forwarder, by Collector, GCP Region, Project ID, and Log Type, comparing the current day against the prior day.

A quick Operations Dashboard, room for improvement, but real-time data is a powerful new capability to ensure continual detection

Random Note

While Google Chat supports Webhooks, it expects data in a specific format, so you can’t use Google Chat as an Operations Webhook destination without some middleware (and there’s a lovely example of doing just that here)

Summary

The Cloud Monitoring integration is in preview, so contact your friendly Chronicle account team or Partner for more info, and if not already done so start your migration to Bring Your Own Project (BYOP) as that’s a pre-requisite for this (and also opens up Chronicle Auditing too).

While a more mundane part of Detection, you can’t Detect things if your ingestion pipeline isn’t working, and this new preview helps to ensure optimal coverage and response time for such issues.

--

--

Chris Martin (@thatsiemguy)

Cloud Security Mechanic, Google Cloud

Recommended from Medium

Lists

See more recommendations