Identifying Late Arriving Log Sources

5 min readNov 26, 2022

In this post I’ll cover how you can identify sources of ‘late arriving’ data in your Chronicle SIEM.

What is late arriving data? Well, not all log sources generate data as a stream and in near real-time, e.g., Google Cloud Operations exports to Google Cloud Storage happens once an hour, some other log sources will send data when they can, e.g., an EDR agent on an endpoint that is disconnected.

In this post I’ll show an example SQL statement, via Google Sheets, to identify log sources with late arriving data, and cover how that can impact Detection or Search when using Chronicle SIEM.

Using Connected Sheets to visualize latency of log sources

🆕 Update Dec 23:

The updated Chronicle SecOps licensing model means the export to BigQuery feature may not enabled by default.

The below SQL statement calculates (attempts) the average latency of a log source based upon the difference in timestamps between the:

i) ingested_timestamp

ii) event_timestamp

💭 See Chronicle’s UDM field list for futher details on available timestamp fields

WITH
  baseline AS (
  SELECT
    hour_time_bucket,
    CONCAT(metadata.vendor_name, ":", metadata.product_name) AS log_source,
    MIN(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(metadata.ingested_timestamp.seconds),TIMESTAMP_SECONDS(metadata.event_timestamp.seconds),SECOND)) AS min_diff,
    MAX(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(metadata.ingested_timestamp.seconds),TIMESTAMP_SECONDS(metadata.event_timestamp.seconds),SECOND)) AS max_diff,
    ROUND(AVG(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(metadata.ingested_timestamp.seconds),TIMESTAMP_SECONDS(metadata.event_timestamp.seconds),SECOND)),0) AS avg_diff,
  FROM
    `datalake.events`
  WHERE
    DATE(hour_time_bucket) BETWEEN DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
    AND DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)
  GROUP BY
    1,
    2
  --HAVING
  --  avg_diff > 360 
)
SELECT
  log_source,
  CAST(AVG(avg_diff) AS INT64) AS average_latency_seconds,
  CAST(AVG(avg_diff)/60 AS INT64) AS average_latency_minutes,
  CAST(TRUNC((AVG(avg_diff)/60)/60,2) AS FLOAT64) AS average_latency_hours,
  MIN(min_diff) AS min_latency_observed,
  MAX(max_diff) AS max_latency_observed  
FROM
  baseline
GROUP BY
  1
ORDER BY
  2 DESC

The SQL statement is written for the Chronicle Data Lake events table, but you can make it work against the older udm_events table by adjusting the WHERE clause and switch hour_time_bucket to _PARTITIONTIME.
The SQL statement is using a CTE to build a baseline over the last 7 days, and then calculate the average ingestion to event timestamp difference.
As Chronicle’s DataLake events table does not include the Ingestion Label (at time of writing) I concatenate metadata.vendor_name and metadata.product_name to create a unique log source. If a parser doesn’t populate either of those UDM fields (not common) then it’ll appear under as a : entry.
Optionally if you want to exclude near real-time log sources un-comment the HAVING statement to show only sources above five minutes.

If you’re not familiar with Google Connected Sheets you can also run SQL against BigQuery and have the results written directly into Google Sheets, and makes it a bit quicker to create a pivot table or a chart.

And example results of the SQL statement, sorted by highest latency desc:

Example results showing the average, min, and max log source latency

✔️ Note, you have to provide Chronicle an appropriate Google Group in order to configure authentication via your identity. If in doubt, contact your Chronicle account team or Partner.

Interpreting the results

Late arriving log sources as visualized in Google Sheets

There are two main reasons for late arriving data:

collection latency
generation latency

The above SQL will not provide an answer to which is the cause, unfortunately; however, we can infer what may be late arriving data, or collection latency.

Firstly, identify the log source ingestion mechanism, which is most likely going to be Chronicle SIEM Feed Management, and review the Ingest schedule value, e.g., for Azure Blob Storage the collection schedule is every 15 minutes.

The second element to this is consider the data type, and if you would expect late arriving data, e.g., for an EDR log source for mobile devices that are not always on, this is a likely scenario.

The third part is per log source specific mechanics, which is more tribal knowledge, for example:

Google Workspace (WORKSPACE_ACTIVITY) includes a high watermark value applied to ensure not lost log data from the Workspace APIs as can happen if you collect data any more aggressively, and this adds a six hour delay, i.e., collection latency
In the TANIUM_STREAM and GOOGLE_SECURITY_COMMAND_CENTER are custom integrations in this lab with data replayed via API, so we can ignore those
See the range of min to max late arriving data. Is there a high variance of latency, or a relatively small window? e.g., GSuite via Google Cloud in the above Chart shows a high range of variance in late arriving data.

And from this analysis, we can have a level of confidence that we have no other log sources with a high data or collection latency 👍

An additional expansion of this would be to break down the latency times into percentiles to get further insights the proportion of data that is late arriving (on the todo wishlist).

Impact on Detection & Response

Chronicle SIEMs Detection Engine will successfully generate a detection upon late arriving data, as long as it’s within 24~ hours. Beyond that duration you may not get a Detection, and would need to consider scheduling or running a Retrohunt.
If you use a multi-event rule over a medium to large window duration, this will be in addition to the data or ingestion latency, e.g., for WORKSPACE_ACTIVITIES a multi-event rule with a 6 hour window + 6 hour ingestion latency will result in a detection 12~ hours after the event.
From a UDM Search perspective, obviously, data won’t be returned until the collection latency period, e.g., again using WORKSPACE_ACTIVITIES as an example you’re not going to see results till 6 hours after the event (as you can see in the above chart, 20k seconds being 6 hours~)
- this is a case where you can now make a choice, the Chronicle default chooses avoiding data loss at the cost of latency; however, you can use other export mechanisms

Summary

Chronicle SIEM supports late arriving data, an perennial challenge with SIEM (and data in general), but having an understanding and insight into what those sources of late arriving data are, be that collection latency or data generation delay, is important due to the impact it has on Detection and Search — hopefully, the above post helps you to get that insight when using Chronicle SIEM.

Identifying Late Arriving Log Sources

Interpreting the results

Impact on Detection & Response

Summary

Written by Chris Martin (@thatsiemguy)