Identifying Late Arriving Log Sources
In this post I’ll cover how you can identify sources of ‘late arriving’ data in your Chronicle SIEM.
What is late arriving data? Well, not all log sources generate data as a stream and in near real-time, e.g., Google Cloud Operations exports to Google Cloud Storage happens once an hour, some other log sources will send data when they can, e.g., an EDR agent on an endpoint that is disconnected.
In this post I’ll show an example SQL statement, via Google Sheets, to identify log sources with late arriving data, and cover how that can impact Detection or Search when using Chronicle SIEM.
🆕 Update Dec 23:
- The updated Chronicle SecOps licensing model means the export to BigQuery feature may not enabled by default.
The below SQL statement calculates (attempts) the average latency of a log source based upon the difference in timestamps between the:
i) ingested_timestamp
ii) event_timestamp
💭 See Chronicle’s UDM field list for futher details on available timestamp fields
WITH
baseline AS (
SELECT
hour_time_bucket,
CONCAT(metadata.vendor_name, ":", metadata.product_name) AS log_source,
MIN(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(metadata.ingested_timestamp.seconds),TIMESTAMP_SECONDS(metadata.event_timestamp.seconds),SECOND)) AS min_diff,
MAX(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(metadata.ingested_timestamp.seconds),TIMESTAMP_SECONDS(metadata.event_timestamp.seconds),SECOND)) AS max_diff,
ROUND(AVG(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(metadata.ingested_timestamp.seconds),TIMESTAMP_SECONDS(metadata.event_timestamp.seconds),SECOND)),0) AS avg_diff,
FROM
`datalake.events`
WHERE
DATE(hour_time_bucket) BETWEEN DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
AND DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)
GROUP BY
1,
2
--HAVING
-- avg_diff > 360
)
SELECT
log_source,
CAST(AVG(avg_diff) AS INT64) AS average_latency_seconds,
CAST(AVG(avg_diff)/60 AS INT64) AS average_latency_minutes,
CAST(TRUNC((AVG(avg_diff)/60)/60,2) AS FLOAT64) AS average_latency_hours,
MIN(min_diff) AS min_latency_observed,
MAX(max_diff) AS max_latency_observed
FROM
baseline
GROUP BY
1
ORDER BY
2 DESC
- The SQL statement is written for the Chronicle Data Lake
events
table, but you can make it work against the olderudm_events
table by adjusting the WHERE clause and switchhour_time_bucket
to_PARTITIONTIME
. - The SQL statement is using a CTE to build a baseline over the last 7 days, and then calculate the average ingestion to event timestamp difference.
- As Chronicle’s DataLake
events
table does not include the Ingestion Label (at time of writing) I concatenatemetadata.vendor_name
andmetadata.product_name
to create a unique log source. If a parser doesn’t populate either of those UDM fields (not common) then it’ll appear under as a:
entry. - Optionally if you want to exclude near real-time log sources un-comment the
HAVING
statement to show only sources above five minutes.
If you’re not familiar with Google Connected Sheets you can also run SQL against BigQuery and have the results written directly into Google Sheets, and makes it a bit quicker to create a pivot table or a chart.
And example results of the SQL statement, sorted by highest latency desc:
✔️ Note, you have to provide Chronicle an appropriate Google Group in order to configure authentication via your identity. If in doubt, contact your Chronicle account team or Partner.
Interpreting the results
There are two main reasons for late arriving data:
- collection latency
- generation latency
The above SQL will not provide an answer to which is the cause, unfortunately; however, we can infer what may be late arriving data, or collection latency.
Firstly, identify the log source ingestion mechanism, which is most likely going to be Chronicle SIEM Feed Management, and review the Ingest schedule value, e.g., for Azure Blob Storage the collection schedule is every 15 minutes.
The second element to this is consider the data type, and if you would expect late arriving data, e.g., for an EDR log source for mobile devices that are not always on, this is a likely scenario.
The third part is per log source specific mechanics, which is more tribal knowledge, for example:
- Google Workspace (WORKSPACE_ACTIVITY) includes a high watermark value applied to ensure not lost log data from the Workspace APIs as can happen if you collect data any more aggressively, and this adds a six hour delay, i.e., collection latency
- In the TANIUM_STREAM and GOOGLE_SECURITY_COMMAND_CENTER are custom integrations in this lab with data replayed via API, so we can ignore those
- See the range of min to max late arriving data. Is there a high variance of latency, or a relatively small window? e.g., GSuite via Google Cloud in the above Chart shows a high range of variance in late arriving data.
And from this analysis, we can have a level of confidence that we have no other log sources with a high data or collection latency 👍
An additional expansion of this would be to break down the latency times into percentiles to get further insights the proportion of data that is late arriving (on the todo wishlist).
Impact on Detection & Response
- Chronicle SIEMs Detection Engine will successfully generate a detection upon late arriving data, as long as it’s within 24~ hours. Beyond that duration you may not get a Detection, and would need to consider scheduling or running a Retrohunt.
- If you use a multi-event rule over a medium to large window duration, this will be in addition to the data or ingestion latency, e.g., for WORKSPACE_ACTIVITIES a multi-event rule with a 6 hour window + 6 hour ingestion latency will result in a detection 12~ hours after the event.
- From a UDM Search perspective, obviously, data won’t be returned until the collection latency period, e.g., again using WORKSPACE_ACTIVITIES as an example you’re not going to see results till 6 hours after the event (as you can see in the above chart, 20k seconds being 6 hours~)
- this is a case where you can now make a choice, the Chronicle default chooses avoiding data loss at the cost of latency; however, you can use other export mechanisms
Summary
Chronicle SIEM supports late arriving data, an perennial challenge with SIEM (and data in general), but having an understanding and insight into what those sources of late arriving data are, be that collection latency or data generation delay, is important due to the impact it has on Detection and Search — hopefully, the above post helps you to get that insight when using Chronicle SIEM.