Utilizing BigQuery to Analyze Exported Chronicle SIEM Archives

Chris Martin (@thatsiemguy)
5 min readDec 10, 2023

This post explores how to use the Data Export API, and effectively query exported raw logs using SQL statements in GCP BigQuery.

Using BigQuery External Tables to query your Archived Chronicle SIEM raw logs

Chronicle SIEM’s Data Export API allows users to export all or selected raw log data for a specific time period to a Google Cloud Storage bucket. This feature is intended for single-use or point-in-time scenarios, such as compliance or offboarding. This post provides examples on querying the exported raw log data outside Chronicle SIEM.

The Data Export API

In order to use the Data Export API you will require a Chronicle SIEM Backstory service account with Data Export API endpoint permissions.

An API request exports up to 10 Terabytes of data to a customer owned Google Cloud Storage (GCS) bucket.

A CreateDataExport request is made with the following parameters:

  • startTime
  • endTime
  • logType
    - either ALL_TYPES, or a specific Ingestion Label
  • gcsBucket

A successful call to the endpoint will return an object as follows:

{
"dataExportId": "a0843074-0316-47eb-b564-fc2413636cbf",
"startTime": "2023-12-06T00:00:00Z",
"endTime": "2023-12-07T00:00:00Z",
"logType": "ALL_TYPES",
"gcsBucket": "projects/<gcp-project>/buckets/<gcp-bucket>",
"dataExportStatus": {
"stage": "IN_QUEUE"
}
}

The dataExportId parameter can be passed to the GetDataExport endpoint to query the status of an export request:

{
"dataExportId": "a0843074-0316-47eb-b564-fc2413636cbf",
"startTime": "2023-12-06T00:00:00Z",
"endTime": "2023-12-07T00:00:00Z",
"logType": "ALL_TYPES",
"gcsBucket": "projects/<gcp-project>/buckets/<gcp-bucket>",
"dataExportStatus": {
"stage": "FINISHED_SUCCESS"
}
}

Once the data export is complete, you will find a top-level folder named with a GUID in your GCS bucket. Within this folder, there will be sub-folders for each unique log source in your Chronicle SIEM environment. Each sub-folder will contain one or more CSV files named with a sequential number and the “.csv” extension, e.g., -00000.csv.

📝 Note, these are in effect single column CSV files

How the raw log files exported to your GCS bucket will appear

Using a BigQuery External Table

BigQuery is Google Cloud’s managed data warehouse, allowing you to query your exported Chronicle SIEM raw logs stored in a GCS bucket using SQL.

The external table schema created in the previous steps initially had only the raw_log field defined; however, another field named _FILE_NAME offers valuable metadata about the raw log itself. This field follows the format:

gs://<gcs-bucket>/<guid>/<INGESTION_LABEL>-FROM-<DDDD–HH–MM:SS>-TO-<DDDD-HH-MM:SS>/-[00000]-of-[00001].csv

Searching for a string

The first example is querying raw log data for a given string, win-dc-01, using a regular expression:

SELECT
raw_log
FROM
`<project>.<data_set>.<table>`
WHERE
REGEXP_CONTAINS(raw_log, "(?i)win-dc-01")
LIMIT
1000

If there are matches, the SQL statement results will show matching file names, and raw logs:

Example of how the SQL statement results look in BigQuery Studio

Extract JSON values from a Raw Log

BigQuery offers various functions to process the raw log data. This SQL statement extracts the following information from the data:

  • The date from the file name
  • The ingestion label (i.e., the log source)
  • The JSON field Hostname
  • The JSON field TargetUserName
  • and applies WHERE filters query only WINEVTLOGs where the Event ID was 4624, i.e., a Logon event
SELECT
REGEXP_EXTRACT(_FILE_NAME,'(\\d{4}-\\d{2})') AS file_date,
REGEXP_EXTRACT(_FILE_NAME,'^.*\\/(.*?)-FROM') AS ingestion_label,
JSON_EXTRACT_SCALAR(REGEXP_EXTRACT(raw_log, r'^.*?(?P<_json>{.*?})$'), "$[Hostname]") as json_hostname,
JSON_EXTRACT_SCALAR(REGEXP_EXTRACT(raw_log, r'^.*?(?P<_json>{.*?})$'), "$[TargetUserName]") as json_TargetUserName,
raw_log
FROM
`<project>.<data_set>.<table>`
WHERE
REGEXP_EXTRACT(_FILE_NAME,'^.*\\/(.*?)-FROM') = "WINEVTLOG"
AND JSON_EXTRACT_SCALAR(REGEXP_EXTRACT(raw_log, r'^.*?(?P<_json>{.*?})$'), "$[EventID]") = "4624"
LIMIT 5

This SQL statement would return results as follows:

Finding how who logged on where and from what

Aggregate Queries

You can use aggregate functions against your raw log data to answer questions, such as how many events were observed from an entity within a given time frame:

SELECT
REGEXP_EXTRACT(_FILE_NAME,'(\\d{4}-\\d{2})') AS file_date,
REGEXP_EXTRACT(_FILE_NAME,'^.*\\/(.*?)-FROM') AS ingestion_label,
COUNT(raw_log) as raw_log_count
FROM
`<project>.<data_set>.<table>`
WHERE
REGEXP_EXTRACT(_FILE_NAME,'(\\d{4}-\\d{2})') = "2023-06"
AND REGEXP_EXTRACT(_FILE_NAME,'^.*\\/(.*?)-FROM') = "WINEVTLOG"
AND REGEXP_CONTAINS(raw_log, "(?i)win-dc-01")
GROUP BY 1,2

And matching results:

Obtaining aggregate results using Aggregate functions

⚠️ ️️️UDM logs in proto3 format ingested into the platform don’t export correctly, causing errors when exporting or querying. This is likely because the platform currently doesn’t format proto3 logs correctly ahead of export.

To avoid errors, I recommend deleting exported UDM logs from the GCS bucket before running queries. Unlike other event types, UDM logs in proto3 format can span multiple lines, leading to errors if not handled correctly. While directly ingested UDM events are not common, this step ensures smooth data export and querying.

Creating a BigQuery External Table

Instructions for creating an external BigQuery table are as follows:

  1. Create a dataset within BigQuery
  • specify a Dataset ID, e.g., chronicle_archive
  • specify the Location where the GCS bucket will keep the raw logs
  • optionally, enable table expiration
    - if perpetual retention is required do not enable this

2. Create a Table

  • Create table from: Google Cloud Storage
  • Select file from GCS Bucket: <bucket-name>/*
  • File format: CSV
  • Destination: <the project with the BQ dataset>
  • Dataset: <the BQ dataset>
  • Table: <the BQ table>
  • Table Type: External Table
  • Advanced Options:
    — Field delimiter: Custom
    — Custom field delimiter: §
    — Quote character: None
  • Click Create Table

The exported raw logs in the GCS bucket have the .csv extension but contain a single column, meaning they lack a delimiter field. This presents a challenge as BigQuery’s external table expects a CSV format. To overcome this, I choose a symbol that won’t appear in the exported logs, the paragraph symbol (§). This allows the external table to successfully process the single-column data as though it were a standard CSV file.

🤷 There may be a better way of creating an external table than using a field delimiter that will never exist, but I couldn’t work one out

Creating the external table in BigQuery

Summary

  • The Data Export API offers the ability to perform a single-use or point-in-time export of all or selected raw logs to a GCS storage bucket.
  • You can leverage BigQuery Studio to run various SQL queries against your raw logs in a GCS bucket, allowing you to analyze and extract insights from your data outside of Chronicle SIEM.
  • As this feature is currently in preview, it might require additional permissions not granted by default. To obtain this access, contact your Chronicle SecOps account team or Chronicle SecOps Partner.

--

--