Automagic JSON Parsing

6 min readMay 7, 2024

Google SecOps is releasing a detection and response changing feature for Chronicle SIEM users: Autonomous Parsers, aka Extracted Fields, the automatic extraction of JSON key-value pairs.

In this post, I’ll dive into the details of this exciting advancement and explain how it can significantly streamline your security operations:

Rapid Onboarding of JSON Log Sources: Seamlessly integrate new JSON-based log sources into Chronicle SIEM without the need for upfront parser creation.
Enhanced Detection and Response: Leverage UDM Search or the Detection Engine for existing log sources, even if key fields haven’t been explicitly mapped to the Unified Data Model (UDM).
Workaround for Corner Cases: Address challenges from Aliasing and Enrichment issues caused by repeated field inconsistencies by directly utilizing values from the original log.

Automatic JSON key value pair extraction in Chronicle SIEM

Overview

Chronicle SIEM, as a schema-on-write platform, employs a familiar Extract, Transform, Load (ETL) process, often referred to as “parsing” in the SIEM world. However, unlike ETL in other domains where data is typically well-structured and predictable, parsing application and operating system logs in the infosec realm presents a unique challenge. This data is often messy, inconsistent, and prone to sudden changes, making it akin to working with actively hostile information.

In Chronicle SIEM, “parsers” and “ingestion labels” play crucial roles. Parsers act as field renaming configuration files, taking original log input tagged with a specific ingestion label and transforming it into the structured UDM schema.

It’s important to note that not all ingestion labels have an associated parser. This information can be verified in the Google SecOps documentation, which differentiates between ingestion labels with parsers and those without.

It’s also worth highlighting that the use of “ingestion labels” to uniquely identify a log source to parser mapping is distinct from the metadata.ingestion_labels field within the UDM schema. While sharing the same name, the latter serves as a user-defined field for arbitrary labeling.

With the above background, let’s explore how Autonomous Parsers can be used in Google SecOps.

Autonomous Parsers

It’s important to note that Autonomous Parsing currently applies exclusively to log sources using valid JSON format.

While default parsers often excel at extracting key fields into the UDM for many common integrations, they may overlook less-frequent yet crucial values. Autonomous Parsing fills this gap by capturing those overlooked details.

UDM Search

extracted.fields["actor.email"] = "admin@foo.bar.com"

In addition to Filtering, Extracted Fields can also populate columns in the UDM event table.

Extracted Fields are also supported as Field filters within UDM Search

Extracted Fields used independently might not match the speed of a UDM search using an Indexed field. You can however combine Extracted Fields with Indexed fields to optimize UDM search performance.

Where Extracted Fields becomes immensely valuable is scenarios where a given key value pair was not added into UDM, and results in blocking your response process, i.e., you can’t filter value in or out, or else have to pivot to Raw Log Search.

Extracted Fields can also be used in a UDM Stats search which, similarly the point above, is incredible powerful for response processes when you need to start with high level aggregate analysis, and require access to original log values that had not been normalized into UDM via the associated log source parser.

// events
$user = $e.extracted.fields["actor.email"]
// conditions
$e.metadata.log_type = "WORKSPACE_ACTIVITY"
match:
    $user
outcome:
    $total = count($e.metadata.id)
order:
    $total desc

There is no longer an immediate need to either author a Parser Extension, or submit a Feature Request for adding a field extraction into a default UDM parser.

ℹ️ UDM Stats is in preview, with further information available here.

Dashboards

Extracted Fields are not available in embedded Looker, i.e., the current (read, legacy) Dashboard feature in Chronicle SIEM.

Extracted Fields are available in Native Dashboards, the new preview Dashboard feature in Chronicle SIEM.

Using Extracted Fields in Native Dashboards in Chronicle SIEM

Detection Engine

Similarly to UDM Search, Extracted Fields are available in YARA-L rules within Detection Engine, and you can use them as you would use any UDM field.

Here’s an example rule that, using the above example, uses Extracted Fields to assign the original email address value to the variable $email:

$email = $e.extracted.fields["actor.email"]

And for reference, the entire YARA-L rule:

A multi-event YARA-L Detection rule
Generates an Alert for a User logging in from a new location for the first time in 30 days, where the user was not created in the prior 2 days
Uses UEBA Metrics for the first_seen detection logic

rule ueba_auth_attempts_success_first_seen_login_from_country_in_interval {
  meta:
    author = "@thatsiemguy"
    description = "Detects successful user authentication for the first time in a given interval (30 days)."
    severity = "LOW"
    priority = "LOW"

  events:
    $e1.metadata.log_type = "WORKSPACE_ACTIVITY"
    $e1.metadata.product_event_type = "CREATE_USER"
    $email = $e1.target.user.email_addresses 
    $email != ""     
    $creation_timestamp = $e1.metadata.event_timestamp.seconds

    $e1.target.user.email_addresses = $e.extracted.fields["actor.email"]
    $creation_timestamp < $login_timestamp

    $e.metadata.event_type = "USER_LOGIN"
    $e.security_result.action = "ALLOW"
    $country = $e.principal.ip_geo_artifact.location.country_or_region
    $country != ""
    $email = $e.extracted.fields["actor.email"]
    $email != ""
    $login_timestamp = $e.metadata.event_timestamp.seconds 

  match:
    $email, $country over 2d

  outcome:
    $risk_score = 15
    $usage_past_24h = count($e.metadata.id)  
    $first_seen_today = max(
        metrics.auth_attempts_total(
            period:1h, window:today, 
            metric:first_seen, 
            agg:sum,
            target.user.email_addresses:$email, 
            principal.ip_geo_artifact.location.country_or_region:$country
        )
    )
    $first_seen_monthly = max(
        metrics.auth_attempts_total(
            period:1d, window:30d, 
            metric:first_seen, 
            agg:sum,
            target.user.email_addresses:$email, 
            principal.ip_geo_artifact.location.country_or_region:$country
        )
    )
    $count_intervals_in_baseline = max(
        metrics.auth_attempts_total(
            period:1d, window:30d, 
            metric:event_count_sum, 
            agg:num_metric_periods,
            target.user.email_addresses:$email, 
            principal.ip_geo_artifact.location.country_or_region:$country
        )
    )
  condition:
    $e // login detection
    and !$e1 // but no user created recently
    and ($first_seen_monthly = 0)  // first login in last 30 days
}

Why this is powerful is it also helps to avoid challenges that can arise around the usage of repeated fields, and the non-deterministic nature of enrichment, i.e., you can have an enriched value added as the 0 index value. Using Extracted Fields provides a reliable workaround to this challenge.

Log Sources without a Parser

Autonomous Parsers offer a valuable advantage for scenarios where log sources lack a default parser or are customer-specific: if the log source is in JSON format, all fields are automatically extracted, allowing you to immediately utilize the data.

For instance, when ingesting a raw JSON log into the ASANA label (which lacks a default parser), Autonomous Parsers seamlessly generate a GENERIC_EVENT UDM event. The extracted JSON fields are then automatically populated within the UDM’s extracted field object.

Existing Parsers

For existing custom or default Parsers, or Parser Extensions, no change is needed and Extracted Fields are automatically appended into UDM, with the caveat the log source is in JSON format.

Summary

Autonomous Parsing is a seemingly minor yet significantly impactful addition to Google SecOps:

By providing access to all raw log fields for Search, Detection, and Dashboarding, it prevents dead-ends in investigations that would otherwise require resorting to raw log searches
Moreover, this feature accelerates the onboarding of new log sources and reduces the time needed to develop parsers when integrating with the UDM
In future other formats are the roadmap for addition, such as XML, which will help with log sources such as Windows Event Logs in their native format.

If not already part of the preview, Autonomous Parsers should be rolling out to your tenant in the near future 🎉