Working with Repeated Fields in Chronicle SIEM

Chris Martin (@thatsiemguy)
9 min readJan 24, 2023

--

In this post I explore Repeated fields, a field type within Chronicle SIEM’s UDM schema that can store multiple values in a single key, aka an Array.

What they are, how to use them, and challenges you may encounter.

A repeated field 🥁

🆕 Dec 23 Update

  • Added clarification that repeated fields are not guaranteed to be returned in order.
  • This is important to be aware of because while the vast majority of the time they are ordered, this will not always be the case, and *if* you rely on the ordering of repeated fields, e.g.,from enrichment, you may get unexpected detection results, i.e., false positives
  • Array Indexing has been added to YARA-L which enables more precise selection of repeated fields using an ordinal value

Understanding Repeated fields in UDM

Repeated fields are a neat feature of UDM, and are used in scenarios where you need to capture all values within a log’s field that may have more than one value, e.g.,

  • network.dns.answers stores all DNS answers to a DNS question
  • user.email_addresses stores all a User’s email addresses
  • user or resource attributes stores arbitrary key value pairs
  • and many, many more examples

The main takeaway here is when used properly you can have confidence all values within a log were captured and are available for detection and response purposes.

Network DNS Question and Answers

An asset issues a DNS query (question), and receives one or more DNS responses (answer), and that will be indexed into UDM as network.dns.answers.

Here’s an excerpt of a DNS question and answer as parsed into Chronicle UDM, with one query (question), and three responses (answers):

metadata.event_type = "NETWORK_DNS"
metadata.id = "AAAAAPSrL1HeCmRyB8x3KYCjHRkAAAAAAQAAAE4AAAA="
network.application_protocol = "DNS"

# Question
network.dns.questions.name = "mybbc-analytics.files.bbci.co.uk"
network.dns.questions.type = 1

# Answer 1
network.dns.answers.name = "mybbc-analytics.files.bbci.co.uk."
network.dns.answers.type = 5
network.dns.answers.class = 1
network.dns.answers.ttl = 3600
network.dns.answers.data = "vip1.bbc-a.akadns.net."
# Answer 2
network.dns.answers.name = "vip1.bbc-a.akadns.net."
network.dns.answers.type = 5
network.dns.answers.class = 1
network.dns.answers.ttl = 60
network.dns.answers.data = "e9930.dscapi2.akamaiedge.net."
# Answer 3
network.dns.answers.name = "e9930.dscapi2.akamaiedge.net."
network.dns.answers.type = 1
network.dns.answers.class = 1
network.dns.answers.ttl = 20
network.dns.answers.data = "23.13.33.110"

network.dns.truncated = true

Let’s create a single event and multi-event YARA-L rule and observe the results.

Single event version of the YARA-L detection

A single event YARA-L rule will evaluate a single UDM event only.

For testing purposes, like this, a useful approach is use the specific metadata.id value that is uniquely generated for each individual log. I’m also using Outcomes: to generate insights into the event itself, i.e., a distinct count of values observed within the network.dns.answer.name field.

  events:
$example1.metadata.product_log_id = "1jrw3ksd6llo"
$example1.metadata.event_type = "NETWORK_DNS"
$example1.metadata.vendor_name = "Google Cloud Platform"
$example1.metadata.product_name = "Google Cloud DNS"
$example1.network.dns.answers.name = $network_dns_answers_name

outcome:
$count_of_dns_answer_name = count_distinct($network_dns_answers_name)
$risk_score = 0

condition:
$example1

And, when run, it returns one Detection result,

Notice that Outcome: variable count_of_dns_answer_name which shows there are three responses (answers).

1 Detection result from the single event test rule, but three DNS Answer Names

Multi-event version of the YARA-L rule

Now, the multi-event version of the above YARA-L rule which evaluates multiple events over a given time window, i.e., the Match: clause.

events:
$example1.metadata.product_log_id = "1jrw3ksd6llo"
$example1.metadata.event_type = "NETWORK_DNS"
$example1.metadata.vendor_name = "Google Cloud Platform"
$example1.metadata.product_name = "Google Cloud DNS"
$example1.network.dns.answers.name = $network_dns_answers_name

match:
$network_dns_answers_name over 1m

outcome:
$count_of_dns_answer_name = count_distinct($network_dns_answers_name)
$risk_score = max(0)

condition:
$example1

And this rule returns three detection results, and again note the outcome: variable count_of_dns_answer_name which is now one.

3 Detections for the multi-event test rule, and 1 DNS Answer Name per distinct Detection

Note, in production the multi-event rule is realistically going to be joined against i) Entity Graph, or ii) a Reference List, and when applying a specific match, e.g., see the below rule with an explicit filter on network.dns.anwers.name.

events:
$example1.metadata.product_log_id = "1jrw3ksd6llo"
$example1.metadata.event_type = "NETWORK_DNS"
$example1.metadata.vendor_name = "Google Cloud Platform"
$example1.metadata.product_name = "Google Cloud DNS"
$example1.network.dns.answers.name = $network_dns_answers_name

// add an explicit match, i.e., if you were using a Reference List or Entity Graph
$example1.network.dns.answers.name = "mybbc-analytics.files.bbci.co.uk."

match:
$network_dns_answers_name over 1m

outcome:
$count_of_dns_answer_name = count_distinct($network_dns_answers_name)
$risk_score = max(0)

condition:
$example1

So, why is this?

A useful way to understand this is to run the above YARA-L Rules as SQL statements in Chronicle Datalake, aka BigQuery.

The first YARA-L rule, the single event rule, run as a SQL statement returns one distinct row.

1 Distinct row, with multiple nested rows

Notice that:

  • within that one row there are three sub-rows, nesting, and the order of the values in answers.name; however, just because in this example the fields were returned in this order there is no guarantee in Chronicle SIEM that fields will be returned in order
    - this is a subtle but important detail to be aware of, e.g., if you were to assume that the value parsed from the original log was always in position 0 in a repeated field, you may encounter a scenario where it is value 1, and an enriched value is in position 0, and this could generate a false positive.
  • and, it looks very similar to the results from the single even rule 🤔

The single event SQL statement for reference:

SELECT 
metadata.id,
network.dns.answers
FROM `datalake.events`
WHERE
DATE(hour_time_bucket) = CURRENT_DATE()
AND metadata.product_log_id = "1jrw3ksd6llo"
LIMIT 10

⚠️ If you tried to use the field network.dns.answers.name in the SQL statement it returns the error “Cannot access field name on a value with type ARRAY<STRUCT<name STRING, type INT64, class INT64,…" which is because you can’t directly query an element within the array without flattening the data first.

The second SQL statement, i.e., the multi-event rule, returns three distinct rows rather than one distinct row with array values.

3 distinct rows, with each row flatted, or unnested

Notice that:

  • ̶t̶h̶e̶ ̶o̶r̶d̶e̶r̶i̶n̶g̶ ̶i̶s̶ ̶c̶o̶n̶s̶i̶s̶t̶e̶n̶t̶
    - while the ordering can appear consistent, this is not guaranteed, and should not be replied upon
  • similar to the multi-event YARA-L rule there are three distinct rows, three Detections 💡

And the SQL statement for reference:

SELECT
metadata.id,
answer
FROM
`chronicle-coe.datalake.events`,
UNNEST(network.dns.answers) answer
WHERE
DATE(hour_time_bucket) = "2023-01-22"
AND metadata.product_log_id = "1jrw3ksd6llo"
LIMIT 10

Multi-event rules use an Unnest operation on repeated fields

The first SQL statement, the single event YARA-L rule, returned 1 row, just like it did when run as a YARA-L rule, and the second SQL statement, the multi-event YARA-L rule, returned 3 rows, just like the multi-event YARA-L rule.

Given that, we can summarize so far, and infer:

  1. A single event YARA-L rule does not perform an Unnest operation on network.dns.answer
  2. A multi-event YARA-L rule does perform an Unnest operation on network.dns.anwer
  3. ̶T̶h̶e̶ ̶v̶a̶l̶u̶e̶s̶ ̶w̶i̶t̶h̶i̶n̶ ̶t̶h̶e̶ ̶n̶e̶t̶w̶o̶r̶k̶.̶d̶n̶s̶.̶a̶n̶s̶w̶e̶r̶ ̶A̶r̶r̶a̶y̶ ̶f̶i̶e̶l̶d̶ ̶a̶r̶e̶ ̶o̶r̶d̶e̶r̶e̶d̶
A single unique row, with multiple nested values in the repeated field

User Email Addresses

The next example of repeated fields I’ll explore is the UDM field user.email_addresses.

Let’s start with an example User Login event via UDM Search. Notice that this user has three email addresses in the email_addresses repeated field.

1 Search result with 3 nested email addresses, and 2 distinct addresses

Let’s run the same tests as before, a single event and multi event YARA-L rule.

The single event YARA-L rule returns one Detection.

  events:
$example2.metadata.product_log_id = "LWTGkIvpWq_t1HvgOb-J-u25ZbOXMEy7kpw5C2HYRIU/XUWLcR_C42MQL4wOGGAC_W7jqtc"
$example2.metadata.event_type = "USER_LOGIN"
$example2.metadata.vendor_name = "Google Workspace"
$example2.metadata.product_name = "saml"
$example2.metadata.product_event_type = "login_success"
$example2.target.user.email_addresses = $user

outcome:
$count_target_user_email_addresses = count_distinct($user)

condition:
$example2

And a multi-event version of the rule returns two Detections 🤔

  events:
$example2.metadata.product_log_id = "LWTGkIvpWq_t1HvgOb-J-u25ZbOXMEy7kpw5C2HYRIU/XUWLcR_C42MQL4wOGGAC_W7jqtc"
$example2.metadata.event_type = "USER_LOGIN"
$example2.metadata.vendor_name = "Google Workspace"
$example2.metadata.product_name = "saml"
$example2.metadata.product_event_type = "login_success"
$example2.target.user.email_addresses = $user

match:
$user over 1m

outcome:
$count_target_user_email_addresses = count_distinct($user)
//$risk_score = max(0)

condition:
$example2

And to expand upon that, here’s a screenshot of the Detection.

Why are there two Detections, and three email addresses shown in the target.user.email_addresses field?

  • The first email address is the value parsed from the log itself
  • The second and third email addresses are context enriched fields from a matching Entity Graph context source.

How did you know this? That’s a little grey area. One option is to check for the metadata.enriched field, but mostly this is going to be knowledge of your environment, specifically what Context Log Sources are applicable, and could be enriching log data.

Why is there a second detection?

It’s for the same reasons as in the first DNS example, a multi event rule appears perform an Unnest operation which creates a unique row for each value in the repeated field.

So why only two detections when the UI shows three email address? This appears to be because one of those emails is the value parsed from the log, and the other two are from a context source, but there’s only two unique email addresses (I know, that’s hard to see as I’ve obfuscated the picture).

How can you test or prove this?

  1. You can look at the User in question in User View, or
  2. search your Context Sources via Raw Log Search

How can I prevent a second, duplicate, detection being generated?

There’s a few options available here:

  1. Let your Chronicle SOAR handle it

If you have Chronicle SOAR, then let your SOAR handle it. Chronicle SOAR includes rather clever alert grouping within a Case, and can successfully group together these alerts into a single case.

An example of Chronicle SOAR automatically grouping the two Detections together into one Case.

2. Use a non-repeated field as a match variable

If you have verified there is consistent enrichment being applied then a good option is to use a non-repeated field, e.g., userid

events:
$example2.metadata.product_log_id = "LWTGkIvpWq_t1HvgOb-J-u25ZbOXMEy7kpw5C2HYRIU/XUWLcR_C42MQL4wOGGAC_W7jqtc"
$example2.metadata.event_type = "USER_LOGIN"
$example2.metadata.vendor_name = "Google Workspace"
$example2.metadata.product_name = "saml"
$example2.metadata.product_event_type = "login_success"

// switch from a repeated field to a standard field
$example2.target.user.userid = $user

match:
$user over 1m

outcome:
$count_target_user_email_addresses = count_distinct($user)
//$risk_score = max(0)

condition:
$example2

And as a non-repeated field this will return one Detection.

An alternative option here if you need to use a repeated field is to use a Parser Extension and populate a non-enriching UDM object, e.g., security_result.about.user.email_addresses. While not ideal, this does provide a predictable way to have consistent detections.

How do you confirm consistent enrichment?

That’s a larger topic for another post, but a quick way is to run a UDM Search for a set time interval, for the same user by Target User ID, and by Target Email Address. The number of events returned will be identical if consistent enrichment is happening, but with the caveat to check the log source doesn’t include both these values in the log itself.

What if my log source only includes email addresses?

A great feature of Chronicle SIEM is its continual aliasing and context enrichment. If you have a User Context source setup and ingested, e.g., Cloud Identity, OKTA, Azure AD, this will enable Chronicle to alias a User between different their different user IDs and email addresses, and will inject related context data into the event, aka we can use the userid field in a Detection rule even for logs that don’t include a userid.

3. Apply exclude filters to omit additional values

If you have aliased emails that you don’t expect to fire in normal day to day activity, you can apply an exclude filter directly in your YARA-L rule, as follows:

  events:
$example2.metadata.product_log_id = "LWTGkIvpWq_t1HvgOb-J-u25ZbOXMEy7kpw5C2HYRIU/XUWLcR_C42MQL4wOGGAC_W7jqtc"
$example2.metadata.event_type = "USER_LOGIN"
$example2.metadata.vendor_name = "Google Workspace"
$example2.metadata.product_name = "saml"
$example2.metadata.product_event_type = "login_success"
$example2.target.user.email_addresses = $user

// exclude additional context enriched email addresses
not $example2.target.user.email_addresses = /test-google-a\.com$/

Note,

  • this approach likely will not work if you have multiple addresses on the some domain name
  • you could prevent a detection firing if the aliased value was used, and so this is better suited when you have certainty of that, e.g., secondary domains, test domains

4. Apply a Custom Parser to not index additional email addresses into Entity Graph

To achieve the same end result as the above YARA-L Detection filter, you can update the parsing of your Context source to only index repeated fields you want to apply for Context enrichment, e.g., an excerpt from a parser to only index email addresses ending with a specific domain:

  for email in emails {
if [email][address] =~ /@altostrat\.com$/ {
mutate {
merge => {
"user.email_addresses" => "email.address"
}
}
}
}

Again, if you have multiple email addresses in the same domain, this may not work, and the consideration of if using the above could prevent an aliased detection from firing.

Summary

Repeated fields are pretty awesome for helping ensure all required data from a log can be captured, indexed and available for Detection purposes, however you do need to understand the nuance of nesting and un-nesting in order to have predictable Detection results and behavior, and hopefully the above will help you to do just that.

--

--