Using Entity Graph as a Multi-dimensional List

9 min readNov 17, 2022

Chronicle SIEM includes Reference Lists, a feature that provides lists of the following three types:

i) String
ii) CIDR
iii) RegEx (Regular Expressions)

String based Reference Lists can be used for include or exclude logic (filtering), however, they’re inherently singular in dimension (at the time of writing), i.e., you can only filter on one UDM Object at a time.

What if you want to apply an include or exclude filter with multiple dimensions? e.g., IP Address, Port, and Protocol? User, Department and Location? Process, Path, and Hash?

In this post I will show how you can do just that using Chronicle SIEM’s Entity Graph.

Entity Graph can act as a super useful multi-dimensional reference list, but without the potential challenges of using a Single Event rule.

🆕 Chronicle SIEM has since released a native ToR Exit Node database in Entity Graph, see the following blog for more info:

New to Chronicle: Detecting Tor Exit Nodes and Remote Access Tools

New to Chronicle: Detecting Tor Exit Nodes and Remote Access Tools "New to Chronicle" is a deep-dive series by Google…

chronicle.security

Filtering a Detection Rule on IP, and Port

Imagine a Single Event YARA-L Detection Rule for matching Tor activity using a String based Reference List containing a list of Tor exit node IP addresses.

As I finished typing the above sentence, it occurred to me that it is a ridiculous thing to expect someone to imagine, so here’s an actual rule:

rule tor_exitNode_listMatch {
  meta:
    author = "thatsiemguy@"
    description = "Match outbound IP accesses against a Tor Exit Node reference list."
    severity = "MEDIUM"

  events:
    $tor.metadata.event_type = "NETWORK_CONNECTION" and
    ( $tor.target.port = 9001 or
      $tor.target.port = 9030 or 
      $tor.target.port = 443 ) and
    $tor.target.ip in %tor_exit_nodes

  condition:
    $tor
}

This Single Event YARA-L Detection Rule will successfully detect Tor exit node activity, which is pretty neat; however, there are challenges with the above approach:

i) It could generate false positives, e.g., certain Tor nodes could be running a legitimate service on 443

As a workaround you could add negative grouping logic to the rule, e.g.,

not ( $tor.target.ip = "1.2.3.4" and $tor.target.port = 443 )

While this is a valid approach for rules with infrequent changes, or small data sets, it’s worth considering when you’re dealing with a large range of IP addresses you may end up with a large rule, requiring frequent tuning.

ii) String Reference Lists are capped at 6MB per list

That’s still a lot of IP addresses but, for a list that could include thousands to tens of thousands of IP addresses, you may have to end up splitting the list of IP addresses across several Reference Lists, which adds some management overhead.

iii) YARA-L Rules have a limit on the number of Reference Lists per rule, which for String Reference Lists is 7 per rule

7 String Reference Lists at 6MB per List is … a lot. Tens of thousands of entries per list, probably, and not a likely limit you will encounter, but one to be aware of.

Factoring the above, this is where using Entity Graph can help for a higher fidelity rule.

Let’s re-write the initial Tor exit node YARA-L Detection Engine rule, but this time taking into account the port and IP address, using Entity Graph:

rule tor_entity_graph {

  meta:
    author = "thatsiemguy@"
    description = "Match outbound IP accesses against a Tor Exit Node entity graph context record."
    severity = "MEDIUM"

  events:
    $connection.metadata.event_type = "NETWORK_CONNECTION"
    $connection.metadata.vendor_name = "ACME"
    $connection.principal.ip in cidr %cidr_rfc_1918
    $connection.principal.ip = $source_ip
    $connection.target.ip = $target_ip
    $connection.target.port = $target_port

    $tor.graph.metadata.vendor_name = "dan.me.uk"
    $tor.graph.metadata.product_name = "TOR Node List"
    $tor.graph.metadata.entity_type = "IP_ADDRESS"
    $tor.graph.entity.ip = $target_ip
    $tor.graph.entity.port = $target_port

  match:
      $source_ip, $target_ip, $target_port over 10m

  outcome:
    $risk_score = max(0)

  condition:
    $connection and $tor
}

Example data from https://www.dan.me.uk/ 👍

This YARA-L rule is now a multi-event rule, and invokes the Chronicle Entity Graph (more on that below).

The multi-event rule defines 2 event variables: (1) $connection and (2) $tor.

Within $connection we have 3 match variables:

$source_ip
$target_ip
$target_port

Within $tor we have 2 match variables:

$target_ip
$target_port

And, it’s the $target_ip and $target_port match variables we can use to join our UDM Event data ($connection) against UDM Entity data ($tor), aka Entity Graph.

The below YARA-L statement (excerpts from the full rule above) joined $connection against $tor in the Detection Rule:

    ...
    $connection.target.ip = $target_ip
    $connection.target.port = $target_port

    ...
    $tor.graph.entity.ip = $target_ip
    $tor.graph.entity.port = $target_port

Which could also have been expressed in YARA-L as an explicit join statement, as below (but you don’t need to write this in your YARA-L rule, the former syntax will do the trick):

$connection.target.ip = $tor.graph.entity.ip
$connection.target.port = $tor.graph.entity.port

💡If you’re familiar with SQL, think of it like an Inner Join between two tables

Fun related fact, you aren’t limited to just joining on the same type either, you can join different UDM Objects and Nouns too, e.g.:

 $event.network.dns.question.name = $graph.entity.hostname

Viewing a successful match of the Detection Rule, you can see the original UDM Event, a Network Connection, aka $connection, that triggered the rule:

The UDM Network Connection event that matched the Detection Rule

And, you can also see the Entity (the Tor node, aka $tor) that matched the original UDM event:

The UDM Entity event that matched the UDM Event

One of the more powerful features of the YARA-L Detection Engine is that you get the context around the match as well! (its that powerful a capability it deserves an exclamation point). Within the Detection View you can see the exact Entity that generated the finding, without having to pivot or perform an additional lookup

The key takeaway here — we’re joining on two dimensions at once — and we’re not limited to just two dimensions either, just with the caveat those values have to exist in the matching Entity Graph entity record.

Okay, what is the Entity Graph then?

💡If you’re already familiar Context Aware Analytics, aka the Entity Graph, you can skip ahead to the next section.

UDM Data Models

Chronicle SIEM normalizes input sources into its schema, the Universal Data Model (UDM). Within the UDM schema there are two data models:

1. Event Data Model

Event data is the result of your more traditional input sources to a SIEM, such as Windows Event Logs, Linux Syslog, your EDRs, Public Clouds, and so forth. Chronicle CBNs (Parsers) take raw logs and use GROK to normalize the data into the UDM Event model, i.e, how raw logs end up as Principal IP or Target URL.

2. Entity Data Model

Entity data is the result of contextual sources, aka, enrichment sources. These include your LDAPs and Identity Stores, such as Active Directory, OKTA, Cloud Identity; Asset Inventories, such as JAMF, MDM, Chrome OS; but, the Entity model also extends to include non-user or asset entities such as IP addresses, Domains, URLs, and Resources (like a SQL table, a GCS bucket).

Entity Types

The UDM Entity Data Model is comprised of Entity Types, which represent… Entities.

A User (USER) is a member of one or more Groups (GROUP), who works for a Company (DOMAIN_NAME), which has a website (IP_ADDRESS), that they access on their Laptop (ASSET).

I could go on, but you hopefully get the idea.

💡If you’re seeing a pattern between these Entities, like they almost have a relationship, that’s because they do — Relationships are a whole key core component of the UDM Entity Model, but a post for another day.

I’ve included the full list of Entity Types as per the UDM documentation for completeness:

|--------------------|-------------|---------------------------------------------------------------------|
| Enum Value         | Enum Number | Description                                                         |
|--------------------|-------------|---------------------------------------------------------------------|
| ASSET              | 1           | An asset, such as workstation, laptop, phone, virtual machine, etc. |
| DOMAIN_NAME        | 5           | A domain.                                                           |
| FILE               | 4           | A file.                                                             |
| GROUP              | 10001       | Group.                                                              |
| IP_ADDRESS         | 3           | An external IP address.                                             |
| MUTEX              | 7           | A mutex.                                                            |
| RESOURCE           | 2           | Resource.                                                           |
| UNKNOWN_ENTITYTYPE | 0           | An unknown event type.                                              |
| URL                | 6           | A url.                                                              |
| USER               | 10000       | User.                                                               |

Entity Context Sources

We now know there are different Entity Types, but where do these Entity Types come from? Entity Sources of course! Where again there is a subset of Context Sources:

Entity Context
- log sources you provide, e.g., Azure AD Context, Anomali IOC
Derived Context
- context that comes from your organization’s data, e.g., domain prevalence as learned specific to your environments network traffic
Global Context
- Chronicle’s (Google’s) own sources, such as Virus Total, SafeBrowsing, or 3rd party sources, such as WHOIS.

And the official description as per the UDM documentation:

| Enum Number             | Description |                                                                                                                     |
|-------------------------|-------------|---------------------------------------------------------------------------------------------------------------------|
| DERIVED_CONTEXT         | 2           | Entities derived from customer data such as prevalence, artifact first/last seen, asset/user first seen stats, etc. |
| ENTITY_CONTEXT          | 1           | Entities ingested from customers (e.g. AD_CONTEXT, DLP_CONTEXT)                                                     |
| GLOBAL_CONTEXT          | 3           | Global contextual entities such as WHOIS, Safe Browsing, etc.                                                       |
| SOURCE_TYPE_UNSPECIFIED | 0           | Default source type                                                                                                 |

Entity Graph Log Sources

The Entity Data Model is populated by Entity Types, which come from Entity Sources, which are populated from CBNs, aka Data Sets (or Labels, some may even say Parsers).

What Data Sets are context sources? The Chronicle documentation page Ingest data using the entity data model includes a list of Entity Context generating integrations (not exhaustive though).

Finally, how did you create the Tor Entity Graph entry?

Let’s recap. We want to be able to filter our YARA-L Detection Rules on multiple dimensions. The Entity Graph enabled us to take our original Single Event YARA-L rule, and re-write it as a Multi-Event YARA-L rule joining against the Entity Graph.

So just how did the Entity Graph entry for that Tor Multi-Event YARA-L Detection Rule get into Chronicle? There are two ways that can happen:

a CBN parser
Chronicle’s Ingestion API

I used the option 2, the Chronicle Ingestion API. 🙃 I’ll do a follow-up post on using the Ingestion API to ingest Entity events into Chronicle, but if you’re itching to go right now you can use our public GitHub Python API examples.

Remember from the Detection View results screen the ASSET Entity?

In case you didn’t remember, it was this

This was created from the below JSON object, which was submitted to the Chronicle Ingestion API, as follows:

{
   "log_type":"CATCH_ALL",
   "entities":[
      {
         "metadata":{
            "product_entity_id":"cb025d8e-878e-4db4-94a9-b50d0a7c7519",
            "collected_timestamp":"2022-11-13T21:46:56.604323Z",
            "vendor_name":"dan.me.uk",
            "product_name":"TOR Node List",
            "entity_type":"IP_ADDRESS",
            "interval":{
               "start_time":"2022-11-13T21:46:56.604354Z",
               "end_time":"2022-11-23T21:46:56.604363Z"
            },
            "threat":{
               "category_details":"Anonymization",
               "url_back_to_product":"https://www.dan.me.uk/tornodes",
               "threat_id":"unlimitedrelay085",
               "threat_feed_name":"TOR Node List"
            }
         },
         "entity":{
            "ip":"101.100.139.201",
            "port":"80"
         }
      }
   ]
}

Let’s review and break down that JSON object:

"log_type":"CATCH_ALL",

log_type
- a valid Chronicle label is required to ingest the Entity Event into Chronicle
- Best practice is always to use the correct label, and not use a generic label, like CATCH_ALL (its useful for testing and demonstration purposes, but please don’t use it in production)

"entities":[

We can submit an array of entities, i.e., multiple entity context records in one go (but in this example its just the single record)

{
         "metadata":{
            "product_entity_id":"cb025d8e-878e-4db4-94a9-b50d0a7c7519",
            "collected_timestamp":"2022-11-13T21:46:56.604323Z",
            "vendor_name":"dan.me.uk",
            "product_name":"TOR Node List",
            "entity_type":"IP_ADDRESS",
            "interval":{
               "start_time":"2022-11-13T21:46:56.604354Z",
               "end_time":"2022-11-23T21:46:56.604363Z"
            },

An Entity Context record must include Metadata fields, and key fields within that include:

product_entity_id
- a mandatory field
- ideally unique and persistent, but if not an option for persistent, i.e., there isn’t a consistent value to use, then just unique, such as a GUID
entity_type
- a mandatory field
- in this case as we’re modelling an entity of type IP, hence IP_ADDRESS
interval
- a key concept
- UDM Entity Data is used to enrichment UDM Event data, but only available within the valid time range specified between the start and end date (this isn’t 100% accurate, there are mechanisms for lookback +/- the period of an Entity interval, another post)
- a really powerful capability which helps to ensure up to date and accurate data is applied, as well as applying the context for that point in time too

            ...
            "threat":{
               "category_details":"Anonymization",
               "url_back_to_product":"https://www.dan.me.uk/tornodes",
               "threat_id":"unlimitedrelay085",
               "threat_feed_name":"TOR Node List"
            }

Threat is part of the UDM Data Model schema, and in this case useful for storing information relevant to the source context feed, i.e., when we get a Detection Rule Alert we have not just an Alert, but the information behind why it matched too.

         ...
         },
         "entity":{
            "ip":"101.100.139.201",
            "port":"80"
         }

And last but not least, the Entity itself. Again, leveraging the UDM Data model, i.e., UDM Nouns.

Conclusion

What feels like a long time ago, I asked what if you want to filter on multiple dimensions in a YARA-L Detection Rule? Hopefully, this post has demonstrated how you can do just that, using Chronicle SIEM’s Entity Graph.

In a follow up post I’ll cover more about the Entity Graph, and the Chronicle Ingestion API.