Threat Intelligence data storage: make it easy with ArangoDB!

At SEKOIA we heavily use STIX (Structured Threat Information eXpression) to manipulate Threat Intelligence information. As soon as a piece of information enters our system it is converted to the STIX format. This format lets us match indicators seen on our client networks using STIX patterning. It also allows us to share threat intelligence with other companies (using TAXII) and helps us normalize the data we receive from various sources (OSINT intelligence, partners, premium services, …). Using a common standard presents some advantages but raises a few issues. In this blog post we are going to study how ArangoDB helps us store STIX data in an easy and efficient manner.

11 min readJun 12, 2019

STIX

For the one not familiar with the STIX format you can check the article we published a few days ago: STIX and TAXII. To sum it up really quickly:

STIX […] is a standard for expressing information about computer threats in a structured and unambiguous way. Based on JSON, it has the potential to allow automatic information exchange between the many tools used to ensure the security of an organization.
STIX 2.0 defines two categories of STIX objects: STIX Domain Objects (SDO) and STIX Relationship Objects (SRO).
For simplicity, SDOs can be seen as the nodes of a graph interconnected by SROs.

Storing STIX data

STIX is really efficient to represent threat intelligence data but raises a few issues when we have to store it. In the following, we analyze the pros and cons of relational databases, document related databases and graph databases to store STIX.

Relational Databases

Storing relationships in relational databases is fairly easy, it is one of the use case RDBs (relational databases) are used for, using a join table for example.

However, storing the objects would be more problematic for several reasons. The STIX standard describes multiple types of objects, and each of them has its own set of attributes. The standard approach would then be to have a stix_objects table holding the common fields ( id, type, created, modified, ...) and then another table for each type. These secondary tables would contain the specific fields and each rows would reference a row of the stix_objects table.
Inserting data in this kind of schema is not efficient since we need to write in two tables when adding a new object.
Querying can be even worse. Let's take an example: we would like to retrieve all the objects having the attribute name with the foo value. Not all the objects have a name attribute so we will need to query all the table having this attribute and then join them with the stix_objects table to get the remaining fields. Not efficient at all!

Another issue is that STIX objects may contain non-primitive types. We may have some arrays (for example aliases), or even some sub objects (like external_references). Some RDB engines support it, but for the others we would need to create new tables to store these attributes and sometimes a join table between the two. It adds a lot of complexity and makes the insertions and queries slower.

The last issue is to perform graph queries in RDB engines. Getting the relationships of level 1 is fairly easy. Getting the relations at level 3 is way more complicated and inefficient.

For example, let’s say we want to get all the indicators used by a specific threat actor. In STIX an indicator can be linked to multiple objects: attack-pattern, campaign, infrastructure, intrusion-set, malware, threat-actor and tool. Creating this query in SQL would be complicated, we would need to get all the relationships for the threat actor and get all the relationship from this relationships and so on until we get all the indicators. Painful. With a graph traversal query it is just a matter of specifying the threat actor as the starting node and that we would like to stop the traversal for a path when an indicator is reached.

Document-oriented databases

Using a document-oriented database such as MongoDB or CouchDB solves a lot of the issues we stressed out with RDBs databases. All the objects would be stored in the same collection so we would have to perform only one write to the database. Requesting based on a field value doesn’t need to perform any join operation and is pretty simple: the document having this field will be evaluated and the others will be ignored.

The problem in this case comes with the relationships. Usually document-oriented databases support relationships based on two mechanisms: embedded documents or documents references.

A first approach to model the relationship between two documents is to include the referenced document inside the referencing one. This approach makes the query really easy as no joint queries are required. The issues come when we need to update the embedded document. If the embedded document is present in multiple main documents, then we will need to update all the documents that embed it. Another issue is that we use a lot more storage than necessary, because the information is replicated in multiple places.
Another approach to model relationships with document oriented database is to leverage referencing pointers between related documents. Now updating the related content is easy, we have to update a single document, all the other ones holding the references won’t need any changes. Nice!
Not so fast! What if we remove this referenced document? We would then need to go through all the documents, find the ones holding a reference in it and remove the details of the relation. Also, many to many relationships are handled poorly. Either we store the relation in only one document (so one side of the relationship) and getting the reverse relationship is complicated, or we duplicate the relationship in both of the documents and then any edition or deletion of the relationship needs to be performed multiple times.

Graph databases

Graph databases are awesome to create edges between nodes and query the graph obtained from this relationships. It is designed to make query traversal or shortest path computation really efficient.
The issue with this kind of database is that it is not made to store documents or query them. The supported types for the attributes of the documents is often limited and the query syntax allows only basic matching.

A common solution is to use a document database along with a graph one to handle relationships. This combined solution works pretty well but it creates a lot of complexity. The insertion or removal needs to be done in two databases and we need to ensure consistency between the two at the software level.

ArangoDB

Here comes ArangoDB to the rescue! Created in 2014 ArangoDB is a native multi-model database system supporting three data models: key/value, documents and graphs. It ships with clustering capabilities that enable its deployment over multiple nodes! The project is published under the Apache 2.0 license, but a commercial version offers more options in terms of security and scalability.

To represent documents and relationships two types of collections exists:

Document collections: Collections storing vertex documents
Edge collections: Edges are special documents used for connecting other documents into a graph. An edge describes the connection between two documents using the internal attributes: _from and _to.

Based on this two types of collections it is possible to define graphs. In order to query data, ArangDB provides its own query langage named Arango Query Langage (AQL). The definition of AQL from Arango is:

AQL is mainly a declarative language, meaning that a query expresses what result should be achieved but not how it should be achieved. AQL aims to be human-readable and therefore uses keywords from the English language.[…] Further design goals of AQL were the support of complex query patterns and the different data models ArangoDB offers.
[…]
The syntax of AQL queries is different to SQL, even if some keywords overlap. Nevertheless, AQL should be easy to understand for anyone with an SQL background.

Example

Now let’s take the STIX example from our previous blog post. This example represents two SDOs along with a relationship linking them. The first STIX object is an indicator providing a hash in the pattern attribute of the object. The second object represents the Poison Ivy malware family. Finally, the relationship links the two models stating that the indicator indicates the presence of a malware of type Poison Ivy.

The STIX document looks like follow:

[
 {
 "type": "indicator",
 "id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "created": "2016-04-06T20:03:48.000Z",
 "modified": "2016-04-06T20:03:48.000Z",
 "labels": [
 "malicious-activity"
 ],
 "name": "Poison Ivy Malware",
 "description": "This file is part of Poison Ivy",
 "pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
 "valid_from": "2016-01-01T00:00:00Z"
 },
 {
 "type": "relationship",
 "id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "created": "2016-04-06T20:06:37.000Z",
 "modified": "2016-04-06T20:06:37.000Z",
 "relationship_type": "indicates",
 "source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
 },
 {
 "type": "malware",
 "id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "created": "2016-04-06T20:07:09.000Z",
 "modified": "2016-04-06T20:07:09.000Z",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "name": "Poison Ivy"
 }
]

Before going further we need two collections that can be created from the Arango UI. This is a Web UI to perform common operations like creating a user, a database or a collection, running queries, … The two collections we must create are:

A regular collection stix_objects to store the SDOs
An edge collection stix_relationships to store the relationships

Based on this collections we will also need a graph, stix_graph with:

stix_relationships as the edge definition
sti_objects for both the from and to collections

Inserting data

First let’s store the SDOs in a new collection. We are going to use the STIX id attribute as the document primary key (_key)so we can easily retrieve the documents later.

Using AQL inserting documents looks like:

LET objects = [
 {
 "_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "type": "indicator",
 "id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "created": "2016-04-06T20:03:48.000Z",
 "modified": "2016-04-06T20:03:48.000Z",
 "labels": [
 "malicious-activity"
 ],
 "name": "Poison Ivy Malware",
 "description": "This file is part of Poison Ivy",
 "pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
 "valid_from": "2016-01-01T00:00:00Z"
 },
 {
 "_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "type": "malware",
 "id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "created": "2016-04-06T20:07:09.000Z",
 "modified": "2016-04-06T20:07:09.000Z",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "name": "Poison Ivy"
 }
]
FOR object IN objects
 INSERT object INTO stix_objects

Now we are going to store the relationship in the edge collection. Three attributes must be set: _key, _from and _to.
For the key we are going to use id the same way we did for the SDOs. Regarding the _from and _to fields we can rely on the attributes source_ref and target_ref to set them. The format to use for this attributes is the following: {collection}/{_key}.

The AQL request looks as follow:

INSERT {
 "_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
 "_from": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "_to": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "type": "relationship",
 "id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "created": "2016-04-06T20:06:37.000Z",
 "modified": "2016-04-06T20:06:37.000Z",
 "relationship_type": "indicates",
 "source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
} IN stix_relationships

Now that the database contains some data we can query it!

Querying

AQL is a powerful query syntax with a lot of features. In this article we focus on the two main queries we use to fetch data out of the database.

Let’s get the ten last created objects of type “malware”:

FOR item IN stix_objects
 FILTER item.type == "malware"
 SORT item.created DESC
 LIMIT 10
 RETURN item.id

As you can see the syntax is easy to understand. First we use the FOR statement to iterate over the elements of the collection. FILTER is then used to restrict the results we want to return while SORT specifies the order we want the results in and LIMIT reduces the number of results to return. Finally the RETURN statement specifies how and what data must be returned.

Now let’s check a graph query! An easy graph traversal query applied to our data would look like the following:

FOR vertex, edge, path IN 1..5
 OUTBOUND 'stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f'
 GRAPH 'stix_graph'
 RETURN path

To understand a bit better this query here’s its decomposition:

The FOR emits three variables, the vertex in the traversal, the edge in the traversal and the current path. The current path contains two members, vertices and edges. Here we decided to return the paths matching the request that's why we can find the edges and vertices keys in the result.
IN 1..5 specifies the minimal and maximal depth for the traversal. 0 would have been a traversal starting from the original vertex.
OUTBOUND specifies the direction to follow Possible values are OUTBOUND|INBOUND|ANY. The object we used as the original vertex for the traversal is the source in the relationship object. For this reason INBOUND wouldn't return any result in our example.
stix_objects/{_key}defines the vertexwhere the traversal originate from.
GRAPH stix_graph is the name identifying the named graph to use for the traversal.

Running the query gives us the following output:

[
 {
 "edges": [
 {
 "_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
 "_id": "stix_relationships1/relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
 "_from": "stix_objects1/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "_to": "stix_objects1/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "_rev": "_YwH2I96---",
 "type": "relationship",
 "id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "created": "2016-04-06T20:06:37.000Z",
 "modified": "2016-04-06T20:06:37.000Z",
 "relationship_type": "indicates",
 "source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
 }
 ],
 "vertices": [
 {
 "_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "_id": "stix_objects1/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "_rev": "_YwHyQSa---",
 "type": "indicator",
 "id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "created": "2016-04-06T20:03:48.000Z",
 "modified": "2016-04-06T20:03:48.000Z",
 "labels": [
 "malicious-activity"
 ],
 "name": "Poison Ivy Malware",
 "description": "This file is part of Poison Ivy",
 "pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
 "valid_from": "2016-01-01T00:00:00Z"
 },
 {
 "_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "_id": "stix_objects1/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "_rev": "_YwHyQSa--A",
 "type": "malware",
 "id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
 "created": "2016-04-06T20:07:09.000Z",
 "modified": "2016-04-06T20:07:09.000Z",
 "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae5ff",
 "name": "Poison Ivy"
 }
 ]
 }
]

Of course this is a really simple example. We could apply some filter on the path during the traversal to discard some nodes or edges and customize the data to return. AQL is really well documented on ArangoDB website, don’t hesitate to check it!

Conclusion

STIX is an awesome format to share Threat Intelligence data between teams. However as shown in this article, it can be complex to store STIX in regular databases without degrading performances. ArangoDB is a new multi model database engine that lets us store STIX data in JSON and perform complex queries without sacrificing performances.

The AQL query language has been well thought and let us perform document queries (filtering, sorting, ...) and graph query (graph traversal, shortest path, ...) in an easy to read manner.

The clustering capabilities of Arango allow the database to grow and scale with time making it a perfect candidat for long term projects!