Security through Data Fusion: Feature extraction
After Data collection, let’s continue along the analysis mainline through Feature Extraction (Data → Features). I have mentioned that features are semantically meaningful chunks of data. In the original field of Multisensor Data Fusion that I’ve based this framework on, feature extraction means things like edge detection, which correspond to pre-cognitive steps of visual recognition that our brains do subconsciously. For us, it means pulling out entity identifiers and application metadata that will help in constructing the higher level objects.
Now, since the key is semantics, while initially we’re extracting text or bit strings, for further analysis we will need to convert those to objects with operations defined on them. For example, an IP address is not a text string: it’s a binary value with structure. We need to be able to efficiently perform bitmasking, to determine whether an IP is in a particular network range. We need methods or flags to determine whether the address is for localhost, or multicast. We need to be able to store it in a set, for quick set operations.
Some examples of semantic feature objects would be:
- The Python ipaddress library and its Address and Network classes
- The IPAddress class in Microsoft .NET (though I don’t see a corresponding IPNetwork class)
- Python’s http.client.HttpResponse class
We might already want to convert the feature to its semantic version during feature extraction, if we decide that we should go ahead and perform certain operations up front. For example, we could label an IP address as localhost, private, routable or multicast. If the answer depends only on the value of the feature itself, then we should definitely go ahead and do it, as it would save us from doing such operations again.
For extracting features from text, the workhorse is the regular expression. Time you spend learning the intricacies of regex is time well-spent. Even if your parsing framework tries to save you the trouble, trust me — it’s regexes all the way down. If you’re the one writing regexes, I’m sorry. Just spare a thought for the poor fool that has to read your regexes, so please use things like meaningfully named groups. Hopefully, your parsing framework supports unit tests, so you can ensure that when you need to edit your regexes, you don’t break your existing parsing.
When doing binary parsing, I have one word: [whispered] endianness. Endianness has bitten me more times than I’d admit, and will bite me again in the future. Network byte order is big-endian (most significant byte first), while our most common platform x86 is little-endian (least significant byte first). This means that IP address 1.2.3.4 when stored in memory looks like 4.3.2.1. Endianness: learn it, live it, love it (if you can).
Thinking ahead to Entities, what we should be focusing on at this point is features that will allow us to identify entities, and associate activity with them. Recall that for this enterprise example our entities are hosts and users, but once again we’ll focus on hosts to demonstrate the concepts.
For hosts, think again of the system layers, and figure out what entity identifiers are available at each:
- Hardware: Serial number, MAC addresses (for all interfaces)
- O/S: Hostname, machine certificate ID/fingerprint
- Application: Not much that would be good as a persistent entity identifier.
- Network: IP addresses are only useful as temporary identifiers, but for many data sources, IP addresses are all that we have to identify hosts. We’ll have to make sure we understand the time range of validity for an IP address assignment, and always try to relate the IP address to a more persistent (or permanent) identifier.
In our first feedback loop (Features → Data) we should think about whether we’re collecting all the data we need for the above identifiers, and in particular we should look for links between identifiers, i.e. particular messages that contain two different identifiers so that we can reliably link them. A good example would be DHCP logs, which link a MAC address with its assigned IP address at a particular time.
It’s OK if we’re not extracting every bit of metadata at first. As you progress through the analysis mainline and build analytics, you’ll identify more pieces of metadata that you need, and you can add feature extractions to pull them out of the raw data (or identify more data sources that you need). A disciplined thought process around the feedbacks (e.g. from Entities → Features → Data) will help us to fill in the gaps.
One caution: when engineering an analysis system, don’t roll feature extraction directly into your more advanced stages of analysis. In other words, don’t pull raw messages, extract features and aggregate them in the same analysis script — what you end up doing in that case is replicating the same feature extractions in multiple (and probably many) places, which makes your whole system fragile. What happens if the input message format changes? You’ll have to hunt down each place where you do the feature extraction and fix all of them.
Instead, view Feature Extraction as a distinct step of analysis. From your raw messages, start to create structured messages (that carry along the raw input as one part of the message) and populate each with the extracted features. Then, later stages of analysis can operate on the structured fields and not the raw data. If an input message changes, you fix parsing in one place, and all downstream processes are fixed simultaneously.
By the way, for structured data you should use one of the binary serialization formats (for example: Thrift, Avro, or Protobuf). JSON is pretty wasteful as a storage format because of repeated keys, and don’t even think about XML. Which binary format to use probably will depend more on your toolchain than on performance, so go for ease of integration.
In the next article I’ll talk about Entity Characterization, before I come back to some more of the lower-level processes, including feedback loops.
Next: Entity characterization