Security through Data Fusion: constructing an analysis framework

Markus DeShon
3 min readSep 23, 2017

--

Since I started working in network security analysis around 2000, I’ve worked on constructing various models of what analysis is. The better we understand that, the more automation we can apply to the problem. I’ll save a historical reconstruction of how I came to my present ideas for a future post—here I would just like to talk about what I think is a complete framework for understanding analysis, why I think the framework works, and why I think it’s useful.

This model is based on the field of Multisensor Data Fusion, which came out of work for the Department of Defense to aggregate multiple independent data sources (e.g. radar, infrared images, seismic sensors) into a “common operational picture.” While MDF operates in the physical realm, I believe that the conceptual model works for network defense as well; I strongly believe in adapting work from other fields where it makes sense.

The basic idea is to focus on particular objects and transitions between them:

  • Data: raw streams of bits and bytes, which for us could be network data, system and event logs, or application logs.
  • Features: semantically meaningful chunks of data, for example an IP address or a domain name.
  • Entities: the objects of interest in your domain of analysis, like a host or a user.
  • Relations: interactions between entities, such as client-server interaction, or a user logging into a system.
  • Impacts: the security relevance of the observed relations—is a remote host launching an attack against one of our servers?
  • Responses: actions we take in our domain of control, such as disabling a suspicious user’s account until we can verify what’s going on.

This may (now that you’ve seen it) appear to be an obvious way to break down the problem, but we can already draw some useful observations. For example, should we consider IDS alerts to be raw data? Not under this framework, because IDS alerts and other security events are really talking about Impacts, not Data. So we should expect to handle alerts differently than raw data, and by being clear about that up front we avoid some common conceptual errors in setting up analysis automation.

The really interesting concepts start to enter as we look at how to traverse this sequence. There is a kind of mainline of analysis that goes like this:

  • Feature Extraction: Data → Features
  • Entity Characterization: Features → Entities
  • Situation Assessment: Entities → Relations
  • Impact Assessment: Relations → Impacts
  • Incident Response: Impacts → Responses

Each of these steps is a careful progression, and in some cases good tools exist for making those transitions (e.g. logstash for extracting features from syslog data). As we go through the sequence, though, the steps become conceptually (and cognitively) more difficult, and maybe the best we can do is assist the analyst with good tooling and visualization.

Back to our example of IDS alerts for a moment: I mentioned that IDS alerts are talking about Impacts, but more specifically what an IDS does is short circuits this mainline with a direct Data → Impacts transition. A signature-based IDS can do this because a human analyst has spent a lot of time and effort to boil down all the intermediate steps into a specific pattern match on the raw data. The result is fragile, though, because the attacker may be able to avoid detection with simple obfuscation or modification of the exploit. Also, false positives are an issue if we don’t include any contextual information about the Entities in our environment (e.g. this Apache exploit won’t work because I’m running NGINX instead).

Note that there could be other transitions that we could enable, such as Impacts → Data, where we adjust the way we collect data because we think there’s a security issue. Thus, there are a total of 36 possible transitions, which I would argue analysts already have to do, but most of it manually. Much more about this, and all the above, later.

For now I will assert, and hopefully over my postings will be able to convince you, that a more deliberate approach to modeling the Entities in our domain will result in more robust detection and protection.

Thanks for reading!

P.S.: This particular formulation of the Multisensor Data Fusion framework (which I’m adapting to our domain) is due to B. Dasarathy, as expanded by Steinberg and Bowman, cf. Handbook of Multisensor Data Fusion Ch.2, CRC Press, 2001.

Next: Data collection

--

--