Security through Data Fusion: Low-level Data Fusion
So far we’ve explored the mainline of analysis up through Entities: Data → Features → Entities. As shown a few times so far, there are opportunities for feedback loops. If we think of the full 36-cell table of possible interactions, we’ve been working on the top left quadrant of 9 cells, which in Multisensor Data Fusion covers ‘Low-level Data Fusion’. Recall that the row headers on the left are inputs, and the column headers on top are outputs (e.g. Data → Features is “Feature extraction”).
Figure: Low-level data fusion (LLDF) processes
The cells in dark gray are the ones we’ve discussed so far. The light gray boxes are additional cells for which I think there are strong, well-known processes (which I will also cover). The unshaded cells are very interesting off-diagonals that deserve some careful consideration. I’m not devoted to the names of those, so they may change as I flesh them out more (and name suggestions for those are welcome). As I’ve mentioned before, I think that all of these are processes analysts actually do already, though perhaps rarely, and maybe not consciously. If our goal is to automate all of LLDF, then we need to make these processes explicit.
Low-level data fusion is absolutely critical to the security detection and analysis process. I estimate that analysts spend up to 80% of their time doing these tasks (mainly the shaded ones) if they have not been well-automated. So, making sure that we’re thoroughly automating these will result in a massive savings in analyst time (turning 1 analyst into 5 if full automation is achieved).
Let’s take a look at Entity Refinement (Entities → Entities). This is the process of improving the entity database through cross-referencing the characteristics in it, and drawing inferences to enrich the entity model. Some examples:
- Using the list of applications and open ports to characterize a host as a server vs. client, and assigning a high-level type (e.g. web server)
- Linking network indicators with software applications (e.g. the User-Agent → web client software link mentioned before).
- Based upon data stores found on the host, labeling it as holding sensitive data.
- Cross-referencing application versions to a vulnerability database to determine whether the host is in a vulnerable state.
As we identify these desired characteristics and inferences, we also identify features that we don’t have yet, and traverse backwards (Entities → Features/Model-based Feature Extraction, and Features→ Data/Model-based detection) to identify new data sources that we need. Those back-propagation processes in the bottom half of the LLDF quadrant deserve more attention, so I’ll come back to those in later articles.
In any case, once we have an entity database with the kinds of characteristics listed above, we’ll be in a good position to start making higher-level inferences based on the observed activity of those entities. But before we proceed to higher-level data fusion, I’ll discuss other parts of Low-level Data Fusion.
Next article: Entities → Features