Security through Data Fusion: Data collection

4 min readSep 26, 2017

In my last article, I started outlining a framework for security analytics. Now I’d like to focus on the first step in the mainline of analysis: Data collection. By thinking ahead to Entities, Relations etc. we can make some principled decisions about what kinds of data to collect.

It will help to think about a particular problem space, enterprise security. Our primary entities would be hosts and users, where by hosts I mean physical machines. Of course the contemporary enterprise includes virtualization, use of SaaS offerings for critical business functions, etc. which deserve articles of their own.

We’ll need both host and user data to build the full picture, but let’s focus on hosts so we can think through what kinds of data will carry us further up the mainline of analysis (to Entities, Relations and Impacts).

Since this is an enterprise network, let’s do so as we walk through a layered IT architecture to make sure we’re collecting appropriate data at all the layers:

Hardware → OS → Application → Network

There are some further abstractions that will become interesting later when we need to understand the business logic in order to detect intrusions, but we may catch most of the data sources we need for that in the Application layer. For something other than an enterprise network, use an appropriate layer model (e.g. Cloud Infrastructure → Platform → Service).

Static or slowly-changing data can help us establish what the entities are, which enables Entity Characterization. Some static sources for hosts include:

Hardware inventory: some combination of manual and automated collection of permanent or hard-to-change machine identifiers. Some info should be collected when the asset tag is first scanned, such as the machine serial number or other hard-wired identifier, and MAC addresses for all interfaces (the primary network identifier), so there’s a strong association between them. These will help us to identify the known assets in our environment, and we can cross-check against hosts generating activity on the network to find the ones we don’t know about.
Operating system configuration: Configuration files/database and install logs or daemon stop/start logs. These will help us to characterize the host setup: O/S version and patch level, machine name, installed applications.

Dynamic data helps us understand what the entities are doing, which supports Situation Assessment (understanding Relations between entities), Impact Assessment and Incident Response. Some dynamic data for hosts include:

Operating system logs: Log events from the kernel and local daemons (syslog, event logs). One of your primary goals should be to centralize and parse this data. It’s also one of the messiest kinds of data to deal with.
Operating system audit logs: special enhanced logging that can be enabled (auditd on Linux, launchd and dtrace on OSX, Windows Security Logs depending on audit policy). Be careful, you can generate tremendous amounts of data depending on configuration, so think through what you really need. In particular we’re interested in who started or spawned what processes. If possible, we also want to know what network activity is attributable to each process. These logs can wait until we’re already collecting the basics and have a clear need for more detailed logging.
Host-based agents (O/S layer): data from a special security agent installed on the host, collecting things that won’t be logged, such as unknown executables or even memory images. This kind of collection is often on-demand, but for known Impacts we’ve identified, maybe we can trigger such collection.
Application logs: Log events for server applications, which means activity logs and error(!) logs. These may be external-facing applications, in which case they are a primary vector for intruders to enter your network.
Network activity: network flow data, and the big 3 protocols: web, DNS and email (mostly from-to data). Possibly packet data. There’s too much to say about each of these for this article…

Authentication logs would be a special case here, because authentication is primarily about what the users are doing, even though it’s a host (or maybe multiple hosts) that’s running the authentication service. Some data sources will serve multiple purposes (e.g. web proxy logs, which can collect both user-driven and host-driven activity).

Any data we collect comes with costs: storage, backups, and data management (human administration time and compute costs for processing it, moving it around, and hopefully aging it out) are obvious ones. Less obvious is the cognitive load on analysts: just being overwhelmed with the variety and volume of data being collected. I’ve seen multiple environments where large data volumes were being collected, but not analyzed because no one had the time to look at, or sometimes even set up parsing for, all that data.

There is a balance to be struck between completeness and overkill; automation can help reduce the labor and cognitive load. The better we are at fusing data into useful information about entities, the less time analysts will have to spend digging for and fusing that information.

We need a clear understanding of what we’re collecting, and especially why (where I hope the framework will help).

The primary questions to ask to determine the data sources needed:

When an entity takes an action, where does it leave traces that I can collect?
Which of that data will help me understand what entities are in my environment?

Will I have the data I need to do…

situation assessment (determine the relations between entities)?
impact assessment (determine whether something security-relevant is happening)?
incident response (run suspected intrusions to ground, and forensically analyze actual intrusions)?

It’s OK if we can’t answer all these questions yet. This can be an iterative process of building the system and realizing there’s more data to be collected. You’ll probably never be done tweaking and enhancing your collection capabilities.

Next: Feature extraction

Security through Data Fusion: Data collection

Written by Markus DeShon