Security through Data Fusion: Entities →Features

3 min readOct 15, 2017

In our data fusion framework, Features are semantically meaningful chunks of data, such as IP Addresses or domain names. We extract them from raw data through various parsing operations, and populate an object that allows us to perform various operations or transformations on the feature (for example, checking an IP address against a network address and netmask through a bitmasking operation).

Some feature extraction steps are not possible to do correctly in isolation, however. We actually need to know something about the state of the entity producing the logs in order to do the feature extraction correctly — if so, then we’re doing “Model-based Feature Extraction” which is the Entities →Features cell in our framework.

A non-trivial example is timestamps. At first sight, a timestamp is a simple thing: a date and time indicating (for example) when a particular event happened that was logged. Anyone who has looked at timestamps in detail will tell you, though, that they are anything but simple.

The first complication is that you need to worry about whether the clock on the system that generated the timestamp is correct. Fortunately, this is a longstanding problem that has a longstanding solution: Network Time Protocol (ntp). You must configure ntp on all hosts that are logging. While there are limited things you can do to correct timestamp errors after the fact, they really aren’t worth it. All modern operating systems have ntp support built in, and there’s just no excuse for not using it.

Once you’re sure the system clock is running correctly, the next major problem is timestamp formats. Here are just a few common ones, all encoding the same time:

2017–10–01T12:00:00.001Z (An ISO 8601 UTC timestamp)
2017–10–01T05:00:00.001–0700 (An ISO 8601 timestamp with UTC offset)
Sun Oct 1 05:00:00.001 (syslog timestamp with millisecond precision, with no time zone or year)
Oct 1 05:00:00 (syslog timestamp with second precision, with no timezone or year)
05:00:00.001000 (boot log timestamp with microsecond precision, with no date)
01/Oct/2017:05:00:00 -0700 (apache log timestamp with UTC offset)
Sun Oct 1 05:00:00 PDT 2017 (other log timestamp with a timezone designator)
1506859200 (Unix epoch timestamp with second precision)
1506859200001 (Unix epoch timestamp with millisecond precision)
1506859200.001 (Unix epoch timestamp in seconds as a floating point number with millisecond precision)

Note that the last three could appear as ASCII text, or be in a binary encoding, such as a 32-bit integer or an IEEE floating point number, usually in little-endian byte order.

So, while in a way all of that is a feature extraction problem, really it ends up being model-based feature extraction, because reading a timestamp by itself is often ambiguous — if the year is missing, if the order may be day/month or month/day and you can’t tell which, or if the time zone abbreviation is ambiguous because there are multiple time zones with the same abbreviation (such as IST, which can be Indian, Irish or Israel standard time).

If you look into libraries that support timestamp conversion, such as Python’s time, you’ll discover layers of complexity when you get to timezones (take a look at time.tzset()). That’s not Python being difficult, it’s just inherently a difficult subject.

For timestamps and problems of similar complexity, you’ll need components that resolve ambiguity by using contextual information. We need to keep state on each host to make reliable inferences about what time zone it is in at any given time/day (have laptop, will travel), then as your timestamps come through use that context to make inferences about what the ambiguous timestamps mean. Then, convert them to an unambiguous form (UTC or epoch time).

Reliably keeping state might mean occasionally injecting a timestamp of ISO 8601 format with UTC offset, if you don’t already have a log source that reports time in that way. If time zone changes, then you’ll need to be clever about which timestamps were generated in which timezone, for timestamps where the timezone is not identified.

I hope that by walking through the details of the timestamp example, you’ll be able to recognize other cases where you’re facing a Model-based Feature Extraction problem, where you need context beyond the feature itself (typically at the entity level) in order to correctly extract the feature.

Security through Data Fusion: Entities →Features

Written by Markus DeShon