In the previous part we looked at differences in scalability and nature (why and how) between historical and SIEM provided short-term realtime analysis. Now let’s see how data is treated in both worlds.
There are more aspects than just structured vs. unstructured data differentiating short and long term analysis. In fact, some of them are severe challenge to both kind of analysis to the extent of being cause for severe limitations or expensive decisions.
Let’s first discuss normalisation of logs, the approach employed by majority of SIEM vendors. Normalisation by its definition involves reducing the data elements (i.e log record fields) to a known common set. This is a necessary procedure to apply indexing and for consistent storage of events. The downside of this process is data loss: a) direct — discarded data elements and b) metadata — information about the structure of log record, such as sequence of fields, data elements within a field, patterns in syntax, etc.
Data loss is generally not considered an issue with SIEMs, as their main goal is to utilize known patterns for detection purposes. On the other hand, the primary mission of a historical analysis is to discover new information where loss of any data will diminish chances of success. Meaning persisted normalisation is a big no-go here.
While it’s easy to understand the importance of direct data fields, metadata is often underestimated. But in many cases metadata is crucial to distinguish malicious usage patterns. For instance, in the e-commerce fraud world, fraudsters are commonly faced with the dilemma of scaling up fraudulent transactions or mimicking the behaviour of normal customers at the same time. They make considerable efforts to blend in with normality and often only the trails of automation is what enables to identify them. Trails of automation can be found in metadata, such as a certain sequence (or absence) of query parameters, ip-addresses in X-FORWARDED-FOR headers, composition of JSON payloads, etc. Transforming data elements to JSON will make this information disappear.
Secondly, the normalisation process involves Extraction (of fields), Transformation (assigning a type) and Load (into the system). The downside of ETL is that it is not resistant to changes in the record structure of logs that are not in a machine readable format (JSON, XML). Extraction from such logs relies on assumptions on the sequence and format of data elements. When that happens, you’re basically left with two options, both with severe consequences:
- ETL process is aborted (provided that the parsing process is able to detect parsing failures) resulting with missing data in the system. Consider the impact of losing a data source of vital real-time alerting.
- Invalid data gets loaded into the system and the errors remain persistent. Obviously, this may lead to wrong conclusions of an otherwise correct analysis, regardless whether it is near-term or historical.
Recovering from such situations requires fixing the Extraction (parsing) step and repeating the whole ETL. The second is actually even worse, since it involves identifying and removing erroneous records from the storage. Still, both of these have a high impact to the SIEM monitoring and historical analysis processes. This is the reason why SIEM solutions use input data which is very static — network security appliance logs, commodity web server logs, etc. In fact, I regard this as a major factor preventing using application logs for security event monitoring and analysis. Which is quite unfortunate since application logs are very rich in context for identifying many types of malicious acts.
When it comes to historical analysis, the chances of structure changes is even higher as you’re looking at logs over longer period. Failure to identify changes equals losing input data, which may lead to wrong results and failure to discover new information.
Vendor Locked Data
It should be pretty obvious by now why attempting to use a SIEM as your long term data storage is a Very Bad Idea. However, there is even one more aspect to it. All the data fed into the SIEM ends up stored in a vendor specific format. When you wish to try out new algorithms, methods using another tool, you’ll be facing the task of exporting data. You’ll be lucky if the vendor supplies tools for that. Often, this is not the case. Since vendors do not generally make their internal formats public, the exporting task becomes essentially a reverse engineering task, problematic both from technical as well as from the legal side. Recall that every transformation may also cause data loss.
The concept of data enrichment means adding supplemental information (like geo-location, transaction numbers, application data, etc.) to logs and events to enhance analysis and reporting (definition borrowed from here). Essentially there are two ways for adding complementary data: i) as additional fields to log records, added at load time or ii) as additional fields in the resultset, added at query time.
Many SIEM’s allow enrichment only by the first method. This is understandable as their need to deliver results in real-time manner explains the desire to offload any additional processing. And the fact that supplemental data is added before query execution is not a problem here either, since the mission is to detect known threats.
However, it becomes an issue for historical analysis. With the mission to discover new information you can’t really predict what supplemental information you need BEFORE the start of analysis. The only reasonable way is to enrich data dynamically.
Always keep your logs in original raw form. Feed your SIEM with data in parallel, but never as a replacement for long term storage (i.e log management is a must have). In this way you avoid data loss and vendor lock-in situations.
Use analysis tools allowing defining schema/format of data during query (as opposed to during ETL). This provides you with flexibility to adopt to structural changes in log records. Even better if the tool is able to notify you of the changes.
To sum up I believe SIEM vendors will be struggling hard to conquer the long term analytics space.