Is it possible to find Security Value in Logs?

Maybe InfoSec should have just stayed out of the kitchen

12 min readDec 22, 2019

This post is a response to Anton Chuvakin’s piece, Security Correlation Then and Now: A Sad Truth About SIEM. I’d recommend checking that one out first, as I reference it throughout this piece.

Security Correlation Then and Now: A Sad Truth About SIEM

We all know David Bianco Pyramid of Pain, a classic from 2013. The focus of this famous visual is on indicators that you…

link.medium.com

Anton took us through a bittersweet (just bitter?) recollection of how SIEMs have progressed in the last 20 years and it wasn’t pretty. It seemed like a reasonable guess a few decades ago: “there’s tons of logs, surely it would be useful to collect them all and look for badness in them!”

I personally joined this narrative in 2004, right around the ‘normalization’ revolution Anton’s piece mentions (you read that one first, right?). I was working for one of the world’s largest payment processors and we were working on getting compliant with the very first release of PCI. We had until October 2005 to be compliant with version 1.0 and it was looking like we’d have to buy a SIEM to comply with most of requirement 10.

SIEM — great in principle

The idea of having access to all logs from one indexed, normalized system was exciting initially. We started evaluating vendors. Each vendor sent a sales engineer on-site for one week to set up and run a POC. We decided on Network Intelligence’s (later acquired by RSA) enVision because it was the only one that worked.

Certification‽ Yeah, this guy was in DEEP.

I don’t mean ‘worked for us, for our use cases’, I mean to say that every other sales engineer failed to get their product up and functional within this 5-business day window. The fact that only one out of four vendors could get their own products working on a customer site says a lot about the state of the SIEM market at the time. It was the first of many red flags.

Two coworkers and I flew up to Massachusetts to go through training for building custom log parsers and for enVision itself. Network Intelligence would build parsers for you, but at a price: $10,000 per parser. The training was cheaper and we justified it because we had so many in-house applications we needed to get into enVision. At this point, we didn’t yet realize how much of our time the SIEM would demand. We ended up writing only one custom parser.

enVision was an odd beast. Network Intelligence had recently migrated their entire SIEM platform from something UNIX-based that I can’t readily recall, to a highly customized version of Windows 2000 that ran on their proprietary appliances. Sybase ASA provided the database. The architecture was complex. The minimal configuration involved an application server, database server, 7TB SAN and local collector. Each of these talked to each other directly on a private network via crossover cables. Troubleshooting was not for the faint-hearted and I’d soon know more than I ever wanted to know about all of enVision’s quirks, odd UNIX holdovers and limitations. All this complexity, fragility and added work dumped on the customer added up to another red flag.

Then we made a critical mistake.

SIEM — a nightmare in practice

In lieu of any strategic guidance from the vendor, we decided to give enVision everything we had, short of Windows workstation logs. In just a few months, we had over 1700 devices sending logs to enVision and were pulling in over 100 million events per day. More logs, more value, right? At least, that’s what we were naively thinking at the time.

We quickly saw the challenges we’d have to struggle with on a daily basis.

With the amount of data we were forcing into the system an eight-hour query became normal, even for mundane searches related to daily tasks. As I was also the chief incident handler, imagine the frustration of knowing the SIEM likely holds the information I’m looking for but having to wait eight hours to see if I was even querying the right tables.

Keeping enVision running became a daily struggle. Services would fail. Logs would get stuck in processing queues, due to parsing issues, unexpected characters or just memory leaks exposed by the vast amounts of data we were shoving through this system.

In a day and age where 4GB OS drives and 18–36GB data/app drives were the norm for most of our servers, seven terabytes seemed like a huge amount of storage. In a very short time, we found ourselves having to closely watch the remaining storage space, eventually having to make some tough choices about which logs to keep and which to let go. We honestly believed we might need anything and everything from our logs at any moment, so the idea of logging less never really occurred to us as an option back then. FOMO hadn’t been coined yet, but definitely applied.

If keeping enVision running wasn’t bad enough, it was also a daily struggle to find out why we would suddenly be missing logs. Systems would get rebuilt and sysadmins would forget to forward Syslog to one of our collectors. Service accounts we used to pull logs would get disabled or their passwords changed. Sometimes systems would get upgraded and the new version would change the log format. We’d be at RSA’s mercy — these logs would become useless if they never updated the parser (as I recall happening when we updated our Ironmail appliances from 5.5 To 6.0). To be fair, most often, it was an internal IT change that killed incoming logs.

Upon finding out that the security folks had a SIEM containing practically ‘all logs from everything’, IT began to lean hard on us to provide support and assistance. Information Security reported up through IT at the time, so there was little we could do about getting ‘commandeered’ every time IT had an outage and needed root cause analysis done.

While it was possible to run ad-hoc queries with enVision, the primary output was scheduled reports. The product came with pre-built reports, but I never found a single canned report that was useful. We built daily scheduled reports that would just graph the output of every major log format so that we could spot quantitative anomalies (spikes on a timeline, basically). Reports on AD activity were read every morning to look for any strange new users, groups or file shares.

We even had to build custom reports to detect missing logs. To this day, I’ve yet to see a SIEM that will notify you when logs suddenly stop coming in. It boggles my mind that so little practical work went into these SIEMs — they had to have been built by people that had never had to use them and never would. The firewall sends in 7 million events a day — don’t you think you should let me know when that suddenly goes to zero?

What about use cases? How did we use the SIEM?

Use cases are worth discussing. There was no threat hunting back then, as the concept of threat intelligence was still a very nascent concept. Perhaps it was and we weren’t very well plugged into InfoSec communities back then.

There are three use cases I can think of:

Manually searching for anomalies (not quite proper detection or threat hunting)
Supporting incident response
Serving root-cause analysis requests for IT

We looked for anomalies — anything that looked out of the ordinary. To do this, we knew we needed a baseline, so we built a number of daily and weekly reports that we’d look through and seek to understand. It was a painfully manual process and we eventually had two folks doing this full-time.

As previously mentioned, we produced visual graphs that just showed counts of events over a 24-hour timeline. Let’s say you spotted a spike on one of these — a 25% increase over the norm, between 2am and 3am. The first step would be to zero in on the exact time when the spike occurred. This would take about an hour of running ad-hoc queries. Then, you’d pull the raw logs from that time period and export them to CSV. In Excel, you’d attempt to get a count by event type (the equivalent of a SELECT… GROUP BY COUNT query) in the hopes that would reveal the cause. Or perhaps you’d just browse through Excel, hoping for a clue to jump out at you.

Let’s say you determined the spike was caused by scheduled vulnerability scans. We might address this by changing our vulnerability scanning process so that it doesn’t cross the firewall when doing its scans — no more spike. Or, if that wasn’t feasible, we’d just log the anomaly in our growing knowledge-base to consider it a part of the baseline — “a spike between 2am and 3am is normal activity”. In the future, every new analyst would reference this knowledge-base or and eventually memorize this fact.

It was a lengthy process, but we eventually established a decent baseline this way. We never discovered a security event this way, but we did discover and fix a LOT of IT issues and misconfigurations.

In hindsight, the only security use-case our SIEM ever helped with was incident response. Even then, as previously mentioned, the process of getting answers from our SIEM was agonizingly slow, especially when under the heat and stress of an active incident.

Conclusions from the first era of SIEMs

I found it more useful to think of a SIEM as a ‘development platform’ rather than a security product or appliance. The amount of work left for the customer was truly enormous. The only organizations I’ve ever met that were happy with their SIEM had full-time SIEM developers.

While there is value in SIEMs, the benefits seemed (ha!) to be significantly greater for IT than security. Why did we cling so tightly to the SIEM then? Why didn’t we just jettison it as an extremely expensive failed experiment? I think it goes back to this concept of FOMO. If I don’t know what data is valuable, how could I possibly be comfortable getting rid of parts of it? As a security analyst, where would I go for answers? The SIEM always gave me something to do. It could keep me busy and give me hope that the answer lurked somewhere within its depths.

The sad reality was that it rarely did.

The biggest mistake, it seemed, was the assumption that all data was equally valuable. We captured as much data as possible and spent untold hours categorizing it, normalizing it and reviewing it — only to get such a small return. Huge efforts went into managing and making sense of logs — security goals were largely forgotten.

Man in the High Tower

What if SIEM history followed a different path?

If, as I’ve concluded, SIEMs were better positioned to provide value to IT than to security, why did security own the SIEM? Why did the “S” exist in SIEM? I don’t know the full history of the SIEM, but best I can tell, it was PCI that really launched it as a market, by creating hard requirements for log retention.

Is it fair to conclude that IT should have owned log management from the start? A group within IT could have acted as data and intelligence gathers. We could have called them “ops librarians”. Regardless, in hindsight, it probably wasn’t a good use of time to have multiple InfoSec FTEs dedicated to simply maintaining a massive data store of logs.

Later on, when working as an industry analyst, my boss and mentor Wendy Nather began an ambitious project to classify types of threat intelligence and the threat intel market itself. Part of what came out of this exercise was dividing intelligence into separate terms: external (threat intel) and internal (security intel). The SIEMs I’ve managed over my career contained security intelligence. The problem was that it was difficult to locate and extract and was of low value when we did find it. By low value, I mean that we’d typically have to correlate and enrich it this data before it could be useful to us.

But WHY did we have to do this? If these were truly pitched and sold as security products — why wasn’t correlation and enrichment done for us?

Strangely, most SIEM platforms never did much in the way of correlating and enriching this data. Back in 2013, QRadar’s “enrichment” was often limited to making it easy to right-click on an IP address to open it in MXToolbox in another tab. With the rise in AI/ML popularity, I fear the separation between marketing and reality has probably only grown.

If SIEMs were truly security products and not just ‘log databases’, why didn’t we see more security-specific use cases and features? Personally, I was always baffled that no SIEM offered the ability to compile employee profiles. All the necessary data was there — identity, logon source IPs, logon destinations, etc. Why couldn’t a product in 2006 build me a profile of sorts for “Adrian”?

Here are all the places Adrian logs into on a daily basis.
Here are all the devices he has logged in from.
Here are the locations he has logged in from.

There’s no reason answering a question like “does Adrian often connect his corporate laptop to public Wi-Fi hotspots” should take more than a few seconds today. Sure, I could build something that does this over the next few days. The point is that, right now, I should be able to choose between a dozen products that can do it right out of the box. This doesn’t seem to exist and I can only guess this is because there’s a significant gap between product managers and practitioners.

Conflicts of Use Cases

Some of the most common SIEM use cases appear to have conflicting needs. All logs and events have the potential to be useful. Very few are ever useful and even fewer are actionable (at least, in isolation). In support of incident response, it’s easy to justify keeping all logs — even all network traffic for a significant amount of time. I believe that finding the root cause of an incident is a critical part of lessons learned and security program maturity in general.

Unfortunately, it seems that the difficulty of detecting threats increases with the size and types of data that must be searched or investigated. For a threat detection or threat hunting use case, it almost seems that false negatives are a necessary trade-off due to the importance of speed. Put another way, if threat hunting is too slow, it becomes incident response, because you’ll have missed the attack.

This begs the question — should the incident response and threat hunting use cases be separated? Should they be addressed with entirely different products or systems? I’m not sure. Technically, from a data storage perspective, I think the answer is no — advances in database and storage technology make it possible to have massive amounts of data and query it quickly. From a UI/UX perspective, however, I think the incident response and threat hunting use cases are completely different and require different approaches and interfaces.

Unless we consider we’re doing one or both incorrectly.

Incident responders are justified in wanting as much data as possible. Threat hunters, on the other hand, must become comfortable with culling the data they look at as much as possible. A remote chance that a data source could indicate badness isn’t good enough. A threat hunter’s time is finite and they must prioritize time spent on data sources containing the highest quality signal. Trying to look at everything risks finding nothing.

Time to move on

There are a few concepts and approaches that I think could help.

One is the concept of a library of ‘red flags’. This approach accepts false negatives as necessary and rejects uncertain detections and alerts. Following is a definition from my own unpublished writings.

Red flags are events that occur in any computing environment (not necessarily just enterprise networks, though that is the primary focus) that indicate malicious intent or activities without a doubt. These events should never occur within a trusted environment, unless someone is assigned or hired to simulate malicious adversaries. There are also yellow flags, which can sometimes indicate malicious operations are occurring, but won’t always be reliable indicators.
The concept of red flags in an information security context is very similar to how the finance industry uses red flags for detecting various types of fraud and other types of illegal financial activities.

Another is the concept of setting detection traps for attackers. Expected defaults are relied upon by adversaries and by anticipating these assumptions, it becomes possible to catch them doing exactly what we already know they’ll do. (Disclaimer: this is what my employer does)

The common thread with these approaches is that they prioritize high quality signal over data and alerts of unknown or uncertain quality. With reliable sources of higher-quality alerts, we can finally leave the hunting and guessing behind. While I don’t think red flags and breach detection traps are enough, I think they’re both examples of steps in the right direction.