Modern SIEM Mysteries

Anton Chuvakin
Anton on Security
Published in
3 min readJul 27, 2020

--

Look outside, we are in 2020 (can anybody really forget that?). So, we are not in 2002 anymore (perhaps the birth year of modern-ish SIEM), neither are we in 2012…

So, depending on how you count, SIEM technology (and SIM/SEM before it) has existed for almost a quarter of a century. Funny enough, my involvement with SIEM started in 2002, exactly.

In light of this, any observer will find it at least a little bit peculiar that in 2020 people are challenged by roughly the same things about SIEM as they were during its younger years.

Here is a SIEM poll that we just ran (Twitter, July 2020):

Note an oversize number of votes (~1400); this tweet impressions count is rapidly getting to 100K. You should definitely read up on the comments if you are curious about the sentiment about this security technology today (An easy challenge: count [or guess] how many people in this thread stated “I just love my SIEM, it does everything perfectly at a reasonable cost”…)

However, this blog is not about the vote count, but about one of the most popular of the “write in” votes. I’d aggregate it under SIEM problems with inputs. Variously expressed as “Garbage in — Garbage out”, “useless logs”, “incomplete log sources”, “getting the data reliably”, “drowning in noise” , “data quality problems”,”not enough sufficient log data” and “lack of good quality input”, this is much worse today than many realize. As you can guess, this problem potentially affects the new entrants into the SIEM market nearly as much as the old ones (although SaaS model makes it a bit easier to catch and address the issues a bit faster).

Now, in the early days of SIM/SEM (2002 comes to mind again), people believed that “correlation” would fix the data quality problems due to bad/incomplete logs. We only need to “correlate” those pesky IDS alerts with system log data and we will get to the truth. Generally, that didn’t work out.

More recently, people tried ML/”AI”-ing their way of this problem and quickly got pecked to death with “chicken/egg problem” of ML requiring high quality data to run well. BTW, you may remember that I shared that UEBA vendors complained about data quality and data access problems being central to their business. So, that is not going to work out either.

Thus, after trying for many years, I am almost ready to admit defeat in the battle for log quality/fidelity/accuracy. I’ve seen modern “security” logs from some modern applications, and in many cases, they look just as bad as old “security” logs from old systems and applications. IMHO, logs are NOT getting better en masse. Sure, we have JSON now, but this fixes only the syntax/format of logs, not the meaning. In the same thread, people ask for syslog instead of some modern XML API for logging… guess why? Also, the security qualities of non-security logs (like debugging and performance logs) is exactly where it was in, say, 1997 when Marcus Ranum fiddled with /var/log/syslog.

OK, so what to do? One route that I’ve been, ahem, observing is of course to make your own telemetry. This is where I see the expansion of EDR to XDR as a potentially big deal. The visual below comes from my BlackHat 2020 XDR speech (sponsored)- even though it oversimplifies a bit, it does make the point well.

Naturally, those of you who still resist anything agent-based will prefer logs. Naturally, I respect that. Also, those who want full visibility will choose the triad or “all of the above.” However, EDR to XDR path is becoming more valid and valuable, given the situation with logs…

Related posts:

--

--