[SIEM]thing is Wrong!

12 min readJun 18, 2019

Congratulations, you’ve setup your SIEM and you instantly have a single pane of glass for security operations and incident response! Not really.

There may have been a time period in the past where that term, ‘single pane of glass’, was closer to reality, but it just isn’t a term that should be used these days. Unfortunately, it’s very likely that security will never have a single pane of glass due to the technology growth in the industry. There are too many technologies and platforms that ultimately become data sources, and the SIEM offerings of the world can’t consume all of the data. Each data source has the potential to hold valuable information or clues that security analysts need for their investigations. However, there are more fundamental problems with SIEM implementations across the globe that have less to do with the product and more to do with implementation.

It’s not always the user and it’s not always the product.

Presumably, the leading reason for users misgivings around SIEM is usually unspoken. Technology vendors don’t want to offend a customer by telling them that they failed to setup correctly. Customers are limited on budget and want to a make their technology investment a success, but they are thinking that the product should simply perform. The solution is for both parties to come together and combine product expertise with organizational knowledge. In many instances, the SIEM is technically installed correctly but the configuration and adaptation to the unique environment and security team capabilities is lacking. Product expertise can only cover so much ground and the gap is in the fundamental understanding of turning Noise into Signal. Technology vendors don’t know the capabilities of the customer or the lay of the land. If only there were a process to follow and alleviate the stalemate. The negative outcomes of a poor SIEM implementation are endless, and the user is the victim. Users often times look at a SIEM dashboard and can’t tell what is good vs bad or true positive vs false positive. Screen fatigue plagues security analysts because they spend too much time manually sifting through events to stitch together. After their research, they rarely document what events were valuable and take the critical next step of continuous improvement because they don’t have time or even worse, context.

Think about most of the major security breach reports in the media over the last decade. Time and time again the victim has security tools in place and sometimes met compliance or regulations such as PCI-DSS. Did any of the security vendors take the blame and state that their product failed to meet promised expectations? I can’t think of a single instance. The flaw was the security team didn’t see the event or have the context to understand if the event(s) were malicious. The signal was lost in the noise and in some cases the signal was never even collected.

Sec Ops management will eventually witness their security analysts hunting for events to investigate. The analysts will probably make mental notes of events that were false positives or outright useless. Detection times will be slow and response times even slower. Soon, realization sets in that there isn’t a clear indication of success over time. It is very hard to prove that the team is able to increase the security baseline and getting better at security when the team is drowning in noise.

After several technology rollouts and onboarding of data sources in the SIEM, the problem still isn’t solved and decisions must be made.

Dump the SIEM and buy a new one
Hire more experienced analysts
Train existing analysts
Quit and go where the grass is greener
Dig in, review and enhance Process

Time to Dig In!

Is it plausible that any NFL Football team won the Super Bowl without a playbook? Zero chance that has happen or ever will happen. The playbooks are the key to offensive and defensive strategy. They align the team and prepare them to identify and respond to the opponent for every conceivable play. Even more importantly, the playbook design is built around the team and individual players capabilities.

The same is true for security operations. Playbooks are critical in order to build an effective security operations and incident response team. Create playbooks, AKA, runbooks, and begin to explore the data required to perform a tactical investigation for each playbook. This is a critical first step in preparing a SIEM to run the way it is intended. The value in features such as correlation, enrichment, threat intelligence feeds, and asset risk scoring are all more achievable after separating signal from the noise. Don’t think on those features in the early stages. More importantly, those valuable features typically align to the last two steps of the process, Insight and Action. The key to SIEM is to understand what capabilities the security team wants or needs and then gather the right data to fulfill the capability.

Just to make sure I haven’t lost you. Here are a couple key takeaways so far.

Create playbooks for incident response.
Don’t pull in data from everything in the environment
Understand the technology in the environment and what types of data exist in each technology
Enhance the process

Throughout the process review and enhancement you will be working with data that is used for analysis. Defining what data is needed to fulfill the capability can be difficult without understanding tactical and technical analysis.

Tactical Analysis should be the story.

Why did this event happen?
What has happen?
Who is the adversary / target?

Technical Analysis should be the pieces that make up the story.

When did it happen?
Where is the source and destination?
How was it done?

All too often security teams setup a SIEM and begin gathering logs from anything they can get their hands on. While the intent is good in nature, the execution and the process are lacking and negatively affect the outcome because it is too much noise. Also, many security teams focus on implementing tools, such as a SIEM, to identify security events, but often forget to prepare for response and metrics. Capabilities are about detecting and responding. Responding to a security incident is only possible if the security team can collect the right technical data and reveal the story through tactical analysis. Separating the signal from the noise and building insight or context is the virtual bridge from technical analysis to tactical analysis. There is a process for this too.

The OODA loop should become a core concept in security investigations and should be used in data selection too. More Info about the OODA Loop link

OODA

Observe
Orient
Decide
Act

Observe events and behaviors that may indicate the adversaries motives — takedown, theft, establish access, control.

Example: Inbound Apache Web Server Exploit from a remote IP address.

Orient is gathering situational awareness, insight, or context. This is achieved through data enrichment, correlation rules, normalization, additional data sources, and organizational knowledge.

Example: The destination is an Apache server. No signs of service takedown, theft, or unauthorized access. The remote IP address is tied to various events every month. It shows to be related to a vulnerability scanning service.

Decide involves validation, scoping, and criticality assessment of the event.

Example: The Apache server is vulnerable and it is a critical asset that drives revenue. The exploit event is tied to monthly vulnerability scanning activities by the organization.

Act is simply responding at the right time with the right resources while considering the outcomes of acting now or later.

Example: This event is not an attack. Report the event and request Apache is patched OR enable blocking by a countermeasure such as Network IPS if a patch cannot be installed in a reasonable timeline. Monitor the vulnerable asset until a patch or tuning is complete. Ensure an exploit attempt is not made from any other source IP addresses.

With an understanding of the adversaries motives and the OODA loop, it is time to start mapping capabilities and filtering data.

Capability Examples:

Malware
Data Exfiltration
Remote Exploit Attempts
Credentials Compromise or Misuse
Denial of Service

To articulate this further, I’ll use malware as the example, but this can be applied to almost any capability. Remember to break the data into chunks that align to capabilities. Be sure to document and understand what technologies are in the environment. Soon, it will be easy to map each of the technologies (data) to the signal, insight, and action phases.

Example — Malware Detection and Response Technologies:

Note: Below is a brief list of technologies using generic names. The ability to detect malware depends on the technology vendor. The ability to consume data into a SIEM may depend on the SIEM and/or the technology vendor as well.

Endpoint Threat Prevention
Web Proxy
Network IPS
Firewall
Endpoint Detection & Response (EDR)

It is time to start the process of separating Signal from the Noise.

In the SIEM, create a dashboard that only shows events from the data sources that support the capability (ex. Malware). The wording and configuration of the dashboard may vary depending on the technology vendor (SIEM & Data Source).

Consider creating dashboard tiles based on the following criteria or groupings. These should assist in separating events quickly.

Tile 1 — Events by data source

Tile 2 — Events by event name, summary or description

Tile 3 — Events by result type (Block, delete, quarantine fail etc.)

Tip: It will be useful to have the ability to drill into an event and see the event details.

Using this basic dashboard, work through the events one by one. The goal is to review and categorize each event so that future decisions can be made based on the usefulness of the event in supporting the capability.

Separate the events into the following groups:

Noise — The event provides no assistance in detecting a malicious event (malware).

Context — The event may not aid in detecting a malicious event, but it may provide context about an event.

Example — A firewall event showing a successful outbound connection was allowed. This type of an event may not prove a malicious activity has taken place. If, for example, a piece of malware was successfully executed on an endpoint AND performed an outbound connection to a remote IP address… The firewall event should provide context about the remote IP, Port, Protocol. The endpoint technology may only capture the name of the file and show the detection was made, but failed to handle the potential threat. A correlation rule could be useful to identify a malware detection and not handled around the same time a callback connection is made to a remote IP.

Signal — The event shows malicious activity and the threat was handled or the event shows malicious activity and the threat was not handled.

Research — The event shows activity that needs to be validated. Expect tuning of policy as a result.

Example — A Network IPS event for IIS 5.0 Server. Research to see if the company has any IIS 5.0 servers running. This will help determine signal and true positive / false positive.

Operational — The event shows activity that is only useful to the deployment or operational health.

Note: The structure of the security team will be a factor in deciding if these events are important. Security teams who have separate engineering and analyst teams vs engineers are the analysts. Either way the events need to be separated.
Two Examples:
Scheduled Scan Completed
Update complete

Document the events so that future security team members have a written history of why specific events were kept or dropped.

Capture the event name, description, and ID for any event that is related to Malware.

Cleaned
Deleted
Quarantined
Would be blocked
Failed to clean
Failed to delete

Any event that is noise should be dropped from the SIEM. Two approaches for this.

Configure the data source to stop sharing the event with the SIEM
Configure the SIEM to drop / delete the event once it is received

Create a watchlist for Malware events, this should be all events related to malicious events (handled & not handled). Add all the events to this watchlist.

If the security team wants to keep operational events in the SIEM, create a watchlist with all the ID’s related to operational health events.

Create a dashboard for Malware (this is a capability driven dashboard now). The security team may have a preference in what dashboard elements are used.

Create alarms for events where response is necessary. Ex. Failed to clean, delete, quarantine

Now the signal is separated from the noise for the use case.

Repeat for all the capabilities in the security team. Some may be harder than others, so start with the low hanging fruit like malware, rogue device detection, phishing and build up to more complex capabilities like fileless malware.

Topics for another day… Correlation, enrichment, asset risk scoring, integrations, metrics that matter and all of the features that support insight and action.

Additional Thoughts

You may be wondering why we want the malware handled event…

The SOC and the overall security team must determine if they are getting better at security monthly, quarterly, annually etc. A quick start point in understanding is to measure events, create a continuous improvement loop, and gain insight into the effectiveness of the changes.

Example Scenario: Endpoint protection events are high and don’t always require response. The SOC should aim for a level of maturity where they respond and eradicate quickly (24 hours or less) as a top priority. Analyzing handled events is important because they can reveal other indicators of attack or compromise. Each set of events should start a continuous improvement effort by tuning policies and ensure everything is up to date on product and DAT versions. Analyze the number of malware events that required response over time and aim to decrease.

How much did the policy tuning and the updating process improve the number of events that were handled vs not handled?

Are response times getting faster?

How many samples were detected on multiple systems?

Look at the trending / normalization data and compare standard deviation over months quarters etc.

What else can be done to improve? Enhance detection at the perimeter and practice a defense in depth strategy? Could SSL decryption be enabled and broaden the scope of traffic analysis? Could a SaaS proxy technology or always on VPN strategy be used for devices in the field and practice a scalable defense in depth strategy?

If yes, then deploy products, tune the products, enable features like SSL decryption, forced proxy, integrate tools and aim for scalability.

How much did the defense in depth and scalability strategy improve the overall malware detection?

Examine the trending and see if malware detections at the perimeter are trending up and endpoint trending down.
Create a dashboard element to show system IP addresses at the time of detection. Sort the data to show detections for systems off of the company LAN. Did the mean time to detect improve? Did the detection shift from endpoint to another countermeasure?
Note: Setup network zones in the SIEM to enhance metrics by zone. It will provide extra insight into areas of the organization that are higher in activity and may require tweaking.

Are the malware handled events useful?

Yes, just because the malware was handled doesn’t mean the work is done.

Use these events to extract additional information and be proactive. Identify the source IP, URL, domain, hostname, user, mail sender, removable media etc. Remember the tactical and technical analysis? Malware that is handled still has a story.

Analysts should research each of these observables and determine the root cause of the malware event.

What can be done to prevent future malicious activity?

Block a remote IP
Block a mail sender
Block a URL
Block a domain
Investigate and remediate an internal host
Block autorun on removable media and force an on access scan when media is connected
Identify a bad actor

With all of the activity in tuning and improving there should be a couple significant outcomes. The event volume in the SIEM should go down considerably. The meaningful events should be more clear and actionable. The start point for measuring effectiveness has been established. The health of the SIEM (performance in events per second) should be in a better state. A new set of questions and procedures for the investigation can be adopted and built upon.

[SIEM]thing is Wrong!

Written by Josh-T