A SOC-MSSP guide (3 of 4)

34 min readNov 26, 2023

From reporting malicious activity to catching it as it happens

Part 1: https://medium.com/@7rm1ef8/a-soc-mssp-guide-1-of-4-3f5450638a98
Part 2: https://medium.com/@7rm1ef8/a-soc-mssp-guide-2-of-4-f4fb93be2422
Part 4: https://medium.com/@7rm1ef8/a-soc-mssp-guide-4-of-4-a78779d830dd

Summary
4. Operational aspects
4.1. Knowledge management
4.2. Infrastructure lifecycle
4.3. Detection lifecycle
4.4. Incident lifecycle
4.5. Training and monitoring
4.6. Customer service
4.7. Conclusion

4. Operational aspects

This chapter goes over the main operational points to address in a SOC or MSSP. Some of them are easily overlooked and sometimes pushed back again and again until “there is time, because we are too busy at the moment”.
The truth is that as long as the time available for people who could work to improve the SOC is spent on quick fixing bugs, getting quick wins or rushing urgent — or more likely already late — projects to production, there will never be any improvements made. It is even worse in the case of an MSSP because the current customers will likely continue to expand their monitored environment and new customers will come in, resulting in even less available time for improvement and automation and more bug fixing, quick wins and late projects.

Therefore, the time for improving operations has to be made available to the right people to get improvements going. This will hurt operations for a bit, because it will take some time to improve and to see the return on investment come in, but it is the only realistic way — without a budget increase that could translate into recruitments — to transition from a vicious circle to a virtuous circle.

The subchapters in this chapter are not ordered by any kind of to-do priority: they should all be addressed at one point, but the priority depends on the current state of the SOC and from what would give the maximum return on investment.

4.1. Knowledge management

Knowledge management is an actual job that requires specific skills and no amount of good intention or advice can replace that. It is strongly recommended to hire someone for this job either for the whole company or just the SOC, depending on the size and needs.
However, as this tragically falls into the “what is this job, did you make it up?” / “why do we need to pay someone just for this, can’t you handle it?” category, it was important to say a few words about what can be done in this chapter, and a few more about how it can be done in the next.

As a disclaimer, please note that the pointers about knowledge management given in this document do not come from a professional knowledge manager and should not be regarded as best practices but rather as a small buoy on which you can cling desperately to try and not drown into the infinite ocean of knowledge.
That was a bit dramatic indeed but sadly not that much of an overstatement. As previously mentioned, a SOC is a complex entity and the mass of purely job related knowledge needed is too much on its own, but to that is added the contextual knowledge of the environment to monitor, multiplied by the number of customers when it comes to an MSSP.
There is simply, humanly no way that any one person can know everything he or she needs to work in a SOC by heart, without looking up or checking documentation multiple times per hour. If someone pretends otherwise, they’re either lying outright to try and brag or a complete idiot that actually does not understand a thing they are doing. In both cases, it would be advisable to avoid having such people in a SOC or MSSP.

There are two main points to address when it comes to knowledge management and operations: creating and maintaining the knowledge and having and keeping the knowledge base actionable.

4.1.1. Knowledge creation

Knowledge, may it be documentation, processes, procedures, reports, etc. is key in many fields so that things can evolve and move forward instead of repeating themselves. That is true for a SOC and even more so for an MSSP.
Knowledge is perhaps the most worthy investment for the future, but writing documentation, for example, can be very tedious and therefore those who actually enjoy working on it are few, while most do it because it’s just “that part of the job that you have to do” and some try to avoid it at all cost. However, it is imperative that everything is fully and accurately documented, because everyone should be expendable in the way that there is no Single Point Of Failure (SPOF). If anyone were to suddenly disappear, there should be some time to recover, but the recovery must be full without any loss of knowledge.

Therefore, since this is a crucial task and no one really wants to do it, everything must be done to encourage, enable and facilitate knowledge creation. For example, some of the following could be done:

Explain and then demand that processes and procedures be followed and enforce it.
If there is an issue with a process or procedure, quickly react to correct it and/or propose that everyone contributes — the person pointing out that there is an issue proposes a correction.
In every task and mission, plan and allocate time for knowledge creation and insist that the task is not complete until knowledge has been created.
Educate everyone to always refer to the knowledge base for everything.

There are many other creative ways to make everyone participate in knowledge creation but the best advice may be to try and show everyone that the knowledge they’re creating has value. There is always value in knowledge but it shows only when it is used and showing that knowledge creation is meaningful and has value is oftentimes enough to convince people to participate.

Unfortunately, creating knowledge isn’t the end of it, because it’s alive. Granted there are some things that are immutable but the vast majority needs to be updated to stay relevant. Therefore, each piece of knowledge should be in one of the following states:

Up to date and used
Being created or updated
Outdated and being archived (optionally before a later deletion)

This is an ever ongoing effort that must persist and be encouraged to prevent any loss of information over the long run. The cost of losing knowledge will always be, at best, the time and resources used to create it in the first place and at worst a net loss plus whatever it costs to find another way to achieve the same goal.

4.1.2. Knowledge organization

For the knowledge to have a real value, it must be used and in order to be used, it has to be organized, accessible and easily searchable.
This is the hard part when there is no one tasked with knowledge management, because there can either be multiple authorities or none at all when it comes to organizing. If no one has been designated by upper management, then at least someone should be designated within the SOC as a decision maker when it comes to knowledge and it should be made clear to everyone inside (and outside) the SOC that this person and only this person can make decisions about organizing knowledge within the SOC. This would ideally be someone that understands the needs of every team and has some experience — as a witness if nothing else — with successes and failures in knowledge management.

At this point it would be best to limit as much as possible the number of solutions hosting knowledge because it would most likely be far easier to search one source of knowledge than many sources. If there has to be more than one source of knowledge then it should be very explicit and clear to everyone what knowledge can be found where — preferably have each source host specific types or perimeters so that there is no overlap and all the knowledge on a single subject is in a single place.
In any case, the following should be observed:

There are regular communications about what knowledge can be found where — or one document explaining that and regular reminders about its existence — so that newcomers get the correct information right away and other staff members are kept updated.
Everyone must be able to access every solution hosting knowledge described in the previous point.
There must exist an explanation on what knowledge to expect and more importantly on how to use every solution hosting knowledge. That explanation should of course be the landing page when accessing the solution.
Processes for creation, modification and deletion of knowledge must exist and be enforced. These should include a validation step from another entity (either a person or a bot).
Other quality assurance items, such as taxonomy, nomenclature, templates, etc. must also exist and be enforced.

Knowledge organization is indeed hard because once everyone understands the value of knowledge and keeping it up to date and starts creating and modifying things all around, it can very quickly become a huge mess.
Since it’s not possible for everyone to know by heart how the knowledge is sorted, there must be rules, validations and templates to guide them through the process and make it easier. Keep in mind that creating knowledge is not a sexy task in the first place and now that people start doing it seriously, they should not be slowed or spooked… too much.
That balance is the hardest thing to grasp: how little is not enough and how much is too much? Not enough management and there will thrive duplicates, outdated knowledge or even misinformation; too much and people will give up, not wanting to have to jump through hoops to do a task they didn’t want to do in the first place.

4.2. Infrastructure lifecycle

In a SOC and especially an MSSP, there are three main lifecycles that should be defined, enforced and monitored to ensure the continuity of operations.
The first one is about the infrastructure used by the SOC, because it is the foundation on which detection and response are built and if it fails, there is not much the SOC can do.

Every SOC and MSSP needs to define its own lifecycle because it will depend on a number of things such as choices made, services available, SOC jobs definition, SOC teams composition… The lifecycle may or may not vary depending on the hardware and software but if it does, every possibility needs to be defined, enforced and monitored.
In order to help with this, there are a few pointers to come, for the two main phases of said lifecycle. As stated before, these pointers will not be tied to a specific editor or even tool and could be applied for a SIEM, a SOAR…

4.2.1. Infrastructure architecture and deployment

The first phase is about the architecture and deployment of the infrastructure: its “birth”.

The actual architecture design will of course vary with the goals, the tool and the editor, but there are some things that might be regarded as obvious or even dumb that should be double-checked to avoid surprises and complications down the line:

If possible, try to use the same specs of hardware/OS across tools and customers as it will limit surprises and make MRO easier — this could then be templatized to automate deployment.
Carefully design and test different “standard architectures” so that when a new customer comes in, there is no or only little time spent on the design step — only for some specific needs and customizations. This allows for a simpler deployment as it would mostly be copy/pasting something that already works elsewhere.
Always test — if possible with production load and data — before going into production, especially if there is a need for collocation of multiple tools or roles on the same server.
Make sure secrets are unique, strong enough and… stay secret — especially for anything with some admin privileges.
Use secure network protocols everywhere, meaning both for the tool’s internal and external traffic.

The last two points sure do seem obviously dumb in a document about SOC and MSSP, but they very sadly need to be mentioned.
Keep in mind that without its tools, a SOC is just a bunch of people taking guesses and rolling dice to figure out what may have happened or is still happening — and more and more attackers are figuring this out as well.

As much as possible, use best practices for that tool/editor both for the back end and front end configuration and when you don’t, make sure that there is an actual reason and document it. This will always make maintenance and future evolution easier as it makes the configuration standard and recommended by the editor.
The best practices also often help a lot in resource optimization, which is easily overlooked when building up a new infrastructure because there is no need for it. However, resource optimization becomes key for a long term exploitation and it is much easier (and cheaper) if it’s part of the design and documented than if some or all the work has to be done again later on when freeing up resources is a must have.
For this to be possible, the people working on the architecture and deployment must be sufficiently trained to know best practices in the first place and when not to apply them might yield better results. Moreover, people that are already trained tend to be less eager to “use that project to test stuff” because they already know the results or know how to build a lab environment when they are unsure, thanks to said training. There is simply no way to overstate how bad it is when the production environment ends up working somewhat but “it was built years ago by that guy who liked to fiddle, but isn’t there anymore”.

4.2.2. Maintenance, Repair and Operations

The second phase is about the main part of the life of the infrastructure, after it first comes online for production purposes until it is decommissioned.

The single most important point here is that the only way to guarantee the availability of all the tools used by the SOC is to have people dedicated to MRO duties. “Dedicated” means that these people should always have at least priority for MRO and especially fixing whatever is down as without these tools, the SOC analysts find themselves unable to do their job.
The organizational particulars as in which team the people dedicated to MRO belong do not matter much in fine as long as these MRO people are dedicated to that job and are accountable to the SOC with some sort of SLAs. Of course, this does not have to be that formal, especially if they are part of a SOC team but there absolutely cannot, ever, be a time when something is down and it’s either no one’s job in particular to fix or it’s someone’s job but that person cannot do it because he or she has other priorities.

Of course MRO tasks do not add up only to fixing downed servers or services and the most time consuming part is often keeping everything up to date. In order to achieve this in a graceful manner, there are a few attention points:

Having a precise inventory of what to maintain (hardware and software) and where (if multiple environments in case of an MSSP for example).
Robust processes and procedures to ensure that there is very little room to fail even if the person assigned is missing part of the context or is light on some skill.
A well established planning that is shared with other SOC teams (and the customers) to limit production issues and avoid panic. This point implies that scenarios and decisions must be established on how to go about upgrading and updating (i.e. by tool? by editor? by customer? …) and continued communication with every stakeholder exists. Here it is especially important to keep in mind that the tools are what make the SOC operate and although upgrades and updates have to be done, they should be done in the least impactful manner possible for the SOC.
Skills and knowledge of the person assigned to the task versus what needs to be done. It may or may not be preferable to wait for the right person if it means postponing the job, but this kind of decision is better made in advance as a plan B scenario.

Planned MRO activities as described above are implied to be manual as it is often the case. However, it can be very hard to do properly in larger environments or multiple environments (e.g. MSSPs), especially if the MRO team is undersized.
That’s where a DevOps approach should be considered to automate as much as possible and save a lot of time. DevOps always requires an investment (resources and time) but is simply a must have for MSSPs as manual upgrades and updates are too time consuming and prone to errors, which require troubleshooting, which consumes some more time… This is made easier if the infrastructures deployed in the first place are standardized or even templatized and follow best practices.
Again with DevOps, it does not matter to which team these people belong but they absolutely need a good understanding and close working relations with all of the SOC teams to correctly understand their needs and implement them.

4.3. Detection lifecycle

The second main lifecycle is the detection lifecycle and it is probably the most important since the goal of a SOC is to detect anomalies.

This lifecycle defines the steps through which to go to build and improve detection use cases and the teams involved at each step. At any time, the state of any detection use case should be clear to everyone in the SOC or MSSP and who is responsible for the current step.

4.3.1. Detection engineering workflow

The engineering workflow will of course vary from one SOC or MSSP to another because it depends on a variety of factors, but the steps should more or less always follow the same pattern as shown in the diagram below.

Diagram of an example of detection engineering workflow

The light purple steps represented in the detection engineering workflow diagram are not actual detection engineering steps because they deal with the initial idea or interacting with a production SIEM for example.
Every other step can and it is strongly recommended that they should be performed by the same team/people: the “detection specialists” mentioned earlier who could work together in a “purple team”.

The details of the contents of a detection use case will be discussed later on but the workflow for its engineering should always start with an idea followed by research and simulation and end with a permanent monitoring of whether it works.
If and when it doesn’t anymore, either improve it if possible or retire it, because it is worse to have a malfunctioning detection rule in production than none at all: with the latter it is clear to everyone that there is a miscoverage.

4.3.2. Workflow tests

There are 6 tests represented in the detection engineering workflow, all of which are mandatory to ensure that the detection use case works as intended.
It is strongly recommended that these tests be automated and the results only reviewed by a human in case of failure. More on the tools, including automation, in the next chapter.

Unit test
The unit tests imply that the detection engineering is done using a DevOps platform to help manage versions, automate testing, collaborate, etc. — this will be discussed in more detail later on.
Using a DevOps platform, unit tests are tests that are run with each commit — a commit is any file modification pushed to the server. These tests aim to check all modifications to control configuration integrity such as common mistakes and incoherence between files, ensure minimum requirements are met and ultimately prevent the deployment of any misconfiguration.

Pertinence test
A pertinence test checks that the detection rule is working as intended, i.e. it triggers when expected and only then and the alert it creates contains useful data.
This can be achieved by (re)playing a scenario, through scripts in a lab environment to have the devices create logs and submit them to the SIEM, through generated events made with an event generator or by reusing the logs stored from the initial behavior simulation.
Then, the detection rule should correctly detect every anomaly it was meant to detect and ignore the rest of the events. At this point, it is recommended to test the detection rule against more logs than only those which contain the anomaly to detect to ensure that there are no False Positives.

Regression test
Regression test aims to confirm that the new version of the detection rule does not introduce any breaking change. A breaking change is a major modification which breaks compatibility with older versions. In the case of a detection rule, this translates into the newer version of the rule not detecting anomalies that it previously detected, effectively breaking (part of) the detection.
Sometimes a breaking change is wanted and expected after a complete overhaul of a malfunctioning detection rule for example. In that case the regression test would fail but the failure would be converted into a pass once the reason has been reviewed.
Since reducing the number of False Negatives is a higher priority for the SOC than reducing the number of False Positives, the regression test will guarantee that the newer version of detection rule detects as much as older versions. There are two checks to perform to conduct a regression test:

Run the detection rule using at least the same set of events or logs from the previous versions of the detection rule — meaning that there can be more logs if the rule now detects other anomalies, but the same set as before must be present. The same number of alerts must be triggered over the same logs set, proving that the newer version detects at least as much as the previous one.
The alerts triggered by the newer version must contain at least the same data in nature and labeling as older versions.

If either check fails then the regression test fails and it must be reviewed to understand the reason. If it was expected, then the failure is converted into a pass, otherwise there is some fixing to do.

Performance test
This test’s goal is to determine the resource usage of the detection rule as the SIEM resources are limited and the impact of deploying the rule must be known in advance to determine whether it is worth it in a resource constrained environment.
The relevant indicators to measure vary depending on the SIEM editor but the impact of the rule on resources must be known and a threshold should exist for this test which, if breached, fails the test. In that case, a review must be done to either optimize the resources used by the detection rule or document why the high resource consumption of the rule is acceptable.

Validation test
The validation test is the last step before declaring a detection use case ready for production. This test consists in simulating in the environment the behavior the detection rule should detect and checking in the SIEM that an alert has actually been fired by that detection rule proving that the logging, the collection, the parsing and the detection rule are all correctly configured.
This final test determines whether the detection rule delivered lives up to expectations.

Working detection test
The so-called working detection test pays in quality assurance what it lacks in originality. It is basically the same thing as the validation test but repeated (randomly) over time to ensure the system keeps working as expected.
This can be achieved by automating the previous validation test, which would be the better way, but it can also be done by injecting logs into the SIEM and checking that an alert is fired by the detection rule. The former is better because it also tests logging and collection but is not always possible; the latter “only” tests parsing and detection, but is easier to build.
Of course, if any working detection test fails, the matter must be addressed at once, because it means that that detection rule is out of order.

4.3.3. Purple team

Since the goal of a SOC is to detect anomalies in the environment in order to respond to them afterwards, detection is the key for a SOC or MSSP to be relevant.
There have been and are more and more instances where purple teams are assembled to simulate attacks and improve detection (and prevention), which is the only way to be sure that the detection rules in place are not theoretical, but actually work. These purple teams are usually temporary, lasting the predefined amount of time assigned to the mission. Moreover, they are made by plucking some people with offensive skills to simulate attacks, some others with detection know how to create and improve detection rules and maybe a few more with incident response experience to make sure that the response process for the use case is correct.
The purple teams described above are simply wrong on so many levels:

Detection is the most important part of a SOC or MSSP, the only way to ensure that the detection rules are not theoretical is to simulate the behavior and yet the purple team is temporary. The team in charge of quality assurance for the most important aspect of the MSSP is not a permanent team.
The people participating in the temporary team are yanked from their jobs with little to no training on the skills and context they miss (i.e. offensive, detection, incident response and even how a SOC works) and have to learn to communicate to the others and to produce something that is usable by the others as they go. The members of the team working on the most important part of the MSSP have issues understanding each other.
This is sadly not an understatement: the SOC is a complex environment with a very specific workflow and that workflow is actually so different from other teams (i.e. pentest teams or CERT) that it is really hard to communicate properly without previous experience of the other jobs. To sum up this particular, crucial issue, one could say that pentest teams and CERT mostly work with a micro view of operational security — i.e. they work directly with systems, from system to system on a perimeter delimited by the engagement rules for the former and the attack for the latter — whereas a SOC works with a macro view, having to cover the whole environment. Because of this, SOC and MSSPs had to adapt to scale up, having to find workarounds to perform detection and investigation at scale that would be straightforward on a single system.
It gets worse as this temporary team composed of people having a hard time communicating to each other is not homogenous: each person has one particular set of skills and cannot be replaced or helped in their tasks. Depending on the size of the team, each role or each member of the most important team is in itself a Single Point Of Failure (SPOF).
The worst part is that after all is said and done and the mission finishes, if someone in the SOC realizes that a mistake has been made on a detection rule improvement and the documentation, timeline, report or whatever is incomplete on that matter, there is no way for the SOC to fix it properly on its own. The only thing left is to hope that the person from whichever other team remembers or wrote down something that didn’t go into the report — let’s not even mention that there could be a do-over of the purple team for that issue because this is not happening, ever. Once the most important job for the SOC is done, the team splits and the SOC can only pray that everything is alright.
Finally, these purple teams often lack proper hierarchical attachment to any department. The direct consequence of this is that the product of the team, the detection use cases, may or may not be used in the production environment by the SOC or MSSP in the end and they surely escape any kind of controls or detection engineering workflow. The most important deliverables for the SOC may stay unused or be discarded entirely.

All of this could be comical if it were made up, alas, it’s not.

Knowing where classic purple teams fail, it is easy to imagine one that could be beneficial for the MSSP and lives up to its name. A purple team worth its money should:

Be permanent.
Have its members be willing to work with both a blueteam and a redteam approach and/or have some kind of experience on both sides.
Be made up of SOC/CERT skills and pentest/redteam skills.
Be part of the SOC or MSSP as everything it produces is for the SOC.
Work for the SOC or MSSP to improve the overall detection capabilities.
Have each of its members have the skills’ “mix” blue/red to maximize efficiency and synergy.

The purple team must have access to a controlled environment, such as a lab, that it can start up at any time, in order to properly perform its R&D without limitations or risks for production. Ideally, to validate the findings, the scenarios should be played in the company’s or MSSP’s own pre-production or production environment with the same rules of engagement, and precautions, as a pentest or red team exercise.
The purple team can and should be given all missions pertaining to detection and some new missions to get the most benefits possible from the offensive skills:

Detection rules conception — from ideas to research and behavior simulation.
Detection rules implementation — from a bunch of logs and artifacts gathered from the simulation to an actual detection rule.
Detection rules testing — pertinence, regression, performance and validation tests either manually with offensive tooling or preferably with some automation.
Detection rules tuning, updates and upgrades — with regard to the evolution of detection and offensive capabilities.
Logging configuration best practices can be added here, since the purple team had to go through that to make the best detection rule possible.
Subjects related to the SOC data model (see next chapter) and even logs parsing.
Live detection and response testing — simulate attacks to test the SOC’s response capability and reliability and give actionable feedback to response specialists.

However, in order to stay relevant, the purple team needs to know precisely how the SOC response specialists (the job described in a previous table) team and the redteam or pentest team work:

It stays updated on offensive TTPs by interacting with the redteam as this guarantees the pertinence of the simulated offensive behavior and therefore the detection use cases.
It takes into account all feedback from the SOC response specialists to maximize detection efficiency and optimize response procedures with every new or improved detection use case.

Another, complementary way to achieve this last point is to have core, permanent positions in the purple team, and some temporary positions. These temporary positions would be filled with people from other teams like CERT, redteam, SOC response specialists, etc, who would join the purple team for a few months. That way, the core members of the purple team would have their knowledge refreshed, and the team as a whole would stay up to date, with an interesting, beneficial mix of ideas and perspectives.

4.4. Incident lifecycle

The third main lifecycle that absolutely needs to be defined in a SOC or MSSP is the incident lifecycle. It is key especially in an MSSP to have any chance at having a stable homogeneous response across customers and time.

The incident lifecycle defines the response phases used by the SOC, the statuses an incident can have, the possible qualifications for an incident along with the conditions to meet to get there and the stakeholders involved at each step.

4.4.1. Detection result

There is a crucial point to discuss about incident qualification: the qualification always refers to the result of the detection.
Yes, this was purposefully put in bold and italic because it is the key to understanding incident qualification by a SOC and it is very, very commonly misinterpreted.
The reason for misinterpretation is that when presented with an incident once the investigation is done or even later once the remediation is done, it is only logical to want to qualify it with regard to the consequences: there was a malicious activity with some impact so it must be a True Positive or on the contrary the actions were those of an administrator so it is of course a False Positive. It could not be more wrong in both cases.

Represented below is a confusion matrix applied to a SOC. The columns represent the results of detection, i.e. if an alert is fired then it is a “Positive” and if not, it is a “Negative”. The rows represent the expected result of detection, i.e. if there is an anomalous event or events it is “Positive” or “Negative” otherwise.

There are 4 cells in the table, one for each of the possible results:

A True Positive (TP) is the qualification given to detection if an alert was created when there was indeed an anomaly to detect. This is what SOCs and MSSPs want to maximize because there is a security issue awaiting response.
A False Negative (FN) is the qualification given to detection that did not fire any alert even though there was an anomaly to detect. This is what SOCs and MSSPs fear the most and want to avoid at (almost) all cost, as it means there is a security issue awaiting response going undetected.
A False Positive (FP) is the qualification given to detection that fired an alert although there was no anomaly in the environment. This is not a real issue in itself but it generates a waste of time for the analyst performing the therefore needless response.
However, if there are numerous FPs either in absolute or relative numbers to the TPs, serious issues can arise.
High absolute numbers of FPs means that there are a lot of resources wasted, translating in a direct cost for the SOC or MSSP — that is if there are enough analysts staffed, otherwise they are also overworked.
High relative numbers of FPs can be even worse because it cannot be solved by throwing money at the issue: it creates alert fatigue for the analysts which translates into increased frustration and lowered morale for them, as they spend lots of time chasing ghosts. The worst part of alert fatigue is that it greatly increases the chance of even a senior, experienced and skilled analyst missing an actual anomaly during response, because he or she simply is used to not ever finding any anomaly.
A True Negative (TN) is the qualification given to detection that did not produce an alert when there was no anomaly. This would be the most common qualification if it were used.

In practice, a SOC or MSSP can hardly ever qualify detection as TN or FN. Although FNs would become widespread incidents or even a crisis that ends up in the news, the FN qualification can only be given after the incident response when realizing that some detection rule should have detected such or such an event and didn’t. At that time, if the SOC still exists, nobody actually bothers to create an incident in the SIRP and qualify it as an FN, if that qualification even exists at all.

The SOC detects anomalies which can take the form of an alert from a SIEM and the triage step of the response phase is an investigation that determines the result of detection: if the detection worked as expected, then the alert is a TP and can be transformed into an incident, otherwise the detection did not work as expected and so the alert is an FP. This is precisely why the qualification always refers to the result of the detection.
To circle back to the previous examples of misinterpretation:

“There was a malicious activity with some impact so it must be a True Positive”, without knowledge of whether a detection picked up the anomaly in the first place, this scenario could as easily be an FN. Of course someone or something ended up alerting the SOC or CERT to respond but a call from an employee saying “Hello, all the computers on the floor went dark and none will boot up anymore” is not a detection but a mere acknowledgement of catastrophic failure.
“The actions were those of an administrator so it is of course a False Positive” is a statement that is wrong in all but a few very rare and specific cases. Either the detection did not fire any alert for this anomaly so it cannot be a “Positive”, period, or the detection did fire an alert because there was an anomaly so it worked and is therefore a TP. The fact that after investigation it comes to light that it was a legitimate action from an administrator does not change anything to the result of detection.

When explained, it becomes more logical to view qualifications this way: they are very often used to measure SOC or MSSP detection efficiency as they should. Now imagine that the qualifications were referring to the result of the investigation and the SOC were monitoring an environment with very few or no malicious activity, it could only ever have “bad grades” regarding qualifications.
Since the environment monitored should not impact the detection efficiency of the SOC, the qualifications must always refer to the result of detection for these metrics to be effective.

4.4.2. Incident evolution

The timeline below represents an example of the phases that could be defined and applied for a SOC or MSSP. The most important point is that it exists and is used as reference whenever needed.

Example of timeline of “SOC actions” vs tools

In this example, the timeline in the middle marks when the key events occurred while the top part shows the “SOC actions” for a lack of better expression, which are the phases mentioned earlier, and the bottom part is a reminder of when some generic tools used by SOCs and MSSPs are relevant.

A few remarks on the timeline:

The first (and main) point addressed by a SOC is detection, which lasts from the time an event or anomaly happens until some tool (here a SIEM) fires an alert for it.
The response phase represented is a modified version of the SANS Incident Response Cycle that divide the response between two stages:
> Investigation, comprised of a Triage step in which the analyst performs a primary investigation to determine whether the detection worked and is therefore a True Positive, in which case the alert becomes an incident, or didn’t work and is a False Positive, and an Identification (as per the SANS Incident Response Cycle terminology) or secondary investigation step to assess the compromised perimeter. This stage could be fully achieved without any active action in the environment if the investigation is performed with passively collected artifacts.
> Remediation, regrouping the four steps left in the SANS Incident Response Cycle. At this stage, active actions are performed in the environment to remedy the incident.
The sensor’s main use from a SOC point of view is to generate logs for the events it observes.
The SIEM is shown to be relevant from “Log created” because it is implied that the SIEM’s infrastructure is also used to collect and centralize the logs, but depending on the technological choices made, there could be other tool(s) involved.

4.4.3. Ticket status

In this subchapter, “ticket” is used to refer to “alert” or “incident” independently from the qualification.

The previous diagram is an example of a ticket lifecycle with the different statuses it can have. It is specifically intended for an MSSP as it mentions an interaction with the customer.

The detection is either qualified as a False Positive or a True Positive:

If it is an FP then something must be improved by adding a WhiteList (WL) entry or reviewing the detection rule in order to not get another unwanted detection such as this one.
If it is a TP then two things can happen:
> After investigation, there was actually nothing malicious about the activity. It was a legitimate action and so the status could be updated to “True Positive Legitimate” (TPL) which basically means that this time there was nothing wrong and there may or may not be a way to determine it at detection time. If there is, a WL entry could be added but this should be done with caution as it can be hard to know for sure, without behavior analysis, that it is actually legitimate.
> After investigation, malicious activity has been confirmed. The status could then be updated to “True Positive Malicious” (TPM) before moving on to incident response, starting with the Identification or secondary investigation step. This update only aims to make statuses clearer at a glance because TP / TPL only could create some doubts as to “what is a TP”.

The example diagram is a bit less straightforward than it could be because it takes into account misinterpretation of “TP” by the customer and analyst error.

4.5. Training and monitoring

It is important to keep SOC or MSSP teams trained as a whole and ready for the worst because when the worst comes, SOC members will not have time to think and elaborate big plans. There will be panic and confusion: this is extremely important and it cannot be overstated, there is nothing short of experience that can really prepare someone for this.
However, having trained teams in which everyone knows their jobs and what they are supposed to do when, and not if, trouble comes around is the best that can be achieved to enable fast, accurate reactions.

Also on a brighter note, there are ways to monitor the SOC environment and the user experiences associated with it in order to preemptively affect resources to stabilizing or fixing instabilities before breakdown occurs or, on the contrary, if every indicator is green, to divert resources to expanding or developing new features.

4.5.1. Incident simulation

An incident here is defined as a severe security issue that requires immediate response from the SOC or MSSP, e.g. an ongoing attack caught early in which the adversary has yet to spread too much.
Typically an incident could involve a few response specialists and their team leader, depending on the severity.

Each analyst needs to be individually trained so they are skilled enough to perform their jobs correctly but more importantly in this case they also need to know how to work together as a team, which is a whole other story.
Even with technical training, processes, procedures, diagrams, a good knowledge base and what not, without proper training specifically on coordination, a bunch of analysts put together won’t amount to a proper teamwork. This is in fact much like sports teams.

Teamwork is therefore also a skill to train properly and this can be achieved by simulating incidents. The majority of simulations should be prepared and advertised in advance with a clear objective to improve teamwork in an emergency situation. However, once the team performs well on these exercises, it is advisable to throw in an unexpected simulation to check how the team reacts in what is for them a real scenario.
The “only” difference between both is psychological but this is what will actually matter when it comes down to it. If they think it is real then their reaction will be genuine and the simulation can create an experience for them, although it was but another exercise.

These simulations are of course technical but should also include any communication that would occur in a real scenario and especially any panic and/or pressure added by the customer (in case of an MSSP) or the management (for a SOC), if possible.

4.5.2. Crisis simulation

A crisis is defined here as a critical, widespread security incident already impacting production. For an MSSP, it could be either one customer victim of such an incident, multiple customers having (for the moment) a less impactful incident such as what can happen when a 0 day on a common product is exploited and there is no patch, or that the company of the MSSP itself has been compromised.
What is described above is where the real fun starts, isn’t it?

This can clearly be a do or die situation for many companies, as reported too often in the news. There will be panic everywhere, the pressure on any stakeholder involved in the resolution (the SOC being one) will suddenly peak and stay at that level until the crisis ends, one way or the other.
If a SOC or MSSP goes through this without previous preparation or training, well… Let’s say the good news is that there are plenty of open jobs for SOC analysts so they won’t stay jobless for long.

The incident described before involved a few response specialists and their team leader when a crisis such as this will involve the whole SOC — every single member — and then more, especially in the upper management.
The incident is simulated at the analysts’ level with some help of SOC or MSSP management and from a technical perspective, a crisis is actually almost only different in what’s at stake. Therefore, the simulation for the technical incident part could be more or less the same, with higher stakes.
The real difference is in the panic of the management, customers or other employees and in this, especially the former, must be trained. The most important part is that upper management understands their jobs in a situation like this: how and what to communicate to whom, who will be their (only) contact on the response side that will report to them the state of things and what they can do to enable the best response possible.
At upper management levels, it is not realistic to simulate an unexpected crisis, therefore advertised ones will have to do. It is especially important to identify backups for such or such key people in the management chain in case they are not available when the time comes.

4.5.3. Site Reliability Engineering

Site Reliability Engineering (SRE) is a concept that originated at Google in 2003 and has kept evolving since. The reference and most of the best resources can be found on the website https://sre.google/.
SRE aims to maintain critical operations running whatever happens by relying on targeted prevention to minimize impacts, therefore reducing the overall direct and indirect costs.
The logic is that everything that is built is made for direct or indirect human usage and any trouble or discontent the human component has in using it will result in added cost in some way for the builder. Therefore, it is more cost-efficient to measure and monitor the level of contentment and take corrective actions when it falls down to a certain point than to wait for an SLA breach to try and fix things as there may be indirect cost of losing the user or the customer on top of the direct cost of the breach.

To achieve this, Service Level Indicators (SLIs) need to be put in place: these indicators are a quantifiable way to measure the reliability. There is real work to be done here, as most of the time, the reliability of a service is more of a qualitative feeling for its users than directly quantifiable and usable data.
Once the SLIs are defined, the Service Level Objectives (SLOs) can be set: if the SLIs fall down to the level of the SLOs, then corrective actions must be taken.
Finally, the monitoring of the SLIs with regard to the SLOs enables active and optimized resource and budget affectation: more development and new features if the SLIs are OK or more stability and fixing if the SLIs are dropping.

Applying the SRE concept to a SOC or even an MSSP can be very complex but is well worth it in the end, especially for an MSSP. That being said, the SOC or MSSP must already be very mature for an implementation to be possible and for it to make sense.
As this is very technical, more of a case by case basis and explained infinitely better on the website mentioned above than it ever could be here, it won’t be discussed any further in this document.

4.6. Customer service

Customer service is an important subject for an MSSP but also for a SOC. Customer service is defined here as all relations a SOC has with external parties to which it answers. These parties, may they be customers (in the MSSP sense) or upper management of the company owning the SOC, will always fund the SOC. Therefore it is for the least advisable to work with them to understand their needs and to make them understand the SOC’s constraints and issues.

Example of timeline of “SOC actions” vs time indicators

The diagram above is a modified version of the one used to talk about incident evolution; this one shows the most important time metrics for a SOC or MSSP and its customers. The total time spent represents the time spent by SOC analysts on the response and is actually a SOC internal metric.
However, the dwell time, TTI, TTN and TTR are important indicators for the SOC to measure its own performance but also for its customers — some of which are usually under SLAs. These indicators should always be (graphically) explained to the customer, what they represent and how they are calculated to make for easier communication.

There should be communication templates for emails, incidents and other regular committees — each of which needs to be presented and explained in detail to the customer to make sure everyone is on the same page and that there are no misunderstandings.
All templates should be standardized so that they are as much the same as possible from one customer to another for an MSSP. They can then be automatically generated for the most part, e.g. Key Performance Indicators (KPIs), performance graphics and other task lists. The time spent by people working on these would then be on the SOC or MSSP added value, such as governance and security advice.

In any case, communication with customers about what the SOC is doing is important, especially if “nothing is happening”, because a SOC is a complex environment almost opaque from the outside. Also, by nature the SOC is reactive so without active communication on the work being done, it is quite easy to conclude, when everything is fine, that there is no actual added value in a SOC or MSSP when it actually costs a lot — in other words that it is a bad investment.

4.7. Conclusion

There are many items that should be implemented, monitored and that can be improved for a SOC and even more for an MSSP. Their importance will vary depending on the current context of the SOC and its planned evolution, as will the priority of implementation.
However, they are all necessary to have and keep a pertinent, stable, efficient, reliable and robust SOC, which is what every SOC should try to be to consistently detect and/or thwart attacks.

Please do not underestimate the importance of items that are not directly related to the SOC’s missions and objectives, such as knowledge management or customer service: the former, acting as the SOC’s memory, participates in ensuring pertinence and the latter is the one that will keep the budget from getting lower.