Measuring a red team or penetration test.

Quantifying “success” after an “unsuccessful” red team.

If no valuable findings were discovered during a penetration test, was it valuable? How do you measure the value of the failure to find vulnerability?

Perhaps your team would have one of these opinions:

  • This security firm sucks. We have vulnerabilities and they didn’t find any.
  • Our security must be better than expected, because the security firm is amazing.

Let’s throw another twist into the situation: What if they did find serious issues? Your opinions could vary across the following, too:

  • We weren’t aware of these. This could be a problem.
  • These findings are serious. But, we detected the attack on their first day.
  • These findings aren’t serious. Regardless, I’m worried we didn’t detect it.

All of these are very important and nuanced opinions that could influence a mitigation roadmap. These nuances are harder to express on paper, and individual opinions are subject to pretty drastic bias.

These are inferences that are usually formed by the blue team, not the red team, which make them extra meaningful and are usually not captured.

So, how do you measure these important factors, without relying on a single person’s opinion of how it went?

Forecasting becomes an interesting tool to cut through the verbal opinions and measure the before and after metadata of an offensive engagement, to store as data for the longer term.

This method should be simple to experiment with. I’m hoping to sneak this into red teams this year, but these are my thoughts so far.

I have more notes on a forecast centric risk analysis method that I am toying with, if you require more explanation than available here.

How do I capture this information?

The previously mentioned opinions are usually verbally expressed in a briefing after an offensive engagement.

They are discussed by the “blue team”, and unless every participant expresses their opinions on each facet of the engagement, these sentiments could be lost or deeply biased by a single person’s interpretation.

I want you to consider a numeric forecast before, and after, an offensive engagement. Let’s discuss making a offensive engagement that is paired well with a forecastable scenario and outcomes.

Scope the forecasts tightly with the engagement’s scope.

You should already have a reason for the offensive engagement. Perhaps you need to understand lateral movement from one network to another, or flesh out a certain class of vulnerability nearby a sensitive database, or the quality of your detection mechanisms.

You should also have a great understanding of what “fair game” is, and what methods the offensive team will be employing. You may eventually develop scenario based forecasts similar to the following:

  • The Red Team will obtain a shell on the production network.
  • The Red Team will be detected.
  • The penetration test will discover an exploitable SQLi.
  • The Red Team will obtain Domain Admin.
  • CERT will discover the “root cause” that began the assessment.

You can split these out into Yes/No forecasts or results in sets like “Within a day, two days, more than three days” based on your teams goals.

Select and train a diverse group of forecasters.

I discuss this in “Killing Chicken Little”. In general, you want your forecasters to have a little bit of practice, and generally be very intentional when forecasting. A little bit of training goes a long way.

An example panel forecast for “The Red Team will be detected”:

  • 40% Red team will be detected within an hour.
  • 15% Red team will be detected in more than an hour.
  • 35% Red team won’t be detected.

The above forecast can be read as “I think they’re likely to be caught in the first hour, or not at all. It’s unlikely we’ll find this after an hour or so.”

Of course, this example forecast makes some assumptions on when the red team begins “detectable” actions, and could be improved.

Run your offensive engagement, and repeat a forecast.

If your scenario has a clear and measurable outcome, then your team will be able to anticipate the results and compare their forecasts with reality afterward.

This puts you in the position to run a hypothetical forecast, after the engagement, to represent the new information you have.

A red team will obtain a shell on the prod cluster from the development network.

If this forecast changes drastically, before and after, it will speak loudly about the sentiment of the firm you hired, and your own security. If the team failed, but the consensus believes that others would succeed, then you have captured sentiment that the security firm was sub-par or generally scoped to be uninformative.

If the red team team failed, and this follow up forecast comes in very pessimistic about future engagements succeeding, then you have measured an increase on confidence based on the failed result. For example, you can state that your risk has decreased, your confidence has increased, and the failed engagement had some value for future decision making and prioritization around the risks you cared about.

Here’s an rhetorical example with our sample forecast. Let’s pretend that the red team was detected in the engagement. In the forecasts below, you see that the forecast panel would be more confident in the detections surrounding this red team’s scope for a future engagement with a different firm.

You have measured a difference that occurred as a result of your red team. You now have a measurable appreciation for your security that didn’t exist before.

Confidence has increased that the red team would be detected in an hour, decreased that they won’t be caught at all, and remained neutral on detecting them long term (more than an hour).

If the observable red team was successful / undetected, these numbers would surely fall into “Won’t be detected” bucket, and your outlook would be measurably bleak. A measurable lack of faith in your security is just as valuable as the opposite.


I have long felt that security teams misunderstand the value of offensive exercises, and it can be hard to capture some of the “softer” areas of value they provide without having some method to measure them.

I am spending some time this year (2018) looking for quantitative methods that measure risk and security team performance. This essay is to serve as a proposal for any red team exercise I get involved with, in an effort to capture all of the obscure value that a red team can provide.

I suppose this topic might reveal useful methods for tabletop exercises as well, proving the value of a discussion numerically among participants.

Please contact me, write a blog post, or point me towards any public discussion using a forecasting before and after offensive work. I’d love to hear about it.

Ryan McGeehan writes about security on medium.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store