Killing “Chicken Little”: Measure and eliminate risk through forecasting.

View the “Risk Forecasting” presentation on GitHub.

Ryan McGeehan

Published in

Starting Up Security

9 min readDec 6, 2017

We all rely on intuition when we lack decision making data.

This is the basis for the arguments we make against maturity models, prescriptive standards, and compliance demands. No one wants their time to be consumed with ‘checkbox security’, while more tangible risks remain unaddressed. We value our intuition towards risk, especially as our experience and expertise compounds.

Well, research shows that our intuition is highly problematic. We’ve all been overconfident at some point, and research has uncovered this overconfidence in the past in other disciplines.

Findings from “Expert Political Judgement” research

How can we manage risk effectively while still making room for our own intuition?

Sometimes we have clear decision making data.

Imagine for a second, that your boss comes along and says:

Hey, will we see a RCE bug discovered this year on our app?

For a question like this, it’s always best to simply rely on data. You’d refer to all known discovered RCE’s that were discovered and fixed based on your own hard data.

2015: Eight
2016: Five
2017: One

A very simple statistical model or forecast will be useful here without much debate. Most people would have confidence that an answer would be around 0–10 findings in 2018.

But what if your boss says:

“Will we be breached by an RCE this year?”

This is has always been the harder question, and some methods exist to provide an answer.

We’re going to forecast it.

In this case you don’t have years of previous identical breaches to apply statistically. Instead, you’ll instantly begin forecasting an answer, whether you like it or not.

The problem is humans are unreliable at making forecasts. This is frequently reproduced in a variety of studies that show that even expert forecasts can be pretty useless.

Most of these studies track the Brier score of an individual, which is easily calculated by exposing how much we can trust their forecasts when held accountable over time.

Take this test, and you’ll very quickly understand the Brier score. It’s a measure of how much we should trust you when you claim you’re certain about something.

An example graph of someone who can’t be totally trusted (myself)

In other words, the Brier score can precisely measure if you’re a chicken little.

That’s someone who is wildly overconfident about their claims and rarely held accountable for it afterwards.

By observing the Brier score of individual experts in many extensive studies, it seems that experts might not be good at forecasting. Additionally, a good forecaster might not even be an expert. Isn’t that strange? This is the basis of Philip E. Tetlock’s research, which is summarized in Superforecasting.

But what if you have an expert forecaster?

This contrasts brilliantly with Tetlock’s previous research which helps us observe and decouple the expertise of a person with their ability to forecast a future event.

More recent research (2016) shows that effective forecasting is not only possible… if it is effectively structured, it is improved. And it revolves around knowing, improving, and maintaining someone’s Brier score through calibration.

When these conditions are met, you may have elevated some individuals to a place where their high confidence forecasts are valuable. In other words, when this individual is measurably confident that something will happen, it’s probably gonna happen. Groups of calibrated individuals are shown to be even better.

When one of these individuals express 90% confidence, believe them!

Unfortunately, we aren’t all walking around with a Brier score floating above our heads.

Be careful not to mistake this with fortune telling or forecasting the future. To be clear, if the mystical fortune teller next to a pawn shop has a great Brier score, it’s very likely that they’ll say “your future is bleak, but, well, maybe, kinda sorta, depends, could go either way, I guess” on a regular basis. Someone who is 90% confident will still be wrong 10% of the time, and they won’t express that confidence on issues they don’t know enough about.

In short: A fortune teller with a great Brier score will have a horrible Yelp review.

So how do we answer your boss?

Will you be breached next year? Hurry up and answer!

We can assume your boss is asking you to express high confidence (~90%) with your answer. Using an ideal combination of known research in decision science, we can gather an answer with the following:

First, we get a group of your best informed coworkers together who have tuned their ability to measure their own confidence with a barrage of tests exposing one’s Brier scores to themselves, and training (link).
We allow the group to decompose the question into smaller problems, seek and gather data, and allow them to explore questions of threat, impact, vulnerability, etc. Seeking “outside” data within the industry as well as your own “inside” data is most helpful.
The group makes initial forecasts on the threat and give them chance to straighten out misunderstandings with the problem amongst themselves, to reign in massive outliers or different interpretations of the threat.
Average forecasts together, which should smoothen out bias and anomalies per individual.

As a result, we may get an answer like:

We think there’s a 25% chance that an attacker will access a production system by exploiting a RCE vulnerability in the next 365 days.

This means that they are more conservative than complete uncertainty (50%, a coin flip), but not willing to say it will never happen.

Any new information provided to the group would raise or lower that estimation accordingly,

A spike in vulnerability disclosures revealing severe vulnerability will harm this probability
An OKR to deploy AppArmor profiles across production to defang RCE vulnerabilities will improve this probability.

You can even propose theoretical mitigations and forecast the result to measure the size of the probabilistic reduction.

This is where I think some people will scoff at this method and close the browser tab. It’s important not to, and remember, there’s a whole lot of research around forecasting that should be taken seriously.

“I’m skeptical.”

Me too. If we are not skeptical of this method, we’ll just accept it like our industry has accepted many “High / Medium / Low” frameworks that aren’t scrutinized or tested whatsoever.

Forecasting methods, however, have established research, measurement methods, and are used extensively outside of security for different purposes. We just need to discover direct applications of these lessons to our work.

What I’ve described is the roughly same forecasting method that was used by individuals in the Good Judgement Project. This was a competition between trained forecasters and intelligence officers, funded by IARPA. Here’s a summary of the result:

Superforecaster predictions are reportedly 30 percent better than intelligence officers with access to actual classified information.

Forecasters won the competition with access to Google.

More recent research shows that strong Brier scores among national security intelligence experts “offer cause for tempered optimism about the accuracy of strategic intelligence forecasts”.

It’s possible to kill chicken little. We can obtain nuanced opinions on the future that aren’t sowing FUD. We can improve our Brier scores as professionals, and track improvements to risks we point ourselves to.

First, we should point ourselves towards a larger question about how we accomplish risk management effectively, and how we can kill the chicken little that lives within all of us.

Here is that question:

How do we forecast risk within an organization?

Like I said, I’m skeptical. But I’m also hopeful. Here are areas I struggle with.

I am finding opportunities whenever possible to address the following challenges I’ve collected, and this is the extent of my exploration on this subject so far.

We mis-use “Black Swans”: I think most security incidents are not black swans and are extremely vulnerable to forecasting. However, our industry is fascinated with actual Black Swan incidents and distracted by them. I don’t think we can forecast a Stuxnet, Aurora, Snowden, or Fancy Bear. I do think we can forecast an engineer copy pasting root credentials to pastebin.
Grading your own homework: If you’re forecasting a reality that you are involved with preventing yourself, you may bias towards ever dwindling probabilities because your raise or promotion depends on it. Maybe forecasts should be explicitly non-involved with performance.
Cost of measurement: It seems prohibitive, especially among small teams and startups, to obtain the benefits of group forecasting. Similarly, tracking outcomes over time may have expense and burden. Maybe tooling or online training can drive these costs down.
Entrenched Policies: It may simply be a long time before widespread security organizations can experiment with probabilistic methods of risk management. Until then, it will be hard to obtain “outside” data from other organizations. Maybe faster moving or younger companies will experiment earlier.
Lack of Calibration: I imagine it will be tough to have tight feedback loops where we can be confronted with the outcomes of our predictions. This is a crucial aspect of improved Brier scores. Maybe we can refine traditional security metrics to act as a stand-in for forecasting, or forecast industry incidents instead, to calibrate employees.
Only Trends Matter Anyway: Probability of a risk taking place may only matter to risk management if measurable increases and decreases are taking place. It may only matter if floors / ceilings are discovered in these trends. Maybe we stop only when diminishing returns on our investment are discovered. Maybe this is fine? Maybe this is the whole point?
Security Talent: When we hire talented individuals, they may not be able to apply their skills towards mitigating the most important risks. We may only be able to tackle risks based on the talent and tooling that is accessible to us. This method may drift way too sharply towards risks we simply can’t mitigate. Maybe we shouldn’t be measuring areas we can’t solve to begin with?

This is not an exhaustive list of problems, but I think these can be overcome with more thought and some experimentation. I am far from calling these fundamental problems, they are simply risks to forecasting themselves.

Conclusion

It’s time for practitioners of our discipline to recognize that rigid methods towards risk management are not flexible enough to prioritize and tackle complex breach scenarios. We are losing, and need to start taking risks in process.

We need flexible, powerful, and cost efficient measurement methods to attack problem with. We need more experimentation towards new methods and more transparency into their results. I’m excited about probabilistic methods and how forecasting fits into it.

Ryan McGeehan writes about security on Medium.

Appendix

There is a whole world of established research behind these methods and the below is my reading list on the subject. Again, How to Measure Anything is a great starting point that covers many of these topics similarly.

“Subjects trained to apply probabilistic reasoning principles will be more accurate than controls” — (Link)

“Forecasters were often only slightly more accurate than chance, and usually lost to simple extrapolation algorithms” — (Link and Book)

“The findings offer cause for tempered optimism about the accuracy of strategic intelligence forecasts and indicate that intelligence producers aim to promote informativeness while avoiding overstatement” — (Link and Blog and Journalism)

The value of group forecasting — (Book and Book and Study)

Decomposition improves estimation — (link)

“The analysis favored mechanical modes of combination and caused a considerable stir amongst clinicians.” Predictions vs Basic Statistical Models (link and link)

Nobel prize winning research (now a book) on why we make horrible decisions — (Link)