Vulnerability Management: You should know about EPSS

Ryan McGeehan
Starting Up Security
7 min readOct 9, 2023

--

The Exploit Prediction Scoring system (EPSS) is great. You might like it, too, if you deal with large amounts of vulnerabilities.

The Hand-Wavy Explanation

The EPSS model spits out a probability of a CVE being exploited in the wild within 30 days. Give the model a CVE, and we return the probability of near-term exploitation in 30 days. That’s it, that’s all! We will discuss the details later.

Let’s take BlueKeep (CVE-2019–0708). EPSS suggests a 97.5% chance of being exploited in the wild within the next 30 days.

Most vulnerabilities (even those with CVSS High and Critical) are not exploited in the wild. This is the primary perspective offered by EPSS. The result is a risk-based model based on real-world proof of concepts and exploitation. EPSS cares about real-world exploitation.

Yes, your CVE may have a CVSS 10, but it may not be the type of vuln that sees real-world exploitation. EPSS predicts if a vuln will be seen exploited in the wild.

This makes EPSS a fantastic way to prioritize vulnerabilities at scale quickly. Under a huge pile of CVSS 9–10 vulnerabilities, an additional dimension of EPSS lets you sort them again based on near-term exploitation risk.

Want to look one up yourself? Plug any CVE into this link (replace the CVE).

https://api.first.org/data/v1/epss?cve=CVE-2019-0708&pretty=true

Here is the example response. We care about epss which is 0.975050000 (95.7%).

{
"status": "OK",
"status-code": 200,
"version": "1.0",
"access": "public",
"total": 1,
"offset": 0,
"limit": 100,
"data": [
{
"cve": "CVE-2019-0708",
"epss": "0.975050000",
"percentile": "0.999760000",
"date": "2023-10-03"
}
]
}

OK, where do these scores come from?

EPSS scores CVEs with a very different approach than CVSS. With CVSS, you can craft a score by hand. EPSS is fundamentally different. EPSS is statistical!

This means the maintainers of the underlying EPSS model are operationally responsible for ongoing tracking of which CVEs are exploited in the wild, exploit code availability, and so on. This is compared with the characteristics of a CVE. A predictive model is born. There’s enough vulnerability and exploit data these days to do this well.

EPSS is positioned to continuously improve the model that predicts what characteristics of CVEs eventually result in exploitation. Those characteristics change all the time based on whatever new data is available.

This means an EPSS score can change daily based on improving information.

How can EPSS improve over time?

Because EPSS using a probabilistic method means it is accountable for the forecasts it publishes. EPSS has error. You can judge it based on how often it makes mistakes. It can be held accountable for doing better or worse.

Here’s a baseball analogy:

CVSS is like: The Yankees get a 9/10 on the good-at-baseball scale.

Whether the Yankees win or lose a lot, you can’t say exactly how wrong this statement is. These something-out-of-ten types of scores are ordinal and are not accountable to any error.

EPSS is like: The Yankees have a 90% chance of winning this year's World Series.

This is scientific. When the Yankees win (or don’t win) the World Series, you can calculate error scientifically for that statement. You can also review all previous statements and scientifically observe their usefulness. This is done with Brier Scores and similar methods. You can hold it accountable for being right or wrong!

EPSS is similar. It uses a probabilistic statement that can actually be right or wrong.

Exploitation is observed within 30 days (Yes / No)

The EPSS documentation describes how they hold the model accountable for failure.

What? Show me exactly how you can hold EPSS accountable!

Here’s an example.

You could immediately retrieve the BlueKeep EPSS score when it was announced. No publicly known in-the-wild exploitation existed then, but we all suspected it soon would. The CVSS score was high, but as we’ve said, this doesn’t mean anything because many vulns get a high score. A CVSS 10 alone does not predict in-the-wild exploitation. EPSS tries to do this statistically.

The EPSS score for BlueKeep before in-the-wild exploitation was observed was 95.2%, a highly confident prediction. If we didn’t see exploitation within 30 days, it would mean that EPSS would have been very wrong. But it wasn’t.

Very quickly, in-the-wild exploitation of BlueKeep did appear within the 30 days EPSS expected. That means the score had a relatively low error, with a 95.2% prediction of in-the-wild exploitation occurring.

This is an example of EPSS working as intended! But you can’t judge EPSS based on cherry-picking good and bad examples. Just as you can find a positive example, you can also find examples where EPSS was not confident, and a vulnerability was eventually exploited in the wild. Both cases are the wrong way to look at EPSS.

Instead, it needs to be looked at scientifically, like a weather model.

EPSS is best judged with a large volume, like weather models tracking the rain for years. Sometimes, weather forecasts are (very) wrong, but the overall error is scrutinized over long periods, and models are improved.

The security world has changed. Vulnerability exploitation is not an area of security that lacks data. There’s enough CVE / PoC / Exploit in-the-wild volume to do this right, and it will only get better over time. From now on, this will increasingly improve.

What if the data that EPSS uses is bad?

If the data EPSS uses is bad, then EPSS will be a measurably lousy model. It won’t be a matter of debate. The statistical accountability of EPSS is enormously helpful for us because it means we don’t need to care about what datasets it uses!

Weather prediction is similar. I don’t need to go around interviewing meteorologists and kicking weather balloons to trust the daily forecast — NOAA retroactively looks at predictions and improves its models based on whether they are right. I, too, could review their forecasts and score them independently.

This is possible only because EPSS can be explicitly right or wrong. I don’t need to care what data EPSS uses because it has measurable performance.

EPSS is young, but early indicators show that the model is calibrated. This is a considerable test to pass with any predictive model. When it says something is 90% likely to happen, it is only wrong 10% of the time. Perfect calibration is shown with charts going up-and-to-the-right, and EPSS is damn close.

EPSS Calibration from a presentation at FIRSTCON 2023

EPSS can also attract competing models. This is what happens in meteorology. Many international models numerically predict rain, clouds, etc., and are scored against each other. But, they can even be combined into ensemble models for useful ensemble forecasting, too!

Numerous models can compete with EPSS because it simply outputs whether a vulnerability would be exploited within 30 days. I love this as a way to measure vulnerability risk.

Why do I even need scoring?

If you need to ask this, you are lucky not to be triaging painfully large amounts of vulnerabilities. Scores like CVSS or EPSS don’t replace human analysis; they aid in prioritizing vulnerabilities to see human analysis eventually.

For instance, large, automated tool output, medium-to-extreme scale bug bounty triage, federated GRC situations, etc. Deciding what to focus on can be painful, so help is valuable.

But scoring is not valuable once human analysis has come in. It’s usually disposed of by then. Scoring is just a way to manage things until a decision to perform analysis happens.

What about my non-CVE product vulnerabilities?

You’re out of luck. EPSS strictly focuses on published CVEs because of a data pipeline that has to protect assumptions about how scores will be used. So, producing an EPSS score for a first-party vuln that will never be issued a CVE is currently impossible. Hopefully, FIRST will have solutions for this in the future, which are possible to build but don’t exist yet.

In the meantime, you can try manually forecasting a score based on the EPSS frame of reference (“In the wild exploitation in 30 days”).

What about vulnerabilities and exploits we HAVEN’T seen in the wild?

How can EPSS scores be useful if huge volumes of vulnerabilities are being exploited without observation? Let’s discuss that.

The ratio of observed versus unobserved vulnerabilities and exploitations isn’t knowable. Still, they move from unknown to known through leaks, incident response, threat research, and vulnerability research.

EPSS makes an assumption that observed exploitation is the tip of an iceberg. A question is… How different is the reality below water?

EPSS assumes that the composition of the tip is indicative of the iceberg. This is a useful assumption, but the assumption runs into trouble with oddball vulnerabilities like Log4J. The EPSS folks openly discuss this as being difficult for EPSS. That’s what we want to see. We want to see owners being critical of their models, which is how models should improve over time.

Despite the weirdness of the vulnerability, Log4J still had a useful and high-risk initial EPSS score of 35%.

Here’s how the score dramatically increased as data sources rolled in with more daily information.

https://www.first.org/epss/articles/log4shell

If weird exploit trends are going on below water, we must also acknowledge that those trends are secret. Of course, neither blue teams nor EPSS will learn from it, and those trends would surprise everyone once revealed. So, EPSS is simply operating with best-effort predictions on objective facts.

Manual analysis would take you further if a vuln seems higher risk than the score suggests, which will still be true even with EPSS or any other scoring method. Scoring of any kind can be helpful but will never replace analysis.

Ryan McGeehan writes about security on scrty.io

--

--