140T+ Emails/Yr: The Broken Promise of ML in Phishing Protection

Rodney Gainous Jr.
The Best Publication
4 min readMar 3, 2024

--

Disclaimer and Focus:

We value machine learning (ML) technology for its role in combating cyber threats but aim to highlight the limitations of overly optimistic claims about its capabilities. This effort is not about casting aspersions but advocating for a realistic understanding of what ML can and cannot do in the realm of cybersecurity. Overstating ML’s effectiveness can lead to a false sense of security, undermining our preparedness for actual cyber threats. It’s crucial to engage in honest discussions about the capabilities and limits of technology to ensure genuine protection against cyber risks.

This article specifically focuses on the challenges ML faces in phishing detection, offering a critical view on current cybersecurity practices without delving deeply into alternative solutions or ML’s positive contributions, which may be explored in future discussions.

Opening Statement:

Let’s cut to the chase — Machine learning (ML) isn’t going to solve the phishing and scam problem. No matter how smart we make these systems, they can’t fully tackle the cunning and ever-changing tactics of scammers. This isn’t about dismissing ML’s strides in cybersecurity; it’s about facing the hard truth that we can’t tech our way out of human deception.

This article is diving straight into the heart of the issue — why leaning too hard on ML leaves gaps that scammers happily exploit, and why false positives are more than just a minor nuisance.

We’re talking about a bigger game here, one that requires more than algorithms to win.

While machine learning models offer promise for sifting through massive amounts of email data to detect phishing attempts, the immense volume of daily email traffic poses serious challenges. In 2023 alone, over 347 billion emails were sent per day. And this number is projected to surpass 408 billion emails per day by 2027, driven by factors like a growing user base, effective email marketing, and increased engagement.

Given email’s continued growth, even a 99.9% accuracy rate in an ML phishing filter would allow millions of potentially harmful phishing emails through on a daily basis. This drives home the need for a multi-layered security approach rather than reliance on ML alone.

Machine Learning fundamentally rely on patterns and data from the past to make predictions or classifications. Sophisticated attackers are aware of this and often devise strategies to evade or fool these models by constantly changing their tactics (e.g., mutating phishing email structures, using zero-day exploits, or simulating benign behavior). The following points illustrate why ML alone has limitations in this domain:

  1. Adversarial Attacks:

Attackers can employ adversarial techniques to craft inputs that ML models misclassify. In the context of phishing emails, this might involve crafting messages that appear legitimate to the algorithm but contain malicious content or intent.

2. Data Dependency:

ML models are as good as the data they are trained on. Attackers can manipulate the situation by ensuring that the data needed to detect their new techniques are not present in the training sets.

3. False Positives and Negatives:

Even with a high accuracy rate, the scale of email traffic means that a small percentage of errors translate to a large volume of security breaches, as you pointed out. An error rate acceptable in other domains might be intolerable in cybersecurity due to the high stakes involved.

4. Generalization vs. Specificity:

While ML models may generalize well, cyber threats are often highly specific and targeted, requiring bespoke analysis that a general model might not be adept at handling.

5. Lack of Explainability:

Many complex ML models, especially deep learning models, operate as black boxes, making it difficult to understand why a certain input was classified in a particular way. This opaqueness can be a liability in a security context where understanding the “why” behind a decision is critical for trust and effective response.

In light of these challenges, it’s clear that relying solely on ML for phishing detection at scale is not a silver bullet. Advanced and more comprehensive solutions need to be developed.

The real message here is about the inherent limitations of ML models in keeping up with the evolving tactics of cyber attackers. ML models, as currently utilized, fundamentally rely on historical data to predict or identify threats. This characteristic makes them vulnerable to novel attacks, including those that have not yet been encountered in the training data. The article’s emphasis on adversarial attacks, data dependency, and the challenge of generalization versus specificity highlights the cat-and-mouse nature of cybersecurity, where attackers continually evolve their strategies to evade detection.

ML has been marketed as a panacea for cybersecurity threats, the reality is more nuanced. ML can be effective in certain contexts but isn’t a standalone solution, especially for sophisticated, adaptive threats like phishing.

This is a living document, and will continuously be updated and refined.

--

--

Rodney Gainous Jr.
The Best Publication

Once 16 making $200K off bots, now the CEO of the best security Safe. Co-host of BeatTheOdds, the best podcast for forward-thinkers.