Stories by Jeshua Bratman on Medium

Calibrating Classifiers in Reality

Jeshua Bratman — Mon, 24 Jan 2022 16:50:02 GMT

This post also appears on the Abnormal Blog.

Abnormal’s fundamental job is to detect malicious emails like phishing and business email compromise attacks and other malicious events, such as suspicious sign-ins that indicate an account has been hacked. To do so, we have a complex web of features, sub-models, and classification models that decide on whether an event is malicious or not. Once we’ve built a model we must turn it into a classifier by selecting a threshold. This sounds easy, but there are many tricky details.

We’ll focus on the email attack detection case for this discussion and simplify it down to the core classification problem. The general approach is straightforward; we want to start with a probabilistic model M(X) that predicts attack probability given our features X.

M(X) = P(attack | X)

Other articles on this blog discuss this attack model itself (here, here, and here), so we won’t discuss that in this article. Here we are interested in what happens after we’ve built a model.

Once we have a probabilistic model, we could easily build an attack detector using thresholding to define a decision boundary.

predict attack if M(X) > threshold

Such a classifier would have nice properties. For example, the precision will equal the threshold, and the recall is recoverable from a precision/recall curve. This property would allow us to easily trade between precision and recall for our business needs by sliding a threshold up and down. This is important because our clients are sensitive to both false positives and false negatives so tuning our detectors precisely is crucial to maintain a high-quality product.

But there’s a problem: In reality, we rarely find that trained models produce accurate probabilities. This is a problem of miscalibration. We will delve into the cause of this miscalibration in a bit, but usually, it is due to a combination of the learning algorithm (i.e. neural nets may not produce probabilities) and, more importantly, skewed data and labels distributions in the training dataset. But first, let’s focus on why miscalibration is a problem for an ML product.

As we said, the model does not line up to the probability distribution:

M(X) ≠ P(attack | X)

We assume the model is still correlated with probability, if it is a good model. When it is not a probability, it throws off our thresholding strategy. Imagine we create a classifier as before:

predict attack if M(X) > threshold

This threshold no longer lines up to an expected precision, and therefore, we must tune thresholds to meet desired performance characteristics. To appreciate why this is not ideal in practice, take Abnormal’s use case. We are carefully tuning our detectors to prevent email attacks. Often we may need to move a false positive rate from X% to Y%. To do so, we must go back to our data and solve for the right threshold. If we want to build simple control knobs or even an automated control system, we do not want to require this manual translation from the desired performance to a threshold.

In practice this caused some big issues, primarily:

For each new model, we needed to carefully tune thresholds. This made it very hard to compare model A to model B on an even playing field. AUC gives one evaluation but tends to evaluate outside the operating range and we need more precise evaluations at particular thresholds. This difficulty slowed down our experimentation and pace of launching new models
We needed to control thresholds separately for each client for each model. As our client base grew, this became increasingly problematic. We knew we had to either better bake the client particulars into the model directly or somehow set thresholds automatically from data.

The ideal solution is to calibrate our model by adding an extra layer to do this translation automatically:

Calibrator(M(X)) = P(attack | X)

A common approach is to use a regression from the model scores to an empirical probability to estimate the true probability on a calibration dataset (we’ll call this CalibrationDataset). If we do have a good calibration dataset, a common approach is to build a simple regression function, for example, isotonic regression, to re-map into probability space.

The idea of isotonic regression is to partition the range of the model’s predictions into N buckets. For each bucket, estimate the expected ratio of positive to negative class, then draw a line between the buckets. There are many approaches to improve this by drawing a smoother function between the points instead of piecewise linear as depicted. For example, you could interpolate with linear regression or splines. These are simple details, but the difficult part is producing the CalibrationDataset on which this all depends.

Sources of Error in Calibration Data

As we discussed above, to calibrate a classifier (or train a model that is predicting probability correctly in the first place) we need to produce a dataset that is distributionally equivalent to the true probability of the online production system. Let’s imagine we have some true probability distribution of attacks.

P_true(attack|features)

At Abnormal, we know our training data is quite different from this distribution. One reason for this is that we heavily subsample negative examples (safe emails) and additionally use many positive examples (attack emails) from different time ranges. This is because we want to include the history of all attacks in our models, but we cannot include the history of all safe messages due to the enormity of the data. We have 100000x or more safe emails than attacks since the base rate is very low.

Additionally, we do not want to force ML engineers to attempt to produce a representative distribution every time they train a model. It may slow down iteration or cause other issues due to missing data, the need to experiment with filtering functions, etc. Instead, we prefer to train uncalibrated models and then fix them afterward with calibration.

That leaves us with the same problem, how do we produce a CalibrationDataset drawn from the true distribution?

First, let’s enumerate some types of distributional errors we commonly encounter:

Sampling distributional errors. The most obvious errors are from how we sample the positive and negative examples. We may select only 10% of negative samples since they are so prevalent.
Label distributional errors. We do not and cannot label every message in a dataset. This means any calibration dataset will be only partially labeled.
Client distributional errors. At Abnormal we have many different clients in different industries and particular characteristics. While there are other potential complicating variables, the client is a particularly impactful variable. We may have a distribution on some clients that do not translate to new clients. Ideally, as our models learn more across a larger sampling of clients this issue will continue to lessen but it must be taken into account.
Data quality distributional errors. We may also have fundamental differences in feature values in our calibration dataset due to engineering data quality (for example some features may not be possible to backstate for certain samples).
Time distributional shifts. Any calibration dataset will be earlier in time than the distribution on which we will be applying the model. There are fundamental time-based shifts due to naturally changing email traffic. Additionally, due to the adversarial nature of the problem, we expect the attack distribution patterns to change as adversaries evolve their strategy.

Attempting to Correct for These Errors

It’s important to understand each of these possible sources of error and think through others if necessary. Once we understand the errors, we can build correction mechanisms. Ideally, we also produce small datasets that help us evaluate how much our correction methods succeed in this task. If we can correct each error source, we can ideally produce a good calibrator.

Here are some high-level ideas on how to correct for each error

Correcting for sampling errors. This can be done by understanding the exact mechanism used to sample in the first place and reversing it. As a simple example, if we uniformly sampled 10% of the negative class, any statistic of that dataset will have probabilities shifted by a factor of 10 and we can shift them back.

Correcting labeling error. For various reasons we cannot and do not label an entire dataset, but we can control exactly what we do and do not label within a dataset. We can use the labels selection criterion (i.e. how we choose which data to label in the first place) and sample (i.e. weigh labeled samples above unlabeled ones) weighting to help correct errors caused by unlabeled samples in a dataset.

Correcting per-client error. We can manually learn marginal distributions for a customer and shift our calibrator to make up for these. Or if we want to get more sophisticated we can attempt to model clients through some featurization (for example the client’s industry or size) and learn marginal distributions across those features.

Correcting data quality error. This error is easy to measure as we can compare distributions of model scores and features between our online system and historical batch data and then use that measurement to shift our distribution as needed.

Correcting time distributional shifts. This is the hardest to correct for. One possibility is to attempt to model the shift with a time series model. Another method is to monitor and correct for distributional shifts with an online system measuring drift over time. Going into details here would be a blog post in its own right.

Engineering an Imperfect Solution

Unfortunately, even after significant work on this problem attempting to correct all these errors, we failed to produce a perfect all-around calibrator. Too many distributional errors persisted.

This sort of setback is common in ML engineering. Rather than give up, we instead thought creatively. We asked the question: Do we actually need a calibrated classifier?

Well, yes and no. A calibrated classifier is sufficient, but not necessary. That is, could we loosen the requirements? To answer this question, we listed out the actual desiderata for our classifier:

Calibration matters only in a specific operating range. In reality, the model only needs to make good predictions at very high precision, and we care much less about calibration lower down in the PR curve because we are only remediating attacks for which we are quite confident, as almost all emails are safe. This is the key insight.
Score stability property. The range of scores emitted by a classifier should be relatively stable across clients and across versions of the trained model. For example, we would like the score of 0.95 to mean roughly the same thing when we roll out a new model or on a newly onboarded client. Also, the score should be smooth: moving a threshold by some amount should smoothly affect the volume of flagged messages and the precision. If we have a stable score, we can more easily build systems on top of the model, such as a control system.
Ranking property. Perhaps obvious, but the calibrator needs to generally rank messages from least likely to most likely to be attacks. That is, it should have a high AUC.

Simplifying the problem made it more tractable. We ended up building a calibration system that has the following properties:

Calibration is correct at about 0.95 precision.
Performance is not well calibration below this point, but it is relatively stable between clients and on new versions of the model.
Very low scores are uncalibrated and we do not trust the model much at all below low confidence predictions. Luckily, we rarely worry about performance low in the curve because we are using this classifier only to predict the positive attack class.

Below is an illustration of stabilized predictor and how it might match up to the ideal calibrated predictor in some places and not others.

Conclusion

Once we developed this calibration method, it made many tasks easier. Before calibration, we had to manage thresholds very carefully across clients and between models. Now, there is a single threshold to control for each model and this threshold is stable from one model to another and one client to another. For example, 0.95 means the same thing between models.

This has increased development speed and made it much easier to run experiments. Trusting this calibration method has also removed many moving parts an ML engineer must think about when comparing one model to another.

Key takeaways include:

Start with a theoretical framework for a problem, but don’t be afraid to cut corners and simplify this framework to make progress. For example, the key insight that our model only needs to be calibrated at the top scores helped dramatically simplify the problem.
Reducing degrees of freedom helps with productivity. In our case, manually controlling thresholds per client or for new models could eke out slightly better performance, but sticking to a fixed calibration method and a single threshold allows easier progress because the ML engineer then does not need to think about the calibration problem for every model on top of the core feature and model improvement tasks.
There are many steps beyond the model itself to build a good product on top of ML. Do not focus solely on getting the best AUC, also think about managing thresholds, running experiments, iteration, and so on.

If these problems interest you, check out our open opportunities because we’re hiring!

Calibrating Classifiers in Reality was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Abnormal Engineering Stories — Episode #2: Future of ML Platform

Jeshua Bratman — Wed, 07 Jul 2021 18:54:58 GMT

Abnormal Engineering Stories — Episode #2: Future of ML Platform

Future of ML Platform w/ Jeshua Bratman & Nico Koumchatzsky

Over the last three years building our ML-based cybersecurity products at Abnormal Security, I’ve benefitted enormously from discussions with colleagues in the ML space. This podcast aims to make some of those conversations available.

In our second episode of Abnormal Engineering Stories, Nico Koumchatzky and I discuss the future of ML platform, what it means to be an ML Engineer, and the ML challenges faced at Abnormal and Nvidia. Nico is the Senior Director of AI Infrastructure at Nvidia, and before that, he ran Twitter’s ML Platform team, “Twitter Cortex,” where he and I worked together.

This discussion includes:

A wide-ranging and enjoyable discussion on the current and future state of ML platform with analogies to the history of software engineering
The role of the “ML Engineer” and why any successful ML practitioner needs to have one foot in the software engineering world (code, IDEs, databases, services, etc.) and the other foot in the machine learning world (experimentation, data science, algorithms, etc.)
Challenges we are trying to solve in our organizations including stopping cybercrime (at Abnormal) and building a platform to enable fast and large-scale ML and autonomous vehicles (at Nvidia)!

We hope you enjoy it! Please do subscribe on Apple, Spotify, or Google Podcasts.

https://medium.com/media/9dace40f0f7fe2eb6b7928e998ee6293/href

Abnormal Engineering Stories — Episode #2: Future of ML Platform was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Podcast: Building Applied ML Startups

Jeshua Bratman — Fri, 02 Jul 2021 14:24:29 GMT

It was a pleasure sitting down with Tim Shi of Cresta and Saam Motamedi of Greylock to discuss building companies around machine learning technology!

It’s one thing to add machine learning and artificial intelligence to existing software, but it’s quite another to build a company with ML at its core like both Abnormal Security and Cresta. In this podcast episode, we discuss common hurdles we face building these companies especially cold start problems, and the importance of customer partnerships early on.

Listen here:

Practical Innovation | Greylock

You have to be very flexible. When starting a company, you really want to solve the problems that the customers have and not just go and do data science off in a void, but at the same time, these are tough machine learning problems that cannot be solved without careful algorithm and data design.

If solving hard ML and engineering problems for stopping cyber-attacks interests you, yes Abnormal is hiring! Please message me if curious to learn more and see open roles at https://lnkd.in/ePwUxhi

Podcast: Building Applied ML Startups was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How you Should Design ML Engineering Projects

Jeshua Bratman — Tue, 06 Apr 2021 00:11:10 GMT

Analysis of ML engineering lifecycles, common pitfalls, and a copy-and-paste template you can use.

Image source: Christopher Lague on pixy

Machine learning engineering is hard, especially when developing products at high velocity (as is the case for us at Abnormal Security). Typical software engineering lifecycles often fail when developing ML systems.

How often have you, or someone on your team, fallen into the endless ML experimentation twiddling paralysis? Found ML projects taking two or three times as long as expected? Pivoted from an elegant ML solution to something simple and limited to ship on time? If you answered yes to any of these questions, this article may be right for you.

Purpose of this article:

Analyze why software engineering lifecycles fail for ML projects
Propose a solution with an accompanying design document template to help you and your team more effectively run ML projects

Software and Data Science Project Lifecycles

Typical software engineering projects are about developing code and systems. They might go something like this:

(1) identify a product or infrastructure problem (2) discuss and design the software system to solve the problem (maybe with a crawl/walk/run) (3) break the problem into pieces and implement over the course of days, weeks, or months often using agile development processes (4) push into production and monitor (5) go back to the beginning of the cycle to improve the system as necessary (Image by Author)

This lifecycle is clearly not what happens for ML engineering projects. What about data experimentation? What about model training and evaluation?

Maybe we should look toward data science research projects and see if their lifecycle is more suited.

A data science research project might go like this: (1) identify a question that can be answered with data (2) design experiments (3) wrangle data (4) evaluate hypotheses with data analysis or modeling (5) publish results or trained models. (Image by Author)

This lifecycle doesn’t seem right either. Pure data science research projects are about answering questions and not about building systems. What’s the middle ground?

Machine Learning Engineering

Machine learning engineering is at a unique crossroads between data science and software engineering. ML engineers will have trouble operating in a software engineering organization if you try to force everyone to operate in the typical software development lifecycle. On the other hand, operating a machine learning team like a pure data science or research team will result in nothing getting shipped to production.

ML engineers can get frustrated when they commit to a project that requires experimentation. When they inevitably have false starts because data does not support their initial hypothesis or because wrangling the data is much more difficult than anticipated, they start falling behind committed timelines. This sense of falling behind results in a feeling that a crucial part of their job — i.e., experimentation — feels like a constant failure compared to their colleagues working on software engineering tasks.

A typical ML Engineering lifecycle goes as follows: (1) identify a problem (2) design software and experiment (these are interconnected because the models you may plan to implement will depend on which experiments work out, but you may need to design feature and model code to run your experiments in the first place) (3) implement code and wrangle data (these may be interconnected because you may need to implement software to get the data you need and you may need the data to write and test the feature extraction or model training code)(4) Analyze data, train models, evaluate results (5) publish results (6) test, deploy, and monitor code and models.

Typical ML Engineering lifecycle. The better the software design and experimental design, the less re-visiting required because a good design will anticipate the branches that may need to be taken. (Image by Author)

This ML engineering lifecycle is often invented on the job and not taught. It is possible to do very well by carefully laying out software and experimental design. Still, it is also easy to do poorly, leading to many false starts and winding paths toward a solution (that may never be reached).

Junior ML Engineers vs Senior ML Engineers

In a fantastic article by Julie Zhuo, she illustratively compares Junior Designers vs. Senior Designers — this visualization aptly pertains to Junior and Senior ML Engineers as well

Process for Junior ML Engineer. Often they will meander through the space of implementations, experiments, and data without a clear method. This wastes time and can be frustrating. (Image by Julie Zhuo, used with permission)

Process for Senior ML Engineer). Senior ML engineers will carefully lay out experimental paths, know when to cut them short, and proceed in more fruitful directions as well as know when one result indicates new directions to go. (Image by Julie Zhuo, used with permission)

Methodical thinking and discipline are a must when iterating on experiments. Can we help ML engineers plan out work to follow this paradigm?

Design documents to aid ML Engineering Lifecycles

How can we encourage better ML engineering design?

A process we’ve implemented at Abnormal is to require all ML Engineering projects to go through a formal design review process using a design document template that helps the engineer do good software and experiment design simultaneously.

What should be encouraged when designing ML Engineering projects?

Put the work into explicit forward-thinking experiments before rushing into implementation. This heads off the endless and fruitless ML/Data experimentation/twiddling experiment iterations we all find ourselves in from time to time.
Call out the *work* of experimentation as useful whether or not the experiment validates the hypothesis. There is value to disproving a hypothesis even if it does not lead to ML product improvements.
Design software with experiments in mind and design experiments with software in mind (i.e., what is capable of shipping to production). Wrangle your data in light of the systems you will be building and how that data will be available in production.

With these in mind, we created this template to fill out at the start of any ML engineering project. An engineer should copy this template, fill in the details for their project, then presents the software and experimental design to the team for feedback and iteration. This process has greatly improved the success and velocity of projects, and we highly encourage adopting this design template (or something similar) for your ML Engineering team.

— — — — — — — — — — — — — — — — — — — — — — —
Abnormal Security’s ML Design Document Template (to copy and fill in at the start of a new ML Engineering project). Feel free to use directly, modify and share!

Problem Statement

What are we specifically trying to solve, and why are we solving it now? A strong justification will tie this back to a product or customer problem.

Goals

Software goals

Describe the software system we wish to build and its capabilities.

Metric goals

Desired metric improvement, how are we going to measure the impact of this work, why do we want to improve the system in this way:

Bad Example: Improve model’s performance
OK Example: Improve AUC by X% for the model
Good Example: Improve recall by X% for the class of false negatives without decreasing recall for any other classes by more than Y%.

Expected metrics tradeoffs, if any: For example: Increase recall without decreasing precision by more than 5%.

Experiment Design

Unlike pure software projects, data science / ML projects often require data exploration, experimentation, failure, and changing design along the way when data has been collected. To help make a project successful, it is helpful to layout your potential branching points and how you will make decisions along the way. Additionally, all experiments should be evaluated against a baseline which is either a simple solution to the problem (simple algorithm, simple heuristic) or the current production solution if one exists.

Data motivation

Describe the problem that should be solved, use data to validate that this is indeed a worthy problem to solve. Is this actually going to have a real impact?

Hypotheses

Hypothesis 1: method A will improve metric B by X% over baseline

Method: Describe the methodology you are approaching. For example, this might be a model architecture we are testing, a new feature we are adding, etc.
Metric: Describe the metric or metrics we will use to evaluate the method.
Success criteria: The measured metric results that will indicate success in this hypothesis. Ideally should be measured against a baseline.
Timebox: X days, then check in with the team to decide the next steps
Failure Next Steps: For example, go on to try Hypothesis 2
Success Next Steps: For example, push this model to production.

Hypothesis 2: …
The same set of questions for each hypothesis

…

Software Design

Describe the software systems and data pipelines needed to execute this project. What software needs to be built? What services and databases? What data will need to be available in production to run your model? Feel free to use normal software design documentation principles here.

Execution

What will be delivered and when will it be delivered. A strong plan will provide incremental value and will allow us to get to the crawl state quickly.

Crawl: Minimum design to prove the efficacy of change before we invest too much time in software development.

Walk: More thorough design aimed to be a relatively complete component.

Run: Long-term design here; how would we make this a really first-class system or model.

Considerations

Success criteria to launch?

Describe metrics evaluated to advocate launching this model or change into production.

What could go wrong?

Describe all possibilities that might go wrong when we launch this?

Which product surfaces could be affected?
How will this impact customers?
How will we monitor?
What will we do to roll back?

Security & Privacy Considerations

What impact on security could this change have?
What impact on privacy could this change have?

Appendix: Experiment log

Keep track of the results of each hypothesis tested and the decisions made along the way, branching points, learnings, revised hypothesis, and so on. It’s beneficial to remind yourself later and share how you approach this type of problem with others on the team.

— — — — — — — — — — — — — — — — — — — — — — —

This template has evolved over the years. Thank you to Dmitry Chechik, Yu Zhou Lee, Kevin Lau, Umut Gültepe, and Abhijit Bagri for all the input on this design process over time.

If you are interested in fascinating Applied ML engineering in the cybersecurity space, yes, we’re hiring!

How you Should Design ML Engineering Projects was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Re-Scoring an ML Detection Engine on Past Attacks (part 1)

Jeshua Bratman — Mon, 29 Mar 2021 18:40:12 GMT

Developing a machine learning product for cybersecurity comes with unique challenges. For a bit of background, Abnormal Security’s products prevent email attacks (think phishing, business email compromise, malware, etc.) and also identify accounts that have been taken over. These attacks are clever social engineering attempts launched to steal money (sometimes in the millions) or gain access to an organization for financial theft or espionage.

Detecting attacks is hard! We’re dealing with rare events: as low as 1 in 10 million messages or sign-ins. The data is high dimensional: all the free-form content in an email and linked or attached in an email. We require bleedingly high precision and recall. And, crucially, it is adversarial: attackers are constantly trying to outsmart us.

These factors have consequences for our ML system:

The adversarial nature of the problem means the attack distribution is constantly shifting, so we must keep improving models. Attackers are often even using ML systems of their own to adapt their attacks!
Due to the rarity of attacks, we must retain every precious sample. We cannot afford to throw out older attacks simply because they do not have the latest features. We must instead re-compute these features.
Due to high dimensionality and volume, it is a big-data problem and requires efficient distributed data processing.

To build a platform and team that can operate and improve our detection engine at high velocity, we must enable ML engineers to experiment with changes across the entire stack. This includes changes to underlying detection code, new datasets, new features, and the development of new models.

Iteration for ML engineers working on detection problems at Abnormal. A lot goes into improving the detection stack, and it’s crucial to have a robust testing and evaluation suite that can evaluate across new models, datasets, new code, etc.

This loop is reminiscent of a software engineering CI/CD loop, but there are more moving pieces. When developing detectors, there may be new code involved, new datasets (that must be served online and offline), and new models. We must test this entire stack thoroughly, and the easier it is to test, the easier it will be to safely iterate.

Why is testing the detection stack so important? Think about what could go wrong -- if, for example, we make an unintentional code change that modifies a feature used by a model (but we do not retrain the model), the effect could shift distributions, incorrectly classify, and miss a damaging attack. Since our system acts at incredibly high precision and recall, small changes can cascade to have large consequences.

Re-generating an updated labeled dataset

Our rescoring system has three important components

Golden Labels: The set of all labeled messages with features updated with the entire detection stem (including recent code, joined data, and prediction models). We affectionately call this dataset “Golden Labels.” It’s the gold standard label set, and it is also precious because it contains all attack examples from the past. This dataset feeds into the following two processes.
Rescoring: Process to evaluate the entire detection system to produce performance analytics like precision, recall. We call this “rescored recall.” For example, we measure what percentage of historical attacks we would catch today with the current detection system. Note: this is different than precision and recall evaluation on individual models on an evaluation set because we are testing an entire stack that includes multiple models, decision logic, feature extraction code, and datasets.
Model Training: Like rescoring, re-training models requires up-to-date Golden Labels data. We must have correct and unbiased features to train our models so that the data observed when training matches up to the data at inference time. The features that feed into model training are organized as a DAG of dependencies where some features rely on other features or models (see our blog post on Graph of Models and Features for more on this)

Various steps in the model training and rescoring are dependent on the most recent datasets, code, models, and labeled samples. Since we don’t want to lose old attack samples, we must re-hydrate and re-extract features for all these old samples. This rescoring pipeline creates an updated dataset called “Golden Labels” used to ensure accurate data on the historical dataset.

Requirements for Data Golden Labels Re-Generation

For data that feeds into rescoring and model training to be effective, we have several requirements:

(historical samples support) Should be able to update all historical samples with the latest attribute extraction (1) code, (2) joined datasets, and (3) model predictions and evaluate proper and trustable analytics
(unbiased feature requirement) Re-computed features on old samples accurately reflect how those features would appear if that sample were encountered today.
To produce unbiased aggregate features, the system must perform time-travel: evaluate samples as they would have looked at the time of the attack. Time travel ensures unbiased data and also helps us avoid future leakage: avoiding bringing any data from the future (to a past sample) that leaks labels into training features.
(developer effectiveness) ML Engineers must be able to easily develop, integrate, and deploy their changes to any area of the detection stack. Must be efficient enough to run frequently to retrain and evaluate changes and ad-hoc evaluations to answer “what-if” type questions.

Ad-Hoc rescoring experiments

In addition to the automatic daily re-generation of “Golden Labels” (tagged to a particular code branch), we additionally have an Ad-Hoc rescoring pipeline allowing engineers to ask “what if” questions — that is, what happens to overall detection performance if we change one or more pieces of the system.

For example, we may not need to re-run the entire feature extraction stage if we are testing only a new model (and the downstream impact of that model). To do so, we rely on the most recent updated Golden Labels from the night before and run additional steps:

Partial re-evaluation allows us to test changes on parts of the stack without re-extracting all features.

Example of an ad-hoc rescoring experiment

We can either set up two configurations, “Baseline” and “Experiment,” or run this on two different code branches. It’s up to the ML engineer to decide how to run their experiment correctly. Eventually, we would like our CI/CD system to run rescoring on stages affected by particular code changes and automatically provide metrics, but for now, it is manual.

This example configuration tests what happens when we swap out a single model.

# Baseline configuration runs model scoring and decisioning.
baseline_config = RescoreConfig(
  [           
    MODEL_SCORING, # Evaluates ML models
    DETECTION_DECISIONS, # Evaluates our detection decisions using the model scores.
  ]
)

# Experimental setup to swap a single model.
experiment_config = RescoreConfig(
  [
    MODEL_SCORING,
    DETECTION_DECISIONS,
  ],
  FinalDetectorRescoreConfig(
    replace_models=[ReplacementModelConfig(
      model_path="/path/to/experimental/model",
      model_id=ATTACK_MODEL
    )]
  )
)

# Runs the rescoring and delivers analytics to the user.
run_rescoring(
  rescore_config=rescore_config,
  baseline_config=baseline_config
)

We can use a similar system to generate model training data with experimental features.

Running rescoring efficiently (part 2…)

Both automatic and ad-hoc rescoring require a lot of heavy lifting behind the scenes. We run everything on Spark, and there are a lot of tricky data engineering problems to solve to satisfy the requirements listed above, especially the time travel problem. Read part 2 of this story here:

Re-Scoring an ML Detection Engine on Past Attacks (part 1)

If you are interested in solving tough Applied ML engineering problems in the cybersecurity space, yes, we’re hiring!

Thanks to Justin Young, Carlos Gasperi, Kevin Lau, Dmitry Chechick, Micah Zirn, and everyone else on the detection team at Abnormal who contributed to this pipeline.

Re-Scoring an ML Detection Engine on Past Attacks (part 1) was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph of Models and Features

Jeshua Bratman — Mon, 01 Feb 2021 15:51:46 GMT

At the core of all Abnormal’s detection products sits a sophisticated web of prediction models. For any of these models to function we need deep and thoughtfully engineered features, careful modeling of sub-problems, and the ability to join data from a set of databases.

For example, one type of email attack we detect is called Business Email Compromise (BEC). A common BEC attack is a “VIP impersonation” in which the attacker pretends to be the CEO or other VIP in a company in order to convince an employee to take some action. Some of the inputs to a model detecting this sort of attack include:

Model evaluating how similar the sender’s name appears to match a VIP (indicating impersonation)
NLP models applied to the text of the message
Known communication patterns for the individuals involved
Identity of the individuals involved extracted from an employee database
… and many more

All these attributes are carefully engineered and may rely on one another in a directed graphical fashion.

This article describes Abnormal’s graph-of-attributes system which makes this type of interconnected modeling scalable. This system has enabled us to grow our ML team while continuing to rapidly innovate.

Attributes

We store all original entity data as rich thrift objects (for example a thrift object representing an email or an account sign-in). This allows flexibility in terms of the data types we log, enables easy, backward compatibility, and understandable data structures. But as soon as we want to convert this data into something that will be consumed by data science engines and models, we should convert these into attributes. An attribute is a simply-typed object (float / int / string / boolean) with a numeric attribute ID.

Attribute vs Features — Attributes are conceptually similar to features, but they might not be quite ready to feed into an ML model. These should be ready to convert into a form consumable by models. All the heavy lifting should occur at the time of attribute extraction, for example running inference on a model or hydrating from a database.

The core principles we are working off include:

Attributes can rely on multiple modes of inputs (Other raw attributes, Outputs of models, Data hydrated from a database lookup or join)
Attributes should be flat data (i.e. primitives) and representable in a columnar database
Attributes should be simple to convert to features (for example you may need to convert a categorical attribute into a one-hot vector)
We will always need to change and improve attributes over time

Consuming Attributes — Once data is converted into a columnar format it can be consumed in many ways: Ingested into a columnar store for analytics, tracked in metrics to monitor distributional shifts, and converted directly into a feature dataframe ready for training with minimal extra logic.

Directed graph of attributes

Computing attributes as a directed graph allows enormous flexibility for parallel development by multiple engineers. If each attribute declares its inputs, we can ensure everything is queried and calculated in the correct order. This enables attributes of multiple types:

Raw features
Heuristic that use many other features as input
Models that make a prediction from many other features
Embeddings

Attribute Hydration Graph:

Explicitly encoding the graph of attributes seems complex but it will save you painful headaches down the road when you want to use one attribute as an input to another.

Attribute versioning

Inevitably we will want to iterate on attributes and the worst feeling is realizing that the attribute you want to modify is used by ten downstream models. How do you make a change without retraining all those models? How do you verify and stage the change?

This situation comes up frequently. Some common cases:

An attribute is the output of a model or an embedding, you want to re-train the model, but this attribute is used by other models, or heuristics
An attribute relies on a database serving aggregate features and we would like to experiment with different aggregate bucketizations
We have a carefully engineered heuristic feature and we would like to update the logic

If each attribute is versioned and downstream consumers register which version they wish to consume, then we can easily bump the version (while continuing to compute the previous versions) without affecting the consumers.

Scaling an ML Team

In addition to enabling flexible modeling of complex problems, this graph of models enables us to scale our ML engineering team. Previously we had a rigid pipeline of features and models which was really only amenable to a single ML engineer at a time to develop. Now, we can have multiple ML engineers developing models for sub-problems, and then combining the resulting features and models together later.

There’s so much more to do

We need to figure out how to more efficiently re-extract this graph of attributes for historical data and good processes for sunsetting older attributes. We would like to build a system that allows our security analysts and anyone else in the company to easily contribute attributes and allow those to automatically flow into downstream models and analysis. We need to improve our ability to surface relevant attributes and models scores important to a given decision back to the client to understand the reasons an event is flagged. And so much more… If these problems interest you, yes, we’re hiring!

Graph of Models and Features was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lazily Loading ML Models for Scoring with PySpark

Jeshua Bratman — Fri, 11 Dec 2020 19:47:59 GMT

Authors: Jeshua Bratman and Vineet Edupuganti

Our core email attack detection product at Abnormal works by processing each incoming message, applying a series of classification models, and ultimately deciding if a message might be an attack. This detection system runs in an online distributed system processing millions of messages per day.

Rescoring

One key component in our pipeline is called rescoring. This data pipeline loads historical examples of email attacks in order to evaluate the accuracy of our detection system with respect to historical attacks. Rescoring allows us to easily measure how changes to code and classification models impact the performance of the product.

High-level steps in rescoring:

Re-process messages
Re-extract every attribute and feature
Re-run every detection model
Evaluate precision, recall, and other statistics

Crucially, we need this batch evaluation pipeline to resemble our online decisioning flow as closely as possible, using the exact same code paths for the scoring module, but run over large sets of batch data using Spark. If the code running was not identical to our real-time scoring systems, we would not be able to trust that the rescoring results match true detection performance in our online system.

Lazy evaluation for rescoring models

PySpark has been a very flexible batch processing tool, but one challenge we ran into with the above architecture surfaces when using certain optimized ML libraries. In this article, we’ll focus in particular on Tensorflow models (using Keras), and nearest-neighbor models (using Spotify’s Annoy library). Neither plays nicely with Spark.

The straightforward approach would be:

Load models into memory in the Spark driver equivalently to how our real-time scorer loads models
Broadcast these models to the Spark executors
Run each model across examples in Spark executors

However, for both Tensorflow and Annoy, the loaded models exist outside the Python process memory space and therefore cannot be broadcast using PySpark (which attempts to just pickle the Python object).

Our simple solution is to broadcast only the bytes of the model and load them into memory in a lazy fashion.

class ModelWrapper:
  def __init__(self, model_data):
    self.model_data = model_data

  def predict(self, features):
    if not self.initialized:
      self._load_model(self.model_data)
      # Predict

  def _load_model(self): 
    # Initialize Tensorflow Graph & Session
    # Load model into memory.

This means the model is not actually loaded until the first time it is used to make a prediction.

Example 1: Keras+Tensorflow models

We use this lazy loading method, in particular, for Keras/TF models. Since the TensorFlow graph lives in memory outside of the Python process we must ensure this graph and session are created on the Spark executors. The models are loaded and used identically between the batch and realtime pipeline. Here’s how it looks:

The raw data we pass around for Keras models is simply

The Keras JSON description
Trained weights in the .h5 format

Sketch of loading Keras models from JSON & H5:

import h5py
from keras.models import model_from_json
from keras.engine import saving

model = model_from_json(model_json)
h5_file = h5py.File(io.BytesIO(model_bytes))

if “layer_names” not in f.attrs and “model_weights” in f:
   h5_file =h5_file[“model_weights”]

saving.load_weights_from_hdf5_group(h5_file, model.layers, reshape=reshape)

Example 2: Nearest Neighbor store using Annoy

Just as with our Keras and TF models, we need to handle serialization and lazy loading when working with Annoy objects as well. Properly leveraging Annoy is an important requirement for our work on intelligent signatures and gives us the ability to perform efficient nearest neighbor lookups with large quantities of data (i.e. given an embedding, find the entries in our nearest neighbor store with the most similar embeddings).

Originally developed by Spotify and used for music recommendations, Annoy uses an approximate approach that creates an index based on a set of trees to enable fast lookup. Additionally, the library allows one to build indices offline and share them across memory processes. Despite these benefits, there are a few challenges with leveraging Annoy objects in production, as we describe below.

Serialization

After building the Annoy index, it is important to be able to save it to a distributed file store, so we can load it in the same manner as our other models. However, a limitation of the Annoy library is that it only enables saving to disk, and other serialization methods (Pickle, for example) will not work. As such, we use an indirect approach whereby we save the index to a temporary file, read the corresponding bytes back, and store it as part of a larger serialized thrift object. To read the index, we load in the bytes directly from our online file store, save to a temporary file, and load in the index with the Annoy API. See below for corresponding code snippets.

def get_annoy_bytes(annoyIndex):
   with tempfile.NamedTemporaryFile(suffix='.ann') as fp:
       fname = fp.name
       annoyIndex.save(fname)
       file = open(fname, "rb")   
       annoy_bytes = file.read()
       file.close()
   return annoy_bytes

def read_annoy_bytes(dimension, distance, annoy_bytes):
   annoy_index = annoy.AnnoyIndex(dimension, distance)
   with tempfile.NamedTemporaryFile(suffix='.ann') as fp:
       fname = fp.name
       fp.write(annoy_bytes)
       annoy_index.load(fname)
   return annoy_indexdef get_annoy_bytes(annoyIndex):

Lazy loading

In addition to serialization, we have to deal with the same challenges as the Keras models in terms of not being able to broadcast Annoy objects using PySpark. As such, we have to employ the lazy loading mechanism described above. With this approach, we initialize the Annoy index to None, such that we can broadcast without memory errors. Each time a given worker tries to score a message, the lazy init function is called, which loads in the Annoy index object. After the first time the object is loaded no additional read operations are required.

Large scale ML at Abnormal

In summary, this post provides a quick glimpse into a few of the challenges with running large-scale ML systems in production — namely leveraging libraries like Keras and Annoy in conjunction with big data frameworks like PySpark. We hope the methods we introduce for addressing lazy loading and serialization can be useful to you as you face similar challenges working with these tools.

To learn more about the exciting work that Abnormal is doing, check out the rest of our blog here. And if developing machine learning models and software systems to stop cybercrime interests you, yes we’re hiring!

Lazily Loading ML Models for Scoring with PySpark was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stopping election interference emails attacks using ML

Jeshua Bratman — Wed, 18 Nov 2020 02:45:26 GMT

Stopping election interference attacks using ML

My favorite part of working at Abnormal Security is seeing the myriad of nefarious attacks we are able to stop. These attacks include everything from attempts to steal millions of dollars, to installing ransomware crippling hospitals, to state actors compromising our power grid. And — right the core of the product sits some tough ML problems. How do we robustly identify behavior anomalies? How do we quickly adapt to an ever-changing attack landscape? How do we catch these really carefully crafted social engineering strategies aimed to trick people?

Just before the election, an attack went out to thousands of voters in Florida trying to intimidate them into voting for Donald Trump. Although our system did not stop these initially, we were able to quickly feed examples into our ML system and get it to catch on, and then subsequently prevent other election-based social engineering attacks. This used a pretty cool system we recently built —

Rapidly fine-tunes models based on the newest data
Accepts hints in the form of examples emails, phrases, etc
Uses data augmentation to generalize and boost the impact of those hints (along with particular false negatives) have on the model parameters

Not only was this able to immediately improve our system to catch the exact attack we saw but was able to generalize from the text content and behavioral patterns to identify other attacks trying to manipulate recipients using election-related strategies.

Combining ML Models to Detect Email Attacks

Jeshua Bratman — Tue, 17 Nov 2020 23:31:12 GMT

This article is a follow-up to one I wrote a year ago — Lessons from building AI to Stop Cyberattacks — in which I discussed the overall problem of detecting social engineering attacks using ML techniques and our general solution at Abnormal. This post aims to walk through the process we use at Abnormal to model various aspects of a given email and ultimately detect and block attacks.

As discussed in the previous post, sophisticated social engineering email attacks are on the rise and getting more advanced every day. They prey on the trust we put in our business tools and social networks, especially when a message appears to be from someone on our contact list (but is not) or even more insidiously when the attack is actually from a contact whose account has been compromised. The FBI estimates that over the past few years over 75% of cyberattacks start with social engineering, usually through email.

Why is this a hard ML problem?

A needle in a haystack — The first challenge is that the base rate is very low. Advanced attacks are rare in comparison to the overall volume of legitimate email:

1 in 100,000 emails is advanced spear-phishing
less than 1 in 10,000,000 emails is advanced BEC (like invoice fraud) or lateral spear phishing (a compromised account phishing another employee)
When compared to spam, which accounts for 65 in every 100 emails, we have an extremely biased classification problem which raises all sorts of difficulties

Enormous amounts of data — At the same time, the data we have is large (many terabytes), messy, multi-modal, and difficult to collect and serve at low latency for a real-time system. For example, features that an ML system would want to evaluate include:

Text of the email
Metadata and headers
History of communication for parties involved, geo locations, IPs, etc
Account sign-ins, mail filters, browsers used
Content of all attachments
Content of all links and the landing pages those links lead to
…and so much more

Turning all this data into useful features for a detection system is a huge challenge from a data engineering as well as ML point of view.

Adversarial attackers — To make matters worse, attackers actively manipulate the data to make it hard on ML models, constantly improving their techniques and developing entirely new strategies.

The precision must be very high — to build a product to prevent email attacks we must avoid false positives and disruption of legitimate business communications, but at the same time catch every attack. The false-positive rate needs to be as low as one in a million!

For more examples of the challenges that go into building ML to stop email attacks, see the discussion Lessons from building AI to Stop Cyberattacks.

To effectively solve this problem we must be diligent and extremely thoughtful about how we break down the overall detection problem into components that are solved carefully.

Example:

Let’s start with this hypothetical email attack and imagine how we could model various dimensions and how those models come together.

Subject: Reset your password
From: Microsoft Support
Content: “Please click _here_ to reset the password to your account.”

This is a simple and prototypical phishing attack.

As with any well-crafted social engineering attack, it appears nearly identical to a legitimate message, in this case, a legitimate password reset message from Microsoft. Because of this, modeling any single dimension of this message will be fruitless for classification purposes. Instead, we need break up the problem into component sub-problems

Thinking like the attacker

Our first step is always to put ourselves in the mind of the attacker. To do so we break an attack down into what we call “attack facets”.

Attack Facets:

Attack Goal — What is the attacker trying to accomplish? Steal money? Steal credentials? Etc.
Impersonation Strategy — How is the attacker building credibility with the recipient? Are they impersonating someone? Are they sending from a compromised account?
Impersonated Party — Who is being impersonated? A trusted brand? A known vendor? The CEO of a company?
Payload Vector — How is the actual attack delivered? A link? An Attachment?

If we break down the Microsoft password reset example, we have:

Attack goal: Steal a user's credentials
Impersonation strategy: Impersonate a brand through a lookalike display name (Microsoft)
Impersonated party: The official Microsoft brand
Payload vector: A link to a fake login page.

Modeling the problem

Building ML models to solve a problem with such a low base rate and precisions requirements forces a high degree of diligence when modeling sub-problems and feature engineering. We cannot rely just on the magic of ML.

In the last section, we described a way to break an attack into components. We can use that same breakdown to help inspire the type of information we would like to model about an email in order to determine if it is an attack.

All these models rely on similar underlying techniques — specifically

Behavior modeling: identifying abnormal behavior by modeling normal communication patterns and finding outliers from that
Content modeling: understanding the content of an email
Identify resolution: matching the identity of individuals and organizations referenced in an email (perhaps in an obfuscated way) to a database of these entities

Attack Goal and Payload

Identifying an attack goal requires modeling the content of a message. We must understand what is being said. Is the email asking the recipient to do anything? Is it an urgent tone? and so forth. This model may not only identify malicious content but safe content as well in order to differentiate the two.

Impersonated Party

What does an impersonation look like? First of all the email must appear to the recipient to look like someone they trust. We build identity models to match various parts of an email against known entities inside and outside an organization. For example, we may identify an employee impersonation by matching against the active directory. We may identify a brand impersonation by matching against the known patterns of brand-originating emails. We might identify a vendor impersonation by matching against our vendor database.

Impersonation Strategy

An impersonation happens when an email is not from the entity it is claiming to be from. To do so we identify normal behavior patterns to spot these abnormal ones. This may be abnormal behavior between the recipient and the sender. It may be unusual sending patterns from the sender. In the simplest case, like the example above, we can simply note that Microsoft never sends from “fakemicrosoft.com”. In more difficult cases, like account takeover and vendor compromise, we must look at more subtle clues like unusual geo-location and IP address of the sender or incorrect authentication (for spoofs).

Attack Payload

For the payload, we must understand the content of attachments and links. Modeling these requires a combination of NLP models, computer vision models to identify logos, URL models to identify suspicious links, and so forth.

Modeling each of these dimensions gives our system an understanding of emails particularly along dimensions that might be used by attackers to conduct social engineering attacks. The next step is actually detecting attacks

Combining Models to Detect Attacks

Ultimately we need to combine these sub-models to produce a classification result (for example P(Attack)). Just like any ML problem, the features given to a classifier are crucial for good performance. The careful modeling described above gives us very high bandwidth features. We can combine these models in a few possible ways.

(1) One humongous classification model: Train a single classifier with all the inputs available to each sub-model. All the input features could be chosen based on the features that worked well within each sub-problem, but this final model combines everything and learns unique combinations and relationships.

(2) Extract features from sub-models and combine to predict target — there are 3 ways we can go about this:

(2.a) Ensemble of Models-as-Features: Each sub-model is a feature. Its output is dependent on the type of model. For example, a content model might predict a vector of binary topic features

(2.b) Ensemble of Classifiers: Build sub-classifiers that each predict some target and combine them using some kind of ensemble model or set of rules. For example, a content classifier would predict the probability of attack given the content alone.

(2.c) Embeddings: Each sub-model is trained to predict P(attack) like above or some other supervised or unsupervised target, but rather than combining their predictions, we extract embeddings, for example, by taking the penultimate layer of a neural net.

Each of the above approaches has advantages and disadvantages. Training one humongous model has the advantage of getting to learn all complex cross dependencies, but it is harder to understand and harder to debug, and more prone to overfitting. It also requires all the data available in one shot, unlike building sub-models that could potentially operate on disparate datasets.

The various methods of extracting features from sub-models also have tradeoffs. Training sub-classifiers is useful because they are very interpretable (for example we could have a signal that represents the suspiciousness of text content alone), but in some cases, it is difficult to predict the attack target directly from a sub-domain of data. For example, purely a rare communication pattern is not sufficient to slice the space meaningfully to predict an attack. Similarly as discussed above, a pure content model cannot predict an attack without context regarding the communication pattern. The embeddings approach is good, but also finicky, it is important to vet your embeddings and not just trust they will work. Also, the embedding approach is more prone to overfitting or accidental label leakage.

Most importantly with all these approaches, it is crucial to think deeply about all the data going into models and also the actual distribution of outputs. Blindly trusting in the black box of ML is rarely a good idea. Careful modeling and feature engineering are necessary, especially when it comes to the inputs to each of the sub-models.

Our solution at Abnormal

As a fast-growing startup, we originally had a very small ML team which has been growing quickly over the past year. With the growth of the team, we also have adapted our approach to modeling, feature engineering, and training our classifiers. At first, it was easiest to just focus on one large model that combined features carefully engineered to solve subproblems. However, as we’ve added more team members it has become important to split the problem up into various components that can be developed simultaneously.

Our current solution is a combination of all the above approaches depending on the particular sub-model. We still use a large monolithic model as one signal, but our best models use a combination of inputs including embeddings representing an aspect of an email and prediction values from sub-classifiers (for example a suspicious URL score).

Combining models and managing feature dependencies and versioning is also difficult.

Takeaways for solving other ML problems

Deeply understand your domain
Carefully engineer features and sub-models, don’t trust black box ML
Solving many sub-problems and combining them for a classifier works well, but don’t be dogmatic. Sure, embeddings may be the purest solution, but if it’s simpler to just create a sub-classifier or good set of features, start with that.
Breaking up a problem also allows scaling a team. If multiple ML engineers are working on a single problem, they must necessarily focus on separate components.
Modeling a problem as a combination of subproblems also helps with explainability. It’s easier to debug a text model than a giant multi-modal neural net.

But, there’s a ton more to do!

We need to figure out a more general pattern for developing good embeddings and better ways of modeling sub-parts of the problem, better data platforms, and feature engineering tools, and so much more. Attacks are constantly evolving and our client base is ever-growing leading to tons of new challenges every day. If these problems interest you, yes, we’re hiring!

Combining ML Models to Detect Email Attacks was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lessons from building AI to Stop Cyberattacks

Jeshua Bratman — Tue, 19 Nov 2019 05:54:02 GMT

Successful Phishing attack on John Podesta that lead to the 2016 DNC email leaks. Released by WIkiLeaks.

On March 19th 2016 John Podesta, was tricked into revealing his Gmail credentials to a Russian-backed organization who then released emails regarding the Clinton campaign, effectively influencing the 2016 election.

Social engineering attacks like this (and much more sophisticated ones) are on the rise and getting more advanced every day. They prey on the trust we put in our business tools and social networks, especially when a message appears to be from someone on our contact list (but is not) or even more insidiously when the attack is actually from a contact whose account has been compromised. The FBI estimates that over the past few years over 75% of cyberattacks start with social engineering, usually through email.

Why is this a new problem?

Hasn’t email been around forever? Why hasn’t this problem been solved? Why is it a machine learning problem?

With more business being done online and connected to a single cloud login, as well as the increasing use of mobile devices where UX is often designed without security in mind, these forms of attacks have skyrocketed in number and are a threat to every organization that uses email. The sophistication and targeted nature of the attacks has, until recently, outpaced the existing email security.

The only way to effectively detect these attacks is by using modern machine learning techniques of deep natural language understanding combined with careful modeling of an organization’s typical communication patterns.

About a year and a half ago I left Twitter’s ML team to help found Abnormal Security where I lead the development of our AI system to detect and prevent these malicious emails.

I’m somewhat new to cybersecurity and to email security (though it has similarities to the problem of detecting abuse detection which I worked on at Twitter). It is a fascinating and scary world:

There’s a multi-billion dollar economy of criminals using email-based attacks to steal money, account information, and trade secrets, but as many as 50% of these attacks are probably perpetrated by state-actors: either spy organizations or politically motivated organizations funded by states such as Russia, China, Iran, and North Korea. For example, Fancy Bear (APT28) was responsible for the successful attack on John Podesta, and is probably funded by the Russian security organization GRU
Using zero-day exploits are not the only tools attackers have. It’s better for criminals to save those for a rainy day or sell them to a state actor. It’s much easier to convince someone to give up their credentials and use those to escalate privileges until you can get what you want, either money or trade secrets
Existing email security has not been able to stop these attacks and the attacks are only getting more sophisticated.

Can we use modern AI to prevent attacks?

There’s a long history of using ML to stop email fraud, for example, Spam filters are taught in most introductory ML courses. However, attackers have far-outpaced existing email security. Nearly every organization from local governments, manufacturing, energy companies, technology companies, to individuals are targeted and many are successfully breached every day.

Why is it hard to prevent these attacks?

The text in email attacks is often indistinguishable from legitimate communication. It’s been carefully designed to fool both the recipient and security software.
Attackers are actively attempting to avoid ML systems, and using their own ML systems to thwart detection (for example, we found spear-phishing A/B testing software for sale on the dark web)
Attackers may launch attacks from compromised accounts making them even harder to identify from regular email.

Before Abnormal Security, I worked at Twitter Cortex bringing deep learning models to many areas of the Twitter product. One particular problem I worked on was abuse detection: How do we identify harassment, bullying, and hate speech in Tweet conversations? We found that a combination of modern NLP (embeddings and LSTMs etc.) alongside features of the communication graph (who is tweeting at who, what communities are they in, what is their communication history etc.) was the crucial combination to build successful detection models.

These lessons learned from Twitter were helpful when approaching the problem of identifying email attacks: (1)is the content of the message suspicious? (2) are identity of parties in the communication often targeted? (3) what is the past communication patterns of the recipient and sender?

Interestingly, Twitter and Email are two of the only large technologies where anyone can contact anyone else without a prior connection and are therefore both ripe for abuse But, email breaches are more insidious due to how much access is linked to email accounts.

In this age of cloud identity providers, so much information and access sis linked to email accounts on Office 365 and Gmail: documents, voice chat, video, possible even desktop access.

Why is this a hard ML problem?

A needle in a haystack — The first challenge is that the base rate is very low. Advanced attacks are rare in comparison to the overall volume of legitimate email:

1 in 100,000 emails is advanced spear-phishing
less than 1 in 10,000,000 emails is advanced BEC (like invoice fraud) or lateral spear phishing (a compromised account phishing another employee)
(compare to spam, which accounts for 65 in every 100 emails)

This means we have an extremely biased classification problem which raises all sorts of difficulties.

Enormous amounts of data — At the same time, the data we have is large (many terabytes), messy, multi-modal, and difficult to collect and serve at low latency for a realtime system. For example, features that a ML system would want to evaluate include:

Text of the email
Metadata and headers
History of communication for parties involved
Account sign-ins, mail filters, and other account activity
Identity of parties involved (is it the CEO’s name ? an accountant? etc.)
Attachment content
Links in attachments
Images in attachments
Links and body
Contents of linked landing page
Images in linked landing pages
Code in landing pages
Malware in attachments
…

Turning all this data into useful features for a detection system is a huge challenge from a data engineering as well as ML point of view.

Adversarial attackers — To make matters worse, attackers actively manipulate the data to make it hard on ML models:

Attackers encode text with Unicode-lookalike text (e.g. 𝙼𝚒𝚌𝚛𝚘soft)
Attackers insert distracting text hidden in non-displayed HTML (to confused NLP models)
Attackers encode text in images (to prevent NLP without OCR)
Phishing pages render all content with javascript and require a CAPTCHA to access (to prevent automated crawling)
Attackers send innocuous emails to an organization for months to build up communication/reputation features
Attackers include text in password-protected PDF files attached to an email (to prevent automatic parsing of attachments)
Attackers sit on a compromised account for months waiting for the right moment. For example by inserting an illegitimate invoice into a conversation about payment at just the right time.

Successfully solving the problem

At Abnormal we’ve built an effective solution that relies on good data engineering, data science, and robust scalable systems underneath powerful ML models.

The key to detecting malicious emails is through good representations and discriminative featurization of that data. At the end of the day, our detection system relies on three dimensions of an email:

All this data is pulled together and made available on the receipt of each email. Our ensemble of detection models use these data sources in various ways, and we have redundant detectors for various classes of particularly damaging attacks. The core engine of detection is a multi-modal ML model.

Malicious Email Classifier

When we first started solving this problem, we used simple GBDT models on top of basic text and communication-graph based features, which I highly recommend to get started on this, or any other, ML problem. Begin with the simplest models you can before going onto sophisticated deep learning approaches.

Eventually, we outgrew simple models, and have found deep learning architectures particularly useful, not only for their predictive power but also for the convenience of including multiple modalities of data inside the same model. For example, one of our most powerful models needs to consume text data of various forms alongside tabular data:

The ability to easily combine and train models representing multiple dimensions of an email in this way has been extremely powerful and helps us generalize to never-before-seen attacks.

Thresholding the model output is one of the hardest problems. Since the precision must be so high, we must be very careful that the model performs well autonomously and can handle distributional shifts.

(I discuss difficulties around thresholding more in a previous post)

Some lessons

Study the attacks, trends, and false negatives. Build your data pipelines and features to represent what attackers might try next as well as variations on what you have seen.
Use human experts to create heuristics! Heuristics features can be some of the most powerful inputs into a model even if they are not sufficiently precise to be standalone rules.
Build a portfolio of detection models, heuristics, and signatures. There can never be too much redundancy in a detection stack.
Ensure each sub-model is representing its sub-problem well before trying to combine them into a larger network, or you will never make progress.
Always baseline and iterate: for example when building a text model, start simple and add complexity only when you can show it is better: (1) try heuristic phrases (2) try bag of words (3) try canned embeddings like fasttext (4) try fine-tuning your embeddings (5) try state-of-the-art techniques like BERT

Bringing it all together

For this whole detection system to work successfully, we must ingest terabytes of data a day — all the incoming emails and other signals we have for an organization — and maintain sophisticated aggregated feature stores to keep track of the communication patterns in realtime. Online we must process every incoming email at low latency, extract and join data, possibly even crawl links and process attachments, apply NLP models and apply featurization. Then we must pass the data through classification models, and combine results of all models for a final decision on whether the email is malicious. All this must be done at latencies of less than a second and maintaining an extremely low false-positive rate. This is a difficult engineering and data science problem.

We’ve built a powerful detection system at Abnormal Security that can prevent the most advanced targeting phishing, business email compromise, lateral phishing and account compromise. But the problem is ever-changing as attackers learn to thwart these systems, so our work is never done.

Some (of many) major challenges we are continually improving

How can we detect invoice fraud better? To do so we must deeply understand the natural language and images in an invoice to identify Abnormal patterns.
How can we detect phishing sites with higher accuracy? For an AI system to understand the content of the page we must render that page and use computer vision and NLP to identify malicious intent.
How can we better identify account takeovers? This is a difficult anomaly detection problem with enormous volumes of data, what constitutes unusual usage behavior and emails from a compromised account?

Acknowledgments

All this work at Abnormal wouldn’t have been possible without the amazing team, especially those working closely on building this detection engine from the ground up: Dmitry Chechik, Kevin Lau, Sanny Liao, Yu Zhou Lee, Abhijit Bagri, Carlos Gasperi, James Yeh, Sanjay Jeyakumar, and the rest of the team.

And yes, we’re hiring! abnormalsecurity.com/careers

Lessons from building AI to Stop Cyberattacks was originally published in Abnormal Security Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.