Replacing Static Authentication Detections With Anomaly Based Detections

Ryan Glynn
Compass True North
Published in
10 min readJul 27, 2022

Due to the drastic increase in remote work, Multi-Factor Authentication (MFA) has become the primary defense mechanism against account takeover attacks. While this is a great additional layer of security, the rest of the security industry needs to adjust to this trend. For example, within the threat detection space the detective measures have not caught up to the remote work shift. We can see this with security vendors and blogs continuing to push detections like impossible distance¹ and the number of authentication attempts² as the most commonly-discussed security monitoring use cases regarding authentication.

But just as businesses had to adapt their applications to accommodate multi-factor authentication, detection teams must adapt their use cases to account for the changing landscape. In the current landscape, employees are logging in from their home and all over the world, using personal devices and some are even establishing VPN connections for their entire home network. These behaviors end up making geolocation data more unreliable as VPNs mask the employee’s true authentication location and cellular hotspots are notoriously unreliable from a geolocation perspective. This is further evidenced by the fact that detections like impossible distance can generate a lot of noise due to these changes in employee behaviors.

Not only are these methods unreliable in catching malicious actors, but the reliance on MFA as the primary defense is a dangerous practice to adopt. To make matters worse, Advanced Persistent Threats (APTs) like Cozy Bear have exploited MFA Push Notifications to capitalize on common human behavior of always pressing “Accept” for an MFA request. This type of attack has been successful against large companies such as Microsoft, Okta, and Nvidia. With a member of Lapsus$ (a notorious hacking group) claiming that MFA prompt-bombing allowed them to establish simultaneous logins for a Microsoft employee from Germany and USA.

What can we as a security community do to improve our detection of compromised accounts in a world where people are working from anywhere and on any device?

Luckily for us, humans are still creatures of habit. This gives way to implementing machine learning algorithms like Isolation Forest and replacing antiquated ways of finding compromised accounts. As an example, while somebody might work from Bora Bora for a week but use a VPN based out of Australia, their general device (e.g. macOS12) and browser (e.g. Chrome) will most likely stay the same, along with the cookies that the identity platform stores along with it.

In this post, I will discuss the benefits of creating homegrown machine learning detections because behavior based anomaly detections to prevent account take over (ATO) is difficult when events such as MFA and geolocation data are viewed in a vacuum. It’s also difficult to rely on vendor created data models for “risk based” authentication determinations because they ignore the realities of your environment. This was the impetus to explore building Isolation Forest machine learning models to help correlate our data and build detections that account for the shift to a post COVID distributed workforce.

What is Isolation Forest?

Isolation Forest is a machine learning algorithm that works similar to a decision tree — essentially making decisions based on a flow chart. But in a regular flow chart, each split tries to come to a decision. Instead of trying to classify each split, Isolation Forest tries to find splits in the data that return very few results. This indicates anomalies due to the low number of data points that exist from those splits.

Here are some good resources that explain Isolation Forest more deeply: here and here.

More Flexibility

One of the reasons why you should want to build your own Isolation Forest is because vendor-provided solutions for anomaly detection are typically rigid in what they expect, accept, and use as data points. Security vendors generalize their models in order to fit the broadest number of customer environments. By developing a custom model, you can improve accuracy by adding context and data-points the generic model may not have.

In a similar respect, each business’ environment is different and thus building your own allows you to actually add context and enrichment that is specific to how the employees at your company work. This again will create a better overall model.

Exploratory Analysis

Exploratory analysis is essentially manually investigating statistical breakdowns of the dataset to identify high level trends or behaviors; for example, graphing authentications based on risk score and whether or not MFA triggered can identify if your risk score is too low. This kind of analysis can identify policies you may want to implement to further curtail unwanted behavior which may also contribute to anomalous authentications. Not only can a vendor’s model perform worse, but you will not gain the knowledge from the exploratory analysis needed to roll your own. While this is an additional step and makes it less plug-and-play, exploratory analysis can find overarching themes of employee behavior that you may want to correct. Additionally, As examples: you may find that a large portion of the initial anomalous authentications are because of the way that tech support is currently troubleshooting individual accounts or because there is a common behavior of credential sharing within specific teams or departments.

This analysis will not only reveal unwanted behaviors but it can also help mitigate the overall workload of a SOC due to these behaviors.

Implementing It Against Authentication Data

Now that we briefly went over why, let’s discuss how. In addition to the medium posts I linked to earlier explaining how to implement Isolation Forest generally, I am going to discuss how Compass implemented the model in our security monitoring environment.

Workflow

High Level Workflow
High Level Workflow for Model Execution and Automatic Response

The above diagram is a high-level workflow of our implementation.

Essentially it performs the following using Apache Airflow:

  1. Pulls the data and model from S3 (we store the latest 6 months of data for baselining).
  2. Queries the newest data from our SIEM.
  3. Combines the new data with the old data, truncates to the latest ~6 months and re-uploads the data to S3.
  4. Executes the model against the newest data and filters to only the anomalies.
  5. Grabs additional contextual data from the SIEM for the anomalies
  6. Makes a determination about whether to send the user in question a Slack message, an email or escalate as an incident to our on-call.

We chose Apache Airflow for this task for two reasons:

  1. Airflow is a reliable scheduler system. Just like how a lot of teams may use Airflow to schedule ETL workloads, Airflow also works great when you need to query in continuous intervals (say every hour) against multiple other systems.
  2. Our SIEM does not support Python or ML Toolkits out of the box. And because of this, we needed something that not only was able to plug into our SIEM infrastructure but also able to execute python scripts.

In this specific implementation, there is actually a separate model built for every single employee (since each employee’s default behavior may be different). Depending on how large your organization is or other constraints you may have, it may be better to generalize one model for the entire company or at the team/department level. Currently with approximately 5,000 employees the model executes every hour in under 5 minutes, so resource constraints are not that large.

The Data Points

Below is a list of data points utilized in the workflow. This is not a hard-set list and if you have better or other data points to use, more power to you!

Data points used in the workflow:

  • Operating System
  • OS Major Version
  • Country of Authentication (based on Auth IP)
  • Browser
  • Region (State or Territory) (based on Auth IP)
  • ISP (based on Auth IP)
  • MFA IP
  • Browser Fingerprint
  • Presence of a Risk Cookie (cookie that the identity provider stores from a previously successful authentication)
  • Risk Score
  • Host Attribution Data (e.g. from an asset management system)

The MFA IP and host attribution data are used as post-anomaly filtering criteria. For example, if an authentication was flagged as anomalous but the user that had the anomalous authentication has an asset check-in at the same IP address close to the same time frame of the authentication, the flagged anomaly won’t create an alert.

Similarly with MFA IP, if your identity provider uses push notifications as a potential MFA option then they hopefully also log the IP address of the device that accepted the push notification. This allows you to compare the IP initializing the auth with the IP that accepted the MFA. Assuming you do not have an already-compromised account with a maliciously configured MFA device you could also use these data points to identify benign events.

Additional fields that were considered for the model but ended up being removed:

  • Browser Version
  • City of Authentication

Both of these fields were too unreliable. Browser versions change all the time so it wasn’t necessarily a good item to include as each time a browser is updated it would be logging in from a version that the user had never used before. With city geolocation, mobile hotspotting can cause quite a lot of IPs to bounce from a lot of different neighboring cities. During our implementation, the region and country proved to be more effective.

The Code

Now to get to the meat of it all. Due to certain parts of the code being specific to our environment, the only code that will be shown are aspects that can be generalized. The benefit of rolling this in your environment is that you may have additional contextual data to further improve the model. However, because of this, I will not go over how to handle individual data points and all code that uses credentials just has generalized variable names.

Assumptions

This section assumes that you already have a method of exporting your historical authentication data into a CSV with correct formatting.

External Libraries

  • Pandas
  • Numpy
  • Scikit-learn
  • Boto3
  • Matplotlib
  • Shap

Grabbing the data from S3 and Loading into Pandas

While this snippet does not cover how to grab and clean the data from your identity logs (as that will vary based on SIEM platform and identity provider). This just gives a basic snippet of loading the data from s3.

Building the Model

The way we implemented this model, we not only build a model for each employee, but we also build a model depending on auth type (cell phone vs laptop). Additionally, users that have had less than 10 auths or have been active for less than 60 days are removed from the model as there is not enough data to create a proper baseline.

The loop above will create a model per employee and per auth type and then create a dataframe named “alerts” to record the alerts that have happened after the “prior” date stored in Airflow (prior is just a recording of the previous run that Airflow performed).

From here you can implement additional logic checks against the alerts dataframe such as removing events where the asset IP matches the authenticating IP. Or if your environment doesn’t need it, the only step required is to send this data to your case management system of choice.

Helping Your SOC Understand It

While the above code provides building blocks to implement the detection I’d recommend adding historical data and visualizations to assist responders in understanding why the event is considered anomalous.

Adding Historical Context

Since we already have the last 6 months of data pulled into the Airflow DAG, the historical data of the user’s authentications are readily available. One thing that is helpful in an investigation is curating the historical behaviors as additional fields to make comparison of the current event to previous events more easily understandable.

As an example, we can generate two additional columns for the operating system field — one to list the previous operating systems and the other as a binary flag to indicate if the operating system changed in this latest event.

We can repeat this for all of the key elements of the model to make investigations very straightforward with a section of the case being dedicated to historical behavior like the following:

Example of a historical profile of a user

Adding Visual Explanations

Additionally, to further understand why the model found an individual event anomalous, we can utilize the Shap library to visually explain the logic to whomever is investigating.

This will save an image that will look like the following:

This visualization makes it easy to see which features had the greatest impact on creating an anomaly. The attribute flag allows the investigator to determine which attribute to the anomalous event. The above example the operating system was Windows which deviates significantly from this individual’s baseline. This allows investigators to work from top to bottom to get a full picture of the event, why it’s anomalous, and how it deviates from historical behaviors.

Putting this all together the full code is:

Summary

To summarize:

  1. Using an Isolation Forest model you can identify anomalous authentications outside of your standard static detections.
  2. Using at least 6 months of historical data to create the baseline is optimal.
  3. It is better to create a model either per user, per team or per department (depending on your resource availability, size of company, etc.)
  4. Building the model instead of using a vendor sourced model provides more flexibility and greater insight into your typical employee authentication behaviors.
  5. Airflow is a great mechanism to schedule and automate a lot of the continuous execution of these models if your SIEM does not support python

While I did not cover all the code necessary to create the entire workflow (since chunks of it are dependent upon multiple upstream platforms that change depending on company choices), this should give a basic overview of how to implement anomaly detection against your authentication logs. I hope you enjoyed the post and please comment any questions/issues!

--

--