10min DIY: Detect Insider Threats | IForest on CERT Data 🔥

Ezzeldin Tahoun
cyberdatascience.org
4 min readJul 5, 2022

🎯Motivation

Cyber security professionals (analysts, investigators, engineers, architects, consultants) can benefit from using simple data science models to help them spot out suspicious activites. The goal of every post is to give the high level steps needed and the commented code to perform an end-end data science exercise of high threat detection value.

🚪Intro

Definition of Insider Threat: The threat that an insider will use her/his authorized access, wittingly or unwittingly, to do harm to the security of organizational operations and assets, individuals, other organizations, and the Nation. This threat can include damage through espionage, terrorism, unauthorized disclosure of national security information, or through the loss or degradation of organizational resources or capabilities. Ref: https://csrc.nist.gov/glossary/term/insider_threat

Insider threat actions include unsanctioned data transfer or sabotaging of business resources. It manifests in various forms and a myriad of motivations. More: https://www.cisa.gov/defining-insider-threats

📊Dataset

Carnegie Mellon University Software Engineering Institute CERT Insider Threat

There are several threat scenarios in the dataset, such as a leaks, theft, and sabotage.

Ref:https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099 Paper:https://ieeexplore.ieee.org/document/6565236 Link:https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1

Data Set Variable Causal Interdependencies. Read AB as “A influences B” (Glasser et al, 2013)

🏋️‍♂️10 min exercise

Lets detect the obvious in the data in a 10 min exercise. The naïve route here, is feature engineering based on subject matter expertise and then simple time series anomaly detection. We do so in 7 important steps.

Follow through by clicking open in collab here :

1- 👩🏽‍🔬Data Acquisition. We download and access the data in pandas data frames. We use 4.2 as it has more insider threat incidents than the others.

If you wish: you can use a subset of the data on your SIEM.

2- 🕵️Data Selection. We read the “device”, “email”, “file”, “logon”, “http” CSVs.
- For device we are interested in the “date”, “user”, “activity” features.
- For email we are interested in the “date”, “user”, “to”, “cc”, “bcc” features.
- For file we are interested in the “date”, “user”, “filename” features.
- For logon we are interested in the “date”, “user”, ”activity" features.
- For http we are interested in the “date”, “user”, “url” features.

If you wish: you can use CSVs with data from your organization (eg: Suricata/Zeek or EDR logs).

3- 👷Feature Engineering. We create the following subject matter expertise engineered additional features:
- File type (exe, zip, pdf, etc.)
- Logon time (Business Hours, After Hours, Weekend)
- Removable media time (Business Hours, After Hours, Weekend)
- Email receiver (External address, Internal address)

If you wish: you can better your detection by using more engineered features, such as employee communications sentiment (negative, positive, neutral), employee browsing and url intelligence (top 1000, malicious, competitor, filesharing, domain generation algorithm, etc.) for instance.

4- 👩‍💻Data Processing. We now vectorize, and split into test/train subsets of anomalous and benign subsets.

5- 🥼Choosing a model. As per the figure, a variety of models exist.
We choose the Isolation Forest in this case. For simplicity, we dont run a hyper parameter tuning experiment here, and instead choose an educated guess, knowing the dataset, as the contamination parameter.

IForest is an ensembling technique that is based on the usage of numerous isolation trees, a tree structure constructed effectively to isolate every single instance. The idea is based on the susceptibility of anomalies to isolation; hence anomalies are found closer to the root of the tree; whereas ordinary points are isolated at the deeper end of the tree. Isolation Forest builds an ensemble of isolation trees and defines anomalies as those instances with the shortest average path lengths on the trees formed. (Tahoun, 2022)

If you wish: you can better your detection by experimenting with different models and hyper parameters. Many outlier detection models exist in the PYOD library. Use sklearn grid search if you want to be thorough or random search for quick exploration of the parameter space.

Anomaly detection approaches arranged in the plane categorized by their underlying type. (Tahoun, 2021)

6- 📐Choosing a threshold. We compute anomaly scores for the training data and plot the scores for both anomalous and benign entries. We eyeball the threshold and choose a subjectively good enough cutoff.

If you wish: you can better your detection by plotting an ROC curve and choose the optimal threshold (eg: TPR+FPR=1 OR MAX[TPR+1-FPR] ).
FYI: TPR: True Positive Rate, FPR: False Positive Rate

7- 🔬Evaluating Results. We observe our confusion matrix. Our model did well enough on this dataset, with a limited amount of false positive.

If you wish: you can iterate on the experiment for another 10 mins by reviewing the various decisions we took in the previous steps and attempting to make them more thorough. As you better this experiment, jot down your learnings on your copy of the colab notebook and I encourage you to test a second dataset from your organization or even your home lab SIEM.

I hope you enjoyed this short exercise and were able to get your hands dirty with some straight forward cyber data science. Dont hesitate to ask any questions. And let me know what should the next DIY exercise be.

Until next time👋

--

--

Ezzeldin Tahoun
cyberdatascience.org

“From error to error one discovers the entire truth.” -Sigmund Freud