Microsoft Malware Prediction Using Classical Machine Learning Algorithms

Published in

The Startup

14 min readNov 11, 2020

As a part of self case study, I selected a problem statement Microsoft Malware prediction from Kaggle which is an online community of data scientists and machine learning practitioners which hosts various competitions in order to solve a real-world problem using machine learning and artificial intelligence.

In this blog, I will explain to you how I approached the problem and solved this problem using classical machine learning algorithms.

Content:

Business Problem
Dataset
Evaluation Metric
Exploratory Data Analysis
Feature Engineering
Encoding Categorical Features
First Cut Approach
Custom Stacking Classifier
Model Comparisons
Final Model
Model Productionization
Future Work
References

Business Problem:

Malware attacks are one of the rising concerns in terms of security and integrity of data for the users and organizations using Windows systems. According to NetMarketShare firm(source Wikipedia), as per their survey on desktop os market share in September 2019, Microsoft Windows OS rules the market by 87%, so here comes the biggest challenge for Microsoft to provide antivirus defender or other security products which provides accurate and efficient results whether machine will be infected with malware or not to keep their data secured and remain non-vulnerable from such kind of attacks.

As a part of their overall strategy, Microsoft challenged the data science community to develop the machine learning systems which can determine or predict if the machine will be soon hit by a malware.

So, I developed a machine learning system which predicted the probabilities of a machine that will likely be infected with malware.

Dataset:

For this problem, I collected data i.e train.csv and test.csv directly from kaggle where this competition was hosted.

Train.csv :- This file contains 8921483 entries where each entry corresponds to a machine which is uniquely identified by a MachineIdentifier and HasDetections is the ground truth and indicates that Malware was detected on the machine. And contains 83 columns including MachineIdentifier and HasDetections using which model is to be trained. Here features, columns and variables are analogous with each other and I will be using this terms interchangeably throughout this blog.
Test.csv :- This file contains 7853253 entries. And contains 82 columns including MachineIdentifier except HasDetections.

Dataset can be downloaded from here.

Evaluation Metric:

For this problem, model performance was evaluated based on area under the ROC score between the predicted probability and true label.

Area under the ROC tells us how good a model is in distinguishing between true postives and false positives.

For this problem statement I achieved final AUC Score of 0.749.

For detailed explanation and understanding of Area under ROC you can refer this blog.

Exploratory Data Analysis:

Null values analysis:

As a part of Exploratory Data analysis and Data preprocessing, First I had figured out how many and which features(i.e columns) contains more than 70 percent of null values, because features with more than 70 percent of null values will not contribute much for model training. And removed those features from dataset.

As from above result, it was clear that 5 features contains null values greater than 70 percent. So I removed those features.

2. Types of features:

After removing null values from train data, I had extracted out types of features from dataset, to preprocessed data according to the types of features.

3. Filling Missing values:

After figuring out different types of features in a given dataset and removing features having more than 70 percent of null values, there were still some features which had null values, so as a part of data preprocessing I filled those missing values and cleaned categories of categorical features in order to reduce the cardinalities of categorical features.

4. Removing outliers from data:

Sometimes outliers present in dataset may deterioriates model performance, but for tree based models like Decision Trees, Random Forests, LightGBM, XGBoost etc, outliers has less impact on model performance which I had used as a first cut approach but for safer side as a part of Exploratory data analysis, I removed outliers from training dataset based on Z- score method by selecting only the datapoints(rows) whose Z-score value is less than 3 because Z- score greater than 3 is considered as outliers. Z- score describes any datapoint by finding their relationship with standard deviation and mean of the group of data points. And Z-score is calculated for features which contains continous values.

There are various other techniques to remove outliers from dataset. You can refer this good blog on ways to detect and remove outliers.

5. Finding Correlation among features and target variable:

This part of exploratory data analysis helped me a lot to improve my model performance, Correlation of features with target variable help us to find out which feature will influence the model for predicting towards the ground truth or target variable in this case ‘HasDetections’ is target variable. By default, correlation technique used here was pearson correlation technique, for detailed explanation and understanding about this technique you can refer this blog.

From above results, AVProductStatesIdentifier,Census_TotalPhysicalRAM are very highly correlated with target variable in terms of positive strength and there are various other features which are correlated with target variable in terms of negative strength, so these features will play more important role for prediction.

But there is one problem with pearson correlation coefficient techinique, this technique can handle only linear relationships amongst the features and works very well only for numerical features. And for this problem statement most of the features are categorical features, so I had used new correlation analyzer library called phi_k Correlation Analyzer Library which outperforms pearson correlation coefficient techinque, as it consistently works well for categorical,ordinal and interval variables, this library also captures non-linear relationships amongst the variables. phik_matrix method will capture the correlation amongst all features. I will provide link of research paper which I referred for this library at the end of this blog.

This Phi_k library not only measures correlation amongst the variables but also measures the statistical significance of the variables and global correlation of variables.

Observations:

From above plot, it is with respect to target variable ‘HasDetections’ there are some features whose phi_k relation is zero like Census_IsFlightsDisabled, Census_IsPortableOperatingSystem,IsBeta, Firewall,Census_DeviceFamily,OsVer and Census_PrimaryDiskTotalCapacity features can be removed after analysing significance matrix, because eventhough if 2 variables dont have any or have small correlation there are chances that those variables are significant or may be a significant predictor for the model with respect to target variable.
Features like GeoNameIdentifier and LocaleEnglishNameIdentifier are highly correlated feature
After above observations I had removed GeoNameIdentifier

As mentioned earlier, phi_k library also measures statistical significance of variables, I had also used significance matrix to find out significance of variables.

Observations:

From significance marix plot, it is visible, that eventhough variables correlation with target variable is zero, but somehow there are some number of records are occurring at same time for both the target variable and other features.

Global Correlation coefficient defines how well each variable can be modelled in terms of other variable, so this can helps in feature engineering.

Observations:

From above global correlation plot, AppVersion is having high global correlation value followed by OSVer, smartscreen
This features will be useful for feature engineering tasks

Similary, due to computation limitations, I had taken subset of features and analysed the correlation between the features, so at the end of this blog I will provide the link of my github profile where you can refer my whole python notebooks.

6. Analysis of distribution of target variable ‘HasDetections:

After all the above preprocessing of data and analysis of correlation between the variables, it is very important to know the distributions of target variables in our train dataset. Because, imbalanced dataset can lead to poor classification accuracy. So I checked the distributions of target variable in train dataset.

Observations:

As can be seen from above countplot, train dataset contains equal amount of positive detections and negative detections(or not detected). And I concluded that this dataset is balanced dataset

Dataset is balanced so model will not be biased towards any of the value of target variable and it will also help the model to classify data accurately.

7. Pair plots for anaysis of numerical features:

Observations:

There is overlapping between negative detection and positive detection for all the three features with respect to target variable.
And distribution for Census_OSBuildRevision can be seen from the plot, that it is left skewed.
As distribution for Census_OSBuildRevision is left skewed, by applying log transformation to this feature, we can come up with new feature.

8. Count plot for SmartScreen feature for comparing the categories in smartscreen feature with target variable ‘HasDetections’:

Analysis of categories in smartscreen feature

Observation:

Machines with smartscreen existnotset i.e entry exists in registry but it is not set are very much vulnerable from being affected by malware
And the machines with smartscreen requireadmin are less likely to be infected from malware and very much secured from the malware attacks
And also the machines with smartscreen requireadmin category dominates this feature
Using this feature we can create new features with feature/ columns like ISProtected then this new feature can be useful for modelling as if smartscreen is enabled then there are less chances of getting infected from malware as observed from above plot

9. Count plot for Product Name for checking distribution of categories:

Distribution of categories in ProductName feature

Observations:

From the above plot, in this dataset machines with product windows 8 defender are more as compared to mse.
And windows 8 defender is dominating category for this feature.

10. Count plot for Census_OSArchitecture feature for comparing the categories in Census_OSArchitecture feature with target variable ‘HasDetections’:

Observations:

From the above plot, we can conclude that in a given dataset, most of the machine architecture is amd64 and most of the amd64 architecture machines has detected the malware.
Processor with amd64 architecture can be more vulnerable from malware as the detection rate of malware is high as compared to x86 architecture

Feature Engineering:

After Exploratory Data Analysis, Feature engineering is considered as a most important part for building good machine learning system.

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
— Prof. Andrew Ng.

As a part of feature engineering, I had created 12 new features from the given set of features after preprocessing using my domain knowledge about which combinations of features may help for predicting machines getting infected with malware.

For example, in given dataset I had features like wdft_isgamer and firewall, so I created new features gamer_with_firewall, using this two features because Gaming machines are more prone to getting infected with malware and when firewall is not turned on then there are higher chances of getting infected. So somehow this feature may help in prediction.

Similary, I had created new feature Dimensions using Census_InternalPrimaryDisplayResolutionHorizontal and Census_InternalPrimaryDisplayResolutionVertical.

I created new feature transformed_build_revision by applying log transformation on Census_OSBuildRevision feature because we seen in EDA part 7 that Census_OSBuildRevision follows right skewed distribution so apply log transformation to make the data follow normal distribution and somehow it can be useful for data modelling.

I created various other features similarly, from the given set of features and came up with 12 more new additional features.

Encoding Categorical Features:

Before implementing first cut approach, as dataset contains categorical features but model performs well when numerical values are feeded, so first I encoded the categorical features. There are various encoding techinques for encoding categorical features, but for this problem statement I used frequency encoding for some features because its not feasible to use LabelEncoding techinque when cardinalities of categories in categorical features are very high and for some features I used LabelEncoding technique.

For detailed explanation of each of different encoding techniques can be found in this blog.

First Cut Approach:

After exploratory data analysis and feature engineering,for training a model as a first cut approach, I had trained model using lightgbm a tree based gradient boosting framework.

I chose this algorithm as a first cut approach because of dataset size and also lightgbm is known for its high speed and less usage of memory and deals very well with data which contains more number of categorical features.

You can refer this blog, for detailed explanation about lightgbm algorithm.

For tuning the hyperparameters of lightgbm , I had used optuna library for selecting the best parameters which gives good auc score.

After tuning the parameters, I had used Stratified K-Fold Cross validation to validate evaluation metric by dividing the data into folds and then each fold is used as testing set.

After applying stratified 3- fold cross validation, model achieved training auc score of 0.724 and validation auc score of 0.723 by taking mean auc score of all the folds.

Lightgbm AUC score

After evaluating metric, Using this model I obtained final predicted probabilites by passing test data to trained lightgbm model.

After training, I figured out which variable played important role in predictions towards target variable using feature_importance_ attribute of lightgbm library. Plotted feature importances of 50 features.

From above plot, its visible that Feature AVProductStatesIdentifier was very much useful atpredicting the target variable. Followed by AppVersion, Wdft_RegionIdentifier and so on. And we can see that feature transformed_build_revision which I engineered by applying log transformation during feature engineering task was also found to be useful at predicting the target variable.

Custom Stacking Classifier:

Stacking is an ensemble machine learning technique that combines the predictions from two or more than two base learners or machine learning algorithms and using those predictions, meta model is trained and final probabilites are predicted.

For Implementing custom stacking classifier first I created custom class containing fit method for training custom model and predict_proba method for final predictions by inheriting BaseEstimator class.

I initialized the class by creating class variable base_learners as list which contains 6 machine learning algorithms , and used XGBClassifier as meta_learner.

init method of custom stacking model

At first I splitted whole dataset from train dataset into train(80 %) and validation data(20 %).

Fit method:

Now in 80% train data set,I further splitted train dataset into D1 and D2 equally i.e 50% during the fit method of CustomStackingClassifier class.

Now from D1 dataset, I further created ‘k’ samples with replacement.

After creating ‘k’ samples, I trained ‘k’ models using this samples. Here ‘k’ is the hyperparameter for this custom implemented ensemble model which I tuned using grid search cv and selected best ‘k’ for final training.

After training ‘k’ models, passed D2 dataset to each of these models, so I got ‘k’ model predictions each. And using those ‘k’ model predictions, I created a new dataset and for D2 dataset I already had its corresponding target variables so trained a meta classifier with these ‘k’ model predictions having D2 target variables.

predict_proba method:

After training, now I passed the remaining 20% of dataset i.e validation data using predict_proba method to each of the ‘k’ trained models, and I got the predicted probabilites of each of these ‘k’ trained models. Using these probabilities, created a new dataset and passed it to the meta classifier to get final probabilites.

Tuning no of base learners:

For this model, I tuned the number of base learners using sklearn library GridSearchCV and selected best number of base learners i.e k = 5 for training model on train dataset and final predictions on test dataset which gave best aucroc score.

Training a custom stacking classifier:

For training, I selected base learners(k) as 5 after tuning using GridSearchCV and got the roc_auc_score of 0.70 for both train and validation data.

ROC AUC Score of custom model

Final Predictions on test data:

After tuning no of base models and training on train data, I passed whole test data to saved best model and got the probabilites of machine getting infected with malware.

Model Comparisons:

After training both models, I compared the model performance of both the models.

From above, it was clear that Lightgbm model outperformed the custom stacking model. So, I selected model trained with lightgbm algorithm as the final model.

Final Model:

After Model comparison, I selected trained lightgbm model as best model because it performed well as compared to that of custom stacking model,so after feature selection based on no of splits, I selected top features which are used for splitting tree greater than equal to 100 times.

After feature selection, followed the same procedure of tuning for best parameters using optuna, trained the classifier and got the train auc score of 0.75 and validation auc score of 0.73.

After evaluating metric, predicted the final probabilites on test dataset and got the predicted probabilites of machine getting infected with malware.

After creating datapipeline for train dataset to evaluate metric which you can refer code in final.ipynb file in my github repo which I will provide link in the end, achieved final AUC score of 0.749.

Final AUC score

Model Productionization:

After getting probability scores and everything using best selected Model, I deployed my best ML Model on Heroku using basic Flask API. Below is the link of my App which I deployed.

Malware detection

Edit description

malware-prediction-flask-api.herokuapp.com

You can refer my blog on how I deployed my ML Model on Heroku.

Future Work:

Still my work can be improved as a part of future work as below:

For custom stacking model, I had tuned only the number of base models but we can also tune the number of parameters of each base models in addition to number of base models.
In lightgbm , there is one parameter of device where I had kept default i.e CPU means CPU is used for training, this can be extended to GPU as a future work to reduce the training time and if have sufficient computational power.
For encoding categorical features, as a future work can try with various different encoding technique.
As a part of future work, One can try with another machine learning algorithm that uses gradient boosting on decision trees known as catboost and can evaluate the performance of model.

Here is my github repo where you can find python notebooks for references.