Analytics Vidhya
Published in

Analytics Vidhya

Machine Learning in Cyber Security — Malicious Software Installation

A real-life use case of local administrative activities monitoring by applying TF-IDF on char extraction


Monitoring of user activities performed by local administrators is always a challenge for SOC analysts and security professionals. Most of the security framework will recommend the implementation of a whitelist mechanism.

However, the real world is often not ideal. You will always have different developers or users having local administrator rights to bypass controls specified. Is there a way to monitor the local administrator activities?

Let’s talk about the data source

An example of how the dataset look like — the 3 entries listed above are referring to the same software

An example of how the dataset looks like — the 3 entries listed above are referring to the same software

We have a regular batch job to retrieve the software installed on each of the workstations which are located in different regions. Most of the software installed is displayed in their local languages. (Yes, you name it — it could be Japanese, French, Dutch …..) So you will meet a situation that the software installed is displayed as 7 different names while it is referring to the same software in the whitelist. Not to mention, we have thousands of devices.

Attributes of the dataset

  • Hostname — The hostname of the devices
  • Publisher Name — the software publisher
  • Software Name — Software Name in Local Language and different versions number

Is there a way we could identify non-standard installation?

My idea is that legit software used in the company — should have more than 1 installation and the software name should be different. In such a case, I believe it will be effective to use machine learning to help a user classify the software and highlight any outlier.

Char processing using Term Frequency — Inverse Document Frequency (TF-IDF)

Natural Language Processing (NLP) is a sub-field of artificial intelligence that deals with understanding and processing human language. In light of new advancements in machine learning, many organizations have begun applying natural language processing for translation, chatbots, and candidate filtering.

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF is usually used for word extraction. However, I was thinking about whether it could also be applicable for char extraction. The idea is to explore how well we could apply TF-IDF to extract the features related to each character in the software name by exporting the importance of each character to the software name.
An example of the script below is going through how I apply the TF-IDF to the software name field in my data set.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Import the dataset
df=pd.read_csv("your dataset")
# Extract the Manufacturer into List
field_extracted = df['softwarename']
# initialize the TF-IDF
vectorizer = TfidfVectorizer(analyzer='char')
vectors = vectorizer.fit_transform(field_extracted)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
result = pd.DataFrame(denselist, columns=feature_names)

A snippet of the result:

The result from the TF-IDF scripts above (with a mix of different languages e.g. Korean, Chinese)

In the above diagram, you could see that a calculation is performed to evaluate how “important” each char is on the software name. This could also be interpreted as how “many” of the char specified is available on each of the software names. In this way, you could present statistically on the characteristic of each “software name” and we could put these features into the machine learning model of your choice.

Other features I extracted and believe it will also be meaning to the models:

import math
from collections import Counter
# Function of calculating Entropy
def eta(data, unit='natural'):
base = {
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.
if len(data) <= 1:
return 0
counts = Counter()for d in data:
counts[d] += 1
ent = 0probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ententropy = [eta(x) for x in field_extracted]
  • Space Ratio — How many spaces the software name has
  • Vowel Ratio — How many vowels (aeiou) the software name has

At last, I have these features listed above with labels to run against randomtreeforest classifier. You could select any classifier of your choice as long as it could give you a satisfactory result.

Thanks for reading!

Originally published at on September 16, 2020.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

Improving your image matching results by 14% with one line of code

Learning Data Science: Day 6 - Data Exploration on Titanic Datasets

When updating features with transformers, the

Machine Learning in Stocks. Re-Defining Company Groups.

Using Twitter to Predict Partisanship of Newspapers

Bias-Variance Tradeoff — Fundamentals of Machine Learning


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Elaine Hung

Elaine Hung

DFIR Specialist | Interested in Machine Learning

More from Medium

Speech to Translation: An end-to-end walkthrough with GCP deployment

Microsoft Malware Kaggle challenge: Multiclass Classification and feature Enginnering.

【Data Analysis(13)】SVM Model

Revolutionizing Satellite Communications with AI