Machine Learning for Malware Detection
My Summary of Kaspersky Whitepaper
Main Reference
https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf
Machine Learning Approaches
Unsupervised Learning
- Machine learning models that analyze datasets without labeled outputs.
- Goal is to uncover patterns, clusters, or underlying structure in the data.
- Example use case: clustering unknown malware samples to prioritize analysis.
Supervuised Learning
- Algorithms trained on labeled datasets — each sample mapped to a known output (malware or benign).
- Models learn patterns from features to predict the correct output label for new unseen inputs.
- Allows detecting known and new malware based on learned decision boundaries.
Deep Learning
- Class of machine learning models composed of multiple processing layers to learn data representations and features at multiple levels of abstraction.
- Can handle raw inputs like file content, system behaviors.
- Complex models but able to capture nuanced malware characteristics.
- Used for classification as well as powering components like exemplar networks.
Static vs Dynamic Analysis
Static (Pre-Execution) Analysis
- Analyzes malware sample without executing code
- Examines properties like file metadata, format, header structure, strings, etc.
- Goal is fast screening to label samples before allowing execution
- Limitations in analyzing packed/encrypted content
- Kaspersky uses similarity hashing and ML at this stage
- Running on client devices like phone, laptop, etc
Dynamic (Post-Execution) Analysis
- Monitors malware execution in a safe sandbox environment
- Logs behaviors like system calls, network activity, registry/file changes
- More computationally intensive but reveals true payload
- Detects advanced threats missed by static analysis
- Kaspersky uses deep learning behavioral models at this stage
- Running on servers in order to label the dataset
Locality-Sensitive Hashes (LSH)
Locality-Sensitive Hashing (LSH) is a technique used to efficiently index and search large datasets for similar items.
- It relies on hash functions where similar inputs map to the same hash codes with high probability.
- So if two documents/images/etc are similar, their locality-sensitive hash codes will be the same or close with high probability.
- Dissimilar objects will hash to very different codes.
- This allows compact approximations of similarity comparisons rather than expensive direct comparisons.
- Can query for similar items by finding other items mapped to same hash codes.
- Useful for duplicate detection, nearest neighbor search, clustering, etc.
In the malware detection context, Kaspersky uses similarity hashing to group similar malware samples by families based on file contents, statistics, metadata similar hashes. This focuses manual analysis and aids detection of novel variants. The key benefit is efficiency — comparing compact hashes rather than complete files to identify whole groups of interest. This scales to malware databases much larger than otherwise feasible.
About Me
Hi, I’m Seyyed Ali Ayati, a software engineer and a passionate blogger. I’m originally from Iran, but I moved to the US in 2022 to pursue my Ph.D. degree in computer engineering at Texas A&M University.
On this blog, I share my insights and tips on software security, web development, machine learning, and more. I aim to help aspiring and experienced developers learn new skills, improve their code quality, and stay updated on the latest trends and technologies. I also write about my personal projects, challenges, and achievements, as well as my opinions and perspectives on various topics related to tech and society.
If you want to learn more about me and my work, you can visit my LinkedIn page here: https://www.linkedin.com/in/seyyedaliayati/