Published in


Got Malware? — Meet Us

Malware Blues! (Source:

The Problem

What are we trying to do?


  • Idika, Nwokedi, and Aditya P. Mathur. “A survey of malware detection techniques.” Purdue University 48 (2007) : This paper discussed two of the most common techniques used in malware detection: anomaly based detection and signature based detection. Link
  • Ahmed, Faraz, et al. “Using spatio-temporal information in API calls with machine learning algorithms for malware detection.” Proceedings of the 2nd ACM workshop on Security and artificial intelligence. ACM, 2009 : These researchers ran malware and benign software in a sandbox environment, analyzed its behavior, and used different algorithms to classify the software as malware or non-malware. Link
  • Liao, Ken. “Solution Corner: Malwarebytes Endpoint Protection.” Blog post. Malwarebytes Labs. Malwarebytes, 27 June 2017. Web. 30 June 2017: This blog post explains how MalwareBytes already incorporates machine learning into their products. Link
  • Siddiqui, Muazzam, Morgan C. Wang, and Joohan Lee. “A survey of data mining techniques for malware detection using file features.” Proceedings of the 46th annual southeast regional conference on xx. ACM, 2008: This article was a survey of different data mining techniques from 19 different studies. Link
  • Alazab, Mamoun, et al. “Zero-day malware detection based on supervised learning algorithms of API call signatures.” Proceedings of the Ninth Australasian Data Mining Conference-Volume 121. Australian Computer Society, Inc., 2011 : This research group used machine learning to identify zero-day malware based on its frequency of Windows API calls. Link

Our Approach

  1. Learning about required tools: Our internship includes a Java course, but because Python has much better libraries for data analysis and visualization, we decided to learn and use it for our project.
  2. Creating a malware analysis test bed: We are writing a Python program that will index the files (make an organized list of all the files along with their sizes) on multiple virtual machines (software that emulates a mini computer inside of your main computer). Then, it will compare the directories and generate a report that tells the user the modifications in the files caused by the malware.
  3. Infect the virtual machines with different types of viruses and compare the files between the infected machines and a clean machine.
  4. Extract meaningful features from our samples. These features will be the basis of our study; features are what describe something, for example, the features of a house are: number of rooms, area of the house, Price of the house.
  5. Visualize data. Malware is a threat to anyone who uses a computer, but many people have only a vague idea of what is and what the effects can be. We aim to write something that will help people clearly visualize the effect of malware in their computers.
  6. Use machine learning on the prepared dataset.

Why is it beneficial?

Can this be done in a better way?

What have we done until now?

Code Review — Please?

  • Code we plan to use for line by line file comparison: Here
  • Code we plan to use to compare two directories and save results to a text : Here



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store