Published in


Team Aladdin — Working Towards our Goal


Welcome Back!

After experimenting a bit for the first two weeks, we have defined our roles in the group. Vivian is our researcher and networking specialist, Huy is our lead programmer, Tien is our assistant programmer and public affairs rep, while Citlalin is the chief editor and writer. We started the week reaching out to Mike Sconzo, owner of for a meeting and with his guidance, we were able to envision the final product of our project and refine our missions statement:

Assist the production of data quality on existing data sets by publishing a report with information on the datasets and a program that visualizes the primary features to inspect.

After defining our roles and getting a clear idea of our final product, we were able to increase our production results, and work more efficiently. Instead of focusing on the research aspect, we started implementing our knowledge.


For Malware feature analysis, we were able to find an academic report called, “A learning model to detect maliciousness of portable executable using integrated feature set” by Ajit Kumar. The report explains the meaning of each feature that could be used to classify malware from benign. Those features are stored as the header names inside the malware JSON files. The headers we decided to extract are FileName, SectionAlignment, FileAlignment, SizeOfHeaders, TimeDateStamp, ImageBase, SizeOfImage, SizeOfHeaders, DllCharacteristics, Characteristics, HighEntropy, LowEntropy, TotalSuspiciousSections, TotalNonSuspiciousSections.

Here is a brief explanation of what each header name means.

  • FileName — The JSON file name from where the extracted data values are originally from.
  • SectionAlignment — In a benign file, this value must be greater than or equal to the FileAlignment.
  • FileAlignment — Used to align the raw date of sections in the image file.
  • SizeOfHeaders — When you divide SectionAlignment by FileAlignment your result is SizeOfHeaders. If you get a whole number, it is a normal file.
  • TimeDateStamp — Used to extract the date when the file was compiled. The year constraint for the files is between 1980 to 2018.
  • ImageBase — Must be a multiple of 64k
  • SizeOfImage — A multiple of SectionAlignment. Most benign files follow this specification and 94% of malware follow this specification. Embedded code is hidden in this field.
  • Entropy — Defined as measure of efficiency of information storage. Some modern malware or malicious packers try to reduce entropy by inserting zero bytes in data. It avoids detection, as many AV software only react to high entropy files.
  • HighEntropy and LowEntropyrespectively — If any section of the corresponding PE file has entropy greater than 7 then HighEntropy is set to 1. When the entropy is less than 1, LowEntropy is set to 1.
  • DllCharacteristics — One of the most important fields. It comprises flags that indicate the attributes of the file. It stores file characteristic such as IMAGE_FILE_DEBUG_STRIPPED flag that shows if a different debug file is used or not.
  • Characteristics — An important field under Optional header in Windows-specific fields and contains information about DLL behaviors which are used by the linker and loader.


For our report write up, we studied up on the format UC Irvine Repository uses. We focused on a Website Phishing Data Set, which was informative yet simplified to be easily readable. There was a table with description of the type of data you will find such as, attribute characteristics, Date the Data was Donated to the repository and the task completed with the dataset. They also give the source of where the data was collected from. We followed along their format :

  • an abstract
  • a section citing the source of the data set
  • brief description of the data and its history
  • a section with the the Attributes extracted
  • a section with relevant papers that have worked with the dataset before

Something we added that the Phishing Data Set did not have was a section with a link to our Jupyter Notebook analysis.


Programs Created this week

This week we created two programs because before we could visualize the datasets through charts and graphs, we needed to create a program to extract the main features from around eight thousand JSON Files for each data set.

1st Program: Feature Extraction

This program extracts features from the raw dataset and gets the values from the features thought to be useful, then it is written into a CSV file. First, we wrote a code to open and read all JSON files, one after the other. Then used a for loop to extract the ‘Entropy’ feature and ‘Section name’ feature in ‘PE Sections’. The JSON library in python can only extract one line at a time from the JSON dictionary, which proved to be time consuming. The next part of the code is to extract the rest of the features needed along with its values from ‘OPTIONAL_HEADER’. We then append the result and go to the next JSON file in the folder, so the code can extract all the features from all files inside the folder. The last part of the code is to write a CSV file to display all the values.

CyberDefenders — Machine Learning (

2nd program: Analysis of DataSet

This program on Jupyter Notebook visualizes the quality of the datasets. It is written in Jupyter Notebook to make viewing easy for anyone wanting to visit the datasets. We create a Data Quality Report Table that shows all the information of the datasets such as Data Type of features, Quantity, Unique Value of features, Maximum and Minimum Values. First the program imports the necessary Python packages, pandas, pyplot, mplot3D. Next we get the data from CVS file into a data frame for pandas to make the process more smooth. For example this is the line of code used, df = pd.read_csv(“VirusShare.csv”). To create the data quality chart with all the descriptions of the data we use, df.describe().transpose() as out line of code. From this chart, there is more data than needed, to clean it up and create a good Data Quality Report, we created some more code as seen on the screenshot picture below with our resulting graph.

Data Quality Report Chart (

To visualize the important features we created graphs to show how often the same value would appear in the other JSON files. The screenshoot below describes the values for the Characteristics Feature in a bar graph. From the graph it is apparent that the value 271 is a very common value for Characteristics, therefore we concluded that it could be a standard value. On the other hand 783, is not very common since there are not that many files with the same value. This graph is distinct from the chart because the chart only shows the minimum and maximum values while the graph can show which values are more common.

Plans for Week 4

Next week we will tidy up our charts, and graphs; Improve our data report to make it more user-friendly. We will do more reports for the other datasets, such as network log files and system files. We will also make individual Jupyter Notebook analysis for each file and their features.


  1. Sonzo, Mike. “ — Samples of Security Related Data” Jul 10
  2. Dawn Song, “AI and Security: Lessons, Challenges & Future Directions”, sept 2017,
  3. AjitKumar, “A learning model to detect maliciousness of portable executable using integrated feature set”, January 2017,
  4. Mike Sconzo, “Honeypot Howto”, 2014,
  5. Ilkay Altintas, “Python for Data Science”, UCSanDiego course,




Cyber Defenders Program

Recommended from Medium

New Magpi-Dropbox Photo Integration

Magpi Dropbox photo integration

Hack The Box — Explosion

What is AWS Security Hub?

Exploit Development 01 — Terminology

Things to think about!

Ask the Expert: Peiyu Wang

Official Response to Matter Labs Questions Regarding ZKSwap

ZKSpace Weekly Dev & Operations Report (2.14–2.18)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Citlalin Galvan

Citlalin Galvan

I enjoy learning and talking about exciting topics in tech and cyber security!

More from Medium

What is EDA(Exploratory Data Analysis) ?

Getting Started: Picking the right chart with your data.

Misinterpretation of Data

The Factor Behind Volatile Chili’s Price