In this tutorial we’ll download the MITRE ATT&CK data set as an Excel file and perform basic Data Cleaning tasks and Exploratory Data Analysis (EDA). The end result will be a bar chart providing a count of MITRE Sub-Techniques by Data Component.
Step 1: Import Python Modules
In this tutorial we’ll use Pandas, PyJanitor and Plotly modules.
import pandas as pd # for data acquisition and manipulation
import janitor as jn # for data cleaning tasks
import plotly.express as px # for visualization
import plotly.io as pio # for visualization
Step 2: Define Settings
When performing initial analysis of a data set, it’s often helpful to remove the Pandas display restrictions for columns and rows.
And we’ll have Plotly send visualization outputs to the browser.
# Pandas
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)# Plotly
pio.renderers.default = "browser"
Step 3: Download the MITRE Enterprise ATT&CK Data Set
There are various ways to access the MITRE data sets. In this tutorial we download the data from an Excel file and convert each sheet to a Pandas data frame.
url_attack = 'https://attack.mitre.org/docs/enterprise-attack-v10.1/enterprise-attack-v10.1.xlsx'
df_datasources = pd.read_excel(url_attack, sheet_name='datasources')
df_tactics = pd.read_excel(url_attack, sheet_name='tactics')
df_techniques = pd.read_excel(url_attack, sheet_name='techniques')
df_relationships = pd.read_excel(url_attack, sheet_name='relationships')
df_mitigations = pd.read_excel(url_attack, sheet_name='mitigations')
df_software = pd.read_excel(url_attack, sheet_name='software')
df_groups = pd.read_excel(url_attack, sheet_name='groups')
Step 4: Clean-up Column Names
Running the command df_techniques.head() in our notebook, we can see that there are spaces in the column names. We can correct the column names using the PyJanitor function jn.clean_names().
df_datasources = jn.clean_names(df_datasources)
df_tactics = jn.clean_names(df_tactics)
df_techniques = jn.clean_names(df_techniques)
df_relationships = jn.clean_names(df_relationships)
df_mitigations = jn.clean_names(df_mitigations)
df_software = jn.clean_names(df_software)
df_groups = jn.clean_names(df_groups)
Step 5: Sampling the Data.
Let’s take a look at the data. In the screenshots below we can see that a lot of useful information is available. Having direct access to MITRE ATT&CK data as a table, provides options for filtering and customization based on our threat detection research and development needs.
Techniques:
Data Sources:
Relationships:
Step 6: Perform Data Pre-Processing Tasks
When observing the techniques data, we can see that in the data sources column the data sources are combined into a string (screenshot below) which isn’t suitable for out desired outcome of visualizing sub-technique counts by data sources. In order to use the data source information for a visualization, we’ll need to convert the string to a list of data sources and then use the Pandas explode function to generate separate observations.
# Convert string of data sources to a list of data sourcesdf_techniques['data_sources'] = df_techniques['data_sources'].str.split(",")# Use Pandas explode function to expand the list of data sources to separate rowsdf_techniques = df_techniques.explode('data_sources').reset_index(drop=True)# Get the technique ID and data sources then drop duplicate rows; place the output in a new data frame called 'viz_data' that'll be used for our visualization.viz_data = df_techniques[['data_sources','id']].drop_duplicates().groupby(['data_sources']).size().reset_index()# Rename column to 'count' in preparation for visualizationviz_data.columns = viz_data.columns.map(str)
viz_data = viz_data.rename(columns={"0": "count"})
Before:
After:
Step 7: Visualize the Data
In our final step, we use Plotly to visualize the data. As shown in the diagram, a significant portion of MITRE techniques are related to the data sources “Command: Command Execution” and Process: Process Creation.” In a future post, we’ll analyze the OSSEM data set to better understand the relationship between techniques, data sources, and event IDs.
fig_te_by_ds = px.bar(viz_data.sort_values('count', ascending=False).head(50), x='data_sources', y='count', title='MITRE ATT&CK: Sub-Technique Count by Data Source (Top 25)', labels={'count':'Technique Count', 'data_sources':'Data Source'})
fig_te_by_ds.show()
Hope you found this post helpful.
Resources: