Stories by Eli Rizk on Medium

How to build custom CrowdStrike integrations with Foundry apps

Eli Rizk — Mon, 02 Dec 2024 17:53:16 GMT

In this post, I will introduce the Falcon Foundry platform to beginners who would like to get started with building custom security integrations with CrowdStrike. I will also walk through developing a custom integration app with Zoho Desk using CrowdStrike Falcon Foundry, cybersecurity’s first-low-code application platform.

Motivation

In today’s rapidly evolving cybersecurity landscape, professionals are constantly challenged to keep up with the latest threats and technologies. Navigating multiple platforms and integrating various tools can be time-consuming and complex. CrowdStrike Falcon Foundry offers a powerful solution by providing a low-code application platform (LCAP) that simplifies the development of custom cybersecurity integration apps. By leveraging this platform, cybersecurity experts can streamline their workflows, enhance their threat detection capabilities, and respond to incidents more efficiently. This blog post aims to guide you through the process of developing a custom app using CrowdStrike Falcon Foundry, empowering you to harness the full potential of this innovative platform to integrate with any other tool at hand.

Background

Foundry applications offer a range of capabilities that allow developers to incorporate a multitude of functionalities within their integrated applications such as storing data in collections, executing code in functions, integration with third-party APIs, and more. The table below describes the capabilities covered by Foundry applications. Note that I will only be covering API integration and function capabilities in this post.

Foundry Functions

Foundry functions allow developers to build custom business logic into the app and run it in the CrowdStrike cloud. Supported languages are Python 3.9 or later and Go 1.19 or later. CrowdStrike caps the execution timeout to 30 seconds. The function capability allows security analysts to execute any custom code and include it as a workflow action in Fusion SOAR. Some examples of logic that can be implemented include modifying variables, writing to collections, executing a LogScale query, and even sending custom HTTP requests.

API Integration

Foundry applications are also able to integrate with HTTP-based web services within Falcon using the OpenAPI specification. Once configured, the Falcon platform becomes able to interact with and orchestrate API requests as a Fusion SOAR workflow action. Each Falcon application is limited to one API host (one domain per app). This capability allows CrowdStrike users to integrate their security solution with any other service through 3rd party API requests even if the integration isn’t natively supported.

Falcon Platform

Foundry gives users two ways to build and manage custom apps:

Command Line Interface (CLI): build locally and deploy from the command line
UI-based tool (App builder): build the application from the Falcon console (over the web)

Note that not all capabilities are offered over CLI or the App builder. The figure below specifies which interface allows you to develop specific capabilities:

In this post, I will be exclusively using the Foundry CLI to create, develop, and deploy the Foundry app. Note that the API integration created can be developed using the App builder over the web.

Quickstart

To get started with Foundry app development, you’ll need to install the Foundry CLI. On Windows, you can install it using Scoop:

scoop bucket add foundry https://github.com/crowdstrike/scoop-foundry-cli.git
scoop install foundrybash

For Linux and macOS users, install using Homebrew:

brew tap crowdstrike/foundry-cli
brew install foundry

Verify your installation by running:

foundry version

Once the Foundry CLI is installed, you will need to create a profile by logging in. This ensures you have the appropriate credentials to build the app on your CrowdStrike console. Run foundry login in your terminal. You should be redirected to a login page where you will set the appropriate permissions, name your credentials and hit authorize.

Here’s an overview of useful commands:

foundry apps create — Create a new Foundry app
foundry apps deploy — Deploy the Foundry app
foundry apps release — Release the Foundry app

Once release, you will have to install it through the App builder in the Falcon console which might require consenting to the permissions requested by the app.

Create a function

To create a function, run the following command:

foundry functions create - name  - description  - handler-name  -l  - handler-method  - handler-path

This will create a new function with filler code. To add environment variables, you can include the following in the manifest file:

functions:
  environment_variables:
    variable_name: value

If the function expects input data or returns data, specify the request and response schema as JSON schema files and include their path in the manifest file. If needed, also consider making the function available on Fusion SOAR (through the App builder).

Create an API integration

To create an API integration, you will need to create an OpenAPI specification file (either yourself or the published specification of the API you’re integrating with). This file will define the API endpoints, request and response schemas, and other details. Once you have the OpenAPI specification file, you can create the API integration using the following command:

foundry api-integrations create

This command will prompt you for the name of the API integration, a description and a path to a local file or URL for the openAPI specification file.

References

Falcon Foundry documentation (You must be signed in to access the page)

How I used FastAPI to automate phishing email handling in Microsoft Teams

Eli Rizk — Mon, 02 Dec 2024 17:52:04 GMT

In this post, I will highlight how I used Python’s FastAPI to develop a custom phishing email handling functionality in Microsoft Teams using actionable message cards.

Motivation

Email is one of the most prevalent attack vectors for malicious actors. An attacker only needs one unknowing user to fall for a phishing campaign to gain foothold access to company systems and cause significant damage. Consequently, cybersecurity and IT analysts must always be vigilant about suspicious emails. Many email security providers offer solutions to prevent suspicious emails from reaching users’ inboxes, such as Microsoft’s Defender for Office 365, Proofpoint, Mimecast, and Abnormal. However, these products can introduce the risk of false positives, where legitimate emails are undelivered or stuck in quarantine. This requires SOC analysts to regularly monitor and review quarantined emails to release any legitimate ones. This motivated me to build a simple API that alerts analysts when an email is quarantined and allows quick action on the alert, such as releasing or deleting the email, changing the status of the detection, and more.

Project Architecture

This project will require to develop a FastAPI app instance endpoint to handle our program’s logic and a Python job to continuously query Microsoft Defender for Office 365 for new quarantined emails.

Below is a high-level diagram of the project architecture:

FastAPI setup

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It is designed to be easy to use and learn, while also providing the performance of asynchronous code. FastAPI is built on top of Starlette for the web parts and Pydantic for the data parts. It allows you to quickly create robust and production-ready APIs with automatic interactive documentation generated by Swagger UI and ReDoc.

To get started with FastAPI, you need to install it using pip:

pip install fastapi
pip install uvicorn[standard]

You can then create a simple FastAPI application:

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"Hello": "World"}

To run the application, use the following command:

uvicorn main:app --reload

This will start a development server and you can access the interactive API documentation at http://127.0.0.1:8000/docs.

Now that we have setup the FastAPI instance, we can start building our endpoints. We will start by creating the endpoint that sends a Microsoft Teams alert for an analyzed email object. To do so, we will use the Microsoft Graph analyzedEmail resource type to retrieve the email’s information followed by a Microsoft Teams webhook to send the message card.

To query the Microsoft Graph API, we need to register an application in Microsoft and generate client credentials. We can do so in Azure using the App registrations service. Make sure to give the application the appropriate application permission: SecurityAnalyzedMessage.ReadWrite.All. You can follow this post in Microsoft to guide you in registering an application that uses Microsoft Graph API.

from fastapi import HTTPException
import httpx

MICROSOFT_GRAPH_API_URL = "https://graph.microsoft.com/beta/security/collaboration/analyzedEmails/"

async def get_ms_token():
    tenant_id = os.environ["MICROSOFTGRAPH_TENANT_ID"]
    client_id = os.environ["MICROSOFTGRAPH_CLIENT_ID"]
    client_secret = os.environ["MICROSOFTGRAPH_CLIENT_SECRET"]
    scope = "https://graph.microsoft.com/.default"
    async with httpx.AsyncClient() as client:
        token_result = await client.post(f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token", data={
        "grant_type": "client_credentials",
        "client_id": client_id,
        "client_secret": client_secret,
        "scope": scope
    }).json()
    if 'access_token' not in token_result:
        raise Exception(f"Error while getting access token: {json.dumps(token_result, indent=4)}")
    return token_result['access_token']

@app.get("/analyzedEmail/{analyzedEmailId}")
async def get_analyzed_email(analyzedEmailId: str):
    access_token = await get_ms_token()
    headers = {
        "Authorization": f"Bearer {access_token}"
    }
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{MICROSOFT_GRAPH_API_URL}{analyzedEmailId}", headers=headers)
        if response.status_code != 200:
            raise HTTPException(status_code=response.status_code, detail="Error retrieving analyzed email")

Once we have obtained the analyzed email resource object, we are ready to send an alert in Microsoft Teams. The alert will be sent in the format of a message card to allow semi-automated actions by the members of the Teams channel. This will greatly facilitate the need for analysts to jump back and forth through different consoles.

API_ENDPOINT = ""
WEBHOOK_URL = ""
TENANT_ID = os.environ["MICROSOFTGRAPH_TENANT_ID"]

@app.post("/sendToTeams/analyzedEmail/{analyzedEmailId}")
async def send_analyzedEmail(analyzedEmailId: str):
    email_data = await get_email_data(analyzedEmailId)
    message_card = {
        "@type": "MessageCard",
        "@context": "http://schema.org/extensions",
        "summary": "Email Sent to Quarantine",
        "themeColor": "0076D7",
        "sections": [
            {
                "activityTitle": "Email Sent to Quarantine",
                "activitySubtitle": "Microsoft Defender for Office 365 detected a suspicious email and sent it to quarantine",
                "activityImage": "https://www.hkmu.edu.hk/ito/wp-content/uploads/sites/10/2021/06/phishingicon1.jpg",
                "facts": [
                    {"name": "Logged Timestamp", "value": email_data["loggedDateTime"]},
                    {"name": "Network Message ID", "value": email_data["networkMessageId"]},
                    {"name": "Email Subject", "value": email_data["subject"]},
                    {"name": "Recipient Address", "value": email_data["recipientEmailAddress"]},
                    {"name": "Sender Address", "value": email_data["senderDetail"]["fromAddress"]},
                    {"name": "Return Path", "value": email_data["returnPath"]},
                    {"name": "Policy", "value": email_data["policy"]},
                    {"name": "Latest Action", "value": email_data["latestDelivery"]["action"]},
                    {"name": "Policy", "value": email_data["policy"]},
                    {"name": "DMARC", "value": email_data["authenticationDetails"]["dmarc"]},
                    {"name": "DKIM", "value": email_data["authenticationDetails"]["dkim"]},
                    {"name": "SPF", "value": email_data["authenticationDetails"]["senderPolicyFramework"]},
                    {"name": "Sender IP Address", "value": email_data["senderDetail"]["ipv4"]},
                    {"name": "Message URLs", "value": "\n".join([url["url"] for url in email_data["urls"]])},
                    {"name": "Attachment File Hashes", "value": "\n".join([attachment["sha256"] for attachment in email_data["attachments"]])}
                ],
                "markdown": True
            }
        ],
        "potentialAction": [
            {
                "@type": "OpenUri",
                "name": "View Email",
                "targets": [
                    {
                        "os": "default",
                        "uri": f"https://security.microsoft.com/emailentity?f=summary&startTime={email_data['loggedDateTime']}&endTime={email_data['loggedDateTime']}&id={email_data['networkMessageId']}&recipient={email_data['recipientEmailAddress']}&tid={TENANT_ID}"
                    }
                ]
            },
            {
                "@type": "HttpPOST",
                "name": "Release Email",
                "target": f"{API_ENDPOINT}/analyzedEmail/remediate",
                "body": "{\"networkMessageId\":\"" + email_data["networkMessageId"] + "\", \"recipientEmailAddress\":\"" + email_data["recipientEmailAddress"] + "\", \"action\":\"moveToInbox\"}"
            },
            {
                "@type": "HttpPOST",
                "name": "Delete Email",
                "target": f"{API_ENDPOINT}/analyzedEmail/remediate",
                "body": "{\"networkMessageId\":\"" + email_data["networkMessageId"] + "\", \"recipientEmailAddress\":\"" + email_data["recipientEmailAddress"] + "\", \"action\":\"hardDelete\"}"
            },
            {
                "@type": "HttpPOST",
                "name": "Move to Junk",
                "target": f"{API_ENDPOINT}/analyzedEmail/remediate",
                "body": "{\"networkMessageId\":\"" + email_data["networkMessageId"] + "\", \"recipientEmailAddress\":\"" + email_data["recipientEmailAddress"] + "\", \"action\":\"moveToJunk\"}"
            }
        ]
    }
    async with httpx.AsyncClient() as client:
        response = await client.post(WEBHOOK_URL, json=message_card)
        if response.status_code != 200:
            raise HTTPException(status_code=404, detail="Error sending message card to Teams")
    return {"message": "Message card sent to Teams successfully"}

Here’s a sample of how the message card will look like in Teams:

Once we have sent the alert to Teams, we also need to accept incoming POST requests sent via the user actions in the message card. We will implement a release, hard delete and move to junk functionalities.

@app.post("/analyzedEmail/remediate")
async def email_action(networkMessageId: str, recipientEmailAddress: str, action: str):
    access_token = await get_ms_token()
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json"
    }
    payload = {
        "action": request.action,
        "networkMessageId": request.networkMessageId,
        "recipientEmailAddress": request.recipientEmailAddress
    }
    async with httpx.AsyncClient() as client:
        response = await client.post(f"{MICROSOFT_GRAPH_API_URL}remediate", headers=headers, json=payload)
        if response.status_code != 200:
            raise HTTPException(status_code=response.status_code, detail="Error performing email action")
    return {"message": f"Email action '{request.action}' performed successfully"}

This concludes our FastAPI instance. Next, we will implement a python script that will run as a service to continuously poll the Microsoft Graph API for new quarantined emails and use our API endpoint to send the alert to Teams.

Python Job

We will write our Python service script to check analyzed emails that are stuck in quarantine every 5 minutes by manipulating the startTime and endTime query parameters. In practice, I have noticed that the Microsoft Graph API has some latency before showing analyzed emails, so instead of querying from the past 5 minutes, the code queries from the previous 15th to 10th minutes. For every quarantined email, the script calls our FastAPI endpoint to send the relevant detection to Teams.

import httpx
import asyncio

async def main():
    while True:
        try:
            now = time.time()
            start_time = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime(now - 5*60*2))
            end_time = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime(now - 5*60*3))
            access_token = await get_ms_token()
            headers = {
                "Authorization": f"Bearer {access_token}"
            }
            async with httpx.AsyncClient() as client:
                response = await client.get(f"https://graph.microsoft.com/beta/security/collaboration/analyzedEmails?startTime={start_time}&endTime={end_time}", headers=headers)
                response.raise_for_status()
                emails = response.json()["value"]
                for email in emails:
                    if email["latestDelivery"]["location"] == "quarantine":
                        await send_to_teams(email["id"])
        except Exception as e:
            print(f"Error: {e}")
        await asyncio.sleep(300)

if __name__ == "__main__":
    asyncio.run(main())

Conclusion

By leveraging FastAPI and Microsoft Graph, we have created a powerful tool that enhances the efficiency of cybersecurity analysts. This solution not only automates the detection and alerting process but also provides actionable insights directly within Microsoft Teams. This integration reduces the need for analysts to switch between different platforms, allowing them to respond to threats more quickly and effectively. The project can be easily augmented to interact with a back-end database and offer more functionalities such as assigning a status, an analyst, tags or comments to a detection facilitating team collaboration against possible attacks. As cyber threats continue to evolve, having such automated and integrated systems in place is crucial for maintaining robust security postures. This project demonstrates the potential of combining modern web frameworks with cloud-based APIs to build scalable and responsive security solutions. I hope this guide inspires you to explore further possibilities with FastAPI and Microsoft Graph in your own projects.

What do you think about this solution? Have you tried something similar? Connect with me on LinkedIn and let me know!

References

Microsoft Graph analyzedEmail resource type

Microsoft Graph app registration walkthrough

Actionable message card documentation

Getting started with static code analysis using Joern

Eli Rizk — Sat, 10 Feb 2024 02:56:48 GMT

In this post we will learn the basics of static code analysis and how to use joern for analyzing an application’s source code. This post also appears on my personal website here.

Motivation

Static code analysis tools have come a long way in the past decade. Multiple tools have been developed to run on top of application’s code to detect bugs, vulnerabilities, or general inefficiencies. For example, Spotbugs has tremendously increased in popularity due to its wide support of bug discovery for Java code. In contrast, Joern provides a general static code analysis tool that supports multiple languages, including C++, Java, and PHP.

Code Property Graphs

In order to extract the multiple properties of source code (e.g., variable names, data dependencies, function calls and definitions, etc…), we will have to obtain an Intermediate Representation (IR) of the code. IR is vastly used by compilers to transform high-level code to low-level machine instructions and perform different types of enhancements on the code before turning it into an executable. IR includes different types of graphs, mainly: Abstract Syntax Trees (AST), Control Flow Graph (CFG) and Data/Program Dependence Graph (DDG or PDG). Let’s talk about the details of each below.

We will use the following C++ example to illustrate the differences of each graph type:

void foo()
{
  int x = source();
  if (x < MAX)
  { 
    int y = 2 * x;
    sink(y);
  }
}

You can think of the source function as a way to obtain user input and the sink function as a sensitive function that needs to be safeguarded against malicious input (e.g., database query, printing a value on a webpage, etc…)

Parsing this code, we can obtain the AST, CFG, or PDG graphs. The AST will use the language semantics to fill the graph nodes while preserving a true representation of the original code semantics; it also maintains the order in which statements were originally written in the code. The CFG showcases how the execution of the code can move between statements (which aren’t necessarily next to each other in the original code). It is mainly useful for representing if-else statements, for and while loops, break statements, etc… For example, the execution of the C++ code above can jump from evaluating the if condition to exiting the program if the condition evaluates to False, this possible execution will be displayed in the CFG. The PDG will highlight any dependencies on the data between different variables, function calls, conditions, etc… For example, the variable definition x in the third line is used in the if condition and to define y, so data dependency edges will be added to display the flow of data from one statement to another. The AST, CFG, and PDG graphs are shown below.

AST, CFG, PDG of the C++ code

While each code representation has its advantages, it is also limiting. For instance, ASTs provide a true representation of the code but it’s hard to find data dependencies between different statements. To resolve these conflicting issues, Code Property Graphs (CPG) were invented to merge these three different graphs which will greatly simplify traversing the code and accessing multiple properties we might seek for discovering bugs, vulnerabilities, or inefficiencies. The CPG graph of the C++ code is presented below.

CPG of the C++ code

Joern

Joern parses the application code to obtain a Code Property Graph and allows developers to customize their needs as graph traversal algorithms on top of this graph.

To install joern, you can execute the following code on a Linux / Ubuntu system:

mkdir joern && cd joern # optional
curl -L "https://github.com/joernio/joern/releases/latest/download/joern-install.sh" -o joern-install.sh
chmod u+x joern-install.sh
./joern-install.sh --interactive

By default, joern will be installed at ~/bin/joern, Once installed, you can run joern in a command prompt as follows:

cd /joern/joern-cli
./joern

Once the joern terminal is open, it will provide you with an interactive way to parse source code and traverse its graph. Joern is written in Scala (a variant of Java that supports object-oriented and functional programming), so the code that will be written in the joern terminal will be Scala code. While knowing Scala will allow you to write more complicated traversals within the joern terminal, you don’t need full fluency in Scala to be able to pass some useful traversal commands. You also have the option to export the graph as a neo4j graph (among other graph extensions) and write Python code on top of the graph to traverse it. For now, we will directly work inside the joern terminal which will greatly simplify graph traversals as it provides an extensive list of useful commands.

For example if the C++ code above was saved as foo.c, we can provide its path to the importCode function in joern to parse the code and obtain its CPG as follows.

joern> importCode("foo.c")

Once done, we can obtain the nodes of interest. For example if we want all reference to the identifier y, we can run:

cpg.identifier("y").toList

This will output the following list of Identifier nodes. Notice how we have two nodes representing the identifier y, once when it was defined in line 6 and another when it is passed to the sink function in line 7.

To obtain a Call node, we can use the call function in the same manner.

In order to find if the source function ever reaches a sink function (and potentially patch or sanitize the input to prevent any vulnerabilities), we can use the reachableByFlows function which performs backward traversal from the sink function to the source.

Vulnerability path

As shown, we obtain a detailed data flow from the source function to the sink, which provides useful information for programmers if they need to prevent any malicious input to reach a sensitive function. As such, they might sanitize the input before it reaches the sensitive function.

Conclusion

In this post, we saw how code property graphs are used to store different properties of the source code. This allows modeling bugs, vulnerabilities, and inefficiencies as graph traversal algorithms which greatly simplifies the development of static code analysis tools. We also used the joern tool to parse application source code to its CPG representation and performed some interesting traversals provided by the tool. In the next post, we will look at developing more complicated graph traversal algorithms to model the complexities of correctly detecting different types of vulnerabilities.

References

Paper that introduced Code Property Graphs for static code analysis.

Joern installation guide

Joern traversal basic commands

This post is also available on my personal website.

Exploratory Data Analysis for Predicting Migration Rate from Socio-economic Factors

Eli Rizk — Tue, 02 Jan 2024 05:14:06 GMT

In this post we will apply exploratory data analysis concepts to the problem of predicting a country’s net migration rate as a time series from its socio-economic factors. The code of the project can be found on Github. This post is also available on my personal website where the world maps are interactive.

Introduction

In this project, we will perform data analysis followed by ML model fitting to predict a country’s net migration rate as a time series from socio-economic factors. The following are some useful definitions to keep in mind.

Net migration rate: the difference between the number of migrants entering and those leaving a country in a year, per 1,000 midyear population (U.S. Census definition). If the rate is positive, it indicates more people leaving the country than entering it (net immigration rate), and if it is negative, it indicates more people entering the country than leaving it (net emigration rate).

DALYs (Disability-Adjusted Life Years): One DALY represents the loss of the equivalent of one year of full health. DALYs for a disease or health condition are the sum of the years of life lost due to premature mortality (YLLs) and the years lived with a disability (YLDs) due to prevalent cases of the disease or health condition in a population (WHO).

HDI (Human Development Index): A composite index measuring average achievement in three basic dimensions of human development: healthcare, education, and economic situation (UNDP).

Other socio-economic factors which include GDP, Life Expectancy, Inflation, Mortality, and Healthcare expenditure (collected from the World Bank) will also guide us in performing our data analysis and ML training.

The final dataset can be found here: https://github.com/elirizk/ML-for-Predicting-Migration/blob/master/Dataset.csv

Feature Correlation

At first, we have to look at the correlation of the different features to get a sense of how the features interact with each other, which will help guide us through the exploratory data analysis. After reading the dataset into a Pandas data frame, we obtain the following correlation matrix.

import pandas as pd
df = pd.read_csv('Dataset.csv', header=0)
df.corr()

One one level, we can look at how the various features correlate with the net migration rate. The year variable has a near zero correlation with the net migration rate, which is a good aspect of our dataset. It points to the fact that the distribution of the net migration rate is independent of the year considered. However, the main factors correlated with the net migration rate are the Human Development Index (HDI), Mortality, and disability-adjusted life years (DALYs). The former indicator correlates positively with the migration while the two latter indicators correlate negatively. This makes sense as the higher the country is developed, the more immigrants might come into the country. While the higher the mortality and DALYs are, the more likely the country is to experience a loss of citizens through emigration. The other indicators, namely life expectancy, healthcare expenditure, and GDP seem to have limited correlation with the output feature. The inflation indicator has the lowest correlation (-0.5%) which might push us to eliminate the column altogether from the data. However, we still have to aggregate the data and visualize it to make that decision.

Another important aspect to note is how the input features correlate among each other. For example, the HDI and life expectancy seem to highly correlate (91%). In fact, one of the factors taken into consideration when calculating the HDI of a country by the UN is the life expectancy in that country. Hence, it makes total sense for these two indicators to correlate. However, we might have to remove one of them when training a machine learning model on the data, or alternatively merge the two columns. Unexpectedly also, mortality and DALYs are highly correlated (67%): they both indicate the development of the health sector in a country. The variables HDI and DALYs are also highly correlated (-80%). We will keep a close look at all the above features throughout the EDA so that we can conclude on whether to remove on of the above features or merge them together through dimensionality reduction at a later stage.

Data Analysis with R

We will perform the first part of our analysis using R. We will have to import the necessary library to visualize our plots (using ggplot2) and read our dataset. We will also transform the Year column into a Date type in R, divide the HDI from numerical to categorical (low, medium, high or very high), and remove any entry with unknown continent code.

library(ggplot2)

setwd("dataset_directory")
df <- read.csv("Dataset.csv", header=TRUE, na.strings = "")

for (i in seq_len(length(df$Year))) {
  df$Year[i] <- (paste("01-01-",as.character(df$Year[i]),sep=""))
}

df$Year <- as.Date(df$Year, format="%d-%m-%Y")
unique(df$Year)
sapply(df, class)

df$HDI <- sapply(df$HDI, cut, breaks = c(0, 0.55, 0.7, 0.8, 1),
               labels = c("Low", "Medium", "High", "Very High"))
df <- subset(df, df$Continent.Code!="Unknown")

Feature Distribution

We will plot the distribution of some of the features throughout the years. The following R code plots the corresponding figures.

ggplot(df, aes(x=Year, y=Net.Migration.Rate, group=Year)) +
  geom_boxplot() +
  coord_cartesian(ylim=c(-50,50)) +
  labs(title = "Variation of Net Migration Rate per Year",
       y = "Net Migration Rate", x = "Year")

ggplot(df[df$DALYs<150000,], aes(x=Year, y=DALYs, group=Year)) +
  geom_boxplot() +
  labs(title = "Variation of DALYs per Year",
       y = "DALYs", x = "Year")

ggplot(df, aes(x=Year, y=GDP, group=Year)) +
  geom_boxplot() +
  coord_cartesian(ylim=c(-30,30)) +
  labs(title = "Variation of GDP growth per Year",
       y = "GDP growth (%)", x = "Year")

ggplot(df, aes(x=Year, y=Inflation, group=Year)) +
  geom_boxplot() +
  coord_cartesian(ylim=c(-25,110)) +
  labs(title = "Variation of Inflation per Year",
       y = "Inflation, consumer prices (annual %)", x = "Year")

Net Migration Rate per Year

As we can see the median of this distribution remains roughly stable throughout the years: it stays around 0. It is also clear how there are a lot of outliers in this distribution. This is expected since a large positive migration rate in one country should translate into a negative migration rate in other countries (the immigrants of one are the emigrants of the other). The fact that the median is stable around 0 from 1960 till 2020 confirms the reliability of the data.

DALYs

Concerning the boxplot of the DALYs, the data presents a decrease in the median and standard deviation of the DALYs with the progress of the years. This shows the overall increase in the quality of healthcare around the globe, which explains the decrease in the DALYs. The data has become more concentrated in recent years (decrease in the standard deviation) which might be due to globalization and recent efforts by the UN to better the living conditions of underdeveloped nations.

GDP Growth

Concerning the boxplot of the GDP growth (as a percentage) per year, we can see the fluctuations of this feature throughout the years. The median fluctuates around 5% with occasional downfalls. We can point out specific years where the fall of the GDP growth was expected. For example, the US stock market crash of 2008, which affected most countries, caused a sharp decline of the GDP growth in that year. Additionally, the 2020 Coronavirus Stock Market Crash is also clearly visible in the boxplot where the median GDP growth of countries becomes negative for the first time since 1960, which indicates an overall decline in the GDP in most countries of the world due to the pandemic.

Inflation

Concerning inflation, it is striking to see that in early and later years (1960–1970 and 2000–2020) the median inflation fluctuates a bit above 0 and the standard deviation is smaller to the other periods. However from 1970 till 2000, we see that the data becomes more dispersed and that the median inflation is significantly larger than before 1970 or after 2000. This might be due to the fact that during this period, a lot of countries experienced political and economical crises, which skewed the data towards having a larger inflation. We can clearly see a large number of outliers in the data too, which confirms our hypothesis.

Data Aggregation

We will perform data aggregation according to the HDI level as well as the country’s continent.

Aggregation by HDI Level

First the UN divides the HDI into four levels: low, medium, high, and very high. After aggregating according to this classification, we can visualize the variation of the Net migration rate per year per HDI level using the R code below.

agg1 <- aggregate(cbind(Net.Migration.Rate) ~ HDI+Year, data=df, FUN=mean)

ggplot(agg1, aes(x=Year, y=Net.Migration.Rate, color=HDI)) +
  geom_line(stat="identity", lwd=1.2) +
  geom_smooth(linetype=2) +
  labs(title = "Variation of Net Migration per Year",
       subtitle = "Divided according to HDI",
       y = "Net Migration Rate", x = "")

We can clearly see a distinction of this evolution according to the HDI level of the countries. As expected, the countries with a very high HDI have the greatest net migrant rate: a lot of immigrants come to highly developed countries. However, there is a decline in this rate starting in 2010. This might be due to stricter immigration policies.

We can also point out that countries who have a high HDI rank second in terms of migration rate. However, it is worthy to note that while this rate was positive in 1990, it decreased slowly to become negative in 2020. One hypothesis might be that some countries who had a high HDI in 1990 progressed and developed into having a very high HDI, climbing up the HDI ranking, while the remaining countries might have faced national difficulties preventing them from increasing their HDI. Hence, this decreased the overall average of the net migrant rate of high HDI countries (similar to what a sampling bias might do to the statistic).

Surprisingly, although countries with a low and medium HDI present a negative net migration rate, countries with a medium HDI have the lowest rate. Why aren’t countries with a low HDI with the lowest net migration rate? This might be due to the fact that extremely underdeveloped countries do not allow their citizens to emigrate freely from the country. For instance, African and Asian countries with very low HDI might suffer from conservative norms and low education and financial status, preventing them from immigration (such as the African tribes and Arab Bedouins) or it could even be a political regime prohibiting emigration.

Aggregation by Continent

Furthermore, we can split the data according to the six continents: Asia (AS), Africa (AF), Europe (EU), North America (NA), South America (SA), and Oceania (OC). The code below does this aggregation.

agg2 <- aggregate(cbind(Net.Migration.Rate) ~ Continent.Code+Year, data=df, FUN=mean)

ggplot(agg2[,], aes(x=Year, y=Net.Migration.Rate, color=Continent.Code)) +
  geom_line(stat="identity", lwd=1, alpha=0.5, linetype=1) +
  #facet_wrap(~ Continent.Code) +
  geom_smooth(linetype=2)+
  labs(title = "Variation of Net Migration per Year",
       subtitle = "Divided according to Continent",
       y = "Net Migration Rate", x = "Year")

We don’t see a strong correlation between the continents. Most continents have a negative migration rate, except for Asia. This might be due to the fact that most immigration happens between neighboring countries in the same continent. For instance, Syrians fleeing war to Lebanon, Venezuelans fleeing the inflation to Columbia, Mexicans immigrating to the USA, Ukrainians fleeing the Russian war to Poland and Moldova. All these examples strengthen our hypothesis that migration is concentrated among neighboring countries. Hence, the emigrants of a country get translated into immigrants for the neighboring country, in the same continent: resulting in an overall fluctuation of the net migration rate around 0 for most continents, especially in the past two decades i.e., from 2000 till 2020. We will try applying this theory when visualizing the migration rates through maps.

Variation of the features in specific countries

We will visualize the effect of inflation on net migration through the case study of a few countries in order to generalize on the interplay of the different features.

Net migration and Inflation

We will visualize the effect of inflation on net migration through their variation in two countries: Honduras, a country in Central America, and Iraq, a middle eastern country in Asia. The time series of this evolution can be shown below.

countryName <- "Honduras"
plot1 <- ggplot(df[df$Country.Name==countryName,], aes(x=Year, y=Inflation)) +
  geom_line(stat="Identity") +
  geom_smooth() +
  labs(title = "Variation of Net Migration and Inflation",
       y = "Inflation, consumer prices (annual %)", x = "", subtitle = countryName)

plot2 <- ggplot(df[df$Country.Name==countryName,], aes(x=Year, y=Net.Migration.Rate)) +
  geom_line(stat="Identity") +
  geom_smooth() +
  labs(y = "Net Migration Rate", x = "Year")

countryName <- "Iraq"
plot3 <- ggplot(df[df$Country.Name==countryName,], aes(x=Year, y=Inflation)) +
  geom_line(stat="Identity") +
  geom_smooth() +
  labs(title = "Variation of Net Migration and Inflation",
       y = "Inflation, consumer prices (annual %)", x = "", subtitle = countryName)

plot4 <- ggplot(df[df$Country.Name==countryName,], aes(x=Year, y=Net.Migration.Rate)) +
  geom_line(stat="Identity") +
  geom_smooth() +
  labs(y = "Net Migration Rate", x = "Year")

gridExtra::grid.arrange(plot1, plot3, plot2, plot4,nrow = 2)

The negative correlation between these two variables is clear: an increase in inflation correlates with a decrease in the net migration rate and vice versa in the two countries. When these countries experience a peak in inflation, it correlates with a significant decrease in the net migration i.e., an increase in the number of emigrants fleeing the country. When this inflation decreases, the net migration rate increases, signifying either that the emigrants are coming back to the country or that new immigrants are coming into the country due to the stabilization of the economical situation in this nation.

Net migration and neighboring countries

Furthermore, it is interesting to visualize the evolution of net migration in neighboring countries, especially those undergoing economical, financial or political crises. The graphs below shows the variation of the net migration for Venezuela & Colombia, and for Mexico and the United States. As a historical background, we should consider that since 1970 Colombians have been fleeing to Venezuela to avoid the violent conflict of their homeland. However, as of 2016, the roles have been reversed: Venezuelans have been immigrating to Colombia because of the terrible financial crises of their country.

This trend is clearly shown in the data, whereas in early years the net migration of Colombia was negative but that of Venezuela positive. However, in recent years, the migration rate has been increasing in Colombia and decreasing in Venezuela (due to the mass immigration of Venezuelans to Colombia). The sharp decrease in this rate in Venezuela coincides with its spike in Colombia (around 2016–2020), which confirms our hypothesis.

While the situation in Colombia and Venezuela might be considered as an outlier, we can see a similar trend when neighboring countries have a huge disparity in economical or developmental opportunities. For instance, in Mexico and the United States, the variation of their net migration rate coincides with each other (whereas a sharp increase in one reflects a decrease in the other, especially around the year 2000).

A similar trend can be seen in other neighboring countries, e.g. Albania & Greece, Bangladesh & India, Syria & Lebanon, Oman & Yemen. The spike in the migration rate in one of these countries is correlated with a fall in its neighboring country around the same period. The widespread nature of this phenomena (which we can’t disregard as being a few outliers) will necessitate us to encode spatial locality in our model to consider the features of the neighboring countries (including the net migration rate) in order to predict the final migration rate of the given country.

Data Analysis with Python

We will now plot the data as scatter plots and visualize it on maps using Python.

Scatter Plots

First, we will divide the countries into their HDI rank and plot them according to their net migration rate, GDP growth and Inflation percentage. The three dimensional scatter plot can be shown below.

As expected, the countries with a high net migration rate also happen to have a high HDI, a high GDP growth and a low inflation percentage, whereas countries with a negative migrant rate are mostly underdeveloped and suffer from a low GDP growth and a high inflation.

Next, we will visualize the scatter plot of net migration rate, DALYs, and mortality.

Considering the high correlation of DALYs and mortality, we expect to see a clear distinction in the plot. As expected the countries with a high DALYs also have a high mortality and vice versa. We can clearly see in the plot that highly developed countries with low DALYs and mortality experience a high migrant rate whereas underdeveloped countries with high DALYs and mortality suffer from an increased number of emigration (negative net migration rate).

We are also able to animate this scatter plot by year to visualize the variation in those different features as a time series. You can refer to the code on GitHub to generate this scatter plot animation. Through this animation, we can learn about the yearly trends of the data. We notice that the DALYs and mortality tend to decrease from year to year and that the GDP growth fluctuates around 0 and 10% for most years except for its occasional fall during a market crash (e.g. 2008–2009 and 2020). Concerning the inflation, before 200 we can see a considerable dispersion of the data (indicating a large standard deviation) while after the year 2000 countries become closer in regards to the percentage of inflation. All these insights confirm what we deduced at the beginning from the boxplots of the distribution of the features in prior project section.

https://medium.com/media/b7af1892617d43885e5de960f0fbb449/href https://medium.com/media/dfb7085617216fda460eee93e1fbee79/href

Animated Heatmap of Migration

Considering the spatial and temporal features of our dataset, the best way to visualize the data would be through an animated world map. To do so, we will use the geopandas python module and merge our dataset with the world dataset available through the module. That way, the resulting dataset will include the appropriate country names and geometries so that it can be easily converted into a folium Map. Besides doing so, the prepForHeatMap function normalizes the net migration rate because the weight input of the HeatMap function must be between 0 and 1. The normalization disregards the outliers (with a z-score above 3 or below -3) when calculating the normal score, and then substitutes those with a z-score above 3 with a score of 1 and those below -3 with a score of 0. That way the outliers won’t skew the normal distribution while staying in the data (so that we don’t end up with missing countries on the map). Refer to the GitHub code for details about the implementation in the GetMap.py file.

https://medium.com/media/701a4b4e0c04809340d562728865f5fe/href

The purple countries are the ones with a low migration rate while the green ones have a high migration rate. This map gives us an overview of the migration flow in the year 2020. The most popular destinations for immigrants appear to be the US, Canada, Eastern Europe, Australia & New Zealand (not shown in the figure) and the Arab states of the Persian Gulf (Saudi Arabia, Kuwait, UAE). There are a few outliers in the map, namely the countries of central America who are accepting Venezuelans immigrants suffering from the crisis of their country along with Lebanon who housed the Syrian immigrants fleeing war.

Stacked Maps of the features

We can also stack multiple features on a static FoliumMap by running the Jupyter notebook GenerateMap.ipynb on Github. After doing so, we end up with the folder StackedMaps filled with the static maps for every year from 1990 until 2020. We can select the specific feature we want to see its distribution throughout the world map. When we hover over the country, we can check the Migration rate of the country, which is implemented into the map to clarify the visualization of the data.

The below figure is the folium map of the year 2020 when “Net Migration Rate” is selected. The darker the country is, the bigger the migration rate and the lighter the color, the smaller the rate. As expected, the countries with a high migration rate are mostly the countries of Western Europe, North America, Australia, and the Arab gulf. There are a few countries in South America, Asia, and Africa suffering from a low migration rate. We will see if this data correlates with our other features by deselecting the migration rate and selecting the other features.

Net Migration Rate

First, we will take a look at the distribution of inflation and its correlation with the above results. The figure below show the inflation and migration per country during the year 2008. The map confirms our earlier hypothesis: the larger the inflation, the lower the migration rate.

Inflation

For example, Argentina (the dark purple country in Southern America) who suffered from a large inflation in 2008 had a net migration rate of -0.625. The rest of the countries show a similar correlation.

The below figures show the distribution of the DALYs and mortality per country in 2008.

DALYs

Mortality

As expected, countries suffering from a high mortality and high DALYs (like the Central African Republic, Afghanistan, Nigeria, etc…) also suffer from a low migration rate: a lot of emigrants are flying out of the country.

Conclusion

Throughout this project, we were able to clearly visualize how the different features interact with each other, guiding us in formulating a clear hypothesis before selecting and training a machine learning model. In the next part of this post, we will train various machine learning models and compare their performance as well as their explainability to reach a final recommendation as to how to predict a country’s net migration rate as a time series from socio-economic factors.

Integer Factorization

Eli Rizk — Sun, 24 Dec 2023 19:29:43 GMT

In this post, we will explore the problem of factorizing integers (in particular semi-primes, i.e., products of two primes). We will implement and contrast two different methods to factorize an integer: the brute-force way with the sieve of Eratosthenes and Pollard’s rho algorithm. This post is also available here.

Photo by Nick Hillier on Unsplash

The problem of integer factorization constitutes a fundamental problem in number theory and provides the security basis of multiple modern encryption schemes.

For example, the RSA asymmetric encryption standard for communication between two parties uses a (public key, private key) pair. In simple terms, RSA computes a large integer from the product of two large primes (n = p.q), and derives the public and private keys from the prime factors. In particular the public key e is chosen such that gcd(e, (p-1).(q-1)) = 1 and the private key d is derived as d is the inverse of e modulo (p-1).(q-1). The RSA schema assumes that n and e are public while the prime factors and the private key are kept secret. It’s easy to see from the formulas that given n’s prime factors and the public key, one can easily derive the private key. Hence, it is of utmost importance to the security of the algorithm that it is “hard” to factorize a large integer. Otherwise, the security of the schema is broken. Given that the basis of security of most asymmetric encryption standards (which are widely used in the world today as part of symmetric key exchange and certificate authority digital signature) is fundamentally based on the difficulty of integer factorization, a lot of research has been dedicated to improving the efficiency of integer factorization algorithms.

We will limit ourselves to the problem of factoring large numbers of the form: n = p.q where p and q are large prime numbers, which is the form used by RSA. Numbers that satisfy this form (a product of exactly two prime numbers) are called semi-primes.

Note that if we are able to decompose a number into the product of two other numbers, we can iterate this algorithm over the newly found numbers along with a primality check (which is in the order of log(n) ) to obtain the complete prime decomposition of any random integer n. Therefore, our above limitation can be easily expanded to solve the general problem of integer factorization. While primality testing can be achieved in the order of log(n), no classical polynomial-time algorithm for integer decomposition is known.

Brute-Force Factorization with the Sieve of Eratosthenes

We will start with a naïve implementation of our factorization algorithm. We will start by listing all primes less than $n$ and check which prime in that list is a divisor. An improvement would be to only check primes less than or equal to the square root of n because the number is a product of two primes, so the smaller prime has to be less than or equal to its square root.

We will use the sieve of Eratosthenes to find primes in a specific range. This approach goes through all numbers starting from 2 to the end of the range and progressively removes all other that are multiples of it. We end up with a list of primes, all that is left is to check the first one that divides n.

import math

def naive(n):
    sqrt_n = int(math.sqrt(n))

    isPrime = [True for i in range(0,sqrt_n+1)]
    primes = []

    for p in range(2, sqrt_n+1):
        if isPrime[p]:
            primes.append(p)
            for x in range(p**2, sqrt_n+1, p):
                isPrime[x] = False
    
    prime1 = 0
    for p in primes:
        if n%p == 0:
            prime1 = p
            break
    return prime1

Analysis

The code above will factorize the integer n by building up the list of primes up to the square root of n, and checking whether any of these primes divide n. This algorithm has a log-linear every-case time complexity. Its proof is shown below:

Order notation of the brute-force algorithm

Its space complexity is in the order of square root of n (since we have to keep track of all numbers less than the square root of n).

Pollard’s Rho

This algorithm is implemented in a very simple manner but is based on a pseudorandom sequence. This sequence is generated by a polynomial function which is generally chosen as g(x) = (x²+1) mod n. The algorithm proceeds by progressively applying the polynomial function on a starting seed value (usually 2) and checking once two different random numbers are congruent to the same value modulo n. If so, the prime is simply the absolute value of the difference between the two numbers.

This approach can be thought of as a form of branch and bound where the state space tree are all pairs of numbers (x,y), this pair is considered promising once its difference (in absolute value) isn’t co-prime with n anymore.

import math

def g(x,n): return (x**2 + 1)%n

def rho(n):
    x = 2
    y = x
    d = 1
    while (d==1):
        x = g(x,n)
        y = g(g(y,n),n)
        d = math.gcd(abs(x-y), n)
    return d

Analysis

The algorithm highly depends on the pseudorandom nature of the polynomial function modulo n. If this randomness is assumed, we base our analysis on the birthday paradox. This paradox is based on the counterintuitive fact that it only takes a minimum of 23 people to have a probability greater than 50% that two of them share the same birthday. This is due to the fact that any two people could share a birthday, so an added person’s birthday is compared to every other birthday: this combination of possible pairings grows exponentially with the size of the group.

Similarly, the Pollard’s Rho algorithm keeps track of two pseudorandom numbers. Its analysis depends on the implicit sequence { x mod p } where p is a non-trivial factor of n. Assuming randomness, we can expect the two computed numbers to be different but the same modulo n to occur after at most the square root of the prime factor p:

The space complexity is O(1) since we only keep track of two numbers generated from a pseudorandom function and computing whether their promising (their difference isn’t co-prime with n).

Implementation Analysis

After implementing both algorithms in Python and running them on a test file containing semi-primes from 8 to 58 bits in size (with values from 143 to 149470864377634489), we measure their time and space complexity and obtain the following dataset description:

Note that to obtain the memory size accumulated by a python script, we use the tracemalloc module in Python as follows:

import tracemalloc

tracemalloc.start() 
# Call the function to analyze
mem = tracemalloc.get_traced_memory()
tracemalloc.stop()

The variable mem will contain a tuple of the current memory allocated for the script as well as the largest memory that the script used. Hence, we obtain mem[1] as the memory needed by the function to execute.

As we can see from the output result, the average time of the two algorithms are drastically different. For the brute force implementation, it takes around 22.5 seconds to factorize the integer while it only takes 0.008764 seconds using Pollard’s Rho. The time taken by the naïve implementation also ranges from 0.00003 to 394.77549 seconds as the size of the integer increased while this range is only from 0.00002 to 0.08404 seconds for all input.

The amount of memory used is also considerably different. The brute force implementation used between 336 to 4.1913 x 10⁹ bytes of memory with an average of 2.365 x 10⁸ bytes. This is considerably more memory intensive than the Pollard’s Rho algorithm which only used between 84 to 504 bytes of memory with an average of 223.38 bytes.

While the brute force implementation was able to factorize all inputs it was given (success rate of 100%), Pollard’s Rho wasn’t able to factorize 4 out of the 205 integers for a success rate of 98%. This could be resolved by repeating Pollard’s Rho with a different starting value for x. However, we didn’t rerun the algorithm to be authentic to its original prediction and to note its limitations due to its randomized nature.

The analysis we performed above was correct where we predicted that Pollard’s Rho would be more time efficient that the brute force. We also correctly determined that Pollard’s Rho space requirement is constant, independent of the input size, while the brute force implementation requires additional space as the size of the input increases to store the list of primes.

We can also plot the time needed to executed with respect to the size of the input N as you can see below (blue plot is for the naïve implementation and the red one is for Pollard’s Rho):

Plotting execution time against the input (right) and its size (left)

This is exponential in the size of N (measured as the number of bits in the binary representation of n, i.e., ~ log(n) ), but log-linear with N.

Conclusion

In conclusion, we can see how two different approaches to solve the problem of integer factorization can result in different time and space complexities. While the brute force algorithm is deterministic and allows to compute a list of primes as it is solving the problem, it is very inefficient in terms of time and space requirements. On the other hand, Pollard’s Rho algorithm is very efficient in terms of time and space, but it is based on a pseudorandom sequence and doesn’t have an absolute guarantee on factorizing an integer. In fact, proving the time-complexity of Pollard’s Rho is still an open problem in mathematics and computer science.

While the problem of large integer factorization has been explored for decades, no efficient algorithm has been found on classical computers yet. This has led to cryptographic schemas to rely on the difficulty of integer factoring for the confidentiality of the encrypted data. However, efficient quantum algorithms for integer factorization has been proven to exist. And as quantum computers are evolving to be more reliable and faster, efficiently factoring an integer might become a thing of the past.

ML for Network Intrusion Detection — Part II: Machine Learning Training

Eli Rizk — Mon, 18 Dec 2023 03:44:53 GMT

ML for Network Intrusion Detection — Part II: Machine Learning Training

If you haven’t yet, check part I of this blog post where we performed some data pre-processing, exploratory data analysis, and feature engineering (PCA) on the original dataset. This post is also available here.

As a reminder, the dataset we’re using can be found here: https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection. We are also training the machine learning models on the 10 principal components produced by our feature engineering approach, except for the DNN which will be using the original features.

We will start by defining a helpful function to print the error metrics of the ML models we will train.

def print_error_metrics(y_test, y_pred):
    acc = metrics.accuracy_score(y_test, y_pred)
    prc = metrics.precision_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)
    print('Accuracy: {:.5f}'.format(acc))
    print('Precision: {:.5f}'.format(prc))
    print('F1 Score: {:.5f}'.format(f1))

Logistic Regression Models

We will use an unregularized logistic regression model to fit on the training data.

regressor = LogisticRegression(max_iter=100, penalty='none')
regressor.fit(X_train_PCA, y_train)
y_pred = regressor.predict(X_test_PCA)
print_error_metrics(y_test, y_pred)

We obtain the following metrics on this model: Accuracy: 0.95773, Precision: 0.96615, and F1 Score: 0.95379. While promising, we regularize this version by using an elastic net regularization with an L1 ratio of 0.5. This will be helpful in avoiding overfitting which can make the model more generalizable to unseen data.

regressor = LogisticRegression(max_iter=400, solver='saga', penalty='elasticnet', l1_ratio=0.5)
regressor.fit(X_train_PCA, y_train)
y_pred = regressor.predict(X_test_PCA)
print_error_metrics(y_test, y_pred)

Decision Trees

dtree = DecisionTreeClassifier(max_depth=None)
dtree.fit(X_train_PCA, y_train)
print("Decision tree maximum depth:", dtree.tree_.max_depth)
y_pred = dtree.predict(X_test_PCA)
print_error_metrics(y_test, y_pred)

After training a decision tree classifier, we obtain the following metrics: Accuracy: 0.99147, Precision: 0.98974, and F1 Score: 0.99080.

While the results are promising, after looking into the decision tree produced, we realize that its maximum depth is 19. This implies that it might be overfitting on the training data: some decisions aren’t obtained until 19 separate splits are made on one of the 10 principal components. A zoomed out snapshot of the decision tree (as well as the code that produces it) is provided below. Note that the code changes the default color of the decision tree to paint the normal class in green and the anomalous class in red.

from matplotlib.colors import to_rgb

features = ['PCA_1', 'PCA_2', 'PCA_3', 'PCA_4', 'PCA_5', 'PCA_6', 'PCA_7', 'PCA_8', 'PCA_9', 'PCA_10']
fig = plt.figure(figsize=(100,90))
class_colors=['green', 'red']
artists = plot_tree(dtree, feature_names=features, class_names=['Normal', 'Anomaly'], filled=True, rounded=True, fontsize=10)
for artist, impurity, value in zip(artists, dtree.tree_.impurity, dtree.tree_.value):
    r, g, b = to_rgb(class_colors[np.argmax(value)])
    f = impurity * 2
    artist.get_bbox_patch().set_facecolor((f + (1-f)*r, f + (1-f)*g, f + (1-f)*b))
    artist.get_bbox_patch().set_edgecolor('black')
fig.savefig('decision_tree_1.png')

As you can see from the tree structure, this is an overly complex models considering it’s trained on 10 features. To regularize this model, we manually set the maximum depth of the tree to 5 in the DecisionTreeClassifier class parameter. As a result, we obtain the following metrics: Accuracy: 0.97658, Precision: 0.98300, and F1 Score: 0.97450. The decision tree produced is found below.

Decision tree model with maximum depth of 5

Note that the red leaves indicate a majority of malicious samples (so it will be predicted as malicious) and the green leaves have a majority of benign samples (predicted as benign). The more transparent a leaf is, the higher its entropy (the model will predict the majority class label but will be unsure about it). Even though we significantly reduced the tree’s depth from 19 to 5, its accuracy and precision barely dropped. This makes us more confident in recommending this version of the regularized model as opposed to the overfit one.

Deep Neural Network

We start by specifying the architecture of the model. The neural network passes the original 118 features to a hidden layer of 20 nodes which then passes them to another hidden layer of 10 nodes which finally sends them to an output node. Note that the ReLU activation function is used (to avoid the problem of vanishing gradients) except for the final layer which uses the sigmoid activation function to output probability values.

class SimpleNN(torch.nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = torch.nn.Linear(118, 20)
        self.relu1 = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(20, 10)
        self.relu2 = torch.nn.ReLU()
        self.fc3 = torch.nn.Linear(10, 1)
        self.sigm = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        x = self.sigm(x)
        return x

We will use the binary cross entropy loss function as well as the Adam optimizer to train this model on the training data.

model = SimpleNN()
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
dataset = TensorDataset(torch.from_numpy(X_train_norm).type(torch.float), torch.from_numpy(y_train.to_numpy()).type(torch.float))
train_loader = DataLoader(dataset, batch_size=64, shuffle=True)
X_test_DNN = torch.from_numpy(X_test_norm).type(torch.float)
y_test_DNN = torch.from_numpy(y_test.to_numpy()).type(torch.float)

epochs = 20
loss_value = 0.0
train_loss = []
test_loss = []
for epoch in range(epochs):
    model.train()
    i = 0
    for _batch_idx, (features, labels) in enumerate(tqdm(train_loader)):
        i += 1
        optimizer.zero_grad()
        outputs = model(features).squeeze()
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        loss_value += loss.item()
    train_loss.append(loss_value/i)
    # Testing
    model.eval()
    with torch.inference_mode():
        test_pred = model(X_test_DNN).squeeze()
        test_loss_val = criterion(test_pred, y_test_DNN).item()
        test_loss.append(test_loss_val)
        print(f'Epoch {epoch+1}/{epochs}, Training Loss: {loss_value/i:.10f}, Test Loss: {test_loss_val:.10f}')
    loss_value = 0.0
print(train_loss)

After 20 epochs, we obtain the following train and test loss curves:

DNN training for 20 epochs

Considering that the testing loss fluctuates and increases slightly beyond the fifth epoch while the training loss is decreasing, we decide to employ early stopping to stop training the model at epoch 5 (by changing our epoch variable to 5 in the code above). This results in the following learning curve:

DNN model training with 5 epochs

To evaluate our model, we run the code below:

model.eval()
with torch.inference_mode():
    test_pred = model(X_test_DNN).squeeze()
    test_loss = criterion(test_pred, y_test_DNN)
    y_pred = []
    for pred in test_pred:
        if pred>0.5: y_pred.append(1)
        else: y_pred.append(0)
    print_error_metrics(y_test_DNN, y_pred)

We obtain the following metrics: Accuracy: 0.99484, Precision: 0.99656, and F1 Score: 0.99442.

Conclusion

In conclusion, it is clear how data pre-processing and feature engineering can be of incredible help in the ML workflow. As for the specific ML models, we recommend the use of decision trees (regularized with a maximum depth of 5) for network intrusion detection as captured by the dataset we used. This model provides high accuracy for intrusion detection while being regularized against overfitting and being explainable in nature as captured by its decision nodes (as opposed to the unexplainable nature of the deep neural network model we developed).

ML for Network Intrusion Detection — Part I: Feature Engineering

Eli Rizk — Mon, 18 Dec 2023 02:23:10 GMT

ML for Network Intrusion Detection — Part I: Feature Engineering

In this blog post, we will explore the area of developing a machine learning model for network intrusion detection. This is a two-part blog series. The first part focuses on data pre-processing and feature engineering. The second part trains and evaluates different machine learning models. This post is also available here.

Traditional intrusion detection systems have used the notion of an allow list or block list to only allow harmless operations or to detect know malicious activities, respectively. This requires a regularly updated configuration file listing the user operations to allow or block, e.g. known malicious IP addresses, outside worktime logins, etc. While this type of network intrusion detection system requires constant modification and a considerable initial investment to produce the original configuration, a novel approach to intrusion detection has been emerging: the use of machine learning models to automate the task of detecting potential intruders. As such, in this post, we will develop a machine learning model to recognize hidden patterns in anomalous network activity. To this end, we will perform some data pre-processing on the dataset we obtain, explore various machine learning models, and recommend the best model according to their performance and other metrics.

The dataset we will be using for the rest of our implementation can be found here: https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection.

Import libraries and set display configuration

Before processing our dataset, we will have to import the necessary libraries to use their helpful commands.

import torch
from torch.utils.data import DataLoader, TensorDataset
from torchviz import make_dot
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import train_test_split
from sklearn import metrics, preprocessing
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from tqdm import tqdm
import seaborn as sns

I also like setting the display configuration of the pandas module to render all columns, rows, or sequences (as opposed to setting a threshold on its maximum value. This can be done with the following code:

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_item', None)

Exploratory Data Analysis

We will start by examining the columns available in the dataset.

df.columns
df = pd.read_csv('Train_data.csv')
df.columns

We obtain the following list of features and class labels (note that the column class contains the label of the entry: Normal or Anomaly). We also note that the columns: protocol_type, service and flag are categorical while the rest are numerical. The total number of columns for this dataset is 43 columns.

['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
'num_access_files', 'num_outbound_cmds', 'is_host_login',
'is_guest_login', 'count', 'srv_count', 'serror_rate',
'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
'dst_host_srv_count', 'dst_host_same_srv_rate',
'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
'dst_host_srv_rerror_rate', 'class']

To check for class imbalances, we plot the number of entries that are labelled as suspicious versus normal. This can safeguard against potential biases in the model being trained (adjusting for class imbalances).

sns.countplot(x=df['class'], palette=['green', 'red'], hue=df['class'])

As you can see both classes are almost equally represented in the dataset. Hence, we don’t need to remove any instances of the majority class.

Data Correlation

We now follow by transforming the class label column to a binary dat (0 for normal and 1 for anomaly) and providing the Pearson correlation matrix of the dataset (excluding the categorical feature columns).

categorical_columns = ['protocol_type', 'service', 'flag']
df['class'] = df['class'].apply(lambda x: 0 if x=="normal" else 1)
plt.figure(figsize=(40,30))
sns.heatmap(df.drop(categorical_columns, axis=1).corr(), annot=True)

Correlation matrix with the 39 numerical features and the class label (last column/row)

As can be seen, most features aren’t highly correlated together (colored in light purple for a value of close to 0) while some feature pairs have either high positive (close to 1) or negative (close to -1) correlation (colored in white or dark purple respectively). For the class label column (last column and last index), only 9 of the 42 numerical features are highly correlated (with an absolute value greater than 0.5). However, this doesn’t mean that we can easily disregard the features with correlation close to 0. This is because the Pearson correlation value assumes a linear relationship between the two variable, which is an overly simplistic assumption in our case (it might be the case that the interplay of different features is highly correlated with the class label). For example, we decided to plot the variation of the same server rate variable against the destination host name server rate, producing the following scatter plot. Note that for the rest of this post, entries labelled as normal are visualized in green while those labeled anomalous are labeled in red.

labels = df['class']
colors = ['green','red']
y = df['dst_host_same_srv_rate']
x = df['same_srv_rate']
plt.scatter(x,y, c=labels, cmap=matplotlib.colors.ListedColormap(colors), alpha=0.5, s=40)
plt.xlabel("Same Server Rate")
plt.ylabel("Destination Host Name Server Rate")

As can be seen, while some distinction can be drawn between suspicious and benign inputs, there is no clear cut separation that can yield a highly accurate model.

Before processing the data further, we perform one-hot encoding on the categorical features as well as split the dataset into training and testing data (80% split).

df = df.join(pd.get_dummies(df.loc[:, categorical_columns]))
df = df.drop(categorical_columns, axis=1)
X_train, X_test, y_train, y_test = train_test_split(df.drop(['class'], axis=1), df['class'], test_size=0.2, random_state=42)

The shape of the training data becomes 20153 x 118. One-hot encoding significantly increased the number of features we were concerned with: we now have 118 features to train a model on. Considering that most features weren’t highly correlated to begin with, we decide to perform feature engineering on our dataset before training any model on it.

Feature Engineering (PCA)

We first begin by scaling our data to the standard normal distribution: mean of 0 and standard deviation of 1. Note that we fit the distribution on the training data and scale the test data accordingly, to prevent data leakage from the training to the testing set.

scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train_norm = scaler.transform(X_train)
X_test_norm = scaler.transform(X_test)

After doing so, we perform Principal Component Analysis (PCA) which reduces the dimensionality of our data. It is a form of unsupervised algorithm to produce a small set of uncorrelated variables called principal components. Its benefit is manifold including reducing the complexity of our models, alleviating the “curse of dimensionality”, and increasing the interpretability of the data. We decide to perform PCA to obtain 10 resulting components, i.e., new feature columns to use for our models.

pca = PCA(n_components = 10)

X_train_PCA = pca.fit_transform(X_train_norm)
X_test_PCA = pca.transform(X_test_norm)

explained_variance = pca.explained_variance_ratio_
explained_variance

The explained variance ratio of those 10 components are:

[0.0834838 , 0.05251576, 0.03534561, 0.03201952, 0.02585129,
       0.02270666, 0.01908112, 0.01513291, 0.01363081, 0.01235519]

Note that the components are sorted by this score. After plotting the Pearson correlation matrix for these new features as well as the class label, we obtain this new heat map:

df_PCA = pd.DataFrame(X_train_PCA).corr()
df_PCA['class'] = y_train
sns.heatmap(df_PCA.corr(), annot=True)

Correlation heatmap of the 10 principal components with the class label

It is interesting to see that the correlation of the features with each other is -0.11 (which is close to -1/9) since it is part of the effect of PCA (producing uncorrelated features with equal correlation score). Concerning their correlation with the class label (last column or last index), we can note that most of these components are more strongly correlated with the label that their original 42 features. In fact, only one of them has a value of 0.071 while the others have values greater than or equal to 0.35 in absolute value. Compare that with the previous heatmap we calculated on the original features where most of them had a score close to 0.01 in absolute value.

In order to appreciate the power of PCA, we also perform a scatter plot using the first two PCA components as well as a 3D scatter plot using the first three components. This results in the following two figures:

x = list(map(lambda x: x[0], X_train_PCA))
y = list(map(lambda x: x[1], X_train_PCA))
z = list(map(lambda x: x[2], X_train_PCA))
labels = y_train
colors = matplotlib.colors.ListedColormap(['green','red'])
plt.scatter(x,y, c=labels, cmap=colors, alpha=0.2, s=40)
plt.xlabel("First PCA Component")
plt.ylabel("Second PCA Component")

Scatter plot of the 1st and 2nd principal components

fig = plt.figure(figsize=(14,14))
ax = plt.axes(projection='3d')
labels = y_train

# Creating plot
scatter = ax.scatter3D(x, y, z, c=labels, cmap=colors)
ax.set_zlim(-4,2)
ax.set_xlim(-4,3)
ax.set_xlabel("First PCA Component")
ax.set_ylabel("Second PCA Component")
ax.set_zlabel("Third PCA Component")
legend1 = ax.legend(*scatter.legend_elements(),
                    loc="upper right", title="Classes")
ax.add_artist(legend1)
plt.show()

3D Scatter plot of the 1st, 2nd, and 3rd principal components

The power of feature dimensionality reduction through PCA allowed us to easily separate the class labels across the first two or three principal components. Surprisingly, the PCA algorithm was able to do that without looking at the class labels! The algorithm works only on the input dataset of features, not the labels. Feature engineering was able to significantly simplify the problem of anomaly detection by using the newly produced 10 principal components instead of the original 118 features.