Graphs and Machine Learning for Cybersecurity

Ryota Yamanaka
Oracle Developers
Published in
10 min readJul 8, 2021

OPG4Py is the Python client of Graph Server, a component of the Property Graph feature of Oracle Database. OML4Py is the Python interface of OML (Oracle Machine Learning), which is another feature of Oracle Database that enables scalable machine learning. The key advantage of OML is that all the machine learning functionality can be run on data in a database. This provides several advantages, ranging from scalability (databases can store huge volumes of data), over performance improvements (as your data does not need to be constantly fetched from the database and pushed back) to security (your data does not need to leave the protected environment of the database).

Together, OPG4Py and OML4Py complement each other and enable a user to develop fast, scalable, and secure graph machine-learning applications. In this post, we would like to demonstrate this on the example of intrusion detection.

(The original blog post can be found here, which was edited by Rhicheek Patra, Research Manager at Oracle Labs. The content here is updated for TechCasts event on July 8, 2021.)

Intrusion detection

Intrusion detection is a broad term for the monitoring of a wide range of systems and public/private network traffic. For our use case, we define intrusion detection to be the problem of monitoring public network (e.g. the world wide web) traffic and detecting malicious activity. Network traffic, in our case, is so-called packet captures which are logs of IP addresses interacting with other IP addresses. The following table is an example of such a packet capture:

The previous table can be regarded as an edge-list representation of a graph. A resulting graph might look something like this: These packet captures can naturally be expressed as graphs, by treating IP addresses as vertices and an interaction between two IP addresses A and B as an edge between the two vertices A and B.

This figure shows an example of a malicious packet capture.

The victim searches for something over a web search engine. He then clicks on one of the results belonging to the IP address A which, unfortunately, has been compromised. A then redirects the victim over several links to the IP address B, which lets the victim download an exploit. This exploit tries to get administrator privileges on the victim’s host machine and, upon success, reports the successful infiltration to its master C. Then C redirects the victim to the server D which downloads further malicious software (for example, ransomware which will encrypt your computer and demand money for its decryption).

For a more detailed explanation of packet captures and the whole problem of intrusion detection, I would like to point the interested reader to this report on DynaMiner and malware detection.

Solving the problem using OPG4Py and OML4Py

For our example, we have a lot of different packet captures, and our goal is to train a classifier that can distinguish malicious packet captures from benign packet captures.

We will solve the problem in three main stages:

  1. Graph creation: In this stage, we will load the raw packet capture data into the database in table format. Using PGQL syntax, create graphs from the data in the tables, and load them into Graph Server.
  2. Feature generation: In the second stage, the graphs are analyzed extensively using the rich graph algorithm library provided by OPG4Py. The results of these analyses are initially kept in the graphs themselves, then stored into database tables using PGQL queries.
  3. Classification: In the third stage, we use the feature table we created to train a variety of different classifiers in order to achieve as high as possible prediction accuracy. For this, we will also use OML’s AutoML functionality.

Generate the graph-based features using OPG4Py

In order to use the OPG4Py functionalities, we first need to import thepypgx package and log in to Graph Server with the database user name and password.

import pypgx.pg.rdbms.graph_server as graph_serverbase_url = "https://<host_name>:7007"
username = "graphuser"
password = "<password>"
instance = graph_server.get_instance(base_url, username, password)
session = instance.create_session("jupyter")
analyst = session.create_analyst()

Prepare the source datasets

We have two datasets. One dataset for the graph data of malicious packet captures and one dataset for the graph data of benign packet captures. For each of these graphs, we prepare the following two tables, one for vertices and another for edges.

BENIGN_VERTICES table:

BENIGN_EDGES table:

The corresponding tables for the malware graph (MALWARE_VERTICES table and MALWARE_EDGES table) have the exact same schema.

Load datasets into Graph Serve as graphs

In this step we are going to create the graphs inside the Graph Server, using the datasets in the database. For this, we first need a CREATE PROPERTY GRAPH statement for each of our two graphs. The statement tells the Graph Server the mapping between tables and a graph.

statement = '''CREATE PROPERTY GRAPH graphlets_b
VERTEX TABLES (
BENIGN_VERTICES AS ip_address
LABEL ip_address
PROPERTIES (V_LABEL, TYPE, GRAPH_ID)
)
EDGE TABLES (
BENIGN_EDGES_v
KEY (e_id)
SOURCE KEY(src_id) REFERENCES ip_address
DESTINATION KEY(dst_id) REFERENCES ip_address
LABEL send
PROPERTIES (SRC_ID, DST_ID, SRC_TYPE, DST_TYPE)
)
'''

Once this statement is executed, the graph is created on memory.

session.prepare_pgql(statement).execute()

Attach the graph and get a proxy object.

graphlets_b = session.get_graph("GRAPHLETS_B")

Next step: Using a featured table

Let’s recap for a moment what we have done so far.

First, we confirmed the raw packet capture data is in the database tables. We then created two partitioned graphs: GRAPHLETS_B and GRAPHLETS_M. Both of these graphs are composed of many small subgraphs. Each of these subgraphs represents one of the packet captures.

As a next step, we will analyze each of these small graphlets, and store the information about them in a large feature table. For each graphlet, we will store one row in the feature table. We will then attach a label to each row that tells whether this row contains information about a malicious or benign packet capture.

So without further ado, let’s continue our journey!

Create graphlet features

Now that we created the two graphs inside Graph Server, we are going to analyze them using a wide variety of graph algorithms. We’ll later use the resulting information to train the different classifiers. For each graph, we will analyze it using the rich set of graph algorithms that Graph Server provides to compute a number of properties.

def generate_graphlet_features(graphlet):
analyst.degree_centrality(graphlet)
analyst.local_clustering_coefficient(graphlet)
analyst.out_degree_centrality(graphlet)
analyst.in_degree_centrality(graphlet)
analyst.vertex_betweenness_centrality(graphlet)
analyst.pagerank(graphlet)
generate_graphlet_features(graphlets_b)
generate_graphlet_features(graphlets_m)

Prepare the data for classification

Remember that, until now, we did every step twice: once for the graph which stores the benign packet captures, and a second time for the graph which stores the malicious packet captures. Now that we appended a label to each feature list, which denotes whether this feature list belongs to a benign (ml_label = 0) or malicious (ml_label = 1) packet capture.

rs = graphlets_b.query_pgql("""SELECT
v.graph_id
, 0 AS ml_label
, COUNT(v) AS cnt_vertex
, SUM(v.degree) / 2 AS cnt_edge
, AVG(v.degree) AS avg_degree
, AVG(v.lcc) AS avg_lcc
, AVG(v.out_degree) AS avg_out_degree
, AVG(v.in_degree) AS avg_in_degree
, AVG(v.betweenness) AS avg_betweenness
, AVG(v.pagerank) AS avg_pagerank
, MAX(v.degree) AS max_degree
, MAX(v.lcc) AS max_lcc
, MAX(v.out_degree) AS max_out_degree
, MAX(v.in_degree) AS max_in_degree
, MAX(v.betweenness) AS max_betweenness
, MAX(v.pagerank) AS max_pagerank
, MIN(v.degree) AS min_degree
, MIN(v.lcc) AS min_lcc
, MIN(v.out_degree) AS min_out_degree
, MIN(v.in_degree) AS min_in_degree
, MIN(v.betweenness) AS min_betweenness
, MIN(v.pagerank) AS min_pagerank
FROM MATCH (v) ON GRAPHLETS_B
GROUP BY v.graph_id
ORDER BY v.graph_id ASC
""")

Store the query result into a database table.

rs.to_frame().write().db().table_name("FEATURES_BENIGN") \
.overwrite(True).owner("GRAPHUSER").store()

The same query but against malware graph having the ml_label = 1.

rs = graphlets_m.query_pgql("""SELECT
v.graph_id
, 1 AS ml_label
, COUNT(v) AS cnt_vertex
, SUM(v.degree) / 2 AS cnt_edge
, AVG(v.degree) AS avg_degree
, AVG(v.lcc) AS avg_lcc
, AVG(v.out_degree) AS avg_out_degree
, AVG(v.in_degree) AS avg_in_degree
, AVG(v.betweenness) AS avg_betweenness
, AVG(v.pagerank) AS avg_pagerank
, MAX(v.degree) AS max_degree
, MAX(v.lcc) AS max_lcc
, MAX(v.out_degree) AS max_out_degree
, MAX(v.in_degree) AS max_in_degree
, MAX(v.betweenness) AS max_betweenness
, MAX(v.pagerank) AS max_pagerank
, MIN(v.degree) AS min_degree
, MIN(v.lcc) AS min_lcc
, MIN(v.out_degree) AS min_out_degree
, MIN(v.in_degree) AS min_in_degree
, MIN(v.betweenness) AS min_betweenness
, MIN(v.pagerank) AS min_pagerank
FROM MATCH (v) ON GRAPHLETS_M
GROUP BY v.graph_id
ORDER BY v.graph_id ASC
""")

Store the query result into a database table.

rs.to_frame().write().db().table_name("FEATURES_MALWARE") \
.overwrite(True).owner("GRAPHUSER").store()

Now, the graph-based features for training are prepared and stored in the database.

Visualize Graphlets

Using the Graph Visualization UI, we can visually check the graphlets: how many vertices and edges are involved, how they are connected, and etc. A part of such topological information is supposed to be quantified by the graph algorithms in the previous step.

The PGQL query below is for retrieving all pairs of vertices and connecting edges in a particular graphlet, where its GRAPH_ID is 1.

SELECT *
FROM MATCH (v1)-[e]->(v2)
WHERE v1.graph_id = 1 AND v2.graph_id = 1

Each vertex holds the new properties such as pagerank or betweeeness. In the screen below, the size of vertices represents the PageRank scores.

Train data with the OML classifier

Remember that, until now, we did every step twice: once for the graph which stores the benign packet captures, and a second time for the graph which stores the malicious packet captures. We can finally combine the two feature lists onto one table:

import omlfeatures_oml = oml.sync(query='''SELECT * FROM FEATURES_BENIGN
UNION
SELECT * FROM FEATURES_MALWARE
''')

We now created the final table, which we will use to train our classifiers. To train and test classifiers on this data, we should split our data into a training set and a test set. Note that the data does not need to be fetched from the database to do this. OML provides the split() function for this, which returns two proxy objects to the two splits of the data:

# Split the data for training and test
train_oml, test_oml = features_oml.split(ratio=(.8, .2), seed=0)
# training data
train_x_oml = train_oml.drop('ml_label')
train_y_oml = train_oml['ml_label']
# test data
test_x_oml = test_oml.drop('ml_label')
test_y_oml = test_oml['ml_label']

First classification using OML

After having prepared our data for classification, it’s time to start training our classifiers!

Let’s start with a neural network. OML4Py lets you create a neural network via the oml.nn()constructor. You can provide a large set of parameters to this constructor in order to individualize your neural network. In the code below, we created four layers with different numbers of nodes and different activation functions.

params = {# The architecture of the neural network
'nnet_hidden_layers' : 4,
'nnet_nodes_per_layer' : '50, 30, 50, 30',
# The differnt activation functions in each layer
'nnet_activations' : "'NNET_ACTIVATIONS_LINEAR', 'NNET_ACTIVATIONS_LINEAR', 'NNET_ACTIVATIONS_LINEAR', 'NNET_ACTIVATIONS_LINEAR'",
# Info about the number of rounds to use for fitting
'nnet_iterations' : 500,
'nnet_heldaside_max_fail' : 100,
# A seed for reproducibility
'odms_random_seed' : 1
}
nn_mod = oml.nn(**params)

Training and scoring the neural network is also quite easy. OML4Py provides two functions, fit() and score(), that each takes OML proxy objects as parameters. These proxy objects tell the functions on which data they should apply the training and testing.

# Fit the NN model according to the training data and parameters
nn_mod = nn_mod.fit(train_x_oml, train_y_oml)
# Score the model
nn_mod.score(test_x_oml, test_y_oml)
0.985251

Training this model on our data and then scoring it, yields an accuracy of 98.5%. But, can we do better?

AutoML: Let OML do the work for you

OML provides a large set of different classifiers, from neural networks over linear classifiers to naive bayes classifiers. However, it’s cumbersome to try out all of these different classifiers to find out which one delivers the best results, especially since every single classifier has a large number of hyperparameters which could be tuned. But, OML has a solution for this: AutoML. The AutoML model provides a variety of functions that one can use to let OML find the best model/features/hyperparameters for you.

Let’s apply the automatic algorithm selection to our data to find out which model yields the best result for our data:

alg_selection = oml.automl.AlgorithmSelection(
mining_function = 'classification',
score_metric = 'accuracy',
parallel = 1
)
algo_ranking = alg_selection.select(
train_x_oml, train_y_oml,
cv = None, X_valid = test_x_oml, y_valid = test_y_oml, k = 3
)
print("The best models are: " + str(algo_ranking))The best models are: [('nb', 0.9970501474926253), ('svm_gaussian', 0.9970501474926253), ('glm_ridge', 0.9941002949852508)]

Running the code above gives us the following ranking:

  1. Naive Bayes (Accuracy: ~99.7%)
  2. Support Vector Machine (Accuracy: ~99.7%)
  3. General linearized model (Accuracy: ~99.4%)

Graph machine learning features

In this blog post, we cover how graph-based features can boost standard machine learning. However, OPG4Py provides more complex graph machine learning functionalities like vertex representations (DeepWalk, Graph Convolutional Networks) as well as sub-graph representations (PG2Vec). We will discuss more in future blog posts.

If you would like to share your feedback or ask questions to the community, click here to join the AnDOUC (Analytics and Data Oracle User Community) Slack workspace and post your comments at the #graph channel.

--

--