How community detection can now be your superhero for masterful feature engineering.

Louis Pienaar
Project Artemis
Published in
7 min readApr 24, 2022

No one seems to talk about how to use label propagation in your train/test pipeline, until now!

You can follow my five steps on how to implement label propagation and get to witness the powerful features generated from this, which otherwise would have been complex or near impossible to do. Plenty of resources explain the family of community detection algorithms, and the bulk of their focus is on label propagation. I encourage you to brush up on your knowledge of them, as this article will not do that for you. Sorry! However, what is essential is to understand the train-test split and how this uniquely impacts you when you use semi-supervised techniques like label propagation.

Why do we need to talk about this? Why is this article so needed? Your test data should not be part of your community detection efforts. If you agree, you need to ask yourself: How do you generate features for your train data and then subsequently for your test data?

The real world is the actual test for your model.

Jason Brownlee explains the train-test split as a method to test your model's performance and a plan to combat over-fitting. We aim to generate features based on communities detected using the label propagation algorithm.

It is time to introduce the example. We will be building features to help predict which domain names are likely to be used for malicious purposes. We will use a dataset created by Team Artemis, a group of Masters’s students in the University of Michigan Masters of Applied Data Science program. You can find Python code for generating this dataset in our GitHub repo: https://github.com/erodium/artemis.

Graph database engines are an exciting new avenue for data science innovations. Now you get to experience some of it here. I’m using the TigerGraph graph engine for this label propagation exercise. Domain records data, similar to the internet, are often reasonably tightly grouped as they naturally share entities like registrars, nameservers or broader organisational records. Below is a representation of this schema:

Domain name graph schema

The graph schema has ten nodes, eleven edges and three derived edges. The important node is the DomainRecord node. This node is the central point of the network and contains important attributes like entropy. This exercise aims to build features using entropy that differs between communities.

Entropy is calculated using the central text portion of the domain names and is a numerical representation of the randomness of the characters and numbers used in the domain name. You can view the entropy calculation code in the GitHub implementation.

We use the three derived edges to ease the complexity of the label propagation algorithm. Each domain record uses a registrar, and different domain records may use the same registrar. Whenever a domain record shares a registrar, we create a derived edge called co_registrar. For example, we have four domains, all using the same registrar (GoDaddy.com, LLC).

4 Domains using the same registrar

Then we run the script to create the derived edges, co_registrar, between these domains to link them directly:

Four domains are now connected using the derived edge co_registar

To create this schema yourself, you need to have docker installed. Once you have docker installed, you should run a local TigerGraph container and make sure you mount the correct data folder:

docker run -d -p 14022:22 -p 9000:9000 -p 14240:14240 --name tigergraph --ulimit nofile=1000000:1000000 -v artemis/data/external:/home/tigergraph/data -t docker.tigergraph.com/tigergraph:latest

Once your container is running, remember to start your graph engine by ssh into the container and then running the start command:

ssh -p 14022 tigergraph@localhost
password: tigergraph
tigergraph@38e46715231e:~$ gadmin start all

Once that is done, you can then execute the gsql script that creates the schema and all the loading jobs and queries you need.

You now have an empty graph (Take a moment and reflect on how cool this is). Next, you need to load the data. The data files used are generated for you and are available in the data/external directory. You can start the loading jobs using these loading scripts:

Your graph has all the data it needs to start training, but we need to create those derived edges to help label propagation. To do that, we will be using GSQL to make those edges for us. The GSQL for creating edges:

This query is pre-installed, and you can execute it with this code:

Now your graph is ready to run the label propagation algorithm. We will be using all three types of derived edges. We use the label propagation algorithm for community detection by initialising every domain record with a random (unique) integer. Once the algorithm executes, this ID will or won’t spread to neighbours based on the label propagation calculations. After a few iterations, the domain records will converge to a stable state where labels are no longer propagated to any neighbours. Once this is done, we should have a natural grouping of domain records based on how well they are connected through our derived edges.

For this, we use the label propagation algorithm coded by TigerGraph. https://github.com/tigergraph/gsql-graph-algorithms

And we execute it with the following code:

Communities both divide us and bring us together

Now we can generate some features unique to each community. We will use entropy and the benign-malicious ratio to calculate community features for this example. Luckily we can easily do this using GSQL.

And execute this pre-installed query with:

community_features = conn.runInstalledQuery("community_features_calc", timeout = 30000)
community_features = conn.runInstalledQuery("community_features_calc", timeout = 30000)

Yes, execute it twice. I have some learning left, as I have no idea why twice is needed. Just trust me on this one (famous last words).

You will then have community features that look like this:

So for each community, we will use the min, max and average entropy and the malicious-to-benign ratio.

Give yourself a hand; you have come this far

Now we get to do some magic. You now have:

  • Your training data loaded
  • Completed the community detection (label propagation)
  • Calculated community features

What comes next is the superhero trick, the part that is not often done or talked about. We will now

  • Introduce the test data into the graph
  • Assign them to their communities
  • Assign them the features of those communities

What is essential is that the existing communities must not be changed, and the current community features must also stay the same. The introduction of the test data should not alter the structure of the train communities and can not influence the community features. By doing this, we prevent any form of data leakage.

How do we get this done? Well, by altering the label propagation algorithm slightly and by making sure that only the new data introduced can have its community label changed. Once we have the latest (test) data assigned to a community, we merge in the original (train) based community features, and you have a test set without data leakage. Enough talk, let’s code.

First load the test data:

The altered GSQL for the test set:

Execute it as follows:

Now you have your test data loaded, and all the new DomainRecord vertexes have an assigned community ID. This means we are done. We now have train and test set data, each with its community ID and the associated community features.

Model results

So let’s set some expectations. Detecting malicious domains is a considerable class imbalance problem, and the data we are using doesn’t have any apparent markers that could indicate fraud. Hence we do not expect the model to perform all that well. I’m kidding; the model is incredible. Artemis features can be set into three groups: DGA, Entropy and Graph community aggregations. These three combinations are lethal; here is why:

An AUC on our test set data of 0.9821

Yes, we expect some calls from a few interested parties. But, let this be an encouragement for us all to use graph/network analysis more and more.

--

--