Classifying behavior sequences in social networks to find bot accounts
The idea for this post comes from the paper Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling.
Using similar methods to DNA sequencing using the alphabet G, C, A, and T, we can similarly define an alphabet for different user generated events, e.g. ‘upload image’ = A, ‘visit a profile page’ = B, and so on. If we have the infrastructure in place to collect this time series of events (which is quite commonplace today, usually with Apache Kafka), we can compare this string sequence of events among all the users and find clusters of different behaviors. If we historically have identified automated bots on the site, we can find which cluster they belong to, and for future accounts predict how likely a new account is a bot or not based on cluster similarity.
In this example, assume we’re collecting the following event types on a social network :
- A user uploaded an image (event name ‘user.image.upload’),
- A user registered on the site (event name ‘user.register’),
- A user commented on another user’s status update (event name ‘user.feed.comment’),
- A user rated another user’s uploaded image (event name ‘user.rate’),
- A user send a friend request to another user (event name ‘user.friend.request’).
A reasonable choice of schema when collecting events (especially in social domains) is the Activity Streams format (it helps to have an already documented standard than to define your own). For simplicity, we’ll use version 1.0 (much more restrictive, but for illustration purposes it won’t matter). A request might look something like the following:
"id": "<server-generated UUID>",
"published": "<server-generated timestamp, RFC3339 format>",
"id": "<id of the image that was rated>"
"id": "<ID of the user who rated the image>"\
Let’s create an alphabet from the above listed events:
A stream of these events from a new user who registered might this look something like this, converted to our alphabet:
The above ‘sequence’ is quite typical in our case to a normal sequence of events for a newly created account. What we on the other hand have observed is the usual sequence from a spam account, which most of the times look more similar to the following:
With some variations, usually a longer common subsequence was found between bot account than between human accounts. Finding the longest common subsequence (LCS) is an NP-hard problem, but by limiting the amount of examples we compare, and the length of the sequences we compare, the problems becomes viable to solve.
By comparing a random bot account with another random bot account, and then two other bot accounts, then three, and so on, we can plot the LCS value in relation to number of accounts compared. For our data, sampling a random bot account from 2 to 30 other bot accounts without replacement, averaging over 100 runs, and comparing the resulting graph with the equivalent comparison of human accounts, we got the following graph:
It’s clear that humans behave more randomly than this group of bot accounts. Armed with this knowledge it’s trivial to separate the two sets and predict whether a new account is likely to be a bot or not. Logistic Regression is usually good for a baseline, so let’s try that:
The output is quite promising:
f1 score: 0.945750452079566
Playing around with other algorithms and ensembles could help, or adjusting the type of events you include in your alphabet, changing the sequence length to wait for, include gender and age as features etc.
The conclusion though, is that it’s hard to mimic the randomness of humans, and without much effort it’s not too hard to detect.