Using AI to detect Bitcoin addresses involved in ransomware transactions

Yogesh Chandrasekharuni
Apr 19 · 21 min read


Online scams are on the rise in the past decade, and with the introduction of cryptocurrency and its pseudo-anonymity, it has never been more convenient for scammers.

This blog will go through an infamous online money-grabbing opportunity for hackers- the ‘Bitcoin Heist’. It’s straightforward. Fancy website, and hmm, download music for free? What could go wrong? *click*- That is when you download a special kind of computer virus called ‘ransomware’.

Ransomware spreads through your computer like wildfire and encrypts every last file. Want your sensitive data back? Just pay the hackers a hefty ransom and they will (hopefully) give you a decryption key, which should unlock your data.

Wondering how Bitcoin comes into play and not something like PhonePe? Bitcoin is a cryptocurrency based on peer-to-peer technology that involves no central authority like a bank. For all intents and purposes, transactions done over bitcoin cannot be traced. Now you might understand, why this is perfect for scammers.

In this project, we will use AI to analyze how these transactions take place and try to build models that predict if a given Bitcoin address is being used for malicious intent or not.


Note: You can find the accompanying jupyter-notebook and all other code in the GitHub repository. For a thorough overview, it is recommended to follow the notebook as well.

1. Introduction

Problem statement

Real-world usage

Business objectives


Business constraints

Note that in this project, we only focus on banning/blacklisting bitcoin addresses used for malicious intents in the past and NOT in real-time. This project can very easily be extended to work in real-time as well, by having an API that checks if a receiver's bitcoin address has been flagged by our model before any new transaction is made, and if so, immediately notifies the authorities who can take further action. However, for this project, we will not be extending our use-case to real-time.

Now let us see the main business constraints.

The data

The head of the dataset

2. Data-cleaning and Feature Engineering

The raw data we got from the UCI Repository needs to be processed before we do any modeling. We also need to modify the distributions for our feature columns because some of the predictive models we will use might assume that the features are normally distributed.

Understanding the dataset

Our dataset has ~30,00,000 rows and 10 columns. Out of these 10 columns, we have 9 predictors and one target column.

1. address [String]:
Stores the address of the bitcoin transaction’s recipient.

2. year [int] :
Indicates the year in which the transaction has been done.

3. day [int] :
Indicates day of the year.

4. length [int] :
Length is designed to quantify mixing rounds on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin.

5. weight [float] :
Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address.

6. count [int] :
Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount of transaction.

7. looped [int] :
Loop is intended to count how many transaction i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to fiat currency.

8. neighbors [int] :
Indicates the number of neighbors a transcation had.

9. income [int] :
Income in terms of Satoshi amount where a Satoshi is the smallest unit of a bitcoin, equivalent to 100 millionth of a bitcoin.

The target

The target, or the entity we want to predict for a given transaction, is a set of labels. A label can either be white, indicating that the address corresponding to a particular transaction was NOT used for malicious intent or, it can belong to a set of labels such as paduaCryptoWall, montrealSam, princetonLocky, etc, each of which is the name of a particular family of ransomware.

For the task at hand, we do not need to worry about what family particular ransomware belongs to, but only if whether an address corresponding to a transaction has been used for malicious intent or not. Thus, we convert this multi-class classification task into a binary-class classification. Another reason this is beneficial is in the distribution of our target feature. It is extremely skewed i.e., we have an extremely imbalanced dataset.

Graph indicating the class-imbalance

Distribution of our features

Let’s take a look at how some of our features are distributed. Let’s start with the feature ‘length’.


PDF of Length

We notice that this distribution is not ideal. Empirically, we know that some models (such as Logistic Regression) have better predictive power when the features are normally distributed and not extremely skewed. Thus, we transform such features to be more Gaussian by using Boxcox or other transformations.

PDF of length after transformation

We see that the PDF is much closer to the bell curve than earlier. We perform similar transformations on all features and construct new and slightly transformed features.


Let’s see how our income feature is distributed:

PDF of Income

We notice that even our income feature is extremely skewed. Let’s try to fix this by applying a box-cox transformation.

PDF of Income after boxcox transformation

After applying the boxcox transformation, we see a great improvement in the overall distribution of data.

We perform similar analyses and transformations for other features as well.

Feature Engineering

Fixing skewness

Engineering new features

Creating new features that might co-relate with the target will give our models more predictive power. Let us take a look at a few of our newly engineered features.

Number of addresses:

Day of week:

Is close to a holiday:

We engineer nine more features, including features that fix skewness, engineered features, and even interaction features. For the complete list, please refer to 3.1.2 Engineered Features in the jupyter notebook.

Exploratory Data Analysis

In this section, we shall go over the critical process of performing initial investigations on our dataset, to discover any patterns or anomalies. We will be doing both Uni-variate and Multi-variate analysis on our features.

Univariate analysis:

Balancing our dataset

Let us take a look at the class-separated PDFs of some of our features and understand how they differ. Note that we want to see separability in these PDFs because that helps the model to distinguish between the two classes.

Number of addresses




Is close to a holiday

Quarter number

Please refer to 3.0 Uni-variate analysis in the accompanying jupyter notebook for a detailed analysis of the remaining features.

Multi-variate analysis

Pair plots

Let us take a look at how some of our features change with respect to each other.

Pair plot

Correlation map

Correlation map between the features

Now that we have a high-level overview of how our features are correlated with each other, let us take a closer look at each pair of features to get a deeper understanding.

Income and count

KDE or Kernel Density Maps plot the probability density of a continuous variable. In the below plot, the smaller the ‘circle’, the higher is the probability density function of the combined variables (since we’re using a 2D KDE plot to measure the effect of two variables at the same time). To know more about KDE plots, please refer to this article.

Length and weight

Income and day of week

Income, years and interaction_count_income

Swarm plot between income, year and their effect on interaction_count_income


Before we can build predictive models and train them on our dataset, we need to establish a metric that defines our model’s performance.



Other metrics used

Now that we understand all of our features well and know exactly what metric we want to maximize, we can start building our models.

Random model

Random model outputs

Note: We will be using two types of machine-learning algorithms for the rest of the modeling — distance-based and tree-based. Fine-tuned datasets for both these models have been constructed separately. For example, distance-based datasets have been One-Hot-Encoded, scaled, and so on, because this increases their predictive power. However tree-based models actually fair poorer when the features are one-hot-encoded and so on. This distinction has been clearly highlighted in the accompanying notebook, and hence for further clarification, it is recommended you take a look at the same.

Distance-based models

Logistic Regression

Support Vector Machines

Tree-based models

Random Forest

Gradient Boosted Decision Trees

Stacking Classifier

Modeling summary

Comparison with stock data


Future work


All of the code is available on the GitHub repository.

Thank you for reading and for any corrections, suggestions, or questions, reach out to me via email: or on LinkedIn.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…