Predicting fraudulent crypto addresses with Machine Learning in DirtyHash!

4 min readJan 30, 2023

Light at the end of the tunnel

The decentralised nature of crypto, which is one of it’s important features, is also attractive to scammers. There’s no bank or other centralised authority to flag suspicious transactions and attempt to stop fraud before it happens. Moreover, the cryptocurrency market is still relatively new and unregulated in many countries resulting in a breeding ground for scams and fraudulent activity. We at dirtyhash.com recently launched a platform as an open source project with an attempt to make Web3 safer by defending against crypto frauds.

We have accumulated a large database (> 10 million) of fraudulent addresses, and safe addresses of known entities. These addresses are reported by users, obtained by scraping the internet for malicious entities, and using publicly available crypto scam databases. Moreover, we maps the transaction graphs in blockchains to hunt for related scam entities. We also search for malicious entities as reported in the Sanctions List by the US Treasury Department.

Data data everywhere, not any scam to detect?

After training a ML model on the data collected as described above, we can provide a risk score for any wallet address in realtime, with high accuracy, as shown below.

Risk Score for a Wallet predicted by Machine Learning

To provide this risk score, we leverage transactions data of known fraudulent addresses and safe addresses, extract the details and build relevant features to train supervised machine learning models to predict if a new address not present in our database is fraud or not and with what probability.

OK but that’s the standard ML thing, but can you be more specific?

Secret Sauce Revealed!

Different blockchains, e.g., Ethereum or Bitcoin follow different protocols and have different transaction details. Hence, each chain will requires a custom machine learning (ML) model. To train a supervised ML model, labelled wallet addresses, either fraud or not fraud, is needed. As mentioned earlier, fraud wallet addresses are collected primarily from a large number of users reports, also complemented by publicly available scam databases (e.g. Sanctions List by the US Treasury Department). Safe addresses were wallet addresses of known entities, e.g. known crypto users and crypto exchange addresses.

Once we have labelled set of fraud and not-fraud addresses, we query our blockchain indexes made by our blockchain nodes to collect the transaction details of the addresses. Fetching transactions details for the addresses gives us raw features that we transform and use to train the ML model.

Example features (out of >100 features)that we use for bitcoin wallet addresses are:

number of transactions
frequency of transactions
final balance of the wallet
mean size of input transaction
mean size of output transaction
presence of the same wallet address on other chains and the corresponding activity
if the address has been reported or appears on any sanctions list, and so on.

For Ethereum (ETH) format addresses, we extract the details of transactions as well as ERC20 token transactions. Example features (out of >250 features) that we use for the ETH chain are:

total number of transactions
average time between receive transactions
average time between sent transactions
number of receive transactions
unique number of addresses from which transactions is received or sent
ether sent or received, etc.

Similar features are developed for the ERC20 tokens for that particular wallet address.

We tried a number of decision tree and Deep Neural Nets (DNN) based classifiers and selected the ones with high precision and recall for a particular chain. And yes, the model types came out to be different for various targeted chains. You can play with some test addresses, such as this one, or check the risk score of your own wallet :-) on dirtyhash.com and let us know if the predicted risk scores make sense.

Twitter Bonus

We also tackled the problem of impersonated twitter handles. There were a number of user reports where the scammer used a look-alike misspelled twitter handle of a famous influence to trick people. We used the Levenshtein’s distance approach to find look-alike twitter handles. Check out some twitter handles on dirtyhash.com yourself! E.g. Vitalik Buterin’s look-alike twitter handle.

Call to Action

DirtyHash is an open source project, please follow us on Twitter and contribute to the Github repo.

Predicting fraudulent crypto addresses with Machine Learning in DirtyHash!

Written by DirtyHash