Looking at RIPEMD-160 Bitcoin Addresses for Fun and No Profit

Keir Finlow-Bates
May 11 · 7 min read

Sometimes for fun, I poke around in that wonderful messy data pool that is the Bitcoin blockchain. It’s full of interesting transactions, data, and weird quirky goings-on. Today I’m publishing this article, which looks at strange repetitive Bitcoin addresses, and what they could mean.

A large number of Bitcoin addresses are derived from 160-bit strings that look very odd and decidedly unrandom, which I initially concluded means that many were probably the result of coding errors. However, the highest value addresses have output transactions, which suggests they are vanity addresses.

I deduced that about 89.2 BTC has been lost due to coding errors, which is about half a million dollars at today’s exchange rate.

The question still remains: why would someone want to make Bitcoin addresses starting with lots of 1s?

The computer code bit

Last night I found myself wondering, “how many Bitcoin addresses have a balance?”, so I went and searched on Github, and of course, almost immediately found a neat Python script that trawls through the Bitcoin chainstate database and generates a file with exactly that data. It’s available at https://github.com/graymauser/btcposbal2csv.

Extracting the transactions

To start with, you need to have an up to date copy of the Bitcoin chainstate database. This resides in the chainstate folder of your Bitcoin core data directory.

It took me about 20 minutes to resolve all the installation problems for the dependencies need in graymauser’s script, so I took time out to produce https://github.com/kf106/btcposbal2csv, which sorts all that out with a single install script:

It turns out the file it generates on May 9 is 24,157,834 lines long and takes up about 1.2 Gb of disk space. Each line shows a Bitcoin address, the balance in satoshis, and the last block that there was any activity for that address. Very neat.

The file size means that you can’t really use a spreadsheet to start analyzing the data contents. Time to break out sed, awk and other Linux command line scripts that can handle that kind of data quantity…

First step: let’s add up the total number of satoshis the script has recorded:

When I run that, I get 1,571,401,006,662,989 satoshis, which is 15.7 million bitcoins. Hmm, the script did warn that it couldn’t process all transactions, and as I write this about 17.7 million have been issued, so 2 million are missing. Never mind — it’s close enough.

Introducing RIPEMD-160

There is also a script that will add the RIPEMD-160 address to the comma separated file generated. If you go to https://gobittest.appspot.com/Address you can see a useful page that shows all the steps involved in generating a Bitcoin address from an ECDSA private key (note — for safety’s sake don’t enter your own private key in this or any other web page). The RIPEMD-160 address is halfway down and is used to obfuscate the ECDSA public key until you use your Bitcoin address to reduce the risk of address compromise. It is usually encoded in base58 with a checksum in order to prevent typing errors and make the system more readable.

When you run the RIPEMD-160 script on the initial output you end up with a 2.2 Gb file, with each line containing the Bitcoin address, the balance, the last block in which it was active, and the RIPEMD-160 address. I called this file ripe.csv. If you want to check the number of lines yourself, just run:

Next, I decided to strip off the Bitcoin addresses, balances and block sizes, to just have a look at the RIPEMD-160 addresses. I also decided to sort them. I’m not sure why I did this, but the following commands do just that:

And now, the very interesting bit. With the head or tail command, you can look at the first nine lines or last nine lines of a file. So look at this output from head:

And here’s the output from tail:

These do not look like random addresses!

Being lazy is a lot less work

Now, I could do a lot of calculations to determine the probability of lots of repeated digits appearing in a RIPEMD-160 address, but that sounds like a lot of work. Obviously, a few repeated digits are to be expected. But at what point can we conclude that an address is not the product of pseudo-random chance, but rather due to programmer’s error? (And in the tail output, what is with that “db004301” string that keeps turning up?)

A quick tutorial: RIPEMD-160 takes as its input any string, and returns a 160 bit number, which in hexadecimal is 40 octets, i.e. forty characters consisting of 0–9 or A-F. That’s about 1.5*10⁴⁸ possible outputs.

So my first thought was to produce a 24,157,834 line file with genuinely random RIPEMD-160 outputs. Time to cobble together a bash script! An initial attempt to generate random numbers and piping them through the hash function failed because of memory problems (despite 16 GB of RAM in my machine). So instead, how about using the actual address list as input to generate a random RIPEMD-160 list:

For this, I’m hashing each of the RIPEMD-160 addresses again and putting them into a new file. The hash function RIPEMD-160 is meant to be a cryptographic hash function, so its output should be pseudo-random.

Okay, that takes a bit of time. 12 hours in fact — sometimes being lazy takes a long time. So it’s a good thing that delivery pizza has arrived and it’s time to feed the kids and then put them to bed, watch some TV, and then go to bed myself.

In a given RIPEMD-160 hash represented in hexadecimal, if it is truly random, the chance of a given digit being followed by nine more is 1/(16⁹) — given a digit (e.g. ‘F’), the chance of the next digit being the same is 1/16, and the one after that another 1/16, and so on nine times. That’s 1/68,719,476,736.

Now, in a 40 digit string, there are 31 opportunities for a digit to be matched nine more times, and there are 24,157,834 addresses, so a rough estimate of the odds of an address containing ten repeated digits is (31*24,157,834)/68,719,476,736, which is about 0.0105. Or simply put, there is about a 1% chance that this could occur naturally.

In general, the odds of N digits repeating somewhere are about

((41-N)*24157834)/(16^N)

So it turns out that 9 repeating digits are quite likely in our file, at 18%. (These calculations are estimates — to do this properly I’d have to calculate the odds of a run, as shown in this Wolfram page, which is a pain in the neck).

The following command counts the number of lines in a file in which there are ten repeated characters:

So, ripesort.csv contains 5476 matches, and pseudorandsort.csv contains 0, as expected. In fact, there are no nine character repetitions in pseudorandsort.csv either, and only 3 eight character repetitions. There is definitely something going on in the list of real Bitcoin addresses. Here’s a sample of some of the matches:

That does look suspicious, doesn’t it?

Adding up the satoshis

It’s time to return to ripe.csv, the original file with all the information, and add up all the satoshis stored in addresses with ten or more repeated digits. These two commands should do that (the first extracts all the rows with strange RIPEMD-160 addresses, and the second is our old satoshi-adding command again):

First I test it on a small sample to check it works (always do this if you a dealing with a lot of data), and then run it over the whole set. It takes about three hours. It’s a good thing I’m doing this in the background while busying myself with other tasks, but I should have done all the file operations in memory rather than onto disk using tmpfs. Lesson learned.

The satoshi total is 23734753110. That's about 237 bitcoins, which at today’s prices is about $1.5 million.

Vanity of vanities

However, after looking at the addresses that contain ten leading zeros (which were the ones with the highest balances) it turns out that they have output transactions. This means that the private keys for these addresses are known.

Sorting the ripeweird.csv file by the number of satoshis and looking at the addresses with the highest balance might be helpful:

This gives the following output:

The penultimate address with 67 bitcoins is definitely a case of programmer error — no outputs ever, and it’s highly unlikely that anyone has or will generate a private ECDSA key that ends up with that Bitcoin address.

But the other addresses starting with ten zeroes appear to be vanity addresses. With a vanity address, you keep generating keys and deriving Bitcoin addresses from them until you find one that starts with the characters you want.

So, for example, 1111116d87CjjDyP8SF5v1LTvUq22VFg currently has a balance of 89.5 BTC, but has received nearly 400 BTC over its lifetime.

Summing up

I manually removed the vanity addresses that are actually active and ran the satoshi count again. This time I’m left with 8,925,496,933 satoshis, which is 89 bitcoins, and therefore worth about half a million dollars.

I have put “retrieving addresses with inputs but no outputs ever” on the list of features to add to the btcposbal2csv script at some point in the future.

Well, that’s it for this weekend — hope you enjoyed that random exploration of bitcoin transactions, and that it’s given you some inspiration to go exploring yourself. The Bitcoin blockchain is full of interesting stuff, so you can easily spend an enjoyable weekend poking around in it, and hopefully this article has given you some tools to do so.

55

55 claps
Keir Finlow-Bates

Written by

CEO and co-founder of Chainfrog Oy, a Finnish startup researching and developing advanced blockchain technologies.