Creating a local version of the Haveibeenpwned password database (with Python and SQLite)

Christian Schwarz
Analytics Vidhya
Published in
7 min readMar 26, 2021

In this post we show how to create a local version of the Haveibeenpwned password database. This can then be used to check passwords for security without the need for an internet connection.

What is Haveibeenpwned?

Haveibeenpwned is a website by security researcher Troy Hunt that collects leaked credentials from data breaches. As a user, you can enter your email address and then find out whether it has already been included in a data breach. You can also test your password in the same way.

If a password is contained in a breach, it should be changed immediately.

What is the point of a local version?

Troy Hunt is a world-renowned security researcher and the website is excellently secured. However, the idea of simply entering your password on this site may sound strange to some. Perhaps such a check is also made difficult or impossible by certain company guidelines.

As an alternative, there is also a haveibeenpwned API. With this, you hash your password on your own computer with SHA-1 and then transmit only the first 5 characters of the hash to the API. The API then responds with a list of all hashes that begin with these 5 characters. On average, you receive 400 responses. With these, you can now check on your own computer whether one of them corresponds to the original password hash.

This is actually a nice solution, but if an attacker manages to carry out a Man in the Middle attack, he could possibly read out the API reply. He could then try a Pass-the-Hash attack or try to crack the 400 responses with rainbow tables.

With a local version, on the other hand, no passwords or hashes are transmitted over the network, everything happens purely on your PC.

What do you need?

Basically, you need three things:

· A good internet connection: Currently, the smallest file (NTLM format, ordered by hash) is 8.5GB. The largest file (SHA-1 format, ordered by prevalence) is even 12.5GB. The download can therefore take a while.

· Storage space: The unzipped text file is about 20GB. If we were to search the passwords directly in this text file, it would take an extremely long time. To ensure fast queries, we need a database. To optimise this even further, we use indices. However, this also increases the size of the database significantly, so you should plan with about 50GB of free memory.

· Python: I used Python 3.9, but it should also work with other 3.X versions. The nice thing is that the Python installation already includes SQLite, so we don’t need to install an additional database.

Step 1: Download text file

On the page https://haveibeenpwned.com/Passwords the password hashes can be downloaded. SHA-1 and NTLM versions are available, which are ordered either by hash or by prevalence.

Since it does not matter for us whether we hash with SHA-1 or NTLM and also do not need a special order of the hashes, we can decide on the smallest version (NTLM, ordered by hash) to minimise the download time.

Screenshot of the downloadable files on Haveibeenpwned

We download a ZIP file, which we unzip after the successful download. The result is a simple, but very large text file. Please do not try to open this file with a normal text editor. In the best case nothing will happen, in the worst case your PC will hang up because it cannot handle this size. If you want to open the file to have a closer look at the content, you need a special editor like the EmEditor.

The text file contains in each line a hash of a password and the prevalence with which this password has appeared in breaches. Hash and prevalence are separated by a colon. It looks like this:

format of the lines in the unzipped .txt file

The hash 00000001F4A473ED6959F04464F91BB5 therefore occurs 4 times in the database.

Step 2: Create database

At this point, we could also just use the text file to check if our password is in it. We would hash our password and then go through the file line by line. We would have to split each line at the colon and then compare the hash with our password hash. This is feasible, but takes a long time (about 20 minutes on my laptop).

Since this would be a bit impractical, we create a database in the next step to make queries faster. For this we use SQLite, which is already included in the standard Python installation, so we don’t need any additional database software or Python libraries.

However, if we create a classic SQLite database, the queries are still not very fast. This is not surprising, after all there are 613 million password hashes in the database.

To increase the performance of the queries, we need to create an index for the Hash column. Indexes cause the database to be structured as a B-tree. This data structure massively reduces the time for a search in the tree, but has the disadvantage that the database becomes significantly larger. You should calculate with about three times the storage space that the original text file needs.

The code to create the database then looks like this:

(Code can be found on my GitHub)

python script for building the SQLite database

First, we create an SQLite database with sqlite3.connect, here with the title “pwned_indexed”. This database is created in the same directory as the script. If you want to store the database in a different location, you can also specify a path instead.

Then we create a table “passwords” with the two columns Hash and Prevalence. For the Hash column, we also create an index so that the search for hash values later runs particularly quickly.

Then we read the values from the text file into the database: each line is read, split at the colon and the first value (index 0) is saved as hash and the second value (index 1) as frequency.

In addition, after one million hash values have been entered, we output an intermediate count on the terminal. The system simply checks whether the count variable is divisible by one million without a remainder. If this is the case, the progress is displayed.

At the very end, we only have to commit everything and close the connection to the database. All this is just 30 lines of Python code (if you can do without the progress output, you can also delete the four lines needed for this).

In Windows Explorer, our new SQLite database looks like this:

screenshot from Windows Explorer

If you look at the properties, you will see that the database is 52.5GB in size.

properties of the file

Step 3: Query database

Now that we have a database with 613 million password hashes, we can write another script to query this database.

python script for querying the database

Via the input() function we have the possibility to enter user input. The user can then enter any password and check for breaches. The password entered is hashed with NTLM, as this format was also used for the hashes in our database.

Since the hashes in the database are all stored in uppercase letters, we also convert the input into uppercase letters and then decode the bytes object into a string so that comparisons with the strings in the database are possible.

We then connect to the database and check whether the hash is contained. If this is the case (i.e. a result is returned), “Pwned” and the hash is the output. If you like, you could also output the prevalence at this point, but since this is irrelevant for us (regardless of whether a password has been broken once or fifty thousand times, it should not be used again), we will do without it at this point.

This script can also be realised in 24 lines. With a total of 50 lines (if you dispense with the output of the progress), you can recreate the functionality of Haveibeenpwned.

output of the script for: password

The screenshot shows the output of the script for “password”. As expected, this password has already been pwned.

output of the script for a random password

A completely random password, on the other hand, is not yet contained in the database.

By the way, the queries deliver a result almost immediately, because the B-tree structure means that only very few nodes need to be checked until the correct hash is found.

What can be done with it?

You can use the local version of the password database to check all existing passwords for their security and also to check new passwords to see if they are already part of a breach and should therefore not be used.

Alternatively, you can integrate the database into your own applications. This procedure has been explicitly permitted by Troy Hunt and is not associated with any costs.

permission from Troy Hunt to use the password database

If the database is integrated into other applications, it can be checked when users register whether their passwords are already contained in a data breach. If this is the case, the password can be marked as insecure and the user can be asked to choose another password.

--

--