How I [almost] blew up my computer: my story of trimming LARGE password lists / txt files

Khris Tolbert
Maveris Labs

--

In the midst of conducting a research project on auditing passwords (which I hope to blog about later!), I had acquired numerous large password leaks. These leaks, however, sometimes were not just simple plain text passwords. They could contain hashes, hash type, count of hits in the source list, etc. So, before they could be of any use in my experiment (yes I own a lab coat if you are wondering), I needed to whittle them down to just the appropriate plain password, followed by a new line, and nothing more. This is not the story of how to do this correctly per se, nay… this is the story of how many different ways I tried to destroy my faithful PC in the name of science! The results may be surprising.

The file used for this foray into destruction was the massive Have I Been Pwned v2 “hash and found” dump that was present on the late Hashes [dot] Org site. After pulling down the entirety of the former Hashes [dot] Org catalog via torrent(which itself was a massive 112 GB), the hibp_v2 zip extracted to an enormous 27GB text file. EDIT: After writing the first draft of this article, I feel it should be emphasized that this leak only contained allegedly leaked passwords, not full accounts, nor the source sites, of the leak.

This particular leak was an ASCII encoded text file containing the algorithm, hash, and plain text equivalent per line of over 500M records.

Top 100 lines of the hibp_v2 leak
501,583,335 disparate passwords are in this collection

I needed to cut on the :, but I was afraid a simple cat | cut was going to be computationally expensive (or is it? stay tuned). I pondered this and asked what others would try.

This is like, where the magic happens

My short list of things to try then came to:

python
powershell / .NET
sed / sed -u
terrible attempt at VS c++
Golang
Search / Replace in Notepad++ (RIP)
cat | cut
awk

I then, needed to come up with measurables and ways to, well, measure them. I came up with following:

Time to complete task, time / Measure-Command
CPU/Memory impact, Resource monitor / htop

I also decided after a few initial attempts, that the initial test would be on a smaller subset (but still rather large) of the first million leaked passwords, which resulted in a text file of about 110MB. Also, I wanted to ensure read/write times of my separate hard drives wouldn’t accidentally impact the test, so this smaller file ensured I could write numerous files to my SSD. Below are some of the other pertinent control variables:

Operating Systems            Windows 10, Kali Linux WSL
Memory 32 GB
CPU Intel i7-7700k
HDD m.2 1TB SSD
Notable Background Programs Chrome, Slack, Outlook, Notepad++, Spotify (gotta jam while we wait!), etc.

Finally, I would attempt a final test that would parse and cut the large 27GB file (500M lines), since that is really why I started this in the first place.

python

The first challenger would be python. And I would argue that the code used may not be the most optimized, but at least gives a gist of maybe what a typical approach and implementation to the problem of splitting strings on a large file might entail. The code was essentially the same as what

suggested above:

The million record test yielded pretty quick results, finishing under a second, and too fast to measure via top:

But, oddly enough, 3300 or so records were left out?

Running python on the 500m record file showed considerably more impact to memory than expected:

Peak memory use ended up in the ballpark of… well 80%. Additionally, the max use on a single core peaked at around 100% early on, but settled in to average about 30% load via top. I would be lying if I said I didn’t contemplate grabbing the fire extinguisher from the kitchen, though.

Very much my reality at this time

And with use of all that raw processing power, python clocked in at a respectable 19minutes, 56seconds. Or so I thought…

Whelp, whomp whomp. 20 minutes I’ll never get back :-/

It appears somewhere along the way, one of the lines was not formatted properly and broke my script. Still, I think even with a little error checking, the execution time would still be close, so I leave correcting this up to the reader!

PowerShell

Before anyone suggests otherwise, I came into this leg expecting results to be paltry. In fact, when researching potential approaches using PowerShell to parse large files, I found this link which simply suggests, “Don’t.”

I came across some other mad person attempting to do what I sought to do via this link and decided this would be the basis of my test. Again, optimizations could be had, but the below I felt would be a fair approach:

The result on the smaller 1 million record file confirmed what I suspected and dreaded, PowerShell was not going to do this very fast. PS came in at a whopping 960.59seconds! That works out to a shade over 16minutes.

As for the impact to the CPU and RAM, PowerShell peaked at around 12–13% CPU utilization across all 8 cores. Judging from the fan noise, I assume one of the cores was near 100%. Yet, this PowerShell script didn’t seem to tie up RAM in any noticeable way.

For kicks and giggles (yes the PG version), I wondered how a simple get-content | split-string would perform. Memory consumption was more noticeable at around 2GB, but no where near the levels of the python attempt. The temperature in my office became more noticeably warm too, as yet again the fans whirred in there relentless efforts to keep the i7 happy. The PS “equivalent” to cat | cut clocked in around 21 minutes and 3 seconds.

Did I dare try to parse the large 27GB beast? Some say I did, and those say it is still running to this day, consuming as many resources while doubling as an office space-heater.

Don’t fret, PS, we’ve all had days like this

The reality of the situation is actually far worse. I have yet had an attempt finish, and quite frankly, I don’t see this happening anytime soon.

The first PowerShell script (using .NET Reader), while isn’t very taxing on the RAM, just isn’t fast enough to finish before the inevitable heat-death of the universe. In theory, technically I should have completion in just over one solid week (21 minutes per 1M * 500 = 10500 min = 175 hours = 7.29 days) , but between other science! and weekly patch reboots, it’s not looking likely.

But wait! There’s more! Get-Content | Split-Sting, much like my python implementation, grabs every spare byte of memory until the system nearly grinds to a halt.

If bytes were tiny marbles, then PowerShell would be PacMan

The job(s) would not even last until morning. Sometime overnight, without fail in at least 4 unsuccessful tries, my laptop finally exclaims, ENOUGH! and reboots to rid itself of this parasitic processing scourge.

sed / sed -u

Using sed, I felt I needed to expand upon executing it with and without the unbuffered flag -u. According to the man page, this flag will:

load minimal amounts of data from the input files and flush the output buffers more often

My (now known as incorrect) hypothesis is that sed would be the quickest, and most efficient way to parse and cut very large files. Upon observing a few test runs, I also noticed that sed without the -u flag may not write the file until completion, so I wanted to see if reading/writing piece by piece would be more or less “efficient”. And well, I’ll let you judge.

The trick with sed was figuring out the proper regex. With my trusty skills of just asking

for the answer because I am too lazy to google it on my own, I received 's/.*:\(.*\)/\1/g'. Running this, sed clocks in around 11 seconds and while it hits a single core of the CPU for 100%, it did not seem to register much use of RAM.
*EDIT: as user crypticgeek pointed out in this reddit thread, the above regex would omit passwords that contained: characters. A more correct regex is 's/[^:]*:\(.*\)/\1/g' or ‘s/.*?:\(.*\)/\1/g' as relayed by another reddit user, Grezzo82 .

Clocking in at 11.9 seconds, almost as fast as Olympic gold medalist Florence Griffith Joyner in the 100m
Command top showing sed barely touching the RAM

The -u flag seemed to slow down the process drastically, with the tradeoff being more frequent writes to the output file:

8:24, or roughly half a millisecond per record

Peak CPU utilization again was 100% on a single core, and again, sed was sipping memory as it was without the -u flag.

Intel i7 is getting some work in

For the sake of my laptop and my sanity, I only ran the large 500M record file with plain sed. And like above, sed only seemed to tax that single CPU core and sip memory, while finishing just under 45 minutes.

sed finishes under an hour!

Custom C++ Application

While I initially thought writing my own cheesy c++ application would be a PITA (it’s a technical term), I very quickly realized I was not prepared as much as I hoped.

One of things I took for granted was text encoding detection, as writing my own encoding detection and forcing such depending on file format was a challenge in its own right. In fact, when I created the top 1M record file using head, it saved as UTF16-LE, even though the source 500M record file is ASCII. For the curious, here is a link to my abomination of code. I must warn you, I didn’t even attempt memory mapping, but just controlled the size of chunks I was processing, as I speculated loading an entire file into memory at once might be a bad idea.

Running the aptly named large_file_splitter yielded a finish time of 18 seconds on the 1M records file. Additionally, the tax on the CPU and RAM seemed quite low.

18.16 seconds and some long file names

So not bad, seems somewhat comparable to sed. I also engineered it to write to the output file every 100,000 records as well, so a little bit of crash redundancy built-in. Parsing the monster 28GB file, my severely under-optimized c++application clocked in 15892total seconds, or a little over 264minutes.

15 thousand eight hundred ninety two seconds… naw doesn’t have the same ring to it…

Go (Golang)

Another take, this time from

was to try this in Go. And when I pressed him, “why?”, he informed me that he felt it was indeed, “fast”.

Rudy and Go.

As my WSL was not set up for Go, I had to follow a short guide here to get started. Then the learning curve. One should have hoped that “Hello World” would be the first milestone in my Go programming career, but alas, I wanted something more. With the power of my semi-efficient skim/speed reading, a few stack overflow articles, and maybe a glance at some Go documentation, I was off trying to wrestle the 1M record file.

Surprisingly, this Go script fared pretty well, parsing the smaller 1M record file. Go clocked in around 7 seconds, while consuming barely any memory, although peak CPU utilization on a single core (via top) did peak around 99%

This Go script is faster than the 0–60 of a Subaru Outback

On the behemoth file, I had to make a few small edits to the script, as I didn’t spend enough time researching the proper way to detect ASCII v. UTF16 text encodings (yup, again fighting encodings), but nevertheless, the Go script clocked in just shy of an hour, while only taxing a single core like above.

Notepad++

Notepad++has a hard cap around 512MB/2GB according to this link. Despite this, I foolishly dared to attempt the unthinkable, and the 1M record file was set to run amuck. Blindly trying this as I hadn’t even checked to see if Notepad++could even do a find replace on a regex (SPOILER, it can!).

Opening the 1M record file did take a few seconds, but using Notepad++'s search/replace feature with the following regex, (.*:) , and then clicking “Replace All”, trimmed the file down appropriately in roughly 3minutes and 45seconds.

Who knew Windows10 had a built-in stopwatch? I didn’t

Memory consumption seems to be all up front on the initial loading of the file, as Notepad++didn’t budge much over the 1G it grabbed first to load the 1M record file.

I was hoping Notepad++would at least try to load the monstrosity that is the full 500M line dump, cycle through until it met the file size cap and then either crash or exit, but (un)fortunately, I was just immediately met with this message trying to open in the application:

Notepad++ gets a DNF just like poor PowerShell in the Gigantic File Olympics

Notepad++ engineers, if y’all need someone to come in and lend their expertise on dealing with 27GB worth of text in a single text file, let me know. I might, uh, know a guy… Kidding aside, kudos to the application checking the file size before wasting resources!

Cat | Cut — Winner!

The astonishing surprise winner in this race was indeed cat | cut. I falsely assumed this would be the most inefficient manner to split such a large file. I thought that cat would read the entire file first, and then perform the cut on each line. Man, was I wrong (please don’t be too harsh on me wise old Unix beards).cat | cut not only was the fastest, but was done before any consistent measurement could be taken on the CPU / Memory impact:

Cat | cut was so fast, it didn’t even register on top during the 1 million record test

Running cat | cut on the 500M record file (yup, that 27GB one) also was alarmingly quick, coming in just under 2 minutes:

And I was able to capture some impact on the CPU / Ram:

Cat barely using any memory here…
Cut averaging about 5.9% CPU utilization (averaged across all 8 cores) via Resource Monitor
Top showing very little usage on Memory, but one of the cores near 50% for cut

AWK — notable mention!

Another speedy suggestion from

was awk. On the top million leak, the command finished in under a second, just like cat | cut . I would presume that most of these common *nix commands are far faster than expected.

awk -F ':' '{print $2}' 515_have-i-been-pwned-v2_found_hash_algorithm_plain_topmillion.txt
AWK is surprisingly fast as well

Much like I assumed, awk is going to perform much like cut , and finished the task in just under 2 minutes, taxing a single core, but not registering much impact to RAM.

Awk comes in a tight second behind cat | cut

awk would be great alternative as it might be easier to move columns around or to add characters to the trimmed string as need be. The time difference between awk and cat | cut is small enough that external processing loads could sway the results in favor of one or another, too.

Conclusion

Although my laptop may not agree, this was a fun little project. Built-in Linux utilities cut and awk were surprisingly quick and efficient, python and golang beat my expectations, and I presume my CS112 professor would be pretty upset with the performance of my custom c++ application. In the end, I realized that I already had a trimmed version of this exact password list elsewhere on my SSD, but at least I learned something, right? Maybe?

Maveris is an IT and cybersecurity company committed to helping organizations create secure digital solutions to accelerate their mission. We are Veteran-owned and proud to serve customers across the Federal Government and private sector. Maveris Labs is a space for employees and customers to ask and explore answers to their burning “what if…” questions and to expand the limits of what is possible in IT and cybersecurity. To learn more, go to maveris.com/#maveris-labs.

--

--

Khris Tolbert
Maveris Labs

Sometimes things break and I happen to be behind the keyboard. I’m just as confused as you are.