How I [almost] blew up my computer: my story of trimming LARGE password lists / txt files
In the midst of conducting a research project on auditing passwords (which I hope to blog about later!), I had acquired numerous large password leaks. These leaks, however, sometimes were not just simple plain text passwords. They could contain hashes, hash type, count of hits in the source list, etc. So, before they could be of any use in my experiment (yes I own a lab coat if you are wondering), I needed to whittle them down to just the appropriate plain password, followed by a new line, and nothing more. This is not the story of how to do this correctly per se, nay… this is the story of how many different ways I tried to destroy my faithful PC in the name of science! The results may be surprising.
The file used for this foray into destruction was the massive Have I Been Pwned v2 “hash and found” dump that was present on the late Hashes [dot] Org site. After pulling down the entirety of the former Hashes [dot] Org catalog via torrent(which itself was a massive 112 GB), the hibp_v2 zip extracted to an enormous 27GB text file. EDIT: After writing the first draft of this article, I feel it should be emphasized that this leak only contained allegedly leaked passwords, not full accounts, nor the source sites, of the leak.
This particular leak was an ASCII encoded text file containing the algorithm, hash, and plain text equivalent per line of over 500M records.
I needed to cut on the :
, but I was afraid a simple cat | cut
was going to be computationally expensive (or is it? stay tuned). I pondered this and asked what others would try.
My short list of things to try then came to:
python
powershell / .NET
sed / sed -u
terrible attempt at VS c++
Golang
Search / Replace in Notepad++ (RIP)
cat | cut
awk
I then, needed to come up with measurables and ways to, well, measure them. I came up with following:
Time to complete task, time / Measure-Command
CPU/Memory impact, Resource monitor / htop
I also decided after a few initial attempts, that the initial test would be on a smaller subset (but still rather large) of the first million leaked passwords, which resulted in a text file of about 110MB. Also, I wanted to ensure read/write times of my separate hard drives wouldn’t accidentally impact the test, so this smaller file ensured I could write numerous files to my SSD. Below are some of the other pertinent control variables:
Operating Systems Windows 10, Kali Linux WSL
Memory 32 GB
CPU Intel i7-7700k
HDD m.2 1TB SSD
Notable Background Programs Chrome, Slack, Outlook, Notepad++, Spotify (gotta jam while we wait!), etc.
Finally, I would attempt a final test that would parse and cut the large 27
GB file (500M lines), since that is really why I started this in the first place.
python
The first challenger would be python. And I would argue that the code used may not be the most optimized, but at least gives a gist of maybe what a typical approach and implementation to the problem of splitting strings on a large file might entail. The code was essentially the same as what
suggested above:The million record test yielded pretty quick results, finishing under a second, and too fast to measure via top:
But, oddly enough, 3300 or so records were left out?
Running python on the 500m record file showed considerably more impact to memory than expected:
Peak memory use ended up in the ballpark of… well 80%. Additionally, the max use on a single core peaked at around 100% early on, but settled in to average about 30% load via top. I would be lying if I said I didn’t contemplate grabbing the fire extinguisher from the kitchen, though.
And with use of all that raw processing power, python
clocked in at a respectable 19
minutes, 56
seconds. Or so I thought…
It appears somewhere along the way, one of the lines was not formatted properly and broke my script. Still, I think even with a little error checking, the execution time would still be close, so I leave correcting this up to the reader!
PowerShell
Before anyone suggests otherwise, I came into this leg expecting results to be paltry. In fact, when researching potential approaches using PowerShell to parse large files, I found this link which simply suggests, “Don’t.”
I came across some other mad person attempting to do what I sought to do via this link and decided this would be the basis of my test. Again, optimizations could be had, but the below I felt would be a fair approach:
The result on the smaller 1 million record file confirmed what I suspected and dreaded, PowerShell was not going to do this very fast. PS came in at a whopping 960.59
seconds! That works out to a shade over 16
minutes.
As for the impact to the CPU and RAM, PowerShell peaked at around 12–13% CPU utilization across all 8 cores. Judging from the fan noise, I assume one of the cores was near 100%. Yet, this PowerShell script didn’t seem to tie up RAM in any noticeable way.
For kicks and giggles (yes the PG version), I wondered how a simple get-content | split-string
would perform. Memory consumption was more noticeable at around 2GB, but no where near the levels of the python
attempt. The temperature in my office became more noticeably warm too, as yet again the fans whirred in there relentless efforts to keep the i7 happy. The PS “equivalent” to cat | cut
clocked in around 21
minutes and 3
seconds.
Did I dare try to parse the large 27GB beast? Some say I did, and those say it is still running to this day, consuming as many resources while doubling as an office space-heater.
The reality of the situation is actually far worse. I have yet had an attempt finish, and quite frankly, I don’t see this happening anytime soon.
The first PowerShell script (using .NET Reader), while isn’t very taxing on the RAM, just isn’t fast enough to finish before the inevitable heat-death of the universe. In theory, technically I should have completion in just over one solid week (21 minutes per 1M * 500 = 10500 min = 175 hours = 7.29 days)
, but between other science! and weekly patch reboots, it’s not looking likely.
But wait! There’s more! Get-Content | Split-Sting
, much like my python
implementation, grabs every spare byte of memory until the system nearly grinds to a halt.
The job(s) would not even last until morning. Sometime overnight, without fail in at least 4 unsuccessful tries, my laptop finally exclaims, ENOUGH! and reboots to rid itself of this parasitic processing scourge.
sed / sed -u
Using sed
, I felt I needed to expand upon executing it with and without the unbuffered flag -u
. According to the man page, this flag will:
load minimal amounts of data from the input files and flush the output buffers more often
My (now known as incorrect) hypothesis is that sed
would be the quickest, and most efficient way to parse and cut very large files. Upon observing a few test runs, I also noticed that sed
without the -u
flag may not write the file until completion, so I wanted to see if reading/writing piece by piece would be more or less “efficient”. And well, I’ll let you judge.
The trick with sed
was figuring out the proper regex. With my trusty skills of just asking
's/.*:\(.*\)/\1/g'
. Running this, sed
clocks in around 11
seconds and while it hits a single core of the CPU for 100%, it did not seem to register much use of RAM.*EDIT: as user crypticgeek pointed out in this reddit thread, the above regex would omit passwords that contained
:
characters. A more correct regex is 's/[^:]*:\(.*\)/\1/g'
or ‘s/.*?:\(.*\)/\1/g'
as relayed by another reddit user, Grezzo82 .The -u
flag seemed to slow down the process drastically, with the tradeoff being more frequent writes to the output file:
Peak CPU utilization again was 100% on a single core, and again, sed
was sipping memory as it was without the -u
flag.
For the sake of my laptop and my sanity, I only ran the large 500M record file with plain sed
. And like above, sed
only seemed to tax that single CPU core and sip memory, while finishing just under 45 minutes.
Custom C++ Application
While I initially thought writing my own cheesy c++
application would be a PITA (it’s a technical term), I very quickly realized I was not prepared as much as I hoped.
One of things I took for granted was text encoding detection, as writing my own encoding detection and forcing such depending on file format was a challenge in its own right. In fact, when I created the top 1M record file using head
, it saved as UTF16-LE, even though the source 500M record file is ASCII. For the curious, here is a link to my abomination of code. I must warn you, I didn’t even attempt memory mapping, but just controlled the size of chunks I was processing, as I speculated loading an entire file into memory at once might be a bad idea.
Running the aptly named large_file_splitter
yielded a finish time of 18
seconds on the 1M records file. Additionally, the tax on the CPU and RAM seemed quite low.
So not bad, seems somewhat comparable to sed
. I also engineered it to write to the output file every 100,000 records as well, so a little bit of crash redundancy built-in. Parsing the monster 28GB file, my severely under-optimized c++
application clocked in 15892
total seconds, or a little over 264
minutes.
Go (Golang)
Another take, this time from
was to try this inGo
. And when I pressed him, “why?”, he informed me that he felt it was indeed, “fast”.As my WSL was not set up for Go
, I had to follow a short guide here to get started. Then the learning curve. One should have hoped that “Hello World” would be the first milestone in my Go
programming career, but alas, I wanted something more. With the power of my semi-efficient skim/speed reading, a few stack overflow articles, and maybe a glance at some Go
documentation, I was off trying to wrestle the 1M record file.
Surprisingly, this Go
script fared pretty well, parsing the smaller 1M record file. Go
clocked in around 7
seconds, while consuming barely any memory, although peak CPU utilization on a single core (via top
) did peak around 99
%
On the behemoth file, I had to make a few small edits to the script, as I didn’t spend enough time researching the proper way to detect ASCII
v. UTF16
text encodings (yup, again fighting encodings), but nevertheless, the Go
script clocked in just shy of an hour, while only taxing a single core like above.
Notepad++
Notepad++
has a hard cap around 512MB/2GB according to this link. Despite this, I foolishly dared to attempt the unthinkable, and the 1M record file was set to run amuck. Blindly trying this as I hadn’t even checked to see if Notepad++
could even do a find replace on a regex (SPOILER, it can!).
Opening the 1M record file did take a few seconds, but using Notepad++'s
search/replace feature with the following regex, (.*:)
, and then clicking “Replace All”, trimmed the file down appropriately in roughly 3
minutes and 45
seconds.
Memory consumption seems to be all up front on the initial loading of the file, as Notepad++
didn’t budge much over the 1G it grabbed first to load the 1M record file.
I was hoping Notepad++
would at least try to load the monstrosity that is the full 500M line dump, cycle through until it met the file size cap and then either crash or exit, but (un)fortunately, I was just immediately met with this message trying to open in the application:
Notepad++
engineers, if y’all need someone to come in and lend their expertise on dealing with 27GB worth of text in a single text file, let me know. I might, uh, know a guy… Kidding aside, kudos to the application checking the file size before wasting resources!
Cat | Cut — Winner!
The astonishing surprise winner in this race was indeed cat | cut
. I falsely assumed this would be the most inefficient manner to split such a large file. I thought that cat would read the entire file first, and then perform the cut on each line. Man, was I wrong (please don’t be too harsh on me wise old Unix beards).cat | cut
not only was the fastest, but was done before any consistent measurement could be taken on the CPU / Memory impact:
Running cat | cut
on the 500M record file (yup, that 27GB one) also was alarmingly quick, coming in just under 2 minutes:
And I was able to capture some impact on the CPU / Ram:
AWK — notable mention!
Another speedy suggestion from
wasawk
. On the top million leak, the command finished in under a second, just like cat | cut
. I would presume that most of these common *nix commands are far faster than expected.awk -F ':' '{print $2}' 515_have-i-been-pwned-v2_found_hash_algorithm_plain_topmillion.txt
Much like I assumed, awk
is going to perform much like cut
, and finished the task in just under 2
minutes, taxing a single core, but not registering much impact to RAM.
awk
would be great alternative as it might be easier to move columns around or to add characters to the trimmed string as need be. The time difference between awk
and cat | cut
is small enough that external processing loads could sway the results in favor of one or another, too.
Conclusion
Although my laptop may not agree, this was a fun little project. Built-in Linux utilities cut
and awk
were surprisingly quick and efficient, python
and golang
beat my expectations, and I presume my CS112 professor would be pretty upset with the performance of my custom c++
application. In the end, I realized that I already had a trimmed version of this exact password list elsewhere on my SSD, but at least I learned something, right? Maybe?
Maveris is an IT and cybersecurity company committed to helping organizations create secure digital solutions to accelerate their mission. We are Veteran-owned and proud to serve customers across the Federal Government and private sector. Maveris Labs is a space for employees and customers to ask and explore answers to their burning “what if…” questions and to expand the limits of what is possible in IT and cybersecurity. To learn more, go to maveris.com/#maveris-labs.