Password Analysis
In late 2016, two databases containing millions of credentials were made public — AntiPublic Combo List and Exploit.in Combo List. These databases contain a combination of usernames/ email addresses and passwords from different online services.
BinaryEdge presented me with the chance to do a full analysis on the content of these databases. The idea was to gather all passwords presented in both lists and analyse them in order to extract significant statistics.
This was a particularly challenging project for me as I had never done password analysis before.
Although the analysis has already been described in BinaryEdge’s blogpost, this post is more focused on the processes I used when creating visualisations to present the statistics found. I’ll take a couple of examples from the many that were presented there.
So, first things first, I did some research on what was known about these lists. At the time (early 2017), not much was known about them (apart from the content) or where they had come from.
My first thought was “ where do I even start?”. After checking the size of the files I was dealing with, I did a basic analysis to check the number of lines the files had, which of those were unique, how many email addresses, passwords and usernames were presented on these lists. This was the easy part. I then went on analysing emails and passwords separately.
Although the analyses on both databases were done independently, I wanted to present the results together — the way I thought this worked best was by using colours that represented each one, maintaining consistency throughout the visualisations.
Email addresses
Analysing email addresses wasn’t part of the original assignment but, out of curiosity, I ended up analysing them aswell. This analysis gave me information on domains and TLDs (top level domains, such as .com).
The figure below represents the domains present in the databases analysed. For me, it was quite important to give some kind of emphasis to the most prominent numbers (yahoo numbers in this case) and, once again, I thought that color would be a good choice here as readers are redirected to that box instantly.
One of the difficulties I had with the results was that the they were a bit skewed, which means that it wouldn’t look good in a regular plot — the smallest values wouldn’t be readable. An alternative to this could be to use a scale break, or broken y axis, or maybe plot it twice — one plot with the largest values and the other plot with the smallest values. For me, it made more sense not only to use percentages instead of the values, as it makes it easier to compare them, but also to create an alternative representation as seen below.
A small detail that I think makes all the difference is that little black and white box on the top. I always add something that somehow guides the reader and makes it easier for him to understand what the he’s looking at. In this case, the box let’s you know that the percentages you’re seeing there are the domains in the complete database.
I actually haven’t looked at these figures since we published them. While writing this post, I took my time to look at them again and came to the conclusion that if I had done this today, I wouldn’t do it exactly the same way.
For instance, I remember thinking at the time that such a high number of yahoo.com email addresses (the box on the right) should be mentioned here as it was an important piece of information. However, I realise now that this could have been included in the blogpost itself and didn’t necessarily had to be in the figure.
PACK (Password Analysis and Cracking Toolkit), a great tool for extract statistics on passwords, was used to analyse passwords’ lengths — the result was a CSV file with the length of the passwords and the number of times they were found. Then, I loaded this data in Microsoft Excel and created the plot below.
In this specific case, I chose to use a simple tool as Excel, as I knew I wanted to add a couple more details to my design. So, by loading the data into Excel, I just had to draw a basic plot and export it.
Please note that the plots that are created with any tool, be it Microsoft Excel, R or Python for example, could and should look a lot better than the one above — they should at least have a title and labels on both axis. In this specific case, I knew I wanted to add a couple more details to it, so I went with the most basic plot. The figure below is the final design.
The examples I’ve been showing you (in this and previous blogposts) have been drawn in Adobe Illustrator. However, a good idea is to do the data analysis in R (for example) and create a basic plot there. Then you can export that plot and clean it up in Adobe Illustrator. This is what I did when analysing the length of the passwords present in the combo lists.
My thought here was to include some notes on the most relevant data points in order to simplify the process of comparison between the results of both databases. It isn’t necessary to add these details all the time but in this case it made sense as we were comparing two different databases and the plot had a very similar shape. For instance, without the labels on the most relevant datapoints, I think it would be very difficult to compare both graphs.
Conclusion
In summary, every visualisation should be independent from the others (each one has its one color guide), it should have all the information needed for the reader to interpret it without having to constantly go back. But, in the end, the most important idea is to make it as easy as possible for someone to interpret your design.