Breached data: stats and graphical representation (haveibeenpwned source)

Andrea Gigante
7 min readApr 19, 2019

--

If you are just interested in the graphical representation of the stats, here your summary (base on Friday 15th of April):

Stats and graphical representation of breached data (from HIBP)

If you are interested in more details of the summary or where I got the data from or why I even did it, keep reading.

Highlights

  • Since July 2013, every month there has been at least one confirmed breach.
  • On average since July 2013 there is almost 1 breach per week.
  • From 2014 until 2017, the average accounts compromised per breach is increasing.
  • The average number of days between breach and added to HIBP has declined to 43 days average per breach.
  • Currently, 4 Months in 2019, we have already reached 86.40% of account breached vs highest year of account breached (2018).

What website did I use to gather this information from?

I am pretty sure everybody is already aware of Have I Been Pwned?, here a short summary thanks to Wikipedia:

Have I Been Pwned? (HIBP, with “Pwned” pronounced like “poned”) is a website that allows internet users to check if their personal data has been compromised by data breaches. Wikipedia

Although the website offers other features as well, this is out of the scope of this article.

The summary in details

First of all, what is a breach?
A “breach” is an instance of a system having been compromised by an attacker and the data disclosed.

How many breaches did I take under account and how many years in the past?
As commented, I used all the information freely available on the website HIBP, at the time of writing this article (15th of April 2019), there are currently 359 breaches and the oldest breach is from 2007.

A view of the data

For each breach we have information like:

  • Date of breach
  • Date added to HIBP
  • Total number of compromised accounts
  • List of compromised type of data (e.g.: Email addresses, IP addresses, Names, Passwords)
  • Boolean value highlighting if the data has been verified, fabricated, sensitive or even a spam list

All the following information is a simple graphical representation of previously commented data based on the 15th of April 2019 via HIBP.

Data in details

Venn diagram displaying the relations between verified, spam list and sensitive breaches
  • 359 breaches in total
    - 330 verified breaches
    - 11 spam list
    - 25 sensitive sites
  • Oldest breach from July 2007
  • First added breach in HIBP on 2013

The results from now on have been filtered, I took under consideration only verified breaches which were not spam lists.

  • 319 total breaches, verified and which are not spam lists
  • 3,800 millions accounts compromised since 2007 ( exact value 3,817,182,724)
  • 2016 was the year with the highest amount of compromised sites (71)
Total of compromised sites per year
  • 2018 was the year with the highest amount of compromised accounts
Total of compromised accounts per year
  • Currently 4 Months in 2019 and we have already reached 86.40% of account breached vs highest year of account breached (2018)
  • As average almost 12 thousand accounts are compromised per breach
  • The highest breach has 763,117,241 accounts compromised
  • From 2014 until 2017, the average accounts compromised per breach is increasing
Average accounts compromised per breach per year
  • Since July 2013, there has been at least one confirmed breach every month
Histogram of total breaches per month
  • Since Nexus Mods breach on the 22nd of July 2013, there have been a total of 278 breaches, that means almost 1 breach per week on average (between 2013–07–22 and 2019–04–15 is a total of 299 weeks, which is 0.93 breaches per week on average)
  • The average of time elapsed between breach date and added in HIBP has dropped within the last few years, currently, we are at 43 days average (in regards of the average before 2013, please take under account that the first added site in HIBP was in November 2013).
Average elapsed time between breach date and added in HIBP
  • 5 fields compromised is the average per breach (5.3), the median is 4
Amount of fields compromised per breach
  • 25 fields are the highest amount of data breached within a single breach (e.g. Dates of birth, Drinking habits, Drug habits, Email addresses, Genders, Geographic locations, Names, Parenting plans, Passwords, Religions, Sexual fetishes)

Top 5s

Top 5 breaches: of number of fields, total accounts, oldest breach

As a user, how can I protect myself from a possible breach?

As a user, we cannot protect ourselves from a company losing our private information, but we can be proactive to protect the re-use of our passwords and monitor our real and virtual life with simple solutions:

  • protect our virtual identity: sign up in sites that would alert us if our data has been exposed in a breach (e.g. have I been pwned, SpyCloud).
  • protect our passwords: use a password manager so the leaked password cannot be reused against you (e.g. dashlane, KeePassX, LastPass, 1Password).
  • protect our real-life identity: if the country you are living has a monitoring credit report, I would suggest you sign up to the service (e.g. in UK experian, noddle)

As a company, what should I take into consideration?

Following you can find an article I wrote in regards of Signup/login security practices, here you can find what I believe good suggestions and ideas to take under account.

Why did I do it?

As a personal development and curiosity, I was playing around with Python and a few modules I wanted to get more experience with.

I needed sample data and because I have a personal interest in security I decided to use the HIBP freely available information to play around with.

How did I do it?

As commented, everything started as a personal project, I have used the HIBP API to retrieve all breaches in the system in a json format, the aim was to use the data in an automated way and create an automated graphical representation of the information retrieved.

Before creating the visual representation I decided to convert the json into csv and play around with the data in a manual way to see what could be more interesting.

What tools did I use?

Possibly you have recognised some of the graphical solutions

  • for charts, I used Google Sheets
  • to manage the data of course I used pivot tables
  • the Venn diagram was created manually but only after playing a bit with matplotlib-venn (by the way, I love it! ;) )
  • the one page summary was created with Gliffy

You have to excuse me if the graphical representation is not that clean, but I am not a graphic designer nor a developer.

If you are interested, please feel free to improve the designs or the information I am displaying, I would appreciate being mentioned.

What will I do next?

As commented all of this was and still is a personal project, I am planning to create a small website that will automatically generate and update this information periodically.

I will share the code in GitHub and I would appreciate any feedback, as commented I am not a developer and any improvement is welcome. (I will update with the GitHub link when possible).

Disclosure

If you represent a company and you are not happy that your name appears in this graphical representation/summary or in this article at all, please feel free to contact me, but as small highlight I am simply publishing data already available (which I did not personally gather), I would suggest contacting HIBP directly that is the source of the information and data.
As a reminder, this is a snapshot taken on Friday the 15th of April 2019.

--

--

Andrea Gigante

Agile practitioner, security fanatic, coffee addict, sci-fi fan, chess lover, Linux/Android user, Shorinji Kempo enthusiast. https://www.skytale.it