Deep dive into the Edmodo data breach

Analyzing the largest breach of children’s data exposed so far

Published in

@4iQ

9 min readJun 5, 2017

In April 2017 a data dump with 11.7 Gb of data and over 77 Million unique users of Edmodo, was exposed in underground communities of the deep & dark web.

Edmodo is an educational technology company offering communications, collaboration and coaching tools to K-12 schools and teachers. The Edmodo network enables teachers to share content, distribute quizzes, assignments, and manage communication with students, colleagues and parents.

My friend Julio’s son, his friends and most of the kids in his school were exposed in this breach. Their names are easily derived from the username fields. The school uses Google for Education — which protects users from receiving external emails — so these students and their parents did not get notifications about the breach. Julio estimates that based on the number of students and teachers at his local school, at least 2/3rd of the accounts could be students and the rest could belong to teachers, parents and other adults.

Although millions of children’s personal information was exposed in the Anthem breach and over 6.4 million children were exposed in the VTech hack, if we use Julio’s rough estimate of a child:adult ratio of 2:1, we believe this is the largest breach of children’s data with at least 50 Million usernames and 29 Million emails exposed.

According to Mollie Carter, VP of Marketing & Adoption, Edmodo learned about the security incident on the 10th of May, their team has remediated the breach and Edmodo was fast to notify users by the 12th of May.

In this post we take you through the 4iQ research methodology to analyze the breach, evaluate the authenticity of the data and assess the risk to Edmodo users:

Research Incident — what, when, who, why, how
Investigate Authenticity
Evaluate Risk

Highlights

The Edmodo Breach exposes data on 77,010,322 unique users. Based on the analysis described in the following sections, we believe:

The data is Real. It was confirmed by the company and Edmodo users.
It’s Fresh, meaning it was recently hacked.
It’s Big — at an estimated 50 million usernames, the largest exposure of children’s accounts.
Probably the result of an intrusion and then exfiltration of data from the Edmodo network.
Notification by Edmodo was fast but didn’t reach all users.
Since passwords were protected using the strong bcrypt hashing function, the risk of account takeover is medium.

Millions of K-12 students have had their usernames and emails exposed for the first time, and in many cases, it is easy to figure out the full name of these children. Although Edmodo notified users within the Edmodo service and quickly sent notification emails, some schools that restrict incoming emails — like the one Julio’s kids attend — have not received the notification.

Luckily the company did a good job protecting passwords using strong bcrypt hashing. This means that most attackers will not invest the time and effort required to decrypt passwords and will probably go on to target less protected accounts exposed in other breaches.

Step 1: What, When, Who, Why, How?

On May 12, 2017, a first reference was published about a possible hack of Edmodo.

Several posts were found later where the data was being sold on the Black Market:

And a first sample was shared:

On May 17th the company acknowledged a security incident warning their users by email:

A number of sources wrote about the breach including Motherboard, Grahamcluley, Infosecurity Magazine and a number of users confirmed receiving the Edmodo email on Twitter:

1.1) What was exposed?

The Edmodo data dump is a single 11.7 Gb file with 77,039,963 lines and 77,010,322 unique user accounts with:

Usernames: 77,010,322
Emails: 43,966,537
Hashed Passwords and Salts: 75,626,136

We estimate at least 50 million of the exposed usernames belong to young students, with the rest associated with teachers, parents and other adults.

1.2) How did it happen?

The data dump starts with two attack functions in Python that the hackers may have used to attack the site. One of them installs a malicious file “xd.exe”, renames it “api.exe”, and is probably a bootloader:

The IP Address is known to have launched several web attacks and some of these have been captured by network id signatures shown in the AlienVault OTX database:

There is also a trace of the execution of the exfiltration, with data sent to IP 80.82.77.46 and port 666:

1.3) When did the breach happen?

Although Edmodo has not confirmed exactly when the breach occurred, we know that the data was available for sale in April 2017 according to this black-market ad.

1.4) Summary Stats

Step 2: Investigate Authenticity

A lot of information that floats in dark web communities is fake; some dumps combine real but old data from previous breaches (called a Combo List), and others contain random fabricated data. When there are rumors of a data breach, bad actors often try to take advantage of interest in the breach and sell fake data. Data is also, at times, misattributed. For example one breach with Tumblr data was misattributed to Dropbox.

We used several tests to validate and verify the authenticity of the data.

2.1) Collision — have we seen this data before?

An important way to check for fake data is to compare it with data exposed in a prior breach. We compare this data by matching or calculating “collisions” with as many attributes from each identity as possible. For example, if we had a more complete PII data set with addresses, phone numbers etc, we would have used an identity resolution algorithm to look for collisions. With credentials, we like to find and compare matching pairs of usernames and passwords. But in this case, since Edmodo passwords are hashed with the strong bcrypt algorithm, we were only able to run a username collision check.

A high collision rate indicates a combo list or mashup of data from previous breaches while a low collision rate indicates a higher probability that the data may be fake and fabricated. We are looking for a collision score consistent with where the data has been breached from.

When we compare identities with the over 5 Billion we have indexed, we normally see collision rates of over 60%. In this case we saw a username-collision index of 19%, meaning we have seen only 19% of these usernames in other breaches:

This surprisingly low username-collision rate of 19% indicates that either a big portion of the data is fake and consists of random, non-existent, fabricated emails, or the set of usernames have not been in a previous breach. Since Edmodo is an education site used by young students — children — who do not often use Dropbox, LinkedIn and other online services that have been breached, in this context, it makes sense.

2.2) Confirmation by Users

The best way to validate the authenticity of data, and the hardest, is to check it with real users. We contacted two dozen users from three different groups:

Twitter users who said that their Edmodo account was hacked or that they received the breach notification email from Edmodo.
Edmodo users including children in our close circle of friends.
Other users with emails found in the data dump.

A surprising 100% of the users that we contacted confirmed that they were active users, giving us very high confidence that the data is real. Just to give an idea, the probability of getting 6 out of 6 consecutive positive results using random emails (the third group listed above) is as low as:

0.00000000000000000000000000000000000000000000000636%

Note that a 100% confirmation rate also provides some insight on the age of breach — since in old breaches, some users may have deleted their accounts.

2.3) Confirmation by the Community and Source

The data breach has been confirmed by the company, media and users and we found several ads in Black Markets.

2.4) Format Consistency — do data sets match?

Consumer internet services and large online portals have a typical distribution of domain names and passwords, where we see many of the most popular passwords repeated over and over again and a large percentages of passwords include the name of the portal or service. This pattern is different in fake and combo lists. Since passwords in this breach were hashed, we were not able to analyze the format of usernames and passwords.

2.5) Authenticity Scorecard

Step 3: Risk Evaluation

This type of attack where username and passwords are stolen is used for credential stuffing — an attack that has been trending up in the last few months — where bad actors automatically test stolen passwords and try to login to other services (file sharing, e-commerce, online banking..) unrelated to Edmodo.

Just put your self in the shoes of a criminal, why break down the door when you have the keys to the kingdom? Automatically testing usernames and passwords is much easier than creating targeted attacks or crafting complex zero-day malware.

The real problem is we are human — we re-use passwords! Most people use an average of 30 different online accounts but use only 3–4 different passwords — since we cannot remember so many passwords, we re-use passwords all the time.

We typically use an easy password for occasional access to unimportant sites, medium level passwords for services like email and social media, and high level passwords for access to important sites with our confidential information or finance sites such online banking.

Some sites force regular password changes, but many sites do not expire passwords and we do not change them ourselves — in fact, many people forget about the numerous online accounts they sign up to and use just once or have not used in a very long time.

Risk is a measure of Impact and Probability, and Impact — the harm created by a threat- depends on several factors:

3.1) Type of Data

The type of data affects Impact — In the Edmodo breach, usernames and passwords were exposed.

3.2) Password Level

Another metric to evaluate Impact is the “Password Level”. We assess the “password level” as low, medium or high depending on how important we believe the data used in a site is to the user. For example, users would use very high level passwords to protect their online banking accounts since they protect their money.

We believe people will use a medium level password for an educational site like Edmodo.

3.3) Age of the Data

Let’s now look at Probability. When evaluating the Probability of an attack due to an exposure, we need to look at how old is the data. Old passwords have a lower probability of success than new ones since they might already have been changed. As mentioned earlier, in this case we believe the breach is recent.

3.4) Password Strength — can it be cracked?

Another important aspect in evaluating the Probability of an attack is determining whether the passwords can be cracked. Different sites use different methods to store passwords. Some use cleartext while others use MD5 or other hashing algorithms. The more easily passwords can be cracked, the higher the risk is that they will be used to launch new attacks.

Edmodo’s passwords were hashed using salted bcrypt, the hash has a prefix of “$2b$” as shown in the sample received from the black market seller. A salted bcrypt hash is definitely more secure than MD5 or SHA1 hashes that we have been seeing in breaches over the last few years. It is a hashing method resistant to brute-force attacks and considered very difficult to break.

3.5) Risk Scorecard

Breaches like this are happening all the time. Edmodo did the right thing — they hashed their passwords with strong security and quickly notified users. For this reason, the risk level — calculated by combining the high impact of account takeover with the low probability of decrypting the passwords — is assessed at medium.

Learn more about 4iQ Identity Threat Intelligence and protecting your digital assets and identity.