Database Leaks: a View from 2017

Storing sensitive data in a modern system connected to the Internet is always a risk. The worst nightmares for users and business owners come true as adversaries find more and more ways to get in and seize the data.

Unlike malware and DDoS attacks, the biggest threats to keeping the sensitive data private are creeping up slowly, silent and invisible. The attacks that end up as data leaks and cause immense damage are only detected with great delay, if ever! And when they are finally found out, the data has already been stolen and used.

Data leaks, historical perspective

Breaches are security incidents that result in a confirmed loss of sensitive data — it either becomes public or it’s put up for sale on black market. Most breaches lead to data leaks — exposure of sensitive data to unauthorised parties. is a website that indexes leaked user credentials and it lists 4 billion accounts leaked from 229 sites. lists 2935 breaches — the leaked data from it had been made public. At the time of writing, the author of this article found out that 3 of his old e-mail addresses and passwords leaked from more than one source.

But where is the rest of the iceberg?

Confirmed security breaches are only the tip of the iceberg

In 2016, only 5% of security incidents became confirmed breaches. As for the remaining 95% — you never know when the breach will be confirmed and how the leaked data will be used. Some take years to uncover — for instance, when Yahoo announced that account information of at least 500 million users was stolen by hackers, it only became known 2 years after the breach took place. A polar example is the 2012 LinkedIn breach, when compromisation of more than 117 million of users’ emails and passwords was detected on time, but the stolen data somehow surfaced with “for sale” label on it on the dark web 4 years later.

Poor security always makes the adversary’s job fairly easy. SQL injections, storing passwords in plain text (sometimes, even third-party), remote code execution after uploading through constrained vector all lead to this:

Let’s assume that insider jobs, poor practices, and sheer bad luck have contributed to the scale of the problem. But the security matters at least were under the control of the systems’ owners, and the security was (and still remains) poor. It should not be a surprise that frequently the attacks are carried out through well-known vulnerabilities — it’s not the 0day to be afraid of, after all.

Modern threat landscape

Fast forward to 2017: things have gotten much worse. Verizon’s Data Breach Investigation Report shows the decline of “simple leaks”:

  • Only 25% of leaks are insider-based;
  • Over 50% involve criminal groups, e.g. motivated actors;
  • Only 14% involve “errors”, but more than 60% involve hacking in one way or another, with 80% of cases taking advantage of weak or stolen access credentials.
Data source: Verizon’s Data Breach Investigation Report 2017

More than a half of typical attack cases described in Data Breach Digest rely on the technical availability of the sensitive data after bypassing access controls of different kinds. The primary attack target in large data leaks is still the server that contains sensitive data, with user environment ranking second, by a wide margin.

In the industries dealing with distributed services and accumulation of large chunks of personal data, i.e. financial, public, and SaaS services, around 10% of the incidents resulting in data loss had hacking through their web app involved.

Why do data leaks happen?

Some of the largest data leaks in history are attributed to human error, poor procedures, and inside jobs. Some of the first historical breaches of the 00s involved lost or stolen laptops, hard drives, backup tapes, and other storage devices. But since the dawn of the history of data leaks, databases have been the number one target for stealing large chunks of data from.

Professor Felix from Cossack Labs doesn’t always teach you ways to fail, but when he does — they are sure-fire

Sensitive data security (and, more specifically, database security) frequently relies on the following three layers of assumptions about access controls:

  • Nobody will bypass my perimeter and get into my network”: perimeter security, firewalls (useless in modern “distributed”/“cloud” environment), weak security considerations of modern ecosystems. It basically equates outsiders to insiders.
  • Nobody will compromise access credentials necessary for leaking the data”: app-level access control, something that barely stands when host or operations environment is compromised.
  • Nobody will bypass the access controls of the database”: in-database access controls, equally weak against adversaries who compromise database host OS and insiders… and someone sending out a “broken” hard drive with the database to a third party.

It’s important to understand that apart from the obvious “smash-and-grab” approach via one-time exploitation, many attackers consciously work towards a complete compromisation of the target system and their persistent presence depreciates many typical defences against SQL injections or authentication flaws.

“Let’s combine this framework, that API, and this weird set of dependencies we’re seeing for the first time in our lives…”

Some ways to shoot yourself in the foot infrastructure-wise are more obvious than others

With growing complexity of systems and due to the modular nature of modern software development, it’s much harder to assess risks of a security breach, because the nature of what comprises a breach and the trust perimeter that is being broken has been seriously blurred.

Although it’s very appealing to model security to protect data in motion, only a subset of protected data is transmitted each time, whereas in the storage back-end — database, document storage, file server — sensitive assets are contained in a complete form.

Hence, the first subject to get dedicated attention should be a database.

Why is it so hard to provide proper database security?

The threat model behind the database defences is wrong. Apart from the wrong overall assumptions about a modern system, even the additional defences that aim to enhance database security, are wrong.

In the mind of those trying to provide at least a decorum of protection to the databases in their charge, the typical portraits of database attackers would look like this:

  • Smash-and-grab attacker. The one mounting an attack to steal the disk image (similar to loss of physical disk from historic incidents), injecting SQL, or bypassing access control a limited amount of times;
  • Snapshot attacker. Takes memory & disk snapshot to later analyze and extract the data;
  • Persistent attacker. Stays in the system long enough to corrupt all defences and seize the data.

However, these days we’re seeing more and more persistent attackers, and there are good reasons why this is happening:

What to do to improve database security?

The very threat model needs to be redefined to make it possible to redefine the mitigation techniques and protect the weak security link which is the database.

A database needs to be treated as a storage component of an app, not as a stand-alone service. In this case it’ll become clear that provision of database security separately from the application logic will protect only a small part of the whole. Attack surface on the data starts in an application, continues in environment between the application and a database, and ends in DBMS.

It is worth remembering that connecting various partial defence mechanisms doesn’t provide secure protection of the whole. You either protect the sensitive data using end-to-end encryption providing application-level protection, or you stitch together some kind of authentication scheme, SSL for transport security and data at rest protection, which eliminates many, but not all risks.

Database vendors could also contribute to the overall increase in security by acknowledging that that problem is within their domain, too. The closer the protection is to the subject of protection, the lower the number of moving parts between the security system and the protected data — the better.

How to protect?

In this threat landscape, the best way to provide protection is to carefully differentiate the data by threat level, tokenize it, and protect with encryption that terminates elsewhere: in middleware, or, better still, in the user’s application. It slightly limits the database conveniences and data processing logic, but in large modern environments it is considered a good form to move as much of such logic as possible outside the database anyway. Remember to manage the keys properly, too.

Is it all worth the hassle? In 2015, only 4% of reported data leaks involved successful breaking of cryptography. Encryption is still a good repellent to ward off attackers.

Zero Knowledge Apps

Data at rest security provides just that — protection of the data that is stored somewhere. But it’s easier to attack sensitive assets when they are transmitted, used or referenced rather than stored. Here is where the end-to-end encryption comes into play to save the day for databases.

A true end-to-end system is a system that enables achieving its design goals goals while providing access to protected data only to holders of appropriate “ends”. If such system’s storage and transmission model prevents leakage of metadata about users, their relationships, and nature of protected data to outside parties as well, such system can be called zero-knowledge. Zero-knowledge architecture is a set of typical trust schemes and design patterns that enable distributed systems to implement end-to-end trust.

Traditional database clients contain some binary code and, frequently, data abstraction model, so there is a perfect place to integrate end-to-end encryption (which is actually one of the use-cases of our end-to-end encryption framework Hermes).


Database is one of the primary sources for leaking sensitive data. System engineers and software architects need to start thinking about assets, not “perimeters”, and assess realistic threat models, not idealistic concepts that explain everything but never actually take place in reality.

While protecting the data at rest in the databases is an important security task, this by no means should be the be all end all of protecting databases against leaks. Paying attention to securing the data in movement through implementation of zero knowledge systems in addition to data at rest protection is what will make for a system that is much more secure (but, unfortunately, is still not safe against plain human error or malevolence).

If you have a story to share about database-related security — we’d love to hear from you! Please reach out to us via or @cossacklabs.