Stop searching for AWS secrets in code

The secrets’ long tail is what you should really worry about

5 min readOct 26, 2021

Misplaced focus

Secrets scanning recently became an important topic. The trigger for that was a series of security breaches that were due to secrets left unintended in code. While assets like AWS credentials are probably the reason why companies start to embrace relevant solutions, the secrets’ long tail is what you should really worry about.

The history of finding secrets in code

People tend to be lazy and for places like public code hosting it can be dangerous. Back in 2015, Sinha et al were the first to introduce the phenomena to the world, stating that public hosted secrets ‘..can be easily stolen by a malicious user who can authenticate themselves as the developer and misuse the services for their own profit’. Using a few simple regex rules they were able to find many secrets in public GitHub accounts.

Shortly after, their prophecy was fulfilled with Uber’s 2016 data leak: hackers gained access to Uber’s 57 million users’ private data using AWS credentials left on Uber’s GitHub account. It ended with Uber paying 100k$ for the data to be destroyed and millions in fines to the authorities. Later on in 2019, Meli et al did a systematic analysis of public repos secrets, looking for aspects like time to remediation, secret types and where it was common to find them. According to their research ‘..the most commonly leaked were Google API keys..’, ‘..81% of the secrets we discover were not removed..’ and that a ‘..single user operating legitimately within the API rate limits imposed by GitHub is able to achieve near perfect coverage of all files being committed on GitHub for our sensitive search queries’. Which made clear why public repos leakages are so severe.

The findings distribution from Meli’s paper

Many mention Meli’s paper as the breaking point of the secrets in code topic. One of them was Adrian Colyer saying that “..Without protection in place, it’s just too easy of a human mistake to make”. As a result a wave of interest was generated. A simple GitHub query can highlight dozens of relevant implementations. But the truth is this is just the tip of the iceberg. The recent SolarWind breach taught us that the soft spot of most organisations is actually the ‘other’ types of secrets.

Uncommon secrets go under the hood

The SolarWind breach led to many companies’ private data being hacked. It was due to a super naive (*****123) FTP credentials left on a public repo. What makes it special is the fact that most (if not all) of the current solutions would miss it as they target the common api keys structures, leaving the fuzzy/ general structures behind.

The code with the credentials from Vinoth’s twitter

Taking a second look at Meli’s paper, given that the secrets highest share was among the super common providers (Google, AWS, etc), we should ask ourselves — how many of the findings were of real companies (like SolarWind) VS how many were of dummy/ side projects?. As for example Google credentials were dozens of times more common than Stripe or Twilio, is it fair to draw a conclusion regarding the likelihood of finding such secrets on a random org repo?. GitHub is the home of many software projects. Most of them are user owned (VS company owned, where coding standards are supposed to be higher, secrets are more likely to be tests related, coding language and technologies in general are more niche. Having more forks and stars, code is assumed to be better monitored). If the deduction is for organisations (vs private users repos) then such bias should be verified. To answer that question we did a small analysis in which we searched for secrets related terms in a public cloud repos. While ~7% of the resulting files turned out to include secrets, only 10% of them were company owned. Moreover, the common providers’ secrets rate was 25% higher among private users’ repos. It leads us to question whether covering only the common providers’ secrets is enough. To further demonstrate that point, we analysed public repos of first line companies and gov organisations, ones that use secret scanning solutions. We found many secrets, most of them weren’t on the most common providers list. While the existing solutions concentrate on the common providers, the secrets’ long tail is kept unattained.

The unpredictability of secrets

In general, secrets in code can be divided into 3 main groups;

Structured random api keys (like Google’s AIZA..) — well defined distinguishable structure, easy to find.
Non structured random api keys (like Django’s secrets) — detection is mostly based on entropy together with context (searching for prefix like ‘secret=’), moderate to find.
General passwords (like ‘welcome123’) — non structured, detection is mostly based on context (‘password=’), hard to find.

A recent analysis by Lounici et al found the existing solutions to focus on API Keys since ‘..it is easier to handle them with simple regular expression classifiers. Passwords, on the other hand, are difficult to identify with classic methods..’. This is why those solutions are ‘not able to detect plaintext passwords, and it only detects a reduced sample of API Keys’. Looking under the hood for the more common and easy to detect secrets. Taking into account the FTP credentials of SolarWind’s breach, most of the existing solutions will fail to identify such secrets, wasting crucial time until remediation, making it a higher prize for the common hacker.

How to keep your secrets safe

While the world is focusing on mitigating the popular and common solutions’ secrets, we should assume the less common and more difficult to detect secrets to attract the hackers’ eyes. Having a mistaken sense of security can be more dangerous than not using secret detection at all. SpectralOps is specialised at covering every secret type. Come have a look.