Seven Best Practices for Keeping Sensitive Data Out of Logs

A few weeks ago, Twitter asked users to reset their passwords. Per Twitter’s announcement, passwords were written to logs before they were hashed. Log data is often consolidated to a central system, sometimes operated by a third party. The security posture and safeguards of your logging system tends to be much more lax than a database that stores sensitive data. It’s not uncommon (especially in a startup) for the entire company to have access and tools to query that data.

A very thoroughly redacted document

Whether or not you’re in an industry such as healthcare or finance with strict compliance, it’s important to limit access to a customer’s sensitive data (we’ve all see how bad it can be for a company to disclose mishandling of data). Keeping track of your sensitive data, and keeping it out of logs is an important foundational piece of that puzzle. Unfortunately, without the proper processes and tools, it’s all to easy to inadvertently write sensitive data to a log file. I’ve worked on a few projects and with smart people who have good strategies for keeping sensitive data out of logs. There isn’t one size fits all approach (each one can break down in different ways), so it’s important to have multiple layers of protection. Here are seven of those strategies that you can quickly put in place to build a solid foundation.

Sensitive data

Before we dive into the solutions for keeping sensitive data out of logs, here’s a quick definition of sensitive data and a few examples:

  • Personally identifiable data. While there are some obviously sensitive things like Social Security Number, combinations of data (like first name + date of birth or last name + zip code) or user generated data (like an email or user name, e.g. BillGates@hotmail.com) can also leak information.
  • Health Data
  • Financial Data (like credit card numbers)
  • Passwords
  • IP addresses may be considered sensitive, especially when in combination with personally identifiable data.

This is not an exhaustive list—it’s important to have a thorough look at your own data to determine what is sensitive (this is a key piece of many compliance programs—in an unregulated industry you should do your own analysis). A useful exercise is to think about how much trouble would your company be in if you had to disclose leaking this data? Would your company go bankrupt from fines or lost customer confidence. Work with your security expert or privacy team (if you have one) to document which data is considered sensitive, what systems process that data, and how access is maintained.

Now that you’ve secured your data at the system level, here are ways to make sure it doesn’t become part of log data exhaust that it shouldn’t.

#1 Compartmentalize Sensitive Data

When you work with sensitive data, you should minimize which parts of the system work with that data. For example, it might be tempting to use a SSN or an email address as a unique identifier for a person. If you do that, though, many different parts of the system (database tables, API endpoints, etc) will process and store the sensitive field. A better approach is to isolate the sensitive field and only use it when absolutely necessary.

One common solution is to use a lookup table to replace the sensitive field with a random ID. For example:

SSN         | External ID
-------------------------
999-99-9999 | 5a2_cXKrt32DcWOJpJlyhr7FhTcLPfvlEAb1eA2H

Even though the SSN is the primary key, tables and services outside of the main Person table only use the external id.

A common pitfall is to attempt to use a hash function to obfuscate the sensitive data and use the hashed value as a key. Although a hash function can’t be easily reversed, when the input domain is relatively small (e.g. all possible SSNs), you can run all inputs through the function to find a match. There are just under 1,000,000,000 possible SSNs — on my laptop it takes just over 1 hour for some unoptimized python code to compute MD5s for all possible SSNs. And with a GPU and heavily optimized code, this can take a matter of minutes or seconds, even if you salt the hash.

#2 Keep Sensitive Data Out of URLs

If you’re building a RESTful API and your user data is keyed on email address, it might be tempting to have an endpoint like: /user/<email>. Request URLs are typically logged by proxies and web servers, so emails are bound to end up in a log if you do that. To keep that sensitive data out of your URLs, you have a couple of options.

Option 1. Per recommendation #1, don’t use the sensitive field as a unique identifier. For the endpoint urls, use these external ids instead.

Option 2. Violate the REST principles and pass along the sensitive value as part of a POST body, even if it’s a read-only request. Web servers don’t typically log the body of a POST request, so your sensitive field stays out of logs.

What can go wrong

Early in your design, you should determine what data in your system is considered sensitive. If you haven’t done that, then it might require herculean effort to make an API design change like this late in the game.

#3 Redact Data Where Possible

So you’ve compartmentalized your code (#1) and kept data out of urls (#2). Your user endpoints, though, contain some logging statements to help in debugging the service. It might look something like:

logger.info("Updating email for user ${user}");

Somewhere in your codebase is a method that serializes a user to a string. Perhaps it’s in the class definition. In that definition, make sure you are redacting fields with sensitive data:

class UserAccount {
id: string
username: string
passwordHash: string
firstName: string
lastName: string

...

public toString() {
return "UserAccount(${this.id})";
}

It might be tempting to log all the fields in toString, but it turns out that you really only need the id. If, in the course of debugging, you need to track down more details about the user, you can look them up once you have the id.

What can go wrong

This doesn’t stop a developer from logging a field directly, e.g.: logger.info("The user's details are: ${user.firstName} ${user.lastName}");

#4 Structured Logging with a Blacklist

Logging via string-based APIs like console.log() or printf, both of which require that you convert data to a string, is typically considered an anti-pattern now-a-days. It might be easy to spit data out for debugging, but parsing this data is painful, and it can be missing useful context. With structured logging, rather than strings, you log key/values or nested objects. Certain details about the current context (e.g. a request id or server host name) can automatically be injected into the request.

If you’re unfamiliar with structured logging, the honeycomb.io blog has a great intro: You Could Have Invented Structured Logging.

Once your logging is structured, you can now blacklist certain properties to filter them at runtime. For example:

Blacklist = ["firstName", "lastName", "SSN"]
SSNRegex = r"^\d{3}-?\d{2}-?\d{4}$"
EmailRegex = r".+@.+";
class Logger {
  log(details: Map<string,string>) {
const cleanedDetails = details.map( (key, value) => {
if (Blacklist.contains(key) ||
SSNRegex.match(value) ||
EmailRegex.match(value)) {
return (key, "<redacted>");
}
return (key, value);
}
console.log(JSON.stringify(cleanedDetails));
}
}

In the above logger, we have a simple heuristic based on a blacklist of keys and two regexes to skip values that look like SSNs or emails.

What can go wrong

This doesn’t stop someone from doing something like:

logger.log({'pleaseLogThis': user.firstName});

or

logger.log('{firstName: Joe}');

It’s also possible to have a false positive in which a logging statement filters out something it shouldn’t—but in my experience that is pretty rare and easy to catch early in development.

#5 Code Review

A reviewer should look for logging statements with sensitive data as part of code review. If you’re using a Pull Request Template, it might be worth having a checkbox in the template for the reviewer to confirm that they’ve verified logging statements in the changes.

What can go wrong

In my experience, reviewer’s tend to gloss over logging statements. It requires a shift in culture to notice and closely inspect them.

#6 QA and automated testing

While your QA team should be testing that the many flows in the system are working, their testing doesn’t have to stop there. If tests are automated and use predictable data, then a test can automatically check that this data doesn’t end up in the logs. For example, if a web form contains a first name, last name, and SSN field, after running the Selenium suite the test should also look in the app server logs for that first name, last name, and SSN.

What can go wrong

A QA team often doesn’t have the right access or even know what systems to check. If they’ve been doing black box testing, then it’ll require a bit of work to get them up to speed.

#7 Automated alerts in logging system

Similar to #4 and #6, you can write a test in your logging system to look for certain patterns of data. For example, a regex that looks for SSN or a search for common test data. This test should be in place in your staging or dev environment so that it is caught before the code is promoted to production.

While this may seem like overkill, I’ve seen this technique catch a number of possible PII-leaks before they made it to production. If you have a complex system, a small change in one part of the system might have unanticipated consequences that are hard to catch in other ways.

What can go wrong

Does the app config (e.g. log levels) for your staging environment match production? Do you have logs in staging funneled to the same logging system as production?

Sometimes, an alert can be too noisy since staging will be configured for DEBUG-level logging, which results in a lot more messages. And sometimes a team just ignores alerts from staging. In my experience, it’s important to treat an alert or outage in staging with the same vigor as one in production. Otherwise, you probably won’t catch these things ahead of time.

Conclusions

These best practices can put you on the right path to keeping sensitive data out of logs. It’s certainly not a complete set that will make you ready for a HIPAA or SOC2 audit, but they can put your startup on a good footing towards that end. And even if you’re not in a high compliance industry, you should be serious about keeping customer data confidential.

But at the end of the day, a checklist won’t make your system secure†. Your company and software team needs to invest in a culture of building secure systems.

If you’d like to learn more about logging and security, OWASP has a number of great resources, including the Logging Cheat Sheet.

† Some other ways I’ve seen data sneak into logs include:

  • DB Logs — if misconfigured, they can end up writing out queries and query params
  • Other misconfigs—I’ve seen a deploy mistake inadvertently change the log to DEBUG in production dozens of times. While your logging might be fine, an ORM or some other third party library might dump lots of information.
  • Client side data leaks—you need to be careful about how you’re storing data on the client side, too, as Healthcare.gov made famous in 2015.
Responses
The author has chosen not to show responses on this story. You can still respond by clicking the response bubble.