Introducing you nxs-data-anonymizer — a convenient open-source tool for anonymizing databases

Nixys
4 min readApr 18, 2024

--

nxs-sata-anonymizer

It is no secret that the development of even a small project is closely related to the infrastructure, as any program requires a certain environment. Often several environments are required — one for production, and the rest — for different needs, such as testing. Sometimes these environments can even be dynamic — when, along with a new branch, an environment is created that runs the development version of the application and everything necessary for it. After the brunch is merged into the main, this environment and all its data are deleted.

And that’s the kind of data (more precisely — databases) that we would like to talk about. Where and how to get them, how to make them as close as possible to production ones, and how to protect against their leakage. To solve these problems, we at Nixys use our own tool — nxs-data-anonymizer. We would like to share it with you.

Why do we need anonymizers?

Now let’s go back from environments to data. Actually, where to get them? Use empty databases? If so, we won’t be able to check anything. Synthetic data? It still needs to be invented and generated correctly. And the application will not always work as we need.

The simplest option is to take data directly from production, but the security service will definitely be against it. In addition to the fact that this is illegal, if any sensitive data leaks, there will be huge reputational and financial costs for the business. As an option, you can take a database from production for once, remove all the secret data from it, and set up your own database server for developers. The problem is that it’s not convenient for the developers themselves, since a sudden change of data by one of their colleagues can affect the development and testing process. This problem can be solved by using the same pre-prepared and cleaned dump for deploying in isolated dynamic environments. But the development industry doesn’t stand still, data and their very structure often change — so the process of cleaning the database and preparing a fresh dump for developers becomes a constant occurrence. And these are financial costs for business and the impact on Time-to-Market.

Therefore, we need to somehow automate the process. Anonymizers were created for this. Moreover, these are not necessarily some complex and expensive enterprise solutions; they can also be scripts in PHP or Bash.

Why have we built our own tool?

Most often we come across two DBMSs: PgSQL and MySQL. To anonymize the first one, we use the tool by Evrone, but with MySQL things are a little more complicated. We were unable to find any solutions. Evrone’s solution doesn’t currently support MySQL. In consequence, for a long time, we used self-written bash scripts. Their problem is that it’s not so easy to get a holistic, complete, and unified solution with them. We had to finish these scripts for almost every task, which ultimately led to a huge number of code variations. As a result, engineers had to understand every branch of the original script almost from scratch whenever they joined a new project.

As often happens, at some point some engineers got tired of this and the idea came up to develop our own tool, which we could use as a boxed solution on all our projects. It took about a month to develop and test it, and now we are already using it in our work. Initially, the solution was created only for MySQL, but in the process, we realized that there was nothing that could stop us from expanding its functionality and adding PgSQL, and in the future, other types of databases.

The solution turned out to be quite flexible, and easy to use and its core is based on the following ideas:

  • Stream data processing. This means that you don’t have to do any pre-processing and save a dump of the original database somewhere on disk. nxs-data-anonymizer can change the data that is on its way to being passed on stdin. And output everything to stdout. I.e. you can build the tool directly in command between two pipes;
  • The values are described by Go templates. Everything you want to replace in desired cells in a table is defined by templates, similar to Helm, which is well-known to people. Of course, just like in Helm, you can use functions that are familiar to you, for example, to generate random strings or numbers;
  • Terms of use and data of other cells in the row. Filters can be flexible and make certain substitutions depending on the results of other (or even themselves) cells in the same row;
  • Checking the uniqueness of data. When you change data (for example, API keys) in a column whose values you must be distinctive, it may happen that you generate two identical keys, and such a dump can no longer be loaded into the database. Using this option ensures that such values will not be repeated.

Eventually, MySQL was the first DBMS that nxs-data-anonymizer began to support, so for the second part of the article we’ll put an example of using it on data from production in practice. You can find all similar information for PostgreSQL on the project page on GitHub and also report any issue or pull request you encounter!

And if you are already familiar with our tool you can provide feedback in the chat. There you will be answered not only by other users but also by the development team of nxs-data-anonymizer!

--

--

Nixys

DevOps, DevSecOps, MLOps and 24/7 server monitoring & support across multiple project types