Data version control in Avito’s Infomodel

Published in

AvitoTech

7 min readJul 6, 2021

Infomodel is Avito’s metadata management system. It manages ad categorization, taxonomy, and ad directories. Our recent post discussed how we handle it: why we need Infomodel and how it interacts with the rest of Avito systems.

Today I will touch upon a no less important issue of working with data — preparing changes and deploying in production.

What we used to have

In 2017, when we started our work on the Infomodel project, Avito maintained essentially two environments — prod and dev. All Infomodel data was stored in the database. There were no interfaces or processes to modify the data, all edits were made in the code by migrations. We used to code SQL migrations in the main repository of our monolith, adding new records as needed in the tables or modifying those. When deployed, migrations were rolled out by the DBA team in prod or rolled out automatically in dev.

But there was a hidden problem: what to do if your feature is not ready yet? What if you need to add changes that are going to affect others’ work? To avoid new changes causing trouble, we had an is_active (bool) column used when retrieving data from the database. Here is a category table as an illustration:

To get data from it for building, say, a category tree, we made a simple SELECT query as follows:

SELECT * FROM categories WHERE is_active = true;

If we wanted to add a new category and hide it for a while, we created a migration in which we inserted a new row and set its is_active to false. After deploying the code in the dev environment, the migration ran automatically and added a new line:

In the backend, we had to update the query, adding the following to it:

SELECT * FROM categories WHERE is_active = true OR category_id = 3;

Thus, we had a new category in our local assembly, but others could not see the changes. Then we finished the task and removed the OR. The DBA team switched is_active to true at the subsequent deploy to production.

Problem

Making frequent changes without involving developers every time is impossible with this approach, especially when it comes to parallel or joint changes.

But the company is growing, dictating the need for ever more frequent changes, so what shall we do?

Solution

When we were designing the architecture of Infomodel’s first version, we set ourselves the requirement to give the business the ability to make changes to Infomodel quickly. And, most importantly, to introduce changes so that Avito’s teams did not block each other’s work.

As a result, we came up with a rather elegant and effective solution that allows teams to:

work with data in parallel;
release changes whenever they are ready for it;
test changes on the fly;
know who made any changes and when;
see the difference between one’s changes and what is currently in production.

We named this solution “Infomodel Version Control.” What does it look like?

We put Git on top of Postgres. Anyone working with Infomodel works in an isolated branch. In the branch, one can change anything — delete a category, or create a hundred new ones.

All branching is done from one common production branch called master, which cannot be modified directly. From an interface point of view, introducing new branches looks like this:

We have three different types of branches. Differences between these are related to the release process, which is worth discussing how branching works under the hood.

Technical side

One of the tasks that we set ourselves in the development of the technical component of version control was simplicity and easy debugging of what is happening with the branches.

Data schema

To ensure that branches are isolated from each other, we have implemented branching at the Postgres schema level. Each branch is a separate schema with the same set of required tables and records as a master. When creating a new branch, we duplicate the entire master schema. Users work with their own snapshot of data without interfering with others.

Of course, we had to create a separate scheme for storing service information, such as branch and schema list, user list, etc. There is a separate metadata schema containing the required set of tables supporting Infomodel’s machinery.

The first task completed: users can make parallel changes to the data without conflict.

But there lay the first problem we encountered — primary keys. When creating new records, users also created new rows in tables with autoincrement PK. By adding two different records to two different schemas, they ended up with identical keys, making the entire schema useless. We moved all the sequences into the metadata schema and shared those between the branch schemas. This solved the problem by preventing identical id’s.

Change management

Similar to Git, we store all changes made by a user. To do this, each schema (including master) has a service table called changelog. This table keeps records for each change in the current branch:

As you can see from the table, we know who did what and when, and with which entity. As a result, the user can always see in the interface a list of her changes or find and debug someone else’s:

Let’s imagine that the user wants to delete a category. The category itself and all associated attributes will be deleted in one step. And several entries will be created in the changelog for one user action. To merge these records, we introduced the batch_hash (string) property. Using it, we can identify all the changes made to the database within the same user action. We can also roll back changes one by one using it.

Thus we have achieved two more things — we see all changes and can see the difference.

Deploying changes to production

The fact that we keep all the changes allows us to merge branches. The exact process of releasing changes in production, as mentioned above, is nothing more than locking the branch for changes and applying entries from the changelog one after another to the entities in the master branch, from where these are deployed to production.

We let users decide for themselves when they are ready to push their changes to prod.

Creating branches with a name starting with the Jira task number allows distinguishing branches from each other and linking changes in Infomodel with external changes. We know what PRs were done as part of the task and whether they passed the tests. If not, we prevent the changes from getting released.

After the release starts, we run a few more test builds before we merge the changes. In this way, product teams can be confident that the change is not going to break anything.

Branch lag

After adding changes from one branch to prod, the rest of the active branches are automatically locked for release. Users can make changes but cannot release them because the branch owner does not see the whole updated picture.

To fix this, a user can upload master to one’s branch with one click. Under the hood, we will create another clean schema from master and upload all the changes from the previous branch that the user has made. The mechanism is similar to git rebase.

But there are pitfalls here, too. What if changes have been added to master that delete some of the entities that the user worked with in his branch? The user’s schema will have rows that refer to nonexistent records via a foreign key.

To solve such problems, we have a tool known as the Garbage Collector. Garbage Collector’s task is to ensure that everything is consistent in the user’s branch and delete records that link to nowhere. It runs after each upload to master. When complete, the tool sends the user notification in Slack about the successful completion of the operation. Changes made by the Garbage Collector are marked with the is_auto (bool) property in the changelog and can be found in the UI using a filter.

Testing changes

Before releasing the changes, the user may want to see how these look on Avito. To do this, the user can create a test bench for our monolith or her specific service and run his branch. This will allow the backend to switch from master to the given branch and show the result to the user.

Thus, we have addressed the last item in the specifications, i.e., testing our changes on the fly.

Cons of the approach

Disadvantages include the lack of a dev environment as such. One instance of Infomodel distributes data to the production environment and all dev environments that link to unreleased branches.

Another obvious disadvantage is that it is necessary to migrate all existing branch schemas when changing the data schema. Sometimes this causes serious problems considering their number.

Results

Over the system’s entire lifetime, we created 2226 branches. Of these, 2073 reached the release stage. On average, rolling out one change takes 2 days.