Metadata management system in Avito

Published in

AvitoTech

19 min readApr 12, 2021

Every classified at some point in its growth has to tackle the problem of systematization, ordering, and organizing its metadata. Why is this such a big deal? To answer this question, we need to understand a couple of basic things about metadata and how it’s used.

The first piece is metadata

You have most likely already worked with metadata. I’ll give a short example so you can get the idea. Suppose you have a site where users can publish things they sell. To post an advertisement, a seller fills specific fields such as title, category, price, location, etc. These fields, the list values, and types of the values that users can fill are metadata. Simply put, it’s data about data.

Picture 1. Metadata examples: *title, category, price, description of an advertisement*

The business value of metadata

After the seller has submitted an advertisement, he needs to pay for the listing. Those listing fees are usually based on information in the ad. For example, a car’s price can vary significantly depending on the year and the brand. Or it could be the location that influences the listing fees the most. It’s not obvious that the company’s speed and fluency in changing metadata will be a bottleneck for finding monetization strategies by running A/B tests.

A less obvious example is when metadata helps to tune up SEO. It’s not a secret that the more organic traffic a website has, the better for the business. Search engine result pages (SERP) and advertisement pages are the most popular pages in classifieds. The ad pages are less interesting, so I will describe the SERP case.

There are millions of filter combinations the users can use. Each filter combination gives a unique SERP. And it’s not good for SEO when tons of users’ searches are scattered across tons of SERPs because each page’s weight is low. To tackle this problem, the canonical URL emerged.

One unique canonical URL groups relatively similar pages and makes them count as one page for the search engines. For example, a user search for all cars older than five years will fall into one canonical URL “5-year-old-car”, even if a user specifies a particular brand or color. Or search criteria: “a house in front of a beach” with an additional filter of type cottage or apartment won’t matter; it’ll be “in-front-of-a-beach.” These are simple examples, but I think you get the idea. Let me repeat: the faster your system allows you to adapt to the changing world, the better.

At this point you can apply the same pattern to all of the following fields:

what data user needs to submit;
what type of metadata displayed in the advertising leads to better CTR;
detect fraud listings;
indexing;
price suggestion;
you name it.

Our problems and what we wanted to change

We have discussed benefits of rapid change of metadata. But what makes it challenging to change it fast in the first place?

From picture 2 you can see that metadata combines all domains of an application. That’s why it’s tough to make a change in any part of it. Even minor changes in the type of an attribute can break the whole system. For example, changes in one attribute from the list of values in the advertisement submission form can break the search indexing and monetization algorithms.

We wanted our different departments to easily make decisions regarding their metadata with no interaction with other departments. At the same time, departments must keep metadata in sync between different domains when needed. For example, when we want to add a required field in a particular category without breaking the search indexing or removing fields used in listing fees calculations. Keeping that in mind, we settled the functional requirements for our metadata management system or infomodel, as we call it.

Functional requirements:

The solution must provide our department engineers and analytics with the ability to easily make metadata changes like adding, removing, and updating attributes and categories as well as their types, values, and options.
Any changes in metadata of a particular department’s space should not affect the metadata of other departments. If it’s not possible, we should warn users. For example, attributes for adding on desktop, iOS, and Android with different versions, attribute space for search indexing, moderation, search filters, and visualizing on various platforms.
Every department (domain space) must have the ability to run many versions of its metadata simultaneously, i.e. for A/B test purposes.

Non-functional requirements:

The system must be easily horizontally scalable.
It must be memory efficient and fast at runtime.
It must be tolerant for version mismatch.
Network user friendly.

The problem with the classic way of storing metadata

The easiest way to store metadata is to bake it into the database’s schema and to hardcode in the codebase. To show this, we proceed with our first example. After a user has submitted the advertisement, it will be stored in a database’s table with a schema like that:

Picture 3. A schema of the table for storing advertisements

Don’t get me wrong. It is a totally fine way to do your schema in this case, and I am pretty sure it’s good enough in most cases. However, in our case, we need to run several different categories. Each category is the whole vertical in our business, and every vertical team wants to experiment in its category by adding, removing, changing columns.

We also want to do A/B testing by adding a new field in a particular category to see if users like it. It won’t be a problem unless you need to change your schema and add a new field, which requires locking the whole table. That is quite difficult if you run a database with billions of advertisements. Even if you shard the database by a category, it will still require you to run migrations, set default values, and so on. The second issue is that changing your schema requires a new commit for the database schema change and a new deployment of the service to run schema migrations. It’s not the fluency we wanted.

Let’s imagine one possible way of organizing database schemas that wouldn’t need us to run migrations to deliver new attributes:

Picture 4. Imaginable solution for dynamic metadata around attributes of advertisements

Wow, there are five new tables instead of one, and a couple more are not shown for simplicity. But don’t be afraid. The idea is simple: we transform our columns into rows, and the rest are auxiliary tables to run the system. This approach is called Entity-attribute-value model (EAV).

There are so many new tables mainly since now our application is responsible for ensuring logical schema. In the old approach, the database was responsible for ensuring data consistency by using foreign keys. The EAV approach leads us to two problems:

Our application needs to be responsible for the consistency of data.
The runtime performance would suffer a lot with such a normalized schema.

The other approach is to use a document-oriented way of storing attribute structure and data. We challenged this approach and concluded that if we kept the properties and data in each advertising document, it would cost us too much memory. And most importantly, dealing with old documents in the codebase is cumbersome.

We didn’t consider graph databases due to the lack of expertise in our company.

A concept of metadata management system

Basic ideas that went into our infomodel design are old and well-proven — normalize for consistency, denormalize for performance. We followed this path and came up with two global components:

Metadata management system with a friendly user interface for managing the variation of EAV pattern.
Frontend for backend — the high-performance microservices that execute operations for data in runtime. They use a highly denormalized data frontend for the backend system to validate, prepare for rendering and other purposes.

After forming these global components, the EAV pattern arose: how to ensure data consistency at the application level and make it run fast at runtime.

The core of the infomodel system consists of two main elements. The first one is a catalog of categories, attributes, values, and possible relations. The second is a layout, an abstraction composed of three other elements, which we will discuss later.

The catalog

The catalog is normalized storage of categories, attributes, and values kept in the third normal form. It reflects the EAV pattern’s architecture discussed earlier, but only the attribute part of it. The catalog has a category list, which is a root for the other attributes. It’s a list of categories our business runs like auto rental, auto selling, real estate selling, real estate renting, real estate short renting, and others.

The next is a catalog of attributes. Attributes are a property of a category. For example, it can be a brand, a model, a year of issue for car renting. For real estate, it can be a city, district, or squared area. The values are all possible values of attributes of the enumerable type. For example, for the brand attribute of category auto, possible values are audi, bmw or ford.

Also, the catalog is responsible for all the possible relations between the attributes and their values:

Picture 6. Format of relations between attributes and values

To explain why we still need to keep all the possible relations, we need to move to the second core component of the infomodel — the layout.

Layout

Layout is a composition of three different components.

The layout

The layout is a name for three manifests that describe a namespace’s behavior, structure, and form properties. The components of the layout tackle one problem that has two sides:

Ability to have a different representation of the same metadata in different namespaces.
Isolation of one namespace from the others, so that one team can change any properties of behavior, structure, and form properties without affecting other namespaces.

Relations are responsible for a particular structure of data in a namespace or a layout, as we call it internally. It’s easier to show by the example from picture 7:

Picture 7. An example of two structures of relations of the same metadata

As you can see, there are two user stories where we use the same metadata in different arrangements.

The first case is submitting a new advertisement for a car. In that flow, the shortest way for a user to specify his vehicle would be first to choose a brand, then select a model of this brand, and then set the car’s year of issue, and so on. As the user fills the fields, there are fewer options to choose from. We can even fill the rest of the fields automatically at some step because only one option is left.

The other example is where a visitor searches for a car. They usually search with a wider range. It’s easier for such users to fill the brand and the model and choose from a list of 4–8 generations rather than pick up specific years. In such scenarios, relations come into play. Using the catalog’s uncompressed structure from picture 6, we can set up any layout relations. This idea is also widely used in the validation of the user’s input data.

Form of fields. The next element of the layout is the form of fields. This component is a declarative way to describe the fields rendered to the user or for the internal backend.

The form consists of a list of fields that link to particular relations and attributes. The important thing is that all the configurations of the fields are declared here. It means that semantically the same attribute could have different properties at different layouts. For example, we have a brand attribute, and when a user submits an advertisement, the field looks like a single option select input. However, on a search form, it’s a multi-option field. The form is responsible for:

A list of fields of attributes on the form.
The link between the fields and actual form attributes and the catalog.
The properties of fields and the form itself.

Rules. The last element of the layout is a rule manifest. It’s a declarative DSL (domain-specific language) for describing the behavior of the form’s fields. This component is responsible for showing/hiding, enabling/disabling, validating fields based on the whole form’s state, and even changing their properties and states.

You can see the result of this element when you choose the brand. It triggers a new state of the form when we display the model attribute. Or another example, when a chosen city has a subway, we will show the attribute with subway stations. It’s worth mentioning that nobody writes manifests. The metadata management system produces them automatically in the user interface.

Versioning

Layouts are an excellent way to distinguish different platforms, domains, and departments. However, once each department has its own set of layouts, they quickly realize that they want to run several versions of the same layout simultaneously for A/B tests or when we stack with old versions of mobile applications or internal services.

The implementation of versioning of layouts conceptually is not that different from the version control system like Git. It utilizes a branching system. We have entities that can change: catalogs and layouts (rules, relations, forms). We also know that we must have the ability to run in production as many versions of the same layout as we have A/B tests.

That leads us to the implementation where you have branches for different A/B tests. But to use a particular branch, including the main branch, you must release it. At the release moment, two main things happen.

The first is the backend combines all the changes and dumps them into an effective storing format that can be easily accessed in runtime. The second thing is the release of the version tag generated by which version of layout could be accessed in production. To keep A/B tests metadata up to date, you can merge the main branch. Merging the main branch into A/B test is required since you backed all the entities once you released the metadata version, including ones that you haven’t even touched. This happens because we decided to implement an append-only strategy, which requires less sophisticated implementation.

Routing

What is routing, first of all? We have already discussed a lot of elements like versions and layouts. Routing was invented so that the client (mobile application, frontend-browser, frontend-service) could specify the layout and the version for its usage.

Technically, routing is just a string by which other services can access layouts. It has the pattern: {version}.{name of layout}.{category}. In real life, it looks like this: REAL-123.new-adv-mobile.13. The tag of the version usually stands for a Jira task, where changes are requested. However, if you want to debug a layout in staging without releasing a new version, you may put dev.real-123 as a tag name. Then all the specifications will be generated on demand. I won’t discuss how it’s done in this article because it’s a whole other topic.

A less obvious observation here: the existence of category in a route sets up the maximum granularity of A/B tests. It’s done on purpose. Each department or business vertical has its sandbox to run experiments without overlapping with other departments. However, inside the route, all the conflicting A/B tests must be arranged appropriately inside one department.

One of the new metadata management system’s main goals was to make sure that we can easily make new changes, including not backward compatible changes. This is not possible if the backend serves the latest version because it leads us to provide a new endpoint version for breaking changes. It is tough for the huge architecture to move on to the next version of API. So we decided to reverse this paradigm and make clients the version they want.

To put it all together

Since we have had a look at all the components, it’s time to show the whole concept. To do so, I will use a prism analogy. At least it works for me, and I hope it will work for you too. So you have a golden source of your metadata catalogs. Also you have layouts, which are a set of rules, forms, and relations. The idea is pretty simple: different layouts are like a prism that filters and changes the representation and behavior of the golden source of catalogs.

Picture 9. Layouts as prism for a certain business domains

There are a few important things that layouts do and don’t do:

The layout is not responsible for the look of the forms. It contains a structure of the form like steps and properties of fields. But it acts as a configuration for the frontend.
The layout does not always act as a visual representation of something. A layout can be a validation mechanism, data representation for internal use, and alike. Also, a layout can act as a template engine for canonical URLs of SERP.
Different versions of the same layout may be used at the same time. That happens mainly for mobile apps where different versions use old APIs and A/B tests.

At this point we can go into details that are a bit more technical:

Picture 10. Component diagram of the metadata management system

There are three primary layers in the metadata management system: backend of the infomodel, frontend of the infomodel, and consumer services.

Backend is responsible for applying changes to metadata, uploading catalogs from external sources, releasing new versions.
Frontend is responsible for allowing real-time access for layouts, catalogs, attributes. Most common use cases: validating a form for web or mobile, assembling view of attributes for advertisement, assembling the form for rendering for web or mobile.
Consumer services implement business logic. It could be frontend, mobile application, or internal service.

The infomodel backend

The backend of EAV for the metadata management system looks more or less like a typical web application. It consists of a relational database, single page application, and several complicated ETL. Here you can manage data of catalogs, rules, and relations, create layouts, create branches, release new versions. However, specific things relate to our internal tools, such as automatic testing of the whole system when applying changes to metadata. All these releases of the new version of the infomodel trigger a bunch of E2E tests to ensure that our users can still add or search advertisements in each category.

Picture 11. Single Page Application of backend of the infomodel

From a technical perspective, it is a complex web application with lots of domain logic and validators. They ensure that all layouts, relations, categories, and attributes still work after applying metadata changes. For example, we have to check that there are no cyclical dependencies, no unreachable states, and so on. A lot also happens with the generation of layouts on demand for debugging in a staging environment.

Implementation of versioning

It’s worth mentioning how versioning is implemented in a database. We did a lot of research about versioning and decided to go with the append log strategy. This strategy means that whenever we do even the tiniest change, for example, fix a spelling error, we need to release a new version with duplication of all the entities: layout, catalogs, etc.

To implement the versioning technique in the database, we decided to copy the whole schema in Postgres with all entities for a new version. We didn’t find any disadvantages of this approach, except it slows down the UI of tools that scan all the schemas to display and manage them. Even though there are almost no limitations on the number of schemas, it’s not that important because the schema per version life cycle doesn’t last very long and is capped to the duration of developing the version and testing it. After a new version of the infomodel is ready and tested, we release it. But what does the release mean?

What happens at the release phase

We release a new version of the infomodel when we are sure that we have applied all the changes and are ready to launch it to production.

To launch the new version into production, we run services of validators and series of E2E tests. They verify the release of a new version of metadata to ensure that it won’t semantically destroy the user experience. The next step is to generate manifests from the database’s current state and dump them to the storage of static files. That’s it.

When clients demand the released version of a layout, the frontend services go to this storage of manifests and prepare them to process the clients’ requests. These files are distributed by nginx with different levels of caching, for example, etags. Each new release dumps a new version and never touches the old ones as we went for append-only versioning.

However, when we want to test a specific version in staging, we don’t do all these steps. Instead, manifests are generated on demand. Manifests can be very large because some contain hierarchical dependencies of data and could lead to poor user experience. To tackle this problem, issues of performance and memory in production, we split such manifests into hierarchical sections. This improves the speed of the on-demand generator and increases cache hits for manifests in production.

Release distribution

The format of json files that get stored for the frontend to read and interpret will be discussed in the frontend part. For now it’s worth to mention that they never get deleted and also, split into in a granular manner for the optimization purposes for the frontend services. The storage provides easy access for all the versions of infomodel that ever been released and we have no worry about if somebody will request absolute or old versions, it always be available. After the release performed we archive the schema of database of this version.

Frontend

Frontend services are read-only databases with a built-in custom interpreter for their simplified DSL. There are three services: the layout service, the rule-engine service, and the URL-builder service. However, due to the whole system’s architecture, they share a lot of implementation details and properties. Let’s first see the component diagram for the frontend services:

It may look that services act as a proxy before the storage. But they don’t because the metadata files that sit in the storage don’t make sense on their own. First, you must “compile” metadata files to run queries against them. Also, services don’t have write load and they are truly stateless services. That property provides us with Unbound horizontal scaling (at least until the k8s clusters’ networks crack).

The request flow diagram shows us that the frontend services make external outbound requests only if the requested layout is not present in the cache. That is the only case when we need external requests otherwise, service reply the request from the internal in-memory cache. Outbound requests occur rarely. They happen when a new main branch of the infomodel is released or a launch of new A/B tests occurs or on a deploy of the services. But first, we have to talk about caching.

Cache eviction

How many layouts can an instance keep in memory simultaneously? In the real world, the answer is — it depends. The most impactful causes are the number of dependencies between the attributes that turn out into RAM amounts the service takes. The deepest Avito category with a high number of attributes and interdependencies between them is the auto category. Just look at the number of dependencies just for Acura car manufacturer:

Picture 14. Number of nodes and dependencies between attributes

At this point, it is clear that we can’t afford to keep all the layouts in memory. We can’t do that because of the huge size and the constantly changing number of layouts due to A/B tests and incremental development of the system. We have to go with a cap, which is the number of layouts that we can keep in memory at the same time:

Picture 15. In memory cache slot usage. Green line is release of a new version of service

However, in our case, we can’t simply use LRU or LFU strategy for the eviction of layouts. The reason is the disproportional usage of different layouts. For example, the layout for validation of the new advertisement submission happens a couple of magnitudes rarer than the layouts of showing attributes on the advertisement page:

To mitigate that problem, we went for ARC cache. Arc cache keeps track of the frequency and recency of usage of a particular layout. This helps not to evict the layouts that get relatively low requests per second but are also very important. An example of such a layout is submitting a new adv that has a low number of requests but is very important to have in the cache due to the importance of not missing the user’s form submission.

Cache warmup

Whenever we deploy a fleet of frontend services’ instances to the prod, they spin up with an empty cache. Warming up on real user requests is a bad experience for our consumers because it can end up with a failing request for submitting a new ad. So we came up with a warming up strategy.

Each service knows what it has in its in-memory cache. So the instance dumps a list of layouts in memory to Redis cluster.

Picture 17. The process of warming up in-memory cache

The deployment process looks like this:

We use a rolling update strategy, which allows us to spread the load when we are excessively loading specs of layouts.
The instance goes to Redis clusters and gets the list of layouts that is currently used. After that it confirms its own successful deployment by giving 200 in health check to the k8s load balancers.
The service goes through the list and loads layouts the same way it does when it has cache misses. If something goes wrong, the instance skips this process.
After all of that, the service tells k8s that it is ready to process requests.

Picture 18. Cache warmup before instances get requests

A green dotted line in the picture shows when new instances got requests from the load balancer. The peak before that shows how much time it took to warm up a particular layout. It takes about four minutes to launch approximately 60 instances.

Conclusion

Metadata management system is an essential part of a highload classified. In our case, it helps to run A/B tests of any metadata changes, tune up SEO and dramatically and decrease time to market for launching metadata-related features. It took us lots of work to build our current system, and we’re still looking forward to new improvements. Especially as that system triggered many requests from the internal teams who actively use it daily, there is plenty of work for improving!