Take a look into the initial steps of Adevinta’s journey in developing our data catalogue. Learn about the principal challenges and the technology that’s being used to empower this journey.
By: Oscar Ompre, Product Manager
Adevinta’s data ecosystem
In my role as Product Manager in Adevinta’s DataHub, I’m part of the team building our data catalogue. This is a key undertaking for Adevinta, so I want to share our experience of this process and how we’re tackling the different issues around data discovery.
Before I describe the first phase of our journey, I’ll provide some context, just in case you’re not familiar with Adevinta.
Who we are
At Adevinta, we believe everything and everyone has a purpose in life. Our digital brand portfolio unlocks the full value in every person, place and thing by creating perfect matches on the world’s most trusted marketplaces.
What we offer
Adevinta’s products and services include both generalist classifieds sites and specialist real estate, vehicle and job sites. For our professional sellers and listers, we offer professional tools and data services to help them boost their businesses.
Autonomy of our marketplaces
In each one of the marketplaces, there are many independent product teams and operation teams that manage and improve each one of the portals. Each of these marketplaces is independent of the others and has the autonomy to serve their customers. This autonomy means that each team can decide with which stack of tools it builds its solutions, and each marketplace has a set of solutions tailored to its needs.
Global Teams
At the same time, Adevinta’s Global Teams offer products and services that are available to all marketplaces, and each marketplace can decide which ones are useful. An example would be the ad recommendation systems or the chat component so that buyers and sellers can contact each other.
Local vs global
This set up of distributed marketplaces alongside different countries and Global Teams creates a number of challenges. While Global Teams try to make their tools usable by as many teams as possible, local teams can choose to use global products or go for tailor-made solutions.
Data landscape
If we think in terms of data, this dynamic between local and global teams brings with it a diversity of tools that are used to drive each of the solutions offered. This has led to the proliferation of different platforms over time, creating a divergence in terms of the tools/platforms used in the company. This is a challenge when it comes to developing ways of working that scale to simplify data discovery processes.
This divergence also causes difficulties in enabling collaboration between the different global and local teams, as each has its own procedures and tools for sharing data and extracting value from it. This means that, as a company and in our individual job roles, we’re not able to take full advantage of the data we hold and the potential it has to make a positive impact in our work — something we’re determined to change.
Data discovery
Earlier, we mentioned the term “data discovery.” We use this term to refer to an ever-developing process in which data profilers need to understand where the necessary information resides, and how to access and connect to these data sources in order to explore the data. Let’s imagine a data scientist with the intention of creating a machine learning (ML) model for a certain marketplace. They need to be able to get a dataset that’s useful for this purpose. In the current context, it’s complex to find such data which makes it difficult to carry out the data discovery process.
Data catalogue
Our purpose in building and developing a data catalogue is to make data discovery as efficient as possible for our users. To that end, the data catalogue aims to be the central point in Adevinta where users can go to browse and find the information that’s relevant to their work, with detailed information on each of the datasets. In addition, users need a simple access management system so that they can access the data easily once they’ve found it.
Our data catalogue is a functional part of our DataHub, which is our main product to explore, access, store and share data within Adevinta.
How have we developed the data catalogue so far?
If the main objective is to make all the information that is distributed throughout Adevinta available in one place, someone has to go and find it, and make it available in the data catalogue.
To achieve this, we’ve made use of an open-source project offered by Acryl Data in the backend. This system provides a variety of connectors for different data sources such as Kafka, S3, Athena, Glue, Hive, Databricks, BigQuery, Redshift, dbt, Snowflake, Looker, Tableau and Datadog, among many others. The team’s contributions to this open-source project have been numerous throughout the year.
Each of these connectors allows us to ingest into the catalogue the entities that are part of the source. These entities vary depending on the data source. For example, they can be datasets, tables, databases and projects, among other things.
Each of these connectors is activated hourly to obtain new information from the various data sources and keep the catalogue up to date. The connector, when executed, can obtain what we call metadata from the information residing in the source.
What is metadata?
Metadata is information about a particular entity. In the catalogue, we do not ingest the information stored in the different databases but we collect information that allows us to know that there is a database, that it was created and updated on a certain date, and that it contains tables and columns. It’s the information that allows us to identify each of the entities in the catalogue.
In the metadata, very important attributes are defined for the catalogue, such as the name of the entity, a description of it, which platform it comes from, who manages it, who owns the data, what the data schema is, which columns it has, when it was updated, what the quality of the data is, where the data comes from and much more.
Making the data marketplace more consumer friendly
We can think of the data catalogue as a marketplace but for data. Our objective is to facilitate the link between buyers and sellers or — in our case, with data — consumers and producers.
But our efforts aren’t solely focused on the creation of these connectors. If information is not correctly presented to consumers, it’s logical to assume that they won’t be able to accurately identify all the entities present. Currently, there are more than 65,000 entities available in the catalogue — the number changes daily. To make life easier for our consumers, we’ve developed a UI with different search capabilities. It offers a search component with different filters, which allows users to execute text-based searches based on the name and description of the entities.
We have also developed different paths to clearly show what data you have for each of the marketplaces and the different teams. In the same way, you can navigate through the platforms that are part of the catalogue. (Shown below.)
One of the main functionalities of the catalogue is lineage. This functionality allows you to see where the information in a dataset comes from and if it is then consumed by another entity. This helps users to have a better understanding of the data landscape.
Simple data access — we’re working on it
Finally, the functionality we are currently focusing on is to provide a data access system. Our goal is to enable consumers to find the information they need, easily request permission from the producers for access, and be able to make use of the available data. To do this, we need to manage a request system so that producers are aware that a user has requested access to their data and can manage the request. Currently, we have certain platforms in which this system is already implemented. In others, we are iterating week by week to provide this type of functionality.
Once consumers have the necessary permissions to access the data, they can make use of other tools in DataHub to share this data from one place to another, manage it, transform it, create ML models, analyse it and do so much more in order to extract value from the data.
In summary, we’ve made great progress with our data catalogue journey so far, and we know what we need to do next. If you are interested in sharing more specific information regarding a certain part of the process, don’t hesitate to contact us.