Trendyol’s Data Catalog Journey Series

Ayça Kaypmaz
Trendyol Tech
Published in
7 min readMar 6, 2023

Data Catalog Series #1

As Trendyol, we have an ODS & DWH platform working on the cloud. We are extracting data from source systems, and loading it to the ODS layer for near real-time query purposes. At the DWH layer, we transform data to eDWH, and after enriching data, we prepare the data mart(publisher) layer for our business users.

Figure 1 : BI Data pipeline

Our business users make their analysis either by writing SQL queries via the UI of the DB or via the BI tool. They need to review data, understand its structure, content, and interrelationships, and also want to know the outliers or data quality of it.

Why did we need a Data Catalog?

Users have been asking questions about data to the “DWH Team” like

“From which source can I report this KPI ?”

“Does this KPI include return sales orders? “

etc via Slack. To answer these questions team has been checking various platforms like scheduler, DB, BI, and executions logs or asking their data-experienced colleagues and waiting for their answers which is another waste of time.

These Q&As were not stored in any inventory. Therefore, the same questions were asked repeatedly. Also, answers have not been linked to the related assets. Moreover, the answers are lost when Slack’s retention period is archived.

So we decided to build up a data catalog platform for data at BI systems. Instead of investing a huge amount of budget for a licensed product, we have chosen an open-source data catalog platform.

The most important ingredient of a data catalog is “Content”. While integrating a new data set to BI platform, at the analysis stage DWH team is creating Gsheet documents which include data like a description of tables /columns, data load frequency, and SQL code. As a first step, we have completed these documents. We have uploaded all this documentation to the catalog and automized the uploading frequency.

Figure 2 : Source to target mapping gsheet document

What were the product capabilities expected from the data catalog tool?

· A single and searchable platform was needed to merge all the metadata we got in terms of documentation of all data sets, KPIS, data marts, data freshness, current data quality control outputs, and data lineage.

· Search capabilities are required to be flexible. User should be able to search using keywords (And / Or ), and be able to filter on the platform( i. e only for DB tables, or BI platform, etc. )

· Also, we wanted a system capable to collaboration. To give an example, if data in the catalog does not contain the detail of business need, the system should be capable to let them interpret the data, or questions they ask on a specific item should be added to the item, so that other users will able to see the questions&answers that previously have been asked.

· Apart from descriptions, data should be capable to be tagged within specific subjects to increase search result probability.

What have we done?

We aimed for this project to be a business-driven initiative. We identified a specific business team and ask them what kind of data catalog they would expect to use it. Given a blank page ,we saw that they did not buy in the product offer and they could not materialize how the data catalog product will be and how they will benefit from it.

So we continued to integrate the metadata without getting business product requirements.

After integrating the description and data upload frequency of tables, we have created a business glossary to standardize the business descriptions.

We have also prepared a “Frequently Asked Questions (FAQ )” document for our business users. It holds solutions to common problems and questions. i. e Connection problems or questions like “ Which data set includes, price change history of listing data ”. We have linked this document to the landing page of the tool.

After fulfilling the descriptions, the data catalog was ready to be used by users.

To increase engagement and create awareness of the data catalog product we accomplished 3 actions.

  1. We have identified business heroes who have the potential to use the catalog. We have organized meetings to hear their product feature requirements, asking them “What would you require to use the data catalog during your daily routines “ After identifying the feasibility of reqs, We have prioritized the reqs, within agile sprints, we started to create the MVP and continue to integrate them.
  2. As I mentioned before Slack is used heavily for internal communication. Users ask their questions via Slack channels. So for these channels, we have created bots those reply to the owner of the question referencing the data catalog. When they sent a question, bot answers them saying “Have you searched your question within data catalog”.
  3. After monitoring user behavior upon the bot’s response, we saw that business users ignore this message and continue to ask the question. We have changed our approach. If the question’s answer can be found in the data catalog, instead of answering their question directly, we started to answer them by directing them to the data catalog and explaining to them how they can search the question. Also if they could not have found their answer, we ask them to fill out a survey explaining their experience within the catalog and their suggestions to improve the tool.

What are we planning Next :

I would like to mention the items within our backlog.

Enriching dimension column descriptions within the possible list of values :

At the data mart tables, there are dimension columns. As a description, we are writing the description of that dimension. i. e For the order_status column the description is as follows “It holds the status value of order”

Users want to see all possible values that this column may get. So we are planning to develop a code that will enrich the description of dimension columns with the values within the content of the dimension. For the order_status column, we will enrich it like this

It holds the status value of the order. The possible values are Delivered, Shipped, Ready-To-Ship, Picking, Invoiced…”;

Link to ERD diagrams :

We have ERD diagrams of data at BI systems. Business users who create their own SQLs need to understand the relationship within tables. Within the data catalog, they will able to link to the ERD diagrams related to the table/column they are investigating

Data Freshness :

Within the data observability tool, we are monitoring whether data at BI platform is refreshed at the predefined frequencies. We will be updating our data catalog with the last refreshed time of the data.

Handover Documentation responsibility to data producers :

As part of our data strategy , we are planning to handover data responsibility of source-aligned data products to data producers. In terms of organization , responsibility will be handovered to Product owners of data whose department is not Data Management. This responsibility includes all data related activities within data catalog.

Limitations of our data catalog Platform

The actions that we can not deploy to the limit of the data catalog tool are as follows

Users want to search the data catalog within the items they have Access.

I. e. If a business user is not authorized to access payment data, they do not want to see payment table-related output while they are interacting with the data catalog. They say, this kind of output loses its focus on search outputs. On the other hand, this output gives them the capability to see different kinds of datasets and in case of need, they may start a permission ticket to access that dataset.

Users also want to see the most popular output at the top result of their search.

Challenges and things to consider while integrating a data catalog tool

For the companies that are planning to start a data catalog project there are a few recommendations that we would like to share;

  1. Handle data catalog platform as a product. Work with the business and assign a product owner to the project. For higher engagement always consider business requirements and focus on MVPs.
  2. Users do not want to use a different platform while they are doing their Daily jobs. Integrate as much as tools and data to the platform so that instead of using all the tools separately, users will definitely prefer a single platform to search , and query metadata.
  3. Lastly but one of the most important factors which defines the success of the project is to define and monitor KPIs. “Weekly Active User “ is one the most potential candidate of the KPI. Since the data catalog is the platform where users would like to search for “what they do not know”, instead of a daily option, they visit the tool within the week for new requirements.

In conclusion, we have not completed our data catalog integration yet. We are still learning from every new integration and new implementation of the tool. We understand that it is a journey and catalog is a fundamental platform for understanding and using data efficiently.

We will be sharing our experinces at the upcoming series of this article .

--

--