#1 Data Discovery: how to choose your Data Catalog and why you do need one

Published in

hipay-tech

6 min readApr 11, 2022

Hey everyone !
We want to share and discuss with you a cool way of deploying a data discovery tool at your company or start up, and why it’s probably a good idea, especially if you have a large number of cloud and on-premises resources.

As data consumers, we needed a convenient way to explore and understand ALL the possibly available data, in every single internal database or tool. This requirement raised a number of challenges, and many questions were asked before getting to a satisfying result. This first article will briefly talk about the emergence of the need for such a tool, and the decision making process involved in the choices that were made.

I hope that reading this will save you some precious time, and will give you some clarity about how to boost data usage in your company 👍

Data at HiPay

Our mission at HiPay is to provide businesses with a powerful all-in-one payment platform. On top of it, we build a variety of analytical products to help our customers understand their business, anticipate and decide. We also strive to provide executives with strategic decision making tools and ways to automate their business processes.

Given the nature and complexity of securing payment processing for virtually every method (from VISA to Apple Pay), across all currencies and banks, we do have a jungle of data sources, both on-premises and hosted on SaaS/Cloud services.

Your company’s data, a Terra Incognita to be explored — Photo by Jeremy Bezanger on Unsplash

Recently, as our systems are moving faster towards modularity and data-centricity, we’ve seen a steep growth in the amount of data assets we generate and store. We can sense lots of value behind this mountain of data, but we have to admit that it’s still barely explored and of course highly underexploited.

Two important things are happening right now:

All the company’s systems are moving to the cloud
We are building a new (modern) data platform

As a consequence, and just like many tech companies, our data landscape is mutating into a galaxy of data sources managed by engineering and data teams, instead of a few big centrally managed PostgreSQL servers.

Let’s start with our needs

Anticipating this inevitable evolution, and knowing how much data entropy can come with engineering flexibility, we decided to make a move during 2021 and get ourselves a data discovery tool. Here’s what we needed:

First, clear data ownership: who is responsible for what ? Who can answer questions about a particular field or table ?
Data segmentation: being able to group data assets into domains using tags and labels,
Rich metadata: we certainly want a unique entry point to get a complete macro-vision of our data landscape, but we want the ability to deep dive and get a precise description for any given field or table: what is it exactly and where can it be found ?
Collaborative: teams are autonomous regarding their data, and they carry the responsibility of documenting it. Everyone should be able to contribute with a new piece of information,
Supports a large array of data source: BigQuery and PostgreSQL were non negotiable, Elasticsearch, Salesforce, MySQL are nice to have. More is better as we are building things for the long term,
Compatible with our existing environment and practices: preferably Python-based and can run on-premises; as our cloud migration is still in progress, and many of our back-end databases are in data centers on private networks,
Automation: this one can seem obvious, but recurring tasks are boring and get easily forgotten. We must automate everything that can possibly be (like metadata ingestion), in order to keep information up to date.

To sum it up: we need to build a single and comprehensive data inventory and metadata source, maintained collectively. You could call it a data catalog, but it’s actually more than that !

Navigating an ocean requires at least a map and a compass (or a GPS !). The same goes for a company’s data ! — Photo by Denise Jans on Unsplash

The ultimate goal here is to enhance all data-related tasks:

Onboarding new recruits on the data ecosystem,
Data Science/Engineering and Business Intelligence/Analytics projects,
Helping with audits, data governance & compliance (GDPR, PSD2, etc.). this one is a must-have if you operate in a heavily regulated industry like we do,
Helping with legacy code refactoring and cloud migration tasks.

Choosing a solution

We like to keep it simple so we first thought: “ Hey how about a big ol’ spreadsheet ?” and then we discovered the marvelous world of data discovery apps with shiny UIs and automated metadata ingestion.

As for any software product need, we had three possibilities:

Buy: there’s a number of cool-looking SaaS products on the market, like Zeenea and Collibra, etc. Let’s see if we can get more long-term flexibility using standard formats & APIs. Also, we want to be able to self-host the solution,
Build: building a tailor-made app is possible and could be fun, but it comes at the expense of technical debt and more code to maintain. Also, definitely not our core skill,
Open Source, but customized: this is the usual go-to at HiPay. We love open source ! We took a close look at Magda, DataHub, Metacat, Databook/OpenMetaData, and Amundsen.

For the first iteration of the projet, our choice was Amundsen for many reasons. Here are some elements that drove the decision :

It’s based on a mix of Python scripts and popular micro-services running in Docker: neo4j for metadata storage, elasticsearch for indexing and search
A vibrant community on Github and Slack. This has been very helpful !
Works for all our data sources, CSV files, and way more. Basically, Amundsen can consume metadata from any source that exposes a REST API or is compatible with SQLAlchemy
Integrates with Apache Airflow for automation
Beautiful and rich UI and awesome search capabilities (check it out)
Supports user profile and authentication through OIDC, a flavor of OAuth2

Ok, we’re all set. Time to think about deployment and the final users - Giphy (source)

Now the real challenges unbundle:

Technical: how to get Amundsen ready for production and connect it to all our data sources ? How do we manage deployment ? how to set up authentication ? How do we integrate it efficiently with Airflow ?
Organizational: who does what ? How do we manage the instance to keep it alive and used ? How do we maximize value for our people and get them to actually use it instead of Slack ?

A bit more about Amundsen

It probably works for your data storage: Redshift, Snowflake, Hive, Athena, Oracle, Kafka, Glue, dbt, Cassandra are all supported
It indexes your dashboard data too: Apache Superset, Mode, Redash & Tableau
It was created at Lyft and it’s currently hosted by the Linux AI & Data Foundation

Giving Amundsen a try is as simple as running these two lines in a terminal and going over to http://localhost:5000 🧙 (quick start guide here) :

git clone — recursive https://github.com/amundsen-io/amundsen.git
docker-compose -f docker-amundsen.yml up

DataHub and OpenMetadata are two more complex but great looking options. Definitely check them out before deciding to get yourself the best match for your needs 😉

Here’s how the main open source alternatives compare in terms of Github stars history (Amundsen is the blue curve). Number of stars and curve slopes gives a pretty good idea of their popularity and dynamism:

In future posts, we will talk about deployment and governance. We will also give some feed-back about Amundsen.

We might even try another promising solution (LinkedIn DataHub ? who knows ?)

Give us a little 👏 if you found this to be useful ! or leave a comment if you want more focus put on specific aspects in the future 😉

Thanks for reading !

Part II is right here: #2 Data Discovery: People, governance and processes

#1 Data Discovery: how to choose your Data Catalog and why you do need one

Data at HiPay

Let’s start with our needs

Choosing a solution

A bit more about Amundsen

Written by Anas El Khaloui