Fram —the ship that was used in expeditions by Roald Amundsen

Facilitating Data discovery with Apache Atlas and Amundsen

Mariusz Górski
Jun 18, 2020 · 9 min read

The third wave of Artificial Intelligence, as claimed in DARPA Perspective on AI, revolves around the idea of data democratization. The idea is to make data and analytics tooling easily accessible to all the people who need it and have the skill-set to analyze it.

In this article, I would like to introduce you to the concept of the data discovery service and how we implemented it within our internally developed Big Data platform to support our data democratization efforts.

Data Analytics Platform

As a word of introduction, let’s focus for the moment on one of the core products we have built-in ING Wholesale Banking Advanced Analytics to make ING a 10 times more data-driven company.

The Data Analytics Platform (DAP) — developed internally by ING WBAA — is a cloud-native platform for data democratization, created to provide any ING employee working with data (data scientist, analyst, wrangler, etc.) with a modern, reliable and feature-rich environment to experiment with data. It features horizontal scalability, the latest open source technologies, and end-to-end security.

One of the most important parts of a successful platform is the onboarding of new data sources. It must be a process that is trusted, secure, and as frequent as possible. And it should cover a broad range of data sources of different origins and specifics. In the same way, you cannot run a successful restaurant with a poor supply of ingredients, you can’t have a data analytics platform without a catalog of all the data relevant to every angle of the business. The banking business is a complex one, and that’s where the need for a separate Data Assets Squad comes from.

Agata Orszulska — Data Assets Squad PM

"The Data Asset Squad develops and maintains the automated and reliable Generic Ingestion Pipeline with capabilities like Data discovery, Profiling, and Lineage. Our ambition is to prepare a comprehensive Data Catalog with embedded ML algorithms to allow users to perform ‘Google-like’ semantic search over metadata to find the most relevant datasets and then browse and filter underlying datasets as needed. The way to achieve this is by keeping track of the latest technologies and practices available in the market to constantly improve our products."

The range and diversity of the data catalog we own is one of the main factors defining the strength of DAP.

When all of this is covered and your data ingestion pipelines are battle-proofed, efficient, and properly monitored, can you just go to your users and say ‘We have all the data in the world — go out and play with it!’? Well, not exactly…

Why Is Data Discovery Important?

Imagine being a Data Scientist who just received a new assignment — to conduct analysis on a new business problem regarding payments and build a Machine Learning (ML) model that would solve it.

The first instinct of a person working with data would be to look for any piece of information that will be relevant and might help along the way. The questions usually coming up in the process are:

  1. What kind of data I can/should use?
  2. Where can I find the data?
  3. Who should I ask about access to the data?
  4. Can I rely on the data we have?
  5. What is the freshness and quality of the data we have?
  6. Who else is using this data?

Data Scientists spend up to one-third of their time in data discovery.

In a world without a data discovery service, our data scientists would reach out to their colleagues, conduct a search by browsing all the objects they have access to, and then, after making several assumptions — proceed with the analysis, hoping that they were a correct ones. If this process sounds like a time-consuming one, it’s because without the right tools — it definitely is. It will require collecting bits of pieces of information from several independent places on your own, assuming that this is all there is to it and that the gathered information was correct.

There certainly has to be a better way to achieve this.

With the growth of the Data Analytics Platform, the demand for more data being available is not surprising — and with new data being available the amount of metadata is also on the rise. This process brings in a new challenge. Discovering the relevant pieces of information about the data is like looking for a needle a haystack. The way data scientists used to find out data relevant to their needs might quickly become counterproductive and unreliable, leading to a lot of frustration, uncertainty, and a decrease in creativity.

The modern answer to those issues is a data discovery service. This topic is gaining more and more popularity these days because any company that owns an enormous amount of data will, in time, encounter a growing difficulty with finding relevant data. Data discovery services aim to resolve this hardship.

Data Discovery Service— The DAP Way

Having a data discovery service being part of DAP means providing our users a tool that builds awareness on the data we have in the platform and its quality. We want this service to be the first place users visit to figure out what relevant data we have and where they can find it.

The data discovery service on DAP leverages connecting information about data we have on the platform with every interaction users have with it to provide as relevant and revealing pictures as possible. Let’s take a deep dive into how we are making it possible for hundreds of DAP users.

Since 2019, we are closely cooperating with Lyft, a USA based, ridesharing, data-driven company, strongly invested in expanding their data part of the business. We have successfully deployed their open-source service Amundsen — a data discovery service named after the great Norwegian explorer, who was the first person to explore both North and South poles. Lyft’s data discovery service aims to resolve similarly challenging tasks — leave no stone unturned in the search for valuable information in the metadata. It is serving as our users’ search interface of the data discovery service.

Amundsen is advertised as Google for data, with a wide range of functionalities and a large supporting community constantly working on improving it. I highly recommend you take a closer look at it.

Metadata is as self-explanatory as it sounds — it’s data providing information about one or more aspects of the data. The simplest example would be with tables — you store actual data inside the table, and all the information revolving around it — such as table name, schema, etc. is metadata. Even the greatest data discovery service cannot exist without having the right metadata information.

As a metadata and search service we use Apache Atlas — a Big Data metadata management and governance service to capture every bit of metadata information related to data, that is available on the platform. It features a wide range of out-of-the-box hooks collecting metadata from services (like Kafka, Hive, or HBase), out of the box enterprise-grade security (much needed in the banking environment), and REST API (enabling us to fill out any metadata gaps we might have).

Atlas relies on HBase/Cassandra and Solr distributed data stores, enabling both storage of information and search capabilities. This way we build a comprehensive data catalog containing lineage information to identify, trace, and secure the data we have and which can be consumed through integration with Amundsen in a modern fashion.

Apache Atlas — high-level architecture

On the diagram below the complete architecture of our solution is depicted, following a deeper dive on selected elements.

Data discovery on DAP — architecture overview

Popularity Score

The best place to hide a dead body is page 2 of Google.

The most relevant results should preferably be displayed on the first page of the search results. To ensure that, we are calculating the popularity score a measure reflecting the number of queries for the data that users of the platform are making. The more interactions users have with the table — the higher it will appear in search results. This is enormously useful especially when search queries made by users are very general and might result in thousands of results (and this is an actual case given our data catalog size).

To calculate the popularity score, we run Airflow scheduled Spark job, which parses Ranged audit logs related to HDFS files and enriches this data with Hive Metastore information. With this, we are getting a full picture of how our data is used in the platform and which tables are more popular than the others.

Popularity score — architecture overview

Table Metadata

The search is obviously the first step of the journey. When the search result becomes a point of interest for users, they can now go into the table details view. There, information such as description, a timestamp of its latest update, and the most frequent users can be viewed. They can also find out about the data owner and its classification. Our data ingestion pipelines ensure this kind of metadata information is as recent as possible.

Data Profiling

The column list provides an inventory of all columns of a table with their data types. Another step towards raising awareness of the data we store within the platform is data profiling. As a part of our ingestion process, we have embedded Spark job utilizing AWS Deeque library to calculate descriptive statistics for every numerical column of the table and push it to Atlas. This way, even without access to the data, users can quickly get a sense of what’s inside.

Let’s go back to our Data Scientists. After introducing the data discovery service to the platform, they can just type in payment into the search bar (just like they would use Google) to get all the tables related to payments. No more browsing Slack, pinging preoccupied teammates, or digging through git to scrape any kind of valuable information. In a matter of minutes the most critical questions are answered: who (is using data)?, what (kind of data I should use)? and where (to look for the data)?. This way our users get much more context on the data since it’s all stored in one place — drawing from both structured and unstructured bits of information available in DAP.

Amundsen table view - table description & column statistics

Believe In Open Source

What’s important about the services we have is that we don’t limit ourselves to just using them. We strongly believe in the power of Open Source and are constantly working with the community on improving them. That includes bug fixing, performance improvements, and extending functionalities. In fact, none of our ideas for improvements are lying around on internal feature branches - we push everything we have to upstream.

We make our users’ lives easier by exchanging, sharing, and comparing ideas with the whole community — within and outside of ING. We feed and learn from the collective wisdom on data discovery.

Try It Yourself

The best way to go about getting a feel for any product is to try it on your own. It’s very easy to start your Amundsen & Atlas deployment. Below a couple of links to kick it off:

  1. https://github.com/lyft/amundsen — Official Amundsen repository containing sample docker-compose deployment of Amundsen with Atlas
  2. https://github.com/dwarszawski/amundsen-atlas-types — A set of entity definitions required for Amundsen & Atlas integration
  3. https://pypi.org/project/pyatlasclient — An easy way to familiarize yourself with Atlas API

My name is Mariusz Górski and amongst other things, I am a Data Engineer, Open Source and Public Cloud enthusiast. Fan of knowledge sharing, who likes to experiment, break things, and fix them — sometimes in random order.

At the time of writing this article, I am working as a Data Engineer in ING WBAA where I contribute to delivering solutions for data ingestion, analysis, and discoverability.

Privately a snooker geek and a black-as-terminal coffee addict. Connect with me on LinkedIn and GitHub. ✌️

wbaa

Wholesale Banking Advanced Analytics team