A step-by-step guide deploying Amundsen on Google Cloud Platform

Published in

talabat Tech

7 min readNov 3, 2020

What is Amundsen?

Amundsen is Lyft’s Data Discovery Platform and metadata engine. In 2019, Lyft officially announced they are open sourcing Amundsen https://github.com/lyft/amundsen. It is named after Roald Amundsen, a Norwegian explorer and the first person to discover the South Pole.

Think of it as the “Google Search” for data in your organization. It powers page-rank style search, leveraging usage patterns. For example: highly queried tables show up earlier than less queried tables.

It works by indexing all different data sources, dashboards and streams; ranking results based on relevance and query activity. It keeps track of the most up-to-date definition of every feature/table/set, and therefore improves the productivity and efficiency of all data platform users all over the organization, not just the data scientist.

Why Amundsen?

There are two main questions often asked within every data team:

Where is this data source?
What does this field mean?

Data Scientists spend lots of time in the “Data Discovery” phase; trying to understand what data exists, where it resides, who owns it, who uses it, and how to request access. Amdunsen helps the data team to be more productive by saving time spent in the discovery phase — less time searching, more time finding.

Challenges to solve

Think of all the different and various projects in which you have encountered:

Different data sources
You can’t fit all the data in one single model; a data resource could be a table, dashboard, Airflow DAG
Every data source is being extracted differently; each data set metadata is stored and fetched differently

These are the types of challenges we were facing and so needed to find a solution to resolve them all.

Amundsen Architecture

Amundsen is built on top of five different micro-services and each needs to be deployed/maintained separately.

Frontend service: The frontend service serves as web UI portal for users to interact with. It is a Flask-based web app with a representation layer, built with React with Redux, Bootstrap, Webpack, and Babel.
Metadata service: A thin proxy layer to interact with the graph database; currently Neo4j is the default option for the graph backend engine. After being open-sourced there has been community collaboration to support Apache Atlas. Also, it supports REST APIs for other services to push or pull metadata directly
Neo4j: a graphical backend server that saves all metadata extracted from different sources. The metadata is represented as a graph model.
Databuilder: Amundsen provides a data ingestion library for building the metadata and uses Apache Airflow to orchestrate Databuilder jobs.
Search service: A thin proxy layer to interact with the search backend functionality (or Apache Atlas’s search API, if that’s the backend you picked) and provides a RESTful API to serve search requests from the frontend service. Supports different search patterns:

Normal Search: match records based on relevancy.
Category Search: match records first based on data type, then relevancy.
Wildcard Search: run a search with missing words.

It provides a huge amount of flexibility and several ways of deployment, and since part of the appeal of Amundsen is its flexibility, there’s no one ‘right way’ to install it.

Setup Guide

After thinking of different approaches and trying out several of them I am going to demonstrate an easy way to deploy Amundsen using it’s default settings, on a GCP cloud machine (you can use linux as well!) and then build an Airflow DAG that orchestrates updating the metadata from different sources.

Setup overview

This guide assumes you have both Airflow on a google composer environment and a GCP machine up and running.

Our setup went as follows:

Wrap four of the services in one docker-compose file to be able to install it all at once, on one machine instead of installing and maintaining each service independently.
Install this docker-compose file on GCP.
Verify the installation of five docker images, each image holds one of the main micro-services (frontend, metadata, neo4j, search, elastic search).
Build the metadata DAG file extractor and orchestrate it using airflow.

Our deploying setup of Amundsen at Talabat

How to install Amundsen’s default version on GCP?

Make sure that docker is installed on the machine.
On the cmd prompt, head to the target directory you want to install Amundsen in and clone the repo together with it’s sub-modules by running the following:
$ git clone --recursive git@github.com:amundsen-io/amundsen.git
Enter the cloned directory and run:
$ docker-compose -f docker-amundsen.yml up

The later cmd installs and runs five docker images using the default settings of each container, each image holds one of the main micro-services that Amundsen use. To verify that all five images are being installed and working correctly, run:
$ docker ps

Troubleshooting:

In some cases, the docker container might not have enough heap memory for Elastic Search, thus “es_amundsen”, the elastic search component, will fail during docker-compose. To fix it, you’ll need to increase the memory.

The error:

es_amundsen | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

To solve it:

Edit file “/etc/sysctl.conf”
Make entry “vm.max_map_count=262144”. Save and exit.
Reload settings “$ sysctl -p”
Restart “docker-compose”

Generally, for any issues:

Check if all 5 Amundsen related containers are running with “docker ps”
Can you connect to the Neo4j UI at
http://{{machine ip address}}:7474/browser/ and the raw ES API at http://{{machine ip address}}:9200?
Do Docker logs reveal any serious issues?

Setup Verification

Amundsen databuilder/Neo4j service

You can verify dummy data has been ingested into Neo4j by by visiting

http:/{{machine ip address}}:7474/browser/ and in the query box run:

MATCH(n:Table) RETURN n LIMIT 25

Snippet of Neo4j (graphical database) interface up and running

Amundsen metadata service

In a different terminal, verify getting HTTP/1.0 200 OK

$ curl -v <http://{{machine ip address}}:5002/healthcheck>

Amundsen frontend service

You can verify the data has been loaded into the metadataservice by hitting

http://{{machine ip address}}:5000/ in your browser.

Airflow environment setup

Now, let’s use amundsendatabuilder library for building our metadata graph and search index. First, let’s run our data-ingestion example using an airflow DAG. You’ll have to install some extra PYPI packages on your air-flow environment. This requirements.txt holds all dependencies needed.

$ pip3 install -r requirements.txt

If you are hosting airflow on google composer; in your gcloud console type:

gcloud composer environments update {{environment name}} --location={{machine location}} --update-pypi-packages-from-file={{.../amundsen/amundsendatabuilder/requirements.txt}} --project={{project name}}

Airflow DAG definition

This DAG example fetches dataset schemas, tables, columns, descriptions, labels and pretty much all related metadata available from a Google Bigquery meta store, then publishes this metadata to Neo4j. It saves the data in two .csv files; one holding the nodes, and the other holding the relationships between the nodes.

If you are pulling data from various sources, each distinct metadata source should be fetched through a different databuilder job. Each databuilder job will be an individual task within the DAG. Each type of data resource will have a separate DAG since it may have to run with a different schedule.

Step 1: Define imports, DAG arguments and setup config.

Step 2: Create bigquery tables extraction job using
BigQueryMetaDataExtractor()

Step 3: Create the index job that updates the graph index after every new data pull

Step 4: Create the publisher function job which is called several times updating tables, users, and dashboard indexes; depending on the passed arguments “kwargs”

Step 5: Finally, define and call the DAG routine

Below, is the graph view of the defined DAG. You’ll notice how the function “create_es_publisher_sample_job” is called three times in the DAG routine, creating a different indexing job according to the passed arguments.

Complete DAG

You can find the complete code here.

Running From a Python Script

In the previous example, I showed how to use Airflow to update the metadata, you can also use a python script that does the same purpose and update it regularly by registering it on a cron job. Make sure to set your credentials if your application runs outside Google Cloud environments. More about authentication here.

Final thoughts

In my next post, I am going to show how to customize Amundsen and will show some of the awesome hidden features and how to leverage them. Making it more configurable means you’ll need to deploy and maintain each micro-service independently, in which we’ll do in an easy way too!