A step-by-step guide deploying Amundsen on Google Cloud Platform

What is Amundsen?

Amundsen is Lyft’s Data Discovery Platform and metadata engine. In 2019, Lyft officially announced they are open sourcing Amundsen https://github.com/lyft/amundsen. It is named after Roald Amundsen, a Norwegian explorer and the first person to discover the South Pole.

Think of it as the “Google Search” for data in your organization. It powers page-rank style search, leveraging usage patterns. For example: highly queried tables show up earlier than less queried tables.

It works by indexing all different data sources, dashboards and streams; ranking results based on relevance and query activity. It keeps track of the most up-to-date definition of every feature/table/set, and therefore improves the productivity and efficiency of all data platform users all over the organization, not just the data scientist.

Why Amundsen?

There are two main questions often asked within every data team:

Where is this data source?
What does this field mean?

Data Scientists spend lots of time in the “Data Discovery” phase; trying to understand what data exists, where it resides, who owns it, who uses it, and how to request access. Amdunsen helps the data team to be more productive by saving time spent in the discovery phase — less time searching, more time finding.

Challenges to solve

Think of all the different and various projects in which you have encountered:

These are the types of challenges we were facing and so needed to find a solution to resolve them all.

Amundsen Architecture

Amundsen is built on top of five different micro-services and each needs to be deployed/maintained separately.

  1. Frontend service: The frontend service serves as web UI portal for users to interact with. It is a Flask-based web app with a representation layer, built with React with Redux, Bootstrap, Webpack, and Babel.
  2. Metadata service: A thin proxy layer to interact with the graph database; currently Neo4j is the default option for the graph backend engine. After being open-sourced there has been community collaboration to support Apache Atlas. Also, it supports REST APIs for other services to push or pull metadata directly
  3. Neo4j: a graphical backend server that saves all metadata extracted from different sources. The metadata is represented as a graph model.
  4. Databuilder: Amundsen provides a data ingestion library for building the metadata and uses Apache Airflow to orchestrate Databuilder jobs.
  5. Search service: A thin proxy layer to interact with the search backend functionality (or Apache Atlas’s search API, if that’s the backend you picked) and provides a RESTful API to serve search requests from the frontend service. Supports different search patterns:
Amundsen’s architecture

It provides a huge amount of flexibility and several ways of deployment, and since part of the appeal of Amundsen is its flexibility, there’s no one ‘right way’ to install it.

Setup Guide

After thinking of different approaches and trying out several of them I am going to demonstrate an easy way to deploy Amundsen using it’s default settings, on a GCP cloud machine (you can use linux as well!) and then build an Airflow DAG that orchestrates updating the metadata from different sources.

Setup overview

This guide assumes you have both Airflow on a google composer environment and a GCP machine up and running.

Our setup went as follows:

  1. Wrap four of the services in one docker-compose file to be able to install it all at once, on one machine instead of installing and maintaining each service independently.
  2. Install this docker-compose file on GCP.
  3. Verify the installation of five docker images, each image holds one of the main micro-services (frontend, metadata, neo4j, search, elastic search).
  4. Build the metadata DAG file extractor and orchestrate it using airflow.
Our deploying setup of Amundsen at Talabat

How to install Amundsen’s default version on GCP?

  1. Make sure that docker is installed on the machine.
  2. On the cmd prompt, head to the target directory you want to install Amundsen in and clone the repo together with it’s sub-modules by running the following:
    $ git clone --recursive git@github.com:amundsen-io/amundsen.git
  3. Enter the cloned directory and run:
    $ docker-compose -f docker-amundsen.yml up

The later cmd installs and runs five docker images using the default settings of each container, each image holds one of the main micro-services that Amundsen use. To verify that all five images are being installed and working correctly, run:
$ docker ps

Five docker images for each service

Troubleshooting:

In some cases, the docker container might not have enough heap memory for Elastic Search, thus “es_amundsen”, the elastic search component, will fail during docker-compose. To fix it, you’ll need to increase the memory.

The error:

es_amundsen | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

To solve it:

  1. Edit file “/etc/sysctl.conf”
  2. Make entry “vm.max_map_count=262144”. Save and exit.
  3. Reload settings “$ sysctl -p”
  4. Restart “docker-compose”

Generally, for any issues:

Setup Verification

Amundsen databuilder/Neo4j service

You can verify dummy data has been ingested into Neo4j by by visiting

http:/{{machine ip address}}:7474/browser/ and in the query box run:

MATCH(n:Table) RETURN n LIMIT 25
Snippet of Neo4j (graphical database) interface up and running

Amundsen metadata service

In a different terminal, verify getting HTTP/1.0 200 OK

$ curl -v <http://{{machine ip address}}:5002/healthcheck>

Amundsen frontend service

You can verify the data has been loaded into the metadataservice by hitting

http://{{machine ip address}}:5000/ in your browser.

Snippet of Amundsen’s frontend service

Airflow environment setup

Now, let’s use amundsendatabuilder library for building our metadata graph and search index. First, let’s run our data-ingestion example using an airflow DAG. You’ll have to install some extra PYPI packages on your air-flow environment. This requirements.txt holds all dependencies needed.

$ pip3 install -r requirements.txt

If you are hosting airflow on google composer; in your gcloud console type:

gcloud composer environments update {{environment name}} --location={{machine location}} --update-pypi-packages-from-file={{.../amundsen/amundsendatabuilder/requirements.txt}} --project={{project name}}

Airflow DAG definition

This DAG example fetches dataset schemas, tables, columns, descriptions, labels and pretty much all related metadata available from a Google Bigquery meta store, then publishes this metadata to Neo4j. It saves the data in two .csv files; one holding the nodes, and the other holding the relationships between the nodes.

If you are pulling data from various sources, each distinct metadata source should be fetched through a different databuilder job. Each databuilder job will be an individual task within the DAG. Each type of data resource will have a separate DAG since it may have to run with a different schedule.

Below, is the graph view of the defined DAG. You’ll notice how the function “create_es_publisher_sample_job” is called three times in the DAG routine, creating a different indexing job according to the passed arguments.

Complete DAG

You can find the complete code here.

Running From a Python Script

In the previous example, I showed how to use Airflow to update the metadata, you can also use a python script that does the same purpose and update it regularly by registering it on a cron job. Make sure to set your credentials if your application runs outside Google Cloud environments. More about authentication here.

Final thoughts

In my next post, I am going to show how to customize Amundsen and will show some of the awesome hidden features and how to leverage them. Making it more configurable means you’ll need to deploy and maintain each micro-service independently, in which we’ll do in an easy way too!

Data gal on weekdays, student on weekends, living life in between:)