talabat Tech
Published in

talabat Tech

A step-by-step guide deploying Amundsen on Google Cloud Platform

What is Amundsen?

Why Amundsen?

Challenges to solve

  • Different data sources
  • You can’t fit all the data in one single model; a data resource could be a table, dashboard, Airflow DAG
  • Every data source is being extracted differently; each data set metadata is stored and fetched differently

Amundsen Architecture

  1. Frontend service: The frontend service serves as web UI portal for users to interact with. It is a Flask-based web app with a representation layer, built with React with Redux, Bootstrap, Webpack, and Babel.
  2. Metadata service: A thin proxy layer to interact with the graph database; currently Neo4j is the default option for the graph backend engine. After being open-sourced there has been community collaboration to support Apache Atlas. Also, it supports REST APIs for other services to push or pull metadata directly
  3. Neo4j: a graphical backend server that saves all metadata extracted from different sources. The metadata is represented as a graph model.
  4. Databuilder: Amundsen provides a data ingestion library for building the metadata and uses Apache Airflow to orchestrate Databuilder jobs.
  5. Search service: A thin proxy layer to interact with the search backend functionality (or Apache Atlas’s search API, if that’s the backend you picked) and provides a RESTful API to serve search requests from the frontend service. Supports different search patterns:
  • Normal Search: match records based on relevancy.
  • Category Search: match records first based on data type, then relevancy.
  • Wildcard Search: run a search with missing words.
Amundsen’s architecture

Setup Guide

Setup overview

  1. Wrap four of the services in one docker-compose file to be able to install it all at once, on one machine instead of installing and maintaining each service independently.
  2. Install this docker-compose file on GCP.
  3. Verify the installation of five docker images, each image holds one of the main micro-services (frontend, metadata, neo4j, search, elastic search).
  4. Build the metadata DAG file extractor and orchestrate it using airflow.
Our deploying setup of Amundsen at Talabat

How to install Amundsen’s default version on GCP?

  1. Make sure that docker is installed on the machine.
  2. On the cmd prompt, head to the target directory you want to install Amundsen in and clone the repo together with it’s sub-modules by running the following:
    $ git clone --recursive git@github.com:amundsen-io/amundsen.git
  3. Enter the cloned directory and run:
    $ docker-compose -f docker-amundsen.yml up
Five docker images for each service
es_amundsen | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
  1. Edit file “/etc/sysctl.conf”
  2. Make entry “vm.max_map_count=262144”. Save and exit.
  3. Reload settings “$ sysctl -p”
  4. Restart “docker-compose”

Setup Verification

Amundsen databuilder/Neo4j service

MATCH(n:Table) RETURN n LIMIT 25
Snippet of Neo4j (graphical database) interface up and running

Amundsen metadata service

$ curl -v <http://{{machine ip address}}:5002/healthcheck>

Amundsen frontend service

Snippet of Amundsen’s frontend service

Airflow environment setup

$ pip3 install -r requirements.txt
gcloud composer environments update {{environment name}} --location={{machine location}} --update-pypi-packages-from-file={{.../amundsen/amundsendatabuilder/requirements.txt}} --project={{project name}}

Airflow DAG definition

  • Step 1: Define imports, DAG arguments and setup config.
  • Step 2: Create bigquery tables extraction job using
    BigQueryMetaDataExtractor()
  • Step 3: Create the index job that updates the graph index after every new data pull
  • Step 4: Create the publisher function job which is called several times updating tables, users, and dashboard indexes; depending on the passed arguments “kwargs”
  • Step 5: Finally, define and call the DAG routine
Complete DAG

Running From a Python Script

Final thoughts

--

--

At talabat we have an ambitious and exciting journey ahead of us. This can only be achieved with a very reliable and scalable technology stack that will allow us to innovate, experiment, and create great impact for our customers and partners.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Salma Amr

Data gal on weekdays, student on weekends, living life in between:)