Hacking Data Discovery in AWS with Amundsen at SEEK

Published in

SEEK blog

4 min readDec 6, 2020

Jacking into Data across the Asia-Pacific at SEEK — Jacking into Data @ SEEK is easier with Amundsen

Twice a year SEEK encourage their people to leave their teams behind, self organise and come together with people from other teams to brainstorm ideas for Hackathon.

This time for Hackathon 15, members from the five Asia and ANZ Data Teams and some fellow ❤️ers of Data, came together to stand-up Amundsen by Lyft.

Amundsen is an Open Source Data Discovery and Metadata Engine for improving the productivity of data analysts, data scientists and engineers when interacting with data.

We called our hack “Google for Data”.

Our purpose, to give back time. Time, of which 30% is wasted in data discovery during the creation of a data driven decision making process, according to Lyft. That is, finding the right data to use.

Time for our fellow SEEKers to spend on their next task and get on with their real jobs, to increase their velocity and deliver outcomes to their customers sooner.

So, we stood up Amundsen in AWS and loaded it with metadata. We did this in 48 hours. The speed with which we could do this surprised us.

This speed talks to the ease of use, succinct documentation and the elegance of design by the Lyft Engineering Team and the Amundsen Open Source Community 🙏

What did we do?

We integrated Amazon Redshift, Amazon Athena & Amazon Postgres Aurora into Amundsen on EC2 in AWS within 48 Hours

We integrated roughly 6,000+ data objects into Amundsen, which we stood up in AWS, with no prior experience.

It just worked…

We integrated, Amazon Redshift, Amazon Athena and Amazon Postgres Aurora from two AWS APAC regions.

We enriched the metadata with information on security, quality, ownership, system, geography and statistics.

With a little more time we will integrate Tableau so we will have bidirectional lineage from our Dashboards to our Data Sources and perhaps Kafka too.

Why is this so important to us?

Our team’s digital analyst Ophir distilled this down perfectly in his use-cases:

security: single click location of pii sensitive and restricted data objects in the organisation, where-ever they are in what-ever system
remote: when you work at home, you cannot turn around and tap your colleague on the shoulder or walk over to another team to ask about where you can best find data about something
new projects / new starters: bootstrap and scaffold analysts, developers and consultants, who don’t even know whom to ask…

Built for the Modern Data Team

Amundsen is written in Python. It’s Open Source and can be easily extended without having to wait for a future feature release.

But please share any enhancements back with the community as this is what makes open source so great…

The Apache Airflow Data Loader Dags will run daily to keep the metadata current as sources evolve.

If you need help, its got a slack channel in which you can talk to a vibrant and helpful global community any time of day.

Its containerised, so you can stand a dev instance up with docker compose on your laptop in minutes and load with the sample data loaders from your data system(s) in under an hour, which is a great method for local development of metadata loaders.

What Happened?

Our hack was voted runner-up in the “Most Popular” category, where entries are voted for by SEEKers, so it was great to see it resonate with our colleagues.

Thank-you

Hack 15 dataSEEKers: APAC Shop-keepers and ❤️ers of Data at SEEK
Amundsen OpenSource Community (we stand on your shoulders)
Our Amundsen OSS friends here in Melbourne at REA and ColesGroup
DataEngAU (where I first learned of Amundsen amongst other things too)

Finally…

❤️ SEEK is a great place to work…

References:

Documentation:

Amundsen

 Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and…

www.amundsen.io

Quick Start:

https://www.amundsen.io/amundsen/installation/
We used: docker-compose -f docker-amundsen.yml up -d

Slack Channel:

amundsenworkspace.slack.com

Git Repo:

amundsen-io/amundsen

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and…

github.com

Articles:

The data production-consumption gap

All recent innovation in the data has taken place in two areas — helping data engineers produce data, and helping data…

mark-grover.medium.com

Amundsen — Lyft’s data discovery & metadata engine

In order to increase productivity of data scientists and research scientists at Lyft, we developed a data discovery…

eng.lyft.com

Talks:

Data & AI Summit Europe October 2020 (by Tao Feng, Staff Engineer Lyft)

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform …

Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux…

databricks.com

Amundsen Online Meetup November 2020 (REA & Brex)

Melbourne-Data-Engineering-Meetup October 2020 (Alagappan Sethuraman, Software Engineer, Facebook ex Lyft)