Hacking Data Discovery in AWS with Amundsen at SEEK

George Pongracz
SEEK blog
Published in
4 min readDec 6, 2020
Jacking into Data across the Asia-Pacific at SEEK
Jacking into Data @ SEEK is easier with Amundsen

Twice a year SEEK encourage their people to leave their teams behind, self organise and come together with people from other teams to brainstorm ideas for Hackathon.

This time for Hackathon 15, members from the five Asia and ANZ Data Teams and some fellow ❤️ers of Data, came together to stand-up Amundsen by Lyft.

Amundsen is an Open Source Data Discovery and Metadata Engine for improving the productivity of data analysts, data scientists and engineers when interacting with data.

We called our hack “Google for Data”.

Our purpose, to give back time. Time, of which 30% is wasted in data discovery during the creation of a data driven decision making process, according to Lyft. That is, finding the right data to use.

Time for our fellow SEEKers to spend on their next task and get on with their real jobs, to increase their velocity and deliver outcomes to their customers sooner.

So, we stood up Amundsen in AWS and loaded it with metadata. We did this in 48 hours. The speed with which we could do this surprised us.

This speed talks to the ease of use, succinct documentation and the elegance of design by the Lyft Engineering Team and the Amundsen Open Source Community 🙏

What did we do?

We integrated Amazon Redshift, Amazon Athena & Amazon Postgres Aurora into Amundsen on EC2 in AWS within 48 Hours

We integrated roughly 6,000+ data objects into Amundsen, which we stood up in AWS, with no prior experience.

It just worked…

We integrated, Amazon Redshift, Amazon Athena and Amazon Postgres Aurora from two AWS APAC regions.

We enriched the metadata with information on security, quality, ownership, system, geography and statistics.

With a little more time we will integrate Tableau so we will have bidirectional lineage from our Dashboards to our Data Sources and perhaps Kafka too.

Why is this so important to us?

Our team’s digital analyst Ophir distilled this down perfectly in his use-cases:

  • security: single click location of pii sensitive and restricted data objects in the organisation, where-ever they are in what-ever system
  • remote: when you work at home, you cannot turn around and tap your colleague on the shoulder or walk over to another team to ask about where you can best find data about something
  • new projects / new starters: bootstrap and scaffold analysts, developers and consultants, who don’t even know whom to ask…

Built for the Modern Data Team

Amundsen is written in Python. It’s Open Source and can be easily extended without having to wait for a future feature release.

But please share any enhancements back with the community as this is what makes open source so great…

The Apache Airflow Data Loader Dags will run daily to keep the metadata current as sources evolve.

If you need help, its got a slack channel in which you can talk to a vibrant and helpful global community any time of day.

Its containerised, so you can stand a dev instance up with docker compose on your laptop in minutes and load with the sample data loaders from your data system(s) in under an hour, which is a great method for local development of metadata loaders.

What Happened?

Our hack was voted runner-up in the “Most Popular” category, where entries are voted for by SEEKers, so it was great to see it resonate with our colleagues.

Thank-you

Hack 15 dataSEEKers: APAC Shop-keepers and ❤️ers of Data at SEEK
Amundsen OpenSource Community (we stand on your shoulders)
Our Amundsen OSS friends here in Melbourne at REA and ColesGroup
DataEngAU (where I first learned of Amundsen amongst other things too)

Finally…

❤️ SEEK is a great place to work…

References:

Documentation:

Quick Start:

https://www.amundsen.io/amundsen/installation/
We used: docker-compose -f docker-amundsen.yml up -d

Slack Channel:

amundsenworkspace.slack.com

Git Repo:

Articles:

Talks:

Data & AI Summit Europe October 2020 (by Tao Feng, Staff Engineer Lyft)

Amundsen Online Meetup November 2020 (REA & Brex)

Melbourne-Data-Engineering-Meetup October 2020 (Alagappan Sethuraman, Software Engineer, Facebook ex Lyft)

--

--

George Pongracz
SEEK blog

Not affiliated with any vendor nor influenced by any commercial relationships, I write about what I develop and live with in production as an AWS Data Engineer.