Hacking Data Discovery in AWS with Amundsen at SEEK

George Pongracz
Dec 6, 2020 · 4 min read
Jacking into Data across the Asia-Pacific at SEEK
Jacking into Data across the Asia-Pacific at SEEK
Jacking into Data @ SEEK is easier with Amundsen

Twice a year SEEK encourage their people to leave their teams behind, self organise and come together with people from other teams to brainstorm ideas for Hackathon.

This time for Hackathon 15, members from the five Asia and ANZ Data Teams and some fellow ❤️ers of Data, came together to stand-up Amundsen by Lyft.

Amundsen is an Open Source Data Discovery and Metadata Engine for improving the productivity of data analysts, data scientists and engineers when interacting with data.

We called our hack “Google for Data”.

Image for post
Image for post

Our purpose, to give back time. Time, of which 30% is wasted in data discovery during the creation of a data driven decision making process, according to Lyft. That is, finding the right data to use.

Time for our fellow SEEKers to spend on their next task and get on with their real jobs, to increase their velocity and deliver outcomes to their customers sooner.

So, we stood up Amundsen in AWS and loaded it with metadata. We did this in 48 hours. The speed with which we could do this surprised us.

This speed talks to the ease of use, succinct documentation and the elegance of design by the Lyft Engineering Team and the Amundsen Open Source Community 🙏

We integrated Amazon Redshift, Amazon Athena & Amazon Postgres Aurora into Amundsen on EC2 in AWS within 48 Hours
We integrated Amazon Redshift, Amazon Athena & Amazon Postgres Aurora into Amundsen on EC2 in AWS within 48 Hours

We integrated roughly 6,000+ data objects into Amundsen, which we stood up in AWS, with no prior experience.

It just worked…

We integrated, Amazon Redshift, Amazon Athena and Amazon Postgres Aurora from two AWS APAC regions.

We enriched the metadata with information on security, quality, ownership, system, geography and statistics.

With a little more time we will integrate Tableau so we will have bidirectional lineage from our Dashboards to our Data Sources and perhaps Kafka too.

Our team’s digital analyst Ophir distilled this down perfectly in his use-cases:

  • security: single click location of pii sensitive and restricted data objects in the organisation, where-ever they are in what-ever system
  • remote: when you work at home, you cannot turn around and tap your colleague on the shoulder or walk over to another team to ask about where you can best find data about something
  • new projects / new starters: bootstrap and scaffold analysts, developers and consultants, who don’t even know whom to ask…

Amundsen is written in Python. It’s Open Source and can be easily extended without having to wait for a future feature release.

But please share any enhancements back with the community as this is what makes open source so great…

The Apache Airflow Data Loader Dags will run daily to keep the metadata current as sources evolve.

If you need help, its got a slack channel in which you can talk to a vibrant and helpful global community any time of day.

Its containerised, so you can stand a dev instance up with docker compose on your laptop in minutes and load with the sample data loaders from your data system(s) in under an hour, which is a great method for local development of metadata loaders.

Our hack was voted runner-up in the “Most Popular” category, where entries are voted for by SEEKers, so it was great to see it resonate with our colleagues.

Hack 15 dataSEEKers: APAC Shop-keepers and ❤️ers of Data at SEEK
Amundsen OpenSource Community (we stand on your shoulders)
Our Amundsen OSS friends here in Melbourne at REA and ColesGroup
DataEngAU (where I first learned of Amundsen amongst other things too)

❤️ SEEK is a great place to work…

Documentation:

Quick Start:

https://www.amundsen.io/amundsen/installation/
We used: docker-compose -f docker-amundsen.yml up -d

Slack Channel:

amundsenworkspace.slack.com

Git Repo:

Articles:

Talks:

Data & AI Summit Europe October 2020 (by Tao Feng, Staff Engineer Lyft)

Amundsen Online Meetup November 2020 (REA & Brex)

Melbourne-Data-Engineering-Meetup October 2020 (Alagappan Sethuraman, Software Engineer, Facebook ex Lyft)

SEEK blog

At SEEK we’ve created a community of valued, talented…

George Pongracz

Written by

Dad, Husband, Cyclist, SEEKer, AWS Data Platform Engineer, Melbourne, Australia

SEEK blog

SEEK blog

At SEEK we’ve created a community of valued, talented, diverse individuals that really know their stuff. Enjoy our Product & Technical insights…

George Pongracz

Written by

Dad, Husband, Cyclist, SEEKer, AWS Data Platform Engineer, Melbourne, Australia

SEEK blog

SEEK blog

At SEEK we’ve created a community of valued, talented, diverse individuals that really know their stuff. Enjoy our Product & Technical insights…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store