We’re All on Our Own, Aren’t We?
Whenever I read use cases about large enterprises, having 900+ applications running, I always got the chills thinking about data transfer between the systems. And it usually followed by the thought, ‘how does the business know if and what data exist or because the company reached such a scale, each unit becomes self-sustained in terms of its data needs?’ And people figured it out — well, apparently kind of. And the answer is ‘data catalogs’. As per Dataversity, ‘A data catalog informs customers about that available data sets and metadata around a topic and assists users in locating it quickly’ . Gartner adds, ‘Data catalogs enable organizations to create an inventory of distributed data assets, thereby underpinning the value of data modernization initiatives. Product managers of data management solutions should introduce data catalog capabilities to create an “upstream hook” for customers and prospects.’ 
But if data catalogs unite various departments in the organization, why has Garner been marking ‘Data Catalogs’ as deemed to be obsolete before reaching the Plateau of Productivity? As per Gartner’s explanation of this category, the technology will never reach the Plateau because it will fail in the market or be overtaken by other competing solutions.
As my interest peaked up, I decided to explore the topic in-depth and do a small project trying some tools out. As there could many criteria to evaluate the tools against, I focused on:
— automated population of the catalog;
— business glossary;
— view lineage;
— preview of sample data & profiling with pricing being the main criteria. A word of caution, the benchmark below was compiled based on the articles on the Internet, and some points could be missed — not even mention that there is extremely limited information on some tools. For more details, please, contact developers/vendors.
Considering that all my data sources run on AWS services, why not AWS Glue? Primarily because Glue is made for developers, not for business users, while I was looking for an end-user tool, preferable, an open-source tool. I was indecisive between Marquez and Amundsen but in the end, because metadata in Marquez is collected with Airflow ETL job execution and I didn’t want to call a Marquez API, I decided to go with Amundsen.
Departments, unite! You have nothing to lose!
Amundsen is the data discovery metadata platform, originated from Lyft and currently open sourced. The platform is built on three pillars: (1) Search with the augmented data graph, (2) Front-end, and (3) centralized metadata — more here. You can listen to the interview with Mark Grover and Tao Feng, engineers at Lyft here or read the blogpost from Tao Feng here.
1. Install and start Docker, an operating system virtualization technology that allows applications to be packaged as containers. In my case, I installed Docker Community Edition for Windows, following https://docs.docker.com/get-docker/.
2. Download and install Git Bash at https://git-scm.com/downloads. Git Bash contains both, Git, an open-source version control system for tracking source code changes when developing software, and Bash, a Unix command-line shell.
3. Install miniconda or Anaconda on your machine at https://conda.io/en/latest/miniconda.html. Conda is an open-source package management system and environment management system for installing multiple versions of software packages and their dependencies.
3.2 Create a conda environment
conda create — name <yourName>.Verify that the environment was created with
conda info — envs. To activate the environment, type in the Conda prompt
activate <yourName> — for Windows or
source activate <yourName> e.g.
source activate myenv.
4. Open Git Bash — if you use Windows, type “git bash” in the Search Box — and clone the Git repository with the Amundsen code
git clone — recursive firstname.lastname@example.org:amundsen-io/amundsen.git or
git clone — recursive https://github.com/amundsen-io/amundsen.git. In my case, I cloned the repository to C:\Users\eponkratova\amundsen\amundsen.
5. Point to the cloned Git repository on your local machine using Git Bash
cd <AmundsenFolder>, e.g.
6. Run the Docker image with
docker-compose -f docker-amundsen.yml up — I picked Neo4j but you also have the option of using the Atlas backend.
There are six Docker containers with each container holding one of the micro-services — frontend, metadata, neo4j, atlas, search, and elastic search that need to be running. In case if, for example, one of the docker containers stopped after ‘docker run….’, please, refer to the ‘Troubleshooting’ section here.
Alternatively, if you use AWS, you can purchase ‘Amundsen: It is a metadata-driven application’ from the Marketplace with Apache Atlas as a backend. The pricing is comprised of the software license fee — $0.5/hr and the EC2 instance charge — the latter depends on the instance size you select. The vendor offers a few days of trial, in case if you just want to evaluate the tool.
Wait for a few minutes to get the Docker up and running and open http://localhost:5000/ to access the front-end.
Then, you have two options of collecting metadata for tables: (1) run a Python script that does the extractions and (2) use the Amundsen integration with Airflow which will fetch the schema for each data source. As it was a proof-of-concept, I used Python scripts which you can find here in the folder ‘scripts’. On the same level, you would see the folder ‘dags’ for the Python scripts to be scheduled in Airflow.
6. Open another Git Bash and point it to the folder ‘amundsendatabuilder’ you downloaded in step (4), e.g.
7. Activate the environment you created in step (3.2)
source activate <yourName>, e.g.
source activate myenv. Here things might get a bit tricky as it depends on which packages you have installed.
pip3 install -r requirements.txt
python3 setup.py install
8. Then, load the test databases that Amundsen created for you in the ‘Examples’ folder by running
python3 example/scripts/sample_data_loader.py. Wait for a few minutes, open http://localhost:5000/ and Click ‘Advanced search’. In the ‘Source’ field, type ‘*’ and click ‘Apply’ to get the data sources.
Sweat but I wanted my tables!
9. As one of my data sources is a PostgreSQL database, I got the file ‘sample_postgres_loader’ from the ‘\amundsen\amundsendatabuilder\example\scripts’ folder and updated this function with the connection details of my Postgres database — you can see my code here:
user = ‘<myUserName>’
host = ‘<myDatabaseURL>’
port = ‘5432’
db = ‘myDatabaseName’
password = ‘myDatabasePassword’
return “postgresql://%s:%s@%s:%s/%s” % (user, password, host, port, db)
python3 example/scripts/postgres_data_loader.py and Voila, my test tables appear on the front-end. Some entries like column types, column names, and column comments were fetched from the database directly, but I am not sure whether the fields that are to be corrected manually from the front-end, could be populated automatically in the code.
But I want more! Said and done!
It seems you cannot get the structure of files uploaded in, for example, s3, but you can connect to AWS Athena and AWS Glue. So, providing that you already retrieved the structure of your files with Glue, these options seem to be viable.
11. Let’s start with Athena. As there was only a script to be run with Airflow in the folder ‘\amundsen\amundsendatabuilder\example\dags’, I did some adjustments to their ‘athena_sample_dag’ file — you can find my code here. You need to replace some parts in the code — if in doubt about where to find the Account details, check out this link from AWS. When I run the script for the first time, I got the error
‘The error message returned was:\nCan’t load plugin: sqlalchemy.dialects:awsathena.jdbc’ which was fixed with running
pip install “PyAthenaJDBC>1.0.9”
AWS_ACCESS = ‘<yourAWSAccessKeyID>’
AWS_SECRET = ‘<yourAWSSecretAccessKey>’
access_key = AWS_ACCESS
secret = AWS_SECRET
host = ‘athena.us-east-1.amazonaws.com’
extras = ‘s3_staging_dir=s3://<yourS3Bucker>/’
return “awsathena+rest://%s:%s@%s:443/?%s” % (access_key, secret, host, extras)
python3 example/scripts/athena_data_loader.py and I got my Athena tables.
13. To continue with Glue. In the folder ‘\amundsen\amundsendatabuilder\example\scripts’, find ‘sample_glue_loader’ or check my version here and update this part in the code:
GLUE_CLUSTER_KEY = ‘<yourAWSSessonToken>’. It took me a while to realize that I need to provide the session token in GLUE_CLUSTER_KEY. To generate the token:
13.1 Download and install the Amazon Command Line Interface.
13.2 Configure the command line interface.
13.3 Request the session token
aws sts get-session-token --duration-seconds 129600 (enter your own duration).
python3 example/scripts/glue_data_loader.py and yes, the tables appeared on the front-end and no, they were not accessible. If you are more successful than me, let me know how you fixed the issue.
Out of curiosity, I also checked the Atlas backend, and it differs drastically from the neo4j backend as it offers the glossary capabilities, and just in general, the look and feel differs from the neo4j backend — as per my read, you cannot add the glossary to the neo4j backed but I could be wrong.
Just recently I got a newsletter with one of the articles titled ‘Data Catalogs Are Dead; Long Live Data Discovery’ which said, ‘Data catalogs as we know them are unable to keep pace with this new reality for three primary reasons: (1) lack of automation, (2) inability to scale with the growth and diversity of your data stack, and (3) their undistributed format.’  And the author got the point, for example, what I missed in Amundsen is the ability to add sources from the platform front-end to empower business users to add the data sources they own. Furthermore, because the tool was developed for the specific needs of Lyft, the tool supports only certain data sources. Finally, I wished one could indicate the connection between the data sources aka a data flow but as per my read, it is not implemented.