Versatile Data Engineering Toolkit for Python

Ari Surana

Published in

hipages Engineering

5 min readJul 23, 2020

hipages’s newest contribution to the free & open source software (FOSS) ecosystem

The what and the why?

Working with the latest and greatest cloud technology and building a diverse data lake can be as exciting as it can be daunting. It’s exciting as these tools usually come with a great set of features that solve problems previously considered too hard to overcome, you also get capabilities of scale often at lower cost. But it can also be daunting as you may not find the evolved support ecosystem that established solutions have.

Incorporating novel, cutting edge solutions into our tech stack enables us to stand on the shoulders of giants but it is often up to us data architects and data engineers to come up with the support framework to get the data plumbing working smoothly within our organisations. Tools which are easy to use not just for our software engineers, but also for our data analysts, and data scientists, applications and other data consumers.

We are proud to contribute back to the opensource community, that is as vivid as it is ingenious, by releasing hip-data-tools, a data engineering toolkit for Python that align with our tech stack and addresses our (and we hope, very common) needs:

At hipages we strive for a modern, scalable, self-service-first Data Platform that can support our growth ambitions leveraging cloud architecture.
We use cloud object stores like S3 as our primary Data Lake storage technology. The tools to interact with such technologies are well equipped for low-level access and control, they lack the higher-level abstractions that the data consumers would care for. For example, data scientists may want to read and write datasets to S3 in form of Parquet files and use them in their notebooks as Pandas DataFrame on a regular basis. To do this, they end up reinventing the wheel every time with low-level libraries like boto.
We use many types of compute and query technologies for our data warehousing needs, like AWS Redshift, AWS Athena, Apache Cassandra, Plain Kubernetes workloads, etc.
Almost all the time we want to access a data set or store it, the tools at hand require us to reinvent the wheel, and force a lot of storage level decisions on the data analysts and data scientists who should not care for them.
Sometimes we require supporting applications like Machine Learning Model version control systems. (If you are curious as to why? read this article.)

We realised that we spend considerable time solving these problems again and again, and each time we solve the problem we end up maintaining a slightly different solution. Ideally, we would have seamless and effortless interaction with the data platform, abstracting away complexities of the multitudes of technologies and keeping sane defaults.

The Data Platform team at hipages built reusable and easy to drop-in libraries that are suited for data engineering, analysis and ad hoc access at the same time. This means the interface to access data, and the code to access and manipulate data within our data platform is exactly the same for our data analysts, scientists, and engineers. This allows everyone to reuse snippets from data pipelines in their notebooks, or send over their notebooks to be converted into data pipelines with ease.

We recently open-sourced some these internal libraries and we hope it is useful for others trying to solve similar problems. You can use and contribute to the project in GitHub and visit the detailed technical documentation at readthedocs.

hipagesgroup/data-tools

© Hipages Group Pty Ltd 2019 Common Python tools and utilities for data engineering, ETL, Exploration, etc.

github.com

You can use the package directly using pip installer:

pip install hip-data-tools

This project is structured with the modern data platform in mind, and we are adding new tools, and utilities every-day. The goal of this package is to provide an easy to use and simplified abstractions for the specialised use-cases generally encountered by analysts, scientists and engineers working with data. Hopefully, these tools will help reduce some reinvention of the wheel when interacting with real-world data platforms.

Structure

The project is structured to allow a convenience wrapper around low-level infrastructure.

We currently wrap some commonly used infrastructure services like:

These modules provide rich low-level functionality to interact with data and the API stored in these services. Almost all of these modules provide a performant and easy to use abstraction to download data from these services in Python dictionary or pandas.DataFrame formats.

Moreover, some of these low level utilities have been bundled together to create a higher-level abstraction of data flows (aka ETLs):

These flows enable efficient data transfers between services and making some opinionated choices about data types and storage formats. For example, the utilities and ETLs are optimised for retrieval and storage of data in Parquet files (on S3) as the industry standard for optimal access and storage of data in data lakes.

Examples

The following examples will demonstrate how easy it can be to use these tools in your everyday work, interactive analysis and to use the Data Platform at scale.

Read Data from Athena tables into a Pandas data frame

This operation is highly optimised for reading large files and assumes, the data is stored in Parquet format.

The code is snippet is self contained and should run as is.

The above example shows how to access data sitting in S3 behind an Athena table, possibly in multiple partitions and multiple keys, in merely 3 lines of code.

Convert SQL into a new Athena table

This operation uses existing data in Athena tables and uses the server-less Athena engine to transform data at scale.