Arbalest: open source data ingestion at scale

If APIs are the manifestation of products and services that enable innovation and collaboration, open source is the foundation upon which APIs are built.

Published in

Dwolla

3 min readDec 9, 2015

Open source software development is arguably one of the greatest examples of human collaboration. Since the release of the GNU project in 1983 and Linux in 1991, to today powering billions of mobile devices, open source powers the software around us. From the programming languages we use to how we run software, Dwolla has immensely benefited from open source.

In addition to benefiting from open source software, we are excited to “pay it forward” and contribute back to the open source community. Open source software encapsulates shared experiences and problem solving. We not only want to share what we have learned, but also learn from the community’s feedback and contributions.

Processing event streams

One area we are contributing to is data and analytics. Events are the atomic building blocks of data at Dwolla. We use some great open source tools to process, analyze, and visualize this data. However, we needed a way to query all this data interactively, using the flexibility and scale of Amazon Web Services.

We are excited to release Arbalest, a Python data pipeline orchestration library for Amazon S3 and Amazon Redshift. It automates data import into Redshift and makes data query-able at scale in AWS.

Arbalest is the backbone of our data pipeline architecture and has enabled Dwolla to query and analyze billions of events. In a few lines of code it takes care of:

Ingesting data into Amazon Redshift Schema creation and validation
Creating highly available and scalable data import strategies
Generating and uploading prerequisite artifacts for import
Running data import jobs
Orchestrating idempotent and fault tolerant multi-step ETL pipelines with SQL

Rationale

Arbalest is a lightweight library designed to be composed with existing data tools. It is also written in Python, which is arguably a de facto programming language for data science. Arbalest embraces configuration as code. Unlike sometimes unwieldy configuration files, an Arbalest pipeline is only a few lines of code that can be tested, packaged, and reused. Finally, it automates the complicated and potentially brittle process of ETL (extract, transform, and load).

We hope Arbalest enables data analysts and developers to spend less time managing their data and more time answering questions.

Use cases

Why use Arbalest? Arbalest is not a MapReduce framework, but rather designed to make Amazon Redshift (and all its strengths) easy to use with typical data workflows and tools. Here are a few examples:

You are already using a MapReduce framework to process data in S3. Arbalest could make the results of an Elastic MapReduce job queryable with SQL in Redshift. You can then hand off to Arbalest to define additional ETL in plain old SQL.
You treat S3 as a catch-all data sink, perhaps persisting JSON messages or events from a message system like Kafka or RabbitMQ. Arbalest can expose some or all of this data into a data warehouse using Redshift. The ecosystem of SQL is now available for dashboards, reports, and ad-hoc analysis.
You have complex pipelines that could benefit from a fast, SQL query-able data sink. Arbalest has support out of the box (arbalest.contrib) to integrate with tools like Luigi to be part of a multi-dependency, multi-step pipeline topology.

Getting Started

Arablest is not supposed to replace existing data tools, but work with them. In our initial release, we have included some batteries, for example, strategies for ingesting time series or sparse data and support for integration with an existing pipeline topologies. We have described a few use cases, but are excited to see more applications of Arbalest in the community.

For more information on Arbalest, be sure to install the package, check it out and star it on GitHub, read the documentation, or look at our presentation from Tableau Conference 15.

—

This blog post shares insights from Fredrick Galoso, a software developer and technical lead for the data and analytics team here at Dwolla. Fred has led the creation of our data platform, including its data pipeline, predictive analytics, and business intelligence platform. In his free time he contributes to a number of open source software projects, including Gauss, a statistics and data analytics library.

Originally published at blog.dwolla.com on December 9, 2015.