Driving Analysis by Managing Data with ETL (a Ruby gem!)

A look at a Ruby gem that Square’s Business Intelligence team wrote to facilitate ETLs.

Square Engineering
Square Corner Blog
7 min readMar 14, 2013

--

Written by Jeff Iacono.

Heads up, we’ve moved! If you’d like to continue keeping up with the latest technical content from Square please visit us at our new home https://developer.squareup.com/blog

Business Intelligence at Square

At Square there are two main teams who ensure that data is accessible and accurate for analysis across the company. The Data Engineering team takes data from disparate sources and feeds it into databases that are accessible to analysts. The Business Intelligence (BI) team is then responsible for the transformation and organization of this data into highly optimized structures. This foundation empowers analysts within the Business Intelligence team and those across the company to derive insights and provide data driven recommendations.

What problem does this library solve?

The BI team is responsible for understanding complicated data and transforming it into a form that supports fast, accurate, data driven decision-making. The team developed an ETL (Extract, Transform, and Load) library that:

  • Empowers the development of robust, repeatable, and transformative data operations
  • Supports working with and transforming millions of records by leveraging intelligent (and built in) iteration features
  • Facilitates deploying any ETL operation in the same manner as any source code

Who else could benefit from this library?

The ETL library mentioned above will benefit:

  • Analytics teams responsible for quickly creating data structures that support analysis
  • Data teams that are technically focused and dedicated to developing an ETL pipeline
  • Anyone with a need to perform repeated functions over large data sets
  • Analysts that need to transform large datasets into a form that facilitates analysis
  • Anyone that is primarily database focused and looking to learn and apply Ruby

How does it work?

Installation

Add this line to your application’s Gemfile:

And then execute

Or install it yourself as:

Basic ETL

Assume that we have a database connection represented by connection.

To run a basic ETL that is composed of sequential SQL statements, start by creating a new ETL instance:

which can then be configured:

At this point it is possible to run the ETL instance via:

which executes #ensure_destination, #before_etl, #etl, and #after_etl in that order.

ETL with iteration

To add in iteration, simply supply #start, #step, and #stop blocks. This is useful when dealing with large data sets or when executing queries that, while optimized, are still slow.

Again, to kick things off:

where connection is the same as described above.

Next we can configure the ETL:

At this point it is possible to run the ETL instance via:

which executes #ensure_destination, #before_etl, #etl, and #after_etl in that order.

Note that #etl executes #start and #stop once and memoizes the result for each. It then begins to iterate from what #start evaluated to up until what #stopevaluated to by what #step evaluates to.

Check out the source code on Github.

What’s Next?

The next step is to integrate each set of ETL operations that our library manages into a pipeline framework, which will serve as the backbone of our evolving data warehouse.
We’re currently building both the pipeline and data warehouse and look forward to sharing our progress with you in the near future. If you’d like to tackle these problems with us, get in touch by going to https://squareup.com/careers/business and clicking “Finance.”

--

--