Ruby based ETL pipeline with Iron.io and Redshift
As the data requirements of our Business Intelligence team grow we’ve been leveraging Iron.io’s IronWorker as our go-to platform for scheduling and running our Ruby based ETL (Extract, Transform, Load) worker pipeline.
Business units within HotelTonight collect and store data across multiple external services. The ETL pipeline is responsible for gathering all the external data and unifying it into Amazon’s excellent Redshift offering.
Redshift hits a sweet spot for us as it uses familiar SQL query language, and supports connections from any platform using a Postgres Adapter.
This allows our Ruby scripts to connect with the PG Gem, our Business Intelligence team to connect with their favorite SQL Workbench, and anyone in our organization with Looker access to run queries on the data.
The team at Iron.io have been a great partner for us while building the ETL pipeline. Their worker platform gives us a quick and easy mechanism for deploying and managing all our Ruby workers.
The administration area boasts excellent dashboards for reporting worker status and gives us great visibility over the current state of our pipeline.
Single responsibility workers
Keeping the worker components modular allows us to separate the concerns of each worker and create a repeatable process for each of our ETL integrations.
The modular design allows for persistence points along the lifetime of pipeline. This allows us to isolate failures and recover if data integrity issues arise.
Chain of responsibility
Each worker in the pipeline is responsible for its own unit of work and has the ability to kick off the next task in the pipeline.
For example when the Mixpanel Extractor completes the export of data from the Mixpanel API it will fork a new Mixpanel Transformer with the S3 file path as a parameter.
Similarly, the Mixpanel Transformer will kick off the Redshift Load task when it has completed aggregating and transforming the data.
Extracting the data
We build a custom worker for each external data source. This process relies heavily on 3rd party API documentation. Big props to Desk.com for their great API docs.
Transforming the data
Each pipeline gets its own set of transformers. A transformer is responsible for sanitizing data and returning a CSV row in the ordinal column position of the Redshift target table (these rows will be used for the DB copy).
The Transformer is responsible to creating a CSV file on S3 with the import data. It then kicks off a Redshift Load worker with the S3 path of the import file.
Loading the data
The final step of the pipeline involves copying the transformed data into Redshift. This is achieved using the Postgres
The keen observer will notice we’re using the
MANIFEST flag during the
COPY operation. This is a Redshift specific feature that allows us to copy multiple CSV files in parallel (this dramatically speeds up imports when dealing with millions or rows of data).
The compiled manifest file looks like this:
Working with unified data
Previously if we wanted to analyze data across multiple 3rd party providers we needed to export CSV’s from each provider, create pivot tables, and apply lots of Excel wizardry.
Now, with all our data in Redshift we can run SQL queries and
JOIN data from multiple platforms into one result set.
Originally published at engineering.hoteltonight.com on May 27, 2014.