ETL on Hadoop made simple!

cdapio
cdapio
Published in
4 min readApr 19, 2019

May 14, 2015

Gokul Gunaskeran is a software engineer at Cask where he is building software to enable the next generation of data applications. Prior to Cask, he worked on Architecture Performance and Workload Analysis at Oracle.

In this post, we will walk you through performing common ETL tasks on Hadoop using the open-source Cask Data Application Platform.

A typical ETL pipeline consists of a data source, followed by a transformation, used for filtering or cleaning data, ending in a data sink.

For example, an organization might take a snapshot of their production database and load the data into Hadoop for further offline processing. Another example is feeding web log data continuously into an OLAP Cube to derive valuable insights in real time.

To develop an ETL solution, you would need to deeply understand the various data stores to be used as sinks and sources, the programming paradigms of batch and real-time processing frameworks, and how to connect them together. You also need to understand consistency models, failure semantics, and ways to make them work together despite their differences.

CDAP comes with a selection of sources and sinks, and the transformations that bridge different technologies which can be connected together via simple configurations to perform useful ETL tasks. It allows you to perform data delivery in batch or real-time. Below are some of the available sources, sinks and transformations that you can use (please take a look at the CDAP Documentation to see the complete list — it is growing over time).

If you don’t find a specific source, sink or transformation you are looking for, you can always build one using CDAP’s Java API and use it along with existing ones. The ETL templates framework will do most of the plumbing for you in that case.

To run an ETL pipeline in CDAP, all you need to do is provide the desired configuration and start it. Let’s take a look at an example (the full example can be found online):

Objective: Fetch tweets from Twitter and write them to an HBase table in real time. In this case, the configuration steps to consider are:

  • Choose the application template: ETLRealtime
  • Select the source type: Twitter; and provide the properties it requires: OAuth credentials
  • Select the sink type: Table; and provide its needed properties: Table name and the field name whose value to use as the row key

Here’s an example of a complete configuration in JSON format:

{
"config": {
"sink": {
"name": "Table",
"properties": {
"name": "tweetTable",
"schema.row.field": "id"
}
},
"source": {
"name": "Twitter",
"properties": {
"AccessToken": "******",
"AccessTokenSecret": "******",
"ConsumerKey": "******",
"ConsumerSecret": "******"
}
},
"transforms": []
},
"template": "ETLRealtime"
}

You can now create the ETL pipeline through the CDAP CLI, CDAP UI or the CDAP RESTful interface. After creation, you can manage its lifecycle, such as starting, stopping, and retrieving metrics and logs. By just providing a bare minimum of information, you have created a production-grade ETL pipeline!

The ETL pipeline as configured above doesn’t perform any transformation of the data. Let’s change that by including a transformation which drops fields in the data record or drops the entire record based on a filter condition. For example, let’s write only tweets that aren’t retweets to a Table. To do that, we can use a ScriptFilter Transform which takes in a Javascript code snippet with filtering logic:

"transforms": [
{
"name": "ScriptFilter",
"properties": {
"script": "function shouldFilter(tweet) { return tweet.isRetweet; }"
}
}
]

You can chain multiple transformations together by adding a list of them to the configuration.

Under the hood

In a previous blog post, we introduced application templates that allow building reusable data applications, including ETL application templates, which are the focus of this post. CDAP v3.0 ships with two ETL application templates: ETLBatch and ETLRealtime. In the future, the list will grow to help solve other common Big Data problems.

The ETL batch application template uses a schedulable workflow that starts a MapReduce program. The schedule is specified using a schedule property, which accepts a cron entry as its value. It uses a mapper-only MapReduce program to execute the ETL pipeline.

On the other hand, the ETL real-time application template uses a worker which is a constantly running process. You can run multiple instances of the worker by setting the instances property if you need to scale out.

Summary

Now you know how to create an ETL pipeline to tackle the ETL challenges in your organization — in just five minutes! I highly encourage you to look at our ETL guide for more examples.

We invite you to download the CDAP SDK today to try this out and let us know what you think!

--

--

cdapio
cdapio
Editor for

A 100% open source framework for building data analytics applications.