CODEX

Build Data Pipeline with Apache Hop

Published in

CodeX

8 min readJan 31, 2021

Apache Airflow has become a de facto tool for Data Engineering, but don’t overlook other tools out there that can boost your productivity. Hop (Hop Orchestration Platform) project is one of the less known yet powerful tools that is incubating at Apache. In this article, I am going to briefly explain what Apache Hop is and will demonstrate how to create a simple pipeline without writing any code with Hop.

What is Apache Hop?

Hop is an open-source data integration tool, which is a fork of Pentaho Data Integration (PDI) or Kettle. It offers a visual development tool that can make developers more productive and less daunting for those who prefer building their pipelines without writing any code. Hop workflow and pipeline can be run on various engines including its own native Hop engine, Spark, Flink, Google Dataflow, or AWS EMR through Beam. Hop is one of the first GUI based designers out there for building Apache Beam pipeline.

Hop Components

Hop is composed of the following three primary components:

Hop GUI. It is a graphical interface editor for building pipelines (transformations) and workflows (jobs). It allows you to build a complex ETL (Extract Transformation and Load) jobs without writing any code. It provides a drag-and-drop interface that allows you to author, edit, run or debug a pipeline or workflow.
Hop Run. It is a standalone CLI utility that can be used to execute both pipeline and workflow.
Hop Server. It is a lightweight Web container that allows you to execute your pipeline and workflow on a remote server. This can be deployed on multiple servers. It also provides a REST API to remotely invoke your workflow and pipeline.

Get Started with Hop

To get started, you can download a recent build from its official Apache Hop project. You need to unzip the file once it is downloaded and start the Hop GUI with the following command:

On Linux:

./hop-gui.sh

On Windows:

hop-gui.bat

After starting the Hop GUI, you’ll be presented with the following window.

This editor allows you to build a complex data pipeline with just a drag-and-drop interface. To give you a better understanding of how Hop works, let me walk through a real-world example.

Real-World Example with Hop

I am going to show a real-world example of how you can bring currency exchange data using API and store them in a MySQL database table. For this demonstration, I’m going to use Open Exchange Rates API that offers a free plan with USD as a base currency. Before we start building the pipeline using Hop, let’s first understand how we are going to build this pipeline.

First, we will send an API request to Open Exchange Rates endpoint to retrieve the latest currency exchange data. Upon successful request, the endpoint will send back a JSON object similar to the one below.

{
  "disclaimer": "Usage subject to terms: https://openexchangerates.org/terms",
  "license": "https://openexchangerates.org/license",
  "timestamp": 1611871200,
  "base": "USD",
  "rates": {
    "AED": 3.672995,
    "AFN": 77.503946,
    "ALL": 102.210542,
    "AMD": 519.485125,
    "ANG": 1.794941,
    "AOA": 655.668,
    "ARS": 87.146455,
    "AUD": 1.302647,
     /* .... */
  }
}

We will parse (without writing any code) this JSON object, which contains a UNIX timestamp (UTC seconds), a base currency (3-letter ISO code), and a rates object with “symbol: value” pairs, relative to the USD currency. Subsequently, we will clean up unnecessary data before inserting them into a MySQL database. The end result should fit into a MySQL database table with three columns — Exchange_Date, Exchange_Currency, and Exchange_Rate.

Let’s begin by creating a new pipeline by clicking the ‘New’ icon in the upper left corner.

Once you have created a new pipeline we can start building by clicking anywhere on the editor. Upon clicking on the editor, you will be presented with the dialog shown below.

This dialog box shows a list of transforms that allows you to build your pipeline. It appears the Hop team has not yet officially published their documentation for them, but I hope they will be available soon. In the meanwhile, you can access them here.

Let’s add the following transforms to our pipeline by searching them using the search box and subsequently clicking on them.

Generate Rows
JSON Input
Calculator
Select values
Row Normaliser
Insert / Update

After adding them all to your pipeline, you can arrange them as indicated below:

These are the only six transforms are required to build our pipeline. Let’s configure them one at a time, and connect each of these transforms by creating a hop between them.

Generate rows: It generates a specified number of rows as output. In our case, we will use this transform to pass the API endpoint for Open Exchange Rates as indicated below. The output of this step will become an input for the next transform (JSON Input).

JSON Input: This transform reads data from JSON objects, files, or incoming fields using a JSONPath expression to extract data and output rows. We are going to use this transform to extract data from the URL that was defined in the ‘Generate rows’ transform.

To keep this demonstration simple, we are going to extract only selected currency as indicated below. We provide JSONPath expression to extract data from JSON response and we provide the name of the field associated with each of those expressions. In addition, we also need to define the data type for each of these fields.

Calculator: We are using this transform to convert a UNIX timestamp from seconds to milliseconds by multiplying it with 1000. Eventually, we will use this in the next transform (Select value) to convert it to a human-readable date.

Select values: This transform helps us selecting, removing, renaming, changing data types as well as configuring the length and precision of the fields on the stream. We are using this transform to remove unnecessary fields (URL, timestamp, and CONSTANT) that we created in our previous transforms.

Now let’s use this same transform to convert the UNIX timestamp from milliseconds to a normal date.

At this point, let’s preview our pipeline to see what our output would look like.

In order to store them in the MySQL database, let’s normalize the stream so we can store them in three columns as shown below.

Exchange_Date | Exchange_Currency | Exchange_Rate |
--------------|-------------------|---------------|
 2021-01-28   |        CAD        |     1.283072  |
 2021-01-28   |        AUD        |     1.303642  |
 2021-01-28   |        GBP        |     0.728617  |
 2021-01-28   |        EUR        |     0.825002  |
 2021-01-28   |        INR        |     72.93905  |
 2021-01-28   |        SGD        |      1.32905  |
 2021-01-28   |        NZD        |      1.39401  |

Row Normaliser: We are using this transform to convert columns into rows, so we can store them in a table.

Insert / Update:

Now let’s use this transform to create a new database connection.

Click on the database icon to open up a window and fill in the information as follows:

Once you fill out the information, you can click on the “Test” button to ensure the connection is working before hitting the “OK” button.

Now we need to define the table and schema information as well as key and update fields as indicated below. This allows us to run our pipeline multiple times per day without creating duplicate records in a table.

Now it’s time to create a table in the database by clicking on the “SQL” button, which will open up a new window with a SQL statement. You can modify the script as you would like before clicking on the Execute button.

Once it successfully creates a table, you can come back to the main editor window by closing all pop-up transform windows.

Finally, you are ready to run your first pipeline in Apache Hop. Let’s click on the play button on the top left corner and you’ll be prompted with a Run Options window as shown below.

Click on the Launch button and your pipeline will start executing on your local machine. Once your pipeline is run successfully, you’ll see a green checkmark on top of each transform. The bottom “Execution Results” panel displays metrics as well as logging information.

Let’s verify the result in the MySQL database table.

mysql> SELECT * FROM TEST.currency_exchange;
+---------------+-------------------+---------------+
| Exchange_Date | Exchange_Currency | Exchange_Rate |
+---------------+-------------------+---------------+
| 2021-01-28    | CAD               |      1.283072 |
| 2021-01-28    | AUD               |      1.303642 |
| 2021-01-28    | GBP               |      0.728617 |
| 2021-01-28    | EUR               |      0.825002 |
| 2021-01-28    | INR               |      72.93905 |
| 2021-01-28    | SGD               |       1.32905 |
| 2021-01-28    | NZD               |       1.39401 |
+---------------+-------------------+---------------+
7 rows in set (0.00 sec)

Where to Go from Here?

Now that you have a working pipeline, you can create other pipelines and orchestrate them by creating a workflow. You may also want to run your workflow or pipeline using Hop Server on a headless machine. Lastly, you can also deploy it on AWS EC2 instance. You can do all of these; however, keep in mind that the Apache Hop project is still in an incubation phase, so be cautious before deploying any of your pipeline in your production environment.

Conclusion

Apache Hop is a new open-source data integration platform that is still under the incubation phase. I feel this project has a lot of potentials, especially it allows you to run your pipeline on different engines — such as Spark, Flink, Google Dataflow, or AWS EMR through Beam. I believe this platform will be a game-changer for data pipeline development because it allows you to visually developed for many platforms like Spark, Flink, and Google Dataflow. I am excited about the future of Apache Hop.

[Feb 10, 2021 Update: The official documentation is now available at https://hop.apache.org/manual/latest/index.html]