Build Your First Data Pipeline in just Ten Minutes

Step-by-step process to build your first data pipeline with a real-world use case using PDI.

Shravankumar Suvarna
The Startup
8 min readMay 2, 2020

--

Building data-pipelines is the need of the hour in the current data age. We have been constantly dealing with a lot of data and to perform the analysis we require data pipelines to help us structure the data as per our requirements. Data pipelines are used across industries and mastering this skill is essential in today’s world.

We will go through a real-world use case for building our first data ingestion pipeline. Don’t worry if you have never done this before, we will follow a step-by-step guide.

Photo by Markus Spiske on Unsplash

User Stories - Definition of the Use Case

Stories are a simple and beautiful way of defining the features; it intends to describe the goal/purpose of a specific feature. Let’s define the user stories for our first data pipeline. Stories are always written from a user perspective and below are live examples for our use case.

  • I want to read a CSV file from a specified folder.
  • I want to perform a calculation on the input data. For eg. ‘Market Capitalization’.
  • I want to delete a few extra columns.
  • I want to save the output file in Excel format in a specified folder.

In general, we should create user story backlog before starting any project. This helps us define the high-level architecture of the data ingestion flow. Again, the stories can be more descriptive in nature and can also include the intention behind developing a feature.

Now that we have defined our stories, we can start working on building our data pipeline using Pentaho Data Integration (PDI). If you have not installed PDI on your machine, then I would recommend you to please go through the below link for the step-by-step installation guide.

Input data

We will be using Apple Inc. historical price data downloaded from Yahoo finance for our data ingestion pipeline. This file consists of price and volume data starting from 1980, it has 9,932 rows (with header) and seven columns. Basic understanding of the input data is essential before starting the process.

Test Cases

It is also a good practice to define the test cases before building the pipeline. We will perform the below checks for our use case.

  • Check the output file format (.xls).
  • Cross-validate the top five and bottom five price numbers with input file.
  • Cross-validate the calculation of market capitalization for 10 days.

Step-1: Set-up for the project

We will create multiple folders as defined below. It’s always a good practice to store all your work-related files in a designated folder for easy backup process. I create ‘Work’ folder in my ‘D Drive’. Again, this is completely optional and you can skip the process.

  1. Create a project folder ‘FirstDataPipeline’ - You can choose your name
  2. Create ‘Input’ and ‘Output’ folder. Obviously to save input and output data.
  3. Open Spoon from ~/data-integration folder. Once opened, click on the small plus button with the file icon and choose transformation.
  4. Write a brief description of the data pipeline and save transformation within our project folder with the name ‘FirstDataPipeline’

Your set-up folder should somewhat look like below.

Setup folder screenshot
Set-up folder with input and output

If you have any difficulty understanding the above steps or Spoon (desktop app), then request you to go through the below link.

Step-2: Reading CSV file from the folder

As per our story, we need to first read the input CSV file using PDI. So, PDI offers a lot of plugins/libraries out of the box and we will be using one of them; which is ‘Text file input’.

Type ‘Text file input’ in the left side design tab of Spoon and drag the same on the canvas.

Searching text file input screenshot
Searching out text file input

Now, double-click on the widget or right-click and choose edit for adding our custom properties. PDI gives us a lot of configuration options.

Properties that we can edit within Text file input step
  • We can change the name of the step name from Text file input to ‘input’. This is to maintain a standard naming convention across steps.
  • In the File tab, we need to click on Browse in the File or Directory option and choose our file (D:\Work\FirstDataPipeline\Input\AAPL.csv) and then click Add.
  • In the Content tab, change Format value from DOS to mixed - This takes care of the CRLF (carriage return line feed).
  • In the Field tab, click on Get Fields button and PDI will auto-populate the columns and data types for us by reading top 100 sample rows (We can change the number of sample rows).
PDI Text file input fields tab screenshot
Field tab should match the above screenshot
  • Click Preview rows to test if the numbers are properly read by PDI.
PDI Preview row screeenshoot
Preview row should look like the above

That’s it, we have successfully read the CSV input file.

Step-3: Calculation of Market Capitalization

Market capitalization is a value which can be derived from share price and volume. Since we have both the values in our input CSV file, let’s calculate the Market Cap. Now, here again, PDI has given us many options to perform this requirement. We will use the simplest of the lot; ‘Calculator’ plugin.

Type ‘Calculator’ in the search box on the left panel and double-click the Calculator plugin to use it on canvas. Since there was already a File Input step present on the canvas PDI will automatically create a hop between these two steps.

Hops are an important concept in PDI. The arrow indicates the data flow direction. In the below example, data will flow from text input to the calculator.

Input and Calculator step hop screenshot
Input and Calculator step hop

Open the ‘Calculator’ by double-clicking on the step, this to configure our custom calculation.

  • We need to create a new variable named MarketCap. Now, we can define this using New Field configuration in the calculator step.
  • In calculation column, we need to choose the type of calculation that we want to perform. Here, we want to perform multiplication to two variables Adj_Close and Volume. Hence, we will choose A*B from the variety of options provided by PDI.
  • Choose the Field A and Fields B from the drop-down menu. PDI auto-detects the input streams and shows us the columns available for data manipulation. Easy!!
  • Choose the data type as Number. Now, data types are a useful concept in any programming language. PDI is developed in Java and hence we need to make sure that we provide correct data types since Java is a strongly typed programming language.

Step-4: Delete unnecessary columns

Now, we need to remove unnecessary columns from the input stream. We want to perform further analysis using four columns viz. Date, Adj_Close, Volume and MarketCap. Therefore, we will delete the remaining fields.

We need to use the ‘Select values’ plugin. Please search in the design tab for the same and double click on the Select value button.

  • Here, we can open the ‘Select value’ step and go to Remove tab and click on Get fields to remove.
  • Once you see the populated fields, delete the ones that we want to keep.
Remove tab from the select value screenshot
Remove tab from Select value step

Yes! That’s it, we have removed the unnecessary fields.

Now, we don’t necessarily have to use this ‘Select Value’ step here for our use case; as we can directly define the fields that we want in our output step. However, I wanted to demonstrate the use case of this step for future reference.

Step-5 Save the output file

So here is our final step of the process. We need to change the format of the file and save the output in a specified folder. We will use ‘Microsoft Excel output’ plugin/step for this story.

As usual, search for the step name ‘Microsoft Excel output’ in the design tab and double-click to automatically create the hop.

In case of any issues in the hop, click on the step, there will multiple options; choose output connector and drag it to the next step (For eg from calculate_market_cap to remove_field).

PDI flow screenshot
Data flow with all our steps

Yes, we need to change the properties here in excel input and will be ready with our data pipeline.

  • Change the step name to ‘output’; to match our naming convention.
  • In the File tab, browse for the output folder (D:\Work\FirstDataPipeline\Output\AAPL) in the Filename field. Since we have already created the Folder we don’t need to click on Create Parent Folder checkbox.
  • Change the Sheet name to ‘Data’ in the Content tab.

Yes, that should be it for the step, since we are using the ‘Select values’ to remove unnecessary fields we don’t need to define it in Fields tab again here in this step.

Microsoft excel output step properties config screenshot
Change the properties like Filename and SheetName

Step-6 Execute the flow and check log

Let’s test our data pipeline flow that we have created. Save the transformation and click on the Run button at the top left corner of the canvas.

Once we run flow, we will see the below error. Now, I purposely didn’t configure the same above. In PDI or any other programming language, you will face errors and will have to read logs to understand the reason. Here, as you see it’s related to data type, PDI/Java is not able to convert the string ‘null’ to number (which is our datatype)

Quick fix: Let’s open the ‘Text file input (input)’ step again and click on Error handling tab, tick Ignore errors? and Skip error lines?. This will ignore the line with ‘null’ string values.

Now, let’s save and run again.

Hurray! We are successfully able to set-up and run our first data pipeline. Please test the output file as per the above test cases.

Conclusion

This is one of the real-world problem statements that we tried to solve. We can easily build similar data pipelines by configuring the various plugins and their properties. We will take another use case and understand on how to handle loops and if conditions in PDI.

See you in the next post. Happy ETL

--

--