cdapio
Published in

cdapio

Processing HTTP Data with CDAP

In this blog post, we will create a pipeline that will make an HTTP request to get data, and process the data using CDAP pipelines. We will be working with police incidents data from DataSF (https://datasf.org). We will process the dataset “Police Department Incident Reports: Historical 2003 to May 2018”.

From the data in that site, we can see that this dataset has 2.21 million rows, and 13 columns. Let’s download a sample data by clicking here.

Now in CDAP, we can go to Wrangler to inspect the data and do some transformations to the data. Navigate to Wrangler, and browse to the downloaded file location.

The sample data downloaded, only contains 1000 rows plus a header. The API seemed to have pagination build in, and each page contains the same header. So let’s filter out this header.

  1. Copy the value of the header.
  2. Click on the directive dropdown in the header.
  3. Select Filter.
  4. Choose “Remove rows”.
  5. Select If “value is
  6. Paste the content to the textbox.
  7. Click Apply.

Now we need to parse this tsv data.

  1. Click on the directive dropdown.
  2. Choose Parse.
  3. Choose CSV.
  4. Select “Tab” as the delimiter.
  5. Click Apply.

Delete the body.

  1. Click on the directive dropdown for body.
  2. Click Delete column.

Set column names.

  1. Click on Column names dropdown.
  2. Click on “Set all
  3. Paste the same header to the textarea.
  4. Remove all quotes and spaces, and put commas in between.
  5. Click Apply.

Click on “Create a Pipeline” button, and choose Batch pipeline. You will be taken to Pipeline Studio view. Notice that you have two plugins connected, File and Wrangler.

What we need to do now is to replace the File source to the HTTP source, so that we can connect to the actual source of data. Click on the menu for the File source, and click Delete.

Find HTTP source in the left panel, and click on it to add the plugin. Connect the HTTP source to Wrangler.

Configuring HTTP Source

Open the properties of the HTTP source. The first thing we have to fill out is the Reference Name. This name is used to track the lineage of this data source.

Next, set the URL: https://data.sfgov.org/resource/tmnf-yvry.tsv

According to the documentation of this dataset from DataSF, we should set an app token (X-App-Token) as the request headers.

Next, set the Format to “text”, and set Output Schema to be “body”.

Finally, the API provided by DataSF contains pagination. According to the documentation, we can page through the data by setting $limit and $offset in the URL.

First, go down to Pagination section in the properties. There are multiple ways to do pagination that is supported out of the box by the plugin, but for this use case, we are going to use “Increment an index”. Set the Start Index as 0, and Max Index to be 2500000 (2.5 million), because this limit needs to be larger than the number of rows in the dataset. Next, set the Index Increment to be 100000 — this is the size per page.

Now, let’s modify the URL that we set in the beginning. The plugin will increment the index by replacing the placeholder {pagination.index}.

Update the URL to:

https://data.sfgov.org/resource/tmnf-yvry.tsv?$limit=100000&$offset={pagination.index}

To finish the pipeline, add a sink. In this example I will be using Google BigQuery. Name the pipeline then Deploy.

Give it a try, and click Run!

The HTTP batch source is able to do a lot more, such as authentication with OAuth, Proxy, and SSL. With all these features, you can now process data from public datasets, SaaS applications, CRM, and ERP system as long as they expose HTTP APIs to the data! Stay tuned for more specific plugins that will unlock some of these use cases.

To learn more about the HTTP plugin, you can go to the repository page https://github.com/data-integrations/http.

--

--

--

CDAP is a 100% open-source framework for build data analytics applications

Recommended from Medium

How to create SaaS within 24 hours. Part 2

Version Control with Git: Get Started in Less Than 15 Minutes

Implementing the new Material Design Full-Screen Dialog for Android

3 Facts You Should Know About the User-Agent String in HTTP Headers

Tanker: A Multi-Datacenter Customer Data Migration Framework

SwiftUI: Interactivity, State Bindings, & Simple Animations

That tricky corner cases

Building the Self-Serve City

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Edwin Elia

Edwin Elia

Edwin is a Software Engineer for Cloud Data Fusion at Google. He specializes in Front End development, specifically Data Analytics user interface.

More from Medium

How Decision Automation Solves for Big Data Bias

Connecting to AWS S3 using Athena and perform data analysis

Big Data Storage Solutions (Data Lakes) for Maximum Flexibility

Analysis Of Medical Industry Using Snowflake, Tableau, and ML