Processing HTTP Data with CDAP

Edwin Elia
cdapio
Published in
4 min readNov 11, 2019

In this blog post, we will create a pipeline that will make an HTTP request to get data, and process the data using CDAP pipelines. We will be working with police incidents data from DataSF (https://datasf.org). We will process the dataset “Police Department Incident Reports: Historical 2003 to May 2018”.

From the data in that site, we can see that this dataset has 2.21 million rows, and 13 columns. Let’s download a sample data by clicking here.

Now in CDAP, we can go to Wrangler to inspect the data and do some transformations to the data. Navigate to Wrangler, and browse to the downloaded file location.

The sample data downloaded, only contains 1000 rows plus a header. The API seemed to have pagination build in, and each page contains the same header. So let’s filter out this header.

  1. Copy the value of the header.
  2. Click on the directive dropdown in the header.
  3. Select Filter.
  4. Choose “Remove rows”.
  5. Select If “value is
  6. Paste the content to the textbox.
  7. Click Apply.

Now we need to parse this tsv data.

  1. Click on the directive dropdown.
  2. Choose Parse.
  3. Choose CSV.
  4. Select “Tab” as the delimiter.
  5. Click Apply.

Delete the body.

  1. Click on the directive dropdown for body.
  2. Click Delete column.

Set column names.

  1. Click on Column names dropdown.
  2. Click on “Set all
  3. Paste the same header to the textarea.
  4. Remove all quotes and spaces, and put commas in between.
  5. Click Apply.

Click on “Create a Pipeline” button, and choose Batch pipeline. You will be taken to Pipeline Studio view. Notice that you have two plugins connected, File and Wrangler.

What we need to do now is to replace the File source to the HTTP source, so that we can connect to the actual source of data. Click on the menu for the File source, and click Delete.

Find HTTP source in the left panel, and click on it to add the plugin. Connect the HTTP source to Wrangler.

Configuring HTTP Source

Open the properties of the HTTP source. The first thing we have to fill out is the Reference Name. This name is used to track the lineage of this data source.

Next, set the URL: https://data.sfgov.org/resource/tmnf-yvry.tsv

According to the documentation of this dataset from DataSF, we should set an app token (X-App-Token) as the request headers.

Next, set the Format to “text”, and set Output Schema to be “body”.

Finally, the API provided by DataSF contains pagination. According to the documentation, we can page through the data by setting $limit and $offset in the URL.

First, go down to Pagination section in the properties. There are multiple ways to do pagination that is supported out of the box by the plugin, but for this use case, we are going to use “Increment an index”. Set the Start Index as 0, and Max Index to be 2500000 (2.5 million), because this limit needs to be larger than the number of rows in the dataset. Next, set the Index Increment to be 100000 — this is the size per page.

Now, let’s modify the URL that we set in the beginning. The plugin will increment the index by replacing the placeholder {pagination.index}.

Update the URL to:

https://data.sfgov.org/resource/tmnf-yvry.tsv?$limit=100000&$offset={pagination.index}

To finish the pipeline, add a sink. In this example I will be using Google BigQuery. Name the pipeline then Deploy.

Give it a try, and click Run!

The HTTP batch source is able to do a lot more, such as authentication with OAuth, Proxy, and SSL. With all these features, you can now process data from public datasets, SaaS applications, CRM, and ERP system as long as they expose HTTP APIs to the data! Stay tuned for more specific plugins that will unlock some of these use cases.

To learn more about the HTTP plugin, you can go to the repository page https://github.com/data-integrations/http.

--

--

Edwin Elia
cdapio
Writer for

Edwin is a Senior Software Engineer for Netflix, previously at Google Cloud. He specializes in Data Analytics User Interface.