Use Case: Extracting Google Analytics data using StreamSets

Back in Jan. 2017 StreamSets had released its version ~2.3.0 which had not OAuth2 integration needed to connect to the Google Analytics API. To workaround this issue I created a custom StreamSets Origin (data extractor), it worked really well to solve a simple use case!

Fast-forward 2018 StreamSets 3.1.0~ has integrated OAuth2 into their built-in HTTP Origin. This article gives you a quick recipe to configure it with Google Analytics Core Reporting V3.

Configuring HTTP Origin — HTTP : Use the following/similar settings

Use runtime variables for sensitive data
Use relative dates so it can pull data in batch in a regular basis. Ideally we should stream data instead of load it in batches.
The reason behind using GA API v3 is because it provides compatibility with StreamSets’s Pagination. Integrating StreamSets’ pagination with GA API v4 posses the challenge of processing requests using a json object inside an HTTP POST request which has to be updated in order to make the pagination work. Whereas, GA API v3 manipulates pagination using HTTP parameters, for instance nextLink will contain a new URL with the pagination information using the parameter start-index and max-results.

Configuring HTTP Origin — Pagination: Use the following/similar settings

GA API v3 and StreamSets let you work with the pagination using the /nextLink field.
The way to make the pipeline stop is by using a Java Expression Language conditional, once the API dispatches the last page of data the /nextLink result field won’t appear anymore.
Do not include the pagination data in the pipeline or exclude it in future stages.
By choosing /rows under Result Field Path StreamSets will take care of filtering out all other fields and only analyze this field downstream.

Configuring HTTP Origin — OAuth2: Use the following/similar settings

The JWT key should come from a JSON key file when you generate the Google credentials under Google Console. Locate the “private_key” field in the file, which contains a string version of the key. Place this string into a file and replace all “\n” literals with new lines.
Use a runtime variable to load the the file that contains the JWT key.

StreamSets is an open source ETL/Data Streaming-like tool that let’s you create data pipelines in minutes, it has great built-in ETL abstractions and let’s you create ad-hoc implementations using Java. If you haven’t heard about it, here’s a great introduction.