Back in Jan. 2017 StreamSets had released its version ~2.3.0 which had not OAuth2 integration needed to connect to the Google Analytics API. To workaround this issue I created a custom StreamSets Origin (data extractor), it worked really well to solve a simple use case!
Configuring HTTP Origin — HTTP : Use the following/similar settings
Use runtime variables for sensitive data
Use relative dates so it can pull data in batch in a regular basis. Ideally we should stream data instead of load it in batches.
The reason behind using GA API v3 is because it provides compatibility with StreamSets’s Pagination. Integrating StreamSets’ pagination with GA API v4 posses the challenge of processing requests using a json object inside an HTTP POST request which has to be updated in order to make the pagination work. Whereas, GA API v3 manipulates pagination using HTTP parameters, for instance nextLink will contain a new URL with the pagination information using the parameter start-index and max-results.
Configuring HTTP Origin — Pagination: Use the following/similar settings
GA API v3 and StreamSets let you work with the pagination using the /nextLink field.
The way to make the pipeline stop is by using a Java Expression Language conditional, once the API dispatches the last page of data the /nextLink result field won’t appear anymore.
Do not include the pagination data in the pipeline or exclude it in future stages.
By choosing /rows under Result Field Path StreamSets will take care of filtering out all other fields and only analyze this field downstream.
Configuring HTTP Origin — OAuth2: Use the following/similar settings
The JWT key should come from a JSON key file when you generate the Google credentials under Google Console. Locate the “private_key” field in the file, which contains a string version of the key. Place this string into a file and replace all “\n” literals with new lines.
Use a runtime variable to load the the file that contains the JWT key.