GETTING STARTED | RECURSIVE LOOPS | KNIME ANALYTICS PLATFORM

Using KNIME Recursive Loops with a GraphQL API

Retrieving large amounts of data in a convenient way

Bob Peers
Low Code for Data Science

--

As first published on Creative Data

Introduction

This guide will show you how to use a recursive loop to download data from an API. I my case I’ll be using a GraphQL API but the approach would work in a similar way for any other REST API.
This guide will not go into the details about the workings of GraphQL as those guides are readily available online.

Paginated APIs

If the API you are accessing has a large number of records to return they will often return data in batches, called pages, and you need to move forward through the data using a method called pagination. The reason for this is to reduce server load, so instead of returning a massive 100,000 record JSON object you get the data in smaller chunks.

Essentially, you call the API requesting say 200 records and you get 200 back, you then request 200 more and so on until there are no more records returned. A typical request looks like the one below:

GraphQL API

Using GraphQL we use a connection to paginate through the records using cursor-based pagination. In this case, I’ll be downloading product data from an eCommerce website using a ProductConnection. The query gets 200 product numbers and descriptions per request.

The important parts of the request object is (first: 200, after: ””). This requests the first 200 records using a cursor which is initially empty, hence the blank string (escaped with backslashes which are required in KNIME, but if you test in Postman, for example, you don’t need the backslashes).

The hasNextPage object will be returned with the value false, if there are no more records to retrieve, or true if there are more records.

The cursor object will also be populated with a string which is a pointer to the record returned. To get the next records we need to find the last cursor returned and on the next iteration ask for the next 200 records after this cursor.

Now we need to put this together into a KNIME workflow.

The KNIME Workflow

To download the data using KNIME, we need to use a combination of the Recursive Loop nodes and the POST Request node to send the query to the GraphQL endpoint. The finished workflow is shown below:

The elements of the workflow are:

  • Input the query string (plus anything else required for the POST request, URL, authentication, etc.).
  • Start the recursive part of the flow.
  • POST the query string to the GraphQL endpoint.
  • Parse the response to collect the product data (middle branch) to feed into the Recursive Loop End node.

Top Branch

  • Use a Column Filter to isolate the hasNextPage value.
  • Convert hasNextPage from text to a boolean and invert the value.
  • Convert the row to a flow variable and feed it into the Recursive Loop End node.
  • Recursive loops can be set to stop using a variable that should be true when you want it to stop. The value hasNextPage returns true when there are more pages. Hence, we need to invert it so when it returns false (no more pages) the variable becomes true and we stop the loop.

Bottom Branch

  • Using a String Manipulation node to replace the after:”” part of the query with the extracted cursor value, so it reads something like after: ”YOURCURSOR”. I use regexReplace for this.
  • Filter the columns and feed them back into the input port of the Recursive End Loop node.

The GraphQL Data

On the first iteration, since we have no cursor, we ask for the first 200 records after an empty cursor (after:””).
We parse the JSON payload returned and get the last cursor which in our case will be the 200th cursor value. We get a cursor for every record returned and it usually looks like a random string something like “AeghYtn=“.

Using this cursor, we place this into the current query object so the next request is for records after:“AeghYtn=“.

At the same time, we check to see if there are more pages to be returned. If not, the loop will end; otherwise another iteration will run using the new query string in the next POST request.

This continues until there are no more records to be read.

The KNIME Details

I haven’t shown all the nodes in details as many are standard to many workflows but I’ve highlighted the ones critical to a recursive loop using GraphQL APIs and numbered them below.

[1] JSON Path

As well as parsing the actual data returned, we get the cursor and the next page value. A trick to get the last cursor is to use the following syntax:

$['data']['productConnection']['edges'][-1:]['cursor']

Using [-1:] to access the last cursor element makes the node dynamic so if we change the workflow to get 500 records at once it will still work. If we had used [199] (remember arrays are zero indexed) to get the last cursor this would only work when we request 200 records.

The path to get the hasNextPage value is much simpler as the element only exists once.

[2] String Manipulation

Here we modify the query string for the next iteration so instead of an empty cursor we populate the query with the last cursor returned from the JSON Path node.

Essentially we need to replace the string in quotation marks used in the after:”” clause. To do this I use a regexReplace function as show below:

regexReplace($query$, "after:\\\"[A-za-z0-9=]*\\\"", join("after:\\\"",$cursor$,"\\\""))

This of course needs to also work on the next iteration when the after:”” part is not an empty string but will contain the cursor from the previous iteration. In my case all cursors contain either letters, numbers or the equals operator so I test for [A-za-z0–9=].

[3] Rule Engine

This is part of the flow to control the iterations. We invert the value of hasNextPage so when there are no more pages it becomes true and the recursive loop stops.

[4] Recursive Loop End

To end the recursion we use the variable we created from hasNextPage. All the data from each iteration will now be collected so you can continue to process the data downstream.

Summary

Recursive loops can be tricky to get working but once you understand the basics of recursion and how KNIME loops work they are a very useful tool and can be used in a wide range of applications where you are required to download large amounts of data via APIs.

The documentation for the Recursive Loop Start can be found on the KNIME Hub.

--

--

Bob Peers
Low Code for Data Science

Builder. Automate all the things. Generally curious person.