GETTING STARTED | LOOPS AND FILE HANDLING | KNIME ANALYTICS PLATFORM

KNIME Snippets (1): Collect and Restore — or how to handle many large files and resume loops

“KNIME Snippets” is a series of short articles dwelling on some specific KNIME subjects. Follow me on Medium to get more.

Markus Lauber
Low Code for Data Science
5 min readApr 22, 2023

--

With the low-code platform KNIME you might come across the task of processing larger files that might not fit directly into the memory or would require a loop.

If you are interested in more general ideas how to improve the performance of your KNIME workflows: “Mastering KNIME: Unlocking Peak Performance with Expert Tips and Smart Settings

Restore after a Loop fails

You can always try to split tasks into loops like in other programming languages. One way to handle a problem where you have a possibly failing loop step is to collect the result or an entry/log of each step and if something goes wrong you can resume the process where you left off.

KNIME Workflow — Restart a loop after the last processed RowID
Restart a loop after the last processed RowID (https://forum.knime.com/t/re-start-a-loop-at-a-given-row/57133/4?u=mlauber71).

Here is an another example using Python Script nodes where the names of processed PNG files are being stored and the process will continue where it has left off. In such cases, you will have to delete the collection file if you want to restart the process. Created with the help of ChatGPT BTW

This can also come in handy if you want to download files where the connection might break down and you want to resume your work. You check which files already have been processed (List Files/Folders node) and continue with the ones you do not have. You might also include a check if the size of the file is greater than zero so partially downloaded files will be excluded. Or you record the time and exclude one more so to be sure to re-do the last one properly.

As always: You are encouraged to think about and plan your tasks before you set out to do it :-)

Another story would be to also use Switches to handle several conditions that might or might not materialize: “An example of combining SQL database, loop and case switch”.

Collect results step by step

OK, what if your data is too large to be processed in one go or you maybe have a lot of very small tasks you want to do in parallel but still collect the results.

CSV Writer can append data

The easiest (and sometimes overlooked) step is to just collect the results in a CSV file. The CSV Writer allows you to append data if the file already exists and not write the headers again (obviously the files would have to have the same structure).

KNIME workflow — Append data to a CSV file if the file already exists
Append data to a CSV file if the file already exists (https://hub.knime.com/-/spaces/-/latest/~-njjNfdjCt05FaKE/).

In this example the data is also being processed in chunks where you would set the size of the individual chunk and the size of the last one will automatically be calculated and set — which also can come in quite handy.

Delete the CSV file in the initial round, calculate the size of the chunks and determine the start and end RowID of each step
Delete the CSV file in the initial round, calculate the size of the chunks and determine the start and end RowID of each step (https://hub.knime.com/-/spaces/-/latest/~-njjNfdjCt05FaKE/).

You can make sure to delete a file (CSV or Excel) in the first iteration of a loop using a Case Switch.

Several Parquet files into one

Parquet is a great data format which will compress the data and keep the column type at the same time. Also it is widely used in Big Data environments and you can bring several files (of the same structure of course) together in on folder and treat them as one file. KNIME has a Parquet Reader and Writer.

Which is another solution you might use. Store the results from one or more of your loops or parallel running jobs in one folder as Parquet files and give them individual names (like timestamps). In the end they will function as one big file and you can handle them further in KNIME or R/Rstats or Python.

You can save data as Parquet in afolder with files of one size which might be easier to transfer/handle
You can save data as Parquet in a folder with files of one size which might be easier to transfer/handle (https://forum.knime.com/t/knime-assign-failed-request-status-data-overflow-incoming-data-too-big/65232/2?u=mlauber71).

You can have one big file (of 16 MB) or several smaller ones with 2 MB for example:

Several Parquet files in the end form one file to further use
Several Parquet files in the end form one file to further use (https://forum.knime.com/t/knime-assign-failed-request-status-data-overflow-incoming-data-too-big/65232/2?u=mlauber71)

You can then import the data as one file:

Import Parquet data from a folder (https://forum.knime.com/t/knime-assign-failed-request-status-data-overflow-incoming-data-too-big/65232/2?u=mlauber71)

This might also be useful if you have to transfer data over sketchy connections and large amounts of them. And of course you could combine this approach with the loops and restore.

On macOS systems it might make sense to add additional filter options, because the file system on Apple Machines tend to write additional information into the folder.

If you want a (very) large SQL dataset you can also use a special setting in a local H2/SQL database and split it into several parts. See also: “KNIME, Databases and SQL”.

CSV, ORC, Parquet and Hive Big Data tables

You can take the concept of reading several CSV or Parquet or ORC files as one to a Big Data system. And when you do not have such an environment at hand you can use KNIME’s Local Big Data Environment.

Upload Parquet files to a Big Data HDFS and create Hive tables there — all with KNIME’s local Big Data Environment
Upload Parquet files to a Big Data HDFS and create Hive tables there — all with KNIME’s local Big Data Environment (https://hub.knime.com/-/spaces/-/latest/~EGxrQlSfohvC3m3y/).

The same approach can be used to upload ORC files to HDFS and create Hive Tables or do the same with CSV files:

CSV files uploaded to a Big Data HDFS storage can form an EXTERNAL Hive Table
CSV files uploaded to a Big Data HDFS storage can form an EXTERNAL Hive Table (https://forum.knime.com/t/read-and-process-huge-csv-file/31655/12?u=mlauber71).

In case you enjoyed this story you can follow me on Medium (https://medium.com/@mlxl) or on the KNIME Hub (https://hub.knime.com/mlauber71) or KNIME Forum (https://forum.knime.com/u/mlauber71/summary).

--

--

Markus Lauber
Low Code for Data Science

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry