GETTING STARTED | AUTOMATION | KNIME ANALYTICS PLATFORM

KNIME Batch Processing on Windows and MacOS

The DIY guide to workflow automation

Markus Lauber
Low Code for Data Science

--

Photo by Gerrie van der Walt on Unsplash.

KNIME batch processing is sometimes shrouded a bit in mystery — that might be because it is somewhat hackey and also because there are much better alternatives available that you could and should explore.

DISCLAIMER.

I would *not* advise to use these cron and batch processes in real production environments since they might fail at any time, require a permanent running computer that might not be a proper server, and are not that easy to control.

The new SaaS feature on the KNIME Community Hub offers a nice and much richer alternative to this DIY approach for small teams at a very convenient price.

The Alternatives from the world of KNIME

KNIME is great as a free low-code (desktop) tool but it is even better when being used for team cooperation, know-how sharing and automation in the organization provided by the KNIME Hub. To emphasis this, I would like to point out these options you have to productionize your analytics solutions using KNIME Hub — from the newest SaaS feature of the KNIME Community Hub for small teams to the “Enterprise” edition of KNIME Business Hub for large organizations:

Moreover, you can also use KNIME Business Hub as Pay-As-You-Go on AWS and Azure [coming soon!].

But now on to the topic of batch processing. You can call KNIME from the command line (Windows, Mac and Linux) with several parameters (https://www.knime.com/faq#q12) or from a script.

Windows Batch / CMD File to start KNIME

We have a Workflow group “KNIME_Batch” with the workflow to be automated “KNIME_Simple_Batch_Windows” and a folder with the /script/ where you find the batch file that you will use. The /data/ folder will have the train.table and later the results and the log files. You can modify such a setup to your liking.

The workflow group to automate KNIME under Windows
The workflow group to automate KNIME under Windows (and MacOS) (https://hub.knime.com/mlauber71/spaces/Examples/KNIME_Batch~DvTbQ6PrJWbXHSDu/).

On Windows a very basic .CMD file would look like this. You would set some parameters (the ones with the double %%) and then start the batch. Here the workflow will be -reset will not be saved (-nosave) and will not show the start screen of KNIME. Although KNIME will be started in the background and will need resources.

knime_simple_batch.cmd

REM Set the path to your KNIME executable (Windows)
REM %USERNAME% is an automatic Windows variable
REM you must adapt the paths to your system and KNIME version (obviously)
set KNIME_PATH=C:\Users\%USERNAME%\software\knime_4.7.0\knime.exe

REM Set the path to your workflow directory (the workflow you want to run)
set WORKFLOW_DIR=C:\Users\%USERNAME%\knime-workspace\KNIME_Batch\KNIME_Simple_Batch_Windows

REM Set a string variable to pass to the workflow (a variable to be used in the workflow)
REM you have to define them as workflow variable in in WF
set STRING_VAR_NAME=var_data_path
set STRING_VAR_VALUE=C:\Users\%USERNAME%\knime-workspace\KNIME_Batch\data\

REM Run KNIME in batch mode with the specified workflow and string variable
"%KNIME_PATH%" -reset -nosave -nosplash -application org.knime.product.KNIME_BATCH_APPLICATION -workflowDir="%WORKFLOW_DIR%" ^
-workflow.variable=%STRING_VAR_NAME%,%STRING_VAR_VALUE%,String

In this case we will provide a Flow Variable var_data_path to the KNIME workflow that will give the absolute path of the /data/ folder. I found that paths (on Windows) best do not contain any blanks or fancy characters.

You will also have to set these variables inside the workflow you want to use as “Workflow Variables”:

Set a workflow variable in KNIME 1
Set a workflow variable in KNIME 1.

The Workflow Variable will first be fixed so you can further develop your workflow and will be replaced with the value from the .CMD file once executed.

Set a workflow variable in KNIME 2
Set a workflow variable in KNIME 2.

The concept of these variables is considered ‘legacy’ by KNIME tough — so be aware.

From my experience it seems some classic detection of environment variables does not work very well so you might be served best to provide basic paths via the call if you want to store data from inside the workflow.

Under KNIME 5.x GUI it is possible you have to switch back to the old user interface to access the workflow variables
Under KNIME 5.x GUI it is possible you have to switch back to the old user interface to access the workflow variables

Advaced configurations of the .CMD file

You can do more elaborate configurations with the batch script like giving the current date and time (you can also derived that inside the KNIME workflow) and provide a log file. You will have to toy around with the paths and maybe the date and time settings depending on your scenario.

knime_simple_batch.cmd

@echo off
REM This batch file runs a KNIME workflow in batch mode and logs the output.

set CURRENT_DATE=%DATE%
set CURRENT_TIME=%TIME%

echo Current Date: %CURRENT_DATE%
echo Current Time: %CURRENT_TIME%

REM Get the current date and time in a format suitable for a filename
REM Example format: YYYYMMDD-HHMMSS

REM Depending on your locale, you might need to adjust the order of %%i, %%j, %%k
for /f "tokens=1-3 delims=/" %%a in ('echo %CURRENT_DATE%') do set DATESTR=%%c%%a%%b
for /f "tokens=1-3 delims=:." %%i in ('echo %CURRENT_TIME%') do set TIMESTR=%%i%%j%%k

for /f "tokens=1-3 delims=:." %%i in ('echo %CURRENT_TIME%') do (
set hours=%%i
set minutes=%%j
set seconds=%%k
)

REM check how the time variables will appear
echo hours=%hours%
echo minutes=%minutes%
echo seconds=%seconds%

echo DATESTR=%DATESTR%
echo TIMESTR=%TIMESTR%

REM Combine them to form a timestamp
set TIMESTAMP=%DATESTR%-%TIMESTR%

REM Set the path to your KNIME executable (Windows)
REM %USERNAME% is an automatic Windows variable
set KNIME_PATH=C:\Users\%USERNAME%\software\knime_4.7.0\knime.exe

REM Set the path to your workflow directory
set WORKFLOW_DIR=C:\Users\%USERNAME%\knime-workspace\KNIME_Batch\KNIME_Simple_Batch_Windows

REM Set the path where you want to save the log file
set LOG_FILE_PATH=C:\Users\%USERNAME%\knime-workspace\KNIME_Batch\data\knime_log_%TIMESTAMP%.txt

REM Set a string variable to pass to the workflow
set STRING_VAR_NAME=var_data_path
set STRING_VAR_VALUE=C:\Users\%USERNAME%\knime-workspace\KNIME_Batch\data\

REM Proxy settings
set PROXY_HOST=proxy.my-company.org
set PROXY_PORT=8080

REM Run KNIME in batch mode with the specified workflow, string variable, proxy settings, and save the log output
REM ^ will concatenate the lines into one at the execution
"%KNIME_PATH%" -reset -nosave -nosplash -application org.knime.product.KNIME_BATCH_APPLICATION -workflowDir="%WORKFLOW_DIR%" ^
-workflow.variable=%STRING_VAR_NAME%,%STRING_VAR_VALUE%,String ^
-workflow.variable=CURRENT_DATE,%DATESTR%,String ^
-workflow.variable=v_hours,%hours%,int ^
-workflow.variable=v_minutes,%minutes%,int ^
-workflow.variable=v_seconds,"%seconds%",String ^
-vmargs -Dhttp.proxyHost=%PROXY_HOST% -Dhttp.proxyPort=%PROXY_PORT% -Dhttps.proxyHost=%PROXY_HOST% -Dhttps.proxyPort=%PROXY_PORT% > "%LOG_FILE_PATH%" 2>&1

REM Optional: Add a line to indicate completion and log file location
echo Workflow execution complete. Log file saved to "%LOG_FILE_PATH%".

REM Pause the script to see the output when running interactively (optional)
REM pause

You will find even more configuration options at the end of this article.

Inside the KNIME Workflow that is being called

The basic path where the train.table is being stored is sent from the .CMD file as an absolute path and will be used:

Inside the KNIME Workflow that is being called
Inside the KNIME Workflow that is being called the Flow Variables are being put to use.

You can start this .CMD file by double clicking or you can now use the:

Windows Task Scheduler

There already is an article about KNIME and batch processing where there is a description of how to set up your .CMD file with the Windows Task Manager “Part 3: Schedule the Batch script” so best to check it out

You basically configure the “Task Scheduler” (or “Aufgabenplanung” in German) and point it to the .CMD file to run when you want it to do so.

Configure the Windows Task Scheduler and point it towards the .CMD file
Configure the Windows Task Scheduler and point it towards the .CMD file

On Windows sometime there seems to be a problem with timeout, there is a workaround you could try and a ticket to solve it.

Schedule a KNIME workflow on MacOS with cronjob

To schedule a job on MacOS does involve a little bit more terminal and coding. First we have again the scenario where we have the Workflow Group “KNIME_Batch”.

The folder /data/ will contain the initial data and later the results from the workflow “KNIME_Simple_Batch_MacOS”. The /script/ folder will hold the shell script (like the windows .CMD) that will be executed by hand or with a cronjob.

Overview of the workflow group for batch use
Overview of the workflow group for batch use

The KNIME workflow will do a simple import and export. You can have it much more complicated and I would always encourage you to make your KNIME workflows ‘robust’ like check what tasks are there, do them and maybe note that they ran somewhere. So they can work in a fire-and-forget style … but this is another story …

OK, let’s take this cron thing step by step …

First you need a run_knime_workflow.sh file containing your command to launch the workflow. It might look like this:

#!/bin/bash
/Applications/KNIME\ 5.2.0\ Intel.app/Contents/MacOS/knime -reset -nosave -nosplash -application org.knime.product.KNIME_BATCH_APPLICATION -workflowDir="/Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/KNIME_Simple_Batch_MacOS" -workflow.variable=var_data_path,/Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/data,String

To break that down in a more readable form. The ‘invocation’ of the script and the path to the KNIME application. This path may vary and you might have to try a few things or just ask ChatGPT to get it right — as has been mentioned in a previous blog (“How to Schedule KNIME Workflow on a MacOS”). Please note the escaped blanks “KNIME\ 5.2.0” means “KNIME 5.2.0”.

#!/bin/bash
/Applications/KNIME\ 5.2.0\ Intel.app/Contents/MacOS/knime
-reset -nosave -nosplash -application org.knime.product.KNIME_BATCH_APPLICATION

-workflowDir=”/Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/KNIME_Simple_Batch_MacOS”

-workflow.variable=var_data_path,/Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/data,String

At the end of the .sh script is a part sending a Flow Variable var_data_path to KNIME as a Workflow Variable (like in the Windows section above). You can have as many such variables as you want. Also other settings should work as in the Windows example.

The run_knime_workflow.sh file should reside in the /script/ folder of your Workflow group. You will now have to ‘activate’ the file.

  • in the Terminal, navigate to the directory containing your .sh script
  • run chmod +x run_knime_workflow.sh to make it executable

You are now ready to test the script by typing this command:
./run_knime_workflow.sh

If all paths are set right you should see some action in the target folder of your workflow (the /data/ path). You might want to have some indicators in the workflow to show you a result like writing a target file using a timestamp and the path provided as a Flow Variable.

More complicated shell scripts are possible. You can find one example on the KNIME Forum. Please also note that setting the -vmargs option will overwrite configurations in the knime.ini.

Setting up a cronjob to automate the whole thing

The next step would be a setup of a cronjob to automate your script.

Again: your Mac will have to be up and running and there are no guarantees (so you might want to take a look at that KNIME Hub Team Plan instead…)

In your terminal you can stay in the /script/ folder and list existing cronjobs with:
crontab -l

You can now either use a terminal editor like vi or nano or micro, but I will show you just how to employ a standard text editor (we are supposed to be low-code here … sort of …):
crontab -l > my_crontab

This will export your settings to a local text file called “my_crontab” which now should reside in your /script/ folder. You can now enter this scheduling command into the file and save it:

my_crontab

54 21 * * * /Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/script/run_knime_workflow.sh >> /Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/data/logfile_mac.log 2>&1

If you have saved the file you can then import back the settings for your cronjob (and check them again with crontab -l):
crontab my_crontab

Breaking down this command which will start the job every day at 21:54h. First the time settings in UNIX style. Then the path of the script itself and then a command that would send a log file to your /data/ folder so you can check if everything worked out.

54 21 * * *

/Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/script/run_knime_workflow.sh

>> /Users/m_lauber/Dropbox/knime-workspace/KNIME_Batch/data/logfile_mac.log 2>&1

In case you are wondering what these *** are all about:

Time settings for cronjob

A small guide on how to set up these time settings:

*    *    *    *    *  command to execute
┬ ┬ ┬ ┬ ┬
│ │ │ │ │
│ │ │ │ │
│ │ │ │ └───── day of week (0 - 7) (Sunday=0 or 7)
│ │ │ └────────── month (1 - 12)
│ │ └─────────────── day of month (1 - 31)
│ └──────────────────── hour (0 - 23)
└───────────────────────── min (0 - 59)

Special Characters

  • Asterisk (*): Represents “every” time unit. For example, * in the hour field means "every hour."
  • Comma (,): Separates items in a list. For example, 1,3,5 in the day of the week field means "Monday, Wednesday, and Friday."
  • Hyphen (-): Defines ranges. For example, 9-17 in the hour field means "every hour from 9 AM to 5 PM."
  • Slash (/): Specifies increments. For example, */15 in the minute field means "every 15 minutes."

Examples

  • 30 4 * * * command: Runs command at 4:30 AM every day.
  • 15 14 1 * * command: Runs command at 2:15 PM on the first day of every month.
  • 0 22 * * 1-5 command: Runs command at 10:00 PM every weekday (Monday through Friday).
  • */10 * * * * command: Runs command every 10 minutes.

So now you are ready to automate the hell out of your KNIME workflows with just what you have in Windows and MacOS — and you might get some appetite for the big folks’ toy, the KNIME Business Hub which will make your life much easier and more fun.

In case you cannot get enough of these scripty things there is this list:

ChatGPT is able to provide a list of all options that you can set

Disclaimer: I have not tested them all and might add or remove some as that happens. Always proceed with caution and test on your environment.

Running KNIME Analytics Platform in batch mode offers various command-line options to control the execution of workflows. Below is a list of optional settings that you can use with the command-line interface for KNIME and their explanations:

-nosplash: This option prevents the KNIME splash screen from appearing when starting KNIME in batch mode. It helps in reducing the graphical output for batch operations, making it more suitable for automated scripts or server environments.

-application: Specifies the application to run. For batch executions, you should use org.knime.product.KNIME_BATCH_APPLICATION.

-workflowDir=<path>: Specifies the directory of the workflow to be executed. You need to provide the full path to the workflow directory.

-workflowFile=<path>: Alternatively to -workflowDir, this specifies the path to a KNIME workflow file (.knwf or .zip) to be executed.

-reset: Resets the workflow before executing it. This ensures that the workflow is run from a clean state, without any previous results.

-nosave: Prevents the workflow from being saved after execution. This is useful for automated processes where you do not need to keep the state of the workflow post-execution.

-preferences=<path>: Specifies the path to an Eclipse preference file (.epf) to load preferences from. This can be used to set various KNIME and plugin preferences for the execution.

-credential=<id:login:password>: Allows specifying credentials for database connections or other nodes requiring authentication. The id corresponds to the credential variable used in the workflow.

-masterkey=<password>: If the workflow contains encrypted nodes, this option allows providing the master password to decrypt them during execution.

-vmargs: This option allows you to pass arguments directly to the Java Virtual Machine, such as memory settings. For example, -vmargs -Xmx2048m sets the maximum Java heap size to 2048 megabytes.

-consoleLog: Outputs the log messages to the console. This is useful for debugging or monitoring the workflow execution in real-time.

-launcher.suppressErrors: Suppresses error dialogs in the GUI. Since batch mode is typically used in environments without a graphical interface, this option can prevent the system from waiting for user input on errors.

-data=<path>: Specifies the workspace location. This is where KNIME stores its internal data and settings.

-destFile=<path>: Specifies a file path to save the workflow to after execution. This is useful if you want to save the state of the workflow post-execution to a specific location.

These options provide flexibility in how you execute KNIME workflows in batch mode, allowing for automation, integration with other systems, and customization of the execution environment.

When running KNIME (or any Java application) in batch mode, the -vmargs option allows you to pass various arguments to the Java Virtual Machine (JVM) to control its behavior. Besides memory settings, several other -vmargs options can be useful for optimizing performance, debugging, or customizing the runtime environment. Here are some additional -vmargs options that might make sense depending on your requirements:

Garbage Collection Tuning:

  • -XX:+UseG1GC: Enables the G1 garbage collector, which is designed for applications with large heaps and minimizes pause times.
  • -XX:MaxGCPauseMillis=50: Sets a target for the maximum garbage collection pause time. This is useful for applications that require low-latency.

Heap Dump:

  • -XX:+HeapDumpOnOutOfMemoryError: Tells the JVM to generate a heap dump when an out-of-memory error occurs. This is useful for diagnosing memory leaks.
  • -XX:HeapDumpPath=/path/to/dump: Specifies the path to store heap dump files.

Performance Monitoring:

  • -XX:+UseStringDeduplication: Enables string deduplication in the JVM to reduce the memory footprint of String objects.
  • -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/path/to/gc.log: Enables detailed GC logging to a specified file, which can help in analyzing garbage collection behavior and performance.

JVM Behavior:

  • -Djava.awt.headless=true: Ensures that the JVM runs in headless mode, which is useful for environments without a graphical user interface.
  • -Dfile.encoding=UTF-8: Sets the default character encoding used by the JVM. This can help avoid issues with non-ASCII characters.

Security:

  • -Djava.security.egd=file:/dev/./urandom: Speeds up the gathering of entropy for cryptographic operations on Linux systems, which can improve startup times.

Networking:

  • -Dhttp.proxyHost=proxyserver.com -Dhttp.proxyPort=8080: Specifies an HTTP proxy for the JVM to use for outgoing connections.
  • -Dhttps.proxyHost=proxyserver.com -Dhttps.proxyPort=8080: Specifies an HTTPS proxy.

Debugging:

  • -XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput -XX:LogFile=/path/to/vm.log: Enables logging of VM diagnostic output to a file.
  • -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005: Enables the JVM to listen for a debugger to connect on port 5005.

Class Loading:

  • -XX:+TraceClassLoading: Enables tracing of class loading, which can be useful for debugging classpath issues.

These options can significantly affect the behavior and performance of KNIME and the JVM. It’s important to test these settings in your environment, as their benefits and impacts can vary depending on the specifics of your workflows, the data you’re processing, and the hardware you’re using. Always refer to the latest JVM documentation for your specific Java version, as available options and their effects may change between versions.

In case you enjoyed this story you can follow me on Medium (https://medium.com/@mlxl) or on the KNIME Hub (https://hub.knime.com/mlauber71) or KNIME Forum (https://forum.knime.com/u/mlauber71/summary).

--

--

Markus Lauber
Low Code for Data Science

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry