DATA STORIES | TIPS AND TRICKS | KNIME ANALYTICS PLATFORM

Mastering KNIME: Unlocking Peak Performance with Expert Tips and Smart Settings

The low-code tool KNIME is great and a lot of guides are out there to help you master it. You are encouraged to read the official ones. Here are some more hints…

Markus Lauber
Low Code for Data Science

--

You can just start using the data analytics tool KNIME and be happy. Over time some questions, maybe quirks and even problems might (or might not) occur and this blog is here to help you — with my 10+ years of experience with KNIME and a lot of conversations from the KNIME Forum (https://forum.knime.com/). What follows is relevant at least up to version 4.7.x — the new KNIME 5.1 and KNIME 5.2+ will have a different user interface and concept so these things might not apply though the general concepts mentioned do.

University of Konstanz where KNIME was initially developed
University of Konstanz where KNIME was initially developed (picture taken by me in June 2020).

If you encounter a problem you also might want to ask a question in the KNIME Forum. To do that: read how to write a good forum entry or ask for help, then provide a sample workflow and maybe provide a detailed log file.

A large open source tool

Though seemingly obvious, KNIME is a quite extensive open source tool based on Java (and some Python lately) and it is undergoing dynamic developments with new releases every few months and it runs on Windows, Mac and Linux — so some issues might appear and the graphical presentation needs a fair amount of resources. And as with all software, you will need to familiarize yourself with the environment so maybe start with the User Guide (here is the one for the latest old GUI 4.7 and the new interface 5.x):

The KNIME Workbench guide
The KNIME User Guide is a good place to start (https://docs.knime.com/latest/analytics_platform_user_guide/index.html#introduction).

Overview of Topics you might want to consider …

Please be aware: there might not be one perfect setting for every task and system. Read the entries and find the ones that fit your needs. And be aware: often there is no substituting a good setup and strong hardware performance with super clever settings — having said that there are some things you could check out (more details on the subjects further down the article):

  • Overall performance of your system (allocated RAM, CPU, Disk speed). Give KNIME what you can but leave some for your operating system and other tasks (20+%) (more details below).
  • Permissions do you (and KNIME) have full access to all files and disk spaces (there was an issue with MacOS once)?
  • Virus scanner — KNIME has a lot of small files, make sure no virus scanner is aggressively blocking them.
  • Settings about data storage and Java heap space (and temporary folders) — we come to that in a moment.
  • If you encounter problems maybe restart your KNIME once with a “-clean” option in the knime.ini.
  • Maybe do a clean new installation in a new folder (if you are on a Windows system just un-zip the zip version into a clean folder) — https://www.knime.com/downloads.
  • Check the integrity of your storage/disk — be aware of possible issues with OneDrive [special characters like hash (#) and brackets “(“ “)” must be allowed/supported] (more details below).
  • Think about a suitable backup strategy for your work. Things can go wrong. You might want to revert to an earlier version of your work (more details below).

On to more details …

Know your paths and folders

I would always encourage people to know about their system and their paths — which data is stored where. I often see people unaware where their files are and what they mean, and then you might run into problems.

You are encouraged to read the “KNIME File Handling Guide” — yes it is lengthy but then you will know all you need about how KNIME handles files and paths.

These are the three storage locations for KNIME:

  • The folder where the KNIME Software resides. You can delete or re-install the software and even keep several versions in parallel. Just keep in mind: you will be able to open older knime-workflows but newer ones might not be used with older software versions.
  • The “knime-workspace. This is where your actual workflows (your work) is being stored. This data you should protect, version and backup.
  • The KNIME temp space. Here you will have the temporary files KNIME uses.

You should learn how to configure these File and Folder Variables and how to use them:

It is always good to know where your data is and how KNIME does handle paths
It is always good to know where your data is and how KNIME does handle paths (https://forum.knime.com/t/write-multiple-csv-file-to-a-loop/44989/2?u=mlauber71)

Also: you might not be immediately aware but you can capture a configuration you have entered (like a path for a file to save) and use that in a Flow Variable. To be honest I oversaw this option for quite a long time …

“Capture” any setting you make in a Flow Variable like a File Path
“Capture” any setting you make in a Flow Variable like a File Path (https://forum.knime.com/t/input-file-name-to-output-file-name/48532/2?u=mlauber71)

KNIME Memory and Performance

One constant theme is the performance of KNIME workflows. One important thing is the physical memory (RAM) allocated to KNIME — this will show up as “Java Heap Space”. Since KNIME ist a visual/graphic tool that would store the intermediate data in each node it will need a good amount of resources and especially memory.

Somewhat older but still relevant blog: “Optimizing KNIME workflows for performance

The default setting is 2 GB of RAM out of the box. While this is good for a lot of tasks, you might need more. A rule of thumb for me is half or 2/3 of your systems memory if you plan on doing a lot of things with KNIME, and also depending on the size of your data.

You will have to locate the knime.ini file and change the value “-Xmx” like this:

  • -Xmx4g (4 GB)
  • -Xmx10g (10 GB)

While giving KNIME RAM is good you should leave something for your system otherwise everything might get stuck. OK, and disk speed is still a thing. An old HDD might not be the best choice. Opt for a fast SSD!

Giving KNIME enough memory is important
Giving KNIME enough memory (RAM) is important. Clicking the recycle bin will clear some clutter (https://forum.knime.com/t/knime-3-6-crash-when-dealing-with-massive-data/12145/2?u=mlauber71).
You can (and should) activate “Show heap status” in the KNIME preferences
You can (and should) activate “Show heap status” in the KNIME preferences (https://forum.knime.com/t/knime-3-6-crash-when-dealing-with-massive-data/12145/2?u=mlauber71).

KNIME internal data storage

One thing to check is the internal storage format of KNIME. In more recent versions the underlying storage has been shifted to Apache Arrow underneath. Which also improves the data transfer between KNIME and Python. Some initial problems with that have been fixed.

Blog: “KNIME Columnar Table Backend Boosts Performance

Speed up your KNIME workflows

There are several things you can do to speed up your KNIME workflows. More often than not it will have to do with some planning about what you want do and bringing that in sync with your resources.

You may want to split a very large workflow into several smaller parts and call them from one central instance (example one, example two – download the whole workflow groups).

Also consider this:

  • Java Garbage Collection can help to free memory (other hints in the link).
  • You can use the Cache node to collect data at certain points in your workflow. Might be a good thing to place this at the end of complex operations and before exports.
  • You can decide to not save parts of your workflow to save space.
  • Streaming might be an option to speed op some operations. You will get the results for a chunk of rows faster instead of doing each step for all rows and then move on — depends on your task.
  • Another option is Parallel Execution — which of course would need more resources but might speed up your tasks.

These are options to consider if you have special use cases or experience problems. Please do not just randomly implement these.

If you have to handle very large files there are some strategies using formats like Parquet or H2/SQL, see: “KNIME Snippets (1): Collect and Restore — or how to handle many large files and resume loops

Split you workflows if they become too large

If your workflows become too large and you can no longer handle them safely (and you cannot increase the resources) on option is to split them into a main and (several) sub-workflows (see example one, example two, example three — best to download the whole workflow group).

Call a KNIME sub-worlflow from a main one
Call a KNIME sub-worlflow from a main one (https://forum.knime.com/t/call-workflow-table-based-not-accepting-variable-with-path/48189/6?u=mlauber71)

Backup is only for the faint of heart?

Backup is important with all your workflows and data and regardless of KNIME or other programs. Keep in mind: KNIME desktop does not have a recycle bin mechanism (yet) where you can recover your workflows if you accidentally delete them in the KNIME Explorer window — the commercial KNIME Business Hub does have such a recovery mechanism. KNIME although does have an auto-save mechanism for open workflows.

With backup also think of versioning — the KNIME Business Hub would help with built in versioning and collaboration but you might also consider it for your KNIME desktop version.

You can have your own improvised local Backup with KNIME nodes — it is not pretty but it would work :-)

A do-it-yourself Backup for KNIME (https://forum.knime.com/t/copy-move-files-node/29412/15?u=mlauber71).

Another way to store your data is using a cloud service. In any case: take care of your workflows and data.

If something happens to your workflows and you do not have backup: Stop what you are doing and seek professional help in restoring the data from your hard drive! Yes this will cost money but you have to decide if loosing the work will cost you more.

KNIME and Clouds — OneDrive and DropBox

You can have your workflows on a cloud drive as long as the data is (also) present on your local folder and you keep some things in mind. In general very long paths and file names might give you problems, so try to avoid them.

Concerning MS OneDrive these are the things to consider:

If you plan on using clouds to work with KNIME on several systems you might want to consider some important aspects:

  • Make sure to fully sync your data after closing KNIME.
  • Do not have KNIME open on both systems at the same time.
  • Make sure you have a backup nonetheless.

Further dark arts and KNIME Tuning

If you still encounter problems with performance you can dive deeper into the dark arts of tuning other parameters in KNIME. One place to explore is the KNIME FAQ.

Again: these could be measures of last resort or for the very experienced users. Before diving into these muddy waters try thinking about your tasks and plannings and maybe ask in the KNIME Forum, which is a good idea anyway :-)

If you enjoyed this blog make sure to follow me on Medium and also on the KNIME Community Hub and Forum (https://hub.knime.com/mlauber71).

--

--

Markus Lauber
Low Code for Data Science

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry