GETTING STARTED | DATA SCIENCE TOOLS | KNIME ANALYTICS PLATFORM
The Best kept Secret in Data Science is KNIME
The ultimate guide to unfold the secrets of the KNIME’s ecosystem
Discover KNIME, the best kept secret in data science. This powerful and versatile open source platform offers a visual interface and a wide range of built-in algorithms to unlock the full potential of your data. Whether you’re an experienced data scientist or just starting to get into the field, KNIME is a game-changer for anyone working with data.
Why KNIME?
It happens more and more often that Business Analysts or BI Specialists ask me:
“Why KNIME? Can you show me a demo? And how do you use it in your company?”
Based on what level of knowledge they have, they also ask me:
“I have heard that there is a free open source version. But the premium version is surely not for free, right?”
“I can do this and that in SPSS, SAS, Alteryx, Python,… can you do the same with KNIME?”
“I have heard that it is a No-Code tool. But why don’t you code?”
All these people have heard or seen that we have been successfully using KNIME in our company for almost a decade.
Some of those who have evaluated the software say: “it’s a No-Code tool that can be used to create data pipelines. Our XYZ tool can do that too!”.
In the past, it was very difficult to find interested Data Engineers or Business Analysts for KNIME. They were happy with their (expensive) tools. Today they come to me by themselves.
But why this sudden change of mind?
In fact, if you search Google Trends for the keyword KNIME, you will see how queries have steadily increased since July 2021.
Two reasons seem to be gaining more and more importance:
Companies start saving on licensing costs as well
Not long ago, if you worked for a large company, it was not a big problem to renew the high license costs for data tools such as SAS or IBM modelers.
These days it’s: “we don’t pay anymore. Find a cheaper tool!”
Collaboration between Business Analysts, Developers and Data Scientists
Business analysts used to work only with Excel. Today more and more of them are realizing that it can be done more efficiently with a Low-Code or No-Code tool.
On the other hand, the developers would like to code only. Cooperation on this basis is very difficult. Misunderstandings and the repetition of certain tasks are pre-programmed.
The solution would be a cheaper or open source tool that would provide a programming language with which both business analysts and developers could work together on the same platform and principles.
and for us, that solution is KNIME
KNIME covers all these aspects:
- it is and will remain open source, as this is the basis of the KNIME philosophy
- provides a Low-Code platform suitable for business analysts as well as developers, data engineers and data scientists
- it is a cross-platform software, so there is a version for Windows as well as for Mac OSX and Linux.
At KNIME, they believe in openness and the power of the community. Their philosophy is to maintain and develop an open source platform that contains all the functionality any individual might require and to continue adding functionality through both our own work and that of the community.
Unlike other open source products, KNIME is not a cut-down version and there are no artificial limitations on execution environment or data size:
If you have enough local or cloud based space and compute power, you can run projects with billions of rows, as many KNIME users currently do.
The second point covers the Low-Code approach. Because…
the best representation of an ETL pipeline is a visual workflow.
ETL (extract, transform, load) is a type of data integration that refers to the three phases (extract, transform, and load) that are used to combine data from various sources. Data is extracted from one or different source systems, transformed into an analyzeable format, and loaded into a data warehouse or other system throughout this process.
Judge it for yourself: Which of the following two representations is closest to the process above?
The visual environment provides just the right amount of abstraction to build and share your work. Yes, because you will always be working with others and therefore need to be able to document and share your work.
Nowadays no data scientist or data engineer works alone anymore. We are all part of teams and we all need to communicate together. Discussion of the tasks, best practices, documentation are all necessary tasks in the daily work.
Starting with KNIME
Installation
You can download KNIME here.
The following video shows how easy the installation is.
Once it’s been installed, locate your instance of KNIME — from the appropriate folder, desktop link, application, or link in the start menu — and start it.
Before we get started with a concrete example, we need to familiarize ourselves with the elementary concepts in KNIME.
The Workspace
When the splash screen appears, a window will ask for the location of your workspace. This workspace is a folder on your machine or on a cloud-server that will host all your work. The default workspace folder is called knime-workspace:
Be careful! Not every cloud provider is suitable for creating a workspace. So far, I have been able to successfully use Onedrive and iCloud. But other clouds failed.
If in doubt, save the workspace locally.
After clicking “Launch”, the workbench for KNIME will open.
The Workbench
The workbench is the place where you will be building your workflows.
It’s also where you’ll find all the resources you need to help you build your workflows.
The Workflow Editor is what you’ll be using to build your workflows. Workflows are made up of individual tasks, which we refer to as “nodes”.
They perform all kinds of operations, for example reading or writing files, transforming data, training models, creating visualizations etc.
You build your workflow by dragging nodes from the Node Repository to the Workflow Editor, then connecting, configuring, and executing them (see video below).
Nodes and Workflows
In KNIME, individual tasks are represented by nodes. They are the smallest possible unit in KNIME and have been created to perform all sorts of tasks, including reading/writing files, transforming data, training models, creating visualizations, and so on.
A sequence of nodes creates a workflow. A workflow — as a sequence of nodes — is the graphic equivalent to a script or a series of instructions.
(see video below).
The Node Repository
You’ll find it in your KNIME workbench in the bottom left-hand corner. It contains the nodes that can be used in your workflow.
The nodes are organized in categories. Each category represents a specific functionality in data analytics.
Let’s have a look at the different categories:
IO — contains the nodes you need to access data, for file reading and writing, using a number of file formats, such as csv, excel, pmml, images, tables and more.
The Manipulation category contains nodes for filtering, aggregating, and transforming data tables. For example for column operations we have a number of column filters, conversions,joining, splitting, and a number of transformation nodes.
Just to quote two of the nodes most used here, the Joiner is for joining two tables and the String Manipulation to modify the content of string type column cells.
In the Row subcategory there are lots of filter nodes, also nodes for aggregation, partitioning and sampling (see video below).
Learning Resources
The Learning Curve
Visual Programming also makes the learning curve much faster than for code-based tools.
If you want to know “Why every Data Engineer should learn a Visual Progamming Language” read my following article:
A GUI-based tool can be learned and applied in less time than a code-based tool, freeing up again precious time and resources for more important investigations.
I have seen too often entire months dedicated to learn the coding practice, before even approaching any data analysis technique. With KNIME in a few weeks you can already assemble quite complex workflows for data transformation and for the training of machine learning algorithms.
KNIME Self-Paced Courses
This courses are free and allow you to learn from the basics to get into machine learning. Courses are organized by level: L1 basic, L2 advanced, L3 deployment, L4 specialized.
In each course, go through the lessons with ~5 minutes videos, hands-on exercises, and knowledge-check questions.
KNIME Cheat Sheets
Cheat Sheets are very helpful for beginners and help to quickly find the desired information about a node function.
KNIME Community Hub
The KNIME Community Hub is the public repository for the KNIME community. Here, you can share your workflows and download workflows by other KNIME users. Just type in the keywords and you will get a list of related workflows, components, extensions, and more. It is a great place to start with plenty of examples!
For example, just type in the search box “basic” or “beginners” and you will get a list of example workflows illustrating basic concepts in KNIME Analytics Platform; type in “read file” and you will get a list of example workflows illustrating how to read CSV files, .table files, excel files, etc. Notice that a subset of these example workflows is also reported in the EXAMPLES server in the KNIME Explorer panel in the top left corner of the KNIME workbench.
Once you isolate the workflow of interest, click on it to open its page, and then download it or open it on your own KNIME Analytics Platform. Once in the local workspace, you can start adapting it to your data and your needs. Following the popular fashionable trend in programming — that is searching for ready to use pieces of code — you can just download, reuse, and readapt workflows or pieces of workflows from the KNIME Hub to your own problem.
Books about KNIME
Books for becoming successful & efficient in using KNIME. Includes beginner and advanced topics, plus how to transition from Alteryx, Excel, SAS and SPSS, for users who already have experience with a similar platform or tool.
KNIME TV Channel on YouTube
Be sure to check out also the KNIME TV Channel on YouTube. With a wide range of tutorials, webinars, and other resources, this channel is an invaluable resource for anyone looking to master KNIME. Whether you’re a beginner just getting started with the platform or an experienced data scientist, the KNIME TV Channel has something for you.
KNIME on Medium
On Medium on Low Code for Data Science content is published on successful data stories, data science theory, tips & tricks to get you started with KNIME and more. And the best is that it collects articles written by the community for the community.
Find out how to contribute and share your stories here.
Data Access with KNIME
In short, it is possible to read all kinds of data sources into KNIME.
I have extensively explained this topic in the following article:
Import flat files
Whether you need to import flat files, like:
- text or csv file
- Excel-files
- SAS-Files
- SPSS-Files
Access relational Databases
or need to query any relational database, there are special DB connectors for:
- MySQL
- Oracle
- SQLite
- Snowflake
- PostgreSQL
- H2
- Microsoft Access
- Microsoft SQL Server
It is also possible to connect to cloud-based databases like Google BigQuery and Amazon Redshift.
To query the databases, you have two options to choose from. Either you write, as usual, the SQL code directly in a node, …
or for the less experienced in SQL, this can be done directly with appropriate nodes.
Access NoSQL Databases
Finally, NoSQL databases such as mongoDB can also be accessed in KNIME via a corresponding node.
And the best part is that you can combine all data sources together, as Rosaria Silipo and Lada Rudnitckaia show in their example workflow in the e-book “Will they blend?”.
The six databases are: MySQL, MongoDB, MS SQL Server, MariaDB, Oracle and PostgreSQL, and the corresponding workflow can be seen in the following image.
Machine Learning
KNIME provides a wide range of machine learning algorithms. Some of the key machine learning algorithms available in KNIME include:
- Supervised learning algorithms: These include popular algorithms such as Linear, Polynomial and Logistc Regression, Generalized Linear Models (GLM), Regression Trees, Random Forest, Support Vector Machines (SVMs), and Neural Networks.
- Unsupervised learning algorithms: such as K-Means Clustering, Principal Component Analysis (PCA), and Hierarchical Clustering.
- Semi-supervised learning algorithms: such as Self-Organizing Maps (SOMs) and Expectation Maximization (EM).
- Ensemble learning algorithms: such as Bagging, Boosting and Random Subspace.
- Deep Learning algorithms: These include algorithms such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)
All of these algorithms can be easily integrated with the corrisponding nodes into KNIME workflows, allowing users to build complex machine learning models quickly and easily.
Additionally, KNIME provides many pre-built nodes to simplify the process of building models, such as feature selection, normalization, and cross-validation.
In the following short video you can see an example of a regression tree learner.
With KNIME, users can quickly and easily build, evaluate, and deploy machine learning models with minimal coding required.
Extensions for Python, R and more
KNIME is a highly flexible and extensible platform that integrates with a variety of programming languages, including Python and R, to provide users with even greater power and versatility.
With the integration of Python and R, data scientists can use their preferred programming language to build and deploy models directly within the KNIME environment. This means that they can take advantage of the vast array of packages and libraries available in these languages, including TensorFlow, scikit-learn, and others, to create complex models and workflows.
Coding in Python
Starting with KNIME Platform 4.7 and above, the Python Integration is pre-installed with a selection of packages (e.g. Python libraries) as a bundled environment. This means you can use the Python Script node without needing to install, configure, or even know about environments.
Add Additional Custom Packages
If you need to adopt a Python package that is not available in the bundled environment, there is a solution for this as well.
The “Conda Environment Propagation node” enables you to snapshot details of your Python environment, be that installed packages or simply the environment name, and “propagate” that environment onto any new execution location where the Conda tool is also installed (any system with Anaconda already installed certainly qualifies).
In the node configuration dialog, you can select the required environment and the packages to be available at the execution locations.
Get Started with the Python Script Space
On KNIME Community Hub, you will find the Python Script Space, which contains example workflows for you to quickly learn how to use the Python Script node in your workflows. This space of examples is especially for KNIME users who are keen to use Python scripts inside KNIME.
For more details coding with Python in KNIME, I recommend the following articles.
Conclusion
KNIME is a highly powerful and versatile tool for data science that has become increasingly popular in recent years.
With its user-friendly interface, extensive library of algorithms and extensions, and ability to integrate with programming languages like Python and R, KNIME is a complete solution for data scientists of all levels of expertise.
Whether you are working on a simple data preprocessing task or building complex deep learning models, KNIME has the tools you need to get the job done.
Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member.
It’s $5 a month, giving you unlimited access to thousands of Data science articles. If you sign up using my link, I’ll earn a small commission with no extra cost to you.
Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “Data Science with Yodime”.