GETTING STARTED | VISUAL PROGRAMMING | KNIME ANALYTICS PLATFORM

Why KNIME?

From extensive coverage of data science techniques and effective deployment to a thriving support community

Rosaria Silipo
Low Code for Data Science
13 min readJul 14, 2020

--

I wrote this preface for a Machine Learning book. I thought it could be of interest to more people than just the readers of the book. So I am sharing it here. In this preface, I was supposed to describe where the convenience is in adopting the KNIME software for your data science projects.

— — — — — — — -

Mind Map on KNIME software features provided by Vijaykrishnan Venkataram.

The success of data scientists at work depends largely on the tool they rely on.

The mathematical knowledge of the algorithms, the experience on the most effective algorithms, the domain wisdom, those are all basic, important, necessary ingredients for the success of a data science project. However, there are other more contingent factors that also influence the final impression left on the stakeholders.

Clearly, every project has a deadline, often it is a stringent one that does not leave much margin for a “try and error” approach. We need to implement a solution in short time and make sure that it is correct. We need to quickly experiment with different techniques to reach and adopt the best procedure for the project. Of course, every project has a budget too. The fast implementation of the solution is often additionally constrained by a reduced budget.

Some projects are quite complex and require specialized techniques, beyond just the classic general machine learning algorithms. Sometimes, we are forced to learn new techniques and new algorithms on the spot — that is on the project — and, given the deadline, the learning curve must develop over a short time. In this case, the more time we have to dedicate to theory and math, the faster the learning will proceed.

Finally, the passage from prototype to production must be as fast (again) and as secure as possible. We can run no risk of degrading parts of the solution while moving it into production and exposing it to a different set of less expert users. If required, a different set of solutions, with more or less degrees of interactions, should be made available to the larger public.

As you see, many other contingent factors, such as ease of learning, speed of prototype implementation, debugging and testing options to ensure the correctness of the solution, flexibility to experiment with different approaches, availability of help from external contributors and experts, and finally automation and security capabilities, all contribute to the success of the data science project, beyond the math, the experience, and the domain knowledge. All of these contingent factors depend heavily on the tool the data scientists are using.

Project Constraints: Time and Money

KNIME Analytics Platform is an open-source software for all your data needs. It is free to download from the KNIME web site and free to use, it covers all main data wrangling and machine learning techniques, and it is based on visual programming.

The implications of being open source and free to use are self-explanatory, reducing the licensing legal headaches and the impact on the project budget.

The impact of visual programming might require a few more words of explanation. Visual Programming has become quite popular in recent times and it aims at substituting, partially or completely, the coding practice. In visual programming, a Graphical User Interface (GUI) guides you through all necessary steps to build a pipeline (workflow) of dedicated blocks (nodes). Each node implements a given task; each workflow of nodes takes your data from the beginning till the end of the designed journey. A workflow substitutes a script; a node substitutes one or more script lines.

In KNIME Analytics Platform, nodes are created by drag&drop (or double-click) from the Node Repository into the workflow editor in the central part of the KNIME workbench. Node after node, the pipeline is quickly built, configured, executed, inspected, and documented.

Figure 1. KNIME Analytics Platform. From the top left corner: KNIME Explorer to store the workflows, Workflow Coach for node recommendations, Node Repository as the node storage, Outline for the workflow overview, Console for the errors, and Description to learn more. In the middle the workflow editor to assemble the KNIME workflow.

Visual programming is a key feature of KNIME Analytics Platform for quick prototyping. It makes the tool very easy to use. Producing a few different experimental prototypes, before deciding the final direction of the project, is fast and quite straightforward. The ease of implementation frees up time to think deeper of possible theoretical alternatives to the current solution.

The Learning Curve

Visual Programming also makes the learning curve much faster than for code-based tools.

Data Science is now used in more or less all disciplines, including the humanities, languages, life sciences, economics, and other unsuspected areas of the human wisdom. Not all scientists are expert programmers and not all scientists have spare time to become expert programmers. A GUI-based tool can be learned and applied in less time than a code-based tool, freeing up again precious time and resources for more important investigations.

Also, when preparing to become future scientists in the humanities, languages, life sciences, economics, or other academic disciplines, a GUI-based tool might free up more time to study and research. I have seen too often entire months dedicated to learn the coding practice, before even approaching any data analysis technique. With KNIME Analytics Platform, in a few weeks you can already assemble quite complex workflows for data transformation and for the training of machine learning algorithms.

Plenty of educational resources all over the web and especially on the KNIME site, of course, help with speeding up the learning curve even more. Starting from the generic LEARNING page on the KNIME site, you can move onto courses with an instructor or completely self-paced courses, all leading to a possible certification. You can also decide to learn by yourself with just the help of a book, like this one.

Another massive help for beginners comes from the KNIME Hub. With the KNIME Hub we have introduced another great strength of KNIME Analytics Platform: the KNIME Community.

The KNIME Community

The KNIME Hub is the public repository for the KNIME community. Here, you can share your workflows and download workflows by other KNIME users. Just type in the keywords and you will get a list of related workflows, components, extensions, and more. It is a great place to start with plenty of examples! For example, just type in the search box “basic” or “beginners” and you will get a list of example workflows illustrating basic concepts in KNIME Analytics Platform; type in “read file” and you will get a list of example workflows illustrating how to read CSV files, .table files, excel files, etc. Notice that a subset of these example workflows is also reported in the EXAMPLES server in the KNIME Explorer panel in the top left corner of the KNIME workbench.

Once you isolate the workflow of interest, click on it to open its page, and then download it or open it on your own KNIME Analytics Platform. Once in the local workspace, you can start adapting it to your data and your needs. Following the popular fashionable trend in programming - that is searching for ready to use pieces of code - you can just download, reuse, and readapt workflows or pieces of workflows from the KNIME Hub to your own problem.

Of course, you can also share your work on the KNIME Hub for the public good. Just copy the workflows to share from your local workspace into the My-KNIME-Hub/Public folder in the KNIME Explorer panel within the KNIME workbench.

Figure 2. Resulting list of workflows from search “read file” on the KNIME Hub.

The KNIME community does not stop at the KNIME Hub. It is indeed very active with tips and tricks on the KNIME Forum as well. Here, you can ask questions or search for previous answers. The community is very active, and it is highly likely that somebody has already asked your question.

Finally, innovative content produced by KNIMErs is available as posts on the KNIME Blog, as books in the KNIME Press, and as videos on the KNIME TV channel on YouTube.

Note. From June 2021, KNIME has its own publication on Medium: Low Code for Advanced Data Science. We publish content on successful data stories, data science theory, tips&tricks to get you started with KNIME Analytics Platform and more. And the best is that it collects articles written by the community for the community. Find out how to contribute and share your stories here.

Correctness and Flexibility

Easy, easy, easy, but can I ensure it is correct? Is it flexible enough to experiment with alternative procedures? This is indeed a key question, because for many of the software tools available nowadays “easy” comes to the expenses of “control” for correctness and of “flexibility” for alternative options.

Automated Machine Learning has become very popular in recent times. It carries the promise to get your data, spit out some results, without you even lifting a finger. As fascinating as this option might sound, together with the promise it also carries some risks.

First of all, it works as a black-box. Its decision process is not transparent. When I feed the box with my data I have to completely trust the machine that the data analysis process is correct, fits the data characteristics and distributions, and is tuned on the problem I wish to solve. It is probably a trusting exercise, like in these psychology support meetings. I prefer to make sure personally that all steps in the analysis are implemented correctly, according to the original design of the application. When executing a node in KNIME Analytics Platform, you can always inspect the output results with a right-click and then selecting the last option in the context menu. This is a way of debugging the application step by step and make sure that all steps work as required by design.

Secondly, the world is never ideal, and neither are the data. Automated analysis might work well on perfect data, with no unbalanced classes, no outliers, no dirty records, etc. This happens sometimes, mainly in toy datasets. In real life, there are always adjustments to make for the unbalanced classes, for the dirty data, for the missing values, for the inconsistencies, and so on. Adding one cleaning step or another, introducing an optimization cycle on a parameter or on another might change the final results of the workflow. In order to experiment with new strategies, it is imperative that the tool is flexible and customizable enough at each step of the analysis. A black-box automated approach does not allow for much flexibility, does not allow to change intermediate steps in the analysis, to customize features, or to tune parameters. KNIME Analytics Platform on the other side is modular enough and with an extensive coverage of data wrangling operations and machine learning algorithms as to allow this kind of flexibility. It is flexible enough as to exchange one data manipulation node with another, introduce an optimization loop, or change the value of a parameter in the training of a machine learning model.

One small note, before digging deeper into the “extensive coverage of data wrangling operations and machine learning algorithms”. If you open an old workflow, i.e. a workflow developed with previous older versions of KNIME Analytics Platform, you will probably find “deprecated” or “legacy” nodes. These nodes are discouraged from current usage, since better newer nodes have been developed in the meantime. However, deprecated and legacy nodes, they do still exist and they do still work. Indeed, back-compatibility with previous versions is a key feature of KNIME Analytics Platform. Only ensuring that old workflows still work exactly in the same way as they were designed, can guarantee the reproducibility of the results. As all of you know, reproducibility is an extremely important argument in the validation of results in scientific research.

Extensive coverage of Data Science techniques

Another necessary complement to the ease of use is the coverage of data science techniques. Without an extensive coverage of the commonly and less commonly used data wrangling techniques, machine learning algorithms, data types and formats, and without integration with the most used database software and data sources, reporting tools, other scripts and languages, ease of use would be of limited convenience.

Let’s start from the machine learning algorithms. KNIME Analytics Platform covers most machine learning algorithms: from decision trees to random forest and gradient boosted trees, from recommendation engines to a number of clustering techniques, from Naïve Bayes to linear and logistic regression, from neural networks to deep learning. Most of these algorithms are native of KNIME Analytics Platform, some are integrated from other open source script tools. Specifically, deep learning layers, units, and pre-configured architectures are available through the KNIME Deep Learning — Keras integration. This integrates the Keras libraries within KNIME Analytics Platform and offers them to the user via the familiar KNIME GUI. In this way, it is possible to drag&drop nodes to create neurons’ layers for complex neural architectures and to train the final network without writing code.

A very large number of nodes is also available to implement a myriad of Data Wrangling techniques. By combining nodes dedicated to small tasks, you can implement up to very complex data transformation operations. The assembling of such operations is so easy that it is often used for data preparation for other tools, for example to generate reports or to create a data warehouse.

KNIME Analytics Platform also connects to most of the required data sources: from databases to cloud repositories, from big data platforms to files. So, no worries here. However weird your data source and your data types are, it is likely that you can connect to them from within KNIME Analytics Platform.

What if all that is not enough? What if I need a specific procedure for DNA analysis or molecule conversion? What if I need a specific network manipulation function from Python? What if I need to export my results into a Tableau report? Where KNIME Analytics Platform cannot reach, there is usually at least one third party extension from the KNIME community providing the missing nodes for that particular domain or task. Where KNIME Analytics Platform and its extensions cannot reach, there are the integrations with other scripting and programming languages, such as Python, R, Java, and Javascript, to mention some.

KNIME Analytics Platform has a seamless integration with the BIRT Reporting tool. Since this is also open source, its integration fits well with the KNIME Analytics Platform’s philosophy. However, integrations with other reporting platforms like Tableau, QlickView, PowerBI, and Spotfire are also available. In those cases, though, you would need to buy and install the destination reporting software.

However, it is not even necessary to move to an external reporting software to visualize your data or your results. A number of Javascript based nodes are available within the Views/Javascript category in the Node Repository. These nodes implement data visualization plots and charts: from a simple scatter plot to a more complex sunburst chart, from a simple histogram to a parallel coordinate plot, and more. These nodes seem simple but are potentially quite powerful. If you combine them within a component, interactive selection of data points across multiple charts is automatically enabled. The component thus inherits and combines all views from the contained nodes and connects them together in a way that, if points are selected and visualized in one chart, they can also be selected and visualized in the other charts of the component’s composite view.

Data Science in the Enterprise

One last step is deployment into production and, in case of an enterprise, easy, comfortable, secure deployment. This is the last step in the chain of actions in a data science project. When the workflow is finished, the model (if any) is trained, the performance is measured, we need to kick the application out into the real world to deal with real-life data. This process of moving the application into the real-world is called moving into production. The process of including the model into this final application is called deployment. Both phases are deeply connected and can be quite problematic since all errors in the application design show up at this stage.

It is possible, though limited, to move an application into production using KNIME Analytics Platform. If you, as a lone data scientist or a data science student, do not regularly deploy applications and models, KNIME Analytics Platform is probably enough for your needs. However, if you are just a bit more in an enterprise kind of environment, where scheduling, versioning, access rights, scalability, easy deployment, security, model monitoring, auditing, and all other typical functions of a production server are needed, then just using KNIME Analytics Platform for production can be cumbersome.

In this case, the KNIME Server, which is not open source but sold for an annual license fee, can make your life easier. First of all, it is going to fit better the governance of the enterprise’s IT environment. Then, it offers a collaboration protected environment for your group and the whole data science lab. And of course, its main advantage consists of making the model deployment and the moving into production easier and safer, for example by using the integrated deployment feature and the one-click deployment into production. End users would then run the application from a KNIME Analytics Platform client or even better from a web browser.

Remember those composite views offering interactive interconnected views of selected points? Those become fully formed web pages when the application is executed on a web browser via the KNIME Server’s WebPortal. Using the components as touchpoints within the workflow, we get a guided analytics application within the web browser. Guided analytics inserts touchpoints for the end user to consume within the flow of the application on a web browser. The end user can take advantage of these touchpoints to insert knowledge or preferences and to steer the analysis in the desired direction.

Summary and Conclusions

I hope I have convinced you by now on why KNIME Analytics Platform is an extremely good choice as a data science tool to support you in your future career as a data scientist.

Just remember the ease of use that will save you time and allow you to dedicate more of your brain and energy to more important research topics than just programming.

Remember the open source choice for the software, with the price and the community that comes with it. The community help is especially useful for tips and tricks, for example workflows, for new nodes and extensions, and in general for smart advices.

Remember the debugging capability to check for the correctness of the implemented operations and remember the atomic tasks for all nodes as to allow to experiment easily with different analysis strategies. The easy prototyping is indeed one of the best features of KNIME Analytics Platform, allowing a quick evaluation and experimentation of new techniques.

The large coverage of machine learning algorithms, data wrangling techniques, and accessible data sources make KNIME Analytics Platform a very reliable tool yet easy to use. If the native nodes and extensions are not enough, community extensions, integrations with other scripting and programming languages, and integration with reporting tools can make up for the missing domain specific functionalities.

Finally, the usage of the KNIME Server as the complement to KNIME Analytics Platform for the enterprise allows for easy and secure deployment and seamless integration in the company IT environment.

With all of these arguments and their lengthy descriptions, I hope I have by now convinced you to download KNIME Analytics Platform and to give it a try!

--

--

Rosaria Silipo
Low Code for Data Science

Rosaria has been mining data since her master degree, through her doctorate and job positions after that . She is now a data scientist and KNIME evangelist.