GETTING STARTED | DATA ENGINEERING | KNIME ANALYTICS PLATFORM

Why every Data Engineer should learn a Visual Programming Language like KNIME

Tip and tricks for data professionals

Dennis Ganzaroli
Low Code for Data Science

--

Learn a visual programming language like KNIME if you want to be successful in data engineering or even as a business analyst instead of just focusing on Excel and Python.

Fig 1: Yodime teaching a Visual Programming Language for Data Engineers (image by author).

The Excel-filter

I call it the Excel filter: More and more companies looking for a data engineer are filtering out applicants if they list Excel as a skill on their CV.

Fig 2: The Excel filter (image by author).

Excel has always been the standard for business analysts and for controlling professionals. But in our complex data world, where everything has to be analyzed with everything else, that’s no longer enough.

And for data engineers, the work has become even more demanding: many different data sources from different systems have to be combined quickly and efficiently and evaluated automatically. In the past, management was satisfied with weekly and monthly reports. Today, everything has to be available at least daily, preferably in real time, because time is even more money today than it used to be.

Spreadsheet programs like Excel are no longer enough to keep track of everything. Most things can be done with Python. But when it comes to collaborating with different team members with different expertise and documenting the work, things get very difficult with Excel or even Python.
Therefore, a different approach is needed.

Learning a Visual Programming Language

Learning a Visual Programming language (VPL) instead of Python will help you solve 99% of your Data engineering tasks. My staff works exclusively with KNIME and some SQL, but I would argue that they are much more efficient and productive than the Python specialists in our company. And they managed to get that far very quickly. When I took over the team, everyone could only use Excel. They learned KNIME in a course that lasted just 3 days, and only a few weeks later they were already productive.

Fig 3: KNIME a VPL for Data Engineering and Data Science (image by author).

Okay, not everyone is going to make it that fast. You must have a certain disposition. I think people who think in pictures have an advantage here.
Yes, because not everyone thinks visually. Marshall McLuhan, a mass media theorist, and later his son Eric McLuhan a communications theorist discovered in their researches that people think in either in words or in pictures. You will notice this difference very quickly when you present someone with a mind map, like this one below.

Fig 4: Mind map of Time Management (image by Jean-Louis Zimmermann).

People who think in pictures will feel more comfortable reading this map than people who think in words. People who think in words prefer a list like the following.

Fig 5: Photo by Torbjørn Helgesen on Unsplash.

They prefer a sequence of tasks that need to be completed rather than holistic connections between different topics. They are more action oriented.

So you realize very quickly that we are dealing with different personality types here as well.

Why is “Visual Programming” a language?

Most of the time we think of a language when it is in text form. But not every language is based on an alphabetical system. Linguists distinguish the written languages in different writing systems. Besides the alphabetical system, a large part of humanity also uses the logographic system. In logographic writing systems such as Chinese and Japanese, glyphs represent words or morphemes (meaningful parts of words) rather than phonetic elements.

Fig 6: Photo by qi xna on Unsplash.

The ancient Egyptian writing system, hieroglyphics, even contained a combination of logographic, syllabic and alphabetic elements.

In the history books we read that the hieroglyphs were still a precursor to our alphabetic writing system. But I wouldn’t be so sure. After all, they built the pyramids (or maybe someone else did?). But the writing system of the time was hieroglyphics, and the pyramids still stand and can still be admired by all as a masterpiece of engineering.

Fig 7: Photo by Jeremy Bezanger on Unsplash.

But why did I end up with the hieroglyphs now? Maybe because when my wife saw me working with KNIME for the first time, she said to me:

“What are you doing with these hieroglyphs?”

She intuitively understood that I was dealing with a writing system. A writing system for the ETL-process.

ETL (extract, transform, load) is a type of data integration that refers to the three phases (extract, transform, and load) that are used to combine data from various sources. Data is extracted from one or different source systems, transformed into an analyzable format, and loaded into a data warehouse or other system throughout this process.

Fig 8: The ETL-Process (image by author).

The best representation of an ETL pipeline is a visual workflow.
Judge it for yourself: Which of the following two representations is closest to the process above?

Fig 9: Python Code vs KNIME Workflow on how to predict covid19-cases (image by author).

The visual environment provides just the right amount of abstraction to build and share your work. Yes, because you will always be working with others and therefore need to be able to document and share your work.

Nowadays no data scientist or data engineer works alone anymore. We are all part of teams and we all need to communicate together. Discussion of the tasks, best practices, documentation are all necessary tasks in the daily work.
Documentation is an important part of collaboration, and therefore a common language is necessary to enable communication and understanding to both, data engineers and business stakeholders.

After you have learned to work with a VPL, you will sometimes hear the following story from your teammates:

“Hey, do you remember? Back when we used to work with Excel?”
Which translated means: The time when we still lived in the Stone Age.

Fig 10: Business Analyst working with Excel during the “stone age”.

The words of a Visual Programming Language

The KNIME Analytics Platform implements visual programming. This means that each data transformation step is represented by means of an icon block, called a node, in a graphical editor. Each node can perform one specific task. For example, in the figure above the “File Reader” node reads a csv file and the “Row Filter” node allows to filter rows based on a filter criterion.
So in the following example, we load the sales records and filter out only those from Germany. Then we write our selection into an Excel file.

Fig 10: Different nodes for every data transformation step (image by author).

A sequence of connected nodes is called a workflow and is the corresponding concept of a programming code e.g. in Python or other programming languages.

As shown above, the ETL process consists of three main basic components:
Input -> Transform -> Output

Fig 11: The three main basic components of an ETL-process (image by author).

The task of the input is to load different data source types into the processing tool. In KNIME there are several nodes available that allow to load data, both from a csv or an Excel file to any table from any database.

Fig 11: Data-Access nodes in KNIME (image by author).

At the output, the processed data is exported back to other databases, systems, or files. In KNIME we have the choice between exporting to a csv file, to an Excel file, writing the data to a database, or even to BI visualization tools like Tableau or Power BI for further analysis.

Fig 12: Write-Data nodes in KNIME (image by author).

The transformation process, on the other hand, can be divided into further sub-processes:

1. Transforming

Functions (Mathematical and Text functions):

You can create new columns in a data table by applying a function to existing columns. This function can be a mathematical or a text function. In KNIME there are several dedicated nodes for this purpose (see figure below). The result is a new column created from one or more columns.

Fig 12: Create new columns with functions (image by KNIME cheat sheets).

Filtering (Rows and Columns)

In KNIME there is no difference between filtering and removing rows, as the original table is not deleted and is still available at the output of the previous node. And that is a very important key point for reusability. You just need to adjust the filter node to your needs and the workflow is ready to use again. In our previous example: if you want to select sales records only of France instead of Germany, you only need to change the filter node and the data will be available again since nothing has been deleted.

Fig 13: Filter data in KNIME (image by KNIME cheat sheets).

2. Aggregate and Reshape

GroupBy

This function is known from SQL and is used to group together any rows of a column with the same value stored in them, based on a function specified in the configuration of the node. Generally, these functions are one of aggregate functions such as maximum() and sum(). For example, you have daily weight values from different people and want to display the maximum and average values for every person in a table.

Fig 14: Aggregate data with the GroupBy function (image by KNIME cheat sheets).

In KNIME we can choose from over 30 aggregation functions. This helps to get things done faster and more efficiently. And just like in any other node, the parameters can be adjusted and reused without having to start all over again.

Fig 15: Aggregate functions in KNIME (image by author).

Pivoting

The Pivoting node is a further enhancement of the GroupBy node. It is a combination of transposing the data with grouping by function. It is equivalent to the Pivot Tables function in Excel. Let’s take the example above with the weights of the people. Now you also want to report summary statistics for heart rate and amount of fat mass for every person. You could now split the people into different columns and output the summary statistics in each row, or you could display the people in each row and output the different categories in the columns.

Fig 15: Aggregate data with the Pivoting function (image by KNIME cheat sheets).

Since many users have initial difficulties with understanding this function, I’m also including an explanatory video below.

Fig 16: How to create a pivot table in KNIME (video by KNIMETV).

3. Combine Data

Join

This is one of the most important concepts in data manipulation and data engineering. And one of the concepts that have not been implemented well in spreadsheet software like Excel, and can never be implemented properly. And the reason is simple:

A spreadsheet does not have the right language to explain the ETL-process

Joining tables in KNIME is self-explanatory. Simply insert the Joiner node and select the desired joiner mode with the joining columns. The output of the node will be either an inner join, full outer join, right or left join (see figure below).

Fig 17: Joining tables based on different joiner modes (image by KNIME cheat sheets).

Concatenate

Concatenating two tables could not be easier in KNIME. The Concatenate Node combines two or even more tables into one.

Fig 18: Concatenating tables in KNIME (image by KNIME cheat sheets).

Via the following link you will find cheat sheets and tips and tricks for KNIME to give you a better overview:

They also provide a good documentation with a Quick Start guide here:

Or if you prefer tutorials on YouTube, check out their channel here:

Conclusion

Learning a visual programming language like KNIME gives you the foundation you need to work efficiently with data.

Not only will you learn to solve your tasks efficiently, but you will also be able to apply the necessary mindset through the visual structure of such a programming language.

Finally this is the best way to enter the world of data, acquire knowledge quickly and efficiently, and succeed in your daily work.

Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime

Yodime is that sort of genetic cross between a Star Wars guru (Yoda) and a data science guru (KNIME). It was introduced as a mentor to young aspiring data scientists in the article “If you want to become a data scientist, change hobbies

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.