GETTING STARTED | VIOLIN PLOT | KNIME ANALYTICS PLATFORM

“The beautiful Violin Plot that has it all” — Create Graphics with R and KNIME Components

TL;DR: KNIME, R and ggplot2 come together to create a powerful graphic that you can quickly adapt with the help of KNIME components — just start with the violin plot workflow (now also for KNIME 5)

Markus Lauber
Low Code for Data Science
5 min readFeb 21, 2023

--

KNIME has a lot of powerful nodes to create graphics but sometimes that just isn’t enough. So as a low-code platform it allows you to use other tools like the popular R.

To learn about how to install R with KNIME you can check out my Medium article “KNIME and R — installation across operating systems — some remarks

With the help of KNIME components and the R package ggplot2 you can create a violin plot with a lot of additional statistics in one chart. The KNIME component makes it easy to configure the options even if you are not familiar with R code.

Violin plots are very effective to show the structure of numeric variables and compare them across different groups. Their width can represent the number of cases that would have the values on the y-axis. It is widely used for example as a population pyramid.

A violin plot with some statistics
“The beautiful (Violin-)Plot that has it all” (https://hub.knime.com/-/spaces/-/latest/~DZP837sduxJ8yC-t/).

In this case we see the usage of a CPU in % in time intervals over one month (on the y-axis) compared between two servers (on the x-axis). Question is which server is more heavily used and how would the usage spread over the levels of usage. Server 1 on the left is more heavily used and the usage has two different concentrations around the 40% and the 20+% mark — so a simple mean or even median value might be misleading.

The R package ggplot2 offers a lot of visualisations and configurations that you could combine into one chart.

The code has been composed and collected by me and the basic version of the component has been created by M. Schmid. I have expanded it to give you access to a lot of useful statistics to describe the two data sets.

But in the end, you might as well just compare the shape of the violins to get an idea about the differences and the structure of the numbers or as D. Paurat would put it: “the eyes have it”.

If you look at the stats you have the classic boxplot statistics like quartiles, mean, median; you have extremes and information about the deviation of the values.

violin plot with statistics and some explanations what they say
“The beautiful (Violin-)Plot that has it all” (https://hub.knime.com/-/spaces/-/latest/~DZP837sduxJ8yC-t/).

The usage of the component is quite simple. You have a numeric column (y) you want to explore and a column with categories (x) you want to compare by the groups (or just show one).

KNIME 4 Workflow Violinplot (https://hub.knime.com/-/spaces/-/latest/~DZP837sduxJ8yC-t/).

The items in the plot can be configured in the menu of the component (just right click) like the columns to compare, the titles (of the axis) and text sizes. One basic configuration would be how you want the violine’s shape to handle the number of items. My standard is to have the shapes proportional to the number of cases “count”. But a legitimate case can be made for all these settings (if you focus on the comparison of the shapes):

  • ‘count’ (default), areas are scaled proportionally to the number of observations
  • ‘area’, all violins have the same area
  • ‘width’, all violins have the same maximum width
violin plot settings

The graphic is being exported as a PNG file for further usage.

If you are interested in the pure R code inside a few hints:

Most of the ‘magic’ is to use “fun.data” specific functions on the Aesthetics data structure of the plot (https://ggplot2.tidyverse.org/reference/stat_summary.html 1). The result would contain the values formatted with decimal and thousands separators, an indicator like “MAX=” and also a position where to put the label. The label is deliberately put a little bit off the value itself so not to block any information.

If you decide to have the number of missing values displayed the label will be put just on top of the boxplot (75er quartile) and will be ‘styled’ with the help of prettyNum(bers).

n_observations_missing <- function(x){
if (knime.flow.in[["displayObservations"]] > 0) {
return(data.frame(y = quantile(x, 0.75)+(num_percent(x)*3), label = paste0("nmiss= ", prettyNum(length(which(is.na(x))), big.mark = v_big_mark, decimal.mark = v_decimal_mark) ) ))
}
else {
return(data.frame(y = quantile(x, 0.75)+(num_percent(x)*3), label = ""))
}
}
R function using pretty numbers

Also, the result will be moved slightly to the left of the middle. All positions will be dynamically generated and moved according to the values that come in. So, if all does work you would not have to worry about the position of your stats and labels. You can also nudge it slightly to not overlap text/labels:

stat_summary(fun.data = n_observations_missing, geom = "text", position=position_nudge(x=-0.2), size = text_size, show.legend = FALSE) +
sample R code

The R code is also provided in the “/data/” subdirectory of the workflow (kn_example_r_violinplot_ggplot2.R).

Some additional links you might like:

If you enjoyed this article you can follow me on Medium (https://medium.com/@mlxl) and on the KNIME forum (https://forum.knime.com/u/mlauber71/summary) and hub (https://hub.knime.com/mlauber71).

--

--

Markus Lauber
Low Code for Data Science

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry