GETTING STARTED | OUTLIER DETECTION | KNIME ANALYTICS PLATFORM

Outliers — How to identify and treat them using KNIME

There are several ways to identify and handle outliers, e.g. using the low-code KNIME software without writing a single line of code

Guilherme Marczewski
Low Code for Data Science

--

As first published on LinkedIn

A casual scene that many analysts and data scientists will identify with: You access a database and start to study that base, but you identify that in a certain column — a column containing the ages of several people, for example — some values are out of tune, while most are ranging from 5 to 80 years old, we have some people who are 112, 115 and 117 years old.

The first thing to do in this case is to make sure these entries are correct. Did we have a typo? If you have a column with the date of birth, is the calculation of the difference between this date and the current day matching the age column? Does this data make sense?

The second thing is to visualize these outliers and identify if, according to the base, they really are outliers. An excellent way to do this is using box plots. The graph below displays the data distribution, its median, quartile values, and the lower and upper bounds.

The values that are below the lower limit and above the upper limit, we can consider as outliers of our sample.

There are several ways and tools to work with outliers, in this case we will use KNIME Analytics Platform, more precisely with the following two nodes:

  • Box Plot node — This node displays the box plot, limits, and quartiles. To use it, just include it in the flow, and in its settings select the fields that you want to analyze. After that, clicking on “Execute and Open Views”, we can visualize it.
  • Numeric Outliers node — This node defines what to do with these outliers as the previous node just displays the graph. Some of the options in the Numeric Outliers node are removing the outliers, removing the non-outliers, and replacing those values. We can also select whether we want to treat only outliers above or below the limits.

After this treatment, we will have a more homogeneous database. Bearing in mind that in some cases outliers must/can be maintained because they make sense for the analysis.

--

--