GETTING STARTED | VISUALIZATION | KNIME ANALYTICS PLATFORM

Data Visualization

Data Exploration vs Result Presentation

Rosaria Silipo
Low Code for Data Science
7 min readSep 28, 2021

--

Photo by Luke Chesser on Unsplash

Data visualization is an important part of data science, and it is used in two main parts of the data science cycle: at the beginning during the initial data exploration and in the end during the result presentation. Even though the visualization techniques are the same, these two phases have different goals. Data exploration starts from ignorance and tries to understand the data, to discover hidden facts, patterns, or outliers. Result presentation starts from knowledge and tries to communicate the message in the clearest and most effective way possible. Thus, even though they share the same techniques, the goal and the starting point are different.

Data Exploration

When performing data exploration, we are on a journey.

If we do not have any previous knowledge, we might start with a histogram or a group of histograms to confirm the range extensions, discover the mean value and the variance, to find out odd outliers, and to check for skewness and normal distribution. A classic example of age distribution must stretch on a [0, 100] range. The age histogram reported in Fig. 1 below is stretching on a [0, 200] range. When we check the data for age exceeding 100 years old, we find two data points in the dataset with age = 200, probably introduced by mistake or by some jokester. A common practice is to remove such points. Notice that such outliers, if in large amount, can affect the statistics of the dataset, altering the mean value, the variance, and all other statistical parameters of the dataset.

Fig. 1. Age histogram of a dataset. Notice the range [0, 200] years old (obviously incorrect).

A second exploration step consists in the calculation of descriptive statistics measures, such as the mean value, the variance, the number of missing values, the skewness, the kurtosis, and other measures on each feature of the dataset. This gives us an idea of from the data contained in each feature. We might, for example, discover that the age feature, after removing the jokester data, has a minimum value of 17 and a maximum value of 90, which is now plausible. An average age value 38.582 speaks for a relatively young population; and a standard deviation of 13.64 describes a relatively large and flat distribution around the average value. The latter is confirmed by a skewness of 0.559 and, of course, by the histogram drawn on the clean data.

Fig. 2. Descriptive statistics on the age feature in the dataset.

So far, we have explored numerical features. The next step then could be to explore nominal features. For example, our dataset contains men and women. How are they distributed? How many men and how many women? That is where a pie chart or a bar chart may come in handy. Here we see that we have more men than women in the dataset. Let’s keep this in mind.

Fig. 3. Women and men distribution in the dataset.

What about considering two features together? For example, age and number of working hours. Is it true that with age we tend to work more? Is this also reflected in this dataset? A scatter plot is the easiest way to visually explore the relationship between two variables. Notice that the relationship of a third variable can be explored by adding a color map to the plot. In the plot in Fig. 4 below, we see that people between 25 and 50 years old tend to work more than very young people (maybe students) and older people (maybe retirees).

Fig. 4. Scatter plot describing the relationship between age and number of working hours.

We could also explore the relationship between sex and number of working hours per week, again via a scatter plot, only to discover that in this dataset only men work more than 80 hours per week and that in general men work more hours per week than women.

We could then explore the relationship among multiple variables, by visualizing the data for example via a parallel coordinates plot or a sunburst chart. We could explore the correlation between pairs of features, pair by pair in a correlation map.

So far, our data exploration journey included a table with the basic statistical measures, histograms, scatter plots, heatmaps, sunburst charts, correlation maps, and more. We also discovered a few interesting facts: for example, that two jokesters inserted the wrong data, that we have more men than women in our dataset, and that men work more hours per week than women do. Notice that this journey could have also included time plots — if time were a dimension in the data -, text representation — if some features were pure text -, graph visualization, and more.

Key here is the interactivity of the dashboard, i.e., the possibility to experiment with different features and to isolate groups of data points for deeper observation across all visual items.

Summarizing, data exploration means that we do not know what we are looking for and we just attempt each time to find some clue about the structure of the data or about the relationships among the features. Classic, but not exhaustive, questions could be:

  • What are the distributions of the dataset features? Could they be considered Gaussians?
  • What are the ranges of the features? Are they already normalized?
  • How many missing values? Too many in one feature for useful information?
  • If we set one variable as the final target, let’s say “income”, can we define a class system on this target variable? Are the classes equally distributed?
  • What are the relationships between the dataset features and the classes? Is there any feature that can help identify one or more of the classes?
  • What is the role of time? Do data change over time?

Those are some of the most basic questions to ask the plots, charts, and tables during data exploration phase. More questions will pop up throughout the journey depending on the problem, the data, the data types, and so on.

Fig. 5. Interactive Data Exploration using a scatter plot and a data table.

Result Presentation

The phase of result presentation comes at the end of the project.

Here we already know the results and the message we want to get across. The goal is to get the message across in the clearest — and sometimes most impressive — way possible. Here graphics must be easy, direct to the point, and not deceptive. This is the realm of bar charts and pie charts and all those charts that compare revenues, KPIs, and other amounts between the previous (then discarded) solutions and the current winning solution.

In a sense, visualization here is easier because we just want to get the message across clearly. We need to pay attention to the correctness of the visualization, to the psychological perception of the charts by the audience, but we do not need to focus on the search of hidden patterns and relationships.

An important part of this phase is the honesty of the presentation. Results must be conveyed in an objective way to allow the audience to understand the right message and form their own opinion unbiased by the presenter’s view.

Many errors in result presentation involve, for example, the range on the y-axis. Indeed, the y-range can be tweaked as to show a more dramatic change than what the data actually offer. Another classic error consists of reporting charts with different y-ranges side by side, since the audience will automatically transfer the y-range of the chart on the left onto the chart on the right. This is of course deceptive because if makes changes appear on one side much more dramatic than on the other side. And so on.

Many books have already been written about displaying statistical findings in a meaningful and objective way (see for example the first book of this type: Darrell Huff, “How to lie with statistics”, 1954). So, I am not going to dwell much more on this particular aspect of data visualization.

Fig. 6. Revenues ($) for year 2017 and 2018. Notice the different ranges for the y-axis.
Fig. 7. Revenues ($) for year 2017 and 2018 side by side, this time with y-range in the correct perspective.

In the project “Data Exploration in #66daysofdata with KNIME”, we focus more on the first part: data visualization techniques for best data exploration. Of course, nobody will prevent you from using any of same techniques for the result presentation as well, whenever needed.

--

--

Rosaria Silipo
Low Code for Data Science

Rosaria has been mining data since her master degree, through her doctorate and job positions after that . She is now a data scientist and KNIME evangelist.