GETTING STARTED | SCATTER PLOT VISUALIZATION | KNIME ANALYTICS PLATFORM

Data, It Makes Me Happy!

Looking At World Happiness Data Through Scatter Plots In KNIME Analytics Platform

John Denham
Low Code for Data Science

--

Photo by Denise Jones on Unsplash.

Introduction: What Makes You Happy

What makes you happy? Interesting data makes me happy!

Have you ever read a report online and thought it would be fun to snag the raw data and create some of the visualizations yourself? Data is everywhere and it’s easier than ever to acquire it to visualize and analyze. Whether your data is from places like kaggle.com or data.gov, scraped from the web or cloned via a Git, the choices are almost endless.

Data visualization in KNIME Analytics Platform is varied and flexible with the platform offering numerous JavaScript-enabled visualizations, and programming language integrations for Python and R enabling even more customized data visualization with your favorite libraries and packages.

In this article, we are going to generate some scatter plots with data from the World Happiness Report 2021. We are not attempting to re-analyze the report data, but simply to demonstrate the configuration and use of the Scatter Plot node. I highly encourage you to read through the World Happiness Report 2021 and explore how they arrived at their insights. For additional questions about the data, please see this FAQ (Figure 1).

Figure 1: The World Happiness Report 2021.

The workflow includes two World Happiness Report 2021 datasets.

DataPanelWHR2021C2 is the first dataset we will interact with and includes data from 2005–2020. This includes columns such as ladder scores — responses to the question of life evaluated through the lens of a ladder, and positive and negative emotions labeled as Positive and Negative affect. Additionally, the Logged GDP per capita provides a metric used in our example that examines how variance in the income of countries influences measures of happiness and life expectancy.

The second dataset, DataForFigure2.1WHR2021C2 is slightly different (a combination of 2018–2020 data) but also includes Regional Indicators (essentially region names) which the first dataset does not include. These two datasets should give us ample opportunities to explore the Scatter Plot node.

The datasets come with the reference workflow for this blog post, named Scatter_Plot, which is available on the KNIME Hub and can be downloaded for free.

In this post, we will describe and apply this node to:

  • Create basic scatter plots from the World Happiness Report data.
  • Apply custom colors to nominal data to display in the plot.
  • Control aspects of the Scatter Plot node with variables.

The Scatter Plot Node

Scatter plots help us to easily visually identify the relationship (or lack thereof) that may exist in our data between numeric variables. The legends on our scatter plots help to visually identify clusters, groups, or categories of nominal data (country, sentiment, etc.) through color coding that helps to further highlight these relationships.

Generally, we will plot what we are trying to predict (our dependent variable) on the y axis and conversely, our independent variable (what we are predicting by) on the x axis. We can then conduct additional analysis such as linear regression to assist us in making predictions and to best understand the significance behind these relationships (Figure 2).

Figure 2: The Scatter Plot node.

The Scatter Plot node is part of the KNIME JavaScript Views family nodes that have a JavaScript-based implementation.

Aside from basic x and y plotting, we can apply color from nominal values to our points on the plot. For additional configuration control, there is currently 1 available configurable CSS class associated with the Scatter Plot node.

The Scatter Plot node is organized under the Views category in the Node Repository (Figure 3).

Figure 3: Where To Find The Scatter Plot Node.

Finding Happiness

Basic scatter plot design requires that we identify data for the x and y axes and if so desired, add a legend. We can label the axes separately or just pull through the column names and add titles and subtitles.

The Scatter Plot node is a great addition to our workflows, whether we are using it for data exploration/discovery or integration into a dashboard. Combined with the Table View node we can select rows of interest and see them highlighted on the scatter plot itself.

Step 1

Figure 4: Workflow Step 1.

To start, using the Rule-Based Row Filter node, we filter the data down to just the year 2020 with the simple expression (Figure 4).

$year$ = 2020 =>TRUE

Next, we just add the Scatter Plot node. This is going to be a very basic plot with x and y plotted but no legend.

In the Options tab of the configuration, we select our numeric columns. In this example, our two columns are of type Number (Double) (Figure 5).

Figure 5: Scatter Plot Node Configuration Options Tab.

Additional options here include the ability to output the generated scatterplot as an image. With the Image Writer (Port) node, we can write the image out to a local file or network location (Figure 6).

Figure 6: Image Writer (Port) node to Write The Scatter Plot Image.

We have the option to restrict the number of rows output and report on or suppress any missing values in the dataset.

If we Report on missing values in our dataset, the node executes with an exclamation point and a note warning about the presence of missing values and their omission from the plot (Figure 7).

Figure 7: Scatter Plot Node Note About Missing Values.

The tab Axis Configuration simply refers to the available configuration options for our axes. If we so desire, we can re-label the x and y axis here. Recall that these titles can also be controlled via flow variables (Figure 8).

Figure 8: Axis Configuration Tab Of Scatter Plot Node.

A large section on date and time formatting is here as well with many locales and time zones to select from.

The Axes ranges section allows us to adjust starting value, and ranges for the x and y axis. With Auto range axes enabled, the ranges are automatically set. Always show origin includes points from 0 forward, and the Use domain information option sets bounds from the domain.

The below image shows the differences in output of the three Axes ranges choices (Figure 9).

Figure 9: Left: Auto range axes. Center: Always show origin. Right: Use domain information.

The General Plot Options tab in the configuration window allows us to set chart title and subtitle. If we want a legend that colors points based on their category or cluster and lists them beneath our plot, we can do that here by checking Show color legend. Legend data is passed in from a Color Manager node that needs to be connected earlier in the workflow. Step 2 has an example of this (Figure 10).

Figure 10: General Plot Options Tab In The Scatter Plot Node Configuration.

The Sizes selection here is the output size of the static image, if we choose to generate one.

Finally, we have a number of color options for background, data color area and grid. We can choose from pre-set swatches, HSV, HSL, RGB and CMYK colors.

The final tab in the configuration window is the View Controls tab which has options that impact the behavior of the plot when used in a composite or interactive view (Figure 11).

Figure 11: View Controls Tab In The Scatter Plot Node Configuration.

There are a lot of choices here that will greatly impact our experience with chart interactivity, point selection, and potential for editing.

Click OK and execute (F7) the Scatter Plot node. What we get is a nice, clean, easy-to-read and dynamically resizable (through the interactive view) plot of our Happiness data (Figure 12).

Figure 12: Scatter Plot Interactive View

The plot indicates that greater healthy life expectancy at birth is impacted by the log GDP per capita of the country. Essentially, higher income countries see increased healthy life expectancy at birth and to further explore this relationship, we would conduct linear regression.

Getting With The Trend

While the option to plot a trend line is not currently available in the Scatter Plot node, it is available elsewhere in KNIME.

If we are engaging in data exploration or analysis, we use the Linear Regression Learner node to see our scatterplot with a trend line. Below is an example of how to view the output from the node (Figure 13, 14).

Figure 13: Connection Data To Linear Regression Learner Node.
Figure 14: How To View The Scatter Plot With Trend Line From Linear Regression Learner Node.

Additionally, if we wanted more scatter plot visualizations, we could easily use KNIME’s R View (Table) or Python View nodes with Plotly, ggplot2, seaborn and others to generate our plots. If you have Python or R configured in KNIME, feel free to explore the basic plots I’ve included (Figure 15).

Figure 15: Left: Python seaborn Regplot. Right: R ggplot2 geom_point.

Step 2

Figure 16: Workflow Step 2.

This step is quick and straightforward. We are simply taking the Color Manager node and using it to auto color the nominal values in the Regional indicator column.

Next, we connect to our Scatter Plot node and ensure that we check the Show color legend under the General Plot Options of the node configuration. Once complete, our scatter plot now includes a legend and color coded points (Figure 17).

Figure 17: Scatter Plot With Legend And Colors.

The legend (and subsequent colored nominal values) shows that Western Europe has a higher logged GDP per capita and as a result a much higher healthy life expectancy than Sub-Saharan Africa. This is an insight that could drive root cause analysis for future projects.

Additionally, you could explore binning and clustering techniques to further visualize and understand the data, in addition to using different columns for the x and y axes.

We can also leverage the Generic JavaScript View node to code customized data visualizations with different JavaScript libraries. If you’re comfortable with Ploty.js, the Generic JavaScript View node includes a 2D Scatter Plot template that you can adjust to the needs of your project. Check out the starter included in Step 2.

KNIME also offers a host of visualizations in the JavaScript View (Labs) category that run on Plotly.js and require zero code, while allowing for quick and easy customization (Figure 18).

Figure 18: KNIME Plotly Family Of Visualization Nodes.

The Thin Red Line: Notes On Controlling Your Workflow With Variables

As with most KNIME nodes there are a number of flow variables we can control. Below we have a few that would be fairly common in our workflows (Figure 19).

Figure 19: Common Flow Variables From The Scatter Plot Node.

Of note here, unlike the Tile View node, the Scatter Plot node accepts colors in RGBA format. Currently, transparency value must remain at 1.0, as transparency values less than 1.0 are not recognized upon output.

Conclusion

In this article, we explored generating scatter plots to explore The Happiness Report 2021 datasets with the Scatter Plot node. We explored how to add a splash of color with the Color Manager node and looked at additional ways we could understand our data. We looked at some key flow variables we might want to use to dynamically control elements of the scatter plot.

Finally, we introduced the Plotly-powered JavaScript View nodes available in the KNIME Labs category.

Data visualization is a key to understanding our data and unlocking insights. We are absolutely surrounded by data and its ever increasing democratization affords all of us incredible opportunities to leverage tools to explore it, and through it, understand our world a little bit better.

--

--

John Denham
Low Code for Data Science

I am a Data Scientist who is passionate about empowering people to make the most of their data. I run the website KNIME.tips.