DATA CHALLENGES | #66DAYSOFDATA | KNIME ANALYTICS PLATFORM

Data Exploration in #66daysofdata with KNIME — Updated

A roadmap to deepen your knowledge of data exploration techniques within the #66daysofdata challenge

Roberto Cadili

Published in

Low Code for Data Science

23 min readSep 20, 2021

Co-author: Rosaria Silipo

Have you heard about the #66daysofdata challenge?

It is an interesting self-motivational approach to learning or deepening any branch of data science in the limited free time everyone of us has each day. The idea is to dedicate a bit of time every day to learning more about data science according to a set roadmap and with the goal of completing a set project. Like for all self-help techniques, it is also good to share your progress with others … and where else if not on social media? Here is a video by Ken Jee that explains the whole idea behind the #66daysofdata challenge: “What is the #66daysofdata?”

The challenge can be so summarized:

1. Learn some data science every day for 66 days and at least 5–15 minutes a day
2. Share your daily progress on social media using #66daysofdata.

We are offering you a project idea to run within your #66daysofdata challenge to learn more about data visualization, data blending, data preparation, and other data science topics.

Find below a roadmap in the form of daily tasks to get started!

New to KNIME?

Our tool of choice is KNIME Analytics Platform, the open source software that leverages visual programming to make data science creation intuitive and accessible to everyone.

Project Resources

Roadmap

1. Import Data. First things first: Learn how to import data and then specifically investigate options for importing text data.

2. Descriptive Statistics. Delve into basic statistics measures and find out things like average popularity of songs, highest danceability, plus the ever-popular topic of “missing values”.

3. Histograms. Investigate how to build a histogram for features such as loudness, tempo, and energy. And learn about components and uploading workflows to the KNIME Hub.

4. Date Standardization. Dates always seem to come in very different formats. Learn how to standardize fields, filter rows, append, transform, and extract date&time data.

5. Ungrouping and Aggregations. Let’s extract the number of tracks for each artist and learn how to ungroup, aggregate, and join in the process.

6. Plots and Charts — Univariate Analysis. Who are the 20 most prolific artists of all time? And visualize it! Explore how to add color and build composite views, use pivoting, and add interactivity.

7. Plots and Charts — Multivariate Analysis. To investigate relationships among features we need multivariate types of visualization. Dive into sunburst charts, scatter plots, learn about rule engines, and loops.

8. Plots and Charts — Time Plots. It’s flow variable time! Try plotting annual track numbers in different types of plots, including stacked area charts and bar charts for evolution over time. And then add a bit of JavaScript!

9. Plots and Charts — Control. This is all about Guided Analytics and Widgets, and exploring how to build interactive dashboards.

10. Covariance and Correlation. The focus here is on adding a little statistical flavor to your analysis, to improve understanding of the data and enable sounder identification of relationships between input features.

11. Text Visualization. Data science isn’t all about numbers! But texts, images, networks, and more. Explore here how to visualize your text analysis.

12. Graph Visualization (Bonus). The bonus topic! Moving into more advanced visualizations, here you can explore visualizing interactions among users. Here: Tweets!

Datasets — Spotify Tracks and Artists

The core of this project idea relies on three Spotify datasets freely available on Kaggle (sign in to download them):

tracks.csv. This dataset contains data about ~600k music tracks for the years 1900–2021 and is described by 20 columns:

id: track unique ID.
name: track name.
popularity: the popularity of a track. Values range from 0 (least popular) to 100 (most popular).
duration_ms: duration of song in milliseconds.
explicit: describes the content type of a track. Explicit content is represented by 1 and inexplicit content by 0.
artists: artist name.
id_artistis: artist unique ID (collection).
release_date: date when track was released.
danceability: describes how suitable a track is for dancing. Values range from 0.0 (least danceable) to 1.0 (most danceable).
energy: represents a perceptual measure of intensity and activity. Values range 0.0 (least energetic) to 1.0 (most energetic).
key: the estimated overall key of the track. E.g. 0 = C, 1 = C♯/D♭, 2 = D, etc.
loudness: the overall loudness of a track in decibels.
mode: indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0.
speechiness: detects the presence of spoken words in a track. Values range from 0.0 (least speechy) to 1.0 (most speechy).
acousticness: a measure of whether the track is acoustic. Values range from 0.0 (least acoustic) to 1.0 (most acoustic).
instrumentalness: predicts whether a track contains no vocals. Values range from 0.0 (least instrumental) to 1.0 (most instrumental).
liveness: detects the presence of an audience in the recording. Values range from 0.0 (least live) to 1.0 (most live).
valence: describes the musical positiveness/negativeness conveyed by a track. Values range from 0.0 (least positive) to 1.0 (most negative).
tempo: the overall estimated tempo of a track in beats per minute (BPM).
time_signature: tells how the music is to be counted.

artist-uris.csv. This dataset contains data about ~81k artists and is described by 2 columns (header names are not provided):

[id_artists]: artist unique ID.
[artists]: artist name.

artist.csv. This dataset is very similar to the tracks.csv dataset but it also includes a popularity metric for artists.

popularity: the popularity of an artist. Values range from 0 (least popular) to 100 (most popular).

KNIME Forum for your questions along the way

In case you are lost or you are curious to know how we engineered the solution of the daily task, post your questions in the comment to our daily social media post. For more technical help, join the KNIME Forum and check the thread #66daysofdata.

If you don’t already have a Forum account, here is a quick user guide for setting up your KNIME Forum account, along with an overview of the various Discourse features.

KNIME Hub and social media to store and share

Store your work on the KNIME Community Hub, and share your progress by posting an impression of your workflow or visualization on social media (e.g., Twitter, LinkedIn, etc.). Do not forget to add the hashtags #KNIME and #66daysofdata.

After completing the challenge, send the link to your work stored on the KNIME Hub to LowCodeJournal@knime.com, and earn a celebratory digital badge of the #66daysofdata challenge with KNIME that you will be able to share on social media.

Ready for the challenge?

1. Import Data

The first part of the project focuses on the first basic step: importing the data into KNIME Analytics Platform. A few documentation sources are provided to download and understand the dataset, to install and explore KNIME Analytics Platform, and then to investigate options for importing text data.

Day 1. Download three Spotify datasets from Kaggle (sign in to download them). Open and investigate the content of each file. For this project, we will use the files: tracks.csv, artist-uris.csv and artist.csv.

Day 2. Learn more about KNIME Analytics Platform. Recommended readings and videos:

Why KNIME?, Low Code for Data Science
Codeless Data Science with KNIME, Low Code for Data Science
A friendly Introduction to KNIME Analytics Platform, Analytics Vidhya
Creating and Productionizing Data Science, KNIMETV

Day 3. Download KNIME Analytics Platform and install it on your machine in a folder where you have reading and writing permission. After installation, start KNIME Analytics Platform, set workspace (default workspace is OK) and get familiar with the workbench. Create an account on the KNIME Forum. The same account will also work for the KNIME Hub. Recommended videos:

How to install KNIME Analytics Platform, KNIMETV
The KNIME User Interface, KNIMETV
How to install KNIME Extensions, KNIMETV
How to import and export KNIME workflows, KNIMETV

Day 4. In the KNIME Explorer panel, under LOCAL, create a new folder (Workflow group) to host your work. Then, create an empty workflow to start the project. Create a “Data” folder under <workspace folder>/<workflow group folder>/Data and copy the Spotify datasets in it. After refreshing the KNIME Explorer (right-click on LOCAL and select Refresh), you should see the folder Data with all copied files. Recommended videos:

What is a Node, What is a Workflow, KNIMETV
Build Your First Workflow, KNIMETV

Day 5. Read file tracks.csv of the Spotify dataset. You can:

Use a File Reader node, KNIME Hub
Use a File Reader (Complex) node, KNIME Hub
Use a CSV Reader node, KNIME Hub

Drag&drop the file from KNIME Explorer onto the workflow editor and see what node is created. Tip: it might be necessary to sensibly increase the count of “Limit data rows scanned”.

Once the data has been read, open the output table to investigate the results.

Notice that many song titles are not pure ASCII: make sure to choose the right character decoding option in the reader node. Compare the performances of the different nodes in terms of flexibility and speed. Recommended documents and videos:

How to Configure and Execute a Node, KNIME Documentation
KNIME File Handling Guide: Reader Nodes, KNIME Documentation
Reading Files in KNIME, KNIMETV
The CSV Reader node, KNIMETV

Day 6. Investigate the concept of relative paths in reader nodes. Use a relative path instead of an absolute path in your file readers. Did you use a workspace or a workflow reference?

KNIME File Handling Guide: Path Syntax, KNIME Documentation
KNIME File Handling Guide: Standard File Systems, KNIME Documentation

Day 7. Comment the nodes and write a general annotation with title and description of what the project is about. Learn more about data exploration.

Annotations and Comments, KNIMETV
Data Visualization: Data Exploration vs Result Presentation, Low Code for Data Science

2. Descriptive Statistics

Before diving into the visualization, let’s start with some basic descriptive statistics.

Day 8. Learn about basic statistics measures: mean, median, mode, variance, standard deviation, range, and quantiles. What is the difference between average and mean? And between variance and standard deviation?

Know Your Data with Descriptive Statistics, KNIME Blog
Difference between average and mean, Cuemath
Quantile: definition and how to find them in easy steps, Statistics How To

Day 9. Learn about more complex statistics measures: skewness and kurtosis.

Know Your Data with Descriptive Statistics, KNIME Blog
Know Your Chances: Calculate Probability, KNIME Blog
Shape of data: Skewness and Kurtosis, Analytics Vidhya

Day 10. Learn about the Data Explorer node to calculate the basic descriptive statistics measures of the dataset in tracks.csv.

Data Explorer: Interactive univariate data exploration, KNIMETV

Day 11. Explore the interactive view of the Data Explorer node. What’s the average popularity of the songs in the dataset? The highest danceability? How many missing values in the feature “key”? Investigate the effect on the output data of excluding a column in the interactive view.

Data Explorer: Interactive univariate data exploration, KNIMETV

Day 12. Investigate the difference between zeros, missing values, infinity, and NaN. In KNIME Analytics Platform, missing values are represented by a red question mark.

Inf, NaN, and Null, Wiki Analytica
Missing Value Handling, KNIME Hub

3. Histograms

Did you notice the histograms in the last column on the right of the interactive view of the Data Explorer node? In this part of the project we will deepen our knowledge of histograms.

Day 13. Learn more about Histograms.

Histograms, Math is Fun
Histograms, Laerd Statistics
A complete guide to histograms, Data Tutorials

Day 14. Introduce a Histogram node into the workflow. Build the histogram of feature “loudness” for binning type on ‘Equal frequency’ first, and then on ‘Equal width’. Why does not it make sense to build a histogram of occurrences on ‘Equal frequency’ bins?

Comparison charts — Histogram and Bar Chart, KNIME Hub

Day 15. Build Histograms with Histogram nodes for the other features as well, like liveness, valence, instrumentalness, danceability, tempo, energy, speechiness, acousticness, popularity, key, loudness, and duration.

Day 16. Learn what a metanode and a component are. Create a component with all Histogram nodes, execute it and open its composite view.

What is a Component, KNIMETV
Create, Modify, and Configure a Component, KNIMETV
KNIME Components Guide, KNIME Documentation
Metanode or component. What is the difference?, KNIME Blog

Day 17. Arrange the Histogram nodes in the component composite view using the Layout button in the toolbar. Add a title to the component view via a Text View node. Reshape the layout of the composite view.

Layout button, KNIME Documentation

Day 18. Upload your current workflow onto the KNIME Community Hub; that is, connect to your hub.knime.com mount point on the KNIME home page, and upload your workflow to your Public space. Then, open the workflow on the KNIME Hub, and explore it for other contributions to #66daysofdata. Is yours there?

The KNIME Community Hub, KNIMETV
The KNIME Hub — Share and Collaborate, KNIME Blog
Community features on KNIME Hub, KNIME Documentation

*Figure 1. Spotify’s track feature distribution using Histograms.*

4. Date Standardization

The task of this part is date standardization. Let’s fix the release_date column. Some dates have format yyyy-MM-dd (<year>-<month>-<day>), some just report the <year> or the <year>-<month> of the track release. Let’s standardize this field. We want to build a metanode that adds -01-01 where <month> and/or <day> are missing, according to the format yyyy-MM-dd.

Day 19. Investigate the String Manipulation node and its functions, especially length(), replace(), and join().

Data Transformation with KNIME: Numbers, Strings, and Rules, KNIMETV

Day 20. Investigate row filtering (Row Filter) and row splitting (Row Splitter) nodes. Investigate how to keep and exclude rows, how to filter based on patterns or on numerical ranges or on missing values. Separate rows in: original dataset to have all rows with only <year> in release_date on one side, and all other rows on the other side. Repeat for rows with only <year>-<month> in release_date.

How to filter rows with the Row Filter node, KNIMETV
ETL with KNIME: Advanced Row Filtering, KNIMETV
ETL with KNIME: Advanced Row Filter for special Data Types, KNIMETV
All Row Filters of KNIME, Low Code for Data Science

Day 21. Append “-01” in release_date where needed to always have dates in format yyyy-MM-dd and reassemble all pieces together via the Concatenate node.

How to Concatenate Tables, KNIME TV

Day 22. Transform all values in release_date column into Date&Time object. Check that no release_date values are missing. Then extract the year from release_date.

Date&Time Integration, KNIME Blog
Explore using Date&Time formats, Low Code for Data Science
Convert Strings and Extract Date & Time Fields, KNIMETV

Day 23. Extract year from release_date. Wrap up all nodes from this section into a metanode.

Convert Strings and Extract Date & Time Fields, KNIMETV
How to Keep Workflows Tidy and Documented, KNIMETV
Metanodes or Components?, KNIME Blog

5. Ungrouping and Aggregations

The task of this part is ungrouping & aggregation. We want to join the tracks.csv with the artist features from the files artist-uris.csv and artist.csv, and extract the number of tracks for each artist in the dataset.

Day 24. The id_artists column in the tracks dataset is a String, including one or more artists. To assign each track to each artist (if more than one), we need to split them and create a new row for each artist. Investigate the Cell Splitter node, especially the output (set, list, new columns). Investigate the Collection type for columns. Using the Ungroup node, disaggregate the dataset so that for each row there is only one artist and the corresponding track.

Collection Cookbook, KNIME Blog
Ungroup node, KNIME Hub
Working With Collections — Collection Types, KNIME Hub

Day 25. Read the artist file artist-uris.csv and extract artist ids. Read the second artist file artist.csv, and use the Transformation tab to retain only the columns with artist name and artist popularity information. Convert to integer the popularity column.

Double to Int node, KNIME Hub

Day 26. Join the data for each artist from the artist-uris.csv and the artist.csv. Join the resulting data table with the disaggregated tracks dataset (output table of the Ungroup node).

The Joiner Node, KNIMETV
How to Join Two Tables, KNIMETV

Day 27. Study the GroupBy node for all aggregations. Study how to build groups of data and what metrics can be calculated on them. In the joined table with artists and tracks, use the GroupBy node to count the number of tracks for each artist plus artist popularity and their period of activity from the first release date to the last release date. Using a second GroupBy node, count again the number of tracks for each artist across all years.

6. Plots and Charts — Univariate Analysis

Some more visualizations! Let’s build a composite view over the top 20 most prolific artists of all times … We mean … of the whole dataset!

Day 28. Using the output table of the first GroupBy node (the aggregated artist, track and popularity data), select the top k (k = 20) most prolific artists ever, that is, the ones with the highest track total count. Sort by track count and rename columns with meaningful names.

Pure Data Gold, Low Code for Data Science
How to sort data with KNIME, YouTube (NickyDee)
How to rename columns in your dataset …, YouTube (NickyDee)

Day 29. Learn more about univariate, bivariate, and multivariate visualizations.

Visual Data Exploration in Three Steps, KNIME Blog
Data Explorer. Interactive Univariate data exploration, KNIMETV
How to Create a Scatter Plot, KNIMETV

Day 30. Put the top k most prolific artists in an interactive Table View. Color rows by artist.

Color Manager node, KNIME Hub
Data Visualization and Interactive Data Exploration with KNIME, KNIMETV
Let There Be Color!, Low Code for Data Science

Day 31. Build a composite view that for all selected rows in the table shows the artist detail (name, # of tracks, popularity, period of activity from: to:) in a Tile View on the right.

Components composite views, KNIME Documentation
Visualizing clusters via dendrogram, heatmap, tile view, and CSS styles, KNIMETV
Tour The World…Factbook!, Low Code for Data Science

Day 32. Display the number of tracks by artist in a Pie Chart. Make sure that the same color code by artist is used as in the Table and Tile view.

Data Visualization and Interactive Data Exploration with KNIME, KNIMETV
How to create KNIME Charts: Pie charts and bar charts, YouTube (Yoda Learning Academy)

Day 33. Display the number of tracks by artist in a monochromatic Bar Chart.

How to create KNIME Charts: Pie charts and bar charts, YouTube (Yoda Learning Academy)
Bar Chart Examples, KNIME Hub

Day 34. Display the number of tracks by artist in a Bar Chart with the same color scheme used for the Table and Tile Views.

Bar Chart node, KNIME Hub
Color Manager node, KNIME Hub
How to Assign Colors to Bars in a Bar Chart — Three Shades of Green, KNIME Blog

Day 35. Investigate interactivity in the composite view. Learn more about composite views becoming dashboards both locally and as data apps on the KNIME Business Hub.

How to Create an Interactive Dashboard in Three Steps with KNIME Analytics Platform, Low Code for Data Science
3 Steps to Build an Interactive Dashboard with KNIME, KNIME TV
KNIME Data Apps Beginners Guide, KNIME Documentation
Deploy a data app on KNIME Business Hub, KNIMETV

*Figure 2. Pie Chart and Bar Chart of the top 20 most prolific artists by number of tracks.*

7. Plots and Charts — Multivariate Analysis

Pie charts and bar charts show and compare aggregated values. To investigate relationships among features though we need to use multivariate types of visualization. The most commonly used visualization of this kind is surely the scatter plot. Since those visualizations do not show aggregated values like the bar and pie charts, but potentially all data points in your data set, often sampling is required.

Day 36. Using the joined tracks and artist features table, create a track popularity class (high, medium, low) with the Rule Engine node, perform different sampling strategies and display percentages for each popularity class and strategy (original data, with random sampling, and with stratified sampling) in a table view.

Data Transformation with KNIME: Numbers, Strings, and Rules, KNIMETV
Too Much Data or Not Enough? Solve with Statistical Sampling, KNIME Blog

Day 37. Scatter Plot is probably the most common way to visualize and investigate relationships among pairs of features. Let’s learn more about scatter plots, how to implement them in KNIME, and how to carry out a visual exploration of the data. Check various pairs of features. Which feature creates a pattern with which feature? Try “loudness” vs “explicit” and see if there is some form of correlation. Add a Table View to visualize details of selected points only in the scatter plot.

A complete Guide to Scatter Plots, Data Tutorials
How to Create a Scatter Plot, KNIMETV
Data, It Makes Me Happy, Low Code for Data Science

Day 38. With the data table created using stratified sampling, let’s prepare the data to use the Sunburst Chart to visualize the feature proportions that lead to high popularity. The Sunburst Chart node requires nominal values, so the numerical columns must be binned or, if already binned but still numerical, converted to strings.

Binning data, YouTube (Caleb Curry)
How To Use KNIME’s Binning Nodes, YouTube (KNIME Tutorials)
Math Formula node in Data Manipulation: Numbers, Strings, and Rules, KNIMETV

Day 39. Binned buckets might not have a distinctive name. Let’s loop through the binned columns to change the bin names to <first three letters of column name> + <default bin name>.

What is a loop, KNIMETV
Build, execute, and debug a Loop in KNIME, KNIMETV
Loop Commands, KNIMETV
The Group Loop Start node, KNIMETV
Loop End nodes, KNIMETV
KNIME Flow Control Guide, KNIME Documentation

Day 40. We can now apply the Sunburst Chart to visualize the proportion of each feature to reach high popularity.

How to create a Sunburst chart in KNIME?, Low Code for Data Science
3 Steps to Build an Interactive Dashboard with KNIME, KNIMETV
Data Visualization and Interactive Exploration with KNIME, KNIMETV

Day 41. We now want to apply a Heatmap to all numerical features. Since the Heatmap visualizes numerical values with a color gradient built on the [min, max] interval, for better visualization we need normalized features first.

Normalization Techniques at a Glance, Google Data Prep
KNIME Normalize and Denormalize Data, YouTube (NickyDee)

Day 42. Let’s apply a Heatmap to all numerical normalized features.

A Complete Guide to Heatmaps, Data Tutorials
Extra Extra Heat All About it, Low Code for Data Science

Day 43. Visualize all feature contributions to popularity classes via a Parallel Coordinates Plot.

Parallel coordinates plots, From Data to Viz blog
Data Visualization and Interactive data exploration with KNIME, KNIMETV
Such Great Heights, Low Code for Data Science

Day 44. A full component should be dedicated to Box Plots. Learn more about Box Plots & Conditional Box Plots. With the data table created using stratified sampling, you can box plot single features one by one or you can box plot multiple features all together. In this last case, you must pay attention to the different ranges. You could normalize of course, but then you lose the interpretability of the box plot.

Understanding Box Plots, Towards Data Science
Box Plot node, KNIME Hub
Box Plot examples, KNIME Hub
Four Techniques for outlier detection, KNIME Blog

8. Plots and Charts — Time Plots

Some more charts and plots. We use this section to also introduce the concept of Flow Variables.

Day 45. Learn more about Flow Variables.

Flow Variables in KNIME, KNIMETV
KNIME Flow Control Guide, KNIME Documentation
KNIME Analytics: Flow Variables, the Red Line, Low Code for Data Science

Day 46. Return to the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. In a selected time window (<from year> <to year>), find out the maximum number of years of activity for an artist and extract those artists that have been active that maximum number of years. Wrap it into a component. Flow Variables here might be helpful.

Sharing components, KNIME Documentation
How To Use KNIME’s Table Column To Variable Node, YouTube (KNIME Tutorials)

Day 47. Make the previous component parametric by adding a configuration window where you can select the time window (<from year> <to year>). You can further parameterize your configuration window by allowing the top k artists to be extracted (i.e., artists with the largest number of tracks in the selected time window).

Create, Modify, and Configure a Component, KNIMETV
Custom components configuration dialogs, KNIME Documentation

Day 48. Plot yearly number of tracks by artist in a Line Plot. Add color for each line, i.e. for each artist. Make the plot title parametric for the selected time window with the help of the String Manipulation (Variable) node. Inspect lines one by one and all together. Find out the artists who have been most consistently active across the years of your time window.

Line plot in KNIME Introduction. Part 15. Visualizations, YouTube (Scott McLeod)
Home Sweet Home, Low Code for Data Science

Day 49. Plot yearly number of tracks by artist in a Stacked Area Chart. Add color for each area, i.e. for each artist. Make the plot title parametric for the selected time window. Explore interactivity, especially how to add and remove areas for artists.

Population Sensation!, Low Code for Data Science
What is a stacked area chart, From Data to Viz
Data Visualization & Interactive Data Exploration with KNIME, KNIMETV
Text Stream Visualization, KNIME Blog

Day 50. Create a Bar Chart to inspect evolution over time. Visualize the number of tracks by artist over years in a bar chart. Add color for each bar, i.e. for each artist. Make the plot title parametric for the selected time window. Inspect bars one by one, in small groups, and all together. Find out the most prolific artist in a specific year of your time window.

Day 51. Let’s conclude this part with some free JavaScript code. Let’s investigate the Generic JavaScript View node. Do not worry, you do not need to code. Just explore the KNIME Hub for workflows and components based on the Generic Javascript View node. Drag&drop the Animated Bar Chart component from the KNIME Hub into your workflow and study, for example, the evolution of artists and track count throughout the years. Wrap all the time plots in a component and investigate selections of year and artists in the composite view.

Example components and visualizations based on free JavaScript code:

FIFA World Cup, KNIME Hub
Animated Bar Chart, KNIME Hub
Epidemiological data from Zika virus, KNIME Hub
Relations in data with scatter and 3-D scatter, KNIME Hub

9. Plots and Charts — Control

Day 52. Learn about Guided Analytics and Widget nodes

Widget nodes in KNIME Component Guide, KNIME Documentation
Widget vs Configuration Nodes — What’s the Difference?, KNIME Blog
Principles of Guided Analytics, KNIME Blog

Day 53. Let’s continue using the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. Build a guided analytics sequence with a component including a Widget-based selection framework to select the <from year> <to year> time window and extract the top k artists. Next, build a second component to filter artists of choice and a third component to build time plot views (i.e., line plot, bar chart, stacked area chart, etc.) to visualize the number of tracks by artist over a selected time period (ex: 1970- 1980).

Integer Widget node, KNIME Hub
Column Filter Widget node, KNIME Hub
Value Selection Widget node, KNIME Hub
Widget nodes, KNIME Documentation

Day 54. Using the top k artists extracted from the time window selection used in the first component, investigate the usage of the Interactive Range Slider Filter Widget node in conjunction with the Scatter Plot or the Stacked Area Chart node. Build a component with a scatter plot or stacked area chart and control the number of points via the Interactive Range Slider Filter Widget node, for example by controlling the size of the year interval. Alternatively, use the data table created with stratified sampling in section 7 to build a component with an interactive multivariate scatter plot and table view in conjunction with the Interactive Range Slider Filter Widget node to control for the year interval.

Interactive Range Slider Filter Widget node, KNIME Hub
Interactive Widget nodes, KNIME Documentation

Day 55. Investigate the usage of the Refresh Button Widget node. After the component that uses Widget nodes to select the time window (ex: 1970–1980) and extracts the top k artists, build a dashboard with a Widget-based framework to select the top k artists, a stacked area chart, a line plot, a bar chart, and a Refresh button.

Eight Data App Designs with the New Refresh Button, KNIME Blog
Example workflow collection using the Refresh Button Widget node, KNIME Hub

Day 56. Combine the Refresh button node with the Interactive Range Slider Filter Widget node and with a stacked area chart to visualize extracted top k artists in a selected time window within a dashboard. Watch the Refresh Button Widget node in action:

Twitter analysis Data App, KNIMETV
Machine learning Data App, KNIMETV

*Figure 3. Guided analytics of most active artists.*

10. Covariance and Correlation

Creating bar charts, pie charts or time plots are definitely great ways to visually explore a dataset and gain precious insights. However, sometimes adding a tiny statistical flavor to the analysis improves our understanding of the data and, more importantly, enables a sounder identification of relationships between the input features.

Day 57. Learn about covariance, linear correlation, and rank correlation. Focus on the similarities and differences between them, as well as on their strengths and weaknesses.

5 Things You Should Know About Covariance, Towards Data Science
How to measure the relationship between variables, Towards Data Science
Why correlation does not imply causation?, Towards Data Science
(Bonus task) Work out the math behind covariance and linear correlation, it will help you understand their differences and similarities. Pick two numeric input features of your choice (e.g., “loudness” and “popularity”), select 5 observations for each feature, and compute the covariance and the linear correlation manually. You can double-check your results with an online calculator.

Day 58. Compute the covariance for all pairs of the following features: “energy”, “loudness”, “danceability”, “valence” and “popularity” in the tracks.csv dataset using the GroupBy node. Build a covariance matrix both with unnormalized and normalized (z-score normalization) features separately.

Aggregations, Aggregations, Aggregations! — Part II (section: “1. Statistical Aggregations: Covariance vs. Correlation”), Low Code for Data Science
GroupBy node for statistical aggregations, KNIME Hub

Day 59. Compute the linear correlation for all pairs of meaningful numeric features in the track dataset using the Linear Correlation node. Inspect the correlation matrix view, and the results of the three output ports. What are the top four positively correlated feature pairs? Compare the values in the linear correlation matrix with those of the normalized covariance matrix, what do you observe?

Linear Correlation node, KNIME Hub

Day 60. Compute the rank correlation for all pairs of meaningful numeric features in the track dataset using the Rank Correlation node. Inspect the correlation matrix view, and the results of the four output ports. What are the top four positively correlated feature pairs? Do you observe similarities with the linear correlation matrix? If yes, why do you think that’s the case?

Rank Correlation node, KNIME Hub

Day 61. Visualize in a component composite view the unnormalized and normalized covariance matrices side-by-side both using a Heatmap node and a Table View node. Inspect in a different component composite view the linear correlation for the top four positively correlated features using the Scatter Plot node for each pair.

11. Text Visualization

There are not only numbers in data science! Besides numerical data, we have to deal with texts, images, networks, and even more diverse data types. In this section, a few items on text visualization.

Day 62. Get familiar with the KNIME Text Processing Extension, the Document object, the Term object and all text processing operations.

KNIME Text Processing Extension, KNIME Hub
From Data Collection to Text Mining Interpretation, KNIME Blog

Day 63. Read the IMDb-sample.csv file from the KNIME Hub. This dataset collects 2000 movie reviews written by users and contains sentiment annotation (i.e., positive or negative) for each review. Convert all texts to Documents and perform some basic text cleaning.

Common Steps in a Text Mining Project, KNIMETV
All you need to know about text preprocessing for NLP and Machine Learning, KDNuggets
Strings to Document and Tika Parser nodes, KNIMETV

Day 64. Using the file MPQA-OpinionCorpus-PositiveList.csv for the list of positive words in the English language and MPQA-OpinionCorpus-NegativeList.csv for the list of negative words in the English language (from the KNIME Hub or the latest version from the MPQA site), tag words in the texts as positive or negative.

Day 65. Let’s transform each text into a Bag of Words and let’s calculate all Term Frequencies.

How important are the words in your text data? Tf-Idf answers…, Towards Data Science
Bag of Words, YouTube (Quantopian)
Bag of Words and Frequencies, KNIMETV

Day 66. We have the list of words, we have their frequencies, let’s visualize them in a word cloud

What are word clouds?, BoostLabs Blog
Word cloud with Additional Visualizations, KNIMETV
Tag Cloud, KNIME Hub

CONGRATULATIONS, YOU REACHED THE END!

12. Graph Visualization (Bonus)

As a bonus, let’s investigate something more complicated: how to visualize interactions among users of a community, like the retweeting patterns around a hashtag on Twitter.

Bonus. First, let’s learn what a Network Graph (or sometimes just a graph) is.

Network Diagram, From Data to Viz
Graph Theory, Wikipedia

Bonus. Download the TwitterData.table around #knime for the time window July 26th — August 3rd, 2021. Shape the Twitter data as an adjacency matrix:

The #KNIME Connection. Where are you?, Low Code for Data Science

Bonus. The KNIME Network Mining Extension deals with graphs and network objects. You need to create a network of interactions among Twitter users before visualizing it. Create the network diagram from the adjacency matrix of Twitter interactions built in the previous step.

KNIME Network Mining Extension, KNIME Hub
Object Inserter node, KNIME Hub
Feature Inserter node, KNIME Hub

Bonus. Visualize the network object of Twitter users interactions with a graph.

Data Visualization & Interactive Data Exploration with KNIME, KNIMETV
Data Visualization in KNIME, YouTube
Network Viewer node, KNIME Hub

Bonus. Another way of visualizing interactions is the chord diagram. Let’s learn what a chord diagram is and how to build one in KNIME using the Generic Javascript View node.

Chord diagram, From Data to Viz
The #KNIME Connection. Where are you?, Low Code for Data Science