DATA CHALLENGES | #66DAYSOFDATA | KNIME ANALYTICS PLATFORM

Data Exploration in #66daysofdata with KNIME — Updated

A roadmap to deepen your knowledge of data exploration techniques within the #66daysofdata challenge

Roberto Cadili
Low Code for Data Science
23 min readSep 20, 2021

--

Co-author: Rosaria Silipo

Have you heard about the #66daysofdata challenge?

It is an interesting self-motivational approach to learning or deepening any branch of data science in the limited free time everyone of us has each day. The idea is to dedicate a bit of time every day to learning more about data science according to a set roadmap and with the goal of completing a set project. Like for all self-help techniques, it is also good to share your progress with others … and where else if not on social media? Here is a video by Ken Jee that explains the whole idea behind the #66daysofdata challenge: “What is the #66daysofdata?

The challenge can be so summarized:

1. Learn some data science every day for 66 days and at least 5–15 minutes a day

2. Share your daily progress on social media using #66daysofdata.

We are offering you a project idea to run within your #66daysofdata challenge to learn more about data visualization, data blending, data preparation, and other data science topics.

Find below a roadmap in the form of daily tasks to get started!

New to KNIME?

Our tool of choice is KNIME Analytics Platform, the open source software that leverages visual programming to make data science creation intuitive and accessible to everyone.

Project Resources

Roadmap

1. Import Data. First things first: Learn how to import data and then specifically investigate options for importing text data.

2. Descriptive Statistics. Delve into basic statistics measures and find out things like average popularity of songs, highest danceability, plus the ever-popular topic of “missing values”.

3. Histograms. Investigate how to build a histogram for features such as loudness, tempo, and energy. And learn about components and uploading workflows to the KNIME Hub.

4. Date Standardization. Dates always seem to come in very different formats. Learn how to standardize fields, filter rows, append, transform, and extract date&time data.

5. Ungrouping and Aggregations. Let’s extract the number of tracks for each artist and learn how to ungroup, aggregate, and join in the process.

6. Plots and Charts — Univariate Analysis. Who are the 20 most prolific artists of all time? And visualize it! Explore how to add color and build composite views, use pivoting, and add interactivity.

7. Plots and Charts — Multivariate Analysis. To investigate relationships among features we need multivariate types of visualization. Dive into sunburst charts, scatter plots, learn about rule engines, and loops.

8. Plots and Charts — Time Plots. It’s flow variable time! Try plotting annual track numbers in different types of plots, including stacked area charts and bar charts for evolution over time. And then add a bit of JavaScript!

9. Plots and Charts — Control. This is all about Guided Analytics and Widgets, and exploring how to build interactive dashboards.

10. Covariance and Correlation. The focus here is on adding a little statistical flavor to your analysis, to improve understanding of the data and enable sounder identification of relationships between input features.

11. Text Visualization. Data science isn’t all about numbers! But texts, images, networks, and more. Explore here how to visualize your text analysis.

12. Graph Visualization (Bonus). The bonus topic! Moving into more advanced visualizations, here you can explore visualizing interactions among users. Here: Tweets!

Datasets — Spotify Tracks and Artists

The core of this project idea relies on three Spotify datasets freely available on Kaggle (sign in to download them):

tracks.csv. This dataset contains data about ~600k music tracks for the years 1900–2021 and is described by 20 columns:

id: track unique ID.

name: track name.

popularity: the popularity of a track. Values range from 0 (least popular) to 100 (most popular).

duration_ms: duration of song in milliseconds.

explicit: describes the content type of a track. Explicit content is represented by 1 and inexplicit content by 0.

artists: artist name.

id_artistis: artist unique ID (collection).

release_date: date when track was released.

danceability: describes how suitable a track is for dancing. Values range from 0.0 (least danceable) to 1.0 (most danceable).

energy: represents a perceptual measure of intensity and activity. Values range 0.0 (least energetic) to 1.0 (most energetic).

key: the estimated overall key of the track. E.g. 0 = C, 1 = C♯/D♭, 2 = D, etc.

loudness: the overall loudness of a track in decibels.

mode: indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0.

speechiness: detects the presence of spoken words in a track. Values range from 0.0 (least speechy) to 1.0 (most speechy).

acousticness: a measure of whether the track is acoustic. Values range from 0.0 (least acoustic) to 1.0 (most acoustic).

instrumentalness: predicts whether a track contains no vocals. Values range from 0.0 (least instrumental) to 1.0 (most instrumental).

liveness: detects the presence of an audience in the recording. Values range from 0.0 (least live) to 1.0 (most live).

valence: describes the musical positiveness/negativeness conveyed by a track. Values range from 0.0 (least positive) to 1.0 (most negative).

tempo: the overall estimated tempo of a track in beats per minute (BPM).

time_signature: tells how the music is to be counted.

artist-uris.csv. This dataset contains data about ~81k artists and is described by 2 columns (header names are not provided):

[id_artists]: artist unique ID.

[artists]: artist name.

artist.csv. This dataset is very similar to the tracks.csv dataset but it also includes a popularity metric for artists.

popularity: the popularity of an artist. Values range from 0 (least popular) to 100 (most popular).

KNIME Forum for your questions along the way

In case you are lost or you are curious to know how we engineered the solution of the daily task, post your questions in the comment to our daily social media post. For more technical help, join the KNIME Forum and check the thread #66daysofdata.

If you don’t already have a Forum account, here is a quick user guide for setting up your KNIME Forum account, along with an overview of the various Discourse features.

KNIME Hub and social media to store and share

Store your work on the KNIME Community Hub, and share your progress by posting an impression of your workflow or visualization on social media (e.g., Twitter, LinkedIn, etc.). Do not forget to add the hashtags #KNIME and #66daysofdata.

After completing the challenge, send the link to your work stored on the KNIME Hub to LowCodeJournal@knime.com, and earn a celebratory digital badge of the #66daysofdata challenge with KNIME that you will be able to share on social media.

Ready for the challenge?

1. Import Data

The first part of the project focuses on the first basic step: importing the data into KNIME Analytics Platform. A few documentation sources are provided to download and understand the dataset, to install and explore KNIME Analytics Platform, and then to investigate options for importing text data.

Day 1. Download three Spotify datasets from Kaggle (sign in to download them). Open and investigate the content of each file. For this project, we will use the files: tracks.csv, artist-uris.csv and artist.csv.

Day 2. Learn more about KNIME Analytics Platform. Recommended readings and videos:

Day 3. Download KNIME Analytics Platform and install it on your machine in a folder where you have reading and writing permission. After installation, start KNIME Analytics Platform, set workspace (default workspace is OK) and get familiar with the workbench. Create an account on the KNIME Forum. The same account will also work for the KNIME Hub. Recommended videos:

Day 4. In the KNIME Explorer panel, under LOCAL, create a new folder (Workflow group) to host your work. Then, create an empty workflow to start the project. Create a “Data” folder under <workspace folder>/<workflow group folder>/Data and copy the Spotify datasets in it. After refreshing the KNIME Explorer (right-click on LOCAL and select Refresh), you should see the folder Data with all copied files. Recommended videos:

Day 5. Read file tracks.csv of the Spotify dataset. You can:

Drag&drop the file from KNIME Explorer onto the workflow editor and see what node is created. Tip: it might be necessary to sensibly increase the count of “Limit data rows scanned”.

Once the data has been read, open the output table to investigate the results.

Notice that many song titles are not pure ASCII: make sure to choose the right character decoding option in the reader node. Compare the performances of the different nodes in terms of flexibility and speed. Recommended documents and videos:

Day 6. Investigate the concept of relative paths in reader nodes. Use a relative path instead of an absolute path in your file readers. Did you use a workspace or a workflow reference?

Day 7. Comment the nodes and write a general annotation with title and description of what the project is about. Learn more about data exploration.

2. Descriptive Statistics

Before diving into the visualization, let’s start with some basic descriptive statistics.

Day 8. Learn about basic statistics measures: mean, median, mode, variance, standard deviation, range, and quantiles. What is the difference between average and mean? And between variance and standard deviation?

Day 9. Learn about more complex statistics measures: skewness and kurtosis.

Day 10. Learn about the Data Explorer node to calculate the basic descriptive statistics measures of the dataset in tracks.csv.

Day 11. Explore the interactive view of the Data Explorer node. What’s the average popularity of the songs in the dataset? The highest danceability? How many missing values in the feature “key”? Investigate the effect on the output data of excluding a column in the interactive view.

Day 12. Investigate the difference between zeros, missing values, infinity, and NaN. In KNIME Analytics Platform, missing values are represented by a red question mark.

3. Histograms

Did you notice the histograms in the last column on the right of the interactive view of the Data Explorer node? In this part of the project we will deepen our knowledge of histograms.

Day 13. Learn more about Histograms.

Day 14. Introduce a Histogram node into the workflow. Build the histogram of feature “loudness” for binning type on ‘Equal frequency’ first, and then on ‘Equal width’. Why does not it make sense to build a histogram of occurrences on ‘Equal frequency’ bins?

Day 15. Build Histograms with Histogram nodes for the other features as well, like liveness, valence, instrumentalness, danceability, tempo, energy, speechiness, acousticness, popularity, key, loudness, and duration.

Day 16. Learn what a metanode and a component are. Create a component with all Histogram nodes, execute it and open its composite view.

Day 17. Arrange the Histogram nodes in the component composite view using the Layout button in the toolbar. Add a title to the component view via a Text View node. Reshape the layout of the composite view.

Day 18. Upload your current workflow onto the KNIME Community Hub; that is, connect to your hub.knime.com mount point on the KNIME home page, and upload your workflow to your Public space. Then, open the workflow on the KNIME Hub, and explore it for other contributions to #66daysofdata. Is yours there?

Figure 1. Spotify’s track feature distribution using Histograms.

4. Date Standardization

The task of this part is date standardization. Let’s fix the release_date column. Some dates have format yyyy-MM-dd (<year>-<month>-<day>), some just report the <year> or the <year>-<month> of the track release. Let’s standardize this field. We want to build a metanode that adds -01-01 where <month> and/or <day> are missing, according to the format yyyy-MM-dd.

Day 19. Investigate the String Manipulation node and its functions, especially length(), replace(), and join().

Day 20. Investigate row filtering (Row Filter) and row splitting (Row Splitter) nodes. Investigate how to keep and exclude rows, how to filter based on patterns or on numerical ranges or on missing values. Separate rows in: original dataset to have all rows with only <year> in release_date on one side, and all other rows on the other side. Repeat for rows with only <year>-<month> in release_date.

Day 21. Append “-01” in release_date where needed to always have dates in format yyyy-MM-dd and reassemble all pieces together via the Concatenate node.

Day 22. Transform all values in release_date column into Date&Time object. Check that no release_date values are missing. Then extract the year from release_date.

Day 23. Extract year from release_date. Wrap up all nodes from this section into a metanode.

5. Ungrouping and Aggregations

The task of this part is ungrouping & aggregation. We want to join the tracks.csv with the artist features from the files artist-uris.csv and artist.csv, and extract the number of tracks for each artist in the dataset.

Day 24. The id_artists column in the tracks dataset is a String, including one or more artists. To assign each track to each artist (if more than one), we need to split them and create a new row for each artist. Investigate the Cell Splitter node, especially the output (set, list, new columns). Investigate the Collection type for columns. Using the Ungroup node, disaggregate the dataset so that for each row there is only one artist and the corresponding track.

Day 25. Read the artist file artist-uris.csv and extract artist ids. Read the second artist file artist.csv, and use the Transformation tab to retain only the columns with artist name and artist popularity information. Convert to integer the popularity column.

Day 26. Join the data for each artist from the artist-uris.csv and the artist.csv. Join the resulting data table with the disaggregated tracks dataset (output table of the Ungroup node).

Day 27. Study the GroupBy node for all aggregations. Study how to build groups of data and what metrics can be calculated on them. In the joined table with artists and tracks, use the GroupBy node to count the number of tracks for each artist plus artist popularity and their period of activity from the first release date to the last release date. Using a second GroupBy node, count again the number of tracks for each artist across all years.

6. Plots and Charts — Univariate Analysis

Some more visualizations! Let’s build a composite view over the top 20 most prolific artists of all times … We mean … of the whole dataset!

Day 28. Using the output table of the first GroupBy node (the aggregated artist, track and popularity data), select the top k (k = 20) most prolific artists ever, that is, the ones with the highest track total count. Sort by track count and rename columns with meaningful names.

Day 29. Learn more about univariate, bivariate, and multivariate visualizations.

Day 30. Put the top k most prolific artists in an interactive Table View. Color rows by artist.

Day 31. Build a composite view that for all selected rows in the table shows the artist detail (name, # of tracks, popularity, period of activity from: to:) in a Tile View on the right.

Day 32. Display the number of tracks by artist in a Pie Chart. Make sure that the same color code by artist is used as in the Table and Tile view.

Day 33. Display the number of tracks by artist in a monochromatic Bar Chart.

Day 34. Display the number of tracks by artist in a Bar Chart with the same color scheme used for the Table and Tile Views.

Day 35. Investigate interactivity in the composite view. Learn more about composite views becoming dashboards both locally and as data apps on the KNIME Business Hub.

Figure 2. Pie Chart and Bar Chart of the top 20 most prolific artists by number of tracks.

7. Plots and Charts — Multivariate Analysis

Pie charts and bar charts show and compare aggregated values. To investigate relationships among features though we need to use multivariate types of visualization. The most commonly used visualization of this kind is surely the scatter plot. Since those visualizations do not show aggregated values like the bar and pie charts, but potentially all data points in your data set, often sampling is required.

Day 36. Using the joined tracks and artist features table, create a track popularity class (high, medium, low) with the Rule Engine node, perform different sampling strategies and display percentages for each popularity class and strategy (original data, with random sampling, and with stratified sampling) in a table view.

Day 37. Scatter Plot is probably the most common way to visualize and investigate relationships among pairs of features. Let’s learn more about scatter plots, how to implement them in KNIME, and how to carry out a visual exploration of the data. Check various pairs of features. Which feature creates a pattern with which feature? Try “loudness” vs “explicit” and see if there is some form of correlation. Add a Table View to visualize details of selected points only in the scatter plot.

Day 38. With the data table created using stratified sampling, let’s prepare the data to use the Sunburst Chart to visualize the feature proportions that lead to high popularity. The Sunburst Chart node requires nominal values, so the numerical columns must be binned or, if already binned but still numerical, converted to strings.

Day 39. Binned buckets might not have a distinctive name. Let’s loop through the binned columns to change the bin names to <first three letters of column name> + <default bin name>.

Day 40. We can now apply the Sunburst Chart to visualize the proportion of each feature to reach high popularity.

Day 41. We now want to apply a Heatmap to all numerical features. Since the Heatmap visualizes numerical values with a color gradient built on the [min, max] interval, for better visualization we need normalized features first.

Day 42. Let’s apply a Heatmap to all numerical normalized features.

Day 43. Visualize all feature contributions to popularity classes via a Parallel Coordinates Plot.

Day 44. A full component should be dedicated to Box Plots. Learn more about Box Plots & Conditional Box Plots. With the data table created using stratified sampling, you can box plot single features one by one or you can box plot multiple features all together. In this last case, you must pay attention to the different ranges. You could normalize of course, but then you lose the interpretability of the box plot.

8. Plots and Charts — Time Plots

Some more charts and plots. We use this section to also introduce the concept of Flow Variables.

Day 45. Learn more about Flow Variables.

Day 46. Return to the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. In a selected time window (<from year> <to year>), find out the maximum number of years of activity for an artist and extract those artists that have been active that maximum number of years. Wrap it into a component. Flow Variables here might be helpful.

Day 47. Make the previous component parametric by adding a configuration window where you can select the time window (<from year> <to year>). You can further parameterize your configuration window by allowing the top k artists to be extracted (i.e., artists with the largest number of tracks in the selected time window).

Day 48. Plot yearly number of tracks by artist in a Line Plot. Add color for each line, i.e. for each artist. Make the plot title parametric for the selected time window with the help of the String Manipulation (Variable) node. Inspect lines one by one and all together. Find out the artists who have been most consistently active across the years of your time window.

Day 49. Plot yearly number of tracks by artist in a Stacked Area Chart. Add color for each area, i.e. for each artist. Make the plot title parametric for the selected time window. Explore interactivity, especially how to add and remove areas for artists.

Day 50. Create a Bar Chart to inspect evolution over time. Visualize the number of tracks by artist over years in a bar chart. Add color for each bar, i.e. for each artist. Make the plot title parametric for the selected time window. Inspect bars one by one, in small groups, and all together. Find out the most prolific artist in a specific year of your time window.

Day 51. Let’s conclude this part with some free JavaScript code. Let’s investigate the Generic JavaScript View node. Do not worry, you do not need to code. Just explore the KNIME Hub for workflows and components based on the Generic Javascript View node. Drag&drop the Animated Bar Chart component from the KNIME Hub into your workflow and study, for example, the evolution of artists and track count throughout the years. Wrap all the time plots in a component and investigate selections of year and artists in the composite view.

  • Example components and visualizations based on free JavaScript code:
  1. FIFA World Cup, KNIME Hub
  2. Animated Bar Chart, KNIME Hub
  3. Epidemiological data from Zika virus, KNIME Hub
  4. Relations in data with scatter and 3-D scatter, KNIME Hub

9. Plots and Charts — Control

Day 52. Learn about Guided Analytics and Widget nodes

Day 53. Let’s continue using the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. Build a guided analytics sequence with a component including a Widget-based selection framework to select the <from year> <to year> time window and extract the top k artists. Next, build a second component to filter artists of choice and a third component to build time plot views (i.e., line plot, bar chart, stacked area chart, etc.) to visualize the number of tracks by artist over a selected time period (ex: 1970- 1980).

Day 54. Using the top k artists extracted from the time window selection used in the first component, investigate the usage of the Interactive Range Slider Filter Widget node in conjunction with the Scatter Plot or the Stacked Area Chart node. Build a component with a scatter plot or stacked area chart and control the number of points via the Interactive Range Slider Filter Widget node, for example by controlling the size of the year interval. Alternatively, use the data table created with stratified sampling in section 7 to build a component with an interactive multivariate scatter plot and table view in conjunction with the Interactive Range Slider Filter Widget node to control for the year interval.

Day 55. Investigate the usage of the Refresh Button Widget node. After the component that uses Widget nodes to select the time window (ex: 1970–1980) and extracts the top k artists, build a dashboard with a Widget-based framework to select the top k artists, a stacked area chart, a line plot, a bar chart, and a Refresh button.

Day 56. Combine the Refresh button node with the Interactive Range Slider Filter Widget node and with a stacked area chart to visualize extracted top k artists in a selected time window within a dashboard. Watch the Refresh Button Widget node in action:

Figure 3. Guided analytics of most active artists.

10. Covariance and Correlation

Creating bar charts, pie charts or time plots are definitely great ways to visually explore a dataset and gain precious insights. However, sometimes adding a tiny statistical flavor to the analysis improves our understanding of the data and, more importantly, enables a sounder identification of relationships between the input features.

Day 57. Learn about covariance, linear correlation, and rank correlation. Focus on the similarities and differences between them, as well as on their strengths and weaknesses.

Day 58. Compute the covariance for all pairs of the following features: “energy”, “loudness”, “danceability”, “valence” and “popularity” in the tracks.csv dataset using the GroupBy node. Build a covariance matrix both with unnormalized and normalized (z-score normalization) features separately.

Day 59. Compute the linear correlation for all pairs of meaningful numeric features in the track dataset using the Linear Correlation node. Inspect the correlation matrix view, and the results of the three output ports. What are the top four positively correlated feature pairs? Compare the values in the linear correlation matrix with those of the normalized covariance matrix, what do you observe?

Day 60. Compute the rank correlation for all pairs of meaningful numeric features in the track dataset using the Rank Correlation node. Inspect the correlation matrix view, and the results of the four output ports. What are the top four positively correlated feature pairs? Do you observe similarities with the linear correlation matrix? If yes, why do you think that’s the case?

Day 61. Visualize in a component composite view the unnormalized and normalized covariance matrices side-by-side both using a Heatmap node and a Table View node. Inspect in a different component composite view the linear correlation for the top four positively correlated features using the Scatter Plot node for each pair.

11. Text Visualization

There are not only numbers in data science! Besides numerical data, we have to deal with texts, images, networks, and even more diverse data types. In this section, a few items on text visualization.

Day 62. Get familiar with the KNIME Text Processing Extension, the Document object, the Term object and all text processing operations.

Day 63. Read the IMDb-sample.csv file from the KNIME Hub. This dataset collects 2000 movie reviews written by users and contains sentiment annotation (i.e., positive or negative) for each review. Convert all texts to Documents and perform some basic text cleaning.

Day 64. Using the file MPQA-OpinionCorpus-PositiveList.csv for the list of positive words in the English language and MPQA-OpinionCorpus-NegativeList.csv for the list of negative words in the English language (from the KNIME Hub or the latest version from the MPQA site), tag words in the texts as positive or negative.

Day 65. Let’s transform each text into a Bag of Words and let’s calculate all Term Frequencies.

Day 66. We have the list of words, we have their frequencies, let’s visualize them in a word cloud

CONGRATULATIONS, YOU REACHED THE END!

12. Graph Visualization (Bonus)

As a bonus, let’s investigate something more complicated: how to visualize interactions among users of a community, like the retweeting patterns around a hashtag on Twitter.

Bonus. First, let’s learn what a Network Graph (or sometimes just a graph) is.

Bonus. Download the TwitterData.table around #knime for the time window July 26th — August 3rd, 2021. Shape the Twitter data as an adjacency matrix:

Bonus. The KNIME Network Mining Extension deals with graphs and network objects. You need to create a network of interactions among Twitter users before visualizing it. Create the network diagram from the adjacency matrix of Twitter interactions built in the previous step.

Bonus. Visualize the network object of Twitter users interactions with a graph.

Bonus. Another way of visualizing interactions is the chord diagram. Let’s learn what a chord diagram is and how to build one in KNIME using the Generic Javascript View node.

--

--

Roberto Cadili
Low Code for Data Science

Data scientist at KNIME, NLP enthusiast, and history lover. Editor for Low Code for Data Science.