DATA STORIES | SPORT VISUALIZATION | KNIME ANALYTICS PLATFORM

Knowledge Discovery likes Sports

Five interesting facts about the Olympic Games that you probably did not know

Roberto Cadili
Low Code for Data Science
15 min readOct 18, 2021

--

Photo by Bryan Turner on Unsplash.

Between July 24th and August 9th, 2021, Tokyo hosted the Games of the XXXII Olympiad, or more commonly known as the 2020 Summer Olympics. For sport lovers, the Olympic Games are the perfect occasion to watch a wide range of sport events and get inspired by the world’s best athletes. For data scientists, the Games represent a golden opportunity to obtain fresh new data, and engage in a variety of data mining tasks: from the elaboration of summary statistics and the extraction of patterns to the creation of predictive analytics or machine learning-driven applications.

In this article, we will use the Olympics Athlete Events Analysis dataset to aggregate, visualize, and discover five interesting facts about the modern Olympic Games that you probably didn’t know. The tool of choice to extract insightful information from our dataset is KNIME Analytics Platform, the open source software that leverages visual programming to make data science creation intuitive and accessible to everyone.

Before diving into the data-driven knowledge discovery, a few words about the modern Olympic Games. Their creation was inspired by the ancient Olympic Games held in Olympia, in Ancient Greece. In 1896, the first modern Games were held in Athens and were motivated by Baron Pierre de Coubertin’s commitment to revamping the ancient sport competition. He founded the International Olympic Committee (IOC) from which many National Olympic Committees (NOCs) –representing each competing country– would later derive. Celebrated every 4 years, the modern Olympics have become a leading international sporting event over time –featuring summer and winter sport competitions– in which thousands of athletes from more than 200 nations participate.

The Dataset & the Pivoting node

The Olympics Athlete Events Analysis dataset is freely available on Kaggle, and it contains comprehensive records on the athletes, sports, and events of the modern Olympic Games from 1896 to 2016. The dataset contains 271,116 rows and each entry refers to the participation of an athlete to one or more editions of the Olympic Games, or related sport events, identified by a unique ID and described by name, sex, age, height, weight, Olympic Team, NOC, sport, and event in which he/she competed, won medals, game and year of the Olympics, season, and hosting city of the Games. Except for “ID” and “Year”, which are numerical attributes, all other 13 attributes are nominal.

In the world of professional sports, such as the Olympics, extracting meaningful information on athletes’ performance, characteristics, or countries’ attendance often relies on some kind of aggregation measure. Moreover, data aggregation can be of key importance to transform, summarize and reshape the data for the creation of meaningful visualizations. To perform a wide variety of aggregation operations, KNIME Analytics Platform offers several native aggregation nodes, such as the GroupBy, Column Aggregator and Pivoting nodes.

Note. If you would like to read more on Olympic data aggregations –from simple methods to more complex measures– using the GroupBy node in KNIME Analytics Platform, have a look at Aggregations, Aggregations, Aggregations! — Part I and Aggregations, Aggregations, Aggregations! — Part II in the journal Low Code for Advanced Data Science.

In this article, we will largely use the Pivoting node to aggregate the data before visualizing them. This node requires a three-step configuration to generate a pivot table (Figure 1 and Figure 2).

  1. The Groups tab defines the group of data (customers, dates, pages, or whatever else). It requires us to select the column(s) whose values are used to build the groups.
  2. The Pivots tab requires us to select the column(s) whose values are used to build the column headers.
  3. The Manual Aggregation tab sets one or more aggregation methods for one or more selected columns. Notice that different aggregation methods are available for different attributes, depending on their type. This means that available aggregation methods vary if the input column is, for example, a string or an integer.
Figure 1. Data aggregation using the Pivoting node. On the right-hand side, the original dataset. On the left-hand side, the pivot table: group (red), pivot (green), aggregation counting unique IDs (blue).
Figure 2. The Pivoting node has a three-step configuration.

Note. If you would like to know more about building a pivot table in KNIME Analytics Platform, have a look at How to Build Pivot Tables on the KNIME Blog.

Let’s start with the five interesting facts emerging from the analysis of this dataset.

# 1 — Gender disparity and boycott

Gender disparity in sport participation was rampant for the better part of the last century and only from the 1980s onwards it has been steadily declining. The drop in athlete participation during the 1932, 1956, and 1980 Summer Olympics is attributed largely to country-promoted boycotts and affected men’s participation dramatically. On the other hand, women’s boycott of those editions is less noticeable than their male counterparts.

Figure 3. Athlete participation by gender in the Winter and Summer Games. As a result of boycotts, three major downward spikes can be observed in male athlete participation in the Summer Games.

We start our analysis by inspecting athlete participation in the Summer and Winter Games over time. In particular, we are interested in the evolution of male and female participation to check whether women are equally represented with respect to their male counterparts.

We first build a pivot table with Games as groups, sex as pivots, and we aggregate by counting unique IDs. Next, we create two time plots using the Line Plot (Plotly) node to visualize the data. This node is part of the KNIME Plotly extension, which supports several Plotly-based visualizations in KNIME Analytics Platform. The nodes of this extension integrate features of the KNIME JavaScript Views with additional functionalities typical of the Plotly library, such as the dynamic toolbar to interact with the visualization. For example, it is possible to download the plot, zoom in, zoom out, select areas of it, or auto-scale it. In addition to that, by ticking the “Enable link to Plotly editor” box in the Control Options tab of the configuration window, the nodes of the KNIME Plotly extension display the command “Edit chart”, which allows to edit and export the chart dynamically on the Plotly Chart Studio interface.

Note. The functionalities included in the free version of Plotly Chart Studio are limited and require authentication. You can check the list of free functionalities on the Plotly page.

Inspecting the time plots, we can observe that participation of male and female athletes over time is uneven (Figure 3). In the first editions of the modern Olympics, female athletes are almost absent from the competitions, making the Games a predominantly male event. From the 1920s onwards, the number of female athletes increased steadily with a surprisingly steep trend line after the 1980s. While male athletes continue to this day to represent the hegemonic gender in both the Winter and Summer games, the presence gap of their female counterparts is steadily closing. Indeed, by hovering the mouse cursor over the line plots, we can see that in the 2016 Summer Olympics the number of female athletes was 5034, and that of male athletes was 6145 –the smallest gender participation gap since the beginning of the modern Olympic Games.

Moreover, a closer inspection of the Summer Game time plot reveals a few more interesting facts. In the 1932, 1956 and 1980 Summer Olympics, we can observe three abrupt large downward spikes, which are more prominent for male participants. Those spikes are easily explained if we acknowledge that the Games are very often a clear reflection of global economic and geo-political transformations and disputes. The 1932 Summer Games were held during the Great Depression, which eroded the worldwide financial stability. As a result, many countries that competed in the 1928 Games could not afford to sponsor athletes in the 1932 Olympics. On the other hand, the 1956 and 1980 Summer Games registered a decline in the participation rates because of major boycotts. The most important of those boycotts occurred in the 1980 Moscow Summer Games and was led by the USA in response to the Soviet-Afghan War. As a result of the protest, 66 nations did not participate in the Games.

# 2 — Winners will be winners

Historically, in the top ten of gold medal winning countries, the USA is the country with the highest count, followed by Russia (former URSS, then RUS, and now ROC), and old and new Germanies (GER and GDR). Compared to the 2020 Tokyo Olympic medal count, in the top 10 we still find the usual suspects.

Figure 4. Animated bar chart of the 10 top gold medal winning countries of all time.
Figure 5. Top ten gold medal winning nations at the 2020 Tokyo Summer Olympics (Source: 2020 Tokyo Olympics) with independent colour coding (left) and same colour coding of the animated bar chart (right).

Inspired by a famous song by Queen — “Friends will be friends”, in this section we reinterpret the song title in the perspective of the Olympic Games. We inspect which are the top 10 Summer Games gold medal winning nations of all time, and whether their leadership has remained solid and undisputed also during the 2020 Tokyo Summer Olympics.

To extract this information from our dataset, we need to aggregate the data using the Pivoting node. We build a pivot table with countries and years as groups, medal types as column headers, and we aggregate by counting unique sport events.

Next, we want to visualize the medal ranking. We use the Animated Bar Chart verified component to create a dynamic bar chart visualization of the top gold medal winning nations of all time. Additionally, we can visualize the top 10 gold medal winning countries of the 2020 Tokyo Olympics using a bar chart and assigning a different colour to each bar, i.e. each country.

To achieve this, we first need to manipulate our data and bring them in the required shape. We can then input the reshaped data into the Bar Chart node, and import the colour information. Figure 6 shows the content of the “Colour Bar Chart” metanode with the node sequence that is required to reshape the data and assign a different colour to each bar.

Figure 6. Node sequence to assign a different colour to each bar in a bar chart.

Let’s have a closer look at the involved steps and nodes:

  1. Color Manager. We use this node to assign a colour to each row, i.e. the country name, in our table.
  2. String Manipulation. We append a copy of type String of the column containing the country names.
  3. Pivoting. We reshape the data to create a pivot table. First, we group by country name, we use the row values of the column created in step 2 as column headers of the pivot table, and we aggregate by summing the gold medal values.
  4. Bar Chart. We input the reshaped data and the colour information. If we wish to obtain wide-looking bars, in the General Plot Options tab select “Stacked” from the “Chart type” section.

Note. If you wish to find out how to obtain coloured bar charts with a different data manipulation approach, have a look at How to Assign Colors to Bars in a Bar Chart — Three Shades of Green on the KNIME Blog.

In Figure 4 we can observe that the USA is by far the country that has won the most gold medals in the history of the modern Olympic Games, followed by the now dissolved Soviet Union, and Great Britain. It is interesting to see that the 9th and 10th place goes to Russia and the now dissolved German Democratic Republic, respectively.

However, we should make a few considerations before drawing conclusions too quickly. Due to major historical and geopolitical transformations occurring over time, several countries changed their NOC codes as the case of the Soviet Union and Russia shows. Similarly, China participated in the Olympic Games using different codes: Republic of China (ROC), People’s Republic of China (PRC), and China (CHN). Indeed, if we aggregated gold medals to account for countries’ historical transformations, Russia might be ranked second and Germany third. Likewise, China might outperform Russia, Great Britain, or France. Reconstructing the historical evolution of each competing nation, and aggregating or disaggregating (i.e., former Soviet Republics that are counted as Soviet Union) data accordingly is not a trivial task and goes beyond the purpose of this article.

Despite the historical complexity, we can still compare the bar chart in Figure 4 with the ranking of the top ten gold medal winning nations at the 2020 Tokyo Summer Olympics (Figure 5), and identify a few interesting facts. The USA continues to dominate the ranking as the country that won the most gold medals. Similarly, China, Russia –that participated as the Russian Olympic Committee (ROC)–, and a bunch of European nations –i.e., France, Great Britain, Germany, and Italy– continue to occupy the top positions in the ranking as mirrored in the animated bar chart. All in all, it looks like winners will be winners.

# 3 — Many countries, same athlete

Over the numerous editions of the Games, several athletes represented more than one country. Among gold medal winning athletes, the largest group competed both for the German Democratic Republic and Germany with 62 athletes, followed by the Unified Team (EUN) and Russia with 45 athletes. The best performing athlete, Birgit Fischer-Schmidt, won 8 gold medals and represented both the German Democratic Republic and Germany.

Figure 7. Most frequent country combinations for gold medal winning athletes.

In the previous section, we discovered that the geopolitical transformations occurring over time are often mirrored in the country code with which each competing nation is identified. We now want to inspect which are the athletes that have represented more than one nation and have won at least one gold medal. In this case, we do not discriminate between the Winter and Summer Games.

We leverage the versatility of the GroupBy and Pivoting node to aggregate our dataset separately:

  1. Pivoting. We build a pivot table with athlete IDs and names as groups, medal types as pivots, and we aggregate by counting sport events.
  2. GroupBy. We group by athlete IDs and names, and we aggregate by concatenating unique countries.

Next, we filter out athletes who competed only for one country, retain those who won at least one gold medal, and join the results of the aggregation nodes. In order to produce a meaningful visualization, we further filter country combinations per athlete whose frequency is larger than two.

In the component “Athlete and country visualization”, we wrap the Color Manager, the Tag Cloud and the Table View node to create an interactive view and assign the same colour to country combinations and the corresponding athletes. We can now inspect our results dynamically and visualize selected rows only by clicking the “Show selected rows only” filter (Figure 7). We can observe that, among gold medal winning athletes, the most frequent country combination is constituted by the German Democratic Republic-Germany with 62 athletes, followed by the Unified Team (EUN)-Russia with 45 athletes, and West Germany (FRG)-Germany with 37 athletes. The best performing athlete, Birgit Fischer-Schmidt, won 8 gold medals and represented both the German Democratic Republic and Germany. Geopolitical transformations are reflected in the country combination frequencies — most notably the historical changes occurred in Germany and Russia after World War II and the rise of the Communist Bloc.

The most peculiar country combinations, however, are those with the lowest frequency (freq = 3). Among those, we can find athletes that competed both for India and Pakistan, Yugoslavia and Austria, and Hungary and the USA.

# 4 — Winning isn’t everything

Winning isn’t always the most important aspect in sport competitions. Perseverance and determination are usually more important traits to pursue a career in sport. The sports with the highest number of tenacious athletes are Athletics (1098), Shooting (368) and Swimming (352). These athletes competed in more than three Olympic Games but won 0 medals.

Figure 8. Most tenacious, non-medal winning athletes per sport.

Athletes love to win medals, and so does the cheering audience as it watches its nation outperforming others. But is winning medals all that matters? What about the joy of competing or the tenacity to never give up despite failures? Aren’t those athletes and sports worth being remembered?

In this section, we aggregate our dataset to account for sports whose athletes participated in multiple editions of the Games but never won a medal. With the help of the Pivoting node, we group by ID, name, and sport, we pivot by medal types and aggregate by counting unique games. Next, we filter out missing values and medal winning athletes, and retain only those who have competed in three or more games without winning a single medal.

We visualize the aggregated data in the “Visualize most tenacious athletes/sports” component. For the interactive view, we employ the Tag Cloud and the Table View node. We control the colour assignment to sports and athletes with the Color Manager node, and we add the Interactive Range Slider Filter Widget node to filter athletes.

In Figure 8, we can see that the sport with the highest number of tenacious athletes is Athletics with 1098 entries, followed by Shooting (368) and Swimming (352). By sliding the interactive range filter, we can retain or exclude athletes by the count of non-won medals. We can check, for example, in which sports athletes unsuccessfully competed three to five times. In the upper bound of the filtered results, we find athletes in Shooting, Luge, Athletics and Water Polo, while on the lower bound we find athletes in Badminton, Alpine Skiing, and Canoeing. While these sportspeople will not be remembered for their victories, they do deserve our admiration for their determination and perseverance.

# 5 — The true story of Tarzan

The American competitive swimmer Johnny Weissmuller participated in the 1924 and 1928 Summer Olympics where he won several medals. In the early 1930s, he was casted to interpret Tarzan in a series of films on the fictional character.

Figure 9. Johnny Weissmuller’s performance at the 1924 and 1928 Summer Olympic Games.

The last piece of information that we extract from our dataset has to do with Tarzan, the fictional character of a feral child raised in the African jungle by great apes. You might be surprised to know that Tarzan did participate in the Olympic Games, not once but twice.

Johnny Weissmuller was an American competitive swimmer who participated in the 1924 and 1928 Summer Olympics, and won several gold medals and one bronze medal as a Water Polo player. His sport career did not last too long since in 1932 he was casted to play Tarzan in a series of films. Perhaps, given these circumstances, we should rethink the way we tell the (true) story of Tarzan: from a feral child to an Olympic champion.

Reporting and displaying this simple piece of information using KNIME can be a lot of fun. Using the Refresh Button Widget node included in the latest software release, we can design a workflow that allows the interactive selection and visualization of the athlete’s competitions and victories. This node works by producing a series of reactivity events that trigger the re-execution of downstream nodes in a component by conveniently connecting the variable output port to the nodes that you wish to re-execute. This means that we can now interact more easily with the data input into our component without leaving the component interactive view, and create dynamic visualizations to make UI even more insightful and enjoyable.

Further customization and beautification of the interactive view of our components is possible by adding images. In KNIME Analytics Platform, we can import images using, for example, the Image Reader (Table) node. We can then use the Renderer to Image node to convert the images in PNG format and import them in the Tile View or the Table View node. Another option to embed images in a component view is to use the Image Output Widget node, which we use to display a picture of Johnny Weissmuller (Figure 9).

Using the Tile View node, we select the tile with the corresponding Game edition. Next, we click on the “Competition details” button where the Table View node with the pertinent Game, sport, event, and sport image information is refreshed and displayed (Figure 9). As we already anticipated, Tarzan was a super Olympic athlete!

Summary

In this article, we have engaged in a data-driven knowledge discovery and visualization using the Olympic Games dataset. The goal was to extract interesting facts that you probably did not know.

As a preliminary step before visualization, we aggregated data using the Pivoting node. Therefore, we briefly introduced the functioning of the Pivoting node and its intuitive three-step configuration.

We then proceeded to identify, extract, and visualize interesting pieces of information about the Olympic Games, from athlete participation trends and boycotts to Tarzan participation in the competitions. We experimented with a few JavaScript View nodes, shared visualization tips&tricks, and enjoyed the interactivity offered by the components’ view and the new Refresh Button Widget node.

With KNIME, knowledge discovery is easy, interactive, and a lot of fun! What interesting insights can you discover in your dataset?

The workflow presented in this article can be downloaded for free from the KNIME Hub.

References

Paris [animation] — Photo by Anthony DELANOIX on Unsplash.

Amsterdam [animation] — Photo by Massimo Virgilio on Unsplash.

Water polo player [animation] — Photo by Weston Eichner on Unsplash.

All other photos are released under the Creative Common license: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

--

--

Roberto Cadili
Low Code for Data Science

Data scientist at KNIME, NLP enthusiast, and history lover. Editor for Low Code for Data Science.