Do we always need a coach to win a medal? Let Apache Spark find it for us!!!

Published in

DataPebbles

7 min readOct 29, 2021

Apache Spark as a Big Data tool

In search of something new in the data world, I decided to learn Apache Spark- Analytics Engine for Big Data. Spark achieves high performance for both batch and streaming data. It provides high-level APIs in Java, Scala, Python, R, and an optimized engine that supports general execution graphs. It also carries a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Photo credit: schnelles clipart, die olympischen spiele, olympische medaille

What about learning spark and finding out the answer to the above question?

This summer I was rather intrigued by the Olympic Games in Tokyo. I found it very interesting to see how many talented athletes attended the Olympics and how many of them redefined the boundaries of human endurance. Watching how motivated they are, how hard they tried to achieve so much and never give up is inspiring. They are doing everything and above to be the best in their profession inspired me to learn new things… So there comes Apache Spark and more…

To get my hands dirty I reach out for Kaggle’s dataset. Since the Tokyo Olympics just finished not that long ago it’s a perfect occasion to use the dataset available publicly.

The dataset from kaggle.com contains the details of over 11,000 athletes, with 47 disciplines, along with 743 Teams taking part in the 2021(2020) Tokyo Olympics as well as achieved medals.

Looking at the data I found it very interesting to see the correlation between achieved medals and athletes with/without coaches. To find out let's follow the next steps.

Processing Data

After downloading the available dataset of Athletes, Coaches and Medals is time to do some processing.
Firstly let’s go over the useful methods to get our data clean and because we have three datasets we will need to combine them and modify them based on the contents of another. To do that we will use JOINS.

JOINS in Apache Spark

Join in Apache Spark is the widely-used clause in the SQL Server primarily to combine and retrieve data from two or more tables. In a real-world relational database, data is structured in a large number of tables and which is why there is a constant need to join these multiple tables based on logical relationships between them.

Inner joins (keep rows with keys that exist in the left and right datasets)
Outer joins (keep rows with keys in either the left or right datasets)
Left outer joins (keep rows with keys in the left dataset)
Right outer joins (keep rows with keys in the right dataset)
Left semi joins (keep the rows in the left, and only the left, dataset where the key appears in the right dataset)
Left anti joins (keep the rows in the left, and only the left, dataset where they do not appear in the right dataset)
Natural joins (perform a join by implicitly matching the columns between the two datasets with the same names)
Cross (or Cartesian) joins (match every row in the left dataset with every row in the right dataset)

To find out more about Joins follow up on our DataPebbles- Pebble stack article Know your joins in Apache Spark.

To solve the topic question, all three datasets with athletes, coaches and medals need to be joined.

First, we will left outer join of the coaches dataframe with athletes dataframe.

# Dataframe - Athletes with coaches 
athletes_with_coaches = df_athletes.join(df_coaches, 
                       (df_athletes.noc == df_coaches.noc) 
                & (df_athletes.discipline == df_coaches.discipline),                              how='leftouter').select(df_athletes["*"],df_coaches["coaches_name"])

The next step is to filter the data to get the number of athletes with and without coaches.

# Number of athletes without coaches
athletes_without_coaches_count = athletes_with_coaches.filter(col('coaches_name').isNull()) \
    .groupBy('noc').count().alias('athletes_without_coaches_count')
athletes_with_coaches_count = athletes_with_coaches.filter(col('coaches_name').isNotNull()) \
    .groupBy('noc').count().alias('athletes_with_coaches_count')

After all transformations broadcast join can be used on a small dataset — Medals.xlsx

BROADCAST in Apache Spark

A BroadcastHashJoin is also a very common way for Spark to join two tables under the special condition that one of your tables is small.

Benefits of Broadcast join:
- Parallelism of the large DataFrame is maintained (n output of partitions)
- Shuffle is not needed

Example of broadcast join of medals table.

#Final dataframe containing a table of athletes with and without #coaches, countries represented (noc) and the total count of medalsfinal_table = athletes_with_coaches_count.join(df_medal.hint("broadcast"),
athletes_with_coaches_count.noc ==df_medal.team_noc)
.join(athletes_without_coaches_count, on='noc')
.select(col('athletes_without_coaches_count.count')
.alias('athletes_without_coaches_count'),
col('athletes_with_coaches_count.count')
.alias('athletes_with_coaches_count'), 'noc', df_medal["total"]
.cast("int"))

Docker

Docker is a developer tool that helps to build, share and run app on-premises or in the cloud. Developers can package applications, code with dependencies, libraries into a deployable unit called a container.
Containers offer cost-effective scalability, disposability, application isolation, lightweight throughout including only the OS processes and dependencies necessary to execute the code.
When you use Docker, you are creating and using images, containers, networks, volumes, plugins, and other objects.

Docker architecture

Docker is a client-server architecture. The Docker client talks to the Docker daemon, which manages the heavy lifting of building, running, and distributing Docker containers. The Docker client and daemon can run on the same system and communicate using a REST API, over UNIX sockets or a network interface.

Docker-compose — is another docker client consisting of a set of containers.

Dockerfile — is defining steps needed to create the image and how to run it. Each instruction creates a layer in the image. When there is a need to rebuild the image only those layers which have changed in Dockerfile are rebuilt, thanks to that the images are lightweight, small, and fast when compared to other virtualization technologies.

Docker Images— is a read-only template with customized instructions for creating a Docker container.

Docker Containers — is a runnable instance of an image. To create, start, stop, move, or delete a container using Docker API or CLI. Containers can be connected to networks, attach storage can be also used to create a new image. More information about docker can be found here.

Docker containers for Tokyo Olympic Application:

To deploy and configure multiple docker-container at the same time like on the attachment above the docker-compose.yml is used. The file contains all configurations for airflow, spark and docker containers.

Example setup of Postgres database for Tokyo Olympic data in docker-compose.yml

Postgress and pgAdmin

PostgreSQL is an advanced open-source relational database that emphasizes extensibility and SQL compliance.

pgAdmin is an open-source administration and development platform for PostgreSQL.

With pgAdmin it is possible to login directly into Postgres containers and have a quick look or make simple queries. The example below illustrates, how to connect to Postgres container where the Tokyo Olympic database can be found.

Apache Airflow

Apache Airflow is an open-source platform to programmatically author, schedule and monitor workflows.

Apache Airflow is:

Scalable — thanks to its modular architecture and a message queue
Dynamic — defined in Python pipelines are generated dynamically
Extensible — user can define their operators and extend libraries
Elegant — pipelines are lean and explicit

Example od SparkSubmitOperator running spark-athletes application.

The output of an executed spark-submit job

Answer to the question: Do we always need a coach to win a medal?

Based on the data, the answer is that indeed having a coach will help to win a medal nevertheless it's not always the case. The table below illustrates that having a coach is not always the case for winning. The table is in descending order by the total of medals achieved by the country. It’s clear that in countries like People’s Republic of China, ROC, Great Britain Australia and more there are more athletes without coaches than with coaches.

Comparison table with a count of athletes with the coach and without coach where the total medal count was the highest.

The graph below demonstrates medals distribution together with the count of athletes with coaches and without coaches. It shows also that small countries have smaller representation of athletes which also impacts the result of winning.

Count of the winning athletes with/without coaches

To summarise, it is clear that having a coach helps in winning the Olympics, nevertheless, it is not always the case in every country. Determination to win even without a coach is the key.

Tokyo Olympic dataset was perfect for practising new skills with Apache Spark, Airflow, Postgres in Docker. Practice makes perfect and this applies everywhere in sports as well as science and in engineering. Therefore my motivation to continue to learn is growing even more.