Exploratory Analysis: What are the best cardio workouts based on ratings?

Gabi

Published in

INST414: Data Science Techniques

5 min readSep 15, 2024

Written by Gabrielle Waugh

Introduction

Fitness and exercise have become increasingly common among people, making it crucial to identify which workouts yield the best results to achieve success in the gym according to individual goals. This analysis addresses the research question: “Which cardio workouts receive the highest user ratings?” Examining data from the megaGymDataset available on Kaggle focuses on identifying the top-rated cardio-based workouts.

The primary stakeholders in this analysis, including personal trainers, gym owners, athletes, and gym enthusiasts, stand to gain significant benefits. Personal trainers can tailor their recommendations to their clients’ fitness goals, gym owners can design more appealing and beneficial workout programs, athletes can optimize their performance, and gym enthusiasts can optimize their personal routines. All these stakeholders can benefit from understanding which cardio workouts are deemed the best based on user ratings, as this information can fit into different routines depending on individual fitness goals.

By providing insights into the average ratings of cardio workouts from people who regularly exercise, this analysis offers actionable guidance that can enhance fitness practices, inform class offerings, improve training schedules, and refine personal training programs. Thus, it supports stakeholders in making informed decisions about workout selection and program development, ultimately contributing to more effective and satisfying fitness experiences.

What data would be helpful in answering this question?

Several vital data types are essential to determine which cardio workouts receive the highest user ratings. The names of the workouts are necessary for recognizing and distinguishing between different cardio exercises. Cardio-focused data helps to focus the analysis specifically on cardio workouts by filtering out other types of exercises, ensuring the relevance of the analysis. User ratings are necessary to provide numerical feedback on each workout’s effectiveness, allowing for the assessment and ranking based on user evaluations. Adding rating descriptions such as “Excellent” or “Average” provides context to the numerical ratings, offering a deeper understanding of user perceptions. Data on the frequency of the exercise can indicate the popularity and reliability of the feedback, as workouts with more ratings may be more trustworthy. Although supplementary, user demographics like age and fitness level can offer insights into how different groups perceive various workouts. Additionally, information on workout duration and intensity may reveal how these factors correlate with user ratings. Together, these data types enable a comprehensive evaluation of cardio workouts, highlighting which are most highly rated and why.

What data was used?

As previously stated, this analysis used data from the megaGymDataset on Kaggle, a popular platform for data science competitions and datasets that provide a wide range of data resources for analysis and machine learning projects. The megaGymDataset includes several key fields relevant to assessing cardio workouts. The “Title” field includes the common names of the workouts. The “Type” field specifies the workout type, allowing the analysis to focus exclusively on cardio workouts after cleaning the data. The “Rating” field contains numerical values between zero and ten, representing user feedback on the effectiveness of each workout. The “RatingDesc” field provides qualitative descriptions of user feedback (if provided), adding context to the numerical rating and providing a more comprehensive interpretation of user feedback. These fields are all crucial for identifying the best cardio workouts, rated by people who exercise regularly.

Data Cleaning

The data cleaning process involved several key steps to ensure accuracy and usability. First, the dataset was filtered to focus only on cardio workouts by examining the “Type” field. Missing values in the “Rating” field were addressed by imputing values where appropriate or excluding those entries to avoid skewing the results. Any extraneous or irrelevant columns, such as “Desc,” were removed to streamline the dataset. Additionally, duplicate records were identified and removed to ensure each workout was represented only once. These steps helped to create a clean, focused dataset, enabling a more accurate and reliable analysis of cardio workout ratings.

Exploratory Data Analysis

The primary objective of the exploratory data analysis was to identify the top-rated cardio workouts based on user feedback. The analysis began by filtering the dataset to include only cardio workouts, as specified by the “Type” field. Next, the “Rating” field was examined to determine the average rating for each workout, focusing on those receiving the highest scores. Data visualization was performed using bar charts to compare the average ratings of different cardio workouts and identify trends or standout exercises. The analysis also involved checking for and addressing missing values in the “Rating” field to ensure result accuracy. Through this process, the analysis successfully identified which cardio workouts were most highly rated by users, providing valuable insights into the most effective and popular exercises.

Fig.1 Cardio workouts graphed by average rating

Findings

The exploratory analysis uncovered that users rated several cardio exercises exceptionally high. Jumping rope was the top-rated exercise, with an average rating of 9.2. The stair climber received a rating of 9.1. Both rowing and bicycling were rated 8.9, and the burpees scored 8.8. These results highlight the cardio workouts that are most favored among those who exercise frequently, offering valuable insights for those looking to optimize their cardio routines with the highest-rated workouts. The high ratings for these exercises suggest they deliver significant fitness benefits and are particularly appreciated within the fitness community.

Common Bugs

During the data cleaning and analysis process, several common issues might arise. One frequent problem is missing values in the “Rating” field, which can impact the accuracy of the results. To address this, entries with missing ratings were either replaced with average values or excluded from the analysis, depending on the extent and distribution of the missing data. Another issue is the presence of duplicate records, which were identified and removed to prevent redundancy and ensure that each workout was uniquely represented. Additionally, incorrect or inconsistent entries in the “Type” or “RatingDesc” fields can lead to misleading conclusions. Ensuring consistent categorization and verifying the accuracy of these fields were critical steps in mitigating this risk.

Limitations

The analysis also has several limitations. One major limitation is the potential bias introduced by the dataset’s sample, which has no clear source of origin listed. The dataset has multiple sources, so it is not guaranteed that this data is representative of a group. Additionally, the absence of detailed user demographics or contextual factors, such as workout duration or intensity, limits the ability to fully understand the reasons behind higher ratings. Furthermore, the analysis does not account for the effectiveness of workouts in achieving specific fitness goals, which could provide a more nuanced evaluation. These limitations suggest that while the analysis offers valuable insights, it should be interpreted cautiously and supplemented with additional data for a more comprehensive understanding.

GitHub Link: https://github.com/gabwaugh/INST414_Module_1