Stories by Lily Gates on Medium

Pokémon: Power and Balance

Lily Gates — Fri, 16 May 2025 17:00:31 GMT

Pokémon™ (©1996–2025 Nintendo®, Creatures™, GAME FREAK™) is a global franchise spanning video games, trading card games, TV shows, movies, and more — capturing the imagination of fans for nearly three decades.

Scraping Pokédex data and visualizing the relationships between elemental types and Pokémon base stats.

I have been a fan of Pokémon for about 20 years. As a child, I eagerly watched the TV episodes and collected over 500 Pokémon trading cards. Although my fandom has evolved over time, I remain a dedicated fan who keeps up with the latest video game installments. With the newest Pokémon game launching this year, I felt motivated to revisit my current games and explore the series from a data perspective. One of the key goals in Pokémon games is to “Gotta Catch ’Em All!” — completing the Pokédex, which catalogs all known Pokémon. Each game takes place in a specific region that features a subset of Pokémon, but the National Pokédex includes every Pokémon species across all regions. I wanted to scrape an online National Pokédex to analyze patterns and trends across Pokémon.

Success in the games requires building a balanced team with diverse base stats — such as attack, special attack, defense, special defense, speed, and health points (HP) — and elemental types. Some Pokémon have notable strengths or weaknesses in certain stats, influencing how Trainers (players) strategize. For example, Slowpoke is notorious for its low speed, which affects its ability to strike or dodge attacks effectively. To compensate, Trainers might equip Slowpoke with items like the “Quick Claw” to boost its speed during battle. Understanding each Pokémon’s strengths and limitations is crucial for winning battles. Additionally, elemental type matchups are key since moves are more or less effective depending on both the user’s and opponent’s types, adding a layer of strategic depth.

Over the years, as the games introduced more dual-type Pokémon and new elemental types, the complexity of team building has increased significantly. To keep track of this growing complexity, I wanted to identify general trends in base stats and elemental types across all Pokémon.

This dataset can be valuable not only for gamers but also for game developers. Pokémon games often include both classic Pokémon from previous generations and new species. Analyzing the distribution of types and base stats can help ensure a balanced and diverse game experience. For example, “Gym Leaders” and the “Elite Four” are progressively tougher bosses specializing in certain types. Game developers can use this data to design their teams with increasing difficulty and varied type combinations from early Gyms to the Champion challenge.

Overarching Question

How do primary and secondary elemental types influence the strength and diversity of Pokémon, and what insights can this reveal about game balance?

Stakeholders

Game designers and competitive players.

Game designers can use this data to better understand how certain typings dominate in terms of base stats, which could inform balancing in future generations or spinoff games.
Competitive players benefit from recognizing underrated, non-legendary Pokémon with strong base stats in type-restricted formats or themed tournaments.

Data

Data Collection

I collected Pokémon data by scraping the Pokémon Database National Pokédex (https://pokemondb.net/pokedex/all) using Python’s requests and BeautifulSoup libraries.

The primary goal was to extract:

The names of all Pokémon listed on the page
The links to their individual detail pages (using urllib.parse for URL handling)

Since some Pokémon have multiple forms (e.g., Mega Evolutions, regional variants), the list included duplicates. To maintain clarity and focus on base forms, I filtered the data to include only unique Pokémon entries.

Summary:

A total of 1,025 unique Pokémon were identified and extracted from the National Pokédex.
The resulting DataFrame, `link_frame`, includes each unique Pokémon’s name along with a link to its detail page — providing a foundation for further scraping or analysis.

# — — — — — — — — — — — — — — — — — — — — — — 
# STEP 1: FETCH MAIN POKÉMON LIST AND LINKS
# — — — — — — — — — — — — — — — — — — — — — —

# Request and parse the main page
site = get(url)
content = BeautifulSoup(site.content, "html.parser")
# Extract anchor elements containing Pokémon names
anchor_elem = content.select('a[class="ent-name"]')
# Get names and full URLs
pokemon_names = [x.string for x in anchor_elem]
completed_links = [urljoin(base_url, i.get('href')) for i in anchor_elem]

# - - - - - - - - - - - - - - - - - - - - - - 
# STEP 2: REMOVE DUPLICATES
# - - - - - - - - - - - - - - - - - - - - - - 

# Get unique Pokémon names and links
unique_pokemon = []
unique_links = []
for name, link in zip(pokemon_names, completed_links):
 if name not in unique_pokemon:
 unique_pokemon.append(name)
 unique_links.append(link)
# Create a dataframe of names and links
link_frame = pd.DataFrame({
 'pokemon': unique_pokemon,
 'url': unique_links
})

Scraping Detailed Data for Each Pokémon

After collecting the list of all Pokémon and their URLs, the next step is to scrape detailed information from each Pokémon’s individual page. The data extracted includes:

Basic info: National Pokédex number, species, height, weight
Types: Primary and secondary elements (e.g., Water, Fire) — Note: It is to be expected that some Pokémon have only a primary elemental type
Gender distribution (percentage male/female or genderless)
Battle stats: HP (health points), Attack, Defense, Special Attack, Special Defense, Speed, and Total stats (sum of all stats)
Pokédex descriptions (concatenated text entries from the page)

The pages are scraped for each Pokémon URL in the link_frame DataFrame using BeautifulSoup and appended to a Python dictionary.

# --------------------------------------------
# STEP 3: INITIALIZE DATA STRUCTURE
# --------------------------------------------

pokedex_dict = {
    'poke_name_from_link': [],
    'pokedex_num': [],
    'elem_1': [],
    'elem_2': [],
    'species': [],
    'height_meters': [],
    'weight_kg': [],
    'male': [],
    'female': [],
    'hp': [],
    'attack': [],
    'defense': [],
    'sp_atk': [],
    'sp_def': [],
    'speed': [],
    'total': []
}

# --------------------------------------------
# STEP 4: DEFINE SCRAPER FUNCTION
# --------------------------------------------


def pokemon_scraper(poke_content, pokedex_dict):
    """
    Extract detailed Pokémon data from a BeautifulSoup-parsed page and 
    append the extracted information to the provided dictionary.

    Args:
        poke_content (bs4.BeautifulSoup): Parsed HTML content of a Pokémon's webpage.
        pokedex_dict (dict): Dictionary with lists that stores Pokémon data fields 
                             such as name, stats, types, and more.

    Returns:
        None: This function updates pokedex_dict in place by appending new data.
    """
    
    # Pokemon name
    poke_name_from_link = poke_content.select('h1')[0].string
    
    # Locate the table that contains the data
    pokedex_data = poke_content.select('table[class="vitals-table"]')
    
    # National Pokedex Number
    pokedex_num = pokedex_data[0].select('td')[0].contents[0].string
    
    # Element type(s)
    elem_type = list(pokedex_data[0].select('td')[1].children)
    elements = []
    for x in elem_type:
        if not str(x).isspace():
            elements.append(x.string)
    elem_1 = elements[0]
    elem_2 = None
    if len(elements) > 1:  # Sometimes there is more than one type
        elem_2 = elements[1]
    
    # Species
    species = pokedex_data[0].select('td')[2].string
    
    # Height
    height = pokedex_data[0].select('td')[3].string.replace(u'\xa0','')
    height_meters = float(height.split('m')[0])

    # Weight
    weight = pokedex_data[0].select('td')[4].string.replace(u'\xa0','')
    weight_kg = float(weight.split('kg')[0])

    # Gender
    gender = list(pokedex_data[2].select('td')[1].children)
    gender_stats = []
    for x in gender:
        if not str(x).isspace():
            gender_stats.append(x.string)

    if "Genderless" in gender_stats:
        male = float(0)
        female = float(0)
    else:
        male = float(gender_stats[0].split('%')[0])  # Extracts only the percent digits
        female = float(gender_stats[2].split('%')[0])  # Extracts only the percent digits

    # Fighting Stats
    hp = int(list(pokedex_data[3].select('tr')[0])[3].string)
    attack = int(list(pokedex_data[3].select('tr')[1])[3].string)
    defense = int(list(pokedex_data[3].select('tr')[2])[3].string)
    sp_atk = int(list(pokedex_data[3].select('tr')[3])[3].string)
    sp_def = int(list(pokedex_data[3].select('tr')[4])[3].string)
    speed = int(list(pokedex_data[3].select('tr')[5])[3].string)
    total = int(list(pokedex_data[3].select('tr')[6])[3].string)
    
    # Append elements in the list for each key
    pokedex_dict['poke_name_from_link'].append(poke_name_from_link)
    pokedex_dict['pokedex_num'].append(pokedex_num)
    pokedex_dict['elem_1'].append(elem_1)
    pokedex_dict['elem_2'].append(elem_2)
    pokedex_dict['species'].append(species)
    pokedex_dict['height_meters'].append(height_meters)
    pokedex_dict['weight_kg'].append(weight_kg)
    pokedex_dict['male'].append(male)
    pokedex_dict['female'].append(female)
    pokedex_dict['hp'].append(hp)
    pokedex_dict['attack'].append(attack)
    pokedex_dict['defense'].append(defense)
    pokedex_dict['sp_atk'].append(sp_atk)
    pokedex_dict['sp_def'].append(sp_def)
    pokedex_dict['speed'].append(speed)
    pokedex_dict['total'].append(total)asdf

# --------------------------------------------
# STEP 5: LOOP THROUGH URLS AND SCRAPE DATA
# --------------------------------------------

print("\n=== FETCHING AND SAVING POKÉMON DATA... ===\n")

# Iterate over each Pokémon URL in the link_frame DataFrame
# 'enumerate' is used to keep track of progress (count) starting from 1
for count, poke_site in enumerate(link_frame['url'], start=1):
    # Send an HTTP GET request to the Pokémon’s individual page
    page = get(poke_site)
    
    # Parse the HTML content of the page using BeautifulSoup
    poke_content = BeautifulSoup(page.content, "html.parser")
    
    # Extract relevant Pokémon data from the page and append it to the pokedex_dict
    pokemon_scraper(poke_content, pokedex_dict)
    
    # Print progress update showing how many Pokémon have been processed out of total
    print(f"{count} of {len(link_frame['url'])} complete")
    
    # Pause for 1 second between requests to avoid overloading the server (politeness)
    time.sleep(1)

The dictionary data is then saved as a pandas DataFrame. It is then exported as a .csv file and reread in for validation.

Screenshot of exported .csv file after it has been read in for validation

Image Download Pipeline for Pokémon Dataset

To efficiently download and store images of Pokémon, I developed a script that automates the process while handling potential errors and avoiding redundant work.

Setup

First, I created a folder called pokemon_images to store all the downloaded image files. If the folder didn’t already exist, the script ensured it was created. To prevent unnecessary downloads, the script checked if an image already existed before attempting to download it again.

# Create folder to store images
folder_name = "pokemon_images"
os.makedirs(folder_name, exist_ok=True) # Ensure folder exists, if not, make
# Number of times to retry downloading an image if it fails
retry_limit = 3
delay_between_requests = 1 # Seconds

Main Download Loop

For each Pokémon listed in my dataset (link_frame), the script extracted the name and corresponding URL. To make the image filenames consistent, each name was converted to lowercase and spaces were replaced with underscores (e.g., "Mr. Mime" becomes mr._mime_image.jpg). To prevent unnecessary downloads, the script checked if an image already existed in the pokemon_images folder before attempting to download it again.

To manage the flow and monitor progress, the script displayed which Pokémon had been skipped (if they were previously downloaded into the pokemon_images folder), the current Pokémon being processed, how many remained, and an estimate of the remaining time based on the average speed of downloads so far. I added a delay of one second between requests to avoid overwhelming the source website.

Each image download was attempted up to three times in case of errors. The script first tried to find an official artwork on the page by looking for artwork_tag = soup.select_one('a[rel="lightbox"]') to access the high-quality image. If that wasn’t available and no artwork was found, it tried to find a sprite image sprite_tag = soup.select_one('img[src*="/sprites/"]'). When an image was successfully fetched, it was saved to the designated folder. If it failed even after three attempts, the Pokémon's name was logged for review and stored in a failed_pokemon.txt file at the end of the run.

total = len(link_frame)
start_time = time.time()
failed_pokemon = []

for counter, (_, row) in enumerate(link_frame.iterrows(), start=1):
    """
    Iterates through a DataFrame of Pokémon names and URLs to download their artwork images.

    Parameters:
    - link_frame (pd.DataFrame): A DataFrame with 'pokemon' and 'url' columns, containing Pokémon names and their detail page URLs.
    - downloaded (set): Set of Pokémon names already downloaded (read from a progress file).
    - folder_name (str): Directory path where downloaded images will be saved.
    - progress_file (str): Path to a file that logs downloaded Pokémon names.
    - retry_limit (int): Number of retry attempts allowed per Pokémon if an error occurs.
    - total (int): Total number of Pokémon entries (used for status display).

    Returns:
    - Saves each downloaded image with a standardized filename format: "_image.jpg".
    - Prints status updates to the console, including progress, retries, and estimated remaining time.
    - Updates the progress file and `downloaded` set to prevent duplicate work.
    """
    poke_name = row['pokemon'].lower().replace(' ', '_')
    poke_url = row['url']
    img_filename = f"{poke_name}_image.jpg"
    img_path = os.path.join(folder_name, img_filename)

    # Skip if image already exists
    if os.path.exists(img_path):
        print(f"Skipping {poke_name.title()} (image already exists)\n")
        continue
        
    remaining_total = total - counter
    elapsed_time = time.time() - start_time
    avg_time = elapsed_time / counter
    est_remaining = avg_time * remaining_total
    mins, secs = divmod(int(est_remaining), 60)

    print(f"Downloading image {counter} of {total}: {poke_name.title()}")
    print(f"Remaining: {remaining_total} | Estimated time left: {mins}m {secs}s")

    attempts = 0
    while attempts < retry_limit:
        try:
            page = get(poke_url)
            if page.status_code != 200:
                raise Exception(f"Failed to get page, status code {page.status_code}")

            soup = BeautifulSoup(page.content, 'html.parser')

            # Try official artwork
            image_url = None
            artwork_tag = soup.select_one('a[rel="lightbox"]')
            if artwork_tag and artwork_tag.has_attr('href'):
                image_url = artwork_tag['href']
            else:
                # Try fallback sprite
                sprite_tag = soup.select_one('img[src*="/sprites/"]')
                if sprite_tag and sprite_tag.has_attr('src'):
                    image_url = sprite_tag['src']
                    # Fix malformed or protocol-relative URL
                    if image_url.startswith('//'):
                        image_url = 'https:' + image_url
                    elif image_url.startswith('/'):
                        image_url = 'https://img.pokemondb.net' + image_url

            if image_url:
                img_resp = get(image_url)
                if img_resp.status_code == 200:
                    with open(img_path, 'wb') as f:
                        f.write(img_resp.content)
                    print(f"Saved image to {img_path}\n")
                    break
                else:
                    raise Exception(f"Failed to download image, status code {img_resp.status_code}")
            else:
                print(f"No artwork or sprite found for {poke_name.title()}\n")
                break

        except Exception as e:
            attempts += 1
            print(f"Attempt {attempts} failed for {poke_name.title()}: {e}")
            if attempts == retry_limit:
                print(f"Skipping {poke_name.title()} after {retry_limit} failed attempts\n")
                failed_pokemon.append(poke_name)
            else:
                print("Retrying...\n")
            time.sleep(2)  # wait before retry

    time.sleep(delay_between_requests)

Final Output

All available images successfully downloaded were saved to the pokemon_images folder in a snake_case, organized naming convention. Any Pokémon that could not be processed were displayed at the end of the run and logged in a text file named failed_pokemon.txt.

if failed_pokemon:
    print("\nThe following Pokémon failed to download:")
    for name in failed_pokemon:
        print(f"- {name.title()}")

    with open("failed_pokemon.txt", "w") as f:
        for name in failed_pokemon:
            f.write(name + "\n")
else:
    print("\nAll Pokémon images downloaded successfully!")

Summary — This script made it possible to:

Automatically download and organize Pokémon images.
Handle missing or malformed data sources gracefully.
Provide transparency and feedback throughout the process.

Data Visualizations

Step 1: Loading the Dataset

First, I load the Pokémon dataset from a CSV file to work with the compiled data. This dataset contains all the relevant Pokémon information collected earlier. I also create a folder named pokemon_graphs where all the visualizations generated will be saved. If this folder doesn’t exist, it gets created automatically.

Step 2: Setting Up a Custom Theme for Graphs

To make the graphs visually consistent and polished, I define a custom style for all plots. This includes settings for fonts, title sizes, axis labels, tick marks, and background colors. Applying this theme globally ensures every graph follows the same look and feel, making the results easier to interpret and more professional.

Step 3: Creating Visualizations

With the dataset loaded and the custom style applied, I’m now ready to start generating visualizations to explore and present the data insights effectively.

Relationship Between Height and Weight in Pokémon

Scatterplot with Regression Line

The scatterplot reveals a strong positive correlation between Pokémon height and weight, with an r value of 0.762 and an R² of 0.581 after applying a square root transformation to the data. This transformation helped normalize the data, which originally clustered heavily around smaller Pokémon, allowing for a clearer linear relationship.

For a game developer, this finding is highly relevant. As the Pokémon games evolve from traditional turn-based combat into more dynamic, 3D environments emphasizing physical movement — such as dodging and aiming — the physical attributes of height and weight directly influence how a Pokémon might move and interact with the environment. For example, larger and heavier Pokémon might move more slowly or have different hitboxes, while smaller Pokémon could be quicker and more agile.

The presence of outliers, even after transformation, suggests certain Pokémon don’t follow the typical height-weight pattern. This could reflect unique species traits or habitat adaptations, such as Pokémon that live in specific biomes or those that are adapted for swimming. Understanding these differences can be crucial for realistic animation, buoyancy effects in water, and gameplay mechanics that depend on Pokémon size and mass.

By incorporating these insights into character design and game physics, developers can create a more immersive and believable game world that respects Pokémon diversity while optimizing player experience.

Histogram Distribution of Pokémon Height and Weight (Square Root Transformed)

Both the height and weight histograms show that the majority of Pokémon cluster towards the lower end of the spectrum. This means most Pokémon tend to be relatively small and light. However, there are noticeable outliers in both distributions — some Pokémon are significantly taller or heavier than the rest.

It’s especially worth noting that the weight distribution exhibits a stronger pull toward the heavier end compared to height. This suggests that while extreme height outliers exist, there is a greater variety or range in heavier Pokémon. This could reflect diverse body types, such as bulky or dense Pokémon, which may impact game mechanics like movement or combat differently than height alone.

Applying the square root transformation helps to reduce skewness, making these patterns more apparent and easier to compare visually. This kind of insight could be valuable for game developers when designing balance, animations, or physics models that consider Pokémon size characteristics.

Distribution of Base Stats Across All Pokémon

HP: Most Pokémon cluster tightly between 40 and 70 HP, showing this stat has the least spread compared to others. This suggests a relatively narrow range of durability for most Pokémon.
Attack: Displays a roughly normal distribution, with most values falling between 50 and 75. This indicates a balanced spread of physical offensive power across Pokémon.
Defense: Skews more toward the lower end, with many Pokémon having Defense values between 50 and 75. This suggests that high Defense is less common, making it a potentially valuable trait.
Special Attack: Shows the greatest spread among all stats, ranging from the 40s up to some Pokémon with very high values around 120. This wider distribution reflects how Special Attack is a key differentiator for powerful moves.
Special Defense: Similar in shape to Defense but with the majority concentrated between 40 and 80, suggesting moderate variance in resistance to special moves.
Speed: Roughly normally distributed, with many Pokémon centered around the 70s.

In general, the distribution likely explains some of the observed skewness, as the highest stat values are typically attributed to Legendary Pokémon or fully evolved starter Pokémon, reflecting their elite status in battle, whereas lower values are generally characteristic of the initial evolutionary stages.

Pokémon Types by Primary and Secondary Categories

Distribution of Pokémon Types by Primary and Secondary Categories

This stacked bar chart visualizes the distribution of elemental types among Pokémon, separated into primary and secondary types. The height of each bar reflects the total percentage of Pokémon with that type, with the darker section representing how many have it as their primary type, and the lighter section showing the secondary type proportion.

It was surprising to find that the most common types overall were Flying, Water, Grass, Normal, and Psychic. One particularly striking insight was the imbalance between Flying as a primary versus a secondary type — very few Pokémon have Flying as their primary type, while a significant number have it as a secondary type.

This is in sharp contrast to types like Water, Normal, Bug, and Electric, which are more often primary types. The implications of this are strategically important: a high number of Flying-type secondaries means that many Pokémon share common strengths (e.g., strong against Fighting, Grass, Bug) and vulnerabilities (e.g., weak to Rock, Electric, Ice). From a gameplay standpoint, this kind of elemental distribution shapes competitive strategy. Knowing which types are most prevalent can help players build a balanced and versatile team — both offensively and defensively.

On the other hand, some types are notably underrepresented. For example, Fairy as a primary type is quite rare, as is Bug as a secondary type. More broadly, Pokémon that feature any Ice or Electric typing at all are relatively scarce. This insight could be valuable for game developers aiming to maintain balance and diversity in type representation across generations. Avoiding the overuse of common types and ensuring rarer types are meaningfully included can help keep gameplay fresh, competitive, and inclusive of a wider range of team compositions.

Frequency of Pokémon Elemental Types

Interpreting the Heatmaps of Dual-Type Combinations To further explore type diversity, a pair of heatmaps was created to visualize the frequency of dual-type Pokémon — those with both a primary and secondary elemental classification.

Heatmap A (top) displays the type matrix sorted by overall frequency, making it easy to spot which type combinations are most common.
Heatmap B (bottom) presents the same data but sorted alphabetically, which helps locate specific combinations quickly and check if they exist.

What stands out is that some combinations are very densely populated (e.g., Water/Ground, Flying/Normal, Bug/Flying), while many cells remain white, indicating combinations that either don’t exist or are extremely rare (e.g., Electric/Fighting, Fairy/Fire, etc.).

This visualization reinforces earlier findings. Flying types often appear as a secondary element, most frequently paired with Normal, Bug, and Dragon types. In addition, certain types (like Fairy and Ice) are underutilized in combinations, which may be due to balancing constraints or design choices.

Game developers can use this chart to assess whether certain type combinations are overused or missing entirely, which can help introduce greater diversity and gameplay novelty in future game releases.

Average Base Stats by Primary Elemental Type

From a game development standpoint, the radar plots are a powerful tool to visualize the stat identity of each elemental type. Types like Water, Poison, Ice, Fire, and Grass stand out as well-rounded classes, forming near-perfect hexagons on the radar. This suggests they’re designed with balanced versatility — strong candidates for starter Pokémon or generalist roles that can slot into a wide range of team compositions.

In contrast, types like Fighting, Rock, Steel, and Ground show sharply skewed profiles, with certain stats like HP, Defense, or Attack soaring into the 90s or 100s, while others like Speed or Sp. Atk dip into the 50s. These extremes represent specialist archetypes — powerhouses in one area but with exploitable weaknesses, which is a classic design strategy to encourage tactical depth, role diversity, and team synergy.

While the radial layout is excellent for identifying within-type balance, it can be less intuitive when comparing across types. This is where the pivot table becomes an essential companion. By presenting side-by-side mean values by type and stat, it supports quick visual scanning — whether you’re identifying types that dominate in Speed, those with defensive advantages, or benchmarking against the overall average. It’s a compact, data-driven supplement to the radar plot’s more visual, holistic perspective.

From a Player vs. Environment (PvE)/Player vs. Player (PvP) balancing lens, this combination of radar and tabular data is extremely useful. Balanced types may excel in PvE scenarios where reliability and flexibility are valued. Meanwhile, types with sharp stat spikes may define PvP meta niches — hitting hard but demanding skilled play to mitigate their vulnerabilities.

Importantly, this view also underscores the value of dual-typing as a stat-level design tool. By combining a low-Speed, high-Attack type with a type known for agility or Sp. Def, designers can smooth out extremes, creating more complex and interesting Pokémon that fill hybrid roles or bridge strategic gaps in teams.

Taken together, the radar plot and pivot table provide a layered analytical toolkit — ideal for identifying archetypes, informing balance patches, or planning new evolutions and forms that round out underrepresented stat distributions.

Deviations in Pokémon Base Stats by Primary Elemental Type

The deviation table reveals clear archetypes and design philosophies embedded within each primary elemental type.

Bug Types consistently fall below average across all base stats, reflecting their classic role as weaker, early-game or specialized Pokémon. This suggests they are intentionally designed to be less intimidating individually but may rely on numbers, status effects, or strategic utility rather than raw stats.

At the opposite end, Dragon Types stand out as powerful all-rounders, boasting above-average values in every stat category. This supports their reputation as iconic, late-game powerhouse types, often reserved for legendary or pseudo-legendary Pokémon. Their strong baseline stats reflect a design choice to make dragons inherently formidable.

Dark and Steel Types show strong overall stat profiles, mostly above average. Dark types are robust in all stats except for a slight dip in Defense, suggesting they rely more on offensive versatility and speed. Steel types are sturdy with high Defense but have a relative weakness in Special Attack, reinforcing their tanky, physically defensive archetype.

Surprisingly, Poison and Water Types hover very close to the overall average in all stats, with deviations rarely exceeding ±5.3 points. This balance suggests these types are designed as flexible generalists, able to fit many roles without dominating or lagging significantly in any area.

The remaining types (Fire, Grass, Fighting, Rock, Ground, Electric, Psychic, Ice, Flying, Ghost, Normal, Fairy) display mixed stat deviations, emphasizing diversity within those groups. Some may excel in one or two key stats while dipping in others, reflecting niche roles and encouraging varied team compositions.

Game Development Implications
The consistent low stats of Bug types reinforce their use as early-game challenges or utility-focused characters, avoiding overpowering new players.

Dragon types’ strong all-around stats justify their positioning as late-game rewards or boss-level opponents.

The variation in types like Steel and Dark showcases how nuanced stat spreads encourage tactical gameplay — e.g., Steel’s defensive strength but special attack weakness prompts creative use of moves and synergy.

The close-to-average stats of Poison and Water types suggest their design intent as versatile “jack-of-all-trades” types, which can blend into many play styles without overshadowing others.

Lastly, mixed distributions in other types highlight opportunities for balancing or evolving new subtypes to address gaps or reinforce certain niches.

Top 10 Pokémon by Base Stat Category

HP and Defense: Pokémon topping these stats tend to be the bulky tanks and walls that can absorb a lot of damage. These are often strategic picks for PvE or stall tactics. These are classic tanky Pokémon known for their immense stamina and ability to absorb hits. For example, for HP Blissey and Chansey are often prized in support or healing roles, while Guzzlord, a Legendary, combines bulk with a unique offensive presence. For defense, top contendors, Shuckle is notable for its extreme defense but is often overlooked in competitive play due to low offensive stats and speed. Stakataka and Steelix, meanwhile, embody the “fortress” archetype, excelling in physical defense and serving as wall-like protectors.

Special Defense: High values here indicate strong resistance to special attacks, favoring Pokémon that can withstand elemental or status-based moves. The top defenders include both unconventional and legendary Pokémon. Again, Shuckle surprises with its defensive prowess, while Regice and Lugia are known for their resilience against special attacks. Lugia, a Legendary, is famed for its balanced bulk and special defense capabilities, making it a versatile defensive pivot.

Speed: The fastest Pokémon dominate here, often used to strike first in battle, outpacing opponents to gain tactical advantages. These Pokémon are designed for speed-centric roles. Top Regieleki boasts the highest speed in the game, facilitating rapid offensive pressure. Ninjask, despite its small size and somewhat frail nature, compensates with extreme speed, useful for hit-and-run tactics. Pheromosa combines speed with strong offense, making it a fearsome sweeper.

Attack and Special Attack: These categories showcase the primary offensive powerhouses, split between physical and special damage dealers, guiding choices for aggressive or sweeping strategies. The top physical attackers include Kartana, a Legendary with a razor-sharp offensive edge; Rampardos, known for its raw power but lacking in speed; and Slaking, a pseudo-legendary with immense strength but hampered by its ability limiting it to move every other turn. For Special Attacks, Legendary and Ultra Beasts dominate here. Mewtwo is a classic powerhouse with high special attack and versatile movepools. Xurkitree and Blacephalon, though less conventional, excel with unique offensive capabilities, highlighting the diversity in design of special attackers.

Notably, some Pokémon like Shuckle and Ninjask rank high in defensive and speed stats respectively but aren’t traditionally considered strong combatants overall due to weaknesses in other areas (e.g., low attack, special attack, or overall bulk). This shows how specialized stats contribute to niche roles or strategic options beyond just raw power.

This word cloud visualizes the prominence of Pokémon appearing in the top 10 rankings across key base stat categories — HP, Defense, Special Defense, Speed, Attack, and Special Attack. The size of each Pokémon’s name reflects how frequently it appears among the top performers: larger names indicate Pokémon that consistently excel across multiple stats, while smaller names appear less often. This gives an intuitive, at-a-glance understanding of which Pokémon dominate different aspects of base stats and highlights those that are versatile or specialized in certain areas.

Top 10 Pokémon by Total Base Stat, Grouped by Elemental Type

Left: Grouped by primary or secondary element type (inclusive). Right: Grouped by only the primary element type.

When filtering by primary elemental type only, the top 10 lists tend to include a broader variety of Pokémon, including more “regular” or mid-evolution forms. This approach captures a more balanced snapshot of each type’s ecosystem, showcasing strength across a wider range of species — not just elite or rare entries.

In contrast, when allowing for both primary and secondary types, the top rankings are often dominated by high-evolution forms and legendary Pokémon. Many of these powerful Pokémon share overlapping secondary types, leading to frequent repeats across categories. This emphasizes raw power and versatility, but at the cost of diversity.

Together, the two charts highlight how including secondary typing can skew perception of a type’s strength by favoring already overpowered Pokémon, whereas filtering by primary type alone allows less dominant but thematically central Pokémon to shine.

This dual-view analysis of Top 10 Pokémon by Total Base Stats, grouped by elemental type, reveals important distinctions that can inform both game development and player strategy.

For Game Developers, this data can help with balancing type and power diversity of certain Pokémon. When rankings are limited to primary types, a more diverse set of Pokémon appears — including mid-tier evolutions and non-legendary entries. This suggests that type ecosystems are richer and more balanced than they might seem when secondary types are included. Developers can use this view to fine-tune game balance by identifying underrepresented types that deserve stat boosts, evolutions, or signature moves to enhance viability. Secondary typing allows powerful Pokémon to dominate multiple categories, inflating their perceived versatility. If left unchecked, this can create type power creep, where certain combinations become omnipresent. This insight can guide decisions to introduce counters, nerfs, or move limitations to prevent a few elite Pokémon from overshadowing others in both casual and competitive play. Seeing how mid-evolution Pokémon surface in primary-only rankings could inform evolution pacing and power curves — ensuring that mid-stage forms remain meaningful throughout gameplay progression.

For Players, this offers valuable insight in hidden potential in overlooked Pokémon, greater strategy depth, and building strong teams with the best of each “type.” Many non-legendary Pokémon rank surprisingly high when evaluated solely by their primary type. This provides players with a new lens for team-building, encouraging experimentation beyond commonly-used top-tier or legendary picks. By understanding how secondary typing inflates the value of certain Pokémon, competitive players can better predict popular meta choices — and craft teams that are resilient to overused dual-type threats. Trainers aiming for theme-based or monotype runs can use primary-only rankings to identify top contenders within a single elemental type, without the noise of dual-type overlaps.

Word Cloud of All Top Pokemon

The word clouds make these patterns visually intuitive. Pokémon with wide reach across type categories appear in larger fonts, emphasizing their overall dominance. Meanwhile, the primary-only view gives stage time to less flashy but strategically important Pokémon, making it easier to spot underrated picks.

Conclusion

Summary of Process

In this project, I built a data pipeline to scrape and organize detailed Pokémon information from the Pokémon Database website, including base stats, elemental types, species details, and images. With this rich dataset, I explored how a Pokémon’s base stats relate to its elemental typing — both primary and secondary — and visualized these patterns using ranked lists and word clouds.

Key Insights

One of the key insights from my analysis is the strong relationship between base stats and elemental typing, especially when considering dual-types. Pokémon with secondary types often have higher total base stats, highlighting their versatility and strategic advantages in gameplay. However, this also results in a smaller, more elite group — often legendary or final-evolution Pokémon — dominating the top rankings. On the other hand, when focusing only on primary types, the top lists are more diverse, capturing mid-evolution and non-legendary Pokémon that typically don’t get as much attention. This contrast reveals how type-based grouping choices shape our perception of a type’s overall strength.

These findings have practical applications. For game developers, understanding how dual-typing correlates with higher base stats can inform more balanced game design — helping ensure type diversity and fairness in future generations. For players, this analysis uncovers powerful yet underrated Pokémon, which can lead to more creative and effective team-building strategies.

Future Steps

Looking ahead, I plan to enhance this dataset by incorporating alternate forms such as Mega Evolutions and regional variants, adding evolution chains, tagging legendary Pokémon, and extracting hidden abilities. These additions would make the data even more valuable for research, game balancing, and player strategy.

Overall, this project offers not only a robust and reusable Pokémon dataset, but also actionable insights into the interplay between typing and stats — serving both the game development community and dedicated fans alike.

GitHub Repository

You can access the code developed for this assignment in my GitHub repository here (https://github.com/lilyxgates/pokemon_db). The repository contains all the necessary scripts for data cleaning, analysis, and visualization, along with documentation explaining each step in the process.

Could Jack Have Survived? A Machine Learning Dive into the Titanic

Lily Gates — Mon, 12 May 2025 03:14:12 GMT

Analyzing the Titanic dataset with machine learning models (Logistic Regression, Decision Trees, and Random Forest) to predict survival rates based on demographics and passenger features.

Like many people, I remember watching Titanic and being flabbergasted that Jack couldn’t fit on the door with Rose. Maybe it was for dramatic effect — or maybe there was some data-driven truth to it? It got me wondering: Was Rose statistically more likely to survive than Jack because of her age, sex, and class? Or was the door just really not big enough for both of them?

To explore this (minus the door physics), I applied supervised learning to the Titanic dataset to understand which features were most predictive of survival. This exercise doesn’t just reveal historical insights; it also demonstrates how machine learning models can uncover patterns in data — even from a 100-year-old maritime disaster.

A question I wanted to explore with supervised learning was:
What factors were most predictive of survival during the Titanic disaster in 1912?

While we can’t change history (or save Jack), we can use data to analyze patterns in who survived and why. By training a machine learning model on real passenger data, we can get a sense of how features like sex, age, passenger class, and ticket fare may have influenced someone’s chances of making it off the ship.

Question

A question I can answer using supervised learning modeling is: What factors were most predictive of survival during the Titanic disaster in 1912?

Stakeholder

A potential stakeholder for this analysis is a maritime safety analyst — someone tasked with improving evacuation strategies and understanding risk in future maritime disasters. While the Titanic is an extreme historical case, the underlying question of who survives and why remains relevant for cruise lines, emergency planners, and safety policymakers.

Insights from this analysis could help inform future safety policies, such as whether to prioritize evacuation by demographic groups, passenger class, or cabin location — especially in fast-moving emergencies where time and space are limited.

Data Collection

To explore survival patterns, I used the classic Titanic dataset available from Kaggle’s Titanic Machine Learning competition. Specifically, I worked with the “train.csv” file, which includes information on 891 passengers — a mix of those who survived and those who didn’t. Each row represents a single passenger, and each column contains details that might influence their odds of survival.

The “Ground-Truth” Labels

The “ground-truth” labels are their real-life survival outcome. It was generated during the actual Titanic disaster, based on historical records of who survived and who didn’t. The data collection likely relied on ship manifests and survivor lists. In the dataset, this is indicated in the “survived” column, with binary values where 1 indicates survival and 0 indicates death.. The “survived” value will be the target variable that the machine learning will be predicting.

Relevance to the Question

The Titanic dataset contains a mix of categorical, continuous, and ordinal features, which are important for analyzing and predicting passenger survival. Since socio-economic status, gender, age, family structure, and other factors likely played a role in survival chances, analyzing this data will help answer the question of what factors most strongly predicted survival during the Titanic disaster.

Key Variables

`Survived` — Survival outcome

Values: 0 = No, 1 = Yes
Note: This is the target label we’re trying to predict.

`Pclass` — Passenger’s ticket class (proxy for socio-economic status)

Values: 1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower)

`Sex` — Passenger’s biological sex

Values: male, female
Note: Categorical feature.

`SibSp` — Number of siblings or spouses aboard

Notes: Includes step-siblings; spouse refers to husband or wife (not fiancés or mistresses).

`Parch` — Number of parents or children aboard

Notes: Includes stepchildren; nannies/guardians not counted.

`Ticket` — Ticket number

Note: Categorical string — not directly meaningful without further processing.

`Fare` — Price of the ticket

Note: Continuous numerical feature.

`Cabin` — Cabin number

Note: Frequently missing; partial values may still provide deck information.

`Embarked` — Port of embarkment

Values: C = Cherbourg, Q = Queenstown, S = Southampton

Choosing a Machine Learning Model

Since the target variable, “Survived,” is binary (0 for death and 1 for survival), this is a classification problem, not a regression problem. We are trying to predict a categorical outcome, not a continuous value. Therefore, this is a binary classification problem that will use “Survived” as the target feature to predict using classification models, not regression models, to predict survival based on the passenger features.

Classification models that could work well include Logistic Regression, Decision Tree, and Random Forest.

Logistic Regression can provide a strong baseline model with clear, interpretable insights about the relationships between features and survival. However, it assumes linearity and may not capture complex interactions between features as well as other models.
Decision Tree can model non-linear relationships and is more flexible than logistic regression. It is also a relatively simple, interpretable model that can handle complex feature interactions. However, it may overfit, especially if the tree is too deep.
Random Forest is the most powerful option for this dataset, especially for the best predictive performance. It can handle complex relationships and missing data more effectively, and its ensemble nature helps reduce overfitting. However, it can be harder to interpret and require substantial computational resources depending on the size.

Data Pre-Processing

One-Hot-Coding Categorical Variables

“Pclass” — This is already a categorical variable with three distinct classes (1, 2, 3). One-hot encoding will create three columns: one for each class (1st class, 2nd class, and 3rd class).
“Sex” — This is a binary categorical variable, with values “male” and “female”. One-hot encoding will create two columns: one for “male” and one for “female”.
“Embarked” — This variable has three categories: “C” (Cherbourg), “Q” (Queenstown), and “S” (Southampton). One-hot encoding will create three columns: one for each port.

# Perform one-hot encoding on the categorical variables:
# ‘Pclass’, ‘Sex’, ‘Embarked’
train_data_encoded = pd.get_dummies(train_data, 
    columns=[‘Pclass’, ‘Sex’, ‘Embarked’], drop_first=True)

Removing Unnecessary Columns

I’m removing “PassengerId” (identifier), “Name” (string), “Ticket” (string), and “Cabin” (string, with many missing values) because they introduce unnecessary, irrelevant, and non-meaningful data that could hurt my model’s performance. These columns don’t provide predictive value and could lead to overfitting or confusion in the model, making it harder for the algorithm to focus on the key features that matter for survival prediction.

# Drop the specified columns from the DataFrame
train_data_cleaned = train_data_encoded.drop([‘PassengerId’, ‘Name’, ‘Ticket’, ‘Cabin’], axis=1)

Addressing Missing Age Values

There are 171 missing age values for 891, which is about 19.19% of the dataset. This proportion is significant enough that in order to continue with the classification model and have it compute effectively, this must be addressed. Several approaches include adding an “Age Unknown” category, imputing any missing ages with a mean/median/mode, dropping rows with missing age, or converting ages into categorical age ranges (“binning ages”).

“Age Unknown” Category:
Preserve all data by labeling missing ages as “Unknown.” This treats missing-ness as a distinct category, which may capture patterns but mixes categorical and continuous data.
Imputation (Mean/Median/Mode):
Replace missing ages with a statistical measure (median is often preferred). This keeps all records and numerical consistency, though it can reduce variance.
Dropping Rows:
Remove the ~19% of rows with missing age. While it eliminates missing values without guessing, it risks significant data loss, bias, and reduced statistical power.
Binning Ages: Makes it categorical, which can work better with Decision Trees but could reduce predictive power for Logistic Regression since it misses the finer nuances of age. Still also introduces issues of a large fraction of values being “unknown.”

For the purposes of the research question, it is important to preserve the dataset size. Also, based on domain expertise of a preference to prioritize women and children during natural disasters, my hypothesis is that age is a signifiant factor in survival rate for the Titanic.

In order to determine which metric (mean, median, or mode) should be chosen, the distribution of “Age” can indicate how best to proceed.

Distribution of Age: Histogram showing the distribution of passenger age, with a slightly right-skewed distribution. The highest concentration of passengers is in the 25–30 age range, with a noticeable cluster of young children aged 0–5.

The distribution of passenger age is right-skewed, with the highest concentration of values in the 25–30 age range. There is also a noticeable cluster of very young passengers aged 0–5, and the full age range spans from 0 to around 80 years.

Due to the skewness of the data, the median is a more robust measure of central tendency than the mean, as it is less influenced by extreme values or outliers. Therefore, imputing missing age values with the median is a more appropriate choice to preserve the integrity of the distribution.

Imputing with the Median keeps age as a continuous variable, which is more effective for Logistic Regression and can still work fine for Decision Trees, especially if the tree model is designed to handle continuous splits.

Train-Test Split for Model Evaluation

Now that I’ve cleaned the DataFrame and converted all relevant features to numeric values, I can begin preparing the data for modeling. This involves splitting the dataset into input features (‘X’) and the target variable (‘y’), and then dividing it into training and testing sets using an 80/20 split to evaluate model performance on unseen data.

# Split data into features and target
X = train_data_cleaned.drop("Survived", axis=1)
y = train_data_cleaned["Survived"]

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training and Evaluating Models

With the data split complete, I can now train and evaluate three different classification models to predict survival: Logistic Regression, Decision Tree, and Random Forest. Each model offers a unique approach to classification, and comparing their performance will help identify the most effective one for this dataset.

1. Logistic Regression
I’ll start with logistic regression, a simple yet powerful linear model commonly used for binary classification tasks like this one.

2. Decision Tree Classifier
Next, I’ll use a decision tree model, which creates a flowchart-like structure that splits the data based on feature values to make predictions.

3. Random Forest Classifier
Finally, I’ll apply a Random Forest model, which builds multiple decision trees and combines their outputs to improve prediction accuracy and reduce overfitting.

Logistic Regression Model

I started by training a logistic regression model on the training data. Logistic regression estimates the probability that a given input belongs to a certain class — in this case, whether a passenger survived.

To interpret the model, I visualized the feature coefficients, which represent the impact of each feature on the predicted outcome. Positive coefficients (green bars) increase the likelihood of survival, while negative coefficients (red bars) decrease it. Sorting the coefficients helps highlight which features the model finds most influential.

Figure: Logistic Regression Feature Importance (Coefficients)
Each bar represents the influence of a feature on survival probability, sorted from most negative (red) to most positive (green). Features like being male, traveling in 3rd class, or embarking from Southampton reduce the likelihood of survival, while being female, traveling in 1st class, and paying a higher fare increase it.

The model’s coefficients reveal clear patterns: features such as being male, traveling in 3rd class, or embarking from Southampton are associated with a lower likelihood of survival. In contrast, being female, traveling in 1st class, and paying a higher fare significantly increase the chances of survival. These findings align closely with historical context and help validate the model’s interpretability.

The logistic regression model provides an interpretable baseline, clearly showing how different features impact survival odds. While its simplicity is a strength for understanding patterns in the data, it may not capture complex, nonlinear relationships as effectively as more advanced models like decision trees or random forests. Next, I’ll explore how a decision tree model performs on this same task.

Decision Tree Model

To refine the Decision Tree model, I first performed cross-validation to identify the optimal maximum depth. I tested depths from 1 to 20, and after calculating the cross-validation accuracy for each depth, I found that the highest accuracy occurred at a depth of 3, with an accuracy of 0.8202. This optimal depth was then used to train the final Decision Tree model, which was visualized to understand its structure.

Decision Tree Visualization with Optimal Depth (max_depth=3): This tree structure shows the decision-making process for predicting survival, where the optimal depth of 3 provides a clear distinction between features that most influence survival chances, such as age, class, and fare/

In addition to visualizing the tree, I also extracted the decision rules from the model and saved them to a text file. These rules outline the conditions under which the model predicts survival or death, adding an extra layer of interpretability to the model. The decision tree visualization and the accompanying rules provide valuable insights into how the model makes predictions.

Decision Tree Model with Optimal Depth (max_depth=3) and Extracted Rules: The first image shows the decision tree at its optimal depth of 3, illustrating how different features like age, class, and fare influence the survival prediction. The second image presents the rules extracted from the decision tree, detailing the conditions under which the model predicts survival or death, offering a transparent view of the model’s decision-making process.

The decision tree model reveals several patterns in predicting survival on the Titanic based on key features such as gender, age, class, and fare.

For female passengers, survival largely depended on their class and age. Those in 3rd class with a fare less than or equal to 23.35 were more likely to survive, while those with a higher fare had a higher likelihood of not surviving. Among females in 1st or 2nd class, survival was associated with being older than 2.5 years.

On the other hand, male passengers were less likely to survive overall. For males, if they were younger than 6.5 years and had two or fewer siblings or spouses aboard, they had a higher chance of survival. However, older males or those with more than two siblings/spouses aboard were predicted to not survive. Furthermore, males in 1st class were unlikely to survive, regardless of their age.

These insights highlight the significant role of gender, class, age, and family aboard in the model’s predictions of survival.

Random Forest Model

For the Random Forest model, I trained the classifier on the training data and then made predictions on the test set. I then extracted the feature importances from the trained model and created a DataFrame to display each feature’s importance in descending order.

To visualize the results, I plotted a bar chart showing the relative importance of each feature, which highlights which variables most influence the survival prediction. This plot helps to understand the factors that are most influential in determining survival outcomes according to the Random Forest model.

Random Forest Feature Importance: This plot ranks the features based on their contribution to predicting survival, with age, fare, and gender being the most influential factors. Notably, passengers who were older, paid higher fares, and were female had a higher likelihood of survival, while 3rd-class passengers and males had lower survival chances.

The “Random Forest Feature Importance” plot visually ranks the features based on their importance in predicting survival. The most influential factors were age, fare, and gender, with age being the highest-ranked predictor, suggesting that a passenger’s age played a crucial role in determining survival likelihood. Fare followed closely behind, indicating that passengers who paid higher fares had a higher chance of survival. Gender also emerged as a key determinant, with being female greatly increasing the likelihood of survival, while being male reduced it.

Other important features included Pclass_3, which revealed that passengers traveling in 3rd class had a significantly lower chance of survival compared to those in higher classes. Family-related features, such as SibSp (siblings/spouses aboard) and Parch (parents/children aboard), also played a role but had a weaker impact compared to the other factors.

The features with the lowest importance were related to the point of embarkation, with Embarked S, Embarked C, and Embarked Q ranking as the least influential variables in predicting survival.

This ranking highlights the relative significance of each feature, with age, fare, and gender standing out as the most critical variables in determining survival.

Answering the Research Question Using the Random Forest Model

To answer the stakeholder’s question — What factors were most predictive of survival during the Titanic disaster in 1912? — I used the Random Forest model, which outperformed both the Logistic Regression and Decision Tree models in terms of accuracy, precision, recall, and F1 score.

I trained the Random Forest classifier on the training data and made predictions on the test set. Afterward, I extracted the feature importances from the trained model, which helped me understand the key factors influencing survival during the Titanic disaster.

Random Forest Feature Importance:

The “Random Forest Feature Importance” plot ranks the features based on their contribution to predicting survival. The most influential factors identified by the model were:

Age: Age was the highest-ranked predictor, suggesting that younger passengers were more likely to survive. This likely reflects the “women and children first” evacuation policy.
Fare: Passengers who paid higher fares had a significantly better chance of survival. This is likely because they were more likely to be traveling in higher-class cabins, with better access to lifeboats.
Gender: Being female was one of the most important predictors of survival. Women had a significantly higher likelihood of survival, aligning with the historical practice of prioritizing women during evacuation.

Other important features included:

Pclass_3: Passengers traveling in 3rd class had a much lower survival rate than those in 1st or 2nd class. This emphasizes the impact of class on survival, with 3rd-class passengers being farther from lifeboats and having lower priority in evacuations.
Family-related features (SibSp and Parch): Having family members aboard did contribute to survival, but to a lesser extent than other factors such as age, fare, and gender.

Features like Embarked (S, C, Q) had the least influence, indicating that the point of embarkation wasn’t a significant predictor of survival.

Conclusion

The Random Forest model, which outperformed the Logistic Regression and Decision Tree models, showed that age, fare, and gender were the most critical factors in predicting survival on the Titanic. Passengers who were older, paid higher fares, and were female had a higher likelihood of survival, while 3rd-class passengers and males were less likely to survive. Family-related features also had some impact but were less significant. These insights can inform future maritime safety policies, particularly about prioritizing evacuation based on demographic factors, class, and fare.

Evaluating Three All Classification Models

Comparing: Accuracy, Precision, Recall, F1 Score

Accuracy

Accuracy measures the overall percentage of correct predictions (both survived and not survived). All models perform similarly in terms of accuracy, with Random Forest being the highest, albeit by a small margin.

Precision

Precision tells us the proportion of positive predictions (survived) that were actually correct. In this case, Decision Tree has the highest precision, meaning it’s the best at correctly identifying survivors without mistakenly predicting non-survivors as survivors.

Recall

Recall measures the proportion of actual survivors that were correctly identified. Random Forest performs the best here, meaning it correctly identifies the most survivors among all models. Logistic Regression is the second-best, while Decision Tree is the least sensitive to actual survivors.

F1 Score

The F1 Score is the harmonic mean of precision and recall, providing a balance between the two metrics. Random Forest has the highest F1 score, suggesting it strikes the best balance between precision and recall. Logistic Regression follows closely, while Decision Tree slightly lags behind.

Conclusion

Best model: Random Forest is the best-performing model overall, having the highest accuracy, recall, and F1 score, though the precision is just slightly lower than Decision Tree.
Precision vs. Recall tradeoff: While Decision Tree excels in precision (fewer false positives), Random Forest outperforms all models in recall, indicating that it is better at capturing actual survivors.
Overall performance: All models perform similarly in terms of accuracy, but Random Forest offers a better balance of precision, recall, and F1 score.

Comparing: Confusion Matrix Values

Confusion Matrices for Logistic Regression, Decision Tree, and Random Forest Models (Left to Right)

Summary of Confusion Matrix Analysis:

True Positives (Survived): Random Forest identified the most survivors, followed by Logistic Regression and Decision Tree.
False Positives (Not Survived): The Decision Tree made the fewest false positive predictions (13), while Logistic Regression and Random Forest made 15 each.
True Negatives (Not Survived): Decision Tree and Random Forest correctly predicted 90 non-survivors, while Logistic Regression correctly predicted 90 as well, showing consistency across all models.
False Negatives (Survived): Random Forest missed the fewest survivors (19), followed by Logistic Regression (21) and Decision Tree (23), indicating that Random Forest is better at detecting actual survivors.

Conclusion

Overall, the Random Forest model performed slightly better in identifying survivors and non-survivors accurately, while Decision Tree had fewer false positives.

The Best Performing Model Overall

Random Forest Model

After evaluating the performance of Logistic Regression, Decision Tree, and Random Forest classifiers using four key metrics — accuracy, precision, recall, and F1 score — and analyzing their respective confusion matrices, the Random Forest model emerges as the best overall performer.

Accuracy: Random Forest achieved the highest accuracy (81.01%), slightly outperforming both Logistic Regression and Decision Tree.
Recall: It also had the highest recall (74.32%), meaning it identified the most true survivors — a crucial factor in applications where missing positive cases (e.g., survivors) carries higher cost.
F1 Score: With the best balance between precision and recall (F1 Score = 76.39%), Random Forest demonstrates strong overall reliability in classification.
Confusion Matrix: Random Forest correctly identified the most true positives (55) and had the fewest false negatives (19), indicating better sensitivity to survivors. While Decision Tree had the fewest false positives, Random Forest still maintained solid precision with stronger overall trade-offs.

Random Forest is the most effective and well-rounded model, offering the best combination of accuracy, recall, and F1 score, along with strong confusion matrix performance. It is best suited for this classification task.

Misclassification Analysis of the Random Forest Model

Misclassified Cases Identified in the Random Forest Model Output

Despite the Random Forest model outperforming the other two models in terms of accuracy, precision, recall, and F1 score, it still made notable misclassifications. These errors occurred in the form of false positives (predicting survival when the individual did not survive) and false negatives (predicting non-survival when the individual did survive).

False Positives (Predicted Survived, Actually Did Not)
In many of these cases, the model likely identified strong survival indicators such as being female, not in third class, or traveling with family members. While these are generally associated with higher survival rates, they do not guarantee survival. These misclassifications highlight “edge cases” where the passengers matched common survival profiles but did not survive, possibly due to unmodeled factors like the timing of evacuation or cabin location.

False Negatives (Predicted Not Survived, Actually Did)
Conversely, some passengers were incorrectly predicted not to survive despite actually doing so. The model may have placed too much emphasis on being male or in third class, traits commonly linked to lower survival rates. However, some individuals, such as young females or wealthy first-class males, did survive — demonstrating that survival could defy general trends. These cases suggest the model lacks access to more granular or contextual features, such as access to lifeboats, crew assistance, or interpersonal dynamics during evacuation.

Technologies Used

Python: used for data analysis, model building, and evaluation
Pandas: reading, cleaning, and transforming the Titanic dataset
NumPy: used for handling arrays and performing mathematical operations during the analysis
Scikit-learn: provided tools for building and evaluating models (e.g., Random Forest, Logistic Regression, Decision Tree), as well as for calculating metrics like accuracy, precision, recall, and F1 score.
Matplotlib/Seaborn: creating plots, charts, and graphs to present the model’s performance and analysis results, such as confusion matrices and performance metrics

Challenges Others Might Encounter

One-Hot Encoding: You need to convert categorical variables like “PClass,” “Sex,” and “Embarked” into numbers for the model to work properly. This can be tricky if you’re not familiar with how to encode these categories.
Removing Unnecessary Columns: Some columns like “PassengerId,” “Name,” “Ticket,” and “Cabin” don’t provide useful information for prediction. If these are left in, they might confuse the model or hurt its performance.
Handling Missing Data: Missing data, like missing “Age” values, is common. You can fill in these gaps with the average value or use other methods, but it’s important to decide on the best approach for your dataset.
Feature Scaling: Some algorithms need the features (like “Age” and “Fare”) to be on a similar scale. Without scaling, the model might focus more on features with larger values, like “Fare.” This can make the model less effective.

These are common hurdles in machine learning, and overcoming them helps improve the model’s accuracy, reliability, and fairness.

Limitations of Analysis and Potential Bias

Despite the Random Forest model outperforming the other models in key metrics, there are several limitations to this analysis. First, the model was trained on a limited feature set, excluding potentially informative variables such as cabin location, family identifiers, or group ticket information. This simplification may overlook important survival patterns.

Moreover, while Random Forests are robust, they can overfit or be biased toward features with many distinct values. The model also lacked access to critical contextual factors — such as proximity to lifeboats, crew behavior, or personal connections — which were likely influential in real survival outcomes but are not reflected in the data. As a result, the model may over-rely on demographic proxies like class, fare, and gender, potentially reinforcing historical biases rather than uncovering deeper causal patterns.

Finally, because the model was validated only on a holdout portion of the same dataset, its generalizability to other scenarios or datasets remains untested.

GitHub Repository

You can access the code developed for this assignment in my GitHub repository here (https://github.com/lilyxgates/titanic_machine_learning). The repository contains all the necessary scripts for data cleaning, analysis, and visualization, along with documentation explaining each step in the process.

Could Jack Have Survived? A Machine Learning Dive into the Titanic was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

Finding Patterns Among Pop Royalty: A Clustering Analysis of Top Female Artists

Lily Gates — Fri, 09 May 2025 10:36:10 GMT

Some of the top-streamed female artists, pictured from left to right: Ariana Grande, Rihanna, Billie Eilish, Taylor Swift, Beyoncé, Dua Lipa, Nicki Minaj, Sia, Halsey, and Selena Gomez. (Image source: IndigoMusic.com)

Using K-Means clustering to uncover trends and similarities among top female pop artists, analyzing music features, popularity, and more.

Since the 2000s, female pop artists have been my guilty pleasure. As much of a tomboy as I was growing up, I couldn’t help but be drawn to the glamor, drama, and theatrical flair of a true pop star. There was something magnetic about them — whether it was the reinvention of Lady Gaga, the power vocals of Adele, or the bubblegum punch of Katy Perry. But as time has passed, only a few of these artists have truly stood the test of time. Taylor Swift and Beyoncé come to mind — not just surviving but thriving across decades and trends.

This got me thinking: what is it that defines the women who endure, who dominate the spotlight year after year, while others fade into nostalgia? In this project, I set out to explore whether today’s most successful female artists could be grouped into distinct “types” or archetypes, and whether these patterns could help us understand who rises, who stays, and what kind of pop star resonates across eras.

Question

A question I can answer using clustering data is:

What are the different clusters or groupings of popular female artists based on their genre, popularity, and social reach on Spotify?

Stakeholder

A stakeholder who might ask this question is a talent agent looking to book headliners for a major music festival like Coachella. Since Coachella features a wide range of artists across genres and popularity levels, this stakeholder would want to understand which artists appeal to similar audiences and how different “types” of pop stars are represented in today’s music landscape. The goal would be to identify top representatives from each cluster or archetype to help build a diverse and well-balanced lineup that appeals to a broad audience and ultimately drives ticket sales.

Data Collection

To answer this question, I used data from Billboard’s Top 100 Women Artists of the 21st Century Chart to create my initial list of artists to explore. This ranked list was created by Billboard based on their performance on the Billboard Hot 100 and Billboard 200 charts from January 1, 2000, through December 28, 2024.

In order to troubleshoot the sensitivity and complexity of names, I had to ensure the input for names based on the Billboard’s spelling of their artist anmes (as of 2024) matched with Spotify’s search API. For instance, names like Beyoncé (with the accent) and Beyonce (without the accent) may appear as different entries to the API, even though they refer to the same artist. In addition, “Lil’ Kim” may appear as a different entry compared to “Lil Kim” (without the apostrophe). These discrepancies can lead to inaccurate search results, such as fetching the wrong artist’s data or failing to match an artist at all.

To resolve these issues, I implemented a normalization step where I:

Removed all accents from artist names.
Converted names to lowercase to ensure consistent matching.
I used the “unicode” module in Python to normalize artist names.

Even after normalizing the names, there could still be small differences (like spacing, punctuation, etc.). So, I used the “difflab” module to implement a close match check. The SequenceMatcher “SequenceMatcher” then compares two strings and returns a similarity ratio. If the ratio is above a defined threshold (e.g., 0.8 or 80%), the names are considered a close match. This step helps identify cases where minor spelling variations still need to be treated as equivalent (e.g., Beyoncé vs Beyonce, or Lil’ Kim vs Lil Kim).

# Close match check using difflib
def is_close_match(input_name, spotify_name, cutoff=0.8):
  norm_input = normalize(input_name)
  norm_spotify = normalize(spotify_name)
  return difflib.SequenceMatcher(None, norm_input, norm_spotify).ratio() >= cutoff

After normalizing and matching the names, I proceed with searching Spotify for artist data. If a close match is found, I collect the artist’s ID and other metadata.

I used the Spotify Web API (accessed through the “spotipy” Python library) to collect metadata on each artist.

The data fields I collected include:

Artist Name: The name of the artist (e.g., Beyoncé, Taylor Swift).
Spotify Artist ID: Unique identifier used to query artist data from Spotify.
Genres: A list of genre tags that Spotify associates with the artist.
Popularity: A score between 0 and 100 based on the artist’s recent streaming performance and activity.
Followers: The total number of Spotify users who follow the artist.

These fields are relevant because they reflect both the musical identity of each artist (through genres) and their popularity and reach (through Spotify’s metrics).

Cleaning and Storing Data

I took the results from the Spotify artist search and organize them into a structured format using a “pandas” DataFrame. This DataFrame has three key columns: the original names of the artists as they were inputted, the names of the artists as matched by Spotify (which might include small adjustments like corrected spellings), and the unique Spotify IDs for each artist.

Once the DataFrame is created, I check for any duplicates in the artist IDs. This is an important step to ensure that there are no duplicate entries in the dataset that could skew the results. If any duplicates are found, the program will print a warning message, providing transparency in the data and alerting me to any potential issues with the search results.

After ensuring that the data is clean, the dataframe “artist_df” is saved, which contains contain all of the matched artist names and their corresponding IDs.

Finally, the code prints a success message along with the full DataFrame, so I can immediately review the results of the artist search and verify that everything has been processed correctly.

# Create DataFrame with results
artist_df = pd.DataFrame({
"original_search_name": original_names,
"matched_spotify_name": matched_names,
"id": artist_ids
})

# Show duplicates if any
duplicates = artist_df[artist_df.duplicated('id', keep=False) & artist_df['id'].notnull()]
if not duplicates.empty:
  print("\nWARNING: Duplicate artist IDs found:\n")
  print(duplicates)

# Save to CSV (optional)
artist_df.to_csv("spotify_artist_ids.csv", index=False)

# - - - - - - - - - - - - - - - -
# Display result
  print("\nSUCCESS: Artist search complete!")
  print(artist_df)

Note on “genre” information:

It’s important to note that some artists do not have genres listed in the Spotify API. Spotify’s genre assignment process is not entirely consistent. Genres are typically assigned based on user listening behavior, editorial tagging, and algorithmic clustering. However, for very big artists, Spotify sometimes relies more on user engagement and less on explicit genre classification, which ironically leaves the genre list blank.

For instance, high-profile artists like Taylor Swift, Rihanna, Beyoncé, and Ariana Grande are classified with “[]” in the genre category, despite being clearly recognized as pop artists. These artists often transcend genre boundaries or fit into multiple high-level genres, making them difficult to categorize within Spotify’s internal system.

Unfortunately, I cannot access genre data for these artists’ top tracks or albums via the API, as Spotify does not list this information. Additionally, with the November 2024 API policy update, access to endpoints like “/related-artists” and “/audio-features” has been further restricted, limiting my ability to infer genres based on related artists or additional track data.

The clustering process, particularly with KMeans, could be impacted by artists with missing genres in the following ways:

Loss of Genre Information: Artists without genres will not contribute to the one-hot encoded genre columns. This means that if an artist doesn't have a genre, those columns will be filled with all 0s for that artist. As a result, the clustering will have less genre-related differentiation for these artists compared to others, potentially reducing the meaningfulness of the clustering in relation to genre-based patterns.
Effect on Feature Scaling: Since the one-hot encoding for genres creates a binary feature for each genre, missing genres for some artists will result in “NAN” or zero entries. While you've normalized the numerical features (ranking, popularity, and followers), missing genre data might introduce skewed or less effective feature vectors. Artists with no genre information will appear similar to each other but very different from artists with multiple genres, as they will lack any meaningful genre-based features. Artists with genre information will have many non-zero values in the genre columns, and these differences may be more important in forming clusters.
Clustering Interpretation: When the clustering is performed, the resulting clusters will group artists based on the available data (numerical features and the genre information). However, artists with no genre information could be grouped based on only their ranking, popularity, and followers, which may make them stand out as distinct clusters, even if their musical style (genre) would normally place them with similar artists.

Cleaning and Reformatting the DataFrame

To prepare the dataset for clustering, I followed several key steps to clean and reformat the data into a more usable structure:

One-Hot Encoding for Genres: I started by transforming the genre information for each artist into a one-hot encoded format. The genres for each artist were initially stored as lists, so I used the ‘explode’ method to split these lists into separate rows. Then, I applied ‘str.get_dummies’ to create individual columns representing each genre. In these columns, a value of ‘1’ indicates that the artist belongs to that genre, while a ‘0’ indicates they do not.
Grouping by Artist: Since the ‘explode’ method created multiple rows for each artist (one for each genre), I needed to collapse these rows back into a single row per artist. To do this, I used ‘groupby(df_genres.index).max()’, which ensures that for each artist, if they belong to a genre, the corresponding column is set to ‘1’. This step allowed me to consolidate the genre data back into a format where each artist has one row with a ‘1’ for each genre they belong to.
Combining Features: Next, I combined the relevant features from the original ‘artist_df’ DataFrame, including the ranking, popularity, and followers, with the one-hot encoded genre columns. I used the ‘pd.concat()’ function to merge these features along the columns, creating a new DataFrame called ‘df_combined’. This DataFrame now contains both the numerical features and the one-hot encoded genres for each artist.
Normalization of Numerical Features: To ensure that the numerical features (ranking, popularity, and followers) are on the same scale for clustering, I normalized them using z-scores. The normalization process adjusted the features so that they have a mean of 0 and a standard deviation of 1. This step was important because clustering algorithms like KMeans can be sensitive to the scale of the data.
Preparing for Clustering: After cleaning and normalizing the data, ‘df_combined’ was now in the ideal format for clustering. The numerical features were normalized, and the categorical genre features were one-hot encoded, making the dataset ready for algorithms like KMeans.
Saving the Data: Finally, I saved the cleaned and reformatted DataFrame, ‘df_combined’, to a CSV file for future use. I stored the file in the current working directory under the name ‘spotify_artist_metadata.csv’. A confirmation message was printed, showing the exact location of the saved file.

Measuring Similarity and Feature Selection

In this analysis, similarity between artists is measured using the features of Spotify popularity score, total number of followers, and genre affiliations.

Spotify Popularity Score: This feature ranges from 0 to 100, representing how popular an artist is on Spotify based on the number of streams, listens, and other metrics.
Total Number of Followers: Represents the number of followers an artist has on Spotify. This can be in the millions, so it has a larger scale compared to the popularity score.
Genre Affiliations: Artists can belong to multiple genres. To handle this, I used one-hot encoding to transform each genre into a binary column (e.g., Pop = 1, Hip-hop = 0).

Since these features are on different scales (popularity score on a 0–100 scale, followers in millions, and binary genres), I normalized the numerical features using StandardScaler from sklearn. StandardScaler standardizes the data to have a mean of 0 and a standard deviation of 1, ensuring each feature contributes equally to the distance calculation. This prevents any single feature (e.g., the number of followers) from dominating the clustering process and ensures that genre affiliations can influence clustering patterns.

Clustering Algorithm

I used the “KMeans” clustering algorithm package from “sklearn” to partition the artists into clusters based on the similarity of the selected features. The KMeans algorithm measures the similarity between data points (artists) using Euclidean distance, which calculates the straight-line distance between points in a multi-dimensional feature space (where the dimensions are the normalized popularity, followers, and genres).

Selecting the Number of Clusters (K)

To select the optimal number of clusters (K), I used the elbow method. This method involves calculating the inertia, which is the sum of squared distances from each data point to its assigned cluster center. As K increases, inertia decreases, but beyond a certain point, the rate of decrease slows down. The optimal K is identified at the “elbow” point, where the inertia starts to decrease at a slower rate.

To visualize this, I plotted inertia for a range of K values and used the elbow point to determine the most appropriate number of clusters. In my analysis, I will further explore this elbow plot to confirm the best value of K. Using the KMeans algorithm, the optimal K value is determined to be 10, where the rapid decrease in inertia begins to plateau, indicating the ideal number of clusters.

The graph shows the elbow method for KMeans, with the optimal K value at 10 where inertia begins to plateau.

Results

Describing Clusters by their Features

Normalized data on average ranking, followers, and popularity, as well as top genres within each cluster

The clustering of artists reveals distinct groupings based on their average Billboard ranking, Spotify followers, popularity, and genre associations. Cluster 6 stands out the most in terms of popularity, with a high z-score of +4.74, despite having one of the lowest average Billboard rankings (–1.59), suggesting it includes newer or recently viral artists with strong current momentum. Similarly, Cluster 5 also shows high popularity (+2.31) and strong follower numbers (+1.43), but it too ranks low on Billboard historically (–1.60), reinforcing the trend of rising contemporary figures. In contrast, Cluster 1 contains the artists with the highest average Billboard rankings (+1.33) but the lowest Spotify presence, as reflected by the sharp deficits in both followers (–2.45) and popularity (–0.69). This group may represent legacy artists with lasting cultural prestige but reduced current engagement.

When considering follower counts, Cluster 6, 5, 3, and 7 lead the way, suggesting these artists have large or growing fan bases. Cluster 3 in particular balances low Billboard ranking (–0.81) with strong followers (+1.11) and above-average popularity (+0.75), indicating these are mid-level names with loyal listenership and active releases. Meanwhile, Cluster 8 and 2 both show low follower and popularity scores but have high Billboard rankings, again highlighting a potential divide between historical impact and present-day traction.

In terms of genres, “alternative metal” appears across nearly every cluster, indicating broad genre-tagging that likely overlaps with pop-rock hybrids. Genres like “pop”, “r&b”, and “adult standards” are also widely represented. A few clusters exhibit unique genre profiles that help differentiate them: Cluster 3 includes latin pop, setting it apart culturally; Cluster 8 uniquely features east coast hip hop, pointing to a more urban influence; and Cluster 4 is notable for its inclusion of emo, emo pop, and pop punk, indicating a tilt toward alternative subcultures. Cluster 1, with high Billboard scores but low streaming, features genres like christmas, adult standards, and celtic, suggesting seasonal or traditional music popularity.

Overall, the clustering highlights the contrast between historical prestige (ranking), current streaming relevance (popularity), and fan engagement (followers), as well as how certain genre combinations correspond with different types of artists’ career stages and audience demographics.

Artists in Each Cluster, Sorted by Billboard Top 100 Rank

Artists in each cluster, sorted by their Billboard Top 100 rank

Cluster 0 — Balanced Fame Across Pop, R&B, and Seasonal Genres

This cluster includes a mix of artists who have built their fame across various genres, notably pop, R&B, and seasonal (holiday) music. They are known for their broad appeal and consistent recognition over time. Artists like Alicia Keys and Norah Jones are known for their vocal prowess, while Colbie Caillat and Leona Lewis are recognized for their softer, more mainstream pop styles.

Cluster 1 — Legacy Vocalists with Niche or Holiday Audiences

This group consists of iconic, legacy vocalists whose popularity often peaks during certain times of the year (e.g., Christmas). These artists, like Mariah Carey and Celine Dion, are beloved for their powerful, emotive voices but have more niche audiences. Their music often spans genres like holiday music and adult contemporary, creating a lasting legacy but with less mainstream attention in the current era.

Cluster 2 — Country-Focused Artists with Modest Mainstream Impact

Artists in this cluster, like Carrie Underwood and Shania Twain, have made their mark in country music with a strong connection to their roots, but they haven’t fully crossed over to mainstream pop. Their success is primarily in country and its subgenres, with some crossover appeal to pop audiences. They are staples in the country music scene but not necessarily recognized as top-tier pop stars.

Cluster 3 — Highly Popular Pop and Crossover Stars

This cluster includes some of the biggest pop stars of the modern era. Katy Perry, Dua Lipa, and Selena Gomez are known for their global reach, chart-topping hits, and massive fan bases. These artists blend pop music with other genres, contributing to their mainstream popularity. They represent the modern pop sound that continues to shape mainstream music culture.

Cluster 4 — Rock, Emo, and Alt-Pop Icons

In this cluster, artists like Avril Lavigne and Evanescence stand out with their contributions to rock, emo, and alt-pop genres. These artists, often associated with alternative music scenes, have a dedicated fan base, especially among younger audiences who resonate with their edgy, emotional lyrics. They are known for their defiance of traditional pop conventions.

Cluster 5 — Iconic Vocalists with Massive Legacy Fame

This cluster features legendary figures like Beyoncé, Adele, Taylor Swift, and Whitney Houston, who have left an indelible mark on music. These artists are known not only for their vocal talent but also for their ability to dominate the charts and cultural conversation for years. Their music spans multiple genres, and they hold iconic status within the industry.

Cluster 6 — Ultra Mainstream Superstars with Peak Pop Stardom

Artists such as Rihanna, Ariana Grande, and Billie Eilish dominate the pop landscape and are widely recognized across the globe. They represent the pinnacle of pop stardom with mass appeal that crosses cultural and age boundaries. These artists are also highly influential in setting music trends and often have a large following on social media platforms.

Cluster 7 — Popular R&B/Pop Crossovers with Loyal Audiences

SZA, Halsey, and Camila Cabello are the key figures in this group. These artists blend R&B with pop, attracting a loyal and engaged fanbase. They are known for creating deeply personal and relatable music, often dealing with themes of love, heartbreak, and identity. Their crossover appeal allows them to maintain a strong presence both in pop and R&B.

Cluster 8 — Country-Pop and Crossover Artists with Steady Popularity

Artists in this cluster, such as Miranda Lambert and Martina McBride, maintain a strong following due to their country-pop crossover appeal. Their steady popularity in the country and country-pop genres ensures they remain relevant, but they do not necessarily achieve the same level of global pop stardom as artists in other clusters.

Cluster 9 — Mid-Tier Artists with Genre Diversity

This cluster includes Dido, Enya, and No Doubt — artists who have had successful careers but haven’t consistently stayed in the mainstream spotlight. Known for their genre diversity, they appeal to a more niche audience with their unique sound, whether it’s pop, alternative rock, or new age. Their music continues to resonate with listeners, though they are not at the forefront of current pop culture.

Technologies Used

Data Collection: Using the “spotipy” library to pull artist metadata from Spotify’s API.
Data Processing: Clean and preprocess data using “pandas” to format the artist data and apply any necessary transformations (e.g., normalization, handling missing values).
I used Python’s “difflib.SequenceMatcher” to calculate the similarity between artist names and identify likely duplicates. This method returns a ratio between 0 and 1, allowing me to catch small variations or typos (e.g., "Beyoncé" vs. "Beyonce") by setting a threshold of 0.8 for fuzzy matching.
Clustering: Apply KMeans clustering using “scikit-learn” to group artists based on their popularity, followers, and genres. The Elbow method is used to determine the optimal number of clusters.
Visualization: Use “matplotlib” to visualize the clustering results, including scatter plots and the distribution of artists by cluster.
Secure Handling: Store and load API keys safely using yaml for secure access to the Spotify API.

Challenges Others Might Encounter

API Access: The Spotify API can change policies or become rate-limited, requiring you to adjust how you fetch data (e.g., adding sleep buffers).
Artist Names: A key challenge was dealing with artist name inconsistencies. Sometimes, the same artist appeared under slightly different names due to variations in spelling or special characters (e.g., “Beyoncé” vs. “Beyonce”). Even though Spotify may correctly associate these with the same artist ID, my original dataset relied on name matching, which led to duplicate entries and potentially inaccurate calculations. To address this, I used fuzzy string matching with “difflib.SequenceMatcher” to identify and consolidate near-duplicate names based on a similarity threshold.
Clustering Sensitivity: KMeans results can vary significantly with different initializations, so selecting the right parameters for “n_init” and ensuring data normalization are crucial steps.

Limitations of Analysis

One key challenge in this analysis was dealing with the limitations imposed by the Spotify API. In November 2024, Spotify implemented a policy change that significantly restricted access to certain data features, particularly audio-related metrics like danceability, energy, and valence, which I initially planned to include in my clustering analysis. As a result, I had to pivot and focus on a more limited set of variables, such as popularity, followers, and genre affiliations.

This change may have impacted the depth and richness of the clustering results. With fewer variables to work with, the clusters might not fully reflect the broader spectrum of characteristics that contribute to an artist’s musical identity. Additionally, the absence of audio features, which could offer insight into the musical style and tone of each artist, introduces a potential bias by focusing the clusters on external metrics like popularity and genre.

Furthermore, by limiting the dataset to only a subset of features, I may have unintentionally overlooked patterns or relationships that could have been more prominent with a fuller set of data. This could also influence the robustness of the clusters formed and may skew the interpretation of the results.

GitHub Repository

You can access the code developed for this assignment in my GitHub repository here (https://github.com/lilyxgates/queen_of_pop). The repository contains all the necessary scripts for data cleaning, analysis, and visualization, along with documentation explaining each step in the process.

Finding Patterns Among Pop Royalty: A Clustering Analysis of Top Female Artists was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using Book Similarity to Inform Retail Merchandising Strategies

Lily Gates — Mon, 31 Mar 2025 03:58:24 GMT

Photo by Iñaki del Olmo on Unsplash

Leveraging Cosine Similarity to analyze book similarities and inform retail merchandising strategies, optimizing product placement and inventory decisions based on customer preferences and trends.

Do you also treat bookstores like your own personal library? I can easily spend hours in places like Barnes & Noble, perusing the shelves picking up books from all corners of the store and flipping through pages to see if it’s a good match. Am I ready to commit to hours of my time to this one book?

Sometimes, what I really need is a little nudge — a recommendation for a book that has the same vibe as one I’ve loved in the past. Whether I’m in the mood for an inspiring non-fiction read or a thrilling fiction adventure, knowing that a book shares similarities with another I’ve enjoyed can be just the motivation I need to make my choice. That’s why I’m diving into how measuring book similarity can help retail stores, like Barnes & Noble, craft the perfect “If you liked this book, you’ll love this!” display that keeps customers like me happily browsing and discovering their next great read.

Prompt to Analyze “Similarity” in Books

A relevant question that can be answered by measuring similarity between data points is, “What books are most similar to a given book in terms of genre, theme, and style?” The stakeholder asking this question could be a bookstore manager at a brick-and-mortar store like Barnes & Noble. They are looking to optimize their store’s layout and create engaging book displays by grouping similar books together, aiming to increase sales through strategic recommendations. For example, the display might feature a sign saying, “If you like this book, you’ll also love these!” By understanding which books are most similar to each other, the manager can make informed decisions about how to arrange books in the store, improve cross-selling opportunities, and enhance the shopping experience for customers by suggesting books they are likely to enjoy based on their interests.

Software and Libraries Used

To facilitate data retrieval, processing, and analysis, I utilized several Python libraries:

Google Books API: served as the primary data source for retrieving book metadata, including titles, authors, descriptions, and genres.
Pandas (import pandas as pd): for organizing and managing tabular data, particularly the metadata returned from the Google Books API.
JSON and YAML (import json, import yaml): to handle data formatting and securely store my API key.
Requests (import requests): to interact with the Google Books API and fetch relevant book information.
OS: to manage file paths to read in CSV data.
urllib.parse: to properly format URLs for book title queries (handling spaces and special characters).
Time (import time): to include pauses between API requests, avoiding rate limit issues.
Scikit-learn (from sklearn.feature_extraction.text import TfidfVectorizer): to convert book descriptions and genres into numerical representations using TF-IDF, enabling content-based similarity analysis (cosine similarity).
SciPy (import scipy.spatial.distance): used to compute cosine similarity between book vectors for recommendation generation.

Data Source: Google Books API

The Google Books API is a service provided by Google that allows developers to access data related to books available in the Google Books repository. This API provides access to information about books, such as titles, authors, publishers, descriptions, and reviews, and it can be used to search for books, retrieve detailed information about specific books, and even preview snippets from the books when available.

Key features of the Google Books API include metadata about books, such as the title, authors, publisher, published date, categories, description, text excerpts, and a preview link. The Google Books API also includes dynamic information, such as user-generated reviews and ratings for the books to help assess their popularity or relevance.

Three Query Examples

For the project where you use the Google Books API to find similar books, you can identify at least three “query” entities of interest. These could be books or genres that would be the popular, well-known books for the “If you liked reading X, you’ll love reading Y!” display that the Barnes & Noble bookstore manager would make. Here are three examples:

“Harry Potter and the Chamber of Secrets” by J.K. Rowling: This book could be used as a query entity for identifying similar books in the fantasy genre or books with themes of magic, adventure, and coming-of-age narratives. Using this as a query entity allows exploration of other books that share common elements, such as magical world-building or young protagonists.
“To Kill a Mockingbird” by Harper Lee: This book could be used as a query entity for identifying similar books in the historical fiction genre or books that address themes of racial injustice, morality, and family dynamics. This would help identify books with similar social and political themes or literary styles.
“The Great Gatsby” by F. Scott Fitzgerald: This book could be used as a query entity for identifying similar books in the classic literature or American literature genres. This type of search would focus on books that explore themes of wealth, class, the American Dream, and societal issues in the early 20th century.

Process

Similarity was compared using the metadata information from the books, specifically the “Description” and “Genres.” Using TF-IDF Vectorization, I processed the text input to find similar words. I chose not to include the book title in the analysis because it could result in repeated books within the same series, and sometimes the title may not necessarily reflect the subject matter of the book. Additionally, I filtered out books by the same author to avoid including books from the same series and to expand the search, as readers can easily find books by the same author.

I had considered including the original publication year, but ultimately decided against it, as the publication year might narrow the similarities too much and exclude books with similar themes or plots. Similarly, I thought about including ratings but decided against it because ratings can be subjective. Using only Google ratings would be problematic, as it would not account for ratings across different platforms, which could skew the results.

Results

Using the description and genre available in the metadata, while excluding books by the same author, the top ten books with the highest cosine similarity for “Harry Potter and the Chamber of Secrets,” “To Kill a Mockingbird,” and “The Great Gatsby” include…

Challenges & Debugging

Data Quality and Missing Information:

Some books may have missing or incomplete metadata, such as missing descriptions, genres, or authors. I used methods to handle missing values, such as filling missing descriptions with an empty string (fillna(‘’)) to avoid errors in the analysis. This helped ensure the algorithm ran smoothly, even when some metadata was incomplete.

Data Processing and Combining Features:

Combining the “Description” and “Genres” into a single text field for analysis presented the challenge of ensuring these different features would work well together when vectorized. By combining both fields and using TF-IDF vectorization, I could generate a more holistic representation of each book’s textual features. This approach avoided the need to weigh each feature separately and allowed for a more efficient similarity comparison.

Cosine Similarity and Matrix Size:

Computing cosine similarity on large datasets can result in a very large similarity matrix, especially when dealing with a large number of books, leading to memory or performance issues. To mitigate this, I focused on a manageable number of books and used the scipy.spatial.distance.cdist function to compute pairwise distances efficiently. I also implemented a smaller-scale approach for testing, expanding it to more books later.

Filtering Out Books by the Same Author:

Including books by the same author might skew the results, as books within the same series are likely to have high similarity scores. This required filtering out books from the same author. By identifying and excluding books written by the same author, I ensured that the comparison was more diverse and relevant for finding books with similar themes or genres.

API Reliability and Request Handling:

The Google Books API can occasionally fail due to connection issues or rate limiting, especially when making multiple requests for several books. To prevent the program from crashing, I added error handling to check the API response status code. If the request failed (status code not equal to 200), the program would print an error message and skip the problematic book. Additionally, I introduced a small delay between API requests (0.3 seconds) to avoid overloading the server and to prevent hitting the rate limits.

Debugging Missing or Incorrect Data:

I first ran checks to verify if data was missing or improperly formatted. For instance, I printed the DataFrame after loading the API data to ensure that book details (like title, author, and description) were properly captured. I also used the fillna() method to handle missing descriptions and genres.

Handling API Request Failures:

When the API request failed, I added a check to confirm the status code of the response. If the status code was not 200, I would log the failure and skip the book in question. I also ensured that the program continued to run smoothly by implementing try-except blocks to catch any exceptions during API requests.

GitHub Repository

You can access the code developed for this assignment in my GitHub repository here (https://github.com/lilyxgates/book_similarity). The repository contains all the necessary scripts for data cleaning, analysis, and visualization, along with documentation explaining each step in the process.

Using Book Similarity to Inform Retail Merchandising Strategies was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Top 10 Most “Important” Harry Potter Characters — according to Centrality

Lily Gates — Mon, 31 Mar 2025 03:46:47 GMT

Photo by Jules Marvin Eguilos on Unsplash

Using Eigenvector Centrality to identify the top 10 most ‘important’ Harry Potter characters, revealing key figures in the story based on their connections and influence within the magical world.

I’ve been a Harry Potter fan ever since I first watched the movie all those years ago, and it quickly became one of the main reasons I got into reading in the first place. The magical world J.K. Rowling created had me hooked from the start, and I’ve loved getting lost in the adventures of Harry, Hermione, and Ron ever since. Like many fans, I often find myself drawn to the characters’ relationships and the intricate web of connections that shape the story. So, when I was thinking about how to analyze the importance of characters in this iconic universe, I thought — why not use a network analysis to explore how these characters are connected and who stands out as the most influential? This project allows me to dive into that idea and uncover some interesting insights about the characters that have shaped the wizarding world.

Why Does it Matter Who is “Important?”

Harry Potter is a multibillion dollar franchise with a loyal fanbase. In order to capitalize off the fandom, a potential stakeholder who may benefit from learning the most “important” characters would be someone in marketing. Specifically, the stakeholder would be the head of the Wizarding World of Harry Potter theme park. They are seeking to diversify character merchandise to entice sales. One way to approach this challenge is by analyzing the relationships between characters in the Harry Potter universe to determine which characters are the most central or influential in the story.

By examining the network of character interactions, I can identify the most connected figures — those who frequently interact with others or hold significant narrative weight across the series. This analysis will help inform merchandising decisions by highlighting which characters resonate most within the Harry Potter universe, aligning offerings with fan interests and potentially driving higher sales.

The guiding research question is:
“Which characters in the Harry Potter universe are the most important and central, and how can this information help diversify character merchandise to increase sales?”

Data Source: Harry Potter Dataset

Ravi, N. (2021). Harry Potter character interactions [Data set]. GitHub. https://github.com/nikhil-ravi/harry-potter-interactions

The Harry Potter Character Interactions dataset, created by Nikhil Ravi, provides a structured representation of character relationships within the Harry Potter series. These networks have been meticulously constructed by establishing connections between two characters whenever their names or nicknames appear within a proximity of 14 words in any of the books. The weight of each connection represents the frequency of their interactions, offering a quantifiable measure of how often characters are mentioned together.

The .csv file contains three columns: source, target, and weight.

source: The character initiating the interaction
target: The character with whom the interaction occurs
weight: The frequency or strength of the interaction between the two characters (the more frequent the interactions, the higher the weight)

This dataset enables network analysis techniques such as degree centrality, eigenvector centrality, and closeness centrality to determine the most influential characters. By leveraging these insights, stakeholders — such as the head of the Wizarding World of Harry Potter theme park — can make data-driven decisions about diversifying character merchandise to better align with fan engagement and potential sales opportunities.

Defining “Important” Nodes with Eigenvector Centrality

In the Harry Potter Character Interactions dataset, the network graph is composed of nodes and edges, each representing different aspects of character relationships:

Nodes (Vertices): Each node represents a character from the Harry Potter series. Every unique character in the dataset is assigned a node, allowing us to analyze their connections within the story.
Edges (Connections): An edge represents an interaction between two characters. If two characters appear together within a proximity of 14 words in the books, an edge is formed between them. The weight of the edge signifies how frequently these interactions occur, meaning stronger relationships (more interactions) will have higher-weight edges.

By analyzing the structure of these nodes and edges, we can determine which characters are most central to the story based on different network centrality measures.

In this analysis, importance is defined using Eigenvector Centrality, which not only considers how many direct connections a character has (like Degree Centrality) but also takes into account the importance of those connections. In other words, a character is considered highly central if they are connected to other well-connected characters. This allows us to capture influential figures in the network — characters who are not just well-connected but are embedded in influential subgroups.

Using Eigenvector Centrality helps identify key characters who serve as bridges between different groups or who exert influence over multiple important characters. This information can provide valuable insights into merchandising strategies, as characters with high Eigenvector Centrality may be perceived as significant by fans, making them strong candidates for expanded merchandise offerings.

The Top 10 Most “Important” Harry Potter Characters

The moment we’ve been waiting for! After using Eigenvector Centrality on Nikhil Ravi’s Harry Potter Character Interactions dataset, the top ten most important characters are…

Harry Potter

2. Ronald Weasley

3. Hermione Granger

4. Albus Dumbledore

5. Tom Riddle

6. Severus Snape

7. Rubeus Hagrid

8. Ginevra Weasley

9. Godric Gryffindor

10. Draco Malfoy

Top 10 Most “Important” Characters in “Harry Potter” — According to Eigenvector Centrality

Expanding on Centrality

Degree Centrality, Closeness Centrality, and Eigenvector Centrality

Network graphs of Top 10 Harry Potter Characters — Using Degree Centrality, Closeness Centrality, and Eigenvector Centrality

For determining the top 10 characters, I primarily focused on Eigenvector Centrality as it accounts for not only the number of direct connections a character has but also the influence of their connections within the network. This measure provided the most robust insight into identifying characters that are likely to drive merchandise sales due to their high influence within the network.

However, I also found that the results from Degree Centrality and Closeness Centrality were very similar, with a few characters consistently appearing in the top rankings across all three metrics. This similarity highlights the importance of considering multiple centrality measures to get a more comprehensive view. Therefore, I chose to include Degree Centrality and Closeness Centrality to provide a fuller picture of the network and to ensure that all relevant characters were captured, even if they might not have ranked as highly based on Eigenvector Centrality alone.

Top 10 Characters in “Harry Potter” Using Degree Centrality, Eigenvector Centrality, and Closeness Centrality

Using all of the measures allowed me to identify the most influential characters within the network, which directly translates to their relevance and popularity.

Degree Centrality reflects the number of direct connections a character has, so characters with high degree centrality are highly connected, and therefore likely to be more visible or popular. Eigenvector Centrality measures the influence of a character’s connections, so a character connected to other well-connected characters is ranked higher. Lastly, Closeness Centrality shows how close a character is to all other characters in the network, which indicates how easily information (or merchandise) could spread to other characters.

By focusing on the top 10 characters based on each of these centrality measures, I identified those who are most likely to drive merchandise sales. The top characters, such as Harry Potter, Ron Weasley, and Hermione Granger, emerged consistently across all three centrality measures. These findings suggest that the characters with the highest centrality should be prioritized for character merchandise.

Key Findings:

Top Characters by Degree Centrality: Harry Potter, Ron Weasley, and Hermione Granger were the top three, indicating they are the most connected characters in terms of network ties.
Top Characters by Eigenvector Centrality: Harry Potter, Ron Weasley, and Hermione Granger were again the top three, reflecting their high influence in the character network.
Top Characters by Closeness Centrality: Harry Potter, Ron Weasley, and Hermione Granger topped the list once more, showcasing their potential as focal points for merchandise promotion.

Data Cleaning and Limitations

Data Cleaning and Issues Encountered

When merging centrality measures into a single DataFrame, I encountered issues with the alignment of character names. This was resolved by using pd.merge() properly to join the data on the character name, ensuring that all centrality scores were correctly aligned.

Limitations of the Analysis

Limited Dataset: The analysis is based on a set of characters from the original Harry Potter book series. While these characters are key figures within the franchise, this selection does not include characters from spin-offs, such as the Fantastic Beasts series. Additionally, the analysis does not consider real-time trends or shifts in fan interest, which could change over time, particularly with new Harry Potter-related releases or events that might impact the relevance or popularity of certain characters.

Network Representation: The network I constructed may not fully capture the complexities of the relationships between characters. It relies solely on direct connections and does not account for deeper or emotional ties between characters. For example, familial relationships or evolving alliances are not considered. The weightings for the edges are determined based on surface-level factors, such as the frequency and prominence of a character’s name and dialogue placement on the page, which may not reflect the full depth of their interactions or their impact on the story.

GitHub Repository

You can access the code developed for this assignment in my GitHub repository here (https://github.com/lilyxgates/harrypotter_character_centrality). The repository contains all the necessary scripts for data cleaning, analysis, and visualization, along with documentation explaining each step in the process.

The Top 10 Most “Important” Harry Potter Characters — according to Centrality was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

Examining the Correlation Between Exercise and Mental Health

Lily Gates — Mon, 31 Mar 2025 03:04:03 GMT

Photo by Jamie Street on Unsplash

Exploring data trends and visualizations to understand the correlation between exercise and mental health, uncovering insights into wellness patterns

As a group fitness coach as well as a student of data science and psychology at the University of Maryland, I have a deep understanding of the positive impact physical activity can have on mental well-being. My background in fitness has shown me firsthand how exercise can help reduce stress, improve mood, and promote overall mental health. Coupled with my studies in psychology as part of the Social Data Science major track, I am particularly interested in how physical activity may serve as an effective tool for managing mental health conditions, such as depression. By combining my knowledge of fitness with psychological principles, I wanted to explore this relationship through data to understand how physical activity can be a key component of mental health treatment plans.

Data Source: Behavioral Risk Factor Surveillance System (BRFSS) — 2023 Annual Data

Centers for Disease Control and Prevention. (2025, February). Behavioral risk factor surveillance system (BRFSS) 2023 annual data and documentation. U.S. Department of Health & Human Services. https://www.cdc.gov/brfss/annual_data/annual_2023.html

The Behavioral Risk Factor Surveillance System (BRFSS) is a nationwide health survey conducted by the Centers for Disease Control and Prevention (CDC) in collaboration with U.S. states and territories. Since its inception in 1984, the BRFSS has gathered data on health behaviors, chronic conditions, health care access, and preventive services through telephone surveys. In 2023, 48 states, the District of Columbia, and U.S. territories including Guam, Puerto Rico, and the U.S. Virgin Islands participated.

The BRFSS collects self-reported data from adults aged 18 and older, using both landline and cellular phones, and employs a dual-frame design to ensure data representativeness. The questionnaire includes a core set of questions covering general health status, chronic conditions, and behaviors, with additional modules on topics such as diabetes, cancer screenings, and social determinants of health. States may also include state-specific questions.

The data are weighted and processed by the CDC to ensure accuracy and representativeness of the survey results. In 2023, states used the iterative proportional fitting (raking) method for weighting to account for various demographic factors. The BRFSS serves as a critical tool for public health agencies to monitor health trends, inform health policies, and guide interventions at the state and national levels.

Purpose of Exploratory Data Analysis

The purpose of this exploratory data analysis is to examine how differing physical activity habits correlate with individuals who have depression. The research question guiding this analysis is: How do differing physical activity habits correlate with individuals with poor mental health?

The primary stakeholder in this research are mental health professionals (e.g., psychology counselors), who are involved in treating individuals with depression. These professionals are often seeking additional therapeutic interventions to complement traditional treatments, and understanding the role of physical activity in this context could be crucial.

The insights gained from this analysis could inform several important decisions for mental health professionals. If a correlation is found between physical activity and improved mental health, these professionals could incorporate specific recommendations for physical activity (e.g., regular exercise, walking, or gym routines) into treatment plans for depression. Additionally, they may guide patients in adopting physical activity as a regular practice alongside traditional methods like therapy or medication, potentially improving overall treatment outcomes. Moreover, if physical activity is found to have a positive impact, mental health professionals could customize treatment approaches to better address the needs of individual patients, particularly those who might benefit from non-medication interventions.

Filtering Data

Relevant columns that can be used for answering the research question regarding correlations between physical and mental health include ones pertaining to exercise specifics and quantity, overall physical and mental health, and demographics.

Physical and mental health (GENHLTH, PHYSHLTH, MENTHLTH, POORHLTH) will help assess the general health status and specific issues related to physical and mental health.

Exercise-related variables (EXERANY2, EXRACT12, EXEROFT1, EXERHMM1, EXEROFT2, EXERHMM2, EXRACT22, STRENGTH) provide insight into the frequency and duration of physical activities, crucial for understanding exercise habits and their relationship with health outcomes.

The computed variables (_MENT14D, _PHYS14D) provide a simplified view of health status, categorizing people based on how frequently they report poor mental and physical health days.

Demographic variables (SEXVAR, _RACE) provide important context for analysis and ways to conduct subgroup analyses to determine if there are differences in physical activity levels and health outcomes based on gender or race/ethnicity.

Data Visualizations

Average Reported Poor Physical and Mental Health Days in the Past Month for Gender and Race/Ethnicity Groups

Understanding the average reported poor physical and mental health days in the past month for different gender and race/ethnicity groups is crucial for identifying health disparities within populations. Recognizing such disparities allows for more targeted public health interventions and policies aimed at improving health outcomes for underrepresented or disadvantaged groups. This information is also vital for healthcare providers to tailor treatment plans and support services that address the unique needs of specific populations, ultimately leading to more equitable healthcare and improved overall well-being.

Overall, it seems that those who identify as American Indian or Alaskan Native (non-Hispanic) for both men and women reported greatest days with poor physical health. Overall, it seems the frequency of poor mental health days is greater on average than the frequency of poor physical health days.

In addition to anlyzing demographic differences, more closely related to the research question of how physical activity and mental health relate, a heatmap can visual display correlations.

The correlation heatmap shown earlier helped to visualize the relationships between exercise frequency (EXEROFT1, EXEROFT2, STRENGTH) and mental health (MENTHLTH). It highlighted any strong, moderate, or weak correlations and provided insights into how these variables interact.

By examining the correlations between exercise and mental health, the heatmap helps uncover areas where physical activity might be particularly beneficial in improving mental health. This can inform interventions targeting individuals who are more likely to experience poor mental health, making your study findings more actionable for promoting physical activity as a way to improve mental well-being.

A high value in the MENTHLTH category indicates a worse mental health status. There does not seem to be an obvious correlation between duration of exercise and mental health. However, there tends to be rather positive correlations between the duration among different modes of exercise.

Process for Data Analysis and Graphics

Data Cleaning

Handling Missing Values: Initially, the dataset contained missing values in several columns. I used the pandas library to identify and handle these missing values. For numerical columns (e.g., PHYSHLTH, MENTHLTH), I opted to either impute the missing values using the median (to avoid skewing the data) or drop rows with missing values if the proportion of missing data was too high. For categorical variables (e.g., SEXVAR, _RACE), I applied similar strategies, imputing missing values based on the mode or removing rows where too many categorical fields were missing.

Normalization: Since the scales for exercise data and health data differed significantly (e.g., exercise values ranged from 0 to 300, while health values were between 0 and 30), I normalized the exercise data by scaling it to a 0–1 range using MinMaxScaler from sklearn. This made the comparison between physical activity and mental health more meaningful.

Limitations of the Analysis

Data Representation: The dataset may not be fully representative of the general population, especially if certain racial or gender groups are overrepresented or underrepresented. This could introduce bias in the results. Specifically, there were a significant number of respondents who indicated other, multiracial, or did not respond to the racial identification question.

Missing Variables: Some important variables, such as specific mental health conditions or detailed exercise routines, were not included in the dataset. This limits the ability to draw conclusions about the specific types of exercise or the detailed mechanisms behind mental health issues.

Potential Biases: The variables in the dataset (e.g., number of poor health days, exercise frequency) are self-reported, which introduces potential bias due to inaccuracies or social desirability bias.

GitHub Repository

You can access the code developed for this assignment in my GitHub repository here (https://github.com/lilyxgates/brfss_2023_exercise_mental_health). The repository contains all the necessary scripts for data cleaning, analysis, and visualization, along with documentation explaining each step in the process.

Examining the Correlation Between Exercise and Mental Health was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.