Measuring the Biggest Smiles in MLB Using Computer Vision

8 min read3 days ago

Introduction

While downloading MLB official player headshots for a graphic I was working on, an idea dawned on me which felt both very fun and interesting:

“Which player has the biggest smile in MLB?”

I have had some experience with Computer Vision (CV) models before but have not applied it to baseball. This felt like the perfect project to try it out with, and the end goal was ridiculous enough that people should find it entertaining.

All code provided is Python. GitHub’s Co-Pilot assisted with code writing.

Downloading all MLB Headshots

Thanks to MLBAM, accessing information about players is simple! The first step for gathering all MLB player headshots is to gather the IDs of all MLB players. This can be done with the MLB Stats API, which includes an endpoint for accessing all players in a given MLB affiliated league.

The function get_players() returns a data frame containing all relevant information for MLB players during the current season.

def get_players(sport_id=1):
    player_data = requests.get(url=f'https://statsapi.mlb.com/api/v1/sports/{sport_id}/players').json()

    # Select relevant player data
    fullName_list = [x['fullName'] for x in player_data['people']]
    id_list = [x['id'] for x in player_data['people']]
    position_list = [x['primaryPosition']['abbreviation'] for x in player_data['people']]
    team_list = [x['currentTeam']['id']for x in player_data['people']]
    age_list = [x['currentAge']for x in player_data['people']]

    # Create Dataframe
    player_df = pd.DataFrame(data={'player_id':id_list,
                    'name':fullName_list,
                    'position':position_list,
                    'team':team_list,
                    'age':age_list})
    return player_df

df = get_players(sport_id=1)

The next step in the process is to download each player’s official MLB headshot. These can be accessed directly from MLB.com via each player’s page.

Here is an example of Andrew Abbott’s player page

We are interested in his headshot, which is easily accessed on this page. Here is the link to his headshot.

All MLB Official Headshots are differentiated by the player’s MLBAMID, also known as their Player ID. By changing the URL to include CJ Abrams Player ID instead of Abbott’s, we can get Abrams’ headshot.

Since we can simply change the ID in each URL to get the player’s headshot, we can create a column in pandas which contains all headshot URLs

df['url'] = "https://img.mlbstatic.com/mlb-photos/image/upload/d_people:generic:headshot:67:current.png/w_213,q_auto:best/v1/people/"+df['player_id'].astype(str)+"/headshot/67/current"

Now that we have the URL for each headshot, we can download all the MLB headshots to a folder. To limit the amount of calls to MLB’s server, I suggest limiting the download time between images using time.sleep().

This function will download all the URLs and store them in a folder called “headshots”.

from PIL import Image
import requests
from io import BytesIO
import time
import os

# Create folder
folder_name = 'headshots'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# Loop through dataframe and save all headshots to ‘headshots’ folder
for i in range(0,len(df)):
    time.sleep(5)
    response = requests.get(df ['url'][i])
    image_data = response.content
    image = Image.open(BytesIO(image_data))
    image.save(f"{folder_name}/{str(i).zfill(5)}_{df['player_id'][i]}_{df_2023[f'name'][i]}.png")

We now have all MLB headshots! Now on to the Computer Vision!

Using CV for Facial Recognition

Facial Recognition

Using CV we can have a model which looks at each headshot and determine the location of facial features on each player. Thankfully, there is a public model available on GitHub which has done the legwork for us!

Italo José created facial-landmarks-recognition which we can use for our task. Their model places landmarks on a person’s face to define facial features. For our project, we are looking just at the landmarks which make up the mouth.

This function takes an image as an input and outputs an array of the mouth coordinates which the model defines.

import cv2
import dlib
import numpy as np

def measure_smile(image_path):
    # Load the face detector and facial landmark predictor
    detector = dlib.get_frontal_face_detector()
    predictor = dlib.shape_predictor("shape_predictor_68_face_landmarks.dat")

    # Read the image
    image = cv2.imread(image_path)
    lab = cv2.cvtColor(image, cv2.COLOR_BGR2Lab )
    
    fig, ax = plt.subplots()
    ax.imshow(lab,zorder=5)

    # Detect faces in the image
    faces = detector(lab)

    if len(faces) == 0:
        return 0

    # We'll work with the first face detected
    face = faces[0]

    # Get facial landmarks
    landmarks = predictor(lab, face)

    # Extract mouth landmarks
    mouth_points = []
    for n in range(48, 68):
        x = landmarks.part(n).x
        y = landmarks.part(n).y
        mouth_points.append((x, y))

    mouth_points = np.array(mouth_points)

    return mouth_points

There was some trial and error involved in my process, mostly regarding the colour conversion used for the images. The conversion I ended up using was BGR to LAB (CIELAB) as it best dealt with players which have facial hair.

This is a for loop which goes through each headshot and calls the measure_smile() function. We append the “mouth_points” of each player to a list called “smile_list” and then create a column in the original data frame with the mouth points.

## Store the Smile Points in a list
import os
smile_list = []
for file in os.listdir('headshots')[:]:
    try:
        smile_list.append(measure_smile(f'headshots/{file}'))
    except cv2.error as e:
        smile_list.append(0)

## Create a column in the original Data Frame
df['mouth_points'] = smile_list

What is a smile?

We are defining a smile as “the portion of a player’s mouth which is between their lips”. While that may be a mouthful, it is essentially the area of teeth which the model calculates. An example would better illustrate this.

For this example, we are using Oakland Athletic’s Shortstop Nick Allen, who has a Player ID of 669397. This is the code which will plot the smile of Nick Allen. Landmarks 12 and onward define the inner portion of a person’s mouth.

import matplotlib.pyplot as plt
import cv2

## Select Player ID
player_id_select = 669397

## Initialize the Plot
fig, ax = plt.subplots()

## Plot Headshot
response = requests.get(f'https://img.mlbstatic.com/mlb-photos/image/upload/d_people:generic:headshot:67:current.png/w_480,q_auto:best/v1/people/{player_id_select}/headshot/67/current')
img = Image.open(BytesIO(response.content))
ax.imshow(img,zorder=1)

## Select Player's Mouth Points
points = df[df['player_id']==player_id_select]['mouth_points'].values[0]

## Break up points into X and Y coordiantes
x = [point[0] for point in points[12:]]
y = [point[1] for point in points[12:]]# Plot the polygon

## Fill in polygon
ax.fill(x, y, "y",alpha=0.5,linewidth=2,edgecolor='red',zorder=2)

## Show the plot
ax.axis('off')
plt.show()

Looking good! The model accurately highlighted Nick Allen’s smile. While the smile isn’t fully captured, it is a good approximation. Now, to determine which players have the largest smiles, we need a method to calculate the polygons which make up their smile.

Calculating the Size of a Smile

The shoelace formula (also known as Gauss’s area formula) is an algorithm to calculate the area of a polygon, even irregular ones.

We can calculate the area of each player’s “smile” using this formula and then determine which players have the biggest smile!

This code creates a column which calculates the area of each smile.

import numpy as np
import pandas as pd

mouth_area_list = []
for i in range(0,len(df_2023_im)):
    ## If the model returned no mouth points, return an area of 0
    try:
        ##Shoelace Formula
        mouth_area = 0.5 * np.abs(np.dot(df['mouth_points'][i][12:, 0], np.roll(df['mouth_points'][i][12:, 1], 1)) - 
                                  np.dot(df['mouth_points'][i][12:, 1], np.roll(df['mouth_points'][i][12:, 0], 1)))
        mouth_area_list.append(mouth_area)
    except TypeError:
        mouth_area_list.append(0)

## Create a column with the areas
df['smile_area'] = mouth_area_list

With the areas calculated, we can now determine which players have the biggest smiles in MLB.

To easily compare smile sizes between players, we can create a metric which quantifies a smile in terms of the average MLB smile. “Plus” Metrics are a popular subset of metrics used in baseball, which quantify a batter’s start as a percentage of the MLB Average. For these metrics, 100 is considered MLB average and every value above and below 100 corresponds to a value greater than or less than MLB average. For example, a player with a 150 OPS+ has an OPS which is 50% greater than league average (This is a simplified example which does not account for League and Park Factors).

Using this methodology, I have created SMILE+ which uses a player’s smile size as the metric of interest. The formula is:

Here are the Top 10 Players in SMILE+

You can find the full leaderboards here: SMILE+ Leaderboards

Limitations

While we achieved an encouraging outcome, there were still a few issues regarding with how the model determined (or failed to determine) the landmarks on some players faces.

Facial Hair

The CV model was sometimes confused with beards, and the biggest offender was Sean Hjelle. His closed mouth had the model considering the top of his goatee as the bottom of his mouth, which spiked his SMILE+ all the way to 298. Specifically for Hjelle, I redrew his smile coordinates and his SMILE+ dropped to 3.

Skin Tones

The CV model sometimes had trouble distinguishing between players skin tones. If a player’s skin tone and lip colour were similar, there has a chance the model would not be able to locate a mouth and return an empty list of coordinates.

The most prominent example of this case was for Jhonkensy Noel, who has one of the best smiles in baseball! I redrew his smile shape and his SMILE+ ended up ranking 1st.

Jhonkensy Noel smile area (manually drawn)

Small Lips

When a player had lips which were small, the model struggled to assign accurate mouth coordinates. Colten Brewer has a great smile, but unfortunately, the model did not do an accurate job capturing its size.

These are only a handful of limitations and inaccuracies present in the model. While this was a fun project to undertake, I will leave any subsequent improvements to you!

Conclusion

I undertook this project because it felt like a fun introduction into using Computer Vision, and it sure was! While the results do not provide much practical value, I learned a lot about CV and had a blast doing so.

Anytime you want to compare the smiles of MLB players, know that there was a baseball nerd who created a metric to help you do that.

I hope you enjoyed this article as much as I did writing it!

Follow me on Twitter: https://twitter.com/TJStats