Who Will Win RuPaul’s Drag Race All Stars Season 5? (Week 4)*

Published in

The Startup

11 min readJun 30, 2020

The competition may be half-over, but according to the (very rough) first draft of my model, it may be all sewn up.

Note: This story was updated on 6/30 to add a methodology section to make clearer how each queen’s placement was determined. Each queen’s placement was also clarified accordingly.

As I mentioned in my last article, I have a not-so-secret love of all things RuPaul’s Drag Race. I love pouring over the data, memorizing the statistics the way my peers memorize baseball players’ RBIs and ERAs, and getting every opportunity possible to understand the art of drag more. So, when I found out we’d be learning machine learning and predictive analysis as a part of my Flatiron School Data Science program, I couldn’t help but get excited at the prospect of taking my love for Drag Race to the next level. Thus: my quest to build a predictive model to determine the eventual winner of RuPaul’s Drag Race All Stars Season 5. I set to work right away, and I’m excited to share my initial findings with you.

Before I get into my analysis, let my start by saying two things:

*First of all: I am five weeks into my data science program. In fact, we haven’t even started the type of modeling I’ve attempted here. I’m certain this model could be better, and even more certain that I’ll continue to improve it as the season goes on and as my skill set improves. So, if your favorite has a 0.00% chance of winning right now — don’t fret, this is mostly just for fun… for now.

Next up, If you are a big data nerd like me, you can reference the data I’ve used for the initial draft of my model here. This will be a living document that I update as I iterate it moving forward.

Now that THAT is out of the way, let’s get down to business. I’ll first discuss a TL;DR version of what the model determined and why. Next up, I’ll outline WHAT the model found (since let’s be honest, that’s what you came here for). Third will be HOW I created the model, and finally, what changes I’m considering for the next draft (and there are many).

TL;DR: What was the methodology, and what factors did this version of the model deem as important?

The model I created leverages 37 unique variables for each queen that has competed AND is not actively on All Stars 2–5 (34 in all). These included race, gender identity, stated hometown for both the queen’s original season and All Stars season, various performance metrics for both seasons (such as wins, highs, and bottoms), as well as metrics for specific challenge wins.

From there, I used a method called Univariate Selection to determine the 10 most important factors that indicate overall placement on All Stars. For this version of the model, these were:

All Stars challenge wins
All Stars challenge “high” placements
All Stars challenge “safe” placements
All Stars lip sync wins (regardless of whether it was “for their life” or “for their legacy”)
The latitude of their original hometown
The number of years between their original season and All Stars
Original season challenge wins
Whether someone is Latinx
Whether the queen won the All Stars “Ball” challenge
Whether the queen won the All Stars “Makeover” challenge

From there, I used 80% of the data for these 10 data points to train a model on how to predict eventual All Stars outcome. I then used the remaining 20% of the data to test the model. It is important to note, the most accurate model option had an accuracy score of only .571 (out of 1.00), meaning that this model has significant room for improvement.

After training the model, I loaded only the remaining six queens and ran it again, calling the probabilities that each queen would get each placement (meaning I had 6 x 10 probabilities, despite the fact 7th through 10th place have already been determined). Finally, I took the sum of probabilities that each queen would get first, and divided each queen’s probability over the sum to get the likelihood that the queen would get first place. To put this a different way, where P is the probability:

P[Alexis][1st] + P[Blaire][1st] + P[Cracker][1st] + P[India][1st] + P[Jujubee][1st] + P[Shea][1st] = P[Total][1st]
P[Alexis][1st]/P[Total][1st] = Alexis’s chance to win All Stars 5 compared to the other 5 comeptitors.

SO, who is poised to snatch the crown according to version 1 of the model, four weeks into the competition?

Are you feeling parched? Because Miz Cracker is currently sitting at a 98.6% chance of taking this whole competition home. This is due largely to her Episode 4 main challenge win (and subsequent Lip Sync for her Legacy tie with Lip Sync Assassin du jour Morgan McMichaels). She has also had a consistent record of high performance without ever landing in the bottom. What’s more, she was a consistently high performer in her original season, who didn’t wait too long between her original outing and All Stars.

While her unrealistically-high chance of victory will certainly plummet as the model evolves and as the season wears on, right now, Madame Cracker is the one to beat.

How do the rest of the queens shake out?

The queen you should expect to see statistical improvement from as the season goes on is Shea Coulee. While she’s sitting at just 1.39% chance of victory in this version of the model, any Drag Race fan knows that Shea’s record is nothing to balk at. Unfortunately, Shea falling into the bottom in Episode 3 greatly hurt her chances in the model because it reduced her number of wins/high/safes. That said, due to the low accuracy of the initial model, we should expect to see this percentage higher than where its currently sitting. Whereas I believe we should expect Cracker’s stock to fall, look for Shea’s to rise as the season wears on — even just one more challenge win under her belt should be enough to see a massive statistical rebound.

Next up is Blair St. Claire (.020%) and India Farrah (.018%). Blair has had a middling track record that is keeping her from falling, but her lack of wins (or even high placements) is hindering her chances in the model. India, on the other hand, is benefiting from the fact that the number of bottom placements doesn’t have a strong correlation to eventual placement in the competition (you can thank queens like Roxxxy Andrews for that). That said, a lack of top-placements will negatively impact your chances, which isn’t serving India.

Finally are Jujubee (.006%) and Alexis Mateo (.000%). Unfortunately, Juju is being dragged down due by the fact she couldn’t pull off the W against Assassin/All Stars 4 co-winner Monet X Change, and the fact she won her first challenge ever in Episode 3 of AS5. Alexis has yet to pull off a win all season and has had a few lows which pull her stock down. What’s more, both queens have had large gaps between their original run and this season of All Stars. As mentioned previously, the gap between the original season and All Stars hurts overall chances the larger it gets, meaning the Season 2 and 3 queens have that as a handicap. Making matters even worse for Alexis is, unfortunately (and definitely not justly), the fact she is Latinx; of previous contestants, Roxxxy Andrews placed the highest at 4th place, and otherwise, both other Latinx queens (Aja and Valentina) placed 7th.

In Greater Depth: How I Created the Model

Because this was my first attempt (ever) at a model, I wasn’t certain what Python libraries I might need. As a result, I started the process by pulling in a large cross-section to ensure that as I worked through the model I would have access to call different functions.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from scipy.io import arff
%matplotlib inline

Next up: I import my data and created dummy variables for categorical data such as gender identity and ethnicity, and then dropped the “defaults,” which I set as male and white (as a majority of All Stars queens identify as such):

df=pd.read_csv(‘/Users/joesanders/Desktop/DragRaceModel/AllStars.csv’)df_gender = pd.get_dummies(df[‘gender_id’])
df_ethnicity = pd.get_dummies(df[‘ethnicity’])
df_concat = pd.concat([df, df_gender, df_ethnicity], axis=1)df_concat.drop([‘gender_id’, ‘ethnicity’,’White’ , ‘M’], inplace=True, axis=1)

Next, I leveraged Univariate Selection to determine the best columns to use for the model.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2fit_X = df_concat[[‘original_season’, ‘os_ht_long’, ‘os_hit_lat’,     ‘os_age’, ‘os_placement’, ‘os_wins’, ‘os_highs’,‘os_safes’, ‘os_lows’, ‘os_btm2’, ‘os_lipsync_wins’, ‘as_ht_long’, ‘as_ht_lat’, ‘moved’, ‘as_age’,‘os_as_age_gap’, ‘as_wins’, ‘as_highs’, ‘as_safes’,‘as_lows’, ‘as_btms’, ‘as_lipsync_wins’, ‘as_varietyshow’,‘as_snatchgame’, ‘as_rusical’, ‘as_parody’, ‘as_standup’, ‘as_business’,‘as_makeover’, ‘as_supergroup’, ‘as_ball’, ‘F’, ’N’, ‘Asian’, ‘Black’, ‘Latinx’, ‘Multiracial’]] fit_y = df_concat[‘as_placement’]bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(fit_X,fit_y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(fit_X.columns)featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = [‘Specs’,’Score’]
print(featureScores.nlargest(10,’Score’))

Based on the resulting data set, it would seem that All Stars Wins, All Stars Lip Sync Wins, Original Season Hometown Latitude, Gap Between Original and All Stars season, All Stars and Original Season High placements, whether someone is Latinx, All Stars Safe placements, and performance in the All Stars Ball and Makeover challenges are the 10 primary factors to consider for the model.

To be certain, I also ran a correlation Heat Map.

corrmat = df_concat.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(30,30))
#plot heat map
g=sns.heatmap(df_concat[top_corr_features].corr(),annot=True,cmap=”RdYlGn”)

The heat map makes clear there is a correlation between All Stars Wins, Safes, and Lip Sync Wins and ultimately winning All Stars, which seems appropriate from a common-sense perspective. Interestingly, All Stars Bottom placements are not a strong indicator of the eventual winner.

Next, I assigned the X and y, where X is the columns we determined as the best fits for the model, and y is the desired result (as_placement). I then split it into training and test data and ran several potential models to determine best-fit.

X = df_concat[[‘as_wins’, ‘as_highs’, ‘as_safes’, ‘as_lipsync_wins’, ‘os_hit_lat’,‘os_as_age_gap’,‘os_highs’,’Latinx’,’as_ball’,’as_makeover’]].values
y = df_concat[‘as_placement’].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)SVC_model = svm.SVC()
SVC_model.fit(X_train, y_train)
SVC_prediction = SVC_model.predict(X_test)
print(“Support Vector Machine Accuracy:”, accuracy_score(SVC_prediction, y_test))GNB_model = GaussianNB()
GNB_model.fit(X_train, y_train)
GNB_prediction = GNB_model.predict(X_test)
print(“Gausian Naive Bayes Accuracy:”, accuracy_score(GNB_prediction, y_test))LR_model = LogisticRegression(max_iter=10000)
LR_model.fit(X_train, y_train)
LR_prediction = LR_model.predict(X_test)
print(“Logistic Regression Accuracy:”, accuracy_score(LR_prediction, y_test))RFC_model = RandomForestClassifier()
RandomForestClassifier()
RFC_model.fit(X_train, y_train)
RFC_prediction = RFC_model.predict(X_test)
print(“Random Forest Accuracy:”, accuracy_score(RFC_prediction, y_test))KNN_model = KNeighborsClassifier(n_neighbors=6)
KNN_model.fit(X_train, y_train)
KNN_prediction = KNN_model.predict(X_test)
print(“K Nearest Neighbor Accuracy:”, accuracy_score(KNN_prediction, y_test), “(Note: Elbow Method for Optimal K was Used)”)LDA_model = LinearDiscriminantAnalysis()
LDA_model.fit(X_train, y_train)
LDA_prediction = LDA_model.predict(X_test)
print(“Linear Discriminant Analysis Accuracy:”, accuracy_score(LDA_prediction, y_test))

Unfortunately, I knew at this point that my current data set wasn’t indicative of eventual placement, as my most accurate model option had an accuracy score of only .571 (out of 1.00). However, because the goal of this exercise is to practice and reinforce my knowledge, I decided to continue forward knowing I would further iterate this model in the future. As a result, I committed the Linear Discriminant Analysis as my final model.

Final_model = LinearDiscriminantAnalysis()
Final_model.fit(X, y)

Then it was time to import model data for the queens still currently in the competition:

df_active = pd.read_csv(‘/Users/joesanders/Desktop/DragRaceModel/Active.csv’)
df_gender_active = pd.get_dummies(df_active[‘gender_id’])
df_ethnicity_active = pd.get_dummies(df_active[‘ethnicity’])
df_concat_active = pd.concat([df_active, df_gender_active, df_ethnicity_active], axis=1)
df_concat_active.drop([‘gender_id’, ‘ethnicity’,’White’ , ‘M’], inplace=True, axis=1)
df_concat_active

And finally, time to make the probability analysis:

# make a prediction
Xnew = df_concat_active[[‘as_wins’, ‘as_highs’, ‘as_safes’,   ‘as_lipsync_wins’, ‘os_hit_lat’, ‘os_as_age_gap’, ‘os_highs’,’Latinx’,’as_ball’,’as_makeover’]].values
ynew = Final_model.predict_proba(Xnew)# show the inputs and predicted probabilities
sum_of_probabilities = ynew[0][0]+ynew[1][0]+ynew[2][0]+ ynew[3][0]+ ynew[4][0]+ ynew[5][0]
for i in range(len(Xnew)):
 prediction = (100*(ynew[i][0])/sum_of_probabilities).round(3)
 print(“Predicted=%s percent” % prediction

As a reminder, the order of queens is India Farrah, Shea Coulee, Miz Cracker, Jujubee, Alexis Mateo, Blair St. Claire.

So, what changes are coming in the next draft?

As I mentioned at the start of the article, there are several things that I think could be potential good indicators to add. These include:

Number of “talking head” confessionals
What drag “camp” a queen falls into (example: comedy, fashion/look, camp, singing/dancing, realness, etc.)
Net Worth (did they parlay their original run on Drag Race into an empire a la AS3 winner Trixie Mattel)
YouTube views
Instagram followers
Total screen time (may be difficult to acquire)
Number of times a queen received judges critiques

What’s more, I want to take a closer look at a few of the key indicators that showed up in my original model to see if I could change how they are represented to make them more representative. For example:

Does it make more sense to combine Black, Asian, Latinx, and Multiracial into BIPOC given the disproportionate number of white queens?
Does it make more sense to combine F (female) and N (nonbinary) gender identities into NM (non-male) given the disproportionate number of male-identifying queens?
Does it make sense to include specific challenge win types (such as Ball, Makeover) when not every season has every challenge type? Do these complicate the model in an unexpected way that could just as easily be captured by Win/High/Safe/Low/Bottom?
Do longitude and latitude make sense as key metrics? My original thought was to include them given the very high number of queens from New York, Chicago, Las Vegas, Tampa/Orlando, and Los Angeles. However, does it make more sense to lump the queens into regions instead?
Should I further reduce the number of data points included in the model? Presently, I include 10 key indicators; does that make sense, or would fewer make the model more reliable instead of less?

If you made it this far, CONGRATS, you are a massive data nerd. Did I miss something in my coding? Is there a metric I should be considering that I haven’t thought of? Sound off in the comments and let me know! A new update will be coming after Episode 5 premiers later this week, and don’t forget I’m also working on a model for Drag Race Canada! Stay tuned for much, much more.