Who Will Win RuPaul’s Drag Race All Stars Season 5? (Week 5)

After “Snatch Game of Love” — and substantial updates to the model — who is poised to take the crown?

Published in

The Startup

8 min readJul 7, 2020

Note: Each week, as a means of practicing the skills I’m learning at the Flatiron School, I make a prediction for who will win RuPaul’s Drag Race All Stars by using a model of my own creation.

Oh, what a difference a week makes. Not only in the world of RuPaul’s Drag Race All Stars, but also for my predictive model to determine who has the best chance to snatch the crown. I spent a lot of time last week discussing the shortcomings of my model, and — never one to give up — I spent much of the last week trying to figure out how to really boil it down to give the most accurate predictions possible.

We’ll handle the business the same as last week:

A TL;DR version of what the model determined and why
An outline of WHAT the model found (since let’s be honest, that’s what you came here for).
How I tweaked the model
What changes I’m considering for the next draft

Methodology & Key Factors

I spent a lot of time in the last week massaging the data to see if I could extract more pertinent information. For example, I consolidated gender identity into male-identifying and not-strictly-male-identifying. I also split the country into four regions (the West, South, Northeast, and Midwest) to see if clustering geography was more telling than longitude and latitude alone. I also determined that when ethnicities were split into white, Black, and Asian/Latinx/Multiracial, it made the groupings even and made ethnicity a (potentially) more valuable data point.

All of that said, it didn’t really matter. At the end of the day, the more I iterate the model, the more it seems that winning All Stars boils down to just four simple things:

All Stars challenge wins
All Stars challenge “high” placements
All Stars lip sync wins (regardless of whether it was “for their life” or “for their legacy”)
The number of years between their original season and All Stars

Regardless of how I worked with the data, these variables rose to the top of the heap over and over again. So, for the purposes of this model, I’ve reduced the predictions to read off of this data alone.

Who is poised to snatch the crown according to version 2 of the model, five weeks into the competition?

Last week I mentioned that you shouldn’t sleep on Shea Coulee, and the improved model PLUS a decisive win in the Snatch Game of Love challenge catapulted her to the front of the line with a 52.95% chance of winning overall. She benefits from a strong track record and a relatively short gap between her original run and her trek through All Stars. With the field narrowing, and Shea’s stock rising, the other competitors will need to be aggressive if they’re going to stop the Chicago queen from taking home the check for “100,000 dow-lahs.”

How do the rest of the queens shake out?

Oh dear, sweet Miz Cracker. There could be 100 people in a room, and none of them would have laughed at her Snatch Game performance. But that’s okay, because the model still has her at a 32.08% chance of squeaking out the win for the season, which should improve if we see another Cracker win in the next two weeks.

Jujubee had another solid week this week, with a strong outing as Eartha Kitt and a fantastic 80’s inspired runway. She is still being dragged down by that lipsync loss against Monet X Change, and the large gap between her seasons aren’t doing her any favors. Still, with a 7.02% chance of winning overall, she is definitely not down and out. Much like with Miz Cracker, another win under her belt — especially if paired with a lip sync win — could see her stock rise.

I don’t mean to minimize Alexis Mateo (5.62%) or Blair St. Clair (2.34%), but without a win at this point in the season, both will require a nearly flawless final three episodes for a shot at the crown. Even when you account for a margin of error, the bottom three are so far behind Shea and Cracker that a win would be hard-won, but at least Juju has one main challenge win to bolster her overall chances.

In Greater Depth: How I Created the Model

I started with the same foundations of libraries that I used last week:

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn import svm
from sklearn import metrics
from sklearn.model_selection import train_test_split 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
%matplotlib inline

Next, I made my attempt at splitting folks into regions. My goal was to get roughly equivalent regions that were based on longitude and latitude and came up with the below. I did this for both a queen’s original season, as well as their All Stars season, though only one is shown.

df = pd.read_csv('/Users/joesanders/flatiron-ds course/DragRaceModel/AllStars.csv')# longitude
long_range = df['os_ht_long'].max() - df['os_ht_long'].min()
long_divider = df['os_ht_long'].max() - (long_range / 2)# latitude
lat_range = df['os_ht_lat'].max() - df['os_ht_lat'].min()
lat_range_dividers = lat_range / 4
west_divider = df['os_ht_lat'].max() - lat_range_dividers
east_divider = df['os_ht_lat'].min() + lat_range_dividersdf['os_region'] = ""#south
df['os_region'] = np.where((df['os_ht_long'] < long_divider)
                           & (df['os_ht_lat'] < west_divider), 
                           'os_south',     
                           df['os_region'])#east cost
df['os_region'] = np.where((df['os_ht_long'] > long_divider)
                           & (df['os_ht_lat'] < east_divider), 
                           'os_east',      
                           df['os_region'])#middle america 
df['os_region'] = np.where((df['os_ht_long'] > long_divider)
                           & ((df['os_ht_lat'] < west_divider) & (df['os_ht_lat'] > east_divider)),
                           'os_mid',   
                           df['os_region'])#west coast - 0
df['os_region'] = np.where((df['os_ht_lat'] > west_divider), 
                           'os_west',     
                           df['os_region'])display(df['os_region'].value_counts())
df[['queen_name','os_ hometown','os_region']]

Then I consolidated gender identities and ethnicities, created my dummies and purged redundant data.

### consolidating gender identities
# 0 = male
# 1 = non-male identifyingdf = df.replace(to_replace ="M", value ="0")
df = df.replace(to_replace ="F", value ="1")
df = df.replace(to_replace ="N", value ="1") ### consolidating ethnicities
# 0 = white
# 1 = black
# 2 = asian, latino, multi-racialdf = df.replace(to_replace =["Asian", 'Latinx', 'Multiracial'], value ="nwb")##adding dummies and removing redundancies
df_ethnicity = pd.get_dummies(df['ethnicity'])
df_os_region = pd.get_dummies(df['os_region'])
df_as_region = pd.get_dummies(df['as_region'])
df_concat = pd.concat([df, df_ethnicity, df_os_region, df_as_region], axis=1)df_concat.drop(['ethnicity','White','os_west','as_west','os_ hometown','as_hometown','os_ht_long','as_ht_long','os_ht_lat','as_ht_lat'], inplace=True, axis=1)

Next, I leveraged Univariate Selection to determine the best columns to use for the model. I ran this a few times, removing various pieces of data, including geography, to see how it would impact my various model scores.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
fit_X = df_concat[['gender_id', 'original_season','os_age', 'os_placement', 'os_wins', 'os_highs','os_safes', 'os_lows', 'os_btm2', 'os_lipsync_wins', 'moved', 'as_age', 'os_as_age_gap', 'as_wins', 
'as_highs', 'as_safes','as_lows', 'as_btms', 'as_lipsync_wins', 'nwb', 'Black']] fit_y = df_concat['as_placement']bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(fit_X,fit_y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(fit_X.columns)#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(5,'Score'))  #print 5 best features

Next, I assigned the X and y, where X is the columns we determined as the best fits for the model, and y is the desired result (as_placement). I then split it into training and test data and ran several potential models to determine best-fit.

X=df_concat[['as_wins','as_lipsync_wins','os_as_age_gap','as_highs']].values
y = df_concat[['as_placement']].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)SVC_model = svm.SVC(random_state=0)
SVC_model.fit(X_train, y_train.ravel())
SVC_prediction = SVC_model.predict(X_test)
print("Support Vector Machine Accuracy:", accuracy_score(SVC_prediction, y_test))GNB_model = GaussianNB()
GNB_model.fit(X_train, y_train.ravel())
GNB_prediction = GNB_model.predict(X_test)
print("Gausian Naive Bayes Accuracy:", accuracy_score(GNB_prediction, y_test))LR_model = LogisticRegression(max_iter=10000, random_state=0)
LR_model.fit(X_train, y_train.ravel())
LR_prediction = LR_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(LR_prediction, y_test))RFC_model = RandomForestClassifier(random_state=0)
RandomForestClassifier()
RFC_model.fit(X_train, y_train.ravel())
RFC_prediction = RFC_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(RFC_prediction, y_test))KNN_model = KNeighborsClassifier(n_neighbors=7)
KNN_model.fit(X_train, y_train.ravel())
KNN_prediction = KNN_model.predict(X_test)
print("K Nearest Neighbor Accuracy:", accuracy_score(KNN_prediction, y_test), 
      "(Note: Elbow Method for Optimal K was Used)")LDA_model = LinearDiscriminantAnalysis()
LDA_model.fit(X_train, y_train.ravel())
LDA_prediction = LDA_model.predict(X_test)
print("Linear Discriminant Analysis Accuracy:", accuracy_score(LDA_prediction, y_test)

The accuracy is actually a bit lower in this version of the model than the previous one, which is something I plan to dig into further as I further refine the model. Since I used Linear Discriminant Analysis as my final model last week, I decided to keep that methodology this week.

Final_model = LinearDiscriminantAnalysis()
Final_model.fit(X, y)

Finally, I imported my active queen data to produce the graphics above. The graph was produced with the below code:

plt.figure(figsize=(20,10))
sns.set_style("whitegrid")
sns.set_context("talk")
sns.color_palette("husl", 5)
ax = sns.lineplot(x='week', y='probability', hue='queen_name', data=df_probability, legend=False)
plt.legend(title='All Star Queen', loc='upper left', labels=['Shea Coulee', 'Miz Cracker', "Jujubee", 'Alexis Mateo', 'Blair St. Clair'])
plt.title("Who is Winning RuPaul's Drag Race All Stars (Week 5)")
plt.xlabel("Week #")
plt.ylabel("Percent Liklihood of Winning")
plt.show

So, what changes are coming in the next draft?

I didn’t incorporate any of the following in this draft of the model, but I am considering them for future versions.

Number of “talking head” confessionals
What drag “camp” a queen falls into (example: comedy, fashion/look, camp, singing/dancing, realness, etc.)
Net Worth (did they parlay their original run on Drag Race into an empire a la AS3 winner Trixie Mattel)
YouTube views
Instagram followers
Total screen time (may be difficult to acquire)
Number of times a queen received judges critiques

The other thing I would like to do is go back and modify my data so I can determine how the percentage likelihood has changed in each iteration of the model. I think it would be fascinating to see how each contestant’s elimination impacted the likelihood of other queens winning, which I can’t do with the way my data is currently formatted. Next time…

More next week — only three episodes left until we crown a winner!