Using Machine Learning to Predict Alabama’s Senate Race

John Bencina
Data Insights
Published in
8 min readDec 15, 2017

On Tuesday, December 12th, the people of Alabama voted in a special election between Democratic candidate Doug Jones and Republican candidate Roy Moore. The election drew many parallels from the 2016 Presidential Election. For an interesting exercise, I wondered whether it was possible to use the 2016 election data in predicting the outcome of the 2017 Alabama election.

This guide walks through how to use 91 county-level Census variables, combined with the 2016 election results, to try and predict the outcome of the 2017 election. This overview uses Python & Jupyter Notebook to compare predictive models using Sklearn, XGBoost, and Keras.

You can follow along in the notebook from the GitHub repository.

Getting the Census data

The Census data is sourced from the 2016 ACS 5YR county level survey. I had previously aggregated this data which can be found on GitHub or in BigQuery. The dataset contains several dozen variables related to race, income, education, population, employment, language, and age mostly expressed as a % of the population.

The Pandas gbq package makes this very easy to source from BigQuery.

def get_acs_data(bq_key_path, bq_project_id):
query = """
#standardSQL
SELECT
*
FROM `jbencina-144002.census.acs_2016_5yr_county`
"""

data_bq = pd.read_gbq(query, private_key=bq_key_path, project_id=bq_project_id, dialect='standard')

data_bq.Id = data_bq.Id.str[-5:]
data_bq = data_bq.drop(['HHWithComputer','HHWithBroadband'], axis=1) # Not available in current 5YR release
data_bq = data_bq.rename(columns={'Id': 'fips'})
data_bq.to_csv(file_name, index=False, encoding='utf-8')
return data_bq

Getting the election results

Next, I needed the 2016 election results by county. The datasets I found weren’t quite what I was looking for so I decided to scrape the web. One way of doing this would be to literally scrape an elections results page with a tool like BeautifulSoup. But dealing with CSS and HTML is a pain so I tried an easier approach.

I found the Washington Post elections results page had the elections data I needed using Chrome’s debugging tools on the Network tab. I noticed there was a JSON resource which contained the information populating the country map. Much easier than web scraping!

After some manual inspection of the JSON, I wrote a loop which iterated through the entries and restructured them into a Pandas data frame. Then I grouped and pivoted the data frame so that I ended up with a % spread (which is just the % Republican vote minus % Democratic vote) by county.

The data for the Alabama election was a bit easier. Similar to the Washington Post, the New York Times election page has a JSON feed populating its county-level results. Fortunately, I didn’t need to write a bunch of crazy loops. This data just required converting the JSON to Pandas data frame and creating a few calculated fields.

def get_nyt_alabama_results():   
nyt_data = requests.get('https://int.nyt.com/data/sendoff/eln_2017_prd_2017-12-12')
votes = pd.DataFrame.from_dict(nyt_data.json()['contents']['races'][0]['counties'])
votes['jones'] = votes.results.apply(lambda x: x['jonesd'])
votes['moore'] = votes.results.apply(lambda x: x['moorer'])
votes['jones_pct'] = votes.jones/votes.votes
votes['moore_pct'] = votes.moore/votes.votes
votes['spread_moore'] = votes.moore_pct - votes.jones_pct
votes = votes.drop(['results'], axis=1)
votes.to_csv(file_name, index=False, encoding='utf-8')
votes = votes[['fips','spread_moore']]

return votes

Finally, I merged the three data frames together into a consolidated master. The final data frame contained the following bits of information:

  • 91 scaled Census variables (z-score)
  • 3,109 records with 2016 election spread values
  • 68 records with 2017 election spread values
def get_merged(data_acs, data_alabama, data_election):
merged = data_acs.merge(data_alabama, on=['fips'], how='left')
merged = merged.merge(data_election, on=['fips'], how='left')

# Remove Alaska and Hawaii due to some missing county level data
merged = merged[~merged.Label.apply(lambda x: any([y for y in ['Alaska','Hawaii'] if y in x]))]

merged = merged.set_index(['fips','Label'])
merged.iloc[:, :-2] = StandardScaler().fit_transform(merged.iloc[:, :-2])

return merged

Splitting the data

In machine learning, we typically split our data into training, validation, and testing sets. For this project, the training data consisted of a 70% split of the 2016 national election results which were validated against the remaining 30%. Alabama was omitted from both the training and validation sets so not to “cheat” in the training process.

The purpose of testing data in machine learning is for fairly measuring final accuracy when comparing multiple models or a model with multiple hyper parameters. So if I had 5 competing models, I would measure each against the validation data. But at the very end, I would report my chosen model’s accuracy against the testing data set.

In this case, the testing data is the Alabama counties for the 2017 result. I actually reserve 50% of them for a final, final test.

feat_names = merged.columns[:-2]x_national = merged[merged.spread_moore.isnull()].iloc[:, :-2].values
y_national = merged[merged.spread_moore.isnull()].loc[:, 'spread_national_gop'].values
x_alabama = merged[~merged.spread_moore.isnull()].iloc[:, :-2].values
y_alabama = merged[~merged.spread_moore.isnull()].loc[:, 'spread_moore'].values
x_national_train, x_national_val, y_national_train, y_national_val = train_test_split(x_national, y_national, test_size=0.3)x_alabama_val, x_alabama_test, y_alabama_val, y_alabama_test = train_test_split(x_alabama, y_alabama, test_size=0.5)

Building the training framework

For this exercise, I run through a bunch of models in the sklearn framework to compare their performance in this task. More specifically, I’ll be considering:

  • Linear Models: Ridge Regression, Lasso Regression, ElasticNet, and LassoLars Regression
  • Support Vector: LinearSVR
  • Gradient Boosting: XGBoost
  • Deep Learning: Keras (Tensorflow)

For the sklearn linear packages, I leverage the CV versions which use cross-validation to self tune some of the hyper parameters. There is certainly opportunity to use deeper tuning strategies to extract better performance.

I added XGBoost and Tensorflow into the mix because of the non-linearity between variables. You can imagine there natural interactions between race, income, education, etc. which impact the model’s prediction. The linear models incorporate these variables in isolation. XGBoost creates non-linearity by sub-setting data through stacked decision trees. Tensorflow uses weighted inter-connections in multiple layers to create this non-linearity as well.

I wrote the following helper function which takes an klearn model and measures training, validation, and testing scores using 5-fold cross validation. I plot the results with another helper function and return a dictionary which will later be consolidated into a Pandas data frame. The models are scored using MSE to measure the success of our regression problem.

def train_model(model, model_name,  x_train, y_train, x_val, y_val, x_test, y_test, cv_iter=5):
model.fit(x_train, y_train)
train_acc = cross_val_score(model, x_train, y_train, scoring='neg_mean_squared_error', cv=cv_iter).mean() * -1
val_acc = cross_val_score(model, x_val, y_val, scoring='neg_mean_squared_error', cv=cv_iter).mean() * -1
test_acc = cross_val_score(model, x_test, y_test, scoring='neg_mean_squared_error', cv=cv_iter).mean() * -1
print('{}-fold CV MSE Value | Train: {:.4f} | Validation: {:.4f} | Test: {:.4f}'.format(cv_iter, train_acc, val_acc, test_acc))
plot_results(model.predict(x_test), x_test, y_test, model_name)

return {'name': model_name, 'train': train_acc, 'validation': val_acc, 'test': test_acc}

Linear Models

I began with the linear models by using the ones from sklearn that incorporate some kind of regularization. Since I used the CV implementation from sklearn, I skipped the hyper parameter tuning and just ran it as is. Each of the sections run through similar code blocks. As an example, I’m showing the ElasticNetCV model being trained on the national data and tested against the Alabama data.

model = ElasticNetCV(max_iter=5000)result = train_model(model, 
model_name='ElasticNet Regression - All Vars',
x_train=x_national_train,
y_train=y_national_train,
x_val=x_national_val,
y_val=y_national_val,
x_test=x_alabama_val,
y_test=y_alabama_val)
model_scores.append(result)

From the ElasticNet we see that the model has produced a model which has a fairly low error rate against the Alabama data. The chart helps visualize what this looks like on a by county level.

The gray circles represent the actual point spread between Moore and Jones. The red X marks show the predicted values. Some predictions are spot on, others are a little further, but generally the prediction and actual values are on the same side of the spread line.

There is one quirk you may notice, that generally the red predictions are skewed more positive (more Republican). Actually none skew more negative (Democrat). I’ll come back to this later.

5-fold CV MSE Value | Train: 0.0180 | Validation: 0.0196 | Test: 0.0118

Non-linear models

I also tried out the non-linear approaches. The XGBoost library conveniently includes a version that is implemented in the sklearn framework so I could just plug it nicely into my helper function. Unfortunately this algorithm actually performed worse with a MSE of 0.052 compared to ElasticNet at 0.012.

model = xgb.XGBRegressor(n_estimators=500, subsample=0.9, max_depth=4, min_child_weight=3,
gamma=0.1, colsample_bytree=0.9, reg_alpha=0.1, learning_rate=0.01,
eval_metric='rmse')

result = train_model(model,
model_name='XGBoost Regression - All Vars',
x_train=x_national_train,
y_train=y_national_train,
x_val=x_national_val,
y_val=y_national_val,
x_test=x_alabama_val,
y_test=y_alabama_val)

model_scores.append(result)

The same happens when trying out a neural network on the problem. The Keras framework on top of Tensorflow makes it very easy to deploy a neural network in a few lines of code. The architecture below is a result of about 20 minutes of playing around with different learning rates, activation functions, and size. It’s final MSE on the test set was 0.022. Better than XGBoost but still defeated by the linear models.

es = EarlyStopping(monitor='val_mean_squared_error',
min_delta=0,
patience=10,
verbose=0, mode='auto')
alpha = 0.1model = Sequential()
model.add(Dense(512, input_dim=91))
model.add(LeakyReLU(alpha=alpha))
model.add(BatchNormalization())
model.add(Dropout(0.1))
model.add(Dense(256))
model.add(LeakyReLU(alpha=alpha))
model.add(BatchNormalization())
model.add(Dropout(0.1))
model.add(Dense(64))
model.add(LeakyReLU(alpha=alpha))
model.add(BatchNormalization())
model.add(Dropout(0.1))
model.add(Dense(1))
model.add(Activation('linear'))
model.compile(optimizer=Adam(lr=0.004, decay=0.0001),
loss='mean_squared_error',
metrics=['mean_squared_error'])
model.fit(x_national_train, y_national_train, epochs=50, batch_size=256, callbacks=[es],
validation_data=(x_alabama_val, y_alabama_val))

Reducing Dimensions

One last quick hack I tried was to reduce dimensions. 91 variables is a quite a lot. Using XGBoost’s feature importance, I identified the top contributing variables. I kept these top 20 variables as inputs, but reduced the remaining 71 down to 10 using PCA hoping to keep a summarized version of those truncated fields. I reran the models using these 30 variables instead.

           RaceWhite: 0.117
FamilyHH: 0.055
IndustryAgMining: 0.050
PopHispanic: 0.050
RaceBlack: 0.044
InsuranceNone: 0.041
LanguageEnglishOnly: 0.035
IncomePublicAssist: 0.033
EduGraduate: 0.031
TotalHH: 0.025

Unfortunately, this reduction resulted in generally lower scores. I still think some reduction is needed, but more careful feature engineering needs to go into this step.

Wrap Up

In the end, I used this chart to compare the performance generated by each model against the 50% Alabama test set. You can clearly see the gap in performance by the different model types. In general, it seems the linear models were better at this task and that having access to all of the variables improved performance.

I suspect since there were so few training instances given the number of variables, that stochastic algorithms like XGBoost and Tensorflow would take more tuning to get the same level of performance.

Also adding to this is that the label distribution varied from 2016 to 2017. Notice that during the 2017 national election (blue), most of the county data skewed positive (Republican). This moderated quite a bit for the Alabama campaign (orange).

It would be tempting to fix the data by adjusting the distribution mean, but in a real predictive scenario, we would not know about the existence of the skew until after the fact. So fixing this outright would be cheating.

I believe this difference actually is apparent in the model plots since every model was biased towards Republican votes. This possibly had a more negative influence on the non-linear models.

For next steps I’d certainly like to dive more into the relationships between the features. It would also be interesting to reduce this problem from a regression to a binary classification (ie. spread>0) and compare performance. Given the prediction charts, I believe a binary model would perform much better.

--

--