Democrat or Republican? Politics and Logistic Regression
What Demographics Can Tell Us About Voters’ Choice of Party in House Elections
2016 Election for National House of Representatives
Using data put together by Steve Riffe at data.world for the 2016 Election for the House of Representatives, I built a simple logistic regression model for predicting whether a voting district will a Democratic or Republican candidate based on its ethno-racial demographics. Riffe’s demographic data were taken from the US Census Bureau’s 2013 estimates.
The Census Bureau’s estimates use the following designations for the various ethno-racial groups:
- Hispanic
- White
- Black
- Native American
- Asian
- Pacific Islander
- Other
- Multiple Races
Training the Model
After converting the data’s raw values to reflect each demographic as a percentage of the total estimated population of each district, I created a voter turnout feature. I do need to note here that, due to Riffe’s lack of clarity in the dataset’s data dictionary, it is not clear whether this feature represents actual voter turnout or simply the total number of votes the winning candidate received. The target category, the victorious candidates’ party affiliation, was then converted to binary values with Democrat = 0 and Republican = 1. Before my reader asks where all the independents went, surprisingly, there were none.
I used with the scikit-learn implementation of logistic regression with a train, validate, test split to keep in line with best practices. This was a natural first choice given the fact that the question called for classification to solve. The model’s accuracy score was taken as the primary error metric over the Receiver-operator characteristic (ROC) score. The model measured against a mode-baseline accuracy score in which each district was predicted to elect a Republican representative. This decision was made given that the accuracy score is fairly interpret-able to the layman and the fact that the mode was roughly 55% meant that the ROC score was not needed to compensate for imbalanced classes.
Since the data were originally in alphabetical order by state name and sequentially by district number, I shuffled the data and performed a 70/15/15 train, val, test split. This was done in a scikit-learn pipeline with StandardScaler as shown below.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
lr = make_pipeline(
StandardScaler(),
LogisticRegression()
)
lr.fit(X_train, y_train)
lr_accuracy_ = lr.score(X_val, y_val)
This simple model netted 84.62 % accuracy, nearly 30% greater than the baseline.
Analyzing the Data
The Logit Function
Given that the above model was sufficiently accurate to draw conclusions from, I used the the intercept and coefficients of the logit function, the linear form of the sigmoid logistic function, to see how each group contributed to the outcome of their districts choice of candidate. Each coefficient was then converted to a probability for facility of interpretation. The process and these data are shown below:
import mathlr_model = lr.named_steps['logisticregression']lr_coef = list(lr_model.coef_[0])
lr_coef_data = {'Feature' : features, 'Coefficients' : lr_coef}
lr_coefficients = pd.DataFrame(lr_coef_data)
lr_coefficients = lr_coefficients.sort_values(by = 'Coefficients',
ascending = False)probabilities = []
def log_odds_to_prob(coefficient):
numerator = math.e ** coefficient
denominator = 1 + numerator
return numerator / denominator
for coefficient in lr_coefficients.Coefficients:
probabilities.append(log_odds_to_prob(coefficient))
lr_coefficients['Probabilities'] = probabilities
lr_coefficients
The sign of the coefficients represents the direction that each feature pushes the vote in with positive values indicating a benefit for the Republicans and negative values indicating a benefit for the Democrats. The probabilities indicate the likelihood that a district composed entirely of the selected demographic would have a Republican representative. In the case of the voter turnout category — assuming that is the correct interpretation of the feature — the probability indicates the likelihood of a randomly selected district anywhere in the nation electing a Republican candidate given a 100% voter turnout. Given that only data from a single year, 2016, is being examined here, it would be prudent not to jump to any hasty conclusions. Even with 100% voter turnout, a Republican candidate would likely still stand a strong chance of winning given that the probability is close to 0.5. If we had analyzed data over a longer time span, we can safely presume that the value would come closer to 0.5. Given that the intercept of the logit function was 0.11225385011865346, or 0.5280340308125522 when expressed as a probability, it can be ignored and the above interpretation is still valid, even when examining the year 2016 alone.
What the data do tell us, however, can be extremely insightful. Being mindful of the fact that the coefficients tell us the relationship between the percentage of each ethno-racial group has and an elected official’s party affiliation, we find that states where the percentage of Whites and Native Americans is the highest tend to elect Republican Representatives. This does not indicate that the latter group, being a minority, actually votes Republican; rather, it tells us that the states with the highest per capita American Indian populations tend to be in the west. An example of this would be Arizona, a red state which also happens to be home to the Navajo Nation. Another example would be Alaska, as shown in the graph below. Nor does this finding indicate that the former group predominantly votes Republican; rather, those Whites who live in states in which a larger proportion of the population is white do.
Note also the magnitude to which the percentage of Pacific Islanders seem to influence the direction of the vote. This is likely due to the fact that Hawaii is the only state with Pacific Islander population greater than 5%, and Hawaii just happens to be a Democratic-run state.
In my opinion, quite possibly the most fascinating finding here is the indication that states with higher percentages of people who choose to check the “Other” box on the Census, tend to elect Democratic candidates. This may simply be because coastal areas and large cities tend to have more diverse populations, or it may be something more curious.
Permutation Importances
Calculating the permutation importances, another statistical tool which examines the effect which each features has on the target, the magnitude of this effect seems to dwarf the other categories’ influence.
The permutation importance algorithm calculates a weight for each feature’s contribution to the target variable. Unlike the logit function’s intercept and coefficients, here sign indicates the magnitude rather than the effect of those contributions. Taking these weights into our analysis, all categories except Other, Multiple Races, Pacific Islanders, and Native Americans contributions can effectively be discounted, with the Other being the only definite contributor to the net result. This is strong evidence for the diversity theory described above.
GitHub Repo: Party-Affiliation-Model