The DAP Journey: Python analysis of gerrymandering

The effects of shifting election boundaries

Published in

SMUBIA

4 min readMay 10, 2019

In this Medium series, BIA extracts the introspection of our Data Associates as they recall their academic exploration. This post features an analytics project on gerrymandering, directed by Jun Xiang, Muskaan, Zexel and Sherman.

Introduction

We are Team B Humble, a 4 member group from SMU’s Business Intelligence Analytics club. Throughout the whole of our second semester in SMU, the team was trying to figure out how gerrymandering could be conducted in Singapore. We decided to choose this topic because there have been speculations regarding the Singapore government carrying out gerrymandering in order to influence election results and we wanted to verify for ourselves whether it was true or not.

Gerrymandering

For a quick definition, Gerrymandering is a practice that arranges a political advantage for a particular party or group by manipulating district boundaries. A simple diagram to the left explains the concept clearly. It is usually illegal, but there are loopholes in which governments still practice this.

Analysis of Household Data

So to start off this project, we made use of Singapore’s 2010 & 2015 Household dataset to see what kind of information we could gather from it.

import pandas as pddf_2010 = pd.read_csv('2010 household data.csv')
df_2015 = pd.read_csv('2015 household data.csv')
df_2010.head()
df_2015.head()

We realized that the dataset does not categorize Towns under their Main Towns. Thus, we needed to carry out data cleaning which separates them according to their Main Towns. After cleaning, we had to retrieve the list of common towns that exists in both 2010 and 2015 so that we would be able to analyze the difference in demography that occurred.

towns_2010 = set(df_grpby_2010.reset_index()['main_town'].unique())
towns_2015 = set(df_grpby_2015.reset_index()['main_town'].unique())
common_towns = list(towns_2010.intersection(towns_2015))

We then filtered the 2 data frames down to only the common towns and began tracking the absolute changes (the difference in headcount over two periods in time) as well as the percentage changes of each race: Chinese, Malay, Indian, Others.

We felt that these results alone weren’t providing any insights, so we decided to make use of previous years election results to complement these findings.

Web Scraping

What we used:

BeautifulSoup4
requests

In order to determine which constituency the main towns belong to, we scraped Wikipedia to retrieve a dictionary with key as the GRCs or SMCs and value as a list of main towns.

From our election statistics file, we calculated the voting percentage change of the winning party, comparing results from 2015 to 2010, then merged it together with the dataframe of demographic changes of each race.

We then used statsmodel.api to help us calculate the Ordinary Least Square results (OLS is a regression technique that estimates unknown parameters of a model). The best r-squared value was 0.268 when we fitted change_in_chinese and vote_count. The others were 0.199, 0.192, 0.235 for Malay, Indian, Others respectively.

Although it may seem that the r-squared values for all 4 races are small, we concluded that this still provides us with some insights that if the government were to change the election boundary, depending on which race is more supportive of the government, there will definitely be an impact on the electoral votes and outcomes.

As this project explores the effects of gerrymandering between limited datasets, no citizen should feel discouraged about making his or her decision wisely; every vote still counts.