Exploring Toronto Voter Statistics using Golang

Published in

Open Data Toronto

9 min readFeb 6, 2020

Original story by Yizhao Tan, republished with permission.

October marks the official start of the Canadian federal election campaign. As such, I was interested in exploring the Toronto Voter Statistics dataset published by Open Data Toronto. Although data were collected from the Toronto municipal elections, it seemed like an interesting dataset that may present some insights on how people in Toronto vote.

Low voter turnouts have been attributed as one of the major reasons why Trump won the 2016 US presidential election. While Canada’s own 2015 federal election saw the highest turnouts in decades, with an average turnout of 68.49% across the country, I was curious to see if this trend was also reflected on a local level in Canada’s largest municipality.

Unlike the Toronto Bikeshare Ridership dataset I previously explored, this dataset is well structured and easy to work with. Instead of using Python (as per usual), I decided to use this dataset as an opportunity to explore Go’s capabilities for data processing and visualizations.

Go has been on my radar for some time now. And although Go is not known for its data-related functions, packages such as gophernotes and gonum are making data related work more approachable. Also, posts like this one are making convincing arguments on why Go might be a better choice than Python for data heavy applications.

About the Data

Data was provided for the past 5 elections from 2003–2018. However, due to the changes in the number of wards in 2018 from 44 to 25 wards, I decided to keep things simple and focus only on the data from the 2018 election.

The data provided has been aggregated by the ward and voting station and provides information on:

Ward and voting station number
Voting station location name and address
Count of voters on the voting list, modifications to the voter list, and count of final eligible voter
Number of voters that voted
Count of voters by the school support
Count of rejected and declined ballots

I was interested in geocoding the data based on voting station addresses and mapping historical voting data to the new map with 25 wards. This would enable me to track how voting trends have changed across the years. For now, I decided this would be out of scope for the purpose of exploring Go.

If you would like to follow along, the data and the notebooks I used can be found here on GitHub.

Preparations

Before diving into the analysis, I identified a number of research questions I was interested in answering:

What is the distribution of turnouts by ward? Are there any patterns when comparing wards with high vs. low turnout?
How has the ward change impacted the turnouts?
How has the number of registered voters and turnouts changed over the years?
Does the ease of access to a voting station have an impact on turnouts?
Are there any patterns for when voting stations have a higher vs. lower turnouts?

Questions 2, 3, and 4 require comparison with previous election years or geocoding and therefore they are out of scope for this analysis.

I defined a number of functions that will be used later in the analysis:

The data itself is fairly well structured and data cleaning was minimal. Mixed within the data (aggregated by ward and voting station) were also rows where the data were also summed by the ward. I dropped these rows since they can be easily calculated later.

The original data is provided as an Excel sheet which I manually converted to a CSV for easier import. Then I read the data into memory by:

Turnouts Distribution

The gonum plot library makes visualization fairly straightforward. I visualized ward vs. turnouts by:

A few things stood out to me. First, the turnout for Toronto’s municipal election in 2018 is much less than the federal election in 2015. The average turnout across all 25 wards was 40.67%, ~28% lower than the federal turnout in 2016.

There was also a big range between the wards with the highest and the lowest turnouts. Ward 14 had the highest turnout at 49.22%, while Ward 23 had the lowest turnout at 34.05%.

A quick visual comparison showed that the 5 wards with the highest turnouts (Wards 14, 15, 12, 19, and 4) are all located near the downtown core. The 5 wards with the lowest turnouts (Wards 23, 7, 10, 1, and 21) were mostly further out into the suburbs, with Ward 10 as the exception.

https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/ward-profiles/

Visual comparison on a map felt unreliable, so I extracted a portion of the data from the Ward Profiles dataset for more systematic analysis and focused on the impact of education and income on voter turnout. The extracted data file I used is also available on GitHub.

The dataset presents the most recent census data by wards. Since the census data was collected in 2016, I assumed that no major demographics changes occurred between 2016 and the election. I specifically focused on two factors:

% of the population with post-secondary or higher education
% of the population from a household with a higher than $80,000 total income (the 2016 median household income was $78,373)

And the Pearson correlation coefficients were calculated to be 0.57 and 0.55 for education and income, respectively. Based on this, it seems fair to say that both education and income have a positive impact the turnouts. While the correlation values were on the lower side, this is most likely due to other factors that are also impacting how likely the voters are to vote.

Ward 10, with a 34.85% turnout, stood out both during the initial review on the ward map and during the census data analysis. Not only is this ward located in the downtown core (where turnouts are usually high), 78% of the people living in this ward had a post-secondary education. I would be interested to include age as an additional factor in future analysis to see if it may have contributed to Ward 10’s unusually low turnout.

Voting Station Trends

Next, I broke down the turnouts further by visualizing the turnout at each station as a box and whisker plot.

The advance voting stations didn’t have a fixed eligible elector count and I removed these stations from this visualization.

I was surprised to see there are stations with 100% and 0% turnout. I reviewed the data to try to understand where these outliers occurred.

Although there were no distinct trends from the stations with 0% turnout, I noticed that many of the stations with 100% turnout were long-term care facilities (ie. retirement homes, veteran care locations, etc.). I identified a list of keywords for these facilities and counted the number of outlier stations that contained these keywords in their name.

Instead of focusing only on the stations with 100% turnout, I examined all stations with turnouts greater than or equal to the 91st percentile for their ward. Out of these 164 voting stations, only 34 of these stations contained one of the keywords. These are only a small percentage of all outliers stations.

Next, I calculated that each stations are expecting about ~1106 voter on average. But these stations with high turnouts (>91 percentile) only were expecting about 94 voters. Since these voting stations were only expecting a handful of voters, it was easy to reach a high turnout at these locations. This would be the likely explanation why these stations reached an usually high turnout, rather than differences in the voters’ demographics at these stations.

Thoughts on Go

Two major things I immediately noticed with Go:

It is verbose. I always needed to explicitly state what to expect, whether I was defining the expected inputs of the data from the CSV or how number characters are parsed into the numeric types
There are less pre-built functions for data. Pandas and numpy may have spoiled me with easy data manipulations. And although gonum plot functions were straightforward and simple to use, creating visualizations were also much more limited in Go than Python (eg. for creating bar graphs, I needed to explicitly state the distance between each bar to avoid bars overlapping).

Both of these things contributed to a much longer time spent working on the code. I am used to a workflow where I can load the data directly into memory (as a pandas DataFrame), explore the data, and then decide on the next steps. However, with Go, the workflow reverses. I needed to know what the data contains, what I plan to do with the data (so I can define the right structure), and then finally load the data.

In general, exploring data with Go felt clunky and slow even though gophernotes made it possible to run snippets of code in Jupyter notebooks.

But that is not to say there are no places for Go in the world of data science. Libraries like pandas make assumptions. For example, in pandas, the data type for integer columns containing NaN is float instead of int. While these assumptions are almost always correct, they are not always guaranteed to be correct 100% of the time. Since Go is verbose and functions are often defined by the developers, Go applications are often executed in a more controlled manner. This means Go is more robust and more reliable in production.

Overall, I don’t think I would use Go for another project where the main purpose is to explore the data. But for any applications where consistency is important (eg. data pipelines, backends for dashboards, etc.) I would definitely consider using Go over Python.

Next Steps and Conclusion

I would be interested in revisiting this dataset in the future using Python with a more geospatial approach. A potential future analysis could include:

Geocode voting data from all available elections and map the historical data to current neighbourhood boundaries. Then track the change of turnout for each neighbourhood over time
Join the data with the neighbourhood profile dataset and analyze the relationship of demographics and turnouts on more dimensions than education and income

I hope that this data story outlines a few interesting insights on how people are voting in Toronto and point the way for more in-depth (and interesting) analysis that you can perform on your own. If you have any questions, feel free to contact the Open Data team at opendata@toronto.ca or me directly at Yizhao.Tan@toronto.ca.