How To Clean Brexit Election Data with Python

Published in

Código Ecuador

4 min readDec 22, 2019

“Get Brexit Done” — Boris Johnson, British Prime Minister

Please note: This is the first in a series of articles where we will be cleaning and analyzing the 2019 U.K. election data published at the British Electoral Commission’s website.

A wooden, brown bench dedicated to British Robert Slater 1929–2015. A white sign that says “Polling Station” attached. Brexit

Data scientists say that they spend most of their time cleaning data. In this lesson, we will take raw British election data and turn it into a series of cleaned files ready for analysis.

The votes are in and the British people have decided, rightly or wrongly, to (in the words of Prime Minister Boris Johnson) “get Brexit done.” As data scientists, we can objectively study how voters decided to get Brexit done by looking at U.K. general election data, regardless of our opinion on the issue.

We should note that senior advisor Dominic Cummings has made basic research a priority in the United Kingdom. Data scientists use Python for much of this research.

It takes about twenty days after the U.K. general election for them to publish the data so the 2019 data is not available at this moment. Let’s create an adaptable template using the 2017 data that we can then reuse to clean the 2019 election data.

Step 1: Download the raw data, examine the first few lines

The election data is available from the U.K. electoral commission’s website here. Open the CSV file containing the election results.

Important Note: The file encoding on the CSV is cp1252. Don’t miss the encoding argument in your Python command when opening the document or it will throw an exception. Let’s go ahead and print the first few rows to see what we have.

import csv
with open(‘2017_UKPGE_electoral_data.csv’,encoding=’cp1252') as uk:
    uk_elect_reader = csv.reader(uk)
    result = [i for i in uk_elect_reader]
print(result[0:3])

Here are the first few rows, showing us what we have to do and the data’s format:

[['RESULTS', '', '', '', '', '', '', ''], 
['ONS Code', 'PANO', 'Constituency', 'Surname', 'First name', 'Party', 'Party Identifer', 'Valid votes'], 
['E14000530', '7', 'Aldershot', 'WALLACE', 'Donna Maria', 'Green Party', 'Green Party', '1090']]

The first row is a list of empty strings, which we do not need. The second contains the headers. The third is the first row of the data. Take note of the format of the data; in this case, the data are strings.

Step 2: Remove unnecessary rows

Delete the first row with pop, because it is not useful. It contains just the word “RESULTS.” The actual headers we want are on the second row.

result = result.pop(0)

Step 3: Save the cleaned data

Create a new file. Write each row into the newly created file. The headers will now be the first line.

with open(‘clean_data.csv’, ‘w’, newline=’’) as csvfile:
    writer = csv.writer(csvfile)
    for line in result:
        result_writer.writerow(line)

Once you finish, check the clean_data file before moving forward.

Step 4: Load the cleaned CSV file with DictReader

Let’s load the data back up with the CSV module’s DictReader. The DictReader module in Python will loop over the rows in our election data file and organize it into dictionaries within a variable called uk_elect_reader. The DictReader will grab each column’s headers to use as dictionary keys.

>>>with open(‘clean_data.csv’) as uk:
        uk_elect_reader = csv.DictReader(uk)
        election_results = [row for row in uk_elect_reader]

The line below this command will store a list of the dictionaries in a variable named election_results. This is the most head-scratching part. The variable election_results is now a list of OrderedDicts rather than a simple list of dictionaries. DictReaders give us a list of OrderedDicts, which are the same as normal dictionaries, but they store their keys in the order they were added.

>>>election_results[0] 
>>>OrderedDict([('ONS Code', 'E14000530'),
(‘PANO’, ‘7’),
(‘Constituency’, ‘Aldershot’),
(‘Surname’, ‘WALLACE’),
(‘First name’, ‘Donna Maria’),
(‘Party’, ‘Green Party’),
(‘Party Identifer’, ‘Green Party’),
(‘Valid votes’, ‘1090’)])

Step 5: Write the results to a file

Now, use the DictWriter to write the results into a file. After creating the file with the open command using the name election_clean.csv, we will create a fieldnames variable to store a list of headers. We can then store a DictWriter object under the name writer using the fieldnames variable as an argument in the DictWriter method.

To write the headers to the file, we call the writeheader method of the writer object. Lastly, to transfer the clean data from the election_results variable created above, we call the writerow method of the writer object.

with open('election_clean.csv', 'w', newline='') as csvfile:
    
     #Create column headers for the DictWriter
     fieldnames = ['ONS Code', 'PANO', 'Constituency', 'Surname',
            'First name', 'Party', 'Party Identifer','Valid votes']     #create a DictWriter object called writer  
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
     
     #write the header to the file
     writer.writeheader() 
    
     for row in election_results:
        writer.writerow(row)

Step 6: Use dataclasses to merge redundant field names

Some of the fieldnames are redundant. In this step, you will learn how to merge the first and last names using a dataclass and replace the party name with the ‘Party Identifier’.

from dataclasses import dataclass@dataclass
class Candidate:
   ons_code: str     #U.K. Office of National Survey number
   pano: int         #British press association number
   constituency: str #constituency name
   name: str         #candidate's name, 
   party: str        #a unique value for each party
   votes: int        #number of votesres = [Result(name='{0} {1}'.format(i['Firstname'], 
   i['Surname']), ons_code=i['ONS Code'], pano=i['PANO'],
   constituency=i['Constituency'], party=i['Party Identifer'], 
   votes=i['Valid votes']) for i in results]

We now have a clean CSV file with election data, and we have custom data classes with which to process that data. We will continue processing this British election data in upcoming articles.