Web Scraping Election Results of PRU-15 (GE-15) using Python

elvinado
4 min readDec 4, 2022

--

Disclaimer: This is not political. Just a data analysis for fun.

Introduction

In an ideal world, all the important information has been assembled and curated for use in a nice and orderly fashion. But they are not. Therefore, the skill to collect and present data is useful. Here, I present the method to answer some questions that were not readily available at the time. For this specific article, I was interested in the recent Malaysian General Election (PRU-15) results.

Main Objective: To illustrate the steps of doing web scraping and show cleaning of messy real-world data.

Question 1: Who are the elected Members of Parliament during PRU-15?

Question 2: How many votes do the winners get?

Strategy: 1. Get the data; 2. Data cleaning; 3. Visualisation

Tools: Python, Selenium, Pandas, Regex, & Matplotlib

Get the data

Berita Harian’s election result

I found that a Berita Harian article provides the most complete results in one page. In this case, we need to use Selenium because the website is a javascript-based, therefore simple python “requests” do not work.

Scraping Berita Harian Article

The content of the class ‘dable-content-wrapper’ was extracted and stored into a variable. Protip! Usually, we use the inspect feature in a web browser to know which specific name of a class or section of HTML.

Data cleaning

Initially, the text was split into a list of lines for further processing. Variable ‘parliaments’ is a list of tuples containing line numbers and patterns resembling parliaments numbers. Similarly, the variables ‘special’ & ‘members’ are for lines with winning candidates’ names. There were many variations because either the winners reported wrongly or had different text patterns.

Extracting parliaments and winners

The results were then combined into a sorted pandas data frame. Then split them into two data frames of parliaments and members. Then join them again to make them columns in the same data frame.

Combining results into a data frame

The results were dirty. We want the parliament number, parliament name, member’s name, and votes in different columns.

Tactics used here were:

  1. Split by space
  2. Extract matching patterns
  3. Replace unwanted text with an empty string
Separating into respective columns

The newly extracted columns were combined into a single data frame with the appropriate columns name.

Combine data frame and rename columns

Next, generate the ‘coalition’ column based on knowledge (from other sources).

Party’s coalition column

The final cleaned result can be seen below.

Final cleaned data set

Quick Visualisation

Caveat: these absolute numbers can be misleading.

The top 50 members of parliament with the most votes are shown below. Some MPs are just too popular or their constituent is too populous.

Top 50 MPs with most votes

The distribution of winners’ votes is shown below. Most winners get votes in the range of 0–60,000 votes. There are a small number of winners who get votes above 80,000.

Distribution of winners’ votes count

Winners’ total votes by the coalition are shown below. However, this can be misleading because this is NOT total votes. This data excludes the votes of the losing candidates.

Total winners’ votes count based on the coalition (this is NOT based on total votes)

Conclusion

Question 1: Who are the elected Members of Parliament during PRU-15?

There are 221 winners and we got all their names, parties, and coalitions.

Question 2: How many votes do the winners get?

Many. Not all winners win the same number of votes.

Web scraping is a basic skill for a data professional. An understanding of the data and its context are essential in order to extract “correct” information. Without domain knowledge, we may pass the wrong and misleading information to the next analyst. They might use it to build whatever narrative based on their agenda. However, our task is to collect the most accurate data.

Challenges

  1. You need to know the context of the data.
  2. Too many unique patterns.
  3. Probably not the most efficient way.

GitHub

https://github.com/elvinado/MP-PRU15

Maybe Next…

Extract other information such as from the same page:

  1. Total voters
  2. Ethnic percentage of voters
  3. Spoil votes
  4. Votes majority count
  5. Redo the visualisation with a relative number (such as percentages)

--

--