Tidying the Australian Same Sex Marriage Postal Survey Data with R

This week the Australian Bureau of Statistics delivered the result the majority of us had hoped for but at the same time kind of expected: Australians want to allow same sex couples to marry. The spectre of a nasty surprise in the vein of Brexit, Trump etc did not materialise.

However, it pays to stay vigilant. A demon of a different kind clawed its way up from the upsidedownunder:

It could be worse. At least there’s no date-time bullshittery to sob quietly into a pillow over.

So while Australia didn’t learn much new from its un-referrendum result, there is much to be learnt from the public datasets as teaching examples we can assail with the full power of the #rstats #tidyverse to do some cool analyses.

In this post I’m going to break down my approach to tidying the two Excel files released by the ABS. When dealing with this file format there’s often no way around looking at the data in Excel to get a feel for what you’re dealing with. You’ll hit too many errors trying to read the data into R without inspecting it. I use Libre Office Calc to do that.

The Participation Data

These data are the messiest and the subject of my annotated screenshot. We have the electorate-level participation data for males and females inexplicably split across two gendered sheets. I preferred to start with just one sheet and work out a way to abstract my approach to both sheets later. For the first sheet I broke the problem down into two parts:

Divide and conquer

Dealing with A) and B) separately hinges on being able to read in separate data frames from areas within a sheet of an .xlsx file. Luckily the readxl package confers this ability. I split the problem because the merged cells worried me, and I thought it might lead to simpler code. See how for B) I only really need to worry about dropping rows that are all NA. I’ll also gather the data into long format so that age range is one covariate. While for A) I have to filter out the subheadings and remove the footnotes embedded in the electorate name. Let’s consider A) first:

Tidying A) into a column of areas

In the code above the filter step removes all the rows that contain either “Division” or end in “Australia”, and the mutate step removes the footnotes by replacing any instances of a single letter inside round brackets with an empty string.

Now let’s get B), and in the process incorporate A):

Tidying B) into long format and including A)

Yes, I specified the column names. If I didn’t, there would still be two I had to manually rename: the blank first column and the final column not visible in the screenshot called ‘Total for Male’. I think it’s clearer what’s happening here by specifying them. Also, I didn’t type them out. I used my datapasta package to copy-paste the vertical vector into R directly and edit as necessary.

Inside the mutate, area = rep(areas$area, each=3) takes the clean column we created from A) and replicates every value 3 times, maintaining order. This is how I dealt with the merged cell for area which needed to appear once for each of the three participation measures. Gender is hard coded to ‘Male’ since I was working on the male sheet. I took this opportunity to ditch the state-wise summaries of the data by removing any areas that contained ‘Total’. Finally I gather — ed up the participation counts spread over age ranges into 2 columns: The age range and a corresponding count. The result looks like this:

# A tibble: 7,632 x 5
measure area gender age count
<chr> <chr> <chr> <chr> <dbl>
1 Total participants Banks Male 18-19 years 1102.0
2 Eligible participants Banks Male 18-19 years 1431.0
3 Participation rate (%) Banks Male 18-19 years 77.0
4 Total participants Barton Male 18-19 years 977.0
5 Eligible participants Barton Male 18-19 years 1278.0
6 Participation rate (%) Barton Male 18-19 years 76.4
7 Total participants Bennelong Male 18-19 years 1177.0
8 Eligible participants Bennelong Male 18-19 years 1488.0
9 Participation rate (%) Bennelong Male 18-19 years 79.1
10 Total participants Berowra Male 18-19 years 1523.0
# ... with 7,622 more rows

Abstracting to both gender tables

Having built a tidy dataset for males, I looked at how to parameterise the code so I could call it as a function on both gender tables. The only parameters needed were the table name and gender represented in the table. pmap invokes the function over two lists of parameters in parallel to give me a list of tidy datasets I could bind together into a final product. The code looks like this:

‘extract_participation_counts’ is a function that parameterises the code I showed you for the male table.

Finishing Touches

That worked well. But as the astute reader may have already noticed, I blew away the Australian State associated with each electorate (area) in the subheadings. I decided rather than extracting and incorporating it, I would put it back in by querying public data maintained by the Australian Electoral Commission. I also felt that someone trying to analyse this data would probably appreciate the participation measures as columns. Lastly, I set a couple of the columns to integer to work around a readr bug so things go smoothly on the read side. So here’s the final touches:

Appending physical Area and State from an online html table, and spreading the participation measures.
>ssm_participation_state %>%
+ spread(measure, count)
# A tibble: 4,800 x 8
area gender age State `Area (sq km)`
* <chr> <chr> <chr> <chr> <chr>
1 Adelaide Female 18-19 years SA 76
2 Adelaide Female 20-24 years SA 76
3 Adelaide Female 25-29 years SA 76
4 Adelaide Female 30-34 years SA 76
5 Adelaide Female 35-39 years SA 76
6 Adelaide Female 40-44 years SA 76
7 Adelaide Female 45-49 years SA 76
8 Adelaide Female 50-54 years SA 76
9 Adelaide Female 55-59 years SA 76
10 Adelaide Female 60-64 years SA 76
# ... with 4,790 more rows, and 3 more variables: `Eligible
# participants` <dbl>, `Participation rate (%)` <dbl>, `Total
# participants` <dbl>

The Response Data

The response data is a little easier since there is only one table we care about parsing to get the electorate-level response counts:

We see many of the same issues as before. There are just two new ones: a blank column right down the centre of the table, and merged column header cells. I got around the header cells by specifying the headings explicitly again. I then worked out that if I removed the blank column straight away, the rest of the issues could be resolved in an identical manner to the participation data. The lack of merged cells in columns meant splitting the problem into 2 reads was not necessary. I’ll just give all the code at once here, since most of it will be familiar:

I used the same strategies as with tidying the participation data
> response_data
# A tibble: 150 x 17
area Yes `Yes pct` No `No pct` `Response Total`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banks 37736 44.9 46343 55.1 84079
2 Barton 37153 43.6 47984 56.4 85137
3 Bennelong 42943 49.8 43215 50.2 86158
4 Berowra 48471 54.6 40369 45.4 88840
5 Blaxland 20406 26.1 57926 73.9 78332
6 Bradfield 53681 60.6 34927 39.4 88608
7 Calare 54091 60.2 35779 39.8 89870
8 Chifley 32871 41.3 46702 58.7 79573
9 Cook 47505 55.0 38804 45.0 86309
10 Cowper 57493 60.0 38317 40.0 95810
# ... with 140 more rows, and 11 more variables: `Response Total
# pct` <dbl>, `Response clear` <dbl>, `Response clear pct` <dbl>,
# `Response not clear(b)` <dbl>, `Response not clear(b) pct` <dbl>,
# `Non-responding` <dbl>, `Non-responding pct` <dbl>, `Eligible
# Total` <dbl>, `Eligible Total pct` <dbl>, State <chr>, `Area (sq
# km)` <chr>


To tidy these data we used a good portion of the gamut of tidyverse techniques. We:

  • Read Excel files using readxl
  • dropped rows and reshaped data with tidyr
  • filtered, mutated, joined, and bound data with dplyr
  • mapped a function to a list of parameters using purrr
  • scraped a table from the web using rvest
  • wrote a csv file using readr.

How great is that?! You can get my complete scripts and tidied csv files from this GitHub repository.

Watch this space for a sequel post where I’ll outline how to join data from the 2016 census to overlay demographic information onto the results.