Tidying the Australian Same Sex Marriage Postal Survey Data with R

6 min readNov 19, 2017

This week the Australian Bureau of Statistics delivered the result the majority of us had hoped for but at the same time kind of expected: Australians want to allow same sex couples to marry. The spectre of a nasty surprise in the vein of Brexit, Trump etc did not materialise.

However, it pays to stay vigilant. A demon of a different kind clawed its way up from the upsidedownunder:

It could be worse. At least there’s no date-time bullshittery to sob quietly into a pillow over.

So while Australia didn’t learn much new from its un-referrendum result, there is much to be learnt from the public datasets as teaching examples we can assail with the full power of the #rstats #tidyverse to do some cool analyses.

In this post I’m going to break down my approach to tidying the two Excel files released by the ABS. When dealing with this file format there’s often no way around looking at the data in Excel to get a feel for what you’re dealing with. You’ll hit too many errors trying to read the data into R without inspecting it. I use Libre Office Calc to do that.

The Participation Data

These data are the messiest and the subject of my annotated screenshot. We have the electorate-level participation data for males and females inexplicably split across two gendered sheets. I preferred to start with just one sheet and work out a way to abstract my approach to both sheets later. For the first sheet I broke the problem down into two parts:

Dealing with A) and B) separately hinges on being able to read in separate data frames from areas within a sheet of an .xlsx file. Luckily the readxl package confers this ability. I split the problem because the merged cells worried me, and I thought it might lead to simpler code. See how for B) I only really need to worry about dropping rows that are all NA. I’ll also gather the data into long format so that age range is one covariate. While for A) I have to filter out the subheadings and remove the footnotes embedded in the electorate name. Let’s consider A) first:

Tidying A) into a column of areas

In the code above the filter step removes all the rows that contain either “Division” or end in “Australia”, and the mutate step removes the footnotes by replacing any instances of a single letter inside round brackets with an empty string.

Now let’s get B), and in the process incorporate A):

Tidying B) into long format and including A)

Yes, I specified the column names. If I didn’t, there would still be two I had to manually rename: the blank first column and the final column not visible in the screenshot called ‘Total for Male’. I think it’s clearer what’s happening here by specifying them. Also, I didn’t type them out. I used my datapasta package to copy-paste the vertical vector into R directly and edit as necessary.

Inside the mutate, area = rep(areas$area, each=3) takes the clean column we created from A) and replicates every value 3 times, maintaining order. This is how I dealt with the merged cell for area which needed to appear once for each of the three participation measures. Gender is hard coded to ‘Male’ since I was working on the male sheet. I took this opportunity to ditch the state-wise summaries of the data by removing any areas that contained ‘Total’. Finally I gather — ed up the participation counts spread over age ranges into 2 columns: The age range and a corresponding count. The result looks like this:

>age_counts
# A tibble: 7,632 x 5
                  measure      area gender         age  count
                    <chr>     <chr>  <chr>       <chr>  <dbl>
 1     Total participants     Banks   Male 18-19 years 1102.0
 2  Eligible participants     Banks   Male 18-19 years 1431.0
 3 Participation rate (%)     Banks   Male 18-19 years   77.0
 4     Total participants    Barton   Male 18-19 years  977.0
 5  Eligible participants    Barton   Male 18-19 years 1278.0
 6 Participation rate (%)    Barton   Male 18-19 years   76.4
 7     Total participants Bennelong   Male 18-19 years 1177.0
 8  Eligible participants Bennelong   Male 18-19 years 1488.0
 9 Participation rate (%) Bennelong   Male 18-19 years   79.1
10     Total participants   Berowra   Male 18-19 years 1523.0
# ... with 7,622 more rows

Abstracting to both gender tables

Having built a tidy dataset for males, I looked at how to parameterise the code so I could call it as a function on both gender tables. The only parameters needed were the table name and gender represented in the table. pmap invokes the function over two lists of parameters in parallel to give me a list of tidy datasets I could bind together into a final product. The code looks like this:

‘extract_participation_counts’ is a function that parameterises the code I showed you for the male table.

Finishing Touches

That worked well. But as the astute reader may have already noticed, I blew away the Australian State associated with each electorate (area) in the subheadings. I decided rather than extracting and incorporating it, I would put it back in by querying public data maintained by the Australian Electoral Commission. I also felt that someone trying to analyse this data would probably appreciate the participation measures as columns. Lastly, I set a couple of the columns to integer to work around a readr bug so things go smoothly on the read side. So here’s the final touches:

Appending physical Area and State from an online html table, and spreading the participation measures.

>ssm_participation_state %>%
+   spread(measure, count)
# A tibble: 4,800 x 8
       area gender         age State `Area (sq km)`
 *    <chr>  <chr>       <chr> <chr>          <chr>
 1 Adelaide Female 18-19 years    SA             76
 2 Adelaide Female 20-24 years    SA             76
 3 Adelaide Female 25-29 years    SA             76
 4 Adelaide Female 30-34 years    SA             76
 5 Adelaide Female 35-39 years    SA             76
 6 Adelaide Female 40-44 years    SA             76
 7 Adelaide Female 45-49 years    SA             76
 8 Adelaide Female 50-54 years    SA             76
 9 Adelaide Female 55-59 years    SA             76
10 Adelaide Female 60-64 years    SA             76
# ... with 4,790 more rows, and 3 more variables: `Eligible
#   participants` <dbl>, `Participation rate (%)` <dbl>, `Total
#   participants` <dbl>

The Response Data

The response data is a little easier since there is only one table we care about parsing to get the electorate-level response counts:

We see many of the same issues as before. There are just two new ones: a blank column right down the centre of the table, and merged column header cells. I got around the header cells by specifying the headings explicitly again. I then worked out that if I removed the blank column straight away, the rest of the issues could be resolved in an identical manner to the participation data. The lack of merged cells in columns meant splitting the problem into 2 reads was not necessary. I’ll just give all the code at once here, since most of it will be familiar:

I used the same strategies as with tidying the participation data

> response_data
# A tibble: 150 x 17
        area   Yes `Yes pct`    No `No pct` `Response Total`
       <chr> <dbl>     <dbl> <dbl>    <dbl>            <dbl>
 1     Banks 37736      44.9 46343     55.1            84079
 2    Barton 37153      43.6 47984     56.4            85137
 3 Bennelong 42943      49.8 43215     50.2            86158
 4   Berowra 48471      54.6 40369     45.4            88840
 5  Blaxland 20406      26.1 57926     73.9            78332
 6 Bradfield 53681      60.6 34927     39.4            88608
 7    Calare 54091      60.2 35779     39.8            89870
 8   Chifley 32871      41.3 46702     58.7            79573
 9      Cook 47505      55.0 38804     45.0            86309
10    Cowper 57493      60.0 38317     40.0            95810
# ... with 140 more rows, and 11 more variables: `Response Total
#   pct` <dbl>, `Response clear` <dbl>, `Response clear pct` <dbl>,
#   `Response not clear(b)` <dbl>, `Response not clear(b) pct` <dbl>,
#   `Non-responding` <dbl>, `Non-responding pct` <dbl>, `Eligible
#   Total` <dbl>, `Eligible Total pct` <dbl>, State <chr>, `Area (sq
#   km)` <chr>

Discussion

To tidy these data we used a good portion of the gamut of tidyverse techniques. We:

Read Excel files using readxl
dropped rows and reshaped data with tidyr
filtered, mutated, joined, and bound data with dplyr
mapped a function to a list of parameters using purrr
scraped a table from the web using rvest
wrote a csv file using readr.

How great is that?! You can get my complete scripts and tidied csv files from this GitHub repository.

Watch this space for a sequel post where I’ll outline how to join data from the 2016 census to overlay demographic information onto the results.