Combining Australian Census data with the Same Sex Marriage Postal Survey in R

Last week I put out a post that showed you how to tidy the Same Sex Marriage Postal Survey Data in R. In this post we’ll visualise that data in combination with the 2016 Australian Census. Note to people just here for the R — the main challenge here is actually just navigating the ABS’s Census DataPack, but I’ve tried to include a few pearls of wisdom on joining datasets to keep things interesting for you.

Load Same Sex Marriage Survey data

We’ll start with the survey response dataset from last week. Grab it from my repo if you need to. Note the little fix in the mutate for some untidiness that slipped through. Both of those refer to Tasmania.

Get Census Data

The ABS has a few different census offerings. I looked at them all. You can trust me when I say to ignore all the reports and go straight for the ‘DataPacks’ at https://datapacks.censusdata.abs.gov.au/datapacks/.

DataPacks are geographically summarised views of the raw census data. To get to a download link, there are two choices to be made regarding the how the summaries are produced:
1. ‘Type’ which concerns questions like: do you want to summarise by people’s place of residence or by place where they filled out the census form? For a postal survey we want the the data summarised by place of residence which is the ‘General Community Profile’ datapack.

2. Geographical area used for summarise. To match the postal survey data I chose ‘Commonwealth Electoral Divisions’.

Once you have a link you’ll need to download and extract the zip file. If you’d prefer not to do this, you can use my fetch_data script to fetch and decompress the datapack to ./data.

Navigating the DataPack

The datapack consists of 59 encoded csv files and 3 metadata excel files that will help us decode their meaning. What? You didn’t think this was going to be straight forward did you?

When I say encoded, I mean the csv’s have inscrutable names like ‘2016Census_G09C.csv’ and contain column names like ‘Se_d_r_or_t_h_t_Tot_NofB_0_ib’ (H.T. @hughparsonage).

Two of the metadata files in /Metadata/ have useful applications for us. ‘2016Census_geog_desc_1st_and_2nd_release.xlsx’ will help us resolve encoded geographic areas to federal electorate names. ‘Metadata_2016_GCP_DataPack.xlsx’ lists the topics of each of the 59 tables and will allow us to replace a short and uninformative column name with a much longer, and slightly more informative name. These can be read without too much hassle:

Examining Religious Affiliation

Suppose we are interested in reproducing this Guardian article's analysis of the correlation between percentage of religious people and percentage of ‘No’ responses. The first step is to identify the DataPack table that contains religious affiliation data. If we look at sheet two of `Metadata_2016_GCP_DataPack.xlsx` we can see a likely choice is table ‘G14’:

I read in the table G14 and replaced the short column names with the long ones to help makes sense of it. This is indicative what we’re working with:

# A tibble: 168 x 103
CED_CODE_2016 Buddhism_Males Buddhism_Females Buddhism_Persons
<chr> <int> <int> <int>
1 CED101 2766 3921 6687
2 CED102 3764 5078 8849
3 CED103 3136 4200 7337
4 CED104 1718 2240 3953
5 CED105 6508 7815 14323
6 CED106 2043 3012 5061
7 CED107 399 573 971
8 CED108 1092 1222 2312
9 CED109 789 1224 2017
10 CED110 611 878 1491
# … with 158 more rows, and 99 more variables:
# Christianity_Anglican_Males <int>,
# Christianity_Anglican_Females <int>,
# Christianity_Anglican_Persons <int>,
# Christianity_Assyrian_Apostolic_Males <int>,
# Christianity_Assyrian_Apostolic_Females <int>,

# Christianity_Total_Persons <int>, Hinduism_Males <int>,
# Hinduism_Females <int>, Hinduism_Persons <int>, Islam_Males <int>,
# Islam_Females <int>, Islam_Persons <int>, Judaism_Males <int>,

We are going append these columns to our response data. Notice how the records are keyed by CED_CODE_2016 which relate to the Commonwealth Electoral Division names we have in our response data. Unfortunately we do not have a common key column to join these data, so there is an intermediate step to join the CED code from electoral_codes to the names in the response data before joining the religious affiliation data.

Take a look at the code before some more detailed commentary:

I wrote a function to replace the short with long names since it was a fiddly task and knew I would need to repeat it. Compare the code for that function with my original one liner:

names(religious_affiliation_by_person)[-1] <- column_codes$Long[match(names(religious_affiliation_by_person), column_codes$Short)[-1]] 

Yuck! Notice how the abstraction of the function allows me to use some more informative variable names.

After renaming we get into building up the dataset by layering census data onto our response data. We join on electorate_codes by electorate name to get the ‘CED Code’ that matches each electorate. A filter step needs to follow because there is a second type of code that matches our electorates introducing duplicates. We filter to just “CED”.

We join on the religious affiliation data using the CED code. We then calculate a new variable representing the proportion of religious people: prop_christian_and_islamic. I singled out these religions because they are are Australia’s top two and I am aware of anti-same sex marriage stances within both. Finally, we select just a subset of columns of interest to make our plots of the correlation.

An Aside on Joining Data

I’d like to spend a minute discussing joins, feel free to skip to the plots if this doesn’t sound exciting.

There is only one join you need for > 90% of data science cases and that is the left join. Inner joins will silently filter mismatches — which is just plain scary! While outer joins can result in a mix of NA’s: some that are to be expected, and some that are not, making it easier to miss or misdiagnose problems. Stick with left join and you’ll have a simple set of failure modes you can easily diagnose.

So how to arrange your datasets in a left join? Put the kernel of your analysis dataset on the left. Put data you are trying to append to that on the right. E.g.:

left_join(response_data, census_data)

You now have exactly one source of potential contamination for the kernel columns, and that is duplication introduced by multiple matching records in the right set. You can easily test for this is by doing duplicated on your key column (e.g. electorate) or by keeping an eye on the number of rows. This is how I spotted the duplicate electorate codes. You may have some instances of a mismatch that will lead to tell-tale NAs in all columns from the data frame on the right. Test for this with anyNA() or sum(is.na()) etc.

A Plot

Getting back to reproducing the investigation into religious affiliation and No responses: Using the ggplot2 code below, we observe the expected positive correlation between religious affiliation and ‘No’ response on same sex marriage:

My state of QLD cops a lot of flack for our rural conservatives, but it’s NSW that is looking red neck here.

Won’t Somebody Please Think of the Children

I wasn’t content to simply reproduce someone else’s analysis, so I thought I’d make some original contribution to the discourse by looking for a trend in responses from parents.

This was motivated by the No campaign going full Helen Lovejoy and trying to turn the public debate to the potential for gender confusion in school children when they are subjected to the idea that anyone can marry anyone. Or something. I can’t say I fully grasped their reasoning. But if their concern were validated by a majority of parents with young children, we might expect to see a similar trend to that of religion.

The code I used is very similar to above. The main trap is that you have to be careful not to mix statistics for families with statistics for persons, since the totals of these are different. Pay attention to the column prefixes and suffixes.

With this topic, I’m playing my Dad card for the first time. Am I doing it right?

A couple of these facets look like text book examples of correlation=0.0. The middle row of states actually look weakly negatively correlated(!). One issue is that we’ve got a lot less variation in the proportion of parents than we did with religious affiliation, but I reckon there’s enough for a trend to show if it existed. I’m even prepared to don my stats-cowboy hat and say based on what I see here, it’s almost as if the people concerned with the negative impact of same sex marriage on school children, tend not to be the people that you know… actually have them.

Discussion

Once again we made short work of a fairly hairy data problem with the R tidyverse. For those new to joining datasets with dplyr, I appreciate it can be a little dizzying at first. Running the code to view the results of each successive step will help. Doing many joins will help you get a feel for the different types of outcomes. Why not try to mix some different Census data with the participation data we tidied last week? Tweet your plots at me if you do!

If you were already familiar with the tidyverse techniques I displayed here, then I hope this at least serves as a useful template for using ABS Census data in R.

Thanks for reading!

Code in this article is available at this Github repository.