Census data released as Excel files?! Getting Australia’s 2021 dwelling and person counts into a usable plain text format for scripted analysis
As of 31 August 2022, the Australian Bureau of Statistics (ABS) has only released the 2021 Census Mesh Block dwelling counts as an Excel file, with data stratified across 12 worksheets. This is inconvenient for users who wish to link this data with digital boundaries, and not useful for data posterity.
I couldn’t find a CSV version of this very useful data elsewhere, so thought it would be useful to take the time to compile these files myself and make them and the code to do this public for others to also use until an official release is produced.
This article describes how I did this using Python 3 with Pandas, where you can get the code, and the resulting CSV files.
What are Mesh Blocks?
Mesh Blocks are described as the ‘building blocks’ of the Australian Statistical Geography Standard, which defines an authoritative series of geographic scales of administrative boundaries : from street blocks (which Mesh Blocks roughly correspond to, with 30 to 60 dwellings), neighbourhoods (Statistical Area 1, or SA1), broader local regions (SA2, SA3, SA4), and up to enter cities ( Significant Urban Areas and Greater Capital City Statistical Areas) and States or Territories. All these larger areas can be constructed using Mesh Blocks, or using correspondence files.
Importantly, Mesh Blocks are the scale at which the Census data records for counts of persons usually resident and dwellings are released. This is the data we use to understand the population of our cities, where people actually live or could live, and evaluate population density and dwelling density for local neighbourhoods and broader regions. This is important for planning and decision making, but also for researchers seeking to understand how urban environments impact our health and wellbeing.
Getting the data
The data was released at on 28 June 2022, and is downloadable from the ABS website, but only as an Excel .xlsx file download. The ABS has a good open data policy, with their published releases such as this data being © Commonwealth of Australia 2022 but usable under Creative Commons Attribution 4.0 International licence terms. Once the file is downloaded, in the explanatory notes worksheet it suggests,
Available Format Type:
Excel 2010 files (.xls)
CSV files (.csv)
… and that was the case for the 2016 release. But for 2021, they describe this as a ‘Completed dataset’ with ‘no updates planned’.
And so, it seemed like a good idea to write the code to retrieve and process the data to make it usable like the previous CSV release was, and publish both the code and data for others (and myself!) to use.
The data was saved using Python 3 using the code below:
While the Excel file URL could be loaded as a DataFrame directly using Pandas, I wanted to download the file itself so I could open it up and ensure I fully understood the data structure.
Understanding the data
By manually opening up the download, it was confirmed that the Mesh Block counts data for Australian States and Territories was stratified across 12 separate tables as follows:
The table header was on row 7 of each sheet, and contained the following fields:
While the MB_CODE_2021 variable is technically a really big integer (11 digits long; ie. longer than a 32-bit integer can be represented without an overflow error); therefore, the ABS represents the Mesh Block code as a string, and so this convention has been maintained in this dataset. Alternatively, the MB_CODE_2021 category could be represented using a 64-bit integer, and I expect there could be some optimisation benefits in doing that for indexed queries of large databases. But I won’t worry about that here.
Some records may have nulls, so Int64 data type will be used to represent integers, allowing for these occurrences to be correctly retained.
The last row for each table was checked, with any records going beyond row 60,000 cascading to a subsequent table for that State or Territory. It was also confirmed that the length of footer on each page was consistently 4 rows.
The data also includes the explanatory note, “Cells in this table have been randomly adjusted to avoid the release of confidential data.” and “© Commonwealth of Australia 2022”; users should note this, and that this data was made available under a Creative Commons Attribution 4.0 International licence as per https://www.abs.gov.au/website-privacy-copyright-and-disclaimer#copyright-and-creative-commons.
(any users of the data should fully read the explanatory notes)
So, we’ll create a data structure reflecting the above table, that we can iterate over and compile each State and Territories’ Mesh Block records into a single master table for export to CSV.
Loading the data
Finally! We can read in the Mesh Block data with some confidence we have interpreted it correctly. The follow code snippet
- initialises an empty dictionary
- loops over each State and Territory, loading each corresponding worksheet table into a temporary Pandas DataFrame. Then, either stores this under the State/Territory name in the dictionary, or appends it to an already stored DataFrame.
I verified that the data had loaded correctly as intended (with manual checks against the original data tables to ensure the first and last row of each table had been recorded and that number of rows corresponded).
Combining the data
So, the loaded data can now be combined. But once that’s done, before exporting it really needs to be cleaned up. In particular, the source data only came with a state code, not a state name variable. While its a moderately sized dataset (368,285 records, once combined into a single table), the small saving of file size for representing a State/Territory as an integer rather than a text string containing its human readable name isn’t justified.
So, most of the following snippet is ensuring that the State/Territory name, which was added as an index when concatenating all the respective DataFrames , was retained using a consistent variable name (STATE_NAME_2021) and the variables sensibly ordered.
Output CSV data files and display summary
Okay… now we’re ready to output this to CSV — and since we can, we’ll do this both as a single national file, as well as a single file for each State and Territory (in contrast to the source data which split any dataset exceeding 60000 records across multiple tables!).
I’ve uploaded these files to FigShare, and they can be accessed here: https://doi.org/10.25439/rmt.20746609
A PDF copy of the Jupyter Notebook used for analysis, which this article is based on, is also included at the above URL.
Finally, here’s a summary of the total area, dwelling count and population of persons usually resident as recorded at the 2021 Census:
The main thing though is, with the Mesh Block data now in a machine-readable format, these counts at the smallest geographic scale released by the ABS can be used to understand the structure of local neighbourhoods around Australia. That’s something for another article though.
Hopefully the above will be useful for others who wish to either replicate the process to get the ABS Mesh Block data in a format readily usable for analysis, or use the derived Mesh Block datasets directly.
If you use the data, or have any thoughts on the above, please leave a comment below!