Sociability index of mice with microbiomes donated by children with ASD (blue), Mild ASD (green), and neurotypical controls (yellow)

Data visualization

Bringing data to life with Flourish, DataWrapper, and ggplot2

Jon Brock

Published in

Dr Jon Brock

7 min readJul 18, 2019

Australian Marriage Law Postal Survey

When Australians voted to legalise gay marriage in 2017, the lowest Yes vote was recorded in a cluster of suburbs in Western Sydney. Commentators quickly blamed the large immigrant populations in these areas.

I created a data “story” in Flourish, plotting the percentage of Yes votes in each constituency against various demographic indicators from the 2016 Australian census. It shows that, across the country, the Yes vote was indeed correlated with the proportion of the population not born in Australia. However, better predictors of the Yes vote were median income and the proportion of residents who described themselves as religious versus secular in their beliefs.

This illustrates the difficulty of drawing conclusions about individuals from data about the groups to which they belong — a problem known as the “ecological fallacy”.

Quest for autism biomarkers faces steep statistical challenges

As an autism researcher, one of my enduring frustrations was the constant flow of news articles about studies claiming to show a biomarker that could be used for autism diagnosis. And so with my brother Tim Brock I wrote an article for autism research magazine Spectrum News explaining the statistical barriers to translating an initial finding of a group difference into a clinically useful diagnostic tool.

We used GIFs to introduce concepts such as the Receiver Operating Characteristic curve, which plots the sensitivity of a test against its false positive rate for different cut-offs.

We then showed how the performance of a test changes with the base rate. The current estimate is that 1 in 68 people are autistic. This means that, even with a highly discriminating test most people who test positive don’t actually have autism.

Autism diagnosis by brain scan: It’s time for a reality check

In an article for The Guardian, I talked about these concepts in more concrete terms, responding to a much-hyped paper claiming to predict a baby’s future autism diagnosis based on MRI brain scans.

Taking the supplementary data from the paper I used ggplot2 in R to plot the growth curves for the autistic and non-autistic children’s brains. This shows some differences on average between the two groups but a large degree of overlap. I used the Magick package in R to create a GIF animation that more clearly shows how the “normal” range of brain growth and then the autistic brain growth superimposed.

Because of the overlap, the researchers were unable to reliably differentiate between groups on any of these measures. Instead, they developed a machine learning algorithm that they taught to distinguish between autistic and non-autistic babies based on multiple brain “features”.

I used the waffle package in R to create a waffle plot, which shows the performance of the machine learning algorithm. This correctly identified 30 out of 34 autistic babies and incorrectly identified just 7 of 145 non-autistic babies.

However, this ignored the fact that many of the infants had to be excluded from the analysis because they had incomplete data. It also ignored the low base rate of autism in the general population. I updated the waffle plot, taking these two issues into account. This demonstrates the limited clinical value of brain scans. Identifying those 30 autistic babies in a real world setting would have required scanning many thousands of babies — and the overwhelming majority of those who tested positive would not be autistic.

R code underlying these plots can be found here.

Can gut bacteria cause autism (in mice)?

In a 2019 paper published in Cell, researchers reported that faecal transplants from autistic children caused mice to exhibit “autism-like” behaviours. This was presented as evidence that gut bacteria play a causal role in autism. But although a large number of mice were tested, the faecal samples came from a small number of autistic children and controls.

In a post on Medium, I used Flourish to replot the data for each test of “autism-like” behaviour, breaking it down by donor.

Eventually, it became clear that the authors had analysed the data incorrectly, treating each mouse as if it had received its transplant from a different child.

The interactive below shows the data for one test as it was presented in the paper and analysed by the authors and then how it should have been broken down.

“A love letter to your future self”: What scientists need to know about FAIR data.

FAIR is the principle is that scientific artefacts (data, code etc) should be Findable, Accessible, Interoperable (read automatically by machines) and Reusable. It’s a principle endorsed and promoted by leading scientific organisations. However, a survey conducted for the 2018 State of Open Data report found that the overwhelming majority of scientists were not familiar with the concept.

My article for Nature Index introducing FAIR data included this panel of pie charts created in Flourish. The sampling for the survey was uneven so I broke it down by discipline, scaling the pie charts depending on how many respondents there were. Pie charts are often frowned upon. But in this case, because there are only three categories, the format allows an intuitive comparison across the different research areas.

Bronze open access supersedes green and gold

This chart for an article in Nature Index shows the rise of open access (journal articles available without payment by the reader). Using a stacked area chart demonstrates the overall increase in open access articles. It also shows that the increase is driven by journals that publish all (gold) or some (hybrid) articles openly, usually if the authors pay a fee.

The focus of the article was the surprising number of so-called “bronze” open access articles. These are papers that can be read without payment but are published without a license for re-use and no guarantee that they will remain open access.

Women edged out of last-named authorships in top journals

In many fields of science, the first and last author on a paper are the most prestigious — typically the person credited with leading the actual study and the leader of the research group. This means that scientists are often evaluated not just on how many papers they have published and what journals they have published in, but where in the list of authors they have appeared.

For Nature Index, I covered a study showing that women had fewer last authorships, even accounting for the fact there are fewer women in science and so fewer authorships overall. The chart shows a “prestige index” for a selection of highly regarded journals — the worse the under-representation of women, the lower the prestige index. Notably, this under-representation is particularly acute in journals with the highest impact factor.

Individual differences in autistic children’s homograph reading: Evidence from Hebrew

This plot comes from a study looking at the ability of Hebrew-speaking autistic children to use sentence context to determine the correct pronunciation of homographs (ambiguous written words). In previous studies conducted in English, the poor performance of autistic children and adults has been interpreted as evidence for impaired reading comprehension. We recreated the test in Hebrew which has many more appropriate homographs.

The plot was created using ggplot2 in R. It shows the relationship between the homograph reading accuracy of autistic children and four other measures. I overlaid the observed data (black dots), the model fit (orange lines) and the adjusted values for each observation (orange dots) with confidence intervals. This demonstrates that Picture naming, a measure of language production, is the best predictor with the closest fit between the model and the observed data. The sample size is obviously very small, but this observation prompted us to propose the testable hypothesis that difficulties with homograph reading in autism may be related more to the production than the comprehension of language Paper Code

The magnetic acoustic change complex and mismatch field: A comparison of neurophysiological measures of auditory discrimination

The waveform figure was also created in ggplot2. It shows brain responses to six different sound stimuli measured using magnetoencephalography (MEG). Typically, researchers only report the response averaged across all the participants. However, we were interested in whether MEG could be used to assess an individual’s auditory perception. I therefore overlaid the MEG response of each individual (in grey), giving an indication of how consistently the brain responses were elicited. Paper Code