How I find stories in data
Be open to the bonanza. Never trust data on its own.
All reporting involves some form of data. An interview or a tip-off is simply unstructured data. All journalists, in particular economic and business journalists, have always used numbers at the heart of their storytelling.
But in Australia, we’re now at a point where there is so much data available across so many areas of reporting that the core skills of data journalism — the ability to manipulate structured data — should be part of the toolkit of every reporter.
Data skills will help you take advantage of two technological shifts that have opened up that world of structured information for reporters.
The first change has been the ever-growing body of publicly available sets of structured data that journalists can explore, cross-reference, mash together and manipulate.
The second change has been a steady increase in sophistication and number of tools that non-experts can use to explore and visualise the datasets — I usually use Excel or Google Sheets to find the key features of the data, such as averages and totals, and for charting. OpenOffice Calc is another option that’s free. I’ll also use Excel or Tableau for a particularly big or complex pieces of data.
The everyday data journalism I’m talking about is quite separate from the other, more enterprising kind — like large-scale data extraction, or scraping data from websites or using data to create visualisations or interactive tools. These are techniques that require programming skills — think of the Guardian’s DisclosureBot or interactive projects from the ABC like Vote Compass and the Brisbane floods before-and-after visualisations.
At The Australian Financial Review, our technology reviewer and part-time database expert John Davidson has put together a flexible system that allows us to quickly upload a large set of data and present readers with an interactive database to search, sort and filter the information. We use the system for data including our high-profile annual Financial Review Rich List.
Here are some examples of the types of publicly accessible data that will hopefully whet your appetite and trigger some story ideas.
But you don’t need a fancy system like ours to get more data into your reporting. All you need are some of the tools above, and some (more than abundant) data.
For example, the Department of Immigration and Border Protection has spreadsheets of data about the usage of the now-abolished 457 visas. When PM Malcolm Turnbull announced in mid-April that the program would be replaced, the data provided a detailed way to parse who was using the visas.
This showed, for example, that many of the jobs dropped from the list of eligible occupations like archivist, antique dealer and blacksmith had not been in use recently. In contrast, many of the most popular occupations with thousands of holders each, like chefs, programmers and accountants, continued to be eligible for a four-year visa.
The Australian Institute of Health and Welfare has downloadable spreadsheets on subjects including adoptions by country and gender, rates of alcohol and drug use, and how Australians die.
A sobering find from this data is that suicide is now the leading killer of Australians aged 25–44 years due to big falls in the death rates from circulatory diseases.
There’s also non-government data available. For instance, pharmaceutical industry lobby group Medicines Australia has released data on payments made to medical practitioners in spreadsheets from each of its 35 member groups.
That data revealed pharmaceutical companies were spending up to $40,000 in a six-month period for doctors to, say, fly to international conferences to present research on a company’s product or on their own area of expertise.
When it comes to business and economics, there is almost too much data available.
One of my favourites is the Workplace Gender Equality Agency. Large companies have to submit an annual, publicly accessible report of their staff numbers, broken down by gender, seniority and type of role (full-time, part-time or casual).
Another favourite, the Australian Tax Office’s annual statistics, is a trove of information about how much individuals earn and pay in tax, broken out by a range of measures including postcode, age, gender, occupation and source of income.
This year’s data, for instance, showed that more emergency services workers than lawyers negatively geared an investment property — but police, firefighters and ambulance officers only got half the benefit (an average of $745 compared with $1,500 for lawyers) because they earn less.
There’s also the ATO’s Corporate Tax Transparency data on how much the biggest companies pay in tax, Treasury’s release of Budget data in spreadsheet form and government tender data, which can be accessed through through data.gov.au or the federal AusTender website. That’s not to mention the data available for publicly listed companies, certain private companies and the bonanza of the Australian Bureau of Statistics.
Once I have some promising data, here’s how I make use of it. In many cases, the trick is the analysis isn’t that complicated — a few basic calculations will suffice for most stories. (See my worked-through example at the bottom of this article.)
If you need just a few data points to put together a story, you may be able to get away with simply looking at the right part of the spreadsheet and noting down the relevant figures, or you may want to copy and paste a small section into a separate sheet to focus on a subset of the numbers on offer.
Formatting goes a long way. I find that before I even worry about formulas or other types of calculations, I need to organise new data into a form that makes sense to me.
For the accounting and consulting firms I cover, recently I downloaded WGEA data for the big four accounting and advisory firms and then recorded the number of male and female equity partners over time.
After I’d organised the data into a spreadsheet and calculated the percentages for each, I had both my story and the data for a graphic. I spent more time checking the data against the original documents, and contacting each firm for comment, than I did on spreadsheet work.
But the result was a story that cut through the spin and showed one of the big four had made no progress in the percentage of female partners over three years, while its rivals had all managed to increase their proportion of female owners in the same time.
In other cases, such as the Corporate Tax Transparency data, I sort and filter a large set of figures to find potential story ideas.
Then you’re getting into a common skills such as how to calculate aggregates (mean, median, mode) and percentage change.
Once you’re comfortable with that sort of work, for the ambitious there is the pivot table — used in Excel and Google Sheets — where you effectively turn a spreadsheet into a database you can query via drag-and-drop. It’s the ultimate quick way to make use of large sets.
I also use spreadsheet graphing features a lot, but mainly to work out my story. I leave the fancy stuff to our graphic artists.
These aspects of data journalism, basic number crunching in a spreadsheet, are usually done as pre-story work. That is, you get the data and then have a look through it for anything that could be newsworthy. The data provides the tip-offs and all roads lead back to traditional reporting.
This involves ringing up a range of experts and contacts to check, verify and expand a story based on your tip-offs.
In the female partner story case, the numbers were the backbone of the story, and then it became about exploring why the firms’ success rates varied so much, through interviews and other information.
This is all iterative and in any story I move between the stages — from exploring the data to interviews back to the data. A contact might point out that the initial data crunch isn’t that useful and there’s no story or another area of data should be investigated.
Bonus! A concrete example of working with data
1. Sort and filter data
In this case I’ve used Google Sheets to sort the Corporate Tax Transparency for public companies to give me the two companies with the highest revenue — Wesfarmers and Woollies — and then copied the data into a separate spreadsheet.
2. Basic calculations.
In the case above, we wanted to know the tax rate — namely, how much taxes were paid as a proportion of income. That’s tax payable divided by taxable income. Other important formulas are (the self-explanatory) average, median, sum, max and min.
3. Organise data in a structured way for reporting.
This is how I would organise the above data for reporting. The idea is you want the information in the same way you might explain the data in a story. You might write that Wesfarmers’ revenue increased by $1.1 billion, or 1.6 per cent, in 2014–15.
The next step would be to find out why these companies had the variations in their revenue and tax paid, which would involve the regular journalism process — phone calls, company reports and other research. But you’ve now got some solid numbers to work from.
This piece is part of Nerd Box, an occasional series on data journalism in the Walkley Magazine.