Preach the therapeutic value of data entry. Treat data like it’s human. Show that ignorance is an asset.
Every time I announce that we will use data-driven investigative methods I watch my students’ faces grow ashen. Then the hands start to rise.
“I’m terrible at maths,” one student will say. “Am I going to fail?”
“I don’t own a calculator,” panics another.
“What is data?” someone else will ask. “It sounds awful.”
Indeed, the first hurdle in teaching data journalism is emphasising its utility and even its glamour. Given the number of extraordinary examples to have emerged in recent years — from big-ticket investigations like the Panama Papers and the Nauru Files through to more playful applications like WNYC’s Ice Cream Radar — this is usually an easy sell. But inevitably there will be a point at which someone will offer up an iteration of the idea that: “That’s great, but it’s not something we can do.”
It is not just student reporters who baulk at the idea of data-driven investigation — the myth that journalists are poor at maths is pervasive. But mathematical ability can be developed, and it’s a great shame that so many students and professional journalists think data journalism is out of reach to all but a few specialist reporters. Data-driven investigations have a lot to offer newsrooms: they produce transparent, credible and exclusive narratives that can have enormous social and political impact. So even reporters who have never opened a spreadsheet should not throw such a powerful research method out of their journalistic toolset.
In fact, there has never been a better time to work with data. Open data sets have proliferated, the increasing number of online tools means the process is more and more accessible to beginners, and data journalists tend to be open about their research process, so there are frameworks and models available to guide amateurs.
Journalists don’t need multimedia experience, coding ability or statistical proficiency in order to take on a data-driven project. It’s really about designing a project that fits a reporter or team’s skillset and applying core journalistic research methods to a new kind of source.
Each year, my cohort of about a dozen data-terrified students collaborate on a large-scale public interest data journalism project. If they can succeed there’s no reason any other data-newbie shouldn’t. Many data-driven investigative methods and techniques can be picked up by a conscientious reporter with an interest in diversifying their skills. Here, I share the process behind one of our data-driven investigations in the hope that elements of it might be applied to other projects.
Step 1: Conceptualise the project
Data-driven reporting is often more open and exploratory than other investigative work and requires a degree of patience and trust (and sometimes, collaboration). Although it will begin with a topic and research question, the news lead and key insights might not emerge until late in the investigation. For this reason, it’s important to be selective in terms of topic choice. In general news reporting, it’s a statement or answer’s newsworthiness that warrants its coverage; in data-driven reporting, the investigative question must be inherently newsworthy.
For example, in 2016, 15 students and I set out to investigate: “Who are Australia’s most and least active federal politicians?” We timed the investigation to coincide with a federal election and although we initially had no idea which MPs would score well on an index of political activity (and who would perform badly), we knew the results would be of broad interest.
The initial research question (in this case, “Which politicians are the most and least active?”) then needs to be refined into a research strategy.
Most people think of data as a spreadsheet, and it can be. There are some great data sets available on the federal and state governments’ open data websites and through the Australian Bureau of Statistics. But websites, reports, documents and even photographs and videos can all be data, too.
In our case, we used Hansard as our primary source. We immediately realised that it was too difficult to assess political activity within an MP’s electorate because it would manifest in too many, varied ways (from speaking to constituents to attending local events). What we could fairly assess was what each MP did within the structures of parliamentary procedure. So we restricted our investigation to the House of Representatives (150 federal MPs), and contained it to one political term (so that political newcomers could be fairly assessed against veteran MPs). We settled on four ways to quantify activity in the lower house:
- Attendance: How often MPs were present on parliamentary sitting days.
- Committee work: How many committees an MP served on and the length of service.
- Speeches: How many speeches the MP made, the category of speech (Hansard differentiates speeches on legislation, for example, from constituency issues) and the topic of each speech.
- Questions: Questions asked/answered, in writing or without notice, and the topic of each question.
We also collected some basic information about each politician: their electorate, state, party, gender, age, social media use and number of political terms served. This meant we could extend our analysis beyond “Which MPs were the standout performers?” to suggest overall trends: “What kind of politicians tend to be more active, in which ways?”
Step 2: Data collection
For reporters using an existing data set, this step is unnecessary. But for those developing a data set, the collection can be as simple or as complicated as they like. Because we had a team of 15 student reporters, we were able to manually collect some data.
Of course, most newsrooms don’t have the luxury of a team of data-collectors, so it’s worth investigating automatic ways to retrieve information. We used a crawler called import.io to do most of the heavy lifting. Don’t be put off by a term like “crawler” — this free, web-based program required no expertise or coding, and was extremely user-friendly. Basically, we fed it a series of web pages from Hansard. For example, we brought up all of Malcolm Turnbull’s speeches in the current term and import.io spat out a spreadsheet with each speech type, date and topic.
Almost instantly, we had access to 20,000 speeches and 3,500 questions.
The only data we had to collect manually was each politician’s political and demographic details, and their committee work. Because of the number of data “coders”, it was important to create systems to ensure each researcher collected information consistently (in the same format, making the same assumptions) and reliably.
Each student was assigned a list of MPs and then recorded the relevant information (for example, which committees they served on, their gender, how many terms they had served in government) into templated Excel spreadsheets, which had drop-down menus for factors like political party or gender so that they were consistently formatted. We put some cross-checking strategies in place, too, so that we didn’t ruin anyone’s political career with dodgy data collection: there were certain politicians whom everyone cross-coded (which meant we could identify incorrect data collection), and we made sure our results tallied against known figures (e.g. how many male and female MPs were in parliament).
Theoretically it’s possible for a newsroom to outsource data collection to keen volunteers as long as the cross-checking systems are strong enough, but directly supervising the data collection made it easier to be confident about the results. Believe it or not, most of the students enjoy this step because it’s procedural, simple and mindless, and there’s a sense of creating order from information chaos — some have even described it as “therapeutic”.
Step 3: Analysis
Although the analysis is where many reporters feel out of their depth, it’s actually where traditional journalistic skills come into play most directly. In many ways, the best approach to data is to treat it like any other source: interrogate it and ask it questions.
Some of our questions were easy to answer. For example, to find out which MPs were most active in question time, all we had to do was sort the Excel spreadsheet and a few seconds later we had our answer. We could also narrow down major questions such as “Who made the most speeches?” to specific questions such as “Who spoke most about local issues in parliament?” There were automated ways to find information like the MPs’ average number of speeches, questions or absences, and even to answer questions such as “What parliamentary sitting days had the lowest MP turnout?”
For more complex analysis, such as correlations, it’s worth consulting an expert — statistical analysis is the only part of this process that a data-newbie really shouldn’t attempt. So, when it came to questions like “Do female politicians tend to be more or less active, or perform in different ways to male politicians?” we sought the help of an analyst — armed with the student journalists’ questions to drive this enquiry.
In cases where hiring an analyst is impossible and there isn’t a specialist journalist to collaborate with, university lecturers, PhD candidates/graduates (especially those in science and the social sciences) and marketers may also be able to perform statistical analysis, and the process needn’t be time-consuming. We were dealing with 150 MPs, 166 sitting days, 3500 questions and 20,000 speeches. The analyst ran the correlations and answered our questions within two hours.
Step 4: The story
This is the final and most rewarding step. In our case, we had a series of strong conclusions from the data and could start divvying them into stories. We knew the best and worst performing MPs in each category. We could also identify which issues had dominated the current term’s speeches and Question Time (economic issues and government affairs), and which had slipped under the radar (social issues like same-sex marriage and Indigenous affairs).
The statistical analysis also provided fodder for stories: female politicians were more likely to advocate for their local areas, made more speeches and asked fewer questions than their male counterparts. A third of Question Time was consumed with Dorothy Dixers. Twitter users were less likely to speak up about local issues in parliament.
Even without any graphic design experience, students were able to visualise their findings using free online tools with simple drag-and-drop interfaces. We used Silk to make sure that readers could search for their local MP and pull up a “report card” on their parliamentary activity. We used Piktochart, Info.gram and Easel.ly to visualise the stories. Tableau, Timeline and Chart.js are great tools, too.
Presenting the findings is the point at which some inexperience with data is actually an enormous asset. Most readers are intimidated by reporting that is too data-heavy so the challenge is translating figures into a format readers will understand. Someone who doesn’t naturally gravitate towards numbers is probably going to work harder to make sure that they tell a clear and compelling story than a data wizard!
Caroline Graham is a journalism lecturer at Bond University and the co-author of Writing Feature Stories: How to Research and Write Articles — From Listicles to Longform.
This data-driven investigation was published by Guardian Australia and the full series of student-authored reports appears at: http://www.unipollwatch.org.au/house-divided.
Caroline is happy to share any of the frameworks, spreadsheets and data collection forms used in this project, and to collaborate on public interest data investigations — email email@example.com
This piece is from Issue 88 (March 2017) of the Walkley Magazine.