Discussing the ethics, challenges, and best practices of machine learning in journalism

Peter Aldhous of BuzzFeed News and Simon Rogers of the Google News Initiative discuss the power of machine learning in journalism, and tell us more about the groundbreaking work they’ve done in the field, dispensing some tips along the way.

Marianne Bouchart
Jun 25, 2018 · 9 min read
Image for post
Image for post
BuzzFeed’s ‘Hidden Spy Planes

What is it about AI that gets journalists so interested? How can it be used in data journalism?

Peter Aldhous: I think the term AI is used way too widely, and is mostly used because it sounds very impressive. When you say ‘intelligence’, mostly people think of higher human cognitive functions like holding a conversation, and sci-fi style androids.

Why and when should journalists use machine learning?

P.A.: As a reporter, only when it’s the right tool for the job — which likely means not very often. Rachel Shorey of The New York Times was really good on this in our panel on machine learning at the NICAR conference in Chicago in March 2018.

What kind of ethical/security issues does the use of machine learning in journalism rise?

P.A.: I’m very wary of using machine learning for predictions of future events. I think data journalism got its fingers burned in the 2016 election, failing to stress the uncertainty around the predictions being made.

If you can’t explain how your algorithm works to an editor or to your audience, then I think there’s a fundamental problem with transparency.

I’m also wary of the black box aspect of some machine learning approaches, especially neural nets. If you can’t explain how your algorithm works to an editor or to your audience, then I think there’s a fundamental problem with transparency.

Image for post
Image for post
‘This Shadowy Company Is Flying Spy Planes Over US Cities’ by BuzzFeed News

What tools out there you would recommend in order to run a machine learning project?

P.A.: I work in R. Also good libraries in Python, if that’s your religion. But the more difficult part was processing the data, thinking about how to process the data to give the algorithm more to work with. This was key for my planes project. I calculated variables including turning rates, area of bounding box around flights, and then worked with the distribution of these for each planes, broken into bins. So I actually had 8 ‘steer’ variables.

Image for post
Image for post
There is simply no reliable national data on hate crimes in the US. So ProPublica created the Documenting Hate project.

What advice do you have for people who’d like to use machine learning in their upcoming data journalism projects?

P.A.: Make sure that it is the right tool for the job. Put time into the feature engineering, and consult with experts.

Don’t do machine learning because it seems cool.

Use an algorithm that you understand, and that you can explain to your editors and audience.

  • Could an intern do it?
  • If you actually asked an intern to do it, would you feel an overwhelming sense of guilt and shame?
  • If so, you might have a classification problem. And many hard problems in data journalism are classification problems in disguise.

What would you say is the biggest challenge when working on a machine learning project: the building of the algorithm, or the checking of the results to make sure it’s correct, the reporting around it or something else?

P.A.: Definitely not building the algorithm. But all of the other stuff, plus feature engineering.

  • We still need to manually delete things that don’t fit.
  • Critical when thinking about projects like this — the map is not the territory! Easy to conflate amount of coverage with amount of hate crimes. Be careful.
  • Always important to have stop words. Entity extractors are like overeager A students and grab things like ‘person: Man’ and ‘thing: Hate Crime’ which might be true but aren’t useful for readers.
  • Positive thing: it isn’t just examples of hate crimes it also pulls in news about groups that combat hate crimes and support vandalized mosques, etc.

I fear we may see media companies use it as a tool to cut costs by replacing reporters with computers that will do some, but not all, of what a good reporter can do, and to further enforce the filter bubbles in which consumers of news find themselves.

Hopes & wishes for the future of machine learning in news?

P.A.: I hope we’re going to see great examples of algorithmic accountability reporting, working out how big tech and government are using AI to influence us by reverse engineering what they’re doing.

Image for post
Image for post

Peter Aldhous tells us the story behind his project ‘Hidden Spy Planes’:

‘Back in 2016 we published a story documenting four months of flights by surveillance planes operated by FBI and Dept of Homeland Security.

Should all this data be made public?

Interestingly, the military were pretty responsive to us, and made no arguments that we should not publish. Certain parts of the Department of Justice were less pleased. But the information I used was all in the public, and could have been masked from flight the main flight tracking sites. (Actually DEA does this.)

About the random forest model used in BuzzFeed’s project:

Random forest is basically a consensus of decision tree statistical classifiers. The data journalism team was me, all of the software was free and open source. So it was just my time.

If you had had a team to help with this, what kinds of people would you have included?

Get someone with experience to advise. I had excellent advice from an academic data scientist who preferred not to be acknowledged. I did all the analysis, but his insights into how to go about feature engineering were crucial.

Data Journalism Awards

The Data Journalism Awards are the first international…

Thanks to Freia Nahser

Marianne Bouchart

Written by

Founder @HeiDaHQ + @Data_Blog. Manager of the @sigmaawards. Former Bloomberg @business #ddj. Data Journalism Lecturer

Data Journalism Awards

The Data Journalism Awards are the first international awards recognising outstanding work in the field of data journalism worldwide. The 2019 edition is launched and data journalism teams from around the world can now apply.

Marianne Bouchart

Written by

Founder @HeiDaHQ + @Data_Blog. Manager of the @sigmaawards. Former Bloomberg @business #ddj. Data Journalism Lecturer

Data Journalism Awards

The Data Journalism Awards are the first international awards recognising outstanding work in the field of data journalism worldwide. The 2019 edition is launched and data journalism teams from around the world can now apply.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store