Three examples of machine learning in the newsroom

What experts have to say about the use of machine learning in the newsroom, and what data journalists can learn from it — Notes from the 2018 NICAR conference

In 1959, Arthur Samuel, a pioneer in machine learning, defined it as the ‘field of study that gives computers the ability to learn without being explicitly programmed’. Machine learning can translate to using algorithms to parse through data, recognise patterns, and then make predictions and assessments based on what the algorithms have learnt.

Machine learning can be used for fact checking and it can make archiving less of a tedious task for journalists. It can let voice assistants like Alexa or Google Assistant know you’re pissed off based on the tone of your voice on a Monday morning and then play a song to cheer you up. It can also be used to explore scenes in Wes Anderson films and help uncover hidden spy planes. In short, machine learning systems could very well become essential journalism tools in the coming years. And good news: according to Walter Frick, a senior associate editor at Harvard Business Review, you no longer even need a PhD to do it.

In a session entitled ‘Getting started with machine learning for reporting’ at this year’s NICAR conference in Chicago, Peter Aldhous from BuzzFeed, Rachel Shorey from the New York Times, Chase Davis from the Minneapolis Star Tribune, and Anthony Pesce from the Los Angeles Times discussed machine learning and what’s in it for reporters. What type of story can machine learning help with? When is it not the answer? And, on a more technical note, how can you structure your data in order to optimise the algorithm you’ve decided to work with? The speakers also gave examples of how newsrooms have worked with machine learning.

Los Angeles Times: Machine learning to uncover skewed crime stats

Number-based strategies have come to dominate policing in Los Angeles and other cities in the US, but unreliable figures undermine crime mapping efforts and make it difficult to determine where police officers need to be sent.

In an investigation powered by machine learning algorithms, the Los Angeles Times uncovered that the Los Angeles police department misclassified an estimated 14,000 serious assaults as minor offenses from 2005 to 2012, therefore artificially lowering the city’s crime levels.

LA Times

In 2009, for example, a man was stabbed by his girlfriend with a 6-inch kitchen knife during a domestic dispute. The police arrested the attacker, who was found guilty of assault with a deadly weapon. In the Los Angeles police department’s crime database, the attack was listed as ‘simple assault’. Due to this misclassification, the serious incident was left out of the department’s recording of violence in the city.

The Los Angeles Times used an algorithm that parsed crime data from a previous Times investigation in order to learn the keywords that identify assaults as either serious or minor. The trained algorithm was then let loose on a random sample of almost 2,400 minor crimes that took place between 2005 and 2012 to find which of these assaults were misclassified.

The results were manually checked to see the amount of incidents that were flagged correctly as misclassified crimes. The algorithm’s work was not perfect and the manual review found that algorithms incorrectly identified classification errors in 24 percent of flagged incidents. The Times then adjusted the estimated tally of misclassified crimes based on the error rate.

The journalists’ analysis concluded that violent crime was in fact 7 percent higher and the number of serious assaults was 16 percent higher than the Los Angeles police department reported.

In response to the Times investigation, a series of changes aimed at improving internal accountability and the training officers receive in classifying crimes, has been launched.

Find the data and code of this machine learning investigation here.

New York Times: Shazam-ing members of Congress

Another project that was featured in the NICAR panel was ‘Who the hill?’, an app that has been referred to as ‘Shazam, but for House members faces’. It is an MMS-based facial recognition service that identifies members of Congress. Reporters can text pictures to a number The New York Times team has set up.

Getting started with machine learning: slides from NICAR

The face recognition app was built by two New York Times interactive interns Gautam Hathi and Sherman Hewitt. ‘Reporters can use it to help figure out who is talking or presenting if they missed the intro or if they run into a member they don’t immediately recognise in the halls of the capitol’, wrote Shorey in our exchange of emails.

It was recently used in a different context: Shorey and her team were reporting on a Christmas party at the Trump International Hotel, hosted by the America First Super PAC. She used an Instagram image, posted by the company that provided decor for the party, to confirm that a congresswoman was in attendance.

‘We were interested in giving our readers some context about who attends this sort of event. Parties at Trump Hotel are of particular interest because of the financial connection to the president’, wrote Shorey in an email.

Read more on the story here.

Getting started with machine learning for reporting: slides from NICAR slides

BuzzFeed: In search for ‘spies in the skies’

BuzzFeed trained a computer system to recognise surveillance planes from the FBI and the Department of Homeland Security (DHS) in order to reveal secret aircraft activity. A great write up of the project can be found here, which we have summarised below.

First, the BuzzFeed team obtained flight-tracking data from Flightradar24 of 20,000 planes in a four-month period and used it in a series of calculations to describe aircraft characteristics and flight patterns, such as turning rates, speeds, and altitudes flown.

A machine learning ‘random forest’ algorithm was then trained to spot the characteristics of a sample of almost 100 previously identified FBI and DHS planes and 500 randomly selected aircraft. Aldhous points out that the random forest algorithm makes its own decisions about which aspects of the data are most important: Given that spy planes tend to fly in tight circles, the algorithm put the most emphasis on the planes’ turning rates.

Once adequately trained, the algorithm was let loose on all 20,000 planes found on Flightradar24, calculating the probability of each aircraft being a match for those flown by the FBI and DHS.

A striking discovery was that a military contractor normally tracking terrorists in African countries is also flying surveillance aircraft over US cities. The machine learning algorithm also found regular surveillance flights over the San Francisco Bay Area in 2015 that contractors claimed were involved in a project studying the world’s rarest mammal (the vaquita in case you were wondering). However, BuzzFeed journalists noted that the flights were mostly circling over land and it was later confirmed that the planes were actually supporting naval operations training.

Flights by US Air Force Special Operations Command over the Florida Panhandle, January 2015 to July 2017. Military bases are shown in pink. Peter Aldhous / BuzzFeed News / Via

The algorithm, however, was not perfect. It flagged skydiving operations that circled in small areas, mimicking the behaviour of spy planes.

‘It’s only by understanding when and how these technologies are used from the air that we’ll be able to debate the balance between effective law enforcement, national security, and individual privacy’, said Aldhous in the BuzzFeed article. Aldhous and his team won the ‘Data visualisation of the year’ award at the Data Journalism Awards 2016 competition for this project.

Read more about their findings here.

But what actually happens when you use machine learning?

It can be scary to launch yourself into a machine learning project, especially if you’ve never done it before.

During the NICAR session, Aldhous demystified the myth. He came up with the following list of steps to put a machine learning project together:

  • Find a good library in your favourite programming language;
  • Read the documentation;
  • Confirm this is actually a good approach for you and that you understand all the inputs and outputs (even if you don’t understand all the maths);
  • Spend days to weeks cleaning your data;
  • Write around ten lines of code.

How do you know if your data is a good candidate for machine learning?

Chase Davis put forward these questions:

  • Is it repetitive/boring?
  • Could an intern do it?
  • But would you feel an overwhelming sense of shame if you asked an intern to do it?

Aldhous also reminded reporters that they must always remember to verify machine learning conclusions. ‘Otherwise you’re basically letting an algorithm do your job!’

Is machine learning always the answer though?

‘Other methods can sometimes get you 90 percent of the way in 10 percent of the time’, said Shorey. She pointed out other ways of solving a problem in more simple ways — and definitely less exciting ways — than machine learning:

  • Make a collection of data easily searchable;
  • Ask a subject area expert what they care about and build a simple filter or keyword alert;
  • Use standard statistical sampling techniques.

Want to learn more about machine learning and data journalism? We will be discussing the topic on Slack with experts on 5 April 2018 at 9:30 am Pacific time. Sign up here.

Editors Note: The article was amended on 16 March. The following sentence from this LA Times article was added for clarity. ‘The Times then adjusted the estimated tally of misclassified crimes based on the error rate.’