Exploring machine learning in newsrooms

Melissa DiPento
Journalism Innovation
7 min readJan 9, 2018

By Sebastian Auyanet and Melissa DiPento

Design by Melissa DiPento. Photo by Photo by Ricardo Gomez Angel on Unsplash.

In her 2017 presentation at the Online News Association Conference, quantitative futurist Amy Webb asked journalists to consider AI as the next layer of technology that will be integrated into everything we do.

According to her Future Today Institute 2017 Tech Trends Report, AI is simply “a branch of computer science in which computers are programmed to do things that normally require human intelligence.”

Newsrooms have already begun to experiment with AI and machine learning. Buzzfeed for instance, trained a computer to search for hidden spy planes. And by using Congressional press releases from 2015 to now, ProPublica trained a computer model to extract what phrases a Congress member uses more than others.

Newsrooms like ProPublica have already experimented with machine learning.
ProPublica used machine learning to understand what members of Congress care about most.
By using Congressional press releases from 2015 to now, ProPublica trained a computer to extract data.
Why ProPublica did this: “To help researchers and journalists understand what drives particular members of Congress and to enable regular citizens to compare their representatives’ priorities to their own.”
Buzzfeed used machine learning to search for hidden spy planes.

Everything counts (in large amounts)

In an introduction piece for the Represent app, ProPublica’s Merrill explained that: “We had two goals in creating it: To help researchers and journalists understand what drives particular members of Congress and to enable regular citizens to compare their representatives’ priorities to their own and their communities.”

He also explained why the app focuses on particular things such as press releases.

“Voting and drafting legislation aren’t the only things members of Congress do with their time, but they’re often the main way we analyze congressional data, in part because they’re easily measured. But the job of a member of Congress goes well past voting. They go to committee meetings, discuss policy on the floor and in caucuses, raise funds and ― important for our purposes ― communicate with their constituents and journalists back home. They use press releases to talk about what they’ve accomplished and to demonstrate their commitment to their political ideals.”

What we learned about machine learning in newsrooms

Both the Buzzfeed and ProPublica pieces bring up interesting use cases for how machine learning could be used to do compelling investigative work.

In both scenarios, we wondered, though, how and why staff at these publications decided to try machine learning for these particular stories. Were they reporting on topics members of Congress were drawn to and looking for spy planes, but coming up short through traditional reporting methods? Or, did these news organizations make a smart move by experimenting with machine learning for practical data-driven topics that they knew a machine could work with?

Either way, we know that machine learning is indeed powerful, and something we will see newsrooms utilizing more in the future.

“Finding Congress members who sound alike is as easy as finding each member’s “nearest neighbor” in this imaginary 100-dimensional space,” writes ProPublica’s Merrill on the promise of this technology.

But it doesn’t come without its drawbacks. Machine learning, as it is now, is not a panacea for mundane tasks and number crunching.

“The algorithm was not infallible: Among other candidates, it flagged several skydiving operations that circled in a relatively small area, much like a typical surveillance aircraft,” Buzzfeed’s Peter Aldhous said. “But as an initial screen for candidate spy planes, it proved very effective.”

Community feedback

Both the ProPublica and Buzzfeed pieces were widely shared on social media. In a time where citizens are concerned about Congress and the decisions they make, and threats of spying, these two pieces gained a good amount of traction on social media.

A Facebook commenter explains how she know has a better understanding of her member of Congress.

The ethical perspective

In order to develop an error-proof way to fulfill the promise about providing comprehensive knowledge about distinctive topics your members of Congress are concerned about, the model for Represent has to be strong and include a lot of detail.

As ProPublica’s Merrill explains “…the model’s strength is not in making obvious observations, but spotting things others might not.”

But it’s not just about spotting those things but also making sure that Represent provides the better picture possible about the priorities of Congress members.

“Just because a topic appears on one member’s list but not another’s doesn’t mean the second Congress member don’t care about it. There may simply be more distinctive topics that they talk about. And for now, that means big topics that lots of representatives and senators talk about, such as education or crime, aren’t included in each member’s list. But we’re working on ways to reflect those, too,” Merrill explained.

Validation is also key in this case, and Merrill also addresses that when speaking about Policy Priorities, a recent new feature added to Represent. The goal of this feature is to help researchers and journalists understand what drives particular members of Congress and to enable regular citizens to compare their representatives’ priorities to their own and their communities.

“… the Policy Priorities broadly match up with what a congressional expert might say each congressperson focuses on. We validated them by checking to see if congressional committee chairs have a Policy Priority score above the median for issues that their committee deals with ― and 78 percent of them do. And members of Congress from the West are those who focus most on regional topics like ‘public lands and natural resources,’” Merrill said.

Regarding the Buzzfeed Hidden Spy Project, there’s an ethical question that might eventually be raised in terms of how damaging for public safety would be to disclose the secret activities of surveillance that an institution like the FBI might be doing across the country. Nevertheless, that question might be answered the same way we answer many other similar questions related to journalism: it is the duty of the journalist to present this information to the public in order to detect patterns of abuse of this technology and the potential usage of it against citizens.

Then there’s also the right to privacy aspect of it, where we think this piece particularly nails it: no matter what politicians say, it is in the public interest that journalism finds a way to expose suspicious moves by public offices and contractors.

AI is not science fiction

As newsrooms begin to see the benefits of using machine learning, we’ll begin to see more and more use cases like the work ProPublica and Buzzfeed are doing. And while some newsrooms may not yet be ready to tackle a new challenge, many are realizing that they should soon start to.

According to a recent AP report — A guide for newsrooms in the age of smart machines — the International Consortium of Investigative Journalists (ICIJ) directed nearly 400 journalists to analyze 2.6 terabytes of leaked emails, documents and databases. The result — the Panama Papers.

The ICIJ didn’t employ any artificial intelligence at the start of its research, but Matthew Caruana Galizia, the organization’s web applications developer, wishes it had.

“We were dealing with a vast amount of documents, and ICIJ just didn’t have the resources to investigate them all,” Galizia said. “But by using artificial intelligence, we would have been able to make that process much faster for all the journalists involved and end up with the same result.”

Although AI in newsrooms may not be as sophisticated as it needs to be yet, AI does something that human journalists cannot do alone, says Jonathan Stray.

“Quietly and without the fanfare of their robot cousins, the cyborgs are coming to journalism. And they’re going to win, because they can do things that neither people nor programs can do alone,” Stray writes.

Stray adds that newsrooms cannot yet rely solely on AI to produce content.

“Automated systems can report a figure, but they can’t yet say what it means; on their own, computer-generated stories contain no context, no analysis of trends, anomalies, and deeper forces at work. Reuters’s newest technology goes deeper, but with human help: It still writes words, but isn’t meant to publish stories on its own.”

One additional use case comes from the AP itself. According to this report, the AP is using machine learning to automate the production of corporate earning stories.

“Historically, the financial news staff of AP was saddled every three months with the enormous human task of reporting earnings for as many public companies as possible. An automation program introduced three years ago enabled the agency to increase its output of corporate earnings stories by an order of magnitude each quarter, essentially providing coverage of the entire U.S. stock market.”

The possibility of using machine learning or artificial intelligence for journalism has mostly been framed in a dystopic way: bots will replace journalists and eliminate (even more) jobs inside the industry.

That premise is a bit far-fetched, but we’re also likely to keep moving to a different and more plausible frame of the conversation: more and more, journalists will apply the notion of machine learning to their ideas for stories or things they will want to know.

The incorporation of this will allow more journalists with coding skills to enter the workforce. But such an ambitious endeavor will likely require newsrooms to add pure coders in and blend them with journalists. To put it simply: journalists will have to be more like coders to be able to partner with them to understand how far they can go by using this new feature for reporting.

Contact Sebastian Auyanet at sebastian.auyanet@journalism.cuny.edu. Contact Melissa DiPento at melissa.dipento@journalism.cuny.edu

--

--

Melissa DiPento
Journalism Innovation

Engagement Journalism at the Newmark J-School. Journalism must be engaged, innovative and equitable.