Beyond Journalism +Computation

The Hoover Tower, icon of Stanford University, shines against a cloudless blue sky. It’s surrounded by graceful school buildings, like beautiful European churches. Stanford’s campus is a popular sightseeing spot for tourists from around the world, and there are photographers all over the place today.

I imagined Silicon Valley as the next generation of glass buildings lined up in rows. But the scenery here is far from that cutting edge image. Beyond the tiled buildings are green fields and eucalyptus groves. I almost asked if this is the same campus that is an engine of Silicon Valley, producing state-of-the-art technology.

But, yes, this is indeed part of Silicon Valley. Lively lectures and events on new technologies and how they can be used, are held here all day long.

Journalism is no exception. In October 2016, the event, “Computation and Journalism Symposium,” was held at Stanford. In this two-day meeting, we discussed how to use and analyze big data to find news.

About “Computation and Journalism Symposium,” held at Stanford.

In the Digital Age, traditional media, such as newspapers, magazines, and television, has been losing its presence for a long time. Especially in the US, print media is in very bad condition. According to Pew Research Center statistics, 2015 was the most difficult year for newspaper companies. The decline of circulation in weekday editions dropped by 7% from the previous year, the most since 2010. The number of reporters and editors also declined, and recorded as negative 10% since 2009, after the financial crisis.

How can we produce high quality journalism, and reduce costs, with less income (subscription fees and advertising revenue), and fewer talented people? Because of this serious and contradictory situation, a computational approach is being aggressively pursued in journalism in the US.

IT × The Washington Post — Bezos brings clear changes

Various data analysis methods and practices were reported at the conference. One of the most remarkable challenges was taken up by The Washington Post. The Post is the prestigious capital city paper acquired in 2013 by Jeff Bezos, CEO of Amazon. Bezos is promoting reform in the newsroom, as positioned growth comes from digitalization. What challenges are they tackling in the newsroom?

Before Bezos’ participation in management, The Post had laid off many reporters, one after another. Since Bezos’ participation, the company has recruited more than 100 reporters and restored the editorial staff to 700.

In addition to the editorial side, 35 engineers were hired. In the newsroom, various systems created by these engineers are used as well. One of them is “Bandito,” launched in February 2016, a custom-built, real-time content testing tool that automatically displays articles and allows editors to test how content is shown on mobile devices and websites.

The editor inputs multiple variations of content, with different titles and different photos, into Bandito. Then, the system searches for the version preferred by each reader, and automatically displays the most popular articles on The Post’s website or mobile.

These systems are used in other digital media too, but what makes them unique is that the website display can be changed in real time with clicks by the reader.

Once editors post articles, they don’t have to take any further action. They can efficiently move on to editing another article. In addition, reporters and editors can get feedback on which articles are more likely to be read.

Here is an example: The following is an article from The Post about Marie Kondo, author of “The Life-Changing Magic of Tidying Up,” which has become a major best seller all over the world. Which article displayed by Bandito is most likely to be read?

Variant1
Variant2
Variant3 (cited from developer site of the Post)

Compare these three samples. The first two have different titles. The difference in the third one is the photo.

A: “Why Marie Kondo’s life-changing magic doesn’t work for parents”

B: “The real reasons Marie Kondo’s life-changing magic doesn’t work for parents”

The title of C is the same as B, but the photo shows Marie’s face.

The most read title was C, the second was B. The least read title was A, according to The Post’s analysis. They also found that readers are more likely to read something with photos of people’s faces, and that it was more appealing to readers to use “the real reasons” rather than “why”.

The Post is also trying to introduce a system that automatically generates titles. The title is one of the most important factors that makes readers decide whether to read an article. However, the truth is that news editors write different titles depending on the platform, such as for the web, SNS for Twitter and Facebook, for mobile applications, and so on. It is a really time-consuming process. The motivation to develop these systems is to save labor and time for editors.

From “Computation and Journalism Symposium” at Stanford

Although The Post is testing the system with machine learning now, there are only a few titles which clearly appear to be created automatically by machine.

If this system starts running at full capacity, it may be able to attract foreign readers as well. For instance, in the case of a title with American jokes, the meaning is not quite understandable with direct translation. However, the computer may be able to change titles to more appealing ones automatically, depending on readers across the country. Looking at those examples, The Post is more like an IT company than a media company.

After Bezos’ acquisition, it is certain that many changes are happening. In October 2015, visitors to The Post website exceeded the rival New York Times website, for the first time, according to the research firm comScore, and it became big news.

Not only in The Post, but also in the main US media companies in general, newsrooms are dynamically changing through collaborations with specialists in computer science, mathematics, programming, and statistics. Reporters are required to at least utilize and analyze data for their reports, if not to write code. Perhaps the era of excuses like, “I’m not good at math because I was an art major…,” is over.

A story emerging from big data

At the Symposium, free tools that reporters can use in their work were also introduced.

One of these is “Campaign Finance,” launched by a New York University team that conducts investigative research by using artificial intelligence.

The website is well organized — the flow of money in the 2016 election campaign can be seen at a glance — and it helps reporters find clues in the news quickly and efficiently.

The Campaign Finance website uses six huge information sources, including Twitter and the data released by the Federal Elections Commission. You can see the “Household Account Book” for more than 4000 candidates, with a visual data analysis of the political funding organizations that support candidates, the amount of personal donations they received, their expenditures, and so on.

Website of “Campaign Finance”: http://campaign-finance.org/

Looking at presidential candidate Donald Trump, the breakdown of spending is: media, online advertisement, and travel costs for election activities. This is no surprise, since those are necessary items in an election campaign.

But suddenly, we realize that there is a big expenditure — the fourth biggest is “T-shirts, mugs, stickers.” Approximately 6% of total expenses are in this category, and it is a rather irregular figure.

What is going on here?

By visualizing the data based on a huge amount of numbers, reporters can get a hint of the storyline and see ways to discover new news.

Indeed, the online media “Skift,” which also participated in the symposium, used this site and later wrote an article, “Clinton vs. Trump: Where Presidential Candidates Spend Their Travel Dollars.”

Just copy and paste …

Another tool that was introduced as an easy one anyone can use, is the “TimeLineCurator,” created by Tamara Munzner, a professor of computer science at the University of British Columbia.

In Timeline Curator, it’s very easy to visualize history, and show what happened in each year. For instance, you can search and copy the history you want from Wikipedia, paste it into the text input box of TimeLineCurator, and click “GO.” Then, years are automatically picked up from the text, and a timeline in which dots and lines are combined is created. It is neat and looks like a score.

This is what American history looks like. This was created from the first paragraph of History of the US, in Wikipedia.

It is especially interesting that this system can put multiple time series data into one graph.

If you want to see the history of Scandinavian music in 1975, you need only copy texts of music history in Danish, Icelandic, Finnish, Norwegian, and Swedish written in English, and paste them into the respective tabs. Then, with one click, you can visually understand periods of musical innovation in the history of those countries. The more dots you see, the more likely you are to “find something” there.

It is true that digitalization has changed the entire media landscape. However, it’s also true that this tool helps us discover stories we may never have found before. Now is the time to use it at full capacity.

*This article got published in “COURRiER JAPON”, a Japanese magazine, on November 2, 2016.