A Distant Reading of Campaign Journalism, 1968–2012

Sam Mather

Quantitative analysis of texts have become more popular in the last decade, from the University of Nebraska’s graphs of novels’ plots by emotional intensity (operationalized by assigning values to certain words and then counting them up) to commercial presses printing Franco Moretti’s books to Dominique Pestre’s history of the World Bank through quantitative study of its documents’ wording. Exactly how useful these techniques will turn out to be is still unknown, but as the above examples show they have been applied with some success to a range of topics. It is worth noting that in practically every major case, the text mining is meant to complement deeper and broader study, not to offer much in the way of conclusions on its own. The Moretti/Pestre narrative of the World Bank, for example, draws on new data but generally echoes the conventional arc of the Bank’s development. No one would claim that a graph of a plot tells us what we need to know about a book. The premise, instead, is that there is something missing from traditional reading and we can supplement it with the new tools of textual analysis. This is the premise of Moretti’s “distant reading”. Journalism seems especially suited for quantitative textual analysis, considering its high-volume production and its focus on picking each word carefully.

My plan is to study the New York Times’ presidential election coverage and the issues that come up in presidential elections since 1968. I will operationalize this by text mining the New York Times articles published from June to the election date in each US presidential election year since 1968 that refer to at least two candidates and the election, then using Voyant to analyze each year as a corpus and find the words most distinctive to each year. This will give an indication of not only what issues were important (economics, race, foreign policy, war, crime, social programs) but which changed and which stayed as well as how press norms changed — which years discussed character more? Which policy? In what ways were policies discussed? In what terms were wars and economics understood? The New York Times, as the gold standard of journalism, is instructive in all of these fields — its discourse is, in many ways, reflective of the national news discourse, which many scholars have asserted is vital to understanding a nation’s political life.

There are traditional narratives of the press’ development, which are not worth revisiting in detail because this project focuses on such a narrow and recent window of time. But the most important historical narrative about the press is Habermas’ history of the public sphere, in which he connects newspapers to the economic utility of information under capitalism, and then traces some of the ways they came to also reflect more complicated and sophisticated interests, from politics to literature. And finally, Habermas argues, modernity struck — corporations accumulated so much money that the press became a tool in the manipulation of public opinion. Newspapers became less a means of disinterested debate and more a way of pushing an agenda that benefitted the powers that be. Habermas has been challenged many times on his historical claims, but his hypothesis that contemporary news shapes the public discourse — in this he echoed his mentors Adorno and Horkheimer, of course — is a solid basis for studying the news. However, the more contentious part of Habermas’ thesis — that the topics covered by the news are driven by vaguely sinister capitalist interests — is outside the scope of textual mining.

An alternate approach would be to situate this study in the history of political journalism as a profession and to look at the way the craft is plied over time. The fact that such changes have taken place is easy to demonstrate by looking at a contemporary newspaper against a Bernstein and Woodward era newspaper — images have taken over, the dense prose reduced to narrow columns. But I find this substantially less interesting as a historical concern.. Habermas’ arguments speak to large social questions in a way that limiting the study to journalistic practices would not. This project’s contribution will be to show the news at work as Habermas understood it, in which reporting is intrinsically as much about producing a subject matter as it is about objectively covering an already-important subject. The different key words in different elections are suggestive of the different media concerns in different years. They are also suggestive of the differences in contemporaneous reporting and how elections are remembered: which key words that can be empirically linked to an election cycle surprise us? Which ones make intuitive sense? What stays, and what is forgotten, from the media coverage of an election?

Of course, the best scholarship on journalism is going to follow a traditional humanities methodology: finding the best work and studying it closely. Another quantitative researcher could even look at syntactical changes in political journalism. In keeping with Moretti’s limited ambitions, I intend only to complement the scholarship of people who know past journalism more seriously and who have greater expertise in political journalism more generally. For my own curiosity I will check which candidate appears in the most articles and see how often that is the same candidate who won. There is already a sophisticated literature on “media effects” in campaigns, so this will be basically worthless, considering my limited focus of just the New York Times, but I would love to know.

There are, of course, concerns and limits to the conclusions that can be drawn from the final project. Because only one newspaper is under consideration, we cannot be confident that the results are representative of media coverage. I think the project is still instructive, since if there is any paper that guides overall coverage, it is the New York Times. Another risk of considering only one paper is exogenous changes: did editors’ approach to campaigns change? Did Times policy change? This could lead to a sudden spike in coverage that only tells us about the internal history of the New York TImes. So certainly the conclusions drawn from this study will be dependent on what other scholars find and already know about the history of political journalism in the United States.

MSU’s subscription to ProQuest makes it possible to search through all the relevant New York Times articles by searching for keywords within a date range. The articles to be text mined will be found by using ProQuest’s advanced search feature to find articles from June to November in each election year that mention at least two. The assumption is that any coverage of one candidate with respect to the election will have at least one mention of another candidate. This will also filter out articles that are simply concerned with a sitting president’s actions while he is up for re-election. However, it will not be a perfect filter against this; for example, McGovern criticizing Nixon’s foreign policy would show up as “relevant” in my search, when in fact its status as “campaign” journalism is debatable; it could also just be political journalism. There is also, of course, the risk that my assumption is wrong and campaign articles do not always mention at least two candidates and my search methodology will exclude a lot of relevant articles.

Once the relevant articles from an election year are found, it’s a simple matter of downloading the PDF files. Unfortunately, most of the PDFs are scans, not text documents, which means they will have to be OCR’d as well. This will be slow going.

Once the PDFs from an election’s coverage are OCR’d, we have a corpus. There will be a separate corpus for each election year under consideration, so we will end up with twelve.

The online tool Voyant is well-equipped to process my corpus to get the information I want: the words most distinctive to each election cycle. Voyant has been down all weekend so I can’t speak to what exactly is involved in submitting a corpus to it. However, Ted Underwood is confident that Voyant is extremely good at picking out the words most specific to a corpus compared to another corpus. So my plan is to count, say, every election except 1968 as one corpus, and the 1968 articles as another corpus, and use Voyant to find the 1968-specific words. And so on through all the other elections.

To my surprise, ProQuest provides a graph of month-by-month distribution of the results when you search for articles with certain key words in a particular range of dates. It shows a bar graph indicating the number of relevant articles from each month in the range. This means I won’t have to do any work to put together graphs showing how the quantity of coverage changes over the course of each election season, and very little to show how the quantity changes year over year. It looks like it follows the pattern I expected: increasing each month. While this is logical, it runs a little bit counter to experience, which is that elections get a bit boring from July to October, when nominees have been decided but the general election is a while in the future.


-List of words most specific to each election cycle in NYT coverage from June to the election

-Reflections on what issues these words were connected to

-Reflections on the significance of recurring words

-Counts of the articles published each month and year

-Counts of the articles published each month and year with regard to specific candidates

-Timeline presenting the results (words most specific to each election, articles published)

-Essay on methodology and above-mentioned reflections to complement timeline


I only intend to complete two election years for this semester, but to carry them through the whole set of deliverables: finding the words specific to both cycles, making a timeline, analyzing it, posting it online. The main time constraint is building a corpus; the rest of it is not so hard. Building a corpus involves downloading and OCR’ing all the relevant articles.

I do not plan to do consecutive elections; I think more contrasts will be visible with more of a gap. I think 1968 and 2000 will be good choices; both have serious third party candidates and national priorities changed substantively between them, so the issues under discussion should be different, and any linkages all the more telling. Similarly, it will be interesting to see the difference in the quantity and timing of articles across a 32 year gap.

Because the difficulty is in the first part of the project, I expect to have my text corpus built by April 17th. I expect the Voyant analysis done April 20th, and the write up April 21st. After that, I expect the Prezi done April 25th.

I don’t know what kind of launch there could plausibly be. I am happy to post the results of my study (a link to the timeline that I will use to present the results) on my Medium, and to write at length about my methodology. The main results, though, are fairly static; interactive only to the extent that the reader chooses what to look at, so it is not too different from publishing a powerpoint. Of course, the reader would be free to follow any of several threads in the timeline on their own. The possible threads: how many articles per month were published in a given year? How many articles over the course of an election were published, and how does it vary per year? Does the distribution across months change from year to year (I suspect yes — e.g. in 1972 the vice presidential part of the race was more interesting, so there would be an uptick in coverage in the summer months, then a dropoff as it became clear McGovern would lose badly, a pattern not repeated in other years). Which words were most common in each cycle? Which words were most distinctive to each cycle? What issues do these suggest were important in each cycle? To what extent do the important issues in elections change, and to what extent do they recur?

In addition to my planned presentation — a timeline with each election year available to be clicked on for more detail — I could provide a write up to show the reader what strikes me about the data, and some preliminary answers to the questions outlined above.

I don’t have big plans for this after the class. The data will have basically gone through Voyant and I don’t expect it to be stored anywhere. The timeline will stay hosted on my Medium; I will keep my Medium account, if only out of inertia and to read a few writers I like, so it will stay public, but I don’t really see any use for the information.

Like what you read? Give Sam Mather a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.