Signal 4: How Big Data is used to identify Literature

Po-Yang Kang
Civic Analytics 2018
2 min readOct 25, 2018

Photo credit: The History of Science Fiction by Ward Shelly

Recently, big data has been used to identify biases and the trends of topics that are said in news websites and channels. By highlighting certain reoccurring key words and sentence structures most commonly expressed in extreme left or extreme right leaning websites, instead of spending more time reading and fact-checking them yourself, big data can easily classify the level of clarity or factual reporting each article has. However, big data can also be used in literature criticism and identifying genres and trends of literature, as well as identifying anonymous authors.

According to the Economist, computational analysis was able to determine that several of the plays suspected to be attributed to Shakespeare wasn’t written by him at all, but by a contemporary named Christopher Marlowe. The article states that this is especially helpful to deflate the cult of Shakespeare and point out instead of other writers of drama that might have influenced him, as it is quite hard for literary critics to exactly make that distinction themselves.

And because it is quite hard for literary critics to make distinctions even in literature, one New York Times article suggested a emerging trend of analyzing literature not through reading, but through processing them with algorithms and digitized databases. Through these computations, some interesting facts were pointed out that were not seen by literary critics: The Gothic novel genre, for example, isn’t just Gothic because of the themes of castles, darkness, and the supernatural, but also their word usages: their word choices of specific verb tenses and prepositions. With this awareness, utilizing big data as a tool, the author believes, it might be possible for some scholars to look back in the past and rediscover some overlooked original stories that can now be labelled as classics.

However, there is a certain limitation I believe in using big data here: The articles seem to like to point out commonly used words by authors and genre-writers, but I am wondering if big data has been used to check AND analyze writing styles, writing structure, and prose, since these analyses might also be useful for literary critics, not just words commonly used by the writers.

Nevertheless, using big data to identify words I believe might give us a picture, but whether or not that picture is extremely accurate is questionable. For example, I am doing a project in a different class which analyzes rates of depression of twitter users based on their word usages, with some words conveying more emotional weight than others, but context matters also: for literature, it is hard to tell whether the writing is actually good, or just ironic.

--

--