Big Data and Book Publishing: Data Mining of Literary Taste?

11 min readOct 14, 2016

The recent increase in the availability and level of sophistication of information about consumer preferences and behaviour — generally gathered under the umbrella term of “big data” — has opened up new opportunities for traditional businesses, which are orienting their choices towards a higher level of customisation of their online ads, automated recommendation systems and even their products.

Publishers and Internet bookstores that own their proprietary e-reading platforms, like Amazon with its Kindle and Barnes & Noble with its Nook, can now, for example, identify which passages of digital books are popular with readers, how long does one take to finish a book, and other more or less sophisticated information about the frequency and “intensity” of reading.

And it does not end here.

Jellybooks, a London-based reader analytics company — that counts the prominent Penguin Random House, Faber & Faber and O’Reilly Media among its customers — has developed a software tool that tracks reading behaviour in third party reading applications, offering publishers the appealing prospect of peering over readers’ shoulders.

Drawing upon (and trying to sum up) one chapter of my Master’s dissertation, this article will investigate the possible role of reader analytics in representing and measuring taste — a vital determinant of cultural consumption -, framing the discourse around the traditional methodological approaches of cultural sociology.

The complex dynamics behind how and why cultural products are consumed represent dimensions of audience behaviour that dig deeper into the nature of cultural consumption than the basic question of whether a particular product was consumed or not . A recurring theme in the discussion around “big data” in publishing is the claim that they allow, to some extent, to measure reading behaviour: readers’ “engagement”, “likes and dislikes” and “literary preferences” are some of the expressions used to convey what these new ranges of insight could reveal.

What does this terminology mean? Is there a link between reader analytics and the methodological tradition of measuring cultural taste?

Borrowed from the media and advertising industry, the term “engagement” is the rubric under which many of the efforts to understand audience behaviour take place. However, similarly to the case of “big data”, there does not seem to be unanimous agreement as to what it means and how it is measured. Emotional connection, the measure of attention paid to a communication, a customer’s relationship to media content are, among the definitions proposed, the ones that seem more applicable to the discourse around reader analytics. Appreciation and emotional response, in particular, are aspects that evoke one of the main objects of inquiry in the fields of sociology of culture, media and cultural studies: the matter of taste.

The reason why such issue is so relevant in the context of the creative industries is due to the peculiar nature of cultural consumption. Like for any other product, consumption depends on tastes — but the taste for creative products actually emerge from distinctive processes, including people’s investment in refining and developing those tastes, the social context consumption takes place in, and, arguably, cultural consumers’ status and social class.

Taste has been variously defined, and its definitions usually vary according to the purpose and the approach of those defining it. Taste can be seen as a characteristic possessed by individuals, an inherent property of certain objects, a process or social practice, and even used as a synonym for consumer choice and preference. Empirically, taste has often been empirically analysed in the form of patterns, in studies that seek to understand the relationship between various leisure activities and socio — demographic variables (e.g. Peterson and Simkus, 1992; Lopez-Sintas and Garcia-Alvarez, 2002).

The epistemological debates around taste have always been linked to the problem of measurement, as the techniques of social science and cultural inquiry adopted to measure taste necessarily rely on the claims made about what we mean by “taste”.

The question then arises as to what extent taste can be measured — and what “big data” can tell us about it. Could the traditional modes of measuring tastes be challenged by reader analytics?

One of the main theoretical and empirical perspective on taste is the one elaborated by Bourdieu (1984) in Distinction: A Social Critique of the Judgement of Taste. The author’s work — a conceptualisation of the links between taste, status, and social class — was based on a sophisticated questionnaire directed to people in Paris, Lille and an unnamed provincial town between 1963 and 1968, with a total number of 1217 participants. The investigation’s results confirmed the view generally held in the first half of the 20th century: people make significant distinctions along a continuum that goes from “highbrow” to “lowbrow” tastes, and such distinctions reflect the homology between cultural hierarchies and social hierarchies: in other words, people’s tastes are seen as channelled by their social class position. The continuum or cultural hierarchy in question consists of the so-called “legitimate culture” — that of the most well-regarded cultural goods and genres -, middle-brow culture, and popular taste.

Bourdieu’s empirical investigation — alongside those of other contributors to the sociological debate about taste, like Peterson (2005) — have been carried out through the conventional toolkit of social sciences, that include questionnaires and surveys, combined with (or, in some cases, replaced by) different forms of qualitative interviewing.

Whichever the methodological approach adopted and the choices made in designing the enquiry and analysing its results, it will be necessary to make assumptions about what tastes are — for example, does reader analytics measure tastes or activities? -, how can they be rendered analysable — perhaps translating them into metrics? — and for which ends.

Methodology Matters: What is New with Big Data?

Structuring a questionnaire that captures the different dimensions of taste is not an easy task. Bourdieu’s survey design consisted of 26 questions about different aspects of taste, from preferred styles of interior design and forms of cuisine to favourite leisure activities and book genres. Information about age, marital status, place of residence, education, income and ownership of technologies of respondents was collected as well. In addition, participants were asked to indicate the opinion closest to their own in relation to classical music and paintings, and “tested” in relation to their participation in film, music and art.

Bourdieu himself does not hide the limitations of his research strategy: not only do his questions capture only a simplified view of the variety of possible dispositions towards the objects of taste, but they also could lead to somehow misleading answers affected by the so-called “cultural goodwill”. Indeed, the research encounter — if the interviewer himself is a representative of “official” forms of culture — might lead the participants to give the answer they imagine to be the most “correct” in the eyes of the questioner, especially when referred to cultural goods that have an association with “legitimate” culture.

In the 1990s, other studies have been carried out in an attempt to propose an alternative to the highbrow/lowbrow dichotomy. For example, in 1992, Peterson and Simkus investigated tastes in music using comparable data with those of Bourdieu. The researchers found out that the respondents in high-status occupations were more likely than others to report being involved in a wide range of low-status activities, while respondents in the lowest status occupations were most limited in their range of cultural activities. Based on the results of their surveys, the authors suggest that high-status interviewees seemed to be “omnivorous” in their tastes — appreciating the aesthetics of a wide range of cultural forms, from fine arts to popular cultural expressions -, while those at the other end of the status hierarchy were more ‘‘univorous”.

However, similar methodological concerns as the one found in Bourdieu’s approach arise also in this case: as Peterson himself recognises, different problems emerge in his comparative research. For instance, if omnivorousness is a measure of the breadth of taste and cultural consumption, how do researches operationally define it? How many of the choices available in the questionnaire should a respondent choose to be counted as an omnivore? In other words, is there an “amount” of culture liked and a composition of highbrow and popular preferences that allow to classify omnivorousness?

Deciding whether to measure likes or dislikes then becomes another relevant question: if, in the words of Bourdieu, “in matters of taste… All determination is negation; and tastes are perhaps first and foremost distastes’’ (1984, p. 56), then asking respondents to indicate what they “like very much” and “dislike very much” might lead to a different way of operationalising the problem than the one focusing on likes. Such an analysis has been made in by Bryson (1996), who concluded that highbrow snobbery consists of disliking all forms of popular culture, while omnivorousness consists of having distaste for none.

Another limitation of the survey method that underlie the aforementioned studies of cultural taste is the reliance on people’s self-reported preferences — which are subject to no verification. Indeed, a survey artificially separates the expression and experience of taste, with the result of constructing tastes as things which research participants remember in relation to their representation of themselves, rather than as a thing which they experience.This might mean, for example, that respondents may not like Dostoevsky, but at the same time they may be aware that that is the kind of reading someone like them should appear to like: “[…] this remembering is both in relation to the stories that they tell themselves about the kind of person they are and the stories that they tell researchers about the kind of person they want themselves to be seen to be” (Wright, 2015).

Here is where reader analytics could be seen as a methodological shift: tastes and preferences of the audience are measured through actual evidence of reading, and through technologies able to grasp the immediate moment of tasting. “Data collection about uses and preferences is now embedded directly into the processes of presentation and consumption of cultural content” (Wade Morris, 2015). This peculiar character of reader analytics reconnects to the previous discussion about the higher level of reliability of the measurement of what people (report) doing, rather than their stated preferences, and overcomes the artificial separation between expression and experience of taste.

With a technology infrastructure (the e-reading device) taking the place of the observer or interviewer, the practice of expressing aesthetic preferences that are seen as “legitimate” according to one’s vision of himself or the social context he belongs to, seem to have no reason to take place. Escaping into a book and knowing (or maybe, given the context, feeling like) no one is watching is a completely different situation than answering to a survey, where conformity phenomena induced by “cultural goodwill” may arise.

On an another level, reader analytics might also challenge the very notion of consumption of creative goods. If a book’s “consumption” can be conceived more deeply than just identifying it with its final purchase, then interesting dynamics may emerge — potentially insightful both for sociologists of culture and for publishers.

E-reading data disclosed by Kobo, as reported by Alison Flood in The Guardian, might be a case in point: indeed, a discrepancy has been revealed between the titles included in the list of Kobo’s UK bestseller e-books and those in its “most completed” list. Among other findings, Kobo has discovered that less than half of those who downloaded the Pulitzer-awarded “The Goldfinch” have actually finished it, and that the most completed book of 2014 was not an award or prize winner, but a self-published thriller.

At this point, a reflection arises: if some books are bought for their value of “status symbol” — as a display of educational and societal status more than material wealth -, they tend to be physical books, as e-books — being stored in a device instead of exhibited on a shelf — do not have the same signalling function (Rhomberg, 2016). This might mean that analysing data from e-books, alongside with the demographics of their readers might help further investigate the question of cultural omnivorousness.

Although theoretically interesting, using reader analytics to overcome the limitations of the survey method in interrogating tastes is, at present, almost utopic: rather than a new methodology for the social sciences, the new information about reading behaviour is more likely to remain stored in the customer databases of Amazon and other big players in the publishing industry — who, in turn, will have to make assumptions, ask questions and attempt to make sense of the data.

Going back to the object of measurement — taste — and the way it is made measurable, Rhomberg has explained how its software tracks, page by page, readers’ interactions with e-books, such as velocity of reading and completion rate (a proxy of how strongly a reader engages with a book). It is behaviour, then, what we are looking at: taste is something that is subsequently inferred from it. Together with these two observational metrics, there is also a more subjective one, the “recommendation factor”. When a reader has finished a book downloaded from Jellybooks’ platform, he is asked to indicate, on a scale from 0 (=not at all) to 10 (=extremely likely), whether he would suggest that book to a friend: respondents whose answers go from 0 to 6 are then considered “detractors”, while those responding 9 and 10 are categorised as “promoters”; the recommendation factor (or net promoter score) for each book is then obtained subtracting “detractors” from “promoters”, in percentage of the total number of readers.

Here, some of the issues of measurement addressed by Peterson seem to find a resolution: are we measuring preferences or activities? We are measuring the activity of reading — in the most literal sense that contemporary technology allows — and therefore deducing preferences. Are we measuring likes or dislikes? We are measuring both.

Ultimately, in the case of Jellybooks, what is extracted from the aggregated data is the profile of a book, rather than that of a reader or of a group of readers. Nonetheless, the capabilities of companies such as Amazon and Apple might even go further: the data mined from e-reading behaviour could be combined into user profiles that include readers’ browsing and purchase history, and potentially their comments and reviews on affiliated sites (such as Goodreads.com, which is owned by Amazon). In Apple’s case, information gathered from iBooks might be linked to the data stream about user interaction with his device (Davis, 2015).

Moreover, as exemplified by Jellybooks, the metrics obtained can also be analysed in combination with more traditional audience insights, like demographic factors, to find correlations and patterns of behaviour. For instance, it is possible to measure the strength of the association between the completion rate and age or gender to understand whether a book appeals to a certain audience segment more than to a different one — and, in the light of the actual content of the book, even understand why.

Although Jellybooks’ example might not be comparable to the quantity and level of detail of data that companies like Amazon are capable of gathering and processing, it still reveals something important about the practices of knowing and measuring tastes.

The real “big data” challenge posed to publishers might then not be a major technological challenge, but rather a human one — that still involves some relevant aspects of “traditional” research methodologies: making informed assumptions, asking the right questions, and deciding what is the purpose they want to achieve.

Main sources:

Alter, A. (2012) Your E-Book is Reading You. Available at: http://www.wsj.com/articles/SB10001424052702304870304577490950051438304

Bourdieu, P. (1984) Distinction: A Social Critique of the Judgement of Taste. London: Routledge

Peterson, R. (2005) Problems in comparative research: The example of omnivorousness. Poetics, 33(5–6), pp.257–282.

Rhomberg, A. (2016) https://medium.com/@arhomberg

Wright, D. (2015) Understanding Cultural Taste: Sensation, Skill and Sensibility. Palgrave MacMillan.

(If interested, feel free to ask for a more detailed bibliography, listing all of the authors and works mentioned in this article).

Big Data and Book Publishing: Data Mining of Literary Taste?

Methodology Matters: What is New with Big Data?

Written by Alice Speranza