Publishing Needs a Quantitative Framework for Assessing Manuscripts

This is why computers are scary for book people.

Any good acquiring editor has two recurring nightmares: (1) choosing to publish a really bad manuscript that loses a lot of money and (2) choosing to pass over a really good manuscript that then goes on to make someone else a lot of money. — a website tracking bestselling and award-winning books that were first rejected tens or even hundreds of times — is the graveyard of publishing dreams. It’s probably haunted: in their heart of hearts, nobody thinks they’d have been smart enough to pick Harry Potter and the Philosopher’s Stone off the top of the stack.

Harry Potter was something new, something publishing hadn’t seen before — a new author, a new story. Just imagine trying to write a short and catchy sales handle to summarize everything that Harry Potter is: “Orphaned Harry Potter learns of latent magical talents and leaves behind his drab, second-hand life with his horrendous aunt and uncle to discover a hidden and wondrous world that lies behind brick walls.” — seriously, it’s always brick walls — “But it’s a darker and more dangerous world than he could ever have imagined.”

Not exactly short, that. Let’s try again: “‘You’re a wizard, Harry,’ says an extremely large man with a very bushy beard. Thus begins a delightful adventure full of locked doors, dark magic, and mystery. Enchanting.”

I could write another, but I think it’s clear by this point that no sane sales rep would read that and say, “Yeah, you know, that sounds really promising. I think I can sell that.” Which is no doubt why, as LitRejections reports, it took the eight-year-old daughter of a Bloomsbury editor to get that book published.

There was a way that this might have worked out better for the 12 publishers who preceded Bloomsbury. Obviously, handing all manuscripts to your eight-year-old for assessment smacks of child labour and should be avoided, but if just one child was enough to predict the massive success that would follow, then, at least in the case of blockbusters, it should be possible to build a data model that would help editors assess manuscripts without resorting to the superior intelligence of eight-year-olds (in no way am I being sarcastic here).

There’s a lot we don’t know about what makes a book appealing. For years now, I’ve been conducting an informal study asking people what it is that they love about Harry Potter. Nobody has thus far been able to give me a concise answer, and a quick Google search confirms that, yes, there are whole lists of reasons but no single without-this-these-books-would-suck core piece. They’re just magical.

In many ways, Harry Potter is a textbook example of why publishing has always preferred the “I’ll know it when I see it” method. If, after decades of record-crushing sales, we still can’t say why exactly people love Harry Potter, the “I’ll know it when I see it” method is really the only one that has any promise at all.

But things are changing. For at least the past decade, publishers have been exploring the world of data-driven publishing. Nielsen BookScan, which tracks book sales on an unprecedented scale, was created in 2001, and its Canadian counterpart, BookNet Canada, appeared close on its heels in 2002. Both organizations have allowed publishers to review not only their own sales, but also the sales of other publishers, aggregated from a huge number of retailers. Their creation, thus, brought about a marked improvement in the quality of information collected about book sales, even if the numbers aren’t perfect and never will be perfect.

Publishers have also begun to look at other kinds of data, specifically social media trends and internet analytics. Some, like Cengage Learning, have launched long-term qualitative research programs to look for ways they can improve their offerings for students. Jellybooks is studying how readers engage with a text.

But, so far, all attempts to get a better grasp on the market have focused on readers: what readers want, how readers find books, and how readers consume what they do manage to find. This is, undoubtedly, an important piece of the puzzle. Jellybooks alone would likely have been sufficient to tell publishers what they had in the manuscript for the first Harry Potter book. But it’s still not the whole picture.

If we want to know what makes a book appealing, reader analytics are not the correct dataset. Readers are inherently variable, but there is one thing they all have in common, and that is the text itself. Which means that we need to study the text.

Researchers are beginning to develop tools that can tell us a great deal about a text. They’ve discovered that your high-school English teacher wasn’t completely off-base when it came to mapping out the plot of a story: there are, in fact, six basic emotional plot maps that show up again and again in stories throughout history. These arcs are even correlated to the success of a story.

Andrew Piper, director at McGill University’s .txtLAB digital humanities laboratory, has used computational analysis — specifically, network science, machine learning, and image processing — to develop a number of interesting insights about books that get published. In his article “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading,” he and coauthor Eva Portelance explain how such an analysis turned up an interesting trend among prizewinning novels: they’re all heavily steeped in nostalgia. Bestsellers, by comparison, are oriented more towards the present moment and have more characters and more dialogue. Piper has also used similar techniques to correctly predict who would win the Giller Prize.

In contrast to the idea that “art is unknowable” — an objection Piper has heard often in response to his research — the findings show how useful these kinds of tools can be. What should be a human phenomenon is instead replicable with a machine. Not perfectly, maybe, but neither is the human process. With this information, it’s possible to look at publishing decisions with a more critical eye.

Publishers can only aspire to this level of artistic mystique.

Piper’s research found other trends as well, including a tendency for strong gender stereotyping in book reviews. When paired with the results of the annual Publishers Weekly Publishing Industry Salary Survey, it becomes clear exactly what function this kind of quantitative research can serve in the publishing world: it can alert us to the places where we’re indulging our own strong biases.

The publishing industry is beyond overwhelmingly white. From 2014 to 2015, the percentage of white respondents to the survey decreased, by 1 per cent, from 89 per cent to 88 per cent. Men in publishing are 94 per cent white, while women are 86. And overall, women made up 74 per cent of respondents but make up only 54 per cent of management.


Under these circumstances, it’s actually imperative that we have some kind of data check on the acquisitions process because instincts are informed by experience, and the experience of a white Canadian female editor is not likely to look much like the experience of a male Chinese immigrant writer. And if the editor can’t rely on experience and instinct, then she can only fall back on either data or existing stories in order to know what is true about that man’s text.

Chimamanda Ngozi Adichie, a Nigerian writer, speaks eloquently to a related problem in her TED talk, “The danger of a single story.” The single story, she says, is the one that shows “a people as one thing, as only one thing, over and over again, and that is what they become.” We need, she says, many different stories about many different people if we are to appreciate the full range of human experience.

Adichie offers many anecdotes about how people have responded to her based on the single story that we tell about African cultures, but without (1) going to Africa to get some first-hand knowledge or (2) talking (preferably at length) with someone who has first-hand knowledge, being sure to ask the right questions, it can be difficult to know when we’re being fed a single story. There are whole academic disciplines dedicated to this task, and still we repeatedly fall into the same trap.

As further evidence, Piper finds that, among MFA graduates at least, nonwhite voices are homogenized and completely indistinguishable from white voices. And, in 99 and 96 per cent of books by non-MFA graduates and MFA graduates respectively, male characters outnumbered female characters.

I’ll pause a moment to let that sink in.

Seriously: 99 per cent?! At that rate, the gender bias should be visible from the moon. And this from an industry that is almost three quarters women, more in the editorial department.

The fact that we can’t — haven’t ever — discovered the extent of any of these problems using our instincts is a clear indication that we should be taking a long, hard look at exactly what our instincts are telling us. We might have to confront the fact that our instincts haven’t evolved since, well, ever.

Data can help us answer the question, “is this book bad or is it just different?” And we should give it a chance to do just that.


So, accepting for the moment that we have a problem and data might help us fix it, what might this data analysis look like within a publishing context?

There are obviously some hurdles to overcome. The first is figuring out what tools, exactly, would be most useful and least misleading. These tools would, of course, need to evolve over time, but the emotional arcs research would be a good place to begin, given the existing correlation between these emotional arcs and the long-term success of titles. Likewise, Piper’s research involving bestsellers and prizewinners would almost certainly be useful in an acquisitions process. Both of these tools could tell us something about a story, specifically as it might compare to other stories, thus helping with positioning; they could also identify which areas require the most work.

For houses that accept unsolicited manuscripts, analyzing manuscripts automatically before they ever reach an editor would be a logical and low-risk starting point. Assessing published titles and comparing the results to actual sales would also be helpful.

Finally, the data should be worked into existing systems. Publishers are already tracking sales, social media, and marketing, and this kind of book-centric data could operate very successfully alongside the more conventional data, especially if trends start to develop over time. Analyzing books themselves is still about determining what makes them enjoyable for readers, after all.

Readers will read anything; there’s really nothing to worry about.

Addressing Some Likely Concerns

1. This is an oversimplification of a very complex question.

Not really — a sales handle is an oversimplification of a very complex question, and computers can handle a lot more complexity than people can, at least in some ways. In any case, it’s necessary to simplify things at some point. After all, a sales handle serves a purpose, even if it is a single 20-word sentence purporting to explain an entire novel.

2. Novelists will hate this.

So far, not so much. Kurt Vonnegut started the whole thing with his lectures on the emotional arcs of stories (and in his rejected master’s thesis), and writers in Quebec participated in a challenge to write like an American bestseller, with varied but highly creative results. Learning more about the art form could be inspiring, in some ways.

3. There’s no way a computer can know what art looks like; art isn’t created for computers.

Correction: there’s no way a computer can understand and experience art in the same way that a human can, but knowing what art looks like is, according to the evidence, entirely possible for a computer. Further, even if we are using computers to examine art, the insights generated are no more for their benefit than was the original piece of art.

4. The data you get is only as good as your inputs.

True, and for this reason — among many, many others — publishers should not immediately fire all their acquisitions editors and replace them with computer scientists; finding the best system for assessing manuscripts would likely involve some trial and error. But in the meantime, publishers could learn some things, both about the books they’re publishing and about the culture they’re publishing in. And you would have to devise a very poor system indeed not to pick up on some of the cultural biases that the research has already revealed; that information alone is extremely valuable.

5. This would kill publishing.

Opinions vary as to whether the advent of sophisticated sales tracking killed publishing. Some people certainly think so. However, as this article points out, that data really cuts both ways: the big multinational houses dropped many of their midlist authors but that left the indies free to pick them up, which could, in the long run, prove invigorating for the industry. Until we try something like this, there’s no way to know for sure what the effects would be. And in the meantime, Amazon is learning everything there is to know about readers and then using that information to hold publishers hostage (see this New Yorker piece for a really thorough overview of the codependent relationship between publishers and Amazon). On some level, publishers can’t afford not to start thinking about some of these questions.

Above all, publishers should remember that it’s no coincidence that the books winning all the prizes and getting all the reviews in the New York Times all have the flavour of nostalgia. Those books satisfy the tastes of literary readers, and literary culture has more than a tinge of nostalgia itself. In his New Yorker piece, George Packer quotes one senior editor as saying, “Book publishing always has a rhetoric of the fallen age. It was always better before you got here.” And being nostalgic and afraid of what the world looks like now can be very dangerous, especially for cultural industries, which are responsible for engaging with the culture, as it is now.

Publishers should likewise remember the second part of the quote: “The tech guys — it’s always better if you just get out of my way and give me what I want. It’s always future-perfect.” Which, eventually, begins to look very much like this:

Watch out.

Ultimately, though, computer-generated data is a lot like nostalgia — it’s all good if you’re learning from it.

*Written for the Simon Fraser University MPub.*


Adichie, Chimamanda Ngozi. “The Danger of a Single Story.” TEDx, October 2009. Accessed November 19, 2016.

Alter, Alexander and Karl Russell. “Moneyball for Publishers: A Detailed Look at How We Read.” New York Times, March 14, 2016. Accessed November 19, 2016.

Barber, John. “Why Book Buying Stats Might Stifle the Next Great Author.” The Globe and Mail, December 27, 2012. Accessed November 19, 2016.

BookNet Canada. “About Us.” Accessed November 19, 2016.

Charman-Anderson, Suw. “Can Nielsen BookScan Stay Relevant in the Digital Age?” Forbes, January 13, 2013. Accessed November 19, 2016.

Dunder Muffin. “Dwight nostalgia.” YouTube, November 19, 2013. Accessed November 19, 2016.

Harvey, Ellen. “The Book Industry’s Quest for Data Intelligence.” Book Business, February 1, 2015. Accessed November 19, 2016.

Lafrance, Adrienne. “The Six Main Arcs in Storytelling, as Identified by an A.I.” The Atlantic, July 12, 2016. Accessed November 19, 2016.

LitRejections. “Best-Sellers Initially Rejected.” Accessed November 19, 2016.

McGill University. .txtLAB Digital Humanities Lab. Accessed November 19, 2016.

Milliot, Jim. “The PW Publishing Industry Salary Survey, 2016.” Publishers Weekly, September 16, 2016. Accessed November 19, 2016.

Nielsen BookScan UK. Accessed November 19, 2016.

Packer, George. “Cheap Words.” The New Yorker, February 17 & 24, 2014. Accessed November 19, 2016.

Pereira, Gabriela. “Episode 107: Will an MFA Affect Your Writing? What the Data Really Tell Us — An Interview with Andrew Piper.” diyMFA, August 10, 2016. Accessed November 19, 2016.

Piper, Andrew and Eva Portelance. “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading.” Post45, May 10, 2016. Accessed November 19, 2016.

Piper, Andrew. “The Devoir Challenge. How to write like an American Bestseller.” .txtLAB@McGill, January 13, 2016. Accessed November 19, 2016.

Piper, Andrew. “The Devoir Challenge. How to write like an American Bestseller.” .txtLAB@McGill, January 13, 2016. Accessed November 19, 2016.

Reagan, Andrew J., Lewis Mitchell, Dilan Kiley, Christopher M. Danforth, and Peter Sheridan Dodds. “The emotional arcs of stories are dominated by six basic shapes.” arXiv, September 26, 2016. Accessed November 19, 2016.

So, Richard Jean and Andrew Piper. “How Has the MFA Changed the Contemporary Novel?” The Atlantic, March 6, 2016. Accessed November 19, 2016.

So, Richard Jean and Andrew Piper. “Women Write About Family, Men Write About War.” The New Republic, April 8, 2016. Accessed November 19, 2016.