# Is it possible to predict a NYT bestseller?

The upcoming book “The Bestseller Code — Anatomy of the Blockbuster Novel” is getting a great deal of buzz. Can one genuinely predict what kind of book will become a NYT best seller?

The promise of a formula for predicting a bestseller is getting many in the publishing industry and those who write about books excited.

Several journalists contacted me for an opinion about the book because of my background in pub tech and reader analytics. Thus I became interested in reading the book and the book’s publisher St. Martin’s Press was kind enough to provide me with an advance reader copy last week.

First of all this is a delightful book to read. I would recommend it as both an entertaining and educational read for anybody interested in the business of books. This is not a magisterial work like “Merchants of Culture” by J. Thompson, but a book written for the mass market with lots of anecdotes and examples that readers and authors can relate to. It is a book for a general audience and avoids as far as possible jargon and “academic” language. The “code” is based on some of the latest advances in machine learning as applied to literature, but the authors attempt to simplify the computer science behind the book to a minimum. There is no mention of “big data” or artificial intelligence, just plain and simple descriptions of what the “black box” does with references for the interested readers to find out more about the inner workings of that black box.

However, there is statement in the book that is misunderstood by many of those who interviewed me about the book and that is “the algorithm can predict if a book will be a best seller with accuracy 80%”.

I had a sense when being interviewed that most journalists thought this meant something along the following lines of: “if there are something like 500 New York Times best sellers this year, then this algorithm can produce a list of 500 titles and 400 of those will indeed turn out to be bestsellers”. Well that’s not actually what 80% accuracy means. The misunderstanding is in the “will produce a list of 500”.

One needs a bit of statistics knowledge to understand this better. I will first provide (with some statistical elaboration) how the authors describe the 80% accuracy:

If the algorithm is applied to 50 books that are genuinely best sellers then it will recognise that 40 of these (80%) are *indeed* best sellers, but will classify incorrectly (“falsely”) that 10 of the books (20%) are not best sellers (a “negative” result). Thus the 10 titles that are missed are what statisticians call the “false negatives”.

Now, if the algorithm is applied to 50 books that are known not to be best sellers, then it will recognise that 40 of these (80%) are indeed *not* best sellers (80%), but will classify incorrectly (“falsely”) that 10 of the books (20%) are, in the opinion of the algorithm, in fact best sellers (a “positive” result) when in fact they never were NYT bestsellers. Thus, these 10 titles that are incorrectly predicted to be bestsellers are what statisticians call the “false positives”.

Let’s construct a different scenario. Imagine a Barnes & Noble megastore in the Midwest with 200,000 nicely ordered titles on its shelves including 1,000 titles in a section called “Past and Present New York Times Bestsellers”.

Now a mob of Trump supporters enters the stores and throws all the books on the floor in protest at Trump’s “Art of the Deal” not being displayed in the bestseller section. They don’t actually take any of the books with them, because, well, they are not really interested in reading books, so there are now 200,000 books lying in a jumble on the floor.

A poor BN intern is now assigned to put the 1,000 best sellers back on the shelf, but, being an intern, the intern has no idea what makes a bestseller and thus the intern decides to make use of this magic new algorithm.

The poor intern now tests all 200,000 books against the algorithm.

When applied to the 1,000 bestsellers the algorithm identifies 800 of them correctly as bestsellers, but dismisses 200 as not being bestsellers.

Now it gets interesting though. When analysing the remaining 199,000 books, the algorithm identifies 80% — that is 159,200 books as not being bestsellers, but it believes (incorrectly) that the rest (20%) are in fact NYT bestsellers. That is whopping 39,800 books. Our intern using the algorithm identified a total of 40,600 (39,800 + 800) books as NYT bestsellers. He discovers not just the 1,000 NYT bestsellers he was looking for, but 39,800 “bestsellers” while missing out on 200 real NYT bestsellers, that were incorrectly classified by the algorithm. That is what 80% accuracy means.

We applied the algorithm to a large sample that had many books in it that were not best sellers, and as a result the algorithm produced many, many false positives.

It did do its job though. Whereas the original 200,000 books contained only 0.5% bestsellers (i.e. 1,000 books) the new smaller list of 39,800 books contains 2% best sellers (800 books), a fourfold “enrichment” which came at the loss of 200 best sellers going missing, because the algorithm is not 100% perfect.

Now, we could play this game a bit differently. The intern is lazy and fills the shelf with the first 1,000 books that the algorithm identifies as being best sellers. Well based on the above enrichment factor we know that among the first 1,000 books the intern select, 2% (i.e. 20 books) will be best sellers. So the new “best seller” shelf will consist almost entirely of books that are not bestsellers. There is even a 1 in 200 chance that Trump’s book will end up on the shelf.

Now, this result doesn’t sound quite as impressive, does it? But this is what 80% accuracy means. It will not turn publishing on its head given one million new books or manuscripts are written every year. An algorithm with 80% accuracy will just not cut it, but don’t be deterred from reading the book. It still offers some genuine and novel insights as to what makes a best seller. However, it is not going to put acquisition editors out of a job.

However, machines are getting smarter, machine learning improves, and artificial intelligence is getting more intelligent. What if the algorithm were 99.9% accurate rather than just 80% accurate? In that case the intern would have correctly identified 999 of the 1,000 best sellers lying randomly on the floor as NYT bestsellers and missed only one. But the intern also had to test the 199,000 other books and that would have produced 199 “false positives”, meaning he would have 1,198 books to put on the shelves, 198 more than he expected if the algorithm was 100% accurate (like an inventory lists with no mistakes or typos). Now that would sound a hell of a lot more impressive, but an algorithm that is 99.9% accurate is still a long way off for the simple reason that human taste and fashion is so incredibly unpredictable. Book publishing will always be a bit of a lottery, but that does not mean the odds cannot be improved with good data and smart algorithms. At my own company, Jellybooks, the emphasis is on generating good data. That means understanding how people read books and when they recommend them, not just judging success based on sales data or a book’s position on particular best seller list. Code will appear more and more in publishing even if code can’t write novels yet or predict with 100% accuracy the next NYT bestseller.