Request for product: audiobook crowdsourcing

I think we’re at the point where we could make pretty good audiobooks using text-to-speech and a mixture of better automation and crowdsourcing. I think the product might boil down to:

  • A lightweight ePub companion file format
  • A potentially decentralized marketplace for human annotation
  • Smarter processing of content, like URLs, that acknowledge that reinterpreting presentation is necessary if the original text wasn’t made for audio

If you want it badly enough, text-to-speech is actually passable as a means for consuming blog posts, books, and other digital reading material. I use audio a lot. I listen to podcasts all the time. I listen to audiobooks, both on their own and alongside reading the ebook on my kindle. It is frustrating to be hooked on audio as a format though. An ebook often costs a few dollars while the audiobook often costs $20. Sometimes, only an abridged audiobook exists.

While text-to-speech is passable if you really want it, I’d expect people that are fine with reading would find text-to-speech totally intolerable. I’m a bit different. I have a weird relationship with focus; I can both read and listen, but audio or audio-assisted reading feels effortless while reading is fairly strenuous.

Text-to-speech doesn’t have to be so awkward for automatic listening! I’m going to outline issues I’ve noticed and then walk through how a great product could address them and provide a nice experience.

Mispronunciation of common words

Random everyday words will sometimes be mispronounced which you have to get used to and tolerate. For example, “live” as in “I live in the city” and “live” as in “a live concert” are often mixed up.

Awkward lack of pauses around formatting

Pauses are often screwed up around certain formatting. Worst offending cases:

  • Bullet point lists — For example, given a series of bullet points where each line doesn’t end with a period. The audio will skip from the last word of one bullet point line to the next without skipping a beat, making it confusing to digest things point by point
  • Paragraphs separators (e.g. a dinkus) — Text-to-speech apps will often either scrub the ‘* * *’ separating two paragraphs and then miss the speech delay, or they will explicitly dictate it as “asterisk asterisk asterisk”
  • Embedded images and captions — Jumping straight from the end of a normal paragraph to a caption for an image, confusing the listener by not making it clear there is an image they aren’t seeing

Mispronunciation of text-specific words

While text-to-speech systems would likely get the Harry Potter universe right today because it is so popular, less popular books may be automatically dictated with wrong pronunciations. Very frustrating to listen to an entire book and, for example, have it pronounce “Hermione” as “her-me-own.”

Failure to scrub certain formatting from dictation

Some texts are littered with footnotes. For example, the text “Some researchers believe this theory has been invalidated [11] [12].” Might be dictated as “Some researchers believe this theory has been invalidated eleven twelve” when those footnote markers should just be skipped.

Annoyingly literal dictation

When I’m actually reading some text and it has a link in the middle like “Read more here:", I obviously don’t actually process it in my head as “h-t-t-p-s-colon-slash-slash-w-w-w-dot…” but most dictation apps will read it to me like this.

Nice to have: Separate voices for the narrator and each character

When you listen to a really good audiobook, they’ll sometimes have a different sounding voice for each character that is appropriate to their age, background, and gender. This makes the text way more immersive and movie-like which takes the audiobook form a bit closer to a little sound-only movie for you to enjoy. Obviously, you don’t get this out of the box with text-to-speech


I think these issues are all very tractable. Here are some ideas:

Visual editor for fixing pronunciation

A nice visual editor that lets someone scrub through the text in a book and listen to the audio produced by the text-to-speech could allow users to propose fixes for mispronounced common words.

Extract text-specific terms

When someone is creating a new refined text-to-speech dictation of some text with the app, it could extract words that aren’t in the dictionary and seem like proper nouns of the book (“Hermione”, “Hogwarts”) and let the user specify a pronunciation.

More automated edge case handling for formatting issues

A bit of more sophisticated text handling could help address the issues I mentioned regarding missing pauses and failure to scrub formatting like footnotes.

Enhance the text using HTTP <meta> tags

The same is probably true for overly literal dictation like for inline urls; a smarter app could check for the HTTP meta tags that Facebook and Twitter use in order to extract a post title and then dictate something like: “Link to ‘The Microsoft Provocateur’ by The New Yorker” instead of reading the URL letter by letter.

Crowd-source contributions and make text-to-speech audiobooks great!

Crowdsourcing could bring all of these features together in a way that might allow for custom text-to-speech audiobooks that actually feel nice to listen to. A mixture of automation and crowdsourced work by users could help identify which quotes in different books are which characters. Getting this annotation right would make it possible to assign different kinds of voices to each character which in my opinion dramatically enhances the experience of listening.

Marketplace for quality

You can text-to-speech an entire book today, it just requires some patience to deal with the awkward parts. Some things could be improved automatically I think, but I imagine a lot of high quality will have to come from humans. Part of my issue today is that ebooks cost a few dollars while audiobooks cost around $20. Being able to place a few dollar bounty on a high quality text-to-speech audiobook might be enough to get something much more pleasant to consume. Many of the human tasks mentioned in this post seem like viable micro tasks where others might be very willing to do 5 minutes of work here and there for a dollar in exchange.

A companion file format to ePub

All of the work mentioned here comes from enhancing annotation of text so that pauses are where they need to be, words are pronounced right, speech from characters is properly annotated, and formatting is stripped down properly to make text-to-speech sound right. While you may want to eventually compile down to a set of mp3s that can be played by audible or your audiobook player of choice, we wouldn’t want that to be the format used for curating our enhancements. A file format could be similar to an ePub, plus:

  • A lookup table for custom pronunciations
  • A list of pronunciation corrections for words that would otherwise be pronounced wrong
  • Annotations of quotes to specify characters
  • Annotations of parts of the text to specify different kinds of pauses

this would make it easy to ship around. Smaller file formats, amenable to diffs for coordinating decentralized collaboration, and extensible if other applications want to take ideas further (interpret the file on the fly vs. compiling to mp3s, for example).

Decentralized curation?

I’m assuming nothing like this exists. I think it is likely a product someone needs to build, but it is possible there is a chilling effect due to concerns about copyright. I’m not sure how to interpret copyright here; it seems like I should be allowed to text-to-speech any text I own, so it seems defensible. Still, maybe this is the sort of product that may be worth leveraging some decentralized tech to build. Write our annotated ePub files to IPFS. Maybe bounties and payments for tasks are paid with cryptocurrencies. I’m just spitballing, but this seems applicable.

I want more audio. I think having high quality audio for as much information as possible would mean that more people could consume more content in general. Better for the world in the same way that open access to academic journals is likely good. I think this sort of product could also push a very comfortable market without much innovation in the right direction. You can get audiobooks that sync with ebooks today from Amazon using their Whispersync product, but it costs a ton and isn’t available for a lot of books. I think a well built, successful, and maybe decentralized text-to-speech audiobook crowdsourcing app could force prices to drop in the rest of the space and maybe inspire more progress on audio as an information format in general.

