Building The Phonic Web

Using Machine Learning to detect Audio-friendly Visual Content

The Digg experience comes in many flavors, such as the Homepage curated by editors, the Channel pages (e.g., Entertainment, Election2016, Sports) driven by algorithms and editorial oversight, Digg Deeper — an algorithmic stream that plucks the best links from your Twitter feed, Digg Reader — an RSS reader aggregating more than 10 million feeds and Digg Messaging powered with trending news and keyword subscriptions.

Almost all these products are primarily designed for a visual experience. In fact unsurprisingly, visual-inspired design is the defining characteristic of web media. Headlines, descriptions and body-content of news articles is largely produced for consumption through visual feeds.

With the advancement of Text to Speech (TTS) technologies and voice agents, consumers are open to choose an alternative dimension for news. This alternative dimension is News via Personalized Audio Feeds.

Alexa, the software technology that powers Amazon’s Echo devices has the ability to convert TTS with incredible nuance. This begs the question: wouldn’t it be interesting if we were able to listen to the Internet, instead of just reading it?

Auditory feeds aim to provide users with short audio briefings on a stream of news stories. But TTS-generated audio briefings has certain constraints. First, audio has no audit trail, i.e. unlike visual feeds where you can scroll up or linger in the article as long as you want. It is hard to interactively go back and forth in an audio feed. Thus, audio briefings must be more focused in what is conveyed.

Visual Feeds are dominant on the Web. As a result, news stories are written with a visual aspect in mind, not an auditive one necessarily. In order to convert visual feeds into auditory streams via TTS, we must figure out text that’s conveys sufficient information when heard and provides a genuine story to the user. Shown here are some candidates stories from our sports channel.

In other words, visual components of the article cannot be a dependency factor when news stories are heard via audio. Moreover, Alexa currently has a TTS word limit. So, article headlines and descriptions alone must provide sufficient interpretability to the story in an audio briefing. This sparks a new kind of problem in the domain of NLP and Speech Synthesis — the task of finding Auditory Features in written text.

Descriptors — i.e. title and description of a story can be written so as to transmit sufficient/critical information. Alternately, it can be framed such that there is overwhelmingly dependency on the body content of the article.

But not all descriptor texts provide genuine information to the user in audio format. Sometimes, descriptor texts are dependent on the full body-text of an article or on other visual aspects (pictures, videos) in the content. Descriptors can also be ‘leads’ to an article (curiosity gap) or a listicle. Such stories cannot be understood completely without visual aid or consuming the entire article content, and don’t provide the best experience when heard purely via audio.

Descriptors of some news stories found on While one story contains critical information in the descriptor, the others are overwhelmingly dependent on the body (either visually or for the significant gist of the story) and could be unsuitable for audio feeds.

Auditory Features in Natural Language Text — Why and How

Auditory features represent natural language text thats satisfying to hear. But why do we care about listening to news, especially when we can scroll infinite feeds visually? Two properties come to mind. First, while visual media consumes all our attention whereas audio is extremely suitable for multi-tasking. Its why people listen to music when they run or play podcasts when they cook.

Secondly, the typical visual web experience is to read or look something up on the Internet but halfway through it, get inadvertently distracted by something else on the page thats tugs our attention. Instead, audio has zero secondary distraction as side effect. It would be overly optimistic to say the same property holds for the visual web.

In our experience, converting the entire article content into TTS does not respect these properties. In fact, the only attributes of a news article that can be effectively experienced in TTS-generated auditory feeds is the descriptor text (headlines & description).

This brings a technical complication in mapping news stories for quality auditory feeds. Since not all news stories are suitable for consumption on audio-only interfaces like Amazon Echo especially via TTS, we must algorithmically weed out stories that contain partial or dependent information in the descriptor text.

Here are a few examples of non-descriptor dependencies, i.e. descriptor texts that provide partial/ dependent information and are unsuitable candidates for auditory feeds:

Examples of training samples where the descriptor text in news stories has dependent characteristics, which could make them unsuitable for auditory feeds.

This task of Auditory Feed Generation(AFG) from visual-inspired content is challenging, because we need classify if the descriptor text semantics is possibly dependent on other media (non-descriptor) components of the article.

Can the title and description of a story tell us if we can completely understand the story without any visual aid?

This is a new kind of NLP problem: detecting Auditory Features In Text (AFIT).

Machine Learning & AFIT

So how do we automatically select digg news stories for quality audio briefings on Amazon Echo? Our approach is to train a supervised machine learning algorithm that can examine the words in descriptors to predict if the article is safe for auditory feeds. This means automatically flagging content where the descriptor text has dependent components (which could be either visual or require the entire body text of the article).

Feature Engineering: Much of applied machine learning is predominantly good feature engineering. So what features should AFIT use? We employ a bag of words model to see which word vectors are more likely to be indicative of audio features. When building this word vocabulary, two things are important in feature selection: (a) the vocabulary size, and (b) what should be a threshold of document frequency for a word we can skip (called min_df)? This is traditionally also referred to a cut-off. These factors not only help training time, but also ease overfitting.

Our training data is ~ 5K news stories. Analyzing the number of words in the descriptor text of articles, we observe that the average is about 62 words with a deviation of 16. This gives us a nice baseline on how we could limit the word vocabulary.

Distribution of no. of words in descriptor text for some samples. The mean is at 62 with a standard deviation is 16.

Since this is a supervised method, each data instance (news story) is hand-labeled audio-friendly (1) or not (0). We do this by feeding the story descriptor text through Alexa’s TTS system and listening in to figure if the audio captures the gist of the story. One observation was that audio-friendly articles have a higher number of words on average in their descriptors, indicating that the length of text might matter as a feature.

Audio-friendly news story have a higher average number of words in their descriptor text.

Another feature is the category of the article. Since every digg news story is tagged in a channel, we possess categorical tags for the training articles. To reduce bias, we choose a uniform distribution of stories from the channel categories (tech, sports, entertainment, politics). We found for the entertainment category, almost 50% of the stories were not audio-friendly — meaning their descriptors were acutely dependent on the body content. Compared to that, technology and sports stories were less dependent on non-descriptor attributes.

% of samples from different categories that fall into the auditory and non-auditory classes. Stories belonging to entertainment category usually are least audio-friendly, which is expected given the visual nature of that category.

Classifier: The simplest way to solve this is to build a Multinomial Naive Bayes Classifier (MNB). This type of classifier looks at the words in the descriptor text and attempts to predict if there is indication of dependent components (visual media/ body content). However, it does not assume that the features are conditionally independent from one another (unlike a general NB). Instead, it believes that there exists a multinomial distribution for each feature.

Training the AFIT model involves first using a count vectorizer to transform word vectors. Next, a TF-IDF transformer is used to get the bag of words, which is consequently fitted using the MNB. A critical part of machine learning is tuning the classifier. We do this by employing ensemble methods followed by boosting.

On average, 7% — 9% of all links that make it to digg trending are flagged audio-unfriendly by AFIT daily. Tested on our data set, AFIT achieves an accuracy ~92% in detecting articles unsuitable for auditory feeds. The plan is to overlay the current model with reinforcement learning once we see more samples of data from the incoming digg news streams.

We use slack to monitor logistics on the daily performance of the AFIT model, logging issues such as which domains are flagged most often & what percentage of links pass through the filter. This helps us tune the algorithm so it does not discount domains or word patterns unnecessarily.

When Voice Bots meet Visually-Inspired Content

News media on the internet is overwhelmingly created for visual feeds and webpages. Thus, the content of news articles is written, produced and edited with a visual aspect in mind. A headline or description is meant to pique user interest, following which he/she can dive into the entire article.

Many news stories are headlined or described in this style. Based on the phrases and word patterns in the titles or descriptions of trending news stories, AFIT can flag stories that aren’t audio-friendly when channeled into a TTS-powered audio feed. Although such flagged content might contain exciting visual content, this immense visual dependency makes them sub-optimal candidates for audio briefings.

Questions/Comments: You can reach me at or Twitter