Bag-of-words representation for video channels’ semantic structuring

Dailymotion is a video platform that hosts millions of videos owned by tens of thousands of channels. Videos are made up of an audio-video stream together with text metadata containing a title, a description of the video content, and keyword tags. In order for these videos to run like clockwork, a data team develops algorithms to automatically structure, in an unsupervised way, videos and channels based on their text metadata.

In this article, we will present an approach that uses the bag-of-words representation to discover latent dailymotion’s channels structures. We will exploit channels bag-of -words representation into clustering algorithms such as k-means, and neural network auto-encoder embedding to exhibit channels’ proximity structure.

Bag-of-words Representation

Examples of a bag-of-words representation of a video gaming and hip-hop music channel displayed as a word cloud. The more a word appears in the metadata of a channel’s videos the more it stands out.

The bag-of-words representation originates in natural language processing (NLP). It consists in, for a given text document, ignoring the phrase structure and retaining only word occurrences. The rationale behind this representation is that word context captured by their co-occurrences carries enough information to allow for semantic content extraction.

In the case of dailymotion’s channels, the bag-of-words representation is obtained in the following way. We’ll denote the set of dailymotion’s channels as 𝒟 = {𝒹m, m=1,…,M} and assume that we are given an N words vocabulary 𝒲 ={𝓌n, n=1,…,N}.

The bag-of-words representation consists of a matrix C = C(𝒹m,𝓌n) in which each (m,n) entry is the empirical probability about the n’th vocabulary word appearing in the metadata of the videos of the m’th channel. Each row of the matrix C represents the corresponding channel’s bag-of-words.

Clustering Channels with K-means

K-means clustering is a classical machine learning algorithm used to cluster samples into K groups of similar elements based on a distance measure. The main steps of the k-means algorithm are as follows:

1. Initialization: among the samples, randomly select K centroids to represent the K clusters
2. Iterate until convergence: assign each of the elements to the cluster of the closest among the K centroids, then update each one of the centroids as the barycenter of the samples inside it's class

Classically, the Euclidean distance is used as a dissimilarity measure. When channels are represented as a bag-of-words, the Euclidean distance is not adequate for comparing different channels, because bag-of-words represents the probabilities of word occurrences. Using distances to compare probability distributions such as the Bhattacharyya distance is more appropriate.

Given two channels’ bag-of-words Ci = (Cin,n=1,…,N) and Cj = (Cjn,n=1,…,N), the Bhattacharyya distance between them is defined as :

Examples of cluster centroids displayed as word clouds. Represented clusters are mostly composed of channels about movies (top left), automotive (top right), beauty (bottom left), and cooking (bottom right).

The above figure presents four examples of cluster centroids obtained when clustering twenty thousand dailymotion channels into fifty clusters. The four displayed clusters contain mostly channels about movies, automotive, beauty, and cooking. Many other clusters, not displayed here, are about news, sports, music, gaming, kids, etc.

Although clustering channels using bag-of-words works well, it has an inherent drawback as it requires working with high dimensional data, and in this case, the vocabulary size N can be copious. We decided to use a method to map bag-of-words into a lower dimensional data with neural auto-encoders.

Channels Embedding With Neural Auto-Encoders

Nowadays, as deep learning with neural networks becomes increasingly common, structuring data using neural embedding is very popular. The principle of neural embedding is to build neural networks that map data to other representations, in such a way that, while constructing these representations, the neural network creates lower dimensional intermediary representations that better exhibit the hidden underlying data structure.

In the case of channel structure analysis, we build a one-hidden-layer neural auto-encoder that takes as input a channel bag-of-words and replicates it via a bottleneck hidden layer. The bottleneck layer has a dimensionality L which is much lower than the input bag-of-words dimensionality N. This structure forces the network to learn a lower dimensional embedding of the input data. Hence, the network produces a similar lower dimensional representation for channels having similar bag-of-words representation.

The neural network mapping is as follows. First, the channels’ bag-of-words are mapped to the lower dimensional embedding space as:

Then the embedded representations are mapped back to the input channel bag-of-words as:

where relu is the rectified linear unit activation function, and variables W and b are neural network parameters, that are learned using training samples. The following figure shows a graphical view of the auto-encoder.

Using the procedure described above, we generated lower dimensional embedding for about twenty thousand dailymotion channels. Below this is displayed as a t-SNE (t-distributed stochastic neighbor embedding) based two-dimensional visualization using the obtained neural auto-encoder embedding for the selected channels denominated according to their category (auto, animals, lifestyle, music, news, short-films, sport, travel, tv, video-games).

T-SNE based two-dimensional display of video channels denominated according to their categories after neural auto-encoder embedding. Channels from the same category are mapped next to each other in the embedding space. Channels are colored according to their languages: blue for English, green for French, and red for other languages.

This figure shows that, in the embedding space, channels with similar content are mapped close to each other. For example, and although language was not explicitly provided, French channels are mainly located in the bottom left. Music channels are mainly located to the middle left, and automotive channels to the middle right.


To conclude our Data Team is working on machine learning algorithms to semantically structure the video channels we host. The obtained channel structures are currently used for automatic video topic detection or for video recommendation, which we will develop in future articles.