How to design deep learning models with sparse inputs in Tensorflow Keras
A few weeks ago, we revealed how Dailymotion’s data scientists and data engineers collaborate to efficiently release into production, but how does this change impact our day-to-day life as data scientists? Does it offer enough flexibility to break the new in artificial intelligence? The answer is yes! Here are some pointers on how to conduct a project which fits our machine learning automation pipeline while tackling a technical issue, namely ingesting sparse inputs in Keras.
One of the challenges we face as a multinational company is having complete knowledge of our video catalog while being efficient both in terms of computation time and costs. This means that important products developed by the Dailymotion Data & AI team are related to video characterization. We are constantly asking ourselves how we can optimize the process in order to reduce our costs in storage, training, etc. At Dailymotion, we came up with two main tracks:
- Stick to a machine learning blueprint designed through multiple collaborations between data scientists and data engineers – the main components being the extraction, preprocessing, training, prediction and serving.
- When dealing with massive text metadata, make use of sparse representation. A video only has a few distinct words describing it in its metadata, while the English dictionary contains nearly 200 000 different words. Hence, there is no need to use a full dense representation of our videos considering the few non-zero entries.
Tackling NLP subjects
The objective of developing a machine learning (ML) tool is to release it efficiently in production for it to be completely agnostic to new data, training/prediction in batch or real-time. We gained maturity as a team collaborating with data scientists and data engineers to instantiate a ML blueprint for all future NLP subjects. Our intention now is to try and mimic each component of the said ML automation pipeline and adapt to what has already been implemented.
Predicting the main categories of a video based on its text metadata
Videos are made of one or more audio and video streams as well as text metadata (title, description, and tags). Due to the diversity of Dailymotion’s video catalog, it is crucial to be able to describe the content of each video in order to link them together for recommendation purposes. At Dailymotion, we only address content characterization based on its text metadata, although we already have tracks for using the other signals.
A bit of context
The first step in categorizing Dailymotion’s contents was the process of labeling each video with very granular Wikidata entities, namely topics. However, to characterize our audience using their areas of interest, we needed to regroup topics into more generic ones. We made use of the relations within the Wikidata knowledge graph to provide a non-contextual automated content taxonomy for our topics.
It’s to overcome this lack of contextualization that we initiated the project of predicting the main contextualized categories of a video based on its text metadata.
Since a video can be associated with multiple categories, our approach consisted of building a deep neural network to achieve multi-label classification. Let’s take a deep dive into the details of this use case and focus on all the components as regards to the ML blueprint and how we tackled the challenge of sparse data representation.
1. Extract & Preprocess
The first step was to transform all the textual information into features to be fed to our network. We began by cleaning the text, using all the classical approaches such as removing non-ASCII characters, HTML tags, digits, etc. Then, we applied a tokenizer using one of the Python packages such as NLTK or Polyglot.
What about feature extraction?
We decided to make use of a bag-of-word representation where, for a given text document, the sentence structure is ignored and only word occurrences matter.
Bag-of-words representation for video channels’ semantic structuring
Dailymotion is a video platform that hosts millions of videos owned by tens of thousands of channels. Videos are made…
Thanks to the hand-in-hand work with the data engineers, we distributed the computation using multiple workers in Dataflow and hence reduced the preprocessing step time by a factor of 20.
This being said, we can go a step further in optimizing the computation time by using a sparse representation of our videos. What’s more, this manipulation could be a key improvement in training computation.
We decided to implement a 2-layer neural network. This challenges us to think about how we will ingest sparse tensors. Our current framework for deep learning models is Tensorflow (version 1.13.1) and the layers of the Keras API in Tensorflow cannot handle sparse tensors for the moment.
AttributeError: 'SparseTensor' object has no attribute 'shape'
The only way around this would be to convert back to a dense tensor, which would be inefficient. We thus decided to add a novel custom dense layer extending the
tf.keras.layers.Layer class for both sparse and dense tensors.
The three following methods are necessary:
build: Creates the kernel and bias variables of the layer
call: Defines the forward computation
compute_output_shape: Specifies how to compute the output shape of the layer given the input shape
The implemented custom dense layer ingests sparse or dense inputs and outputs a dense underlying representation of the videos.
We then built a fully-customizable model by subclassing
tf.keras.Model and fed the said representation to a regular dense layer to predict pairs (class, probability) for each of the training set classes.
In an attempt to be efficiently distributed, the data engineers instantiated a distributed predictor where we only needed to implement two methods: load (once) and predict (multi-workers). This module is vital as it allows the predictions to happen in real-time.
The predictions are computed using the
tf.keras.Model.predict method. We encountered a real challenge when ingesting sparse tensors for predictions. The
steps argument represents the total number of batches of samples before declaring the prediction round finished. The thing is, by finished, it means calling the prediction afterward with different data would return an empty array.
In order to construct an array to predict in streaming, the trick here is to use:
- the indices of the elements in the sparse tensor that contain non-zero values
- the values for each element in indices
- the dense shape of the sparse tensor
steps argument can be set to None and the prediction can be called in parallel with multiple workers!
In doing so, we are completely in line with Dailymotion’s vision of adapting and reusing existing components as well as merging our efforts between data scientists and data engineers.
The long term vision
The journey of content characterization at Dailymotion is not finished yet and we are already planning our next moves. We intend to make use of every bit of information related to the videos – that is to say its text metadata, its frames and its audio – as each and every source of data completes and enriches our knowledge. Once we know exactly what’s in our videos, we will be able to build models to select/aggregate the predicted classes characterizing them.