PRADO: NLP for Android!!

Vaibhav Tiwari
Analytics Vidhya
Published in
4 min readOct 24, 2020

If it is about ML model deployment, cloud servers become must.. right?

NLP based applications generally require deployment on cloud servers for smooth processing and seamless performance. Sharing of the data across the devices and servers cause increased latency, higher internet costs, restriction from offline functionality and at last the so called private user data loss.

Due to limited computational power and enormous size of the modern complex NLP models, deployment of the model solely on mobile is a topic of advanced research, following which Google introduced a projection-based model called PRADO in Sept. 2019 for performing on device Text classification without any need of cloud server. PRADO has turned out to be useful in development of IoT setups and Mobile applications since quantized model’s size is ~200KB and has only several thousand trainable parameters.

Approach used in PRADO

Since most of the state of the art NLP models use pre-trained tokenizers for generation of word embedding of each word in the sentence, the overall model becomes more complex and computationally expensive when our target is to make a model for a comparatively easy task like Text classification. Had it been a task like Machine Translation, generation of word embedding for each word (including articles, pronouns, and punctuation) would have been worth the task but for the problems like Topic modelling and Sentiment classification which depend on few important and less frequent words of the sentence, conversion of whole sentence into vectors is not the best idea. Instead, we can focus on those important words that carry the semantics of the sentence.

Leveraging this thought, PRADO focuses on a subset of the sentence for vectorization and final classification. It combines trainable projections with attention and convolution layers for capturing long range dependencies. This change in strategy and focus enables PRADO to deliver results comparable to BERT and complex LSTMs, with only 175K parameters to train on.

Semantic Fingerprinting is the process of representation of each word of the text in binary format. This is achieved by creating a 2D matrix where one axis contains the parsed patches of text from entire document and the other axis contains a list of words. The occurrence of a word in any number of patches is marked as ‘1’ and rest others as ‘0’.
Here is a picture portraying the fingerprinting of “Dog” from a given piece of information.

Courtesy: Fullstackacademy

Architecture & Working

In the Projection embedding layer, each word is semantically fingerprinted using ‘2B’ bits sized vectors and then a projector operator maps each word to a ternary vector i.e. represented using {-1,0,1}. These ternary vectors are then passed through a trainable neural network to produce word embeddings. These trainable projections do not store the vocabulary instead they update the projection weights during training time. This provides them an edge over the existing tokenizers that store the entire vocab!
In this projection based method, the final embedding size is much smaller than that formed by a big pre-trained vocabulary.

PRADO architecture (Paper)

Convolution and Attention operation

  • Projected feature network ‘F’ is used for capturing the features or words from the sentence that could be useful for classification task. It involves convolution operation over the entire sentence.
  • F = conv(e, n, N)
    ‘e’ is the sequence of projected word embeddings received from last layer
    ’n’ is kernel size
    ’N’ is output size
  • Now, an Attention layer is used to weight the importance of each embedding.
  • W= conv(e, n, N)
    Softmax of output ‘W’ is calculated to get a probabilistic distribution.
  • E = softmax(W).F
    This operation outputs a fixed length encoding of the sentence with important cum deciding features having higher weights.
  • The final embedding ‘E’ is further passed through a trainable Neural Network to get the classification done.

End Notes

The 8-bit quantized version of PRADO performs extremely well and can be a go-to option for developing independent NLP based mobile apps . The model also provides the option of Transfer learning and has delivered good performance with few trade-offs.

In September 2020, Google released an updated version of PRADO called pQRNN that has adapted the technique of projection layer for embedding generation but has different and much advanced approach in further layers. It would be very informative for you if you’d go through that too.

Thanks for reading!!

Resources and Documents: https://www.aclweb.org/anthology/D19-1506.pdf , Fullstackacademy , GoogleAI

--

--

Vaibhav Tiwari
Analytics Vidhya

Senior Data Scientist @Freshworks | IIIT Jabalpur | I write on ML and Abstract topics