Deploying Tensorflow Model on Android,
Hey Everyone! Welcome to a tutorial series on deploying Tensorflow models on Android. Fortunately there are enough blogs, tutorials on deploying Tensorflow models on the cloud but there aren’t many on deploying models on Android phones especially for models related to text Classification. So here I attempt to bridge the gap and show an example on how to do Question classification and deploy the model on Android to classify further locally. Lets get in!
This series will be divided into 4 parts.
- Part 1 will be writing the model in python.
- Part 2 will be writing our neural network model.
- Part 3 will be to setup android application with necessary libraries.
- Part 4 will be implementing the complete android application.
For Question classification data, I will be using the dataset from cogcomp. The format of the questions are:
The first word represent a label(called as class also). The questions and the label is separated by “:”. The questions can belong to one of the 5 categories( ABBR, ENTY, DESC, HUM, LOC and NUM). These abbreviations stand for:
- ABBR- Abbreviation
- LOC- Location
- DESC- Description
- ENTY- Entity
- NUM- Numeric
- HUM- Human
Next open up a new file and call it “DataHelpers.py”. This file will have functions that load, manipulate, clean data before feeding it into any Machine-Learning algorithms.
Go ahead and add these imports.
To start off with data manipulation, we need to first import data from the text file ‘data.txt’, you previously saved. The function ‘readData’ reads from the text file where the dataset is stored and splits each sentence into label(Y), that is the class it belongs to and the question itself and stores them into two lists(‘labels’ and ‘questions’).
Our first step in data cleaning would be to remove stop-words (words such as ‘the’, ’if’, ‘a’….) that do not contribute to the meaning of a sentence. The function ‘remove_stopwords’ removes unnecessary words from the dataset. The function ‘cleanInput’ removes unnecessary characters such as apostrophes, question marks, white spaces, commas, slashes, etc from a string.
But wait! Will the machine learning algorithm understand text data? Can it perform mathematical operations on Strings? No. So how do we represent a text in numeric format that can be understood by the algorithm? Well, One way of doing this would be to create a vocabulary. A vocabulary can be thought of as a set containing of all the unique words it has come across, in the dataset( here, the questions dataset). We could do this by using a Counter in Python which make sure there are no duplicate words stored. Simultaneously we would also want to just add English words and not random characters, or special symbols, etc. NLTK provides a wordNet lexicon containing a large subset of the English dictionary.
You might have also noticed that we are saving the words in a text file. Why should we do this? This is the trick part for deploying the model in Android. We would want our vocabulary to be the same in Python and Android implementation, that is we want a consistent vocabulary. There has to be an exact same implementation of ‘DataHelpers.py’ in Android as well ( of-course its Java counterpart) because our model is built in python with the data-Format received from ‘DataHelpers.py’ and will require the same format in Android as well. We’ll get into this deeper in part 3 and 4.
Okay so now we have successfully created our vocabulary but it does not solve our problem with numeric representation of words. How do we link a word with an index of sort? The lines 14–16 do exactly that. ‘word2index’ is a dictionary where the key is word and the value is an index( some arbitrary number assigned). Each word is arbitrarily assigned an index ranging from 1 to the total number of words.
Another way of representing the text data would be to count the no of occurrences of words in the sentence. I’ll be showing both the methods of representing text data.
To make this more clear, take a look at this example where words are assigned an ID. Notice that same words have the same index assigned.
Here is another example of representing text data where words are assigned a count (number of times the word has occurred in a sentence).
We now have a mapping between each word in the vocabulary and an assigned integer. The next step would be to map the text data with a numeric(Integer) value( called word embedding). The words may also be called as features. Each question in the dataset is converted into a vector containing word embedding. There are various representation of word vectors as show in the above diagrams. ID’s can be assigned to each by referencing ‘word2index’ dictionary, and for frequency based vectors the count for each word in the sentence. The length of all word vectors must be uniform( same length). For this exercise, i will be assuming the length of the word vector equal to the total no of words in our vocabulary set. That is,
length of WordVector = len(vocabularySet)
The function ‘featureExtractionFrequency’ converts all the sentences in the dataset into word vectors using frequency based(Count) representation of word. Every sentence has a fixed vector length(value of lengthOfVector). For every word in the sentence, word2index dictionary is used to retrieve the index of the word. Hence it is certain that the index of the last word in word2index is equal to lengthOfVector. This index can be used to access an element in the word vector(here newFeatureVector) and add 1 to it ( increase count everytime the word is encountered).
Think of the index as an identity of the word and if the word is present in a sentence, we use this identity and change its property(here, add 1). It is as simple as adding 1 at the index assigned to a word(word2index) in the ‘newFeatureVector’.
In the below code snippet, index based representation is used. The length of the word vector should be uniform as said earlier. We fix the length of each word vector to length(number of words) of the longest sentence in the dataset and use ‘word2index’ to map the words in the sentence with their respective indexes.
You can use either of the two functions for feature extraction. In this example I will be using the count based representation. Now all we have to do is call these functions. So lets go ahead and do that!
This concludes part 1 of the series. The other parts will be published soon. So Stay Tuned !
You can find the entire code on my github. Any feedback would be helpful to improve the content :)