LSTM on Amazon Food Reviews using Google Collaboratory

Theodox Bolt
4 min readJun 20, 2019

--

Place: Munnar, Kerala| Device: Redmi 1s | Taken on 28-Sep-2014

After trying out all machine learning algorithms, one that excited me was Recurrent Neural Network — Long Short Term Memory.

Though studying them was exciting, practising them needs resources like GPUs or High-end system. That’s when I explored Google Collaboratory which gives online powered GPU for free.

You may have a question like “Well, that’s fine. How can I train on my own dataset?”. They have a solution for that too. Upload your .csv or database file to Google Drive and …. let's write some code for authentication.

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

Once the above dependencies are installed, we can go for authentication.

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

For more on loading files.

Google Authentication

Once authenticated, You can load the uploaded dataset by first taking its file link from the Google drive.

link = "https://drive.google.com/open?id=1VwEFJH367Y0WCXpX0e"

Once you get the link, retrieve the ID and get the content of the file as shown below.

fluff, id = link.split('=')
print (id) # Verify that you have everything after '='
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('preprocessed.csv')
preprocessed_amazon_data = pd.read_csv('preprocessed.csv')

By now, You should be able to display the content of the file.

Here, I have two columns, one is preprocessed reviews (I have removed all stopwords, special characters already) and another one is Values (whether the given review is positive or negative)

If you feel the dataset might have some NaNs. You can remove them using below command.

preprocessed_amazon_data = preprocessed_amazon_data.dropna()

Now, It’s important. I mean really important. Yes, it’s about Data Leakage. To get rid of that you divide the whole dataset into Train and test dataset now itself and ...

X_train, X_test, Y_train, Y_test = train_test_split(preprocessed_amazon_data['preprocessed_reviews'].values,preprocessed_amazon_data['Values'].values,test_size=0.30,shuffle=False)

Among divided dataset, you only tokenize Train dataset.

from keras.preprocessing.text import Tokenizertokenizer = Tokenizer(num_words= 50000)
tokenizer.fit_on_texts(X_train)

Tokenizer creates a dictionary of words in reviews and an index based on the frequency of appearance. ‘num_words’ will help to keep the maximum number of words, based on word frequency.

Tokenized Words

Once the text is fit and word index is ready, we can convert train and test text to sequences of numbers.

X_train_tok = tokenizer.texts_to_sequences(X_train)
X_test_tok = tokenizer.texts_to_sequences(X_test)
Sequenced Text

Now, sequenced train data has completely different dimension because of the difference in length of each review. Hence we pad the sequences with zeros and make all of the same dimension.

max_review_length = 600
X_train_pad = sequence.pad_sequences(X_train_tok, maxlen=max_review_length)
X_test_pad = sequence.pad_sequences(X_test_tok, maxlen=max_review_length)

Here, We see an embedding layer before LSTM. It’s used to create word vectors for incoming words. First, it’s gonna make a matrix of top_words (.i.e., 50000 here) and embedding_vecor_length (.i.e., 32 here). Second, it will pick a corresponding matrix of 1x32 for each word. Now if you consider a data point or a review (remember that now it is sequences of number), the embedding layer will repeat the second step for each word then gives 600x32 matrix.

So, Now if you consider train dataset of 25ooo datapoints or reviews. it will be having 25000 x 600 x 32. This will be fed to the LSTM layer.

embedding_vector_length = 32
top_words = 50000
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

So one the above model is instantiated. We can train the model. And remember if you want to plot validation loss and training loss, you need to capture the model’s losses using ‘history’.

history = model.fit(X_train_pad, Y_train, nb_epoch=10, batch_size=64,validation_data=(X_test_pad,Y_test))
# Final evaluation of the model on test data
scores = model.evaluate(X_test_pad, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Training the model

One the model is trained and the history of losses is captured. We can utilize them to plot and visualize the progress.

Plot the loss

Other Architectures to try:

The above example had 100 units in one LSTM layer. Below is the code snippet if we want to create the model with two LSTM layers, one with 32 units and other with 16 units.

top_words=50000
embedding_vector_length = 40
model = Sequential()
model.add(Embedding(top_words+1, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(32,return_sequences=True))
model.add(LSTM(16))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Note the ‘return sequences = True’, which enables connectivity between two LSTM layers.

Similarly, you can try different architectures by varying the number of units, top words, embedding vector length, etc. and can observe their behaviour.

Thanks for your time! Happy Learning!

--

--