Processing data for Machine Learning with TensorFlow

Published in

Analytics Vidhya

4 min readJun 23, 2020

look into codes — Photo by Kevin Ku on Unsplash

Turn your dataset into TensorFlow for the beginner step by step

It is so confusing when training your data using TensorFlow, seeing errors showing shapes or dtype is not right. This is my note trying to organize the tf dataset in an easy way for movie review classification.

In this article, I’m going to deal with the Large Movie Review Dataset and train a Keras.models.Sequential model, which is a plain stack of layers model.

My steps:

1. Load Dataset
2. Create tf.data.Dataset for input
3. Create TextVectorization layer (including tokenization and padding)
4. Create Bag of Word
5. Create the model
6. Fit and train model

Load dataset

Starting from checking what files in the zip file, we can use os.walk(filepath). Then we will have something like this:

Load data

/root/.keras/datasets/aclImdb [‘imdb.vocab’, ‘imdbEr.txt’, ‘README’]
/root/.keras/datasets/aclImdb/train [‘labeledBow.feat’, ‘unsupBow.feat’, ‘urls_neg.txt’, ‘urls_unsup.txt’, ‘urls_pos.txt’]

All files are in lists under their folders. We will use reviews under these 4 folders, training and test datasets with positive and negative semantic respectively.

/root/.keras/datasets/aclImdb/train/pos
/root/.keras/datasets/aclImdb/train/neg
/root/.keras/datasets/aclImdb/test/pos
/root/.keras/datasets/aclImdb/train/pos

4 paths

Make the 4 paths. They all contain 12500 reviews. (12500, 12500, 12500, 12500)

Create TensorFlow Dataset, using tf.data.TextLineDataset

For simplified, I’m not making a function here. We can directly put the list of paths into tf.data.TexLineDataset. Remember to turn paths into the string format. Here are what we do:

1. Pass paths into tf.data.TexLineDataset() and generate 6 tf.data.Dataset
2. Add labels into datasets, using the method of tf.data.Dataset.map(). 0 for negative, 1 for positive.
3. Combine neg and pos, using tf.data.Dataset.concatenate()
4. Shuffle training set, batch, and prefetch all three sets. (we can also skip prefetcf for now)

<PrefetchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>

Processing the reviews

We make a simple tf.constant to make sure our steps correct and then create a function.

Preprocess words and padding

After preprocess_word(X_example) for the simple example, it should be like this:

<tf.Tensor: shape=(3, 50), dtype=string, numpy= array([[b”It’s”, b’a’, b’great,’, b’great’, b’movie!’, b’I’, b’l’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’], [b’It’, b’was’, b’terrible,’, b’run’, b’away!!!’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’], [b’I’, b”don’t”, b’get’, b’it!!’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’]], dtype=object)>

Create TextVectorization layer

1. get all vocabulary, by word frequency
2. make tensor of word and index to create the initializer
3. make a table

word frequency: use top 1000 words as the vocabulary

frequency of words

Next, we will output a list of the top max_size most frequent words (setting 1000 for now), ensuring that <pad> is first.

Create the text vector Layer

to look up the index of each word in the vocabulary

We have to create the text vector class and adapt it before we use it in our model.

Bag of Words

This is also a layer we need to add to our model. This could let you have a summary of words counts. For example,

tf.constant([[1, 3, 1, 0, 0], [2, 2, 0, 0, 0]])

we count the occurrence and get this (not the output)

[[ 0:2 , 1: 2 , 2:0 , 3:1 ] , [ 0:3 , 1:0 , 2:2 , 3: 0 ] ]

remove 0 (<pad>) , so it is [[2., 0., 1.] , [0., 2., 0.,]]

Create a class to add to the model later, and we will pass our input in the model.

Bag of Word

Summarize and train the model

Movie review classification using TensorFlow

Without fine-tune the parameter, we could get an accuracy of around 0.73.

Epoch 1/5
782/782 [==============================] - 12s 15ms/step - loss: 0.1737 - accuracy: 0.9498 - val_loss: 0.6392 - val_accuracy: 0.7236
Epoch 2/5
782/782 [==============================] - 12s 15ms/step - loss: 0.1060 - accuracy: 0.9794 - val_loss: 0.7092 - val_accuracy: 0.7214
Epoch 3/5
782/782 [==============================] - 12s 15ms/step - loss: 0.0605 - accuracy: 0.9944 - val_loss: 0.7724 - val_accuracy: 0.7258
Epoch 4/5
782/782 [==============================] - 12s 15ms/step - loss: 0.0327 - accuracy: 0.9989 - val_loss: 0.8467 - val_accuracy: 0.7179
Epoch 5/5
782/782 [==============================] - 12s 15ms/step - loss: 0.0177 - accuracy: 0.9998 - val_loss: 0.9208 - val_accuracy: 0.7253<tensorflow.python.keras.callbacks.History at 0x7fb437eb2d68>

Next article, I will try to summarize my notes for processing image data for training using TensorFlow and Keras.

Reference: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition