Processing data for Machine Learning with TensorFlow
Turn your dataset into TensorFlow for the beginner step by step
It is so confusing when training your data using TensorFlow, seeing errors showing shapes or dtype is not right. This is my note trying to organize the tf dataset in an easy way for movie review classification.
In this article, I’m going to deal with the Large Movie Review Dataset and train a Keras.models.Sequential model, which is a plain stack of layers model.
My steps:
1. Load Dataset
2. Create tf.data.Dataset for input
3. Create TextVectorization layer (including tokenization and padding)
4. Create Bag of Word
5. Create the model
6. Fit and train model
Load dataset
Starting from checking what files in the zip file, we can use os.walk(filepath). Then we will have something like this:
/root/.keras/datasets/aclImdb [‘imdb.vocab’, ‘imdbEr.txt’, ‘README’]
/root/.keras/datasets/aclImdb/train [‘labeledBow.feat’, ‘unsupBow.feat’, ‘urls_neg.txt’, ‘urls_unsup.txt’, ‘urls_pos.txt’]
All files are in lists under their folders. We will use reviews under these 4 folders, training and test datasets with positive and negative semantic respectively.
/root/.keras/datasets/aclImdb/train/pos
/root/.keras/datasets/aclImdb/train/neg
/root/.keras/datasets/aclImdb/test/pos
/root/.keras/datasets/aclImdb/train/pos
Make the 4 paths. They all contain 12500 reviews. (12500, 12500, 12500, 12500)
Create TensorFlow Dataset, using tf.data.TextLineDataset
For simplified, I’m not making a function here. We can directly put the list of paths into tf.data.TexLineDataset. Remember to turn paths into the string format. Here are what we do:
1. Pass paths into tf.data.TexLineDataset() and generate 6 tf.data.Dataset
2. Add labels into datasets, using the method of tf.data.Dataset.map(). 0 for negative, 1 for positive.
3. Combine neg and pos, using tf.data.Dataset.concatenate()
4. Shuffle training set, batch, and prefetch all three sets. (we can also skip prefetcf for now)
<PrefetchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>
Processing the reviews
We make a simple tf.constant to make sure our steps correct and then create a function.
After preprocess_word(X_example) for the simple example, it should be like this:
<tf.Tensor: shape=(3, 50), dtype=string, numpy= array([[b”It’s”, b’a’, b’great,’, b’great’, b’movie!’, b’I’, b’l’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’], [b’It’, b’was’, b’terrible,’, b’run’, b’away!!!’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’], [b’I’, b”don’t”, b’get’, b’it!!’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’, b’<pad>’]], dtype=object)>
Create TextVectorization layer
1. get all vocabulary, by word frequency
2. make tensor of word and index to create the initializer
3. make a table
word frequency: use top 1000 words as the vocabulary
Next, we will output a list of the top max_size most frequent words (setting 1000 for now), ensuring that <pad> is first.
Create the text vector Layer
to look up the index of each word in the vocabulary
We have to create the text vector class and adapt it before we use it in our model.
Bag of Words
This is also a layer we need to add to our model. This could let you have a summary of words counts. For example,
tf.constant([[1, 3, 1, 0, 0], [2, 2, 0, 0, 0]])
we count the occurrence and get this (not the output)
[[ 0:2 , 1: 2 , 2:0 , 3:1 ] , [ 0:3 , 1:0 , 2:2 , 3: 0 ] ]
remove 0 (<pad>) , so it is [[2., 0., 1.] , [0., 2., 0.,]]
Create a class to add to the model later, and we will pass our input in the model.
Summarize and train the model
Without fine-tune the parameter, we could get an accuracy of around 0.73.
Epoch 1/5
782/782 [==============================] - 12s 15ms/step - loss: 0.1737 - accuracy: 0.9498 - val_loss: 0.6392 - val_accuracy: 0.7236
Epoch 2/5
782/782 [==============================] - 12s 15ms/step - loss: 0.1060 - accuracy: 0.9794 - val_loss: 0.7092 - val_accuracy: 0.7214
Epoch 3/5
782/782 [==============================] - 12s 15ms/step - loss: 0.0605 - accuracy: 0.9944 - val_loss: 0.7724 - val_accuracy: 0.7258
Epoch 4/5
782/782 [==============================] - 12s 15ms/step - loss: 0.0327 - accuracy: 0.9989 - val_loss: 0.8467 - val_accuracy: 0.7179
Epoch 5/5
782/782 [==============================] - 12s 15ms/step - loss: 0.0177 - accuracy: 0.9998 - val_loss: 0.9208 - val_accuracy: 0.7253<tensorflow.python.keras.callbacks.History at 0x7fb437eb2d68>
Next article, I will try to summarize my notes for processing image data for training using TensorFlow and Keras.
Reference: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition