NLP With TensorFlow/Keras: Explanation and Tutorial

Alexander Chow
Geek Culture
Published in
7 min readApr 1, 2021
https://www.blumeglobal.com/learning/natural-language-processing/

What is NLP?

Natural language processing (NLP) is a branch of machine learning and artificial intelligence that focuses on deriving meaning from human language and automatically handling it. It has various use cases in our modern society, including chat-bots, sentiment analysis from customer reviews, and identifying fake news. In this article, you’ll learn about the most important concepts behind NLP and how to implement emotion analysis with TensorFlow and Keras.

The Main Concepts

Tokenization

Tokenization is the process of splitting up sentences into various “tokens”, most likely individual words. Additionally, this process involves getting rid of certain less important characters, such as punctuation. It is extremely important in understanding the context and developing the model’s understanding. It is an essential part of every NLP model.

Stop Words Removal

This is the removal of irrelevant words such as “and”, “to”, and “the”. Such common words don’t provide any value to the NLP model and are filtered out before the text is passed to the model. While there is no official list of stop words, it is important to note that the list will change based on the purpose of the model. It is very useful in increasing the accuracy of the model during training.

Stemming

Stemming is another process that is useful in making text more understandable for the model. It involves shortening text by reducing words to their word stem. For example, “waiting” and “waited” are both shortened to “wait”.

https://www.c-sharpcorner.com/blogs/stemming-in-natural-language-processing

Lemmatization

Lemmatization is the process of turning a word into its base form and standardizing synonyms to their roots. For example, “went” is turned into “go” and “joyful” is turned into “good”. Although it is somewhat similar to stemming, it takes a different method to simplifying text.

Topic Modelling

Another important topic within NLP is topic modelling. It is essentially an unsupervised machine learning technique that is used to group different texts under certain subjects. For example, it is used in email systems such as Gmail, where emails concerning certain subjects are grouped together. All the techniques that I listed above can also be used to better train your model when working with topic modelling.

Tensorflow/Keras Tutorial

Now that we know what NLP is and various tools that are used to increase the accuracy of the model, we’ll tackle a classicc NLP problem: Detecting the emotion of text. For the dataset, we’ll be using a set of English Twitter messages that are classified into six basic emotions: anger, fear, joy, love, sadness, and surprise. You can view the dataset here: https://huggingface.co/datasets/emotion. Note that you don’t have to download it, as we will be using the “nlp” module to import it instead.

Imports

!pip install nlpimport tensorflow as tfimport numpy as npimport matplotlib.pyplot as pltimport nlpimport randomfrom tensorflow.keras.preprocessing.text import Tokenizerfrom tensorflow.keras.preprocessing.sequence import pad_sequences

Importing and Preparing Data

dataset = nlp.load_dataset('emotion')train = dataset['train']val = dataset['validation']test = dataset['test']

This imports the dataset and separates it into the training, validation, and testing sets.

def get_tweet(data):tweets = [x['text'] for x in data]labels = [x['label'] for x in data]return tweets, labelstweets, labels = get_tweet(train)

This separates our training data into 2 arrays: “tweets” and “labels”.

You can run this code to get a better idea of what your data looks like:

Tokenization

tokenizer = Tokenizer(num_words=10000, oov_token='<UNK>')tokenizer.fit_on_texts(tweets)

This code will initialize a tokenizer and calibrate it onto our training data. This will assign each word a number by how commonly they appear in the dataset.

As you can see below, when we run “texts_to_sequences” on the first tweet in our dataset, we will get an array of four numbers. Each number corresponds to one of the words in the tweet, and is determined by how common the word is. For example, you can see that the word “i” corresponds with the number “2”, as it is a very common word.

Making all Sequences Same Shape

maxlen=50def get_sequences(tokenizer, tweets):sequences = tokenizer.texts_to_sequences(tweets)padded = pad_sequences(sequences, truncating = 'post', padding='post', maxlen=maxlen)return padded

The code above will simply turn all tweets in the dataset into the same length, and will set their lengths to 50 words each (this number might change based on the length of text in various datasets). Empty spaces will be added and extra words will be cut off at the end. This is a necessary step, as the ML model expects the input to be a fixed shape and length.

When running “get_sequences()” on our tweets and taking the first tweet from that set, you can see that we have the same set of four sequences as above, except that its length has been extended to 50.

Preparing Data for Model

classes = set(labels)class_to_index = dict((c,i) for i, c in enumerate(classes))index_to_class = dict((v,k) for k, v in class_to_index.items())names_to_ids = lambda labels: np.array([class_to_index.get(x) for x in labels])train_labels = names_to_ids(labels)

This will create a set of all of our labels, and a dictionary that we can use when converting our classes to their indexes and the indexes to classes. This is especially useful when converting the values that the model returns into something we can understand easier. Additionally, we create a lambda function named “names_to_ids” and use it to convert all labels in our training data into their respective indexes.

You can view what each of the our variables looks like below to gain a better sense of their purposes. Note that your individual variables might slightly differ in their indexes, which is completely normal.

Corresponds classes to indexes
Values of index_to_class switch places from class_to_index
This shows that the first tweet has a class of index “1”, which corresponds to “sadness”

Creating Model

model = tf.keras.models.Sequential([tf.keras.layers.Embedding(10000,16,input_length=maxlen),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),tf.keras.layers.Dense(6, activation='softmax')])model.compile(     loss='sparse_categorical_crossentropy',     optimizer='adam',     metrics=['accuracy'])

Here, we create a very simple model, that includes an embedding layer, two bidirectional LSTM layers, and a dense layer for the output. To learn more about embedding layers, visit this link.The bidirectional layers allow for two-way communication, and use long short-term memory layers, which are a type of RNN architecture that are capable of learning long-term dependencies. To learn more: visit this link.

We also compile the model and its loss function, optimizer, and measure it by its accuracy. To learn more about the Adam optimizer, click here. To learn more about sparse categorical crossentropy, click here.

Training Model

val_tweets, val_labels = get_tweet(val)val_seq = get_sequences(tokenizer, val_tweets)val_labels= names_to_ids(val_labels)h = model.fit(     padded_train_seq, train_labels,     validation_data=(val_seq, val_labels),     epochs=20,     callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)])

Finally, we can start training the model. Note that use use callbacks to halt the training when our accuracy on the validation set doesn’t go up for more than 2 epochs.

It should look something like this:

Evaluating and Testing Model

test_tweets, test_labels=get_tweet(test)test_seq = get_sequences(tokenizer, test_tweets)test_labels=names_to_ids(test_labels)model.evaluate(test_seq, test_labels)

This separates your test data and gets the sequences for them. It also allows the model to evaluate its accuracy against the test data.

i = random.randint(0,len(test_labels)-1)print('Sentence:', test_tweets[i])print('Emotion:', index_to_class[test_labels[i]])p = model.predict(np.expand_dims(test_seq[i], axis=0))[0]print(test_seq[i])pred_class=index_to_class[np.argmax(p).astype('uint8')]print('Predicted Emotion: ', pred_class)

This code simply generates a random tweet and has the model predict what class it belongs to. It also predicts the tweet and its label so you can compare the prediction and the correct answer.

sentence = 'i am not sure what to do'sequence = tokenizer.texts_to_sequences([sentence])paddedSequence = pad_sequences(sequence, truncating = 'post', padding='post', maxlen=maxlen)p = model.predict(np.expand_dims(paddedSequence[0], axis=0))[0]pred_class=index_to_class[np.argmax(p).astype('uint8')]print('Sentence:', sentence)print('Predicted Emotion: ', pred_class)

This code allows you to enter your own sentence and let the model predict its emotion.

Saving Model

from google.colab import drivedrive.mount('/content/drive')model.save("/content/drive/My Drive/TweetEmotionRecognition/h5/tweet_model.h5")

This allows you to save your model to your Google Drive in a .h5 filetype. Note that I am using Google Collab, so the path will change if you are running your code locally.

Loading Model

load_model = tf.keras.models.load_model("/content/drive/My Drive/TweetEmotionRecognition/h5/tweet_model.h5")print(load_model.summary())

If you want to load your model from a certain filepath, you can use the code above to do so. From then, you can run all the functions you want with your model, except replacing the word “model” with “load_model” (e.g. “load_model.predict()”).

Conclusion

With that being said, I hope you enjoyed my article and learned about natural language processing! Feel free to check out my other articles, with many more coming soon!

If you want to see the completed code, you can view it at my Github repository here.

If you have any questions or would like to connect, feel free to email me at: alexander.chow911@gmail.com

To learn more about me: LinkedIn

--

--