Deep Learning — Natural Language Processing

Dejan Jovanovic
NewCryptoBlock
Published in
5 min readMay 21, 2019

Part V — a

When it comes to natural language processing we should not forget one thing: none of the deep learning models truly understand text in a human sense. So, how does they understand us? Deep learning models actually apply statistics to map the structure of the written language. Deep learning for natural language processing is pattern recognition applied to words, sentences and paragraphs. As any other deep learning model, models for natural language processing as input do not take raw text, they can only work with numeric tensors. The transformation of text data into numeric tensors is called text vectorization.

Let’s review how to prepare text with the Keras library. A good first step when working with text is to split it into words. Words are called tokens and the process of splitting text into tokens is called tokenization.

from keras.preprocessing.text import text_to_word_sequence

# Our text document
text = 'This text is used here just for demonstration purposes.'

# Document tokenization
result = text_to_word_sequence(text)

This code automatically does three things:

  1. Splits words by space
  2. Filters out punctuation
  3. Converts text to lower cases

The outcome of this code is as follows:

[‘this’, ‘text’, ‘is’, ‘used’, ‘here’, ‘just’, ‘for’, ‘demonstration’, ‘purposes’]

This is just our first step. Further preprocessing is required before we can work with this text. One of the most common methods is to represent the document as a sequence of integer values, where each word in the document is represented as a unique integer index. This method is called one-hot encoding. The Keras function that implements one-hot encoding is called one_hot(). Expanding on the previous example this is how it looks like:

from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import one_hot

# Our text document
text = 'This text is used here just for demonstration purposes.'

# Document tokenization
words = set(text_to_word_sequence(text))
vocabulary_size = len(words)

result = one_hot(text, vocabulary_size)
print(result)

The outcome of this code is below:

[6, 8, 1, 4, 7, 7, 4, 6, 1]

The limitation of this approach is that one needs to maintain a vocabulary of words and their mapping to integers. The alternative approach is to use one way hash function to convert words into unique integers. Keras provides hashing_trick() function that provides exactly that, and here is a sample code:

from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import hashing_trick

# Our text document
text = 'This text is used here just for demonstration purposes.'

# Document tokenization
words = set(text_to_word_sequence(text))
vocabulary_size = len(words)

result = hashing_trick(text, vocabulary_size, hash_function='md5')
print(result)

And our outcome now is:

[3, 2, 2, 4, 5, 1, 4, 3, 4]

Keras also provides a more sophisticated API for preparing multiple text documents, Tokenizer API and here is an example:

from keras.preprocessing.text import Tokenizer

# Our text document
documents = ['The cat sat on the mat.',
'The dog sat on the log.',
'Dogs and cats living together.']

# create Tokenizer instance
tokenizer = Tokenizer()

# fit the tokenizer on the documents
tokenizer.fit_on_texts(documents)

# print what was learned
print("Word count: ", tokenizer.word_counts)
print("\nDocument count: ", tokenizer.document_count)
print("\nWord index: ", tokenizer.word_index)
print("\nWord documents: ", tokenizer.word_docs)

# integer encode documents
encoded_documents = tokenizer.texts_to_matrix(documents, mode='count')
print("\nEncoded documents:")
print(encoded_documents)

And our outcome now is:

Word count: OrderedDict([(‘the’, 4), (‘cat’, 1), (‘sat’, 2), (‘on’, 2), (‘mat’, 1), (‘dog’, 1), (‘log’, 1), (‘dogs’, 1), (‘and’, 1), (‘cats’, 1), (‘living’, 1), (‘together’, 1)])
Document count: 3
Word index: {‘the’: 1, ‘sat’: 2, ‘on’: 3, ‘cat’: 4, ‘mat’: 5, ‘dog’: 6, ‘log’: 7, ‘dogs’: 8, ‘and’: 9, ‘cats’: 10, ‘living’: 11, ‘together’: 12}
Word documents: defaultdict(<class ‘int’>, {‘the’: 2, ‘sat’: 2, ‘cat’: 1, ‘mat’: 1, ‘on’: 2, ‘log’: 1, ‘dog’: 1, ‘and’: 1, ‘together’: 1, ‘cats’: 1, ‘dogs’: 1, ‘living’: 1})
Encoded documents:
[[0. 2. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 2. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]]

However, if you are not planing to use Keras, then there are additional options such as the bag-of-words model and the n-grams model.

Bag-of-words model has seen great success in problems such as language modeling and document classification. This model creates the representation of text that describes the occurrence of words within a document. It involves two things: a vocabulary of known words, and the measure of the presence of known words. What is interesting is that this model is only concerned with whether known words occur in the document but not where in the document. The idea behind it is that documents are similar if they have similar content.

Using the documents from our previous example:

the cat sat on the mat
the dog sat on the log
dogs and cats living together

Then, our vocabulary that is a collection of unique words used in our text is:

the 
cat
sat
on
mat
dog
log
dogs
and
cats
living
together

So, now we can vectorize documents by using 0 for absence of the word and 1 when it is used. This is what we get:

"the cat sat on the mat" = [1,1,1,1,1,0,0,0,0,0,0,0]
"the dog sat on the log" = [1,0,1,1,0,1,1,0,0,0,0,0]
"dogs and cats living together" = [0,0,0,0,0,0,0,1,1,1,1,1]

The problem with Bag-of-Words model is that vector representation of the document can get very long with lots of zeros. In order to decrease the size, the following techniques can be used:

  1. Ignoring the case
  2. Fixing misspelled words (Python pyspellchecker library)
  3. Ignoring punctuation
  4. Reduce the words to their stem (Python NLTK library)

Summary

This is the continuation of my AI exploration, and I hope you enjoyed this reading. In my next article, I will continue with the topic of NLP by providing a sample code of the material presented in this entry. Also, the third part of this series will go over an example of sentiment analysis AI model.

References

  1. Deep Learning with Python, By Francois Chollet, ISBN 9781617294433
  2. Artificial Intelligence for Humans Volume 1: Fundamental Algorithms, By Jeff Heaton, ISBN978–1493682225
  3. Artificial Intelligence for Humans Volume 3: Deep Learning and Neural Networks, By Jeff Heaton, ISBN978–1505714340
  4. Develop Deep Learning Models on Theano and TensorFlow Using Keras, By Jason Brownlee
  5. Deep Learning, By Ian Goodfellow, Yoshua Bengio and Aaron Courville, ISBN 9780262035613
  6. Neural Networks and Learning Machines, By Simon Haykin, ISBN 9780131471399
  7. //hackernoon.com/memorizing-is-not-learning-6-tricks-to-prevent-overfitting-in-machine-learning-820b091dc42
  8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting, by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov

NCB consists of a group of engineers and specialists with extensive technology and business backgrounds, united by a passion for innovation, professional development and building high-quality software products. Technology has the capacity of bringing to life revolutionary ideas that can change and better the world compared to the way we know it.

info@ncb.global

--

--

Dejan Jovanovic
NewCryptoBlock

Seasoned executive, business and technology leader, entrepreneur, blockchain and smart contract expert