Deep learning for text classification part 1.0

aj_khan
May 31 · 4 min read

Aah okay, this is yet another tutorial for text classification, this will be series of tutorial on text classification will start from the simple neural networks and down the line will use the state of the art transfer learning for text classification like Ulmfit/Transformers/BERT.

The core dataset contains 50,000 reviews split evenly into the 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). Dataset link: https://www.kaggle.com/utathya/imdb-review-dataset Glove Embedding: https://nlp.stanford.edu/projects/glove/

Things we are going to do in this tutorial * Little bit of Pre-Processing and EDA * Loading GLove embedding * Building Model

 
   models
   
   
   tqdm
tqdm.pandas()

   
%matplotlib inline

Loading GLove file and creating embedding index

embeddings_index = {}
f = open('../embeddings/glove.6B/glove.6B.50d.txt')
 line  f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found  word vectors.' % len(embeddings_index))Found 400000 word vectors.

Let’s check a few embeddings of words …

embeddings_index["why"]
array([ 0.32386  ,  0.011154 ,  0.23443  , -0.18039  ,  0.6233   ,
       -0.059467 , -0.62369  ,  0.12782  , -0.40932  ,  0.083849 ,
       -0.19215  ,  0.57834  , -0.49637  , -0.048521 ,  1.099    ,
        0.6298   ,  0.26122  , -0.11049  ,  0.16728  , -0.71227  ,
       -0.371    ,  0.51635  ,  0.54567  ,  0.27623  ,  0.82096  ,
       -2.1861   , -1.0027   ,  0.11441  ,  0.53145  , -0.86653  ,
        2.5888   ,  0.37458  , -0.51935  , -0.68734  , -0.14537  ,
       -0.53177  , -0.065899 ,  0.0077695,  0.31162  , -0.17694  ,
       -0.36669  ,  0.17919  ,  0.21591  ,  0.61326  ,  0.41495  ,
        0.17295  , -0.19359  ,  0.26349  , -0.19398  ,  0.58678  ],
      dtype=float32)embeddings_index["example"]
array([ 0.51564  ,  0.56912  , -0.19759  ,  0.0080456,  0.41697  ,
        0.59502  , -0.053312 , -0.83222  , -0.21715  ,  0.31045  ,
        0.09352  ,  0.35323  ,  0.28151  , -0.35308  ,  0.23496  ,
        0.04429  ,  0.017109 ,  0.0063749, -0.01662  , -0.69576  ,
        0.019819 , -0.52746  , -0.14011  ,  0.21962  ,  0.13692  ,
       -1.2683   , -0.89416  , -0.1831   ,  0.23343  , -0.058254 ,
        3.2481   , -0.48794  , -0.01207  , -0.81645  ,  0.21182  ,
       -0.17837  , -0.02874  ,  0.099358 , -0.14944  ,  0.2601   ,
        0.18919  ,  0.15022  ,  0.18278  ,  0.50052  , -0.025532 ,
        0.24671  ,  0.10596  ,  0.13612  ,  0.0090427,  0.39962  ],
      dtype=float32)

Now we’ve loaded the embeddings, we can check the embeddings of few words and the length of embedding is 50.

Now let’s read the data and stalk it :P

Some pre-processing:

# Cleaning and Pre Processing text
 

 clean_numbers(text):
    text = re.sub('[0-9]{5,}', '#####', text)
    text = re.sub('[0-9]', '####', text)
    text = re.sub('[0-9]', '###', text)
    text = re.sub('[0-9]', '##', text)
     text

 clean_text(text):
    text = clean_numbers(text)
    text = str(text)

     punct  "/-'":
        text = text.replace(punct, ' ')
     punct  '&':
        text = text.replace(punct, f'  ')
     punct  '?!.,"$%()*+-/:;<=>@[]^_`{|}~' + '“”’':
        text = text.replace(punct, '')

    text = text.lower()
     text# Test the above pre process function
clean_text("Hi this Is the test For 894 y ~is okay")output: 'hi this is the test for ### y is okay'

Check the length of sentences:

length of sentences in the data

Next is, prepare data for training: * Tokenize: “Tokens” are usually individual words (at least in languages like English) and “tokenization” is taking a text or set of text and breaking it up into its individual words. more about tokenization: http://blog.kaggle.com/2017/08/25/data-science-101-getting-started-in-nlp-tokenization-tutorial/


Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com


# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)# pad documents to a max length of 150 
max_length = 150
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

text_to_sequence converts the sentence into tokens sequence have a look.

# create a weight matrix for words in training docsembedding_matrix = zeros((vocab_size, 50))
 word, i  t.word_index.items():
    embedding_vector = embeddings_index.get(word)
     embedding_vector   :
        embedding_matrix[i] = embedding_vector

Now let’s create a simple neural network

vocab_size = vocab_sizemodel = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=max_length, trainable=))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(16, activation=tf.nn.relu) )
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.summary()

Compile the model and start training on the data…

Let’s have a look at train/test accuracy and loss

Finally, this part ends here… Codes and notebook can be found here: https://github.com/aquibjaved/Deep-learning-for-text-classification

Any comments/suggestions please use space below :) I hope it helps someone somewhere, stay tuned for the next one.

Voice Tech Podcast

Voice technology interviews & articles. Learn from the experts.

7

7 claps
aj_khan

Written by

aj_khan

Voice Tech Podcast

Voice technology interviews & articles. Learn from the experts.