Deep Learning Techniques for Text Classification

SWAYAM MITTAL
Aug 16, 2019 · 17 min read

The exponential growth in the number of complex datasets every year requires more enhancement in machine learning methods to provide robust and accurate data classification. Lately, deep learning approaches are achieving better results compared to previous machine learning algorithms on tasks like image classification, natural language processing, face recognition, etc. The success of these deep learning algorithms relies on their capacity to model complex and non-linear relationships within the data.

However, finding suitable structures for these models has been a challenge for researchers which are addressed leveraging. Transfer Learning techniques where instead of training a model from scratch, we reuse a pre-trained model and then fine-tune it for another related task which have resulted in a massive leap in state-of-the-art results for many of the NLP tasks through various approaches such as ULMFiT, the OpenAI Transformer, ELMo and Google AI’s BERT.

Topics you’ll find on this post include the following:

  • Deep Neural Networks
  • Recurrent Neural Networks (RNN)
  • Gated Recurrent Unit (GRU)
  • Long Short-Term Memory (LSTM)
  • Convolutional Neural Networks (CNN)
  • Hierarchical Attention Networks
  • Recurrent Convolutional Neural Networks (RCNN)
  • Random Multimodel Deep Learning (RMDL)
  • Hierarchical Deep Learning for Text (HDLTex)

Deep Neural Networks

Deep Neural Networks architectures are designed to learn through multiple connections of layers where every single layer only receives a connection from previous and provides connections only to the next layer in the hidden part. The input layer is embedding vectors as shown in Figure below. The output layer neurons equal to the number of classes for multi-class classification and only one neuron for binary classification. Here, we have multi-class DNNs where the number of nodes in each layer as well as the number of layers are randomly assigned. The implementation of Deep Neural Network (DNN) is basically a discriminatively trained model that uses the standard back-propagation algorithm and sigmoid or ReLU as activation functions. The output layer for multi-class classification should use Softmax.

import packages:

from sklearn.datasets import fetch_20newsgroups
from keras.layers import Dropout, Dense
from keras.models import Sequential
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import metrics

convert text to TF-IDF:

def TFIDF(X_train, X_test,MAX_NB_WORDS=75000):
vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)
X_train = vectorizer_x.fit_transform(X_train).toarray()
X_test = vectorizer_x.transform(X_test).toarray()
print("tf-idf with",str(np.array(X_train).shape[1]),"features")
return (X_train,X_test)

Build a DNN Model for Text:

def Build_Model_DNN_Text(shape, nClasses, dropout=0.5):
"""
buildModel_DNN_Tex(shape, nClasses,dropout)
Build Deep neural networks Model for text classification
Shape is input feature space
nClasses is number of classes
"""
model = Sequential()
node = 512 # number of nodes
nLayers = 4 # number of hidden layer
model.add(Dense(node,input_dim=shape,activation='relu'))
model.add(Dropout(dropout))
for i in range(0,nLayers):
model.add(Dense(node,input_dim=node,activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(nClasses, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model

Load text dataset (20newsgroups):

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

run DNN and see result:

X_train_tfidf,X_test_tfidf = TFIDF(X_train,X_test)
model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 20)
model_DNN.fit(X_train_tfidf, y_train,
validation_data=(X_test_tfidf, y_test),
epochs=10,
batch_size=128,
verbose=2)
predicted = model_DNN.predict(X_test_tfidf)print(metrics.classification_report(y_test, predicted))

Model summary:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 38400512
_________________________________________________________________
dropout_1 (Dropout) (None, 512) 0
_________________________________________________________________
dense_2 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_2 (Dropout) (None, 512) 0
_________________________________________________________________
dense_3 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_3 (Dropout) (None, 512) 0
_________________________________________________________________
dense_4 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_4 (Dropout) (None, 512) 0
_________________________________________________________________
dense_5 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_5 (Dropout) (None, 512) 0
_________________________________________________________________
dense_6 (Dense) (None, 20) 10260
=================================================================
Total params: 39,461,396
Trainable params: 39,461,396
Non-trainable params: 0
_________________________________________________________________

Output:

Train on 11314 samples, validate on 7532 samples
Epoch 1/10
- 16s - loss: 2.7553 - acc: 0.1090 - val_loss: 1.9330 - val_acc: 0.3184
Epoch 2/10
- 15s - loss: 1.5330 - acc: 0.4222 - val_loss: 1.1546 - val_acc: 0.6204
Epoch 3/10
- 15s - loss: 0.7438 - acc: 0.7257 - val_loss: 0.8405 - val_acc: 0.7499
Epoch 4/10
- 15s - loss: 0.2967 - acc: 0.9020 - val_loss: 0.9214 - val_acc: 0.7767
Epoch 5/10
- 15s - loss: 0.1557 - acc: 0.9543 - val_loss: 0.8965 - val_acc: 0.7917
Epoch 6/10
- 15s - loss: 0.1015 - acc: 0.9705 - val_loss: 0.9427 - val_acc: 0.7949
Epoch 7/10
- 15s - loss: 0.0595 - acc: 0.9835 - val_loss: 0.9893 - val_acc: 0.7995
Epoch 8/10
- 15s - loss: 0.0495 - acc: 0.9866 - val_loss: 0.9512 - val_acc: 0.8079
Epoch 9/10
- 15s - loss: 0.0437 - acc: 0.9867 - val_loss: 0.9690 - val_acc: 0.8117
Epoch 10/10
- 15s - loss: 0.0443 - acc: 0.9880 - val_loss: 1.0004 - val_acc: 0.8070
precision recall f1-score support 0 0.76 0.78 0.77 319
1 0.67 0.80 0.73 389
2 0.82 0.63 0.71 394
3 0.76 0.69 0.72 392
4 0.65 0.86 0.74 385
5 0.84 0.75 0.79 395
6 0.82 0.87 0.84 390
7 0.86 0.90 0.88 396
8 0.95 0.91 0.93 398
9 0.91 0.92 0.92 397
10 0.98 0.92 0.95 399
11 0.96 0.85 0.90 396
12 0.71 0.69 0.70 393
13 0.95 0.70 0.81 396
14 0.86 0.91 0.88 394
15 0.85 0.90 0.87 398
16 0.79 0.84 0.81 364
17 0.99 0.77 0.87 376
18 0.58 0.75 0.65 310
19 0.52 0.60 0.55 251
avg / total 0.82 0.81 0.81 7532

Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN) is another neural network architecture that is addressed by the researchers for text mining and classification. RNN assigns more weights to the previous data points of sequence. Therefore, this technique is a powerful method for text, string, and sequential data classification. In RNN, the neural net considers the information of previous nodes in a very sophisticated method which allows for better semantic analysis of the structures in the dataset.

Image for post
Image for post

Gated Recurrent Unit (GRU)

Gated Recurrent Unit (GRU) is a gating mechanism for RNN which was introduced by J. Chung et al. and K.Cho et al.. GRU is a simplified variant of the LSTM architecture, but there are differences as follows: GRU contains two gates and does not possess any internal memory (as shown in Figure; and finally, a second non-linearity is not applied (tanh in Figure).

Image for post
Image for post

Long Short-Term Memory (LSTM)

Long Short-Term Memory~(LSTM) was introduced by S. Hochreiter and J. Schmidhuber and developed by many research scientists.

To deal with these problems Long Short-Term Memory (LSTM) is a special type of RNN that preserves long term dependency in a more effective way compared to the basic RNNs. This is particularly useful to overcome vanishing gradient problem as LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state. The figure shows the basic cell of a LSTM model.

Image for post
Image for post

import packages:

from keras.layers import Dropout, Dense, GRU, Embedding
from keras.models import Sequential
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import metrics
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.datasets import fetch_20newsgroups

convert text to word embedding (Using GloVe):

def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):
np.random.seed(7)
text = np.concatenate((X_train, X_test), axis=0)
text = np.array(text)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
word_index = tokenizer.word_index
text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Found %s unique tokens.' % len(word_index))
indices = np.arange(text.shape[0])
# np.random.shuffle(indices)
text = text[indices]
print(text.shape)
X_train = text[0:len(X_train), ]
X_test = text[len(X_train):, ]
embeddings_index = {}
f = open("C:\\Users\\swayam\\Documents\\Glove\\glove.6B.50d.txt", encoding="utf8")
for line in f:
values = line.split()
word = values[0]
try:
coefs = np.asarray(values[1:], dtype='float32')
except:
pass
embeddings_index[word] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_index))
return (X_train, X_test, word_index,embeddings_index)

Build a RNN Model for Text:

def Build_Model_RNN_Text(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
"""
def buildModel_RNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
word_index in word index ,
embeddings_index is embeddings index, look at data_helper.py
nClasses is number of classes,
MAX_SEQUENCE_LENGTH is maximum lenght of text sequences
"""
model = Sequential()
hidden_layer = 3
gru_node = 32
embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
if len(embedding_matrix[i]) != len(embedding_vector):
print("could not broadcast input array from shape", str(len(embedding_matrix[i])),
"into shape", str(len(embedding_vector)), " Please make sure your"
" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
exit(1)
embedding_matrix[i] = embedding_vector
model.add(Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True))
print(gru_node)
for i in range(0,hidden_layer):
model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(GRU(gru_node, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(Dense(256, activation='relu'))
model.add(Dense(nclasses, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model

run RNN and see result:

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)
model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)model_RNN.fit(X_train_Glove, y_train,
validation_data=(X_test_Glove, y_test),
epochs=10,
batch_size=128,
verbose=2)
predicted = Build_Model_RNN_Text.predict_classes(X_test_Glove)print(metrics.classification_report(y_test, predicted))

Model summary:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 500, 50) 8960500
_________________________________________________________________
gru_1 (GRU) (None, 500, 256) 235776
_________________________________________________________________
dropout_1 (Dropout) (None, 500, 256) 0
_________________________________________________________________
gru_2 (GRU) (None, 500, 256) 393984
_________________________________________________________________
dropout_2 (Dropout) (None, 500, 256) 0
_________________________________________________________________
gru_3 (GRU) (None, 500, 256) 393984
_________________________________________________________________
dropout_3 (Dropout) (None, 500, 256) 0
_________________________________________________________________
gru_4 (GRU) (None, 256) 393984
_________________________________________________________________
dense_1 (Dense) (None, 20) 5140
=================================================================
Total params: 10,383,368
Trainable params: 10,383,368
Non-trainable params: 0
_________________________________________________________________

Output:

Train on 11314 samples, validate on 7532 samples
Epoch 1/20
- 268s - loss: 2.5347 - acc: 0.1792 - val_loss: 2.2857 - val_acc: 0.2460
Epoch 2/20
- 271s - loss: 1.6751 - acc: 0.3999 - val_loss: 1.4972 - val_acc: 0.4660
Epoch 3/20
- 270s - loss: 1.0945 - acc: 0.6072 - val_loss: 1.3232 - val_acc: 0.5483
Epoch 4/20
- 269s - loss: 0.7761 - acc: 0.7312 - val_loss: 1.1009 - val_acc: 0.6452
Epoch 5/20
- 269s - loss: 0.5513 - acc: 0.8112 - val_loss: 1.0395 - val_acc: 0.6832
Epoch 6/20
- 269s - loss: 0.3765 - acc: 0.8754 - val_loss: 0.9977 - val_acc: 0.7086
Epoch 7/20
- 270s - loss: 0.2481 - acc: 0.9202 - val_loss: 1.0485 - val_acc: 0.7270
Epoch 8/20
- 269s - loss: 0.1717 - acc: 0.9463 - val_loss: 1.0269 - val_acc: 0.7394
Epoch 9/20
- 269s - loss: 0.1130 - acc: 0.9644 - val_loss: 1.1498 - val_acc: 0.7369
Epoch 10/20
- 269s - loss: 0.0640 - acc: 0.9808 - val_loss: 1.1442 - val_acc: 0.7508
Epoch 11/20
- 269s - loss: 0.0567 - acc: 0.9828 - val_loss: 1.2318 - val_acc: 0.7414
Epoch 12/20
- 268s - loss: 0.0472 - acc: 0.9858 - val_loss: 1.2204 - val_acc: 0.7496
Epoch 13/20
- 269s - loss: 0.0319 - acc: 0.9910 - val_loss: 1.1895 - val_acc: 0.7657
Epoch 14/20
- 268s - loss: 0.0466 - acc: 0.9853 - val_loss: 1.2821 - val_acc: 0.7517
Epoch 15/20
- 271s - loss: 0.0269 - acc: 0.9917 - val_loss: 1.2869 - val_acc: 0.7557
Epoch 16/20
- 271s - loss: 0.0187 - acc: 0.9950 - val_loss: 1.3037 - val_acc: 0.7598
Epoch 17/20
- 268s - loss: 0.0157 - acc: 0.9959 - val_loss: 1.2974 - val_acc: 0.7638
Epoch 18/20
- 270s - loss: 0.0121 - acc: 0.9966 - val_loss: 1.3526 - val_acc: 0.7602
Epoch 19/20
- 269s - loss: 0.0262 - acc: 0.9926 - val_loss: 1.4182 - val_acc: 0.7517
Epoch 20/20
- 269s - loss: 0.0249 - acc: 0.9918 - val_loss: 1.3453 - val_acc: 0.7638
precision recall f1-score support 0 0.71 0.71 0.71 319
1 0.72 0.68 0.70 389
2 0.76 0.62 0.69 394
3 0.67 0.58 0.62 392
4 0.68 0.67 0.68 385
5 0.75 0.73 0.74 395
6 0.82 0.74 0.78 390
7 0.83 0.83 0.83 396
8 0.81 0.90 0.86 398
9 0.92 0.90 0.91 397
10 0.91 0.94 0.93 399
11 0.87 0.76 0.81 396
12 0.57 0.70 0.63 393
13 0.81 0.85 0.83 396
14 0.74 0.93 0.82 394
15 0.82 0.83 0.83 398
16 0.74 0.78 0.76 364
17 0.96 0.83 0.89 376
18 0.64 0.60 0.62 310
19 0.48 0.56 0.52 251
avg / total 0.77 0.76 0.76 7532

Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNN) is Another deep learning architecture that is employed for hierarchical document classification. Although originally built for image processing with architecture similar to the visual cortex, CNNs have also been effectively used for text classification. In a basic CNN for image processing, an image tensor is convolved with a set of kernels of size d by d. These convolution layers are called feature maps and can be stacked to provide multiple filters on the input. To reduce the computational complexity, CNNs use pooling which reduces the size of the output from one layer to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features.

The most common pooling method is max pooling where the maximum element is selected from the pooling window. In order to feed the pooled output from stacked featured maps to the next layer, the maps are flattened into one column. The final layers in a CNN are typically fully connected dense layers. In general, during the back-propagation step of a convolutional neural network not only the weights are adjusted but also the feature detector filters. A potential problem of CNN used for text is the number of ‘channels’, Sigma (size of the feature space). This might be very large (e.g. 50K), for text but for images this is less of a problem (e.g. only 3 channels of RGB). This means the dimensionality of the CNN for text is very high.

Image for post
Image for post

import packages:

from keras.layers import Dropout, Dense,Input,Embedding,Flatten, MaxPooling1D, Conv1D
from keras.models import Sequential,Model
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import metrics
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.datasets import fetch_20newsgroups
from keras.layers.merge import Concatenate

convert text to word embedding (Using GloVe):

def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):
np.random.seed(7)
text = np.concatenate((X_train, X_test), axis=0)
text = np.array(text)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
word_index = tokenizer.word_index
text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Found %s unique tokens.' % len(word_index))
indices = np.arange(text.shape[0])
# np.random.shuffle(indices)
text = text[indices]
print(text.shape)
X_train = text[0:len(X_train), ]
X_test = text[len(X_train):, ]
embeddings_index = {}
f = open("C:\\Users\\swayam\\Documents\\Glove\\glove.6B.50d.txt", encoding="utf8")
for line in f:
values = line.split()
word = values[0]
try:
coefs = np.asarray(values[1:], dtype='float32')
except:
pass
embeddings_index[word] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_index))
return (X_train, X_test, word_index,embeddings_index)

Build a CNN Model for Text:

def Build_Model_CNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):    """
def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
word_index in word index ,
embeddings_index is embeddings index, look at data_helper.py
nClasses is number of classes,
MAX_SEQUENCE_LENGTH is maximum lenght of text sequences,
EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py
"""
model = Sequential()
embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
if len(embedding_matrix[i]) !=len(embedding_vector):
print("could not broadcast input array from shape",str(len(embedding_matrix[i])),
"into shape",str(len(embedding_vector))," Please make sure your"
" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
exit(1)
embedding_matrix[i] = embedding_vector embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
# applying a more complex convolutional approach
convs = []
filter_sizes = []
layer = 5
print("Filter ",layer)
for fl in range(0,layer):
filter_sizes.append((fl+2))
node = 128
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
for fsz in filter_sizes:
l_conv = Conv1D(node, kernel_size=fsz, activation='relu')(embedded_sequences)
l_pool = MaxPooling1D(5)(l_conv)
#l_pool = Dropout(0.25)(l_pool)
convs.append(l_pool)
l_merge = Concatenate(axis=1)(convs)
l_cov1 = Conv1D(node, 5, activation='relu')(l_merge)
l_cov1 = Dropout(dropout)(l_cov1)
l_pool1 = MaxPooling1D(5)(l_cov1)
l_cov2 = Conv1D(node, 5, activation='relu')(l_pool1)
l_cov2 = Dropout(dropout)(l_cov2)
l_pool2 = MaxPooling1D(30)(l_cov2)
l_flat = Flatten()(l_pool2)
l_dense = Dense(1024, activation='relu')(l_flat)
l_dense = Dropout(dropout)(l_dense)
l_dense = Dense(512, activation='relu')(l_dense)
l_dense = Dropout(dropout)(l_dense)
preds = Dense(nclasses, activation='softmax')(l_dense)
model = Model(sequence_input, preds)
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model

run CNN and see result:

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)
model_CNN = Build_Model_CNN_Text(word_index,embeddings_index, 20)
model_CNN.summary()model_CNN.fit(X_train_Glove, y_train,
validation_data=(X_test_Glove, y_test),
epochs=15,
batch_size=128,
verbose=2)
predicted = model_CNN.predict(X_test_Glove)predicted = np.argmax(predicted, axis=1)
print(metrics.classification_report(y_test, predicted))

Model:

__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 500) 0
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, 500, 50) 8960500 input_1[0][0]
__________________________________________________________________________________________________
conv1d_1 (Conv1D) (None, 499, 128) 12928 embedding_1[0][0]
__________________________________________________________________________________________________
conv1d_2 (Conv1D) (None, 498, 128) 19328 embedding_1[0][0]
__________________________________________________________________________________________________
conv1d_3 (Conv1D) (None, 497, 128) 25728 embedding_1[0][0]
__________________________________________________________________________________________________
conv1d_4 (Conv1D) (None, 496, 128) 32128 embedding_1[0][0]
__________________________________________________________________________________________________
conv1d_5 (Conv1D) (None, 495, 128) 38528 embedding_1[0][0]
__________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D) (None, 99, 128) 0 conv1d_1[0][0]
__________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D) (None, 99, 128) 0 conv1d_2[0][0]
__________________________________________________________________________________________________
max_pooling1d_3 (MaxPooling1D) (None, 99, 128) 0 conv1d_3[0][0]
__________________________________________________________________________________________________
max_pooling1d_4 (MaxPooling1D) (None, 99, 128) 0 conv1d_4[0][0]
__________________________________________________________________________________________________
max_pooling1d_5 (MaxPooling1D) (None, 99, 128) 0 conv1d_5[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 495, 128) 0 max_pooling1d_1[0][0]
max_pooling1d_2[0][0]
max_pooling1d_3[0][0]
max_pooling1d_4[0][0]
max_pooling1d_5[0][0]
__________________________________________________________________________________________________
conv1d_6 (Conv1D) (None, 491, 128) 82048 concatenate_1[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout) (None, 491, 128) 0 conv1d_6[0][0]
__________________________________________________________________________________________________
max_pooling1d_6 (MaxPooling1D) (None, 98, 128) 0 dropout_1[0][0]
__________________________________________________________________________________________________
conv1d_7 (Conv1D) (None, 94, 128) 82048 max_pooling1d_6[0][0]
__________________________________________________________________________________________________
dropout_2 (Dropout) (None, 94, 128) 0 conv1d_7[0][0]
__________________________________________________________________________________________________
max_pooling1d_7 (MaxPooling1D) (None, 3, 128) 0 dropout_2[0][0]
__________________________________________________________________________________________________
flatten_1 (Flatten) (None, 384) 0 max_pooling1d_7[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 1024) 394240 flatten_1[0][0]
__________________________________________________________________________________________________
dropout_3 (Dropout) (None, 1024) 0 dense_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 512) 524800 dropout_3[0][0]
__________________________________________________________________________________________________
dropout_4 (Dropout) (None, 512) 0 dense_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 20) 10260 dropout_4[0][0]
==================================================================================================
Total params: 10,182,536
Trainable params: 10,182,536
Non-trainable params: 0
__________________________________________________________________________________________________

Output:

Train on 11314 samples, validate on 7532 samples
Epoch 1/15
- 6s - loss: 2.9329 - acc: 0.0783 - val_loss: 2.7628 - val_acc: 0.1403
Epoch 2/15
- 4s - loss: 2.2534 - acc: 0.2249 - val_loss: 2.1715 - val_acc: 0.4007
Epoch 3/15
- 4s - loss: 1.5643 - acc: 0.4326 - val_loss: 1.7846 - val_acc: 0.5052
Epoch 4/15
- 4s - loss: 1.1771 - acc: 0.5662 - val_loss: 1.4949 - val_acc: 0.6131
Epoch 5/15
- 4s - loss: 0.8880 - acc: 0.6797 - val_loss: 1.3629 - val_acc: 0.6256
Epoch 6/15
- 4s - loss: 0.6990 - acc: 0.7569 - val_loss: 1.2013 - val_acc: 0.6624
Epoch 7/15
- 4s - loss: 0.5037 - acc: 0.8200 - val_loss: 1.0674 - val_acc: 0.6807
Epoch 8/15
- 4s - loss: 0.4050 - acc: 0.8626 - val_loss: 1.0223 - val_acc: 0.6863
Epoch 9/15
- 4s - loss: 0.2952 - acc: 0.8968 - val_loss: 0.9045 - val_acc: 0.7120
Epoch 10/15
- 4s - loss: 0.2314 - acc: 0.9217 - val_loss: 0.8574 - val_acc: 0.7326
Epoch 11/15
- 4s - loss: 0.1778 - acc: 0.9436 - val_loss: 0.8752 - val_acc: 0.7270
Epoch 12/15
- 4s - loss: 0.1475 - acc: 0.9524 - val_loss: 0.8299 - val_acc: 0.7355
Epoch 13/15
- 4s - loss: 0.1089 - acc: 0.9657 - val_loss: 0.8034 - val_acc: 0.7491
Epoch 14/15
- 4s - loss: 0.1047 - acc: 0.9666 - val_loss: 0.8172 - val_acc: 0.7463
Epoch 15/15
- 4s - loss: 0.0749 - acc: 0.9774 - val_loss: 0.8511 - val_acc: 0.7313
precision recall f1-score support 0 0.75 0.61 0.67 319
1 0.63 0.74 0.68 389
2 0.74 0.54 0.62 394
3 0.49 0.76 0.60 392
4 0.60 0.70 0.64 385
5 0.79 0.57 0.66 395
6 0.73 0.76 0.74 390
7 0.83 0.74 0.78 396
8 0.86 0.88 0.87 398
9 0.95 0.78 0.86 397
10 0.93 0.93 0.93 399
11 0.92 0.77 0.84 396
12 0.55 0.72 0.62 393
13 0.76 0.85 0.80 396
14 0.86 0.83 0.84 394
15 0.91 0.73 0.81 398
16 0.75 0.65 0.70 364
17 0.95 0.86 0.90 376
18 0.60 0.49 0.54 310
19 0.37 0.60 0.46 251
avg / total 0.76 0.73 0.74 7532

Hierarchical Attention Networks

Image for post
Image for post

Recurrent Convolutional Neural Networks (RCNN)

Recurrent Convolutional Neural Networks (RCNN) is also used for text classification. The main idea of this technique is capturing contextual information with the recurrent structure and constructing the representation of text using a convolutional neural network. This architecture is a combination of RNN and CNN to use the advantages of both technique in a model.

import packages:

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import GRU
from keras.layers import Conv1D, MaxPooling1D
from keras.datasets import imdb
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from sklearn import metrics
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Convert text to word embedding (Using GloVe):

def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):
np.random.seed(7)
text = np.concatenate((X_train, X_test), axis=0)
text = np.array(text)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
word_index = tokenizer.word_index
text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Found %s unique tokens.' % len(word_index))
indices = np.arange(text.shape[0])
# np.random.shuffle(indices)
text = text[indices]
print(text.shape)
X_train = text[0:len(X_train), ]
X_test = text[len(X_train):, ]
embeddings_index = {}
f = open("C:\\Users\\swayam\\Documents\\Glove\\glove.6B.50d.txt", encoding="utf8")
for line in f:
values = line.split()
word = values[0]
try:
coefs = np.asarray(values[1:], dtype='float32')
except:
pass
embeddings_index[word] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_index))
return (X_train, X_test, word_index,embeddings_index)
def Build_Model_RCNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50): kernel_size = 2
filters = 256
pool_size = 2
gru_node = 256
embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
if len(embedding_matrix[i]) !=len(embedding_vector):
print("could not broadcast input array from shape",str(len(embedding_matrix[i])),
"into shape",str(len(embedding_vector))," Please make sure your"
" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
exit(1)
embedding_matrix[i] = embedding_vector model = Sequential()
model.add(Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True))
model.add(Dropout(0.25))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
model.add(LSTM(gru_node, recurrent_dropout=0.2))
model.add(Dense(1024,activation='relu'))
model.add(Dense(nclasses))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return modelnewsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)

Run RCNN :

model_RCNN = Build_Model_CNN_Text(word_index,embeddings_index, 20)
model_RCNN.summary()model_RCNN.fit(X_train_Glove, y_train,
validation_data=(X_test_Glove, y_test),
epochs=15,
batch_size=128,
verbose=2)
predicted = model_RCNN.predict(X_test_Glove)predicted = np.argmax(predicted, axis=1)
print(metrics.classification_report(y_test, predicted))

summary of the model:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 500, 50) 8960500
_________________________________________________________________
dropout_1 (Dropout) (None, 500, 50) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 499, 256) 25856
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 249, 256) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 248, 256) 131328
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 124, 256) 0
_________________________________________________________________
conv1d_3 (Conv1D) (None, 123, 256) 131328
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 61, 256) 0
_________________________________________________________________
conv1d_4 (Conv1D) (None, 60, 256) 131328
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 30, 256) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 30, 256) 525312
_________________________________________________________________
lstm_2 (LSTM) (None, 30, 256) 525312
_________________________________________________________________
lstm_3 (LSTM) (None, 30, 256) 525312
_________________________________________________________________
lstm_4 (LSTM) (None, 256) 525312
_________________________________________________________________
dense_1 (Dense) (None, 1024) 263168
_________________________________________________________________
dense_2 (Dense) (None, 20) 20500
_________________________________________________________________
activation_1 (Activation) (None, 20) 0
=================================================================
Total params: 11,765,256
Trainable params: 11,765,256
Non-trainable params: 0
_________________________________________________________________

Output:

Train on 11314 samples, validate on 7532 samples
Epoch 1/15
- 28s - loss: 2.6624 - acc: 0.1081 - val_loss: 2.3012 - val_acc: 0.1753
Epoch 2/15
- 22s - loss: 2.1142 - acc: 0.2224 - val_loss: 1.9168 - val_acc: 0.2669
Epoch 3/15
- 22s - loss: 1.7465 - acc: 0.3290 - val_loss: 1.8257 - val_acc: 0.3412
Epoch 4/15
- 22s - loss: 1.4730 - acc: 0.4356 - val_loss: 1.5433 - val_acc: 0.4436
Epoch 5/15
- 22s - loss: 1.1800 - acc: 0.5556 - val_loss: 1.2973 - val_acc: 0.5467
Epoch 6/15
- 22s - loss: 0.9910 - acc: 0.6281 - val_loss: 1.2530 - val_acc: 0.5797
Epoch 7/15
- 22s - loss: 0.8581 - acc: 0.6854 - val_loss: 1.1522 - val_acc: 0.6281
Epoch 8/15
- 22s - loss: 0.7058 - acc: 0.7428 - val_loss: 1.2385 - val_acc: 0.6033
Epoch 9/15
- 22s - loss: 0.6792 - acc: 0.7515 - val_loss: 1.0200 - val_acc: 0.6775
Epoch 10/15
- 22s - loss: 0.5782 - acc: 0.7948 - val_loss: 1.0961 - val_acc: 0.6577
Epoch 11/15
- 23s - loss: 0.4674 - acc: 0.8341 - val_loss: 1.0866 - val_acc: 0.6924
Epoch 12/15
- 23s - loss: 0.4284 - acc: 0.8512 - val_loss: 0.9880 - val_acc: 0.7096
Epoch 13/15
- 22s - loss: 0.3883 - acc: 0.8670 - val_loss: 1.0190 - val_acc: 0.7151
Epoch 14/15
- 22s - loss: 0.3334 - acc: 0.8874 - val_loss: 1.0025 - val_acc: 0.7232
Epoch 15/15
- 22s - loss: 0.2857 - acc: 0.9038 - val_loss: 1.0123 - val_acc: 0.7331
precision recall f1-score support 0 0.64 0.73 0.68 319
1 0.45 0.83 0.58 389
2 0.81 0.64 0.71 394
3 0.64 0.57 0.61 392
4 0.55 0.78 0.64 385
5 0.77 0.52 0.62 395
6 0.84 0.77 0.80 390
7 0.87 0.79 0.83 396
8 0.85 0.90 0.87 398
9 0.98 0.84 0.90 397
10 0.93 0.96 0.95 399
11 0.92 0.79 0.85 396
12 0.59 0.53 0.56 393
13 0.82 0.82 0.82 396
14 0.84 0.84 0.84 394
15 0.83 0.89 0.86 398
16 0.68 0.86 0.76 364
17 0.97 0.86 0.91 376
18 0.66 0.50 0.57 310
19 0.53 0.31 0.40 251
avg / total 0.77 0.75 0.75 7532

Random Multimodel Deep Learning (RMDL)

A new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of different deep learning architectures. RDMLs can accept a variety of data as input including text, video, images, and symbols.

Image for post
Image for post

Random Multimodel Deep Learning (RDML) architecture for classification. RMDL includes 3 Random models, oneDNN classifier at left, one Deep CNN classifier at middle, and one Deep RNN classifier at right (each unit could be LSTMor GRU).

Installation

There are pip and git for RMDL installation:

Using pip

pip install RMDL

Using git

git clone --recursive https://github.com/kk7nc/RMDL.git

The primary requirements for this package are Python 3 with Tensorflow. The requirements.txt file contains a listing of the required Python packages; to install all requirements, run the following:

pip -r install requirements.txt

Or

pip3  install -r requirements.txt

Or:

conda install --file requirements.txt

Hierarchical Deep Learning for Text (HDLTex)

Increasingly large document collections require improved information processing methods for searching, retrieving, and organizing text documents. Central to these information processing methods is document classification, which has become an important task supervised learning aims to solve. Recently, the performance of traditional supervised classifiers has degraded as the number of documents has increased. This exponential growth of document volume has also increased the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead, we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide a hierarchical understanding of the documents.

Image for post
Image for post

Conclusion

This should give you a good idea of how you can leverage deep learning for text classification task! The list of techniques here are not exhaustive but definitely cover some of the most popular and widely used methods to train neural network models for the text classification task. I recommend you to try these out with your own data.

Data Driven Investor

from confusion to clarity not insanity

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

SWAYAM MITTAL

Written by

Maker

Data Driven Investor

from confusion to clarity not insanity

SWAYAM MITTAL

Written by

Maker

Data Driven Investor

from confusion to clarity not insanity

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app