Classification of Documents Using Convolutional Neural Network(CNN)

Priyanka Goel
The Startup
Published in
7 min readSep 10, 2020

--

Dealing with text data in deep learning with the help of CNN and word embedding.

Table of content-

  1. What is document classification
  2. Preprocessing
  3. Word Embedding
  4. Keras Embedding layer
  5. Tokenizer API
  6. GloVe: global vectors for word representation
  7. Model Creation
  8. Model Summary
  9. References

Document classification

Document classification is an example of Machine learning where we classify text based on its content.

There are two broad categories of Machine learning techniques which can be used for it.

Supervised leaning — Where we already have the category to which particular document belongs to. Our model parse through the data during training, maps the function from it.

Categories are predefined and documents within the training datasets are manually tagged with one or more category labels.After training, the model is smart enough to categorize the new document given.

Unsupervised learning — Where we do not have the class label attached to the document and we use ML algorithms to cluster the document which are of same type.

Refer below diagram for better understanding-

Document classification

Preprocessing

Lets suppose we have millions of emails with us and we need to classify to which class each of these email belongs to.

In real world the data given is never perfect. We need to do preprocessing so as to extract maximum knowledge out of it with out making our model get confused due to extra information given .

Take out the subject, remove extra details from it and put it in a Data-frame.

Extract all the Email ids mentioned into the mail and get it into a Data-frame.

Extract the given text data , preprocess it and put it in a Data-frame.

Combine all these and we are ready with the desired text to give to our model.

For more information and code visit my github. Link is at the end of this blog .

Word embedding

Word Embedding is a representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.

It is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to dense vector of real numbers.

It is actually an improvement over traditional ways of encoding such as Bag-of-word where each word was represented by a large sparse vector depending upon size of vocabulary it is dealing with.

In contrast to this, in an embedding, representation of a word is by a dense vector which represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learned from text and is based on the neighboring words.

Keras Embedding layer

Keras offers an Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with keras.

This layer can be represented as —

Keras Embedding layer

Where few important arguments are-

  1. input_dim — Which specifies the size of the vocabulary in the text data, that means the total unique words in your data is 50.
  2. output_dim — Which specifies the size of the word vector you get as the output of the embedding layer.
  3. input_length — It is the length of the input sequence. For example if the maximum length of a sentence in a document is 100 then its input length is 100.
  4. Trainable — It specifies whether we want to train the embedding layer or not.

The embedding layer can be used in different ways-

  1. We use it as a part of deep learning model and this layer learns with the model itself. In such scenarios we give parameter trainable as True.
  2. We use already pretrained vectors to represent our words which are trained on large datasets.In such scenarios we give parameter trainable as False.

We will be focusing on how to use the pretrained vector for representing our words and train our complete dataset on it.

Let us take an example to understand it more deeply-

suppose we have our dataset which contains few remarks and we need to classify to which class these remarks belong to.

1 signifies that the remark is good where as 0 signifies it to be bad.

Given set of data

The Keras deep learning library provides some basic tools to help us prepare our text data.Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models.

For this purpose we use Tokenizer API.

Tokenizer API

Tokenizer

Sometimes we want to have some special punctuation to be a part of our analysis thus in that case we can specify only the filters we want to get removed.

Tokenizer along with filters argument

Our vocabulary is the total number of unique words present in the data-set and Tokenizer represent each word with a unique digit.

word index dictionary

So the total vocab size is this case will be 13.

Now we need to encode our complete data in integer form, where texts_to_sequences is used and then apply padding to make complete data of same length.

We can pad our data in 2 forms-

Padding is Post
Padding is Pre

Now since we are ready with our words lets discuss about the pre trained vectors for word representation and where can we download it from.

GloVe : global vectors for word representation

GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embedding by aggregating global word-word co-occurrence matrix from a corpus.

We can download this and can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in our training dataset.

GloVe: Global Vectors for Word Representation

We can download any of the zip file depending upon the requirement. After unzipping the final file I used is “glove.6B.100d.txt”.

If we try to look inside of this file, it has a vector associated with each word.

Content of GloVe file

Now we need to load this entire file and try to fetch words and its corresponding vector representation.

Once this is done, we need to create the embedding matrix for our vocabulary we created using the training dataset and its vector representation will be taken from the GloVe file.

Model Creation

Once we are all set with our embedding matrix, we will use Keras predefined Embedding layer and give parameter weights as our embedding matrix.

This embedding layer comes just after the input layer. We can add any number of layers as per our model’s requirement.

Final model
Model Summary

now we can fit the model using fit method.

Accordingly we can evaluate it and validate it for our test data.

For more details visit my Github.

References :

--

--