Neural Network for Sentimental Analysis [Part -3: Noise Reduction Hypothesis Implementations]

7 min readMay 8, 2020

[Hey there! Welcome back, If you did not refer “ Part -1: [Feature Extraction], Part-2 [Neural_Net. Implementation # HP tuning]” (just, 8-min. reading) then I advise you to refer it before going through Part-2, which will give you a better understanding of Neural_Network Architecture and hyperparameter tuning.]

Use my Google Colab Notebook for interactive learning!

What is Noise in terms of “text analysis”? and, why its NeuralNet approach is different than other ML methods?

In, Normal Machine learning algorithms where we use NLTK, we have the leverage to remove some common words which haven’t any specific meaning (in term of our classification Objective). Means some words such as prepositions/ verbs/ spaces/ special characters.
But, for neural networks, we haven’t much leverage. Yes, we can use it during preprocessing. But, with a large amount of data is it really possible? (in reference to computational and time complexity).
Well, to resolve this complexity we have another way, I first display noise reduction hypothesis and then apply to the pre-trained model from Part-2

Let’s first observe common words used to train our network…

Our most counted words are without meaning. , but for Neural Network they give more importance when they multiply with weight(w) and give resultant output (y) for next neighboring neurons.

And, as a result, make our model weaker during prediction as they remove focus(attention) from more meaningful (positive/negative) sentimental words.

So, Now What is Solution: →

Hypothesis_1 :

What if we just use words included for a particular review out of our dictionary of words. And, do not think about “How many time such word/ words came in such review

What we will do?: Beside word count, we just measure word existance

for this, I am changing our update_input_layer

def update_input_layer(self,reviews):                                       # count word uniqueness and return layer_0 as input_layer
        self.layer_0 *= 0 
        for word in reviews.split(" "):
          if(word in self.word2index.keys()):
            self.layer_0[0][self.word2index[word]] +=1 =====================< AFTER UPDATE:> ============================
def update_input_layer(self,reviews):                                       # count word uniqueness and return layer_0 as input_layer
        self.layer_0 *= 0 
        for word in reviews.split(" "):
          if(word in self.word2index.keys()):
            self.layer_0[0][self.word2index[word]] =1

Here, besides counting words we are just using existence by 1 and if a word is not in the review then it becomes 0 so now input array should look like this

old_input = array[18,0,0,27,3,8,0,0,2,0]====================< AFTER UPDATE > ==============================
new_input = array[1,0,0,1,1,1,0,0,1,0]

Let’s check Model performance improvement…

Noise Reduction hypothesis_1 Training [lr = 0.001] >> **Improvement = 19 % Accuracy Increase**

Noise Reduction hypothesis_1 Testing [lr = 0.001] >> **Improvement = 7% Accuracy Increase**

Great!, Until now, we changed our neural network performance in term of accuracy but as I mensioned before our speed of training and testing (reviews/sec) is not that much efficient.

To improve it we need to think on an operational level and here we need real Mathematic skills and aptitude…

Hypothesis 2:

Our operation is started form layer_0 which actually has a long array of 0s and 1s, with a length of 74,074 words. Wow, but in reality, most companies request customer’s reviews within 200 to 500 words.

Even, if we actually see then there are fewer unique words than the actual total amount. In such a situation, the main focus is “Why we need to work with 74,074 words for observing one review with 200 -500 words? when the concern is about to compare the words with the positive or negative part of the dictionary, we can do it within the training of the model during prediction.

Solution:

During the creation of layer_1, input layer_0 multiply with the weight matrix, but these calculations are with 1 and 0 only! means (1*w = w, 0*w =0). So, it's better to update weight directly to layer_1 as per index which has 1s.

Why we do so? : Well, the reason is Computation power/ (complexity calculation)

Normal Calculations : # operations = np.(input_layer(74074),layer_1[0])

Now, calculations after applying hypothesis 2: layer_1 directly formed by … indices = [0,1,1,0,0,1] … where 1s exist…

for index in indices:
    layer_1 += (1 * weights_0_1[index])

So, # opreations = np.((200-500) ,layer_1[0])

Hypothesis Unit Testing [without hypo_2]

Hypothesis Unit Testing [applied hypo_2]

See, Same output in each case. Note: here, layer_1 update also multiply by 1 so if we remove it, it will not change the output.

Observation: We can save time and computation power. Speed of network will be improved.

Hypothesis_2 : Training Speed **Improvement by +950 (reviews/sec)**

Hypothesis_2 : Testing Speed **Improvement by +1500 (reviews/sec)**

In hypo_1 we just equalize counts of a word but if we can remove those words, it will give us further improvement in the network. so our next hypothesis_3 is an improved version of hypothesis_1. Let’s first try to understand.

During data processing, we find pos_neg_ratio for getting positive and negative review words. as below, their ratios are quite far from 1, proved the most important words.

Let’s first visualize the distribution of such words …

Visualization Observation:

The X-axis shows probabilities, which means the positive and negative words are near to the end and their counts are quite low compares to common words.

Y-axis shows counts, and the highest counts are for the probability near to 0 or area between (-1,1). This graph resembles a normal distribution.

Repetition: Our common words have more counts and also exists with both negative and positive labels. So, their predictive power is less than unique words that exist in opposite labels.

Its Zitvian Distribution graph in which each line represents corpus (word collection), represents that there are so many fewer words that dominate in text.
From, integration of both graphs we should conclude that “Those words around 1 in normal distribution dominate our text field. so, better to remove them by filtering pos_neg_ratio by the range and minimum counts value. This is the main idea behind our next hypothesis_3.

We need to change prepocessing step to add new parameters called (polarity_cutoff, min_count)

Please, refer to part_1 for preprocessing stage understanding. Now, we just do two main changes.

min_count >> we just count words, which are repeated more than a certain amount, will give us the advantage of neglecting some words such as industry name, type of product, company or brand name, distributor name, etc.
polarity_cutoff >> This is one of the filters which decides the range of word from distribution graph. widen the range of negligence gives less input to the neural network. It improves speed but also has a chance of losing accuracy.

The following graph will give you a better idea…

Polarity_cutoff range and data reduction

In Range -1, less word is neglected will give less speed than Range-2 polarity_cutoff. but, speed may be affected easily.

Let’s check model performance…

Training Performance speed **improved by +2000 (reviews/sec)**

Testing Performance speed **improved by +2100 (reviews/sec)**

Combined Evaluation:

Here, “speed increase 3 times with the cost of 4% of accuracy”. Here, specifically, we can tolerate the accuracy of a model because when we discuss model performance with immense data (big data world) then the main task is to work with data optimization, and for that, if we get minimum tolerable accuracy which helps us to decide the objective. then, I think it's better to choose speed over accuracy. (Not always, think about high precision of data such as healthcare, aviation tech., public security etc…)

For improving this scenario, companies can use some other features such as star rating buttons, emojis with text data. By integrating these data with text reviews, the “DS team” can improve their confidence level.

Thank you for reading. I tried my best still if you have any suggestions. Please let me in a comment. If you like my work, then please show your Sentiments by giving me “clap” and share it with your connections, it helps me to keep me motivated.

The motto of my life: “Keep Learning, Enjoy Empowering”

Neural Network for Sentimental Analysis [Part -3: Noise Reduction Hypothesis Implementations]

What is Noise in terms of “text analysis”? and, why its NeuralNet approach is different than other ML methods?

Hypothesis_1 :

Hypothesis 2:

Written by Vedant Dave