Sentiment Analysis from a High-school Freshman’s Perspective

“My First Experiences with Machine Learning”

8 min readNov 3, 2018

Have you ever wondered how sites like Yelp and Google rate reviews on their websites accurately? Most major websites these days have some kind of sentiment analysis algorithm running in the background by rating and indexing the reviews they get from their users. Sentiment analysis is the process of extracting the emotion behind someone’s text. As humans we do this naturally, by looking at the meanings behind words and putting together a cohesive set of emotions behind the text. For a computer however, this level of cognition was not possible to achieve just a decade ago. Companies can use sentiment analysis to analyze consumer support for themselves and their competitors, and to analyze the reputation of their companies. Interested, I set out to the internet for the technologies used for sentiment analysis. Armed with this information, I set out to experiment and apply my newly acquired knowledge. Lets start with a brief introduction to Machine Learning.

What is ML?

ML (Machine Learning) is a learning system where data is utilized to learn patterns make out meaningful results, and predict from multiple applications. DL (Deep Learning) /NN (Neural Networks) are specialized techniques to achieve ML goal more efficiently.

The field of ML- especially DL has been growing rapidly in the past several years. Over the years, the conventional amount of enterprise data have increased from about 10,000–100,000 to 1,000,000–1,000,000,000. This caused a unique issue that researchers hadn’t encountered before; the limitations of compute power to train models. To create deeper neural networks, they needed a level of computational power that the time just couldn’t provide. This is why the field of ML plateaued for years. Then the rise of computational power allowed researchers to implement deeper, higher dimensional neural networks. New technological advances brought Graphical Processing Units (GPUs), one of the most efficient hardware for the high dimensional matrix multiplication required for deep learning. Again, the only limitation became the algorithms and architectures of the models, while we could constantly collect more data to improve accuracy.

Now, the field of DL has split into multiple major groups: Computer Vision, Audio understanding and NLP to name some. My interest leaned toward NLP, so I looked further into it. There are a few major industries that NLP can completely redesign. These are customer service, virtual assistants, information retrieval.

In customer assistance, chatbots can streamline customer service, taking care of simple tasks and questions and leaving complex queries to their human counterparts. In the future, DL models could analyze a call and rate the customer satisfaction through sentiment analysis. Virtual assistants use natural language understanding techniques to extract commands from your speech. Information retrieval extract valuable information from unstructured text by using sentiment analysis, and abstractive summarization.

For my project, I chose sentiment analysis, because it is very versatile and is used in almost everything in NLP Sentiment Analysis is used commercially as a way to extract information from consumers. For example, it can be used to analyze customer reviews to find out the general opinion. It can also be applied to live conversations in hand with speech to text systems. It is also very easy to evaluate on it’s accuracy. Like any DL problem, I started by sourcing my data.

The Problem with Words:

Since words cannot be mathematically manipulated, I used both GloVe word embeddings that represent the words as a 50 dimensional vector, and Fasttext word embeddings that represent the words as a 300 dimensional vector. These word vectors represent the semantic meanings behind the words in a series of numbers. For example, porpoise and dolphin are similar in meaning, and their cosine distance (mathematical distance between 2 vectors are measured using the cosine function) is much smaller than dolphin and Paris. See figure below:

What I did:

For the sentiment analysis task, I built different neural network based models with different types of layers. I used a CNN, that applies multiple learnable filters, RNN and BRNN along with Long Short Term Memory (LSTM) units to allow the neural network to “remember” information, and a vanilla neural network, with fully connected layers of neurons. Conventionally, CNNs are used in images to map thousands of pixel values to a classifier. RNNs/BRNNs are used in sequence to sequence applications, where a sequence like a sentence or audio is the input and another sequence is the output. Fully Connected layers are now used as a compliment to CNNs and RNNs, but rarely on their own. By using each type of model, the best type for this problem can be determined.

Each model was made differently, using each of the different architectures. The GUI used was Jupyter Notebook, since it has the option to save the notebook along with the outputs. Tensorflow and Numpy packages were using along with python built-ins.

First, the data, labels, and word vectors where loaded.

Model Diagrams:

The above two models (model 1 and model 2) are a single and dual layered dense neural network respectively. This is the traditional neural network that has an activation function applied to a weighted sum of the inputs in each unit.

The above two models (model 3 and model 4) are called CNNs, or Convolutional Neural Networks. Typically, this network architecture is used for working on images because of the large number of input features, in this case pixel values.

The above two models (model 5 and model 6) are RNN, or Reccurent Neural Networks. I have used LSTM cells to build the RNN. This RNN based network architecture was specially designed to remember and store the order of words in language. Each LSTM cell transfers it’s output to the next, allowing for information to be remembered in the later cells.

Model 7: Single Layered Bidirectional RNN

The above two models (model 7 and model 8) are the single and dual layer Bidirectional RNN respectively. These models are inspired by the fact that language has both left to right and right to left dependencies (order of words). This way the network is evaluated both left to right and right to left, and then the outputs are concatenated together.

Each model had dropout applied to each layer, and a final softmax function at the end. Each model was then trained using the AdamOptimizer (Tf docs here, using a learning rate of 0.001 and 10000 iterations.

Results:

Below are the results based on all eight models described above:

What This Means:

No models could be trained using the FastText Word Embeddings because they were way too large, and my machines did not have sufficient RAM :) . The accuracy matches what was expected, with the FC layers not doing well at all, while the RNN and BRNN had reasonable accuracy. Because the RNN and BRNN were able to capture the long term dependencies of english, they could extract the sentiment from the input text. The CNN layer took a very long time to train, because it is not designed to be used on words, and shrinks and discards data between layers, instead keeping it, which ultimately made the difference between the CNNs and FCs which did very poorly to the RNNs and the BRNNs which did reasonable on the test data. Based on the accuracy alone, the 2 layer BRNN had the best accuracy, but after weighing in the runtimes, the 2 Layer RNN was the most optimal. The runtimes are so different because the FC and CNN layers did not have as much complexity as the RNN and BRNN layers had. Between the RNN and BRNN, the BRNN has almost double the amount of computations, because the BRNN does both a forward and a backward pass on the sequence, so has double the parameters. The 2 Layer RNN is the most optimal, because it only performs 0.175% worse on the test data than the 2 Layer BRNN. It also has less than half the runtime, which means that the user will have to wait much less.

Below is the architecture of the chosen model:

Improvements:

The data, taken from twitter, could have been precompiled to remove repetitions, clean extra characters, and translate acronyms. This model could have also been built on a more powerful machine (GPUs?) to allow the loading of the FastText word vectors and allow for faster training. With a faster machine, more sophisticated models could have been build and tested and I was limited by my machine. However, the results, no matter how improvable, are significant. They show that RNNs and BRNNs are the best way to learn the discrepancies and dependencies of the english language. CNNs and FCs cannot capture the workings of english, because they do not remember information across inputs. The question now is what other technologies can be added to the simple BRNN/RNN to improve accuracy further.

Final Remarks:

Highly interested in my findings, I further researched deeper into the field of ML I even found and completed Andrew Ng’s DeepLearning.ai specialization and have just started Google’s specialization on deep learning and the google cloud platform, both on coursera.com When I presented my findings and research in my school ‘s science fair, I won first place. This project has opened the world of ML for me and I will now stay well updated in the subject and try more projects. I will be honest though, I am not in any way an expert in this subject, and please take all the information in this blog with a grain of salt. I am only a freshman after all. There are many other people much more informed in this subject. However I plan to continue with Deep Learning. My next interest is in building an Autoencoder to build a fault detection system using a more powerful computer with a GPU.

I will try my best to post one entry every month.

Thanks for Reading!