Stock Sentiment Analysis using News Headlines

Sai Gowtham Babu
Fast-Feed.ai
Published in
4 min readJun 8, 2020

Git hub url for full code:- click here

Media is one of the largest network in the world. Media is helpful telecasting information around the globe to us. Today, in this we discuss on how media can get profited by choosing the right headlines. Since, many people generally read the news based on the headlines provided for us.

Contents:-

Part 1:- About the data set.

Part 2:- Processing of data.

Part 3:- Feature Engineering Process.

Part 4:- Model Initialization.

Part 5:- Accuracy.

Part 6:- Conclusions.

The different parts are explained in the following article

Part 1:-About the Data set.

So, for predicting the perfect headlines for a news paper company, we first need a data set so that the model can be trained on the data. Some features of the data set are taken and from Kaggle and some other features are taken form Yahoo finance. The mixture gives the News Paper Headlines from the year 2000–2016.

Part 2:- Processing of Data.

The data is loaded using pd.read_csv and encoded using “ISO-8859–1” since the data set contains dates involved in it.

Next, the data is split into training and test set in such a way that the training set contains the news headlines and the test set contains labels. The labels contain 0 and 1 predicting 0 when the stock rate of the particular news paper decrease and 1 is the stock rate increases based on the headlines.

Part 3:- Feature Engineering

This process contains all the necessary cleaning of the data. Any extra or modifications needed for the data are done here.

Step 1:- The data contains punctuation marks . This might be a problem while fitting the model to the algorithm. So, now we remove the punctuation marks and replace the punctuation marks using a space. This can be done by using the replace key in python.

Step 2:- We see that the column name contains “Top 1, Top 2,….”. This could be some misleading for us. So, it is better to convert the columns names with the numbers.

Step 3:- We can see that there are some capital letters present in the News Headlines. So, we use the “str.lower()” function to convert the capital letters to small letters.

Step 4:- Now, combining all the headlines and forming a paragraph. Since, it can be further converted into a feature vector.

Part 4:- Model Initialization.

Now, after a paragraph of features is formed, we use “CountVectorizer” to convert the text formatted paragraph to a feature vector for model initialization. In this, we use the bag of words problem in “Natural Language Processing” to guess weather the words used in Headlines increase the stock price or not. Now, use fit the model to “Random Forest Classier” since, we can say that this is one of the ensemble learning algorithm and used to increase the accuracy of the algorithm quite a bit. So, we used 200 estimators with entropy criteria for training the model.

Part 5:- Accuracy.

Now, after the training is done, we can find the accuracy of the test data by using the “classification_report, confusion_matrix and accuracy_score” from sklearn.metrices library.

Part 6:- Conclusions.

From the above accuracy, we can conclude that the was fit to model with a good predictions and can be implemented in real world.

Follow me for more such articles and implementations on different real-world case studies and articles in data science! You can also connect with me through LinkedIn and Github.

I hope you enjoyed reading my article. Stay tuned for more such content until then Happy learning.

--

--

Sai Gowtham Babu
Fast-Feed.ai

Machine Learning, Deep Learning and Data Science enthusiast, looking forward to do more research on these topics.