Simple Stock Sentiment Analysis with news data in Keras

Have you wonder what impact everyday news might have on the stock market. In this tutorial, we are going to explore and build a model that reads the top 25 voted world news from Reddit users and predict whether the Dow Jones will go up or down for a given day.

After reading this post, you will learn,

  • How to pre-processing text data for deep learning sequence model.
  • How to use pre-trained GloVe embeddings vectors to initialize Keras Embedding layer.
  • Build a GRU model that can process word sequences and is able to take word order into account.

Now let’s get started, read till the end since there will be a secret bonus.

Text data pre-processing

For the input text, we are going to concatenate all 25 news to one long string for each day.

After that are going to convert all sentences to lower-case, remove characters such as numbers and punctuations that cannot be represented by the GloVe embeddings later.

The next step is to convert all your training sentences into lists of indices, then zero-pad all those lists so that their length is the same.

It is helpful to visualize the length distribution across all input samples before deciding the maximum sequence length.

Keep in mind that the longer maximum length we pick, the longer it will take to train the model, so instead of choosing the longest sequence length in our datasets which is around 700, we are going to pick 500 as a tradeoff to cover the majority of the text across all samples while remaining relatively short training time.

The embedding layer

In Keras, the embedding matrix is represented as a “layer” and maps positive integers(indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pre-trained embedding. In the part, you will learn how to create an Embedding layer in Keras, initialize it with GloVe 50-dimensional vectors. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. I will show you how Keras allows you to set whether the embedding is trainable or not.

The Embedding() layer takes an integer matrix of size (batch size, max input length) as input, this corresponds to sentences converted into lists of indices (integers), as shown in the figure below.

The following function handles the first step of converting sentence strings to an array of indices. The word to index mapping is taken from GloVe embedding file so we can seamlessly convert indices to word vectors later.

After that, we can implement the pre-trained embedding layer like so.

  • Initialize the embedding matrix as a numpy array of zeros with the correct shape. (vocab_len, dimension of word vectors)
  • Fill the embedding matrix with all the word embeddings.
  • Define Keras embedding layer and make is non-trainable by setting trainable to False.
  • Set the weight of the embedding layer to the embedding matrix.

Let’s have a quick check of the embedding layer by asking for the vector representation of the word “cat”.

The result is a 50 dimension array. You can further explore the word vectors and measure similarity using cosine similarity or solve word analogy problems such as Man is to Woman as King is to __.

Build and evaluate the model

The task for the model is to take the news string sequence and make a binary classification whether the Dow Jones close value will rose/fail compared to previous close value. It outputs “1” if the value rose or stays the same, “0” when the value decreases.

We are building a simple model contains two stacked GRU layers after the pre-trained embedding layer. A Dense layer generates the final output with softmax activation. GRU is a type of recurrent network that processes and considers the order of sequences, it is similar to LSTM regarding their functionality and performance but less computationally expensive to train.

Next, we can train the evaluate the model.

It is also helpful to generate the ROC or our binary classification classifier to access its performance visually.

Our model is about 2.8% better than the random guess of the market trend.

For more information about ROC and AUC, you can read my other blog — Simple guide on how to generate ROC plot for Keras classifier.

Conclusion and Further thought

In this post, we introduced a quick and simple way to build a Keras model with Embedding layer initialized with pre-trained GloVe embeddings. Something you can try after reading this post,

  • Make the Embedding layer weights trainable, train the model from the start then compare the result.
  • Increase the maximum sequence length and see how that might impact the model performance and training time.
  • Incorporate other input to form a multi-input Keras model, since other factors might correlate with stock index fluctuation. For example, there are MACD(Moving Average Convergence/Divergence oscillator), Momentum indicator for your consideration. To have multi-input, you can use the Keras functional API.

Any ideas to improve the model? Comment and share your thoughts.

You can find the full source code and training data here in my Github repo.

Bonus for investors

If you are new to the whole investment world like I did years ago, you may wonder where to start, preferably invest for free with zero commissions. By learning how to trade stocks for free, you’ll not only save money, but your investments will potentially compound at a faster rate. Robinhood, one of the best investing app does just that, whether you are buying only one or 100 shares, there are no commissions. It was built from the ground up to be as efficient as possible by cutting out the fat and pass the savings to the customers. Join Robinhood, and we’ll both get a stock like Apple, Ford, or Sprint for free. Make sure you use my shared link.


Originally published at www.dlology.com.