Predicting Stock Market Movements with the News Headlines and Deep Learning

Dave Currie
9 min readMay 1, 2017

--

We are going to use daily world news headlines from Reddit to predict the opening value of the Dow Jones Industrial Average. The data for this project comes from a dataset on Kaggle, and covers nearly eight years (2008–08–08 to 2016–07–01)

For this project, we are going to use GloVe’s larger common crawl vectors to create our word embeddings and Keras to build our model. This model was inspired by the work described in this paper. Similar to the paper, we will use CNNs followed by RNNs, but our architecture will be a little different and we will use LSTMs instead of GRUs.

To help construct a better model, we will use a grid search to alter our hyperparameters’ values and the architecture of our model. To finish things off, I will show you how to make your own prediction of the Dow’s opening price in just a few steps.

Note: Like my other articles, I’m going to skip over a few parts the project, but I’ll supply a link to some important information, if need be. Plus, you can see the full version on this project on its GitHub page.

The data for this project is in two different files. Due to this, we need to ensure that we have the same dates in each of our dataframes. The function isin() will help us here.

news = news[news.Date.isin(dj.Date)]

To create our target values, we are going to take the difference in opening prices between the current and following day. Using this value, we will be able to see how well the news will be able to predict the change in opening price.

dj = dj.set_index('Date').diff(periods=1)
dj['Date'] = dj.index
dj = dj1.reset_index(drop=True)

Now that we have our target values, we need to create a list for the headlines in our news and their corresponding price change.

price = []
headlines = []
for row in dj.iterrows():
daily_headlines = []
date = row[1]['Date']
price.append(row[1]['Open'])
for row_ in news[news.Date==date].iterrows():
daily_headlines.append(row_[1]['News'])

Each day, for the most part, includes 25 headlines. This is what makes up our ‘news’ data. We need to clean this data to get the most signal out of it. To do this, we will convert it to the lower case, replace contractions with their longer forms, remove unwanted characters, reformat words to better match GloVe’s word vectors, and remove stop words. The list containing the contractions can be found in this project’s jupyter notebook.

def clean_text(text, remove_stopwords = True):

text = text.lower()

# Replace contractions with their longer forms
if True:
text = text.split()
new_text = []
for word in text:
if word in contractions:
new_text.append(contractions[word])
else:
new_text.append(word)
text = " ".join(new_text)

# Format words and remove unwanted characters
text = re.sub(r'&', '', text)
text = re.sub(r'0,0', '00', text)
text = re.sub(r'[_"\-;%()|.,+&=*%.,!?:#@\[\]]', ' ', text)
text = re.sub(r'\'', ' ', text)
text = re.sub(r'\$', ' $ ', text)
text = re.sub(r'u s ', ' united states ', text)
text = re.sub(r'u n ', ' united nations ', text)
text = re.sub(r'u k ', ' united kingdom ', text)
text = re.sub(r'j k ', ' jk ', text)
text = re.sub(r' s ', ' ', text)
text = re.sub(r' yr ', ' year ', text)
text = re.sub(r' l g b t ', ' lgbt ', text)
text = re.sub(r'0km ', '0 km ', text)

# Optionally, remove stop words
if remove_stopwords:
text = text.split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = " ".join(text)
return text

I’m going to skip a few steps that would prepare our headlines for the model. My method is pretty similar to the one found my article “Tweet Like Trump with a One2Seq Model.” You can read about it there, or go to my GitHub page for this project.

To create the the weights that will be used for the model’s embeddings, we will create a matrix consisting of the embeddings relating to the words in our vocabulary. If a word is found in GloVe’s vocabulary, we will use its pre-trained vector. If a word is not found in Glove’s vocabulary, we will create a random embedding for it. The embeddings will be updated as the model trains, so our new ‘random’ embeddings will be more accurate by the end of training.

# Need to use 300 for embedding dimensions to match GloVe's vectors.
embedding_dim = 300
nb_words = len(vocab_to_int)# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim))
for word, i in vocab_to_int.items():
if word in embeddings_index:
word_embedding_matrix[i] = embeddings_index[word]
else:
new_embedding = np.array(np.random.uniform(-1.0, 1.0,
embedding_dim))
embeddings_index[word] = new_embedding
word_embedding_matrix[i] = new_embedding

The final step in preparing our headline data is to make each day’s news the same length. We are going to maximize the length of any headline to 16 words (this is the length of the 75th percentile headline) and maximize the length of any day’s news to 200 words. These values were picked to have a good balance between the number of words in a headline and the number of headlines to use. I expect that using more words for each day’s news (i.e. increasing the 200 word limit) would be beneficial, but I didn’t want my training time to become too long since I am just using my macbook pro.

max_headline_length = 16
max_daily_length = 200
pad_headlines = []
for date in int_headlines:
pad_daily_headlines = []
for headline in date:
if len(headline) <= max_headline_length:
for word in headline:
pad_daily_headlines.append(word)
else:
headline = headline[:max_headline_length]
for word in headline:
pad_daily_headlines.append(word)

# Pad daily_headlines if they are less than max length
if len(pad_daily_headlines) < max_daily_length:
for i in range(max_daily_length-len(pad_daily_headlines)):
pad = vocab_to_int["<PAD>"]
pad_daily_headlines.append(pad)
else:
pad_daily_headlines = pad_daily_headlines[:max_daily_length]
pad_headlines.append(pad_daily_headlines)

When I first tried to train my model, it struggled to make any improvements. The solution that I found was to normalize my target data between the values of 0 and 1.

max_price = max(price)
min_price = min(price)
mean_price = np.mean(price)
def normalize(price):
return ((price-min_price)/(max_price-min_price))

As I mentioned in the introduction of this article, we will be using a grid search to train our model. Below, you will see the variables, ‘wider’ and ‘deeper’. These are two of the ways that I am altering the model. ‘wider’ doubles the values of some of the hyperparameters and ‘deeper’ adds an extra convolution layer to each branch as well as adding an extra fully connected layer to the final part of the model.

I was surprised that this model goes against the conventional knowledge of the more layers the better. Using just one layer and a smaller network provided the best results.

There are two ‘input’ branches for this model because we want to create CNNs with different filter lengths. The research paper showed that this can improve the results of a model, and this project agrees with those results.

filter_length1 = 3
filter_length2 = 5
dropout = 0.5
learning_rate = 0.001
weights = initializers.TruncatedNormal(mean=0.0, stddev=0.1, seed=2)
nb_filter = 16
rnn_output_size = 128
hidden_dims = 128
wider = True
deeper = True
if wider == True:
nb_filter *= 2
rnn_output_size *= 2
hidden_dims *= 2
def build_model():

model1 = Sequential()

model1.add(Embedding(nb_words,
embedding_dim,
weights=[word_embedding_matrix],
input_length=max_daily_length))
model1.add(Dropout(dropout))

model1.add(Convolution1D(filters = nb_filter,
kernel_size = filter_length1,
padding = 'same',
activation = 'relu'))
model1.add(Dropout(dropout))

if deeper == True:
model1.add(Convolution1D(filters = nb_filter,
kernel_size = filter_length1,
padding = 'same',
activation = 'relu'))
model1.add(Dropout(dropout))

model1.add(LSTM(rnn_output_size,
activation=None,
kernel_initializer=weights,
dropout = dropout))

####
model2 = Sequential()

model2.add(Embedding(nb_words,
embedding_dim,
weights=[word_embedding_matrix],
input_length=max_daily_length))
model2.add(Dropout(dropout))


model2.add(Convolution1D(filters = nb_filter,
kernel_size = filter_length2,
padding = 'same',
activation = 'relu'))
model2.add(Dropout(dropout))

if deeper == True:
model2.add(Convolution1D(filters = nb_filter,
kernel_size = filter_length2,
padding = 'same',
activation = 'relu'))
model2.add(Dropout(dropout))

model2.add(LSTM(rnn_output_size,
activation=None,
kernel_initializer=weights,
dropout = dropout))

####
model = Sequential() model.add(Merge([model1, model2], mode='concat'))

model.add(Dense(hidden_dims, kernel_initializer=weights))
model.add(Dropout(dropout))

if deeper == True:
model.add(Dense(hidden_dims//2, kernel_initializer=weights))
model.add(Dropout(dropout))
model.add(Dense(1,
kernel_initializer = weights,
name='output'))
model.compile(loss='mean_squared_error',
optimizer=Adam(lr=learning_rate,clipvalue=1.0))
return model

The method that I used to create the grid search is the same as the one in my article “Predicting Movie Review Sentiment with TensorFlow and TensorBoard”. However, we are using Keras here, so the rest of the code is quite different.

for deeper in [False]:
for wider in [True,False]:
for learning_rate in [0.001]:
for dropout in [0.3, 0.5]:
model = build_model()
print("Current model:
Deeper={},Wider={},LR={},Dropout={}".format(
deeper,wider,learning_rate,dropout))
save_best_weights = \
'question_pairs_weights_deeper={}_wider={}_
lr={}_dropout={}.h5'.format(
deeper,wider,learning_rate,dropout)
callbacks = [ModelCheckpoint(save_best_weights,
monitor='val_loss',
save_best_only=True),
EarlyStopping(monitor='val_loss',
patience=5,
verbose=1,
mode='auto'),
ReduceLROnPlateau(monitor='val_loss',
factor=0.2,
patience=3,
verbose=1)]
history = model.fit([x_train,x_train],
y_train,
batch_size=128,
epochs=100,
validation_split=0.15,
verbose=True,
shuffle=True,
callbacks = callbacks)

Using the ‘for loop’ method, you should be able to tune just about any (if not all) features of the model. One important thing to remember is to save each iteration of the model with a different string, otherwise they will overwrite each other.

Early stopping is really useful to avoid unnecessary training. Since each iteration will likely take a different number of epochs to fully train, this will give you the flexibility to properly train each iteration. Just make sure that you set the default number of epochs high enough, otherwise a training session could be stopped too soon.

ReduceLROnPlateau will reduce your learning rate when the validation loss (or whatever metric your measuring) stops decreasing. This is really helpful because we want to start with a higher learning rate to have the model train quickly, but we want it to be smaller near the end of training to make the small adjustments that are necessary to find the optimal weights.

To make predictions with your testing data, you might need to rebuild the model. This needs to be done if the optimal parameters/architecture is different from that used during the final training iteration. You will also need to load your best weights.

deeper=False
wider=False
dropout=0.3
learning_Rate = 0.001
model = build_model()
model.load_weights('./question_pairs_weights_deeper={}_wider={}_
lr={}_dropout={}.h5'.format(
deeper,wider,learning_rate,dropout))
predictions = model.predict([x_test,x_test], verbose = True)

To evaluate the model, I used the median absolute error. I like this metric because it is easy to understand and it factors our any extreme errors that could provide misleading results. Before using this metric, we will need to ‘unnormalize’ our data, i.e. revert it back to its original range.

def unnormalize(price):
price = price*(max_price-min_price)+min_price
return(price)
mae(unnorm_y_test, unnorm_predictions)

The median absolute error for this model is 74.15. Here is a comparison of the predicted values and actual values.

I go into some detail about why the results are as mediocre as they are, but to give you the short version: Predicting the future of the stock market is a complicated and near impossible task. A great deal of data and even emotions are factored into its value, and using 25 daily headlines from Reddit will not be able to incorporate all of the complexities.

Despite the results, I still think this is an interesting and worthwhile task, which is why I wanted to share it with you, but if you were hoping to make some money from this article, then lol, and sorry.

To make your own predictions is a rather simple process. For this model, I found that it was best to fill all 200 words of the input data with news, rather than using any padding. In my jupyter notebook, I have 25 headlines worth of news from Reddit that you can use as your default news. Make whatever changes you want, then you can see the impact it will have!

create_news = ""clean_news = clean_text(create_news)int_news = news_to_int(clean_news)pad_news = padding_news(int_news)pad_news = np.array(pad_news).reshape((1,-1))pred = model.predict([pad_news,pad_news])price_change = unnormalize(pred)print("The Dow should open: {} from the previous open.".format(np.round(price_change[0][0],2)))

That’s all for this project! I hope that you have found it to be rather interesting and informative. Keras is pretty sweet because you can build your models much more quickly than in TensorFlow, and they are easier to understand (architecturally, at least).

If you want to expand on this project and make it even better, I have a few ideas for you:

  • Use headlines from the 30 companies that make up the Dow Jones Industrial Average.
  • Include the previous day(s)’s headline(s).
  • Include the previous day(s)’s change(s) in value.

Thanks for reading, and if you have any ideas about how to improve this project, or want to share something interesting, then please make a comment about it below!

--

--