Sentiment Classification based on Financial News data for Portfolio/Asset Managers & Credit Risk Officers
The job of Portfolio Managers & Credit Risk officers is quite daunting. They, at all times, need to be upbeat about current Market conditions. These investment banking professionals are deemed to be experts in Technical & Fundamental Analysis of stocks/instruments/counterparties/issuers or at least, are adept at handing various COTS (Commercial off-the-shelf) tools to churn out numbers/metrics and recognize market patterns to help guide their investment decisions.
Armed with this data, Portfolio Managers make decisions of adding/removing instruments from their Investment Portfolio where as, Credit Officers set/adjust trading limits or extend facilities to the Counterparties that their investment bank is interested to trade with. This is one of the prominent jobs in the financial job families as even a lone incorrect assessment is liable to have farfetched implications leading to investors or the bank losing significant investors funds — a nightmare for a Bank or Financial institution.
Let’s add another tool to the Investment Bankers’ toolbox — the tool that will help them determine the Market Sentiment of the stock/counterparty that they are currently researching on.
In the rest of the article, we will focus on creating a Deep Neural Network model that will accept a body of text (expected to be an extract from a financial news article or Twitter discussion or even an email with financial content!) and turn that into a market sentiment (without really having to manually read the text word by word).
Coding for this model will involve the following steps:
1. Downloading the financial news sentiment data; hopefully someone did already collect this data for us;
2. Choosing a word embedding for converting the textual data into embedding vectors; we could create this embedding from the scratch using word2vec model but we will leave that part for another article; for now we will reuse one of the embeddings that is freely available on the internet.
3. Pre-process textual data; we will need to convert text to numbers before feeding the inputs into our model.
4. Design a bespoke neural network composing of input, hidden & output layers; we will use LSTM in the hidden layers as the text data is sequential and LSTM layers are ideal for dealing with sequential data. GRU layers is another alternative that we could have chosen. You are free to experiment with GRU layers on your own.
5. Train the network based on the financial news sentiment data.
6. Predict the sentiment of a financial text of our choice.
In the next section, I will go through the above steps in more detail and will also include the code snippets inline where required. The code is also available at this Github link:
https://github.com/rohitar/myprojects/tree/master/counterparty-sentiment
Downloading the Financial News Sentiment Data
As with any Machine Learning model, we will need data to train our model on. Since we want to gauge polarity of financial text, we will need data which contains both the financial text as well as it’s corresponding polarity. Luckily this data is already present (published by Pekka Malo et al) and you can freely download it from here:
https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10
Choice of Word Embedding
We could have opted to build the entire word embedding from scratch but we don’t want to re-invent the wheel when someone has already run rigorous training on huge textual data to generate the embeddings.
Those of you who don’t understand the embedding — it is a simple mapping which maps all the words in the vocabulary to their corresponding vector representations (in n dimensional vector space — you get to define the dimensions when you create your embedding). Typically word2vec models are used to create word embeddings using either Skipgram approach or Continuous Bag of Words (CBOW) approach. These word embeddings capture the semantic relationships between words and similar words will be clustered together in the vector space.
One of the pre-trained word embeddings available for free download is based on Google news that contains 3 million words and has been trained on around 100 billion words from Google News Archives (you could imagine what mammoth task it will be to train these embeddings). Another popular option of word embedding is fastText trained by Facebook.
We will download Google news embeddings on disk and will use Gensim library to load these embeddings in memory (caution it will consume around 5 GB memory so make sure the machine that you are running this model on has enough RAM!). The following lines of code will download the embeddings as well as load it in memory (courtesy Douwe Osinga — Deep Learning Cookbook):
MODEL = 'GoogleNews-vectors-negative300.bin'
path = get_file(MODEL + '.gz', 'https://s3.amazonaws.com/dl4j-distribution/%s.gz' % MODEL)if not os.path.isdir('generated'):
os.mkdir('generated')unzipped = os.path.join('generated', MODEL)
if not os.path.isfile(unzipped):
with open(unzipped, 'wb') as fout:
zcat = subprocess.Popen(['zcat'],
stdin=open(path),
stdout=fout
)
zcat.wait()from gensim.models import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format(unzipped, binary=True)
If desired, you could cut down the number of word vectors loaded into RAM by using the ‘limit’ argument within load_word2vec_format method, but this might have an impact on the accuracy of the final model you build.
Pre-process Textual Data
Note any deep learning or machine-learning model doesn’t understand textual data; the models understand only numbers. Hence we will need to tokenize the words in the input text into numeric form. We will use Keras libraries to appropriately tokenize the strings in input Financial text; essentially we will map each word in the input data to a unique numeric token and finally use this mapping to transform the input text into string of numbers. The code for this is given below:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequencestokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token=OOV_TOK)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)X = pad_sequences(sequences, maxlen=MAX_LENGTH, truncating=TRUNC_TYPE)
y = to_categorical(labels)
Next we will need to transform the tokens assigned to each word into their respective embedding vectors. For that we will need to re-index the pre-trained Gensim embedding using the word index we created above while tokenizing input data. The code for doing that is given below (courtesy Antonio Gulli — Deep Learning with Keras):
embedding_weights = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))
for word, index in word_index.items():
try:
embedding_weights[index,:] = word2vec[word]
except KeyError:
pass
Design Neural Network
Now it’s finally time to define the neural network. We will use the following code to define our neural network:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Bidirectional, LSTMmodel = Sequential()
model.add(Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LENGTH, weights=[embedding_weights]))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(64, activation="relu"))
model.add(Dense(3, activation="softmax"))
Please note that the embedding layer above uses the transformed embedding weights we created earlier. The vectored input is then passed through a couple of LSTM layers and a fully connected dense layer.
At a high level, LSTM layers comprise of Long term memory cells and Short term memory cells. Long term cells essentially capture the association of words that are far away in a body of text whereas Short term cells capture the association of words that appear closer together. LSTM is an involved topic so detailed discussion of that will be outside the scope of this article.
Finally the data from fully connected dense layer is passed to a ‘softmax’ output of 3 classes. These 3 classes will designate the text polarity viz. positive, neutral or negative.
Great! Now that we have defined the Neural Network we will jump to the next step.
Train the Network
This step will involve training the network using the financial news sentiment data that we downloaded in our first step. However, we don’t want to train the model on the entire data; we would want to keep aside some data, which is unseen to the model so that it can later be used to evaluate our trained model. For this, we will split the data into train & test with test size of 30% using the below code:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17)
Next we will compile the model and subsequently fit it on the test data. We will the use following code for that:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X_train, y_train, batch_size=64, epochs=20, validation_data=(X_test, y_test))
Please note that we have opted for ‘adam’ optimizer as it is known to work optimally for sequence based models like LSTM. The loss criterion is defined as Categorical Crossentropy, which is used for multi-class classification problems such as ours. I would recommend running the training on a GPU machine — as training is expected to happen 10 times faster as compared to on a CPU. If you don’t have a physical GPU at your disposal consider using Google Colab (cloud based GPU).
After running the training for 20 epochs, you will get a test accuracy of around 99% & a validation accuracy of 75% — which is a very good start indeed. Thus, you now have your trusty model at your disposal which is capable of predicting market sentiment based on the financial text fed into it.
I would encourage you to experiment with different set of model parameters / model architecture to improve the accuracy further. Also, it would seem that the model is slightly over-fitting as gap between training & test accuracy is large — I’ll recommend that you try out regularization techniques such as Dropout.
Sentiment Prediction
Now it’s time for taking your model on a roll. Use the below code to generate market sentiment by inputting the financial text of your choice.
pred_sentences = ["<input the financial text of your choice>"]
pred_sequences = tokenizer.texts_to_sequences(pred_sentences)
X_pred = pad_sequences(pred_sequences, maxlen=MAX_LENGTH, truncating=TRUNC_TYPE)
y_pred = model.predict(X_pred)
y_pred
Are you convinced with the sanctity of sentiment prediction? If yes, it about time that you designed a user interface around the model that we just developed and ship that to your Credit Officers / Portfolio Managers to enable them to use the model to make superior investment decisions.
I do hope that you enjoyed this article — I’ll be really grateful if you are able to leave a rating / feedback below.
Credits:
- Co-authored by Anuj Kumar (https://www.linkedin.com/in/anujchauhan/)
- Concept: Simarjit Singh Lamba & Rohit Arora
- Financial News Sentiment Dataset published by Pekka Malo et al
- Douwe Osinga — Deep Learning Cookbook
- Antonio Gulli — Deep Learning with Keras
- Hobson Lane et al — Natural Language Processing in Action
- Image - sourced from miro.medium.com (https://miro.medium.com/max/1196/1%2aLTpcCt2mYa2-edjdti8S4Q.jpeg)