Word Embedding on Stock Market News

Rahman Taufik
Geek Culture
Published in
2 min readJan 8, 2021

Word embedding is a one of the most popular language modeling in Natural Language Processing (NLP) where words or phrases from the vocabulary are mapped to a multi-dimensional vector space. It is able to capturing the context of a word, semantic and syntactic similarity, relation to other words, etc.

The most famous example to describe about word embedding and how they can be added or subtracted is ‘Queen’ word. This word is obtained in adding the vectors associated with the words king and woman while subtracting man is equal to the vector associated with queen.

King - Man + Women = Queen

In this article, we will try to do word embedding on stock market news. We have a dummy dataset about Covid vaccine news and it is separated into several paragraphs.

The dataset

Each paragraphs in the dataset should be pre-processing first. We have pre-processing function including:

  • cleaning: it’s intended to get alphabet only
  • tokenizing: breaks sentences into syllables
  • remove stopword: remove common words (e.g. i, have, they, etc.) that are considered as unimportant words

Actually, we can use lemmatization and stemming methods for pre-processing. Moreover, we can add more stopword in pre-processing.

The pre-processing functions

After the dataset has been pre-processed, we continue to train the dataset into a word embedding model. We use Word2Vec tool from gensim python library to train it. Word2Vec is commonly used to generate models that detect synonymous words or suggest words for partial sentences. It has two model architectures, which are Skip-gram and Cbow. The difference is CBOW uses surrounding context words to predicts a target/current word while Skip-gram uses the current word to predicts several context words.

The training model function

From the dataset we have, we train it to be a word embedding model (contains word and vector) and the model is used to get the similarity words. For example, we try to find out what words are related to vaccine, and the results are as follows:

# words related to vaccine
[('takes', 0.15888293087482452), ('vaccinate', 0.14267049729824066), ('access', 0.14085279405117035), ('end', 0.13962803781032562), ('eu', 0.12622541189193726), ('normal', 0.12466207891702652), ('making', 0.11808254569768906), ('provide', 0.10831684619188309), ('supplies', 0.10447120666503906), ('approved', 0.1038375273346901)]

In Natural Language Processing (NLP) area, word embeddings is the famous method for many domains, this could be the key in NLP problems, including stock domain problem.

Reference:

--

--