Keyword Extraction on Stock Market News

Rahman Taufik
The Startup
Published in
2 min readJan 8, 2021
Sources image from https://rare-technologies.com/tag/keyword-extraction/

Since news and articles portal are available on internet in massive numbers, it is almost impossible for users to understand and process that information at once. However, this is no longer a limitation for users to get keyword in many documents, because we can use Natural Language Processing (NLP) tools to extract keyword, such as NLTK, Gensim, fastText, sklearn (these are python libraries).

In this article, we will extract keyword from stock market news using Tf-Idf method from python sklearn library. Tf-Idf is weighting words processing that is intended to reflect how important a word in document. There are two terms in this method, which are Term Frequency (TF) that represents the number of times a word appears in a document divided by the total number of words in the document and Inverse Document Frequency (IDF) that represents the log of the number of documents divided by the number of documents that contain the word.

The more datasets we have, the better the extraction results we get. Actually, we can get datasets from various sources such as Eikon Data API, Rapid API, etc. However, we will not use those sources in this article, because those need more explanation, so let’s use dummy data. The following is an example of the dataset and keyword extraction code:

The Code

The following is steps in extracting keywords from the code above:

  • Use stopwords to filter common word such as they, we, have, etc. We don’t need that because it’s not really important for a keyword.
  • Vectorize and transform the words using tf-idf function
  • Get keywords using tf-idf function and save it

Basically, since we use python tf-idf library, it is quite simple to code. However, we can develop the model to get better result, for example stopwords additional for better filtering results.

The Result

If you see the dataset and compare it with the result, it could be said it’s a good result. However, this is not perfect result because it’s only dummy data and we only use simple method.

--

--