Mining Financial Stock News Using SpaCy Matcher

Published in

AIGuys

10 min readMay 5, 2023

By the end of this article, you will be able to write an information extraction NLP pipeline using spaCy’s Matcher. It will extract dividend information from news headlines and articles.

Demo:

Results of the information extraction pipeline we will create (Source: Image by the author.)

For the scope of this article, we will only use rule-based techniques to see how much could be done with them.

In this tutorial, we will:

Get introduced to the fundamentals of rule-based matching using spaCy’s Matcher
Use our newfound knowledge to create an information extraction NLP pipeline to mine dividend information from news articles

spaCy Matcher vs Regular Expressions

Pattern matching forms the basis of many artificial intelligence solutions, especially for natural language processing (NLP) tasks. Regular expressions (RegEx) are a powerful construct for extracting information from text, but they are limited to string patterns.

spaCy’s rule-matching engine Matcher extends RegEx and offers a novel solution to all our pattern-matching needs. Compared to regular expressions, the Matcher works with Doc and Token objects, not just strings. It’s also more flexible. We can search for textual patterns and lexical attributes or use model outputs like the part-of-speech (POS) tags or entity types.

Beginners Guide To spaCy Matcher

First, install spaCy.

The first step is to initialize the Matcher with a vocabulary. The matcher object must always share the same vocabulary with the documents it will operate on.

import spacy
from spacy.matcher import Matchernlp=spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

Once the matcher object has been initialized with a vocab, we can add patterns to it using the matcher.add() method. This method takes two arguments: a string ID and a list of patterns.

Let’s say we want our matcher object to find all variants of “Hello World!”:

Different variants of a hello world string (Source: Image by the author.)

We need to match three tokens:

A token with the text “hello” in upper or lower case.
A token with the text “world” in upper or lower case.
Any punctuation like “.” or “!”.

The pattern is defined as a list of dictionaries, where each dictionary represents one token. For our “hello world” matcher, we define the pattern as follows:

pattern = [{'LOWER': 'hello'},
           {'LOWER': 'world'},
           {'IS_PUNCT': True, 'OP': '+'}]

Here the LOWER and IS_PUNCT keys correspond to token attributes. LOWER matches for the lowercase form of the token text and IS_PUNCT is a flag for punctuation.

The OP key is a quantifier. Quantifiers enable us to define sequences of tokens to be matched. For example, one or more punctuation marks or specify optional tokens with ?. We use it here to match one or more punctuation.

When a matcher object is called on a Doc, it returns a list of (match_id, start, end) tuples.

#add pattern to matcher 
matcher.add("HELLO_WORLD", None, pattern) #create a doc of the string to be 'queried' 
doc = nlp("hello world!\nHello World.")
matches = matcher(doc) 
for match_id, start, end in matches:
    span = doc[start:end]  # The matched span
    print(match_id, start, end, span.text)   
# Output
# 2008415248711360438 0 3 hello world!
# 2008415248711360438 4 7 Hello World.

Here, match_id is the hash value of the string ID and can be used to retrieve the string from the StringStore, and the start and end correspond to the span of the original document where the match was found.

Besides the obvious text attributes, such as IS_ALPHA, IS_DIGIT, IS_UPPER, and IS_STOP, we can use the token’s part-of-speech tag, morphological analysis, dependency label, lemma, or shape. Find the full list here.

spaCy Matcher With Lemma

For instance, if we wanted to match the different forms of a verb like run, we could do so using the LEMMA attribute.

run_matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "run"}]
run_matcher.add("RUN", None, pattern) doc = nlp("Only when it dawned on him that he had nowhere left to run to, he finally stopped running.")matches = run_matcher(doc)for match_id, start, end in matches:
    span = doc[start:end]
    print(start, end, span.text)# Output
# 12 13 run
# 18 19 running

Instead of just mapping token patterns to a single value, we can also map them to a dictionary of properties. For example, we can match the value of a lemma to a list of values or match tokens of a specific length.

spaCy Matcher With RegEx

Even with all these nifty attributes, working on the token level is sometimes not enough. Matching for spelling variations is an obvious example. In these cases, the REGEX operator enables us to use regular expressions inside the Matcher.

regex_matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": {"REGEX": "colou?r"}}]regex_matcher.add("BUY", None, pattern) 
doc = nlp("Color is the spelling used in the United States. Colour is used in other English-speaking countries.")
matches = regex_matcher(doc)for match_id, start, end in matches:
    span = doc[start:end]
    print(start, end, span.text)    # Output
# 0 1 Color
# 10 11 Colour

When To Use spaCy Matcher?

The Matcher will work well if there’s a finite number of examples we want to find in the text. Or if there’s a very obvious structure pattern, we can express it with token rules or RegEx. It is a good choice for dates, URLs, phone numbers, or city names.

A statistical entity recognition model would be the better choice for complicated tasks. But they have their caveat, a large amount of training data is required to train robust models. Rule-based approaches enable us to prototype the statistical model while gathering and annotating the training data. We can also combine the two approaches to improve a statistical model with rules by using EntityRuler. This helps handle very specific cases and boosts accuracy.

Here’s a nifty tool for visualizing and playing with matcher patterns: Rule-based Matcher Explorer

Using spaCy Matcher To Extract Dividend Information From News Headlines

Loading packages and the dataset we’ll work with.

import json
import time
import spacy
from spacy.matcher import Matcher

We will use a JSON file of news articles from top publications fetched from the newscatcherapi using the query “dividend”.

You can get the data for this tutorial on GitHub.

with open('data.json', 'r') as f:
    news_articles = json.load(f)print(news_articles[0])

Example of what an article from NewsCatcher looks like(Source: Image by the author.)

There’s a lot of information here, but we’ll focus on the headlines and the article summaries. Let’s look for common patterns in the headlines of the articles.

print("\n".join(art['title'] for art in news_articles))

*GIF of news article headlines (*Source: Image by the author.)

There are a lot of article headlines in the following formats:

The data points we want to extract from the news articles (Source: Image by the author.)

The first thing we’ll make a pattern for is the organization names. Using the string patterns alone won’t be enough, so we will need to incorporate part-of-speech(POS) tags as well.

doc = nlp("BlackRock Innovation and Growth Trust (BIGZ) will begin trading ex-dividend on November 12, 2021.")for token in doc:
  print(f"{token.text:15}, {token.pos_:<10}")

Part-of-speech tags for each token in the headline (Source: Image by the author.)

The organization name tokens consist of proper nouns with optional conjunction like “and” Another thing to note is that the organization name is often followed by an “ ‘s” or simply “ ‘ “ in the case of the second headline type. This means that we don’t necessarily have one concrete way of finding the end of the organization’s name, so we will rely on the presence of the ticker after it.

Detect Organization

Before we create our first matcher object and initialize it with the organization name pattern, we’ll need to initialize it with the vocabulary of the nlp object.

def get_org(doc):
  org_matcher = Matcher(nlp.vocab)
  pattern = [{'POS': 'PROPN', 'OP': '+'},
            {'POS': 'CCONJ', 'OP': '?'},
            {'POS': 'PROPN', 'OP': '*'},
            {'ORTH': '\'', 'OP': '?'},
            {'ORTH': '\'s', 'OP': '?'},
            {'ORTH': '(', 'OP': '+'}]             org_matcher.add("ORG", None, pattern)

This matcher object finds all the organizations in the two headline types. But the problem is that it returns all possible sequences of tokens that fit the pattern. And since the pattern has a bunch of optional tokens, it will return some half-complete organization names.

*Multiple overlapping matches for our organization name pattern (*Source: Image by the author.)

There’s a simple way to fix this problem. We check if the Matcher object returns more than one match. If it does, we choose the longest match. And if it only returns one match, well then, that’s our organization name.

matches = org_matcher(doc)
if len(matches) == 0:
    return f"{doc.text} -> NO MATCH FOUND"
elif len(matches) == 1:
    match_idx = matches[0]  
else:
    max_len = 0
    for m in matches:
      if m[2] - m[1] > max_len:
        max_len = m[2] - m[1]
        match_idx = m

Another thing to keep in mind is that our matches will include the open parentheses( “(“ ) of the ticker, so we’ll have to slice our matches accordingly.

return doc[match_idx[1]:match_idx[2]-1]

Detect Ticker

The next easiest pattern to extract is the stock ticker. It is enclosed within a pair of parentheses and may or may not have information about the exchange it’s listed on. The tickers exist in the following two formats:

The two ticker formats appear in (Source: Image by the author.)

def get_ticker(doc):
  org_matcher = Matcher(nlp.vocab)
  pattern = [{'ORTH': '('}, {'IS_ALPHA': True},
             {'ORTH': ':', 'OP': '*'},
             {'IS_ALPHA': True, 'OP': '*'},
            {'ORTH': ')'}]
  org_matcher.add("ORG", None, pattern)
  match = org_matcher(doc)
  if len(match) == 0:
    return f"{doc.text} -> NO MATCH FOUND"
  else:
    return doc[match[0][1]:match[0][2]]

Let’s test these two pattern-matching methods on a headline:

doc = nlp("BlackRock Energy and Resources Trust (BGR) Ex-Dividend Date Scheduled for November 12, 2021")print(get_org(doc))
print(get_ticker(doc))# Output
# BlackRock Energy and Resources Trust
# NYSE

Detect the Date and Amount

Now depending on the headline type, we can extract two more bits of information:

the ex-dividend date
the dividend amount

Let’s do the dividend amount first. We could use the dollar sign alone to fetch the amount, but there are often other monetary amounts, like the total dividend earned by a firm.

Example of a monetary amount that a simple dollar sign pattern would match for (*Image by author)*

So we will use the presence of “US” before the dividend amount to work around this confusion.

def get_amount_headline(doc):
  dividend_matcher = Matcher(nlp.vocab)
  pattern = [{"ORTH": "US$"}, {"LIKE_NUM": True}]
  dividend_matcher.add("USD", None, pattern)
  if len(dividend_matcher(doc)) > 0:
    match = dividend_matcher(doc)[0]
    return doc[match[1]:match[2]]
  else:
    return False
doc = nlp("There's A Lot To Like About ConnectOne Bancorp's (NASDAQ:CNOB) Upcoming US$0.13 Dividend")
print(get_amount_headline(doc)) # OUTPUT
# US$0.13

As we saw earlier, the ex-dividend date has a defined format, and the month is a proper noun.

Date format (Image by author)

Using this information, we can come up with a simple pattern.

def get_date(doc):
  date_matcher = Matcher(nlp.vocab)
  pattern = [{"POS": "PROPN"}, {"LIKE_NUM": True},
            {"text": ","}, {"LIKE_NUM": True}]
  date_matcher.add("EX_DATE", None, pattern)
  if len(date_matcher(doc)) > 0:
    match = date_matcher(doc)[0]
    return doc[match[1]:match[2]]
  else:
    return False

The articles with ex-dividend date headlines have the dividend amount and payment date in a consistent format in the summary. We can easily create two more Matcher objects to extract these bits of information.

Example of dividend amount and payment date in the article text (Source: Image by the author.)

The payment date has the same format as the ex-date. This is good and bad. Good because we already have a working pattern for it, and bad because our existing Matcher won’t be able to differentiate between the two dates.

One quick workaround is to use the neighboring text to differentiate the two, namely the presence of “paid on” before the payment date. Since we are matching extra text on either side of the relevant information, we will have to modify the final match index.

def get_pay_date(doc):
  pay_date_matcher = Matcher(nlp.vocab)
  pattern = [{"ORTH": "paid"}, {"ORTH": "on"},
             {"POS": "PROPN"}, {"LIKE_NUM": True},
             {"ORTH": ","}, {"LIKE_NUM": True},
             {"ORTH": "."}]
  pay_date_matcher.add("AMOUNT", None, pattern)
  match = pay_date_matcher(doc)[0]
  return doc[match[1] + 2:match[2]-1]

Using the same logic, we can extract the dividend amount per share based on the presence of the text “per share” after the dollar amount. We are doing this to avoid mistakenly extracting the total amount of dividend a company is distributing.

Example of a monetary amount that a simple dollar sign pattern would match for (Source: Image by the author.)

Again, we are using extra neighboring text in our Matcher, so we will adjust the indices.

def get_amount_summary(doc):
  per_share_matcher = Matcher(nlp.vocab)
  pattern = [{"ORTH": "$"}, {"LIKE_NUM": True},
             {"LOWER": "per"}, {"LOWER": "share"}]
  per_share_matcher.add("AMOUNT", None, pattern)
  match = per_share_matcher(doc)[0]
  return doc[match[1]:match[2]-2]

Final Pipeline

Let’s test these patterns

doc = nlp("BlackRock Innovation and Growth Trust (BIGZ) will begin trading ex-dividend on November 12, 2021. A cash dividend payment of $0.1 per share is scheduled to be paid on November 30, 2021. Shareholders who purchased BIGZ prior to the ex-dividend date are eligible for the cash dividend payment. This marks the 6th quarter that BIGZ has paid the same dividend. At the current stock price of $17.99, the dividend yield is 6.67%.")print(get_amount_summary(doc))
print(get_pay_date(doc)) # Output
# $0.1
# November 30, 2021

We now have all the Matcher methods we need to extract dividend information. Let’s combine these into one function to test it on our news data easily.

def dividend_info(article):
  headline = nlp(article['title'])
  if 'date' in [token.text.lower() for token in headline]:
    date = get_date(headline)
    if date:
      org = get_org(headline)
      ticker = get_ticker(headline)
      amount = get_amount_summary(nlp(article['summary']))
      pay_date = get_pay_date(nlp(article['summary']))
      print("HEADLINE: " + article['title'])
      print(f"\nTICKER: {ticker}" + f"\nDATE: {date}" +  f"\nAMOUNT: {amount} per share to be paid on {pay_date}\n")
  else:
    dividend = get_amount_headline(headline)
    if dividend:
      org = get_org(headline)
      ticker = get_ticker(headline)
      print("NEWS HEADLINE: " + article['title'])
      print(f"\nTICKER: {ticker}" + f"\nAMOUNT: {dividend}\n")

Results of the information extraction pipeline we created (Source: Image by the author.)

Conclusion

In this article, you learned the basics of spaCy’s Matcher class. Then used this newfound knowledge to create a robust information extraction pipeline for news articles. You can now extend the patterns to extract dividend information from more types of articles or change them to extract other types of information altogether.