Mining Financial Stock News Using SpaCy Matcher
By the end of this article, you will be able to write an information extraction NLP pipeline using spaCy’s Matcher. It will extract dividend information from news headlines and articles.
Demo:
For the scope of this article, we will only use rule-based techniques to see how much could be done with them.
In this tutorial, we will:
- Get introduced to the fundamentals of rule-based matching using spaCy’s Matcher
- Use our newfound knowledge to create an information extraction NLP pipeline to mine dividend information from news articles
spaCy Matcher vs Regular Expressions
Pattern matching forms the basis of many artificial intelligence solutions, especially for natural language processing (NLP) tasks. Regular expressions (RegEx) are a powerful construct for extracting information from text, but they are limited to string patterns.
spaCy’s rule-matching engine Matcher extends RegEx and offers a novel solution to all our pattern-matching needs. Compared to regular expressions, the Matcher works with Doc
and Token
objects, not just strings. It’s also more flexible. We can search for textual patterns and lexical attributes or use model outputs like the part-of-speech (POS) tags or entity types.
Beginners Guide To spaCy Matcher
First, install spaCy.
The first step is to initialize the Matcher with a vocabulary. The matcher object must always share the same vocabulary with the documents it will operate on.
import spacy
from spacy.matcher import Matchernlp=spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
Once the matcher object has been initialized with a vocab, we can add patterns to it using the matcher.add()
method. This method takes two arguments: a string ID and a list of patterns.
Let’s say we want our matcher object to find all variants of “Hello World!”:
We need to match three tokens:
- A token with the text “hello” in upper or lower case.
- A token with the text “world” in upper or lower case.
- Any punctuation like “.” or “!”.
The pattern is defined as a list of dictionaries, where each dictionary represents one token. For our “hello world” matcher, we define the pattern as follows:
pattern = [{'LOWER': 'hello'},
{'LOWER': 'world'},
{'IS_PUNCT': True, 'OP': '+'}]
Here the LOWER
and IS_PUNCT
keys correspond to token attributes. LOWER
matches for the lowercase form of the token text and IS_PUNCT
is a flag for punctuation.
The OP
key is a quantifier. Quantifiers enable us to define sequences of tokens to be matched. For example, one or more punctuation marks or specify optional tokens with ?
. We use it here to match one or more punctuation.
When a matcher object is called on a Doc
, it returns a list of (match_id, start, end)
tuples.
#add pattern to matcher
matcher.add("HELLO_WORLD", None, pattern) #create a doc of the string to be 'queried'
doc = nlp("hello world!\nHello World.")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end] # The matched span
print(match_id, start, end, span.text)
# Output
# 2008415248711360438 0 3 hello world!
# 2008415248711360438 4 7 Hello World.
Here, match_id
is the hash value of the string ID and can be used to retrieve the string from the StringStore, and the start and end correspond to the span of the original document where the match was found.
Besides the obvious text attributes, such as IS_ALPHA
, IS_DIGIT
, IS_UPPER
, and IS_STOP
, we can use the token’s part-of-speech tag, morphological analysis, dependency label, lemma, or shape. Find the full list here.
spaCy Matcher With Lemma
For instance, if we wanted to match the different forms of a verb like run, we could do so using the LEMMA
attribute.
run_matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "run"}]
run_matcher.add("RUN", None, pattern) doc = nlp("Only when it dawned on him that he had nowhere left to run to, he finally stopped running.")matches = run_matcher(doc)for match_id, start, end in matches:
span = doc[start:end]
print(start, end, span.text)# Output
# 12 13 run
# 18 19 running
Instead of just mapping token patterns to a single value, we can also map them to a dictionary of properties. For example, we can match the value of a lemma to a list of values or match tokens of a specific length.
spaCy Matcher With RegEx
Even with all these nifty attributes, working on the token level is sometimes not enough. Matching for spelling variations is an obvious example. In these cases, the REGEX
operator enables us to use regular expressions inside the Matcher
.
regex_matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": {"REGEX": "colou?r"}}]regex_matcher.add("BUY", None, pattern)
doc = nlp("Color is the spelling used in the United States. Colour is used in other English-speaking countries.")
matches = regex_matcher(doc)for match_id, start, end in matches:
span = doc[start:end]
print(start, end, span.text) # Output
# 0 1 Color
# 10 11 Colour
When To Use spaCy Matcher?
The Matcher will work well if there’s a finite number of examples we want to find in the text. Or if there’s a very obvious structure pattern, we can express it with token rules or RegEx. It is a good choice for dates, URLs, phone numbers, or city names.
A statistical entity recognition model would be the better choice for complicated tasks. But they have their caveat, a large amount of training data is required to train robust models. Rule-based approaches enable us to prototype the statistical model while gathering and annotating the training data. We can also combine the two approaches to improve a statistical model with rules by using EntityRuler. This helps handle very specific cases and boosts accuracy.
Here’s a nifty tool for visualizing and playing with matcher patterns: Rule-based Matcher Explorer
Using spaCy Matcher To Extract Dividend Information From News Headlines
Loading packages and the dataset we’ll work with.
import json
import time
import spacy
from spacy.matcher import Matcher
We will use a JSON file of news articles from top publications fetched from the newscatcherapi using the query “dividend”.
You can get the data for this tutorial on GitHub.
with open('data.json', 'r') as f:
news_articles = json.load(f)print(news_articles[0])
There’s a lot of information here, but we’ll focus on the headlines and the article summaries. Let’s look for common patterns in the headlines of the articles.
print("\n".join(art['title'] for art in news_articles))
There are a lot of article headlines in the following formats:
The first thing we’ll make a pattern for is the organization names. Using the string patterns alone won’t be enough, so we will need to incorporate part-of-speech(POS) tags as well.
doc = nlp("BlackRock Innovation and Growth Trust (BIGZ) will begin trading ex-dividend on November 12, 2021.")for token in doc:
print(f"{token.text:15}, {token.pos_:<10}")
The organization name tokens consist of proper nouns with optional conjunction like “and” Another thing to note is that the organization name is often followed by an “ ‘s” or simply “ ‘ “ in the case of the second headline type. This means that we don’t necessarily have one concrete way of finding the end of the organization’s name, so we will rely on the presence of the ticker after it.
Detect Organization
Before we create our first matcher object and initialize it with the organization name pattern, we’ll need to initialize it with the vocabulary of the nlp
object.
def get_org(doc):
org_matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'PROPN', 'OP': '+'},
{'POS': 'CCONJ', 'OP': '?'},
{'POS': 'PROPN', 'OP': '*'},
{'ORTH': '\'', 'OP': '?'},
{'ORTH': '\'s', 'OP': '?'},
{'ORTH': '(', 'OP': '+'}] org_matcher.add("ORG", None, pattern)
This matcher object finds all the organizations in the two headline types. But the problem is that it returns all possible sequences of tokens that fit the pattern. And since the pattern has a bunch of optional tokens, it will return some half-complete organization names.
There’s a simple way to fix this problem. We check if the Matcher object returns more than one match. If it does, we choose the longest match. And if it only returns one match, well then, that’s our organization name.
matches = org_matcher(doc)
if len(matches) == 0:
return f"{doc.text} -> NO MATCH FOUND"
elif len(matches) == 1:
match_idx = matches[0]
else:
max_len = 0
for m in matches:
if m[2] - m[1] > max_len:
max_len = m[2] - m[1]
match_idx = m
Another thing to keep in mind is that our matches will include the open parentheses( “(“ ) of the ticker, so we’ll have to slice our matches accordingly.
return doc[match_idx[1]:match_idx[2]-1]
Detect Ticker
The next easiest pattern to extract is the stock ticker. It is enclosed within a pair of parentheses and may or may not have information about the exchange it’s listed on. The tickers exist in the following two formats:
def get_ticker(doc):
org_matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': '('}, {'IS_ALPHA': True},
{'ORTH': ':', 'OP': '*'},
{'IS_ALPHA': True, 'OP': '*'},
{'ORTH': ')'}]
org_matcher.add("ORG", None, pattern)
match = org_matcher(doc)
if len(match) == 0:
return f"{doc.text} -> NO MATCH FOUND"
else:
return doc[match[0][1]:match[0][2]]
Let’s test these two pattern-matching methods on a headline:
doc = nlp("BlackRock Energy and Resources Trust (BGR) Ex-Dividend Date Scheduled for November 12, 2021")print(get_org(doc))
print(get_ticker(doc))# Output
# BlackRock Energy and Resources Trust
# NYSE
Detect the Date and Amount
Now depending on the headline type, we can extract two more bits of information:
- the ex-dividend date
- the dividend amount
Let’s do the dividend amount first. We could use the dollar sign alone to fetch the amount, but there are often other monetary amounts, like the total dividend earned by a firm.
So we will use the presence of “US” before the dividend amount to work around this confusion.
def get_amount_headline(doc):
dividend_matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "US$"}, {"LIKE_NUM": True}]
dividend_matcher.add("USD", None, pattern)
if len(dividend_matcher(doc)) > 0:
match = dividend_matcher(doc)[0]
return doc[match[1]:match[2]]
else:
return False
doc = nlp("There's A Lot To Like About ConnectOne Bancorp's (NASDAQ:CNOB) Upcoming US$0.13 Dividend")
print(get_amount_headline(doc)) # OUTPUT
# US$0.13
As we saw earlier, the ex-dividend date has a defined format, and the month is a proper noun.
Using this information, we can come up with a simple pattern.
def get_date(doc):
date_matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}, {"LIKE_NUM": True},
{"text": ","}, {"LIKE_NUM": True}]
date_matcher.add("EX_DATE", None, pattern)
if len(date_matcher(doc)) > 0:
match = date_matcher(doc)[0]
return doc[match[1]:match[2]]
else:
return False
The articles with ex-dividend date headlines have the dividend amount and payment date in a consistent format in the summary. We can easily create two more Matcher objects to extract these bits of information.
The payment date has the same format as the ex-date. This is good and bad. Good because we already have a working pattern for it, and bad because our existing Matcher won’t be able to differentiate between the two dates.
One quick workaround is to use the neighboring text to differentiate the two, namely the presence of “paid on” before the payment date. Since we are matching extra text on either side of the relevant information, we will have to modify the final match index.
def get_pay_date(doc):
pay_date_matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "paid"}, {"ORTH": "on"},
{"POS": "PROPN"}, {"LIKE_NUM": True},
{"ORTH": ","}, {"LIKE_NUM": True},
{"ORTH": "."}]
pay_date_matcher.add("AMOUNT", None, pattern)
match = pay_date_matcher(doc)[0]
return doc[match[1] + 2:match[2]-1]
Using the same logic, we can extract the dividend amount per share based on the presence of the text “per share” after the dollar amount. We are doing this to avoid mistakenly extracting the total amount of dividend a company is distributing.
Again, we are using extra neighboring text in our Matcher, so we will adjust the indices.
def get_amount_summary(doc):
per_share_matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "$"}, {"LIKE_NUM": True},
{"LOWER": "per"}, {"LOWER": "share"}]
per_share_matcher.add("AMOUNT", None, pattern)
match = per_share_matcher(doc)[0]
return doc[match[1]:match[2]-2]
Final Pipeline
Let’s test these patterns
doc = nlp("BlackRock Innovation and Growth Trust (BIGZ) will begin trading ex-dividend on November 12, 2021. A cash dividend payment of $0.1 per share is scheduled to be paid on November 30, 2021. Shareholders who purchased BIGZ prior to the ex-dividend date are eligible for the cash dividend payment. This marks the 6th quarter that BIGZ has paid the same dividend. At the current stock price of $17.99, the dividend yield is 6.67%.")print(get_amount_summary(doc))
print(get_pay_date(doc)) # Output
# $0.1
# November 30, 2021
We now have all the Matcher methods we need to extract dividend information. Let’s combine these into one function to test it on our news data easily.
def dividend_info(article):
headline = nlp(article['title'])
if 'date' in [token.text.lower() for token in headline]:
date = get_date(headline)
if date:
org = get_org(headline)
ticker = get_ticker(headline)
amount = get_amount_summary(nlp(article['summary']))
pay_date = get_pay_date(nlp(article['summary']))
print("HEADLINE: " + article['title'])
print(f"\nTICKER: {ticker}" + f"\nDATE: {date}" + f"\nAMOUNT: {amount} per share to be paid on {pay_date}\n")
else:
dividend = get_amount_headline(headline)
if dividend:
org = get_org(headline)
ticker = get_ticker(headline)
print("NEWS HEADLINE: " + article['title'])
print(f"\nTICKER: {ticker}" + f"\nAMOUNT: {dividend}\n")
Conclusion
In this article, you learned the basics of spaCy’s Matcher class. Then used this newfound knowledge to create a robust information extraction pipeline for news articles. You can now extend the patterns to extract dividend information from more types of articles or change them to extract other types of information altogether.