NLP: Entity Grounding
When you extract entities (names, dates, places, etc) or any keyword, from a document, there are external details surrounding it which provide us with a contextual awareness that a machine often lacks. (e.g. knowing that the date 25th December is Christmas, or that a Saturday is a weekend, or that the actor Keanu Reeves is male, etc.) If this implicit knowledge is made explicit to a machine, it can significantly aid its ability to learn and identify patterns in the data.
Pandas
Since dates and times can be phrased in various ways, lets start by normalising them (i.e. into a standard format like a datetime stamp)
example_date = "dec 25th"
from dateutil import parser
datetime = parser.parse(example_date)
datetime
Then we can use Pandas to expand the implicit features hidden within the datetime stamp (e.g. day, month, weekend, etc)
stndrd,th = ["st","nd","rd"], ["th"]import pandas as pd
dt = pd.to_datetime(datetime)
date_time = {
"phrase":example_date,
"date": f"{dt.day}/{dt.month}/{dt.year}",
"time": f"{dt.hour}:{dt.minute}:{dt.second}",
"month":("January","February","March","April","May","June","July","August","September","October","November","December")[dt.month -1],
"day":("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")[dt.dayofweek],
"day of year":dt.dayofyear,
"suffix":f"{dt.day}{(stndrd + th*17 + stndrd + th * 7 + stndrd)[dt.month - 1]}",
"is month end":dt.is_month_end,
"is month start":dt.is_month_start,
"is quarter end":dt.is_quarter_end,
"is quarter start":dt.is_quarter_start,
"is year start":dt.is_year_start,
"is year end":dt.is_year_end,
"is weekend":dt.dayofweek in (6,7),
}date_time
This is also an important step to perform before embedding dates or times!
Google Correlate
There are also ways to extract events commonly associated with times and dates (i.e. holidays). Google correlate uses an algorithm to find correlated search terms based on the pattern of the searches over the year. (e.g. people will usually search for “Christmas” and related events around 25th December)
Unfortunately it doesn’t provide an API (which means the correlated results must be scraped). Let’s see which search terms correlate to the term “25th December”
WordNet
WordNet is a curated graph of words and their relation with one another
It’s free and easily accessible via Python’s NLTK library. Its an easy and rather comprehensive way to find a word’s synonyms automatically
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wndef synonyms(word):
synonyms = set()
for synset in wn.synsets('_'.join(word.lower().split())):
synonyms |= set(synset.lemma_names())
for related_synset in synset.hypernyms() \
+ synset.hyponyms() \
+ synset.part_meronyms() \
+ synset.substance_meronyms() \
+ synset.member_meronyms() \
+ synset.part_holonyms() \
+ synset.substance_holonyms() \
+ synset.member_holonyms() \
+ synset.topic_domains() \
+ synset.region_domains() \
+ synset.usage_domains() \
+ synset.entailments() \
+ synset.causes() \
+ synset.also_sees() \
+ synset.verb_groups() \
+ synset.similar_tos():
synonyms |= set(related_synset.lemma_names())
return synonyms
synonyms("psoriasis")
{‘acanthosis’, ‘disease_of_the_skin’, ‘psoriasis’, ‘skin_disease’, ‘skin_disorder’}
synonyms("robotics")
{‘AI’, ‘animatronics’, ‘artificial_intelligence’, ‘robotics’, ‘telerobotics’}
synonyms("thanks")
{‘acknowledge’, ‘acknowledgement’, ‘acknowledgment’, ‘aid’, ‘appreciation’, ‘assist’, ‘assistance’, ‘bow’, ‘convey’, ‘curtain_call’, ‘give_thanks’, ‘help’, ‘recognise’, ‘recognize’, ‘thank’, ‘thank_you’, ‘thanks’}
DBPedia
Using DBPedia, we can find facts related to names and places.
Just like WordNet, there are various types of relations (e.g. gender, deathPlace, children, etc)
Wikipedia
We could even use the links that direct to a Wikipedia page as implicit relations for an entity. The disadvantage of this approach is that the type of relation will not be named.