Extract Subject Matter of Documents Using NLP

Alexander Crosson
4 min readJun 8, 2016

Understanding large corpora is an increasingly popular problem. Modern startups and established companies are working diligently to produce models that can extract meaningful data from a body of text.

In this post, I will explain some Natural Language Processing (NLP) techniques that can be used to extract the main subject of a particular document. In addition to identifying the main subject, I will explain a technique for getting Subject Verb and Object sets, everywhere the subject is mentioned.

To further explain what I’m talking about take a look at this TechCrunch article. Using the techniques explained below we will be able to extract that the main subject matter is Snapchat, and more specifically that “Snapchat is raising money”.

I will be expanding upon some of the NLP techniques I outlined in my previous post. Refer to that article for more in depth explanations of concepts such tokenizing, part-of-speech (POS) tagging and chunking.

For this example I will be using NLTK, and BeautifulSoup and the requests framework.

To get started, lets download an article from the net by using a simple get request.

import requestsurl = 'http://techcrunch.com/2016/05/26/snapchat-series-f/'
r = requests.get(url)

Once downloaded we can pass the document to BeautifulSoup to parse out the body and title. Note: I’m only using the text that is included in <p> tags. Any text that falls outside of that will be excluded.

from bs4 import BeautifulSoupsoup = BeautifulSoup(r.text, 'html.parser')
title = soup.find('title').get_text()
document = ' '.join([p.get_text() for p in soup.find_all('p')])

After downloading the document, it’s imperative that we do some document cleaning (pre-processing). I removed any characters which weren’t alpha numeric but kept a few punctuation marks.

document = re.sub(‘[^A-Za-z .-]+’, ‘ ‘, document)
document = ‘ ‘.join(document.split())
document = ‘ ‘.join([i for i in document.split() if i not in stop])

The next step is extracting the subject matter, or main word or phrase from the document. To accomplish this I first calculated a word frequency distribution (bag of words). The most frequent nouns were set aside to be used later. To determine the POS tag, I used NLTK’s built in method. This requires that we first tokenize the documents.

words = nltk.tokenize.word_tokenize(document)
words = [word.lower() for word in words if word not in stop]
fdist = nltk.FreqDist(words)
most_freq_nouns = [w for w, c in fdist.most_common(10)
if nltk.pos_tag([w])[0][1] in NOUNS]

Now that we have the most frequently used nouns, we can look for named entities. Named entity extraction was covered in my last post. For this project we’ll be using same chunking method.

Using the named entities pulled using the method above, we can now choose the most related ones by taking the intersection of named entities and the most frequently mentioned nouns.

subject_nouns = [entity for entity in top_10_entities
if entity.split()[0] in most_freq_nouns]

This will leave us with phrases like “Snapchat”.

Given that we’ve found the key subject noun (“Snapchat”), we can now extract the Subject-Verb-Object (SVO) sets for all phrases where Snapchat was mentioned. SVO is a common sentence structure used in many languages. In this structure you’ll find the Subject comes first, followed by the Verb (or action), then finally the Object. An example is listed below. You can read more about SVO here.

Subject Verb Object example

SVO can help us understand what a particular sentence is talking about, and through this, make inferences about the whole body of text. To get this information we need to take our tokenized sentences and run them through a n-gram tagging model. In order to get a more accurate result I chose to use NLTK’s TrigramTagger, with a Bigram, a Unigram and a Default backoff tagger. A backoff tagger will attempt to tag any untagged words that the previous tagger was unable to tag. To train this model, I chose to use the Brown, CoNLL2000 and the TreeBank corpus, all of which are included with NTLK.

train_sents = nltk.corpus.brown.tagged_sents()
train_sents += nltk.corpus.conll2000.tagged_sents()
train_sents += nltk.corpus.treebank.tagged_sents()
# Create instance of SubjectTrigramTagger
trigram_tagger = SubjectTrigramTagger(train_sents)

Once the trigram tagger has been trained and applied to the tokenized sentences we can iterate through all the sentences and pull out the SVOs where the subject was mentioned.

Finally, we can use the code above and return the Subject, Verb and Object for each relevant sentence as well as the overall phrase.

The full code used in this project can be found here. In my next post, I will talk about ways to summarize a body of text.

--

--

Alexander Crosson

Curious about Deep Learning, NLP, AI. Hopeful traveler, wannabe chef.