Collocations — identifying phrases that act like single words in Natural Language Processing

6 min readMar 16, 2018

What is a collocation? It is a phrase consisting of more than one word but these words more commonly co-occur in a given context than its individual word parts. For example, in a set of hospital related documents, the phrase ‘CT scan’ is more likely to co-occur than do ‘CT’ and ‘scan’ individually. ‘CT scan’ is also a meaningful phrase.

The two most common types of collocation are bigrams and trigrams. Bigrams are two adjacent words, such as ‘CT scan’, ‘machine learning’, or ‘social media’. Trigrams are three adjacent words, such as ‘out of business’, or ‘Proctor and Gamble’.

If we choose any adjacent words as our bigram or trigrams, we will not get meaningful phrases. For example, the sentence ‘He uses social media’ contains bigrams: ‘He uses’, ‘uses social’, ‘social media’. ‘He uses’ and ‘uses social’ do not mean anything, while ‘social media’ is a meaningful bigram. How do we make good selections for collocations? Co-occurences may not be sufficient as phrases such as ‘of the’ may co-occur frequently, but are not meaningful. We will explore several methods to filter out the most meaningful collocations: frequency counting, Pointwise Mutual Information (PMI), and hypothesis testing (t-test and chi-square).

Some uses for collocation identification are:
a) Keyword extraction: identifying the most relevant keywords in documents to assess what aspects are most talked about
b) Bigrams/Trigrams can be concatenated (e.g. social media -> social_media) and counted as one word to improve insights analysis, topic modeling, and create more meaningful features for predictive models in NLP problems

We will use hotels reviews data that can be downloaded here.

Before applying different methods to choose the best bigrams/trigrams, we need to preprocess the reviews text. Get the code to clean the text here.

We will then use NLTK’s tools to generate all possible bigrams and trigrams:

import nltkbigrams = nltk.collocations.BigramAssocMeasures()
trigrams = nltk.collocations.TrigramAssocMeasures()bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(tokens)

Methods to Rank Collocations

Counting frequencies of adjacent words with part of speech filters:

The simplest method is to rank the most frequent bigrams or trigrams:

#bigrams
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)#trigrams
trigram_freq = trigramFinder.ngram_fd.items()
trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)

However, a common issue with this is adjacent spaces, stop words, articles, prepositions or pronouns are common and are not meaningful:

To fix this, we filter out for collocations not containing stop words and filter for only the following structures:

Bigrams: (Noun, Noun), (Adjective, Noun)
Trigrams: (Adjective/Noun, Anything, Adjective/Noun)

This is a common structure used in literature and generally works well.

#get english stopwords
en_stopwords = set(stopwords.words('english'))#function to filter for ADJ/NN bigrams
def rightTypes(ngram):
    if '-pron-' in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords or word.isspace():
            return False
    acceptable_types = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    second_type = ('NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in acceptable_types and tags[1][1] in second_type:
        return True
    else:
        return False#filter bigrams
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]#function to filter for trigrams
def rightTypesTri(ngram):
    if '-pron-' in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords or word.isspace():
            return False
    first_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    third_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in first_type and tags[2][1] in third_type:
        return True
    else:
        return False#filter trigrams
filtered_tri = trigramFreqTable[trigramFreqTable.trigram.map(lambda x: rightTypesTri(x))]

Results after filtering:

Much better!

2. Pointwise Mutual Information

The Pointwise Mutual Information (PMI) score for bigrams is:

For trigrams:

The main intuition is that it measures how much more likely the words co-occur than if they were independent. However, it is very sensitive to rare combination of words. For example, if a random bigram ‘abc xyz’ appears, and neither ‘abc’ nor ‘xyz’ appeared anywhere else in the text, ‘abc xyz’ will be identified as highly significant bigram when it could just be a random misspelling or a phrase too rare to generalize as a bigram. Therefore, this method is often used with a frequency filter.

#filter for only those with more than 20 occurences
bigramFinder.apply_freq_filter(20)
trigramFinder.apply_freq_filter(20)bigramPMITable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.pmi)), columns=['bigram','PMI']).sort_values(by='PMI', ascending=False)trigramPMITable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.pmi)), columns=['trigram','PMI']).sort_values(by='PMI', ascending=False)

We can see that PMI picks up bigrams and trigrams that consist of words that should co-occur together.

3. Hypothesis Testing

a. t-test:

Consider if we have a corpus with N words, and social and media have word counts C(social) and C(media) respectively. Assuming null hypothesis with social and media being independent:

The test statistic is:

However, the same problem occurs where pairs with prepositions, pronouns, articles etc. come up as most significant. Therefore, we need to apply the same filters from 1.

bigramTtable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.student_t)), columns=['bigram','t']).sort_values(by='t', ascending=False)
trigramTtable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.student_t)), columns=['trigram','t']).sort_values(by='t', ascending=False)#filters
filteredT_bi = bigramTtable[bigramTtable.bigram.map(lambda x: rightTypes(x))]
filteredT_tri = trigramTtable[trigramTtable.trigram.map(lambda x: rightTypesTri(x))]

Results are similar to the frequency count technique from 1.:

T-test has been criticized as it assumes normal distribution. Therefore, we will also look into the chi-square test.

b. chi-square test

First, we compute a table like below for each word pair:

The chi-square test assumes in the null hypothesis that words are independent, just like in t-test. The chi-square test statistic is computed as:

Results are as follows:

Comparing top 20 results of all methods:

Bigrams:

Trigrams:

We can see that PMI and chi-square methods give pretty good results even without applying filters. Their results are also quite similar. Frequency and T-test methods are also similar to each other. In real applications, we can eyeball the list and set a threshold at a value from when the list stops making sense. We can also do different tests to see which list seems to make the most sense for a given dataset. Alternatively, we can combine results from multiple lists. Personally, I find it effective to multiply PMI and frequency to take into account both probability lift and frequency of occurrence.

For all the codes used to generate above results, click here.

Collocations — identifying phrases that act like single words in Natural Language Processing

Methods to Rank Collocations

Comparing top 20 results of all methods:

Written by Nicha Ruchirawat