Natural Language Processing with Python

Part 1: Tokens

Published in

Analytics Vidhya

6 min readAug 21, 2019

Natural Language Processing (NLP) is a ubiquitous application of modern machine learning techniques. You’d be hard pressed to find a professional data science team not using NLP in one form or another. It is for this reason I have set out to produce a gentle introduction to the subject, both its concepts and application.

This post covers some basic methods and tools for a barebones beginner. If you are a more advanced practitioner you might want to move on to Part 2: Vector Representations, or Part 3: Pipelines and Topic Modeling of this series.

Like any area of study, NLP has its own jargon, inherited from not only from the fields of machine learning and data analysis, but linguistics as well. This mashup of technical terms can be intimidating at first, but I will do my best to keep the technobabble to a minimum.

I’m assuming at least a basic understanding of python from the reader, and some experience with machine learning would be helpful, though not necessary. As always, I’m providing copious links to documentation in order to help explain areas in which my techniques or explanations are short on detail.

1. Tokenization

A token is a sequence of characters in a document that are useful for an analytical purpose. Often, but not always individual words. A document in an NLP context simply means a collection of text, this could be a tweet, a book, or anything in between.

Attributes of good tokens

Tokens should always be stored in an iterable data structure (list, generator, etc.) to allow for easy future analysis
Tokens should all be the same case to reduce complexity
Tokens should be free of non-alphanumeric characters

Example 1: Tokenization

import redef tokenize(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z ^0-9]', '', str(text))
    return text.split()# We are using the text from above as our sample_text
tokens = tokenize(sample_text) 
tokens[:10]
____________________________________________________________________['a',  'token',  'is',  'a',  'sequence',  'of',  'characters',  'in',  'a',  'document']

2. Word Count

Once we have tokens we can perform some basic analysis on our document. Let’s start with getting a simple word count using a python Counter object.

Example 2: Word Count

from collections import Counterdef word_counter(tokens):
    word_counts = Counter()
    word_counts.update(tokens)
    return word_countsword_count = word_counter(tokens)word_count.most_common(5)
____________________________________________________________________[('a', 7), ('of', 4), ('in', 4), ('be', 4), ('an', 3)]

We can visualize these word counts with a simple bar chart.

Example 3: Word Count Visualization

import matplotlib.pyplot as pltx = list(word_count.keys())[:10]
y = list(word_count.values())[:10]plt.bar(x, y)
plt.show();

3. Document Makeup

Word counts can be useful on occasion, but often, more advanced analytics are needed to answer business questions, especially when our data consists of more than a single document. We can set ourselves up for success by creating a Pandas DataFrame containing features representing the number of documents a token appears in, a count of that tokens appearances, its rank in relation to other tokens, tokens percentage of total document makeup, a running sum of these percentages, and the percentage of documents a token appears in.

Example 4: Document Makeup DataFrame

import pandas as pddef count(docs):
  
    word_counts = Counter()
    appears_in = Counter()
    total_docs = len(docs)
    
    for doc in docs:
        word_counts.update(doc)
        appears_in.update(set(doc))
        
    temp = list(zip(word_counts.keys(), word_counts.values()))
    
    # Word and count columns
    wc = pd.DataFrame(temp, columns = ['word', 'count'])
    
    # Rank column
    wc['rank'] = wc['count'].rank(method='first', ascending=False)
    
    # Percent Total column
    total = wc['count'].sum()
    wc['pct_total'] = wc['count'].apply(lambda x: x / total)
    
    # Cumulative percent total column
    wc = wc.sort_values(by='rank')
    wc['cul_pct_total'] = wc['pct_total'].cumsum()
    
    # Appears in column
    t2 = list(zip(appears_in.keys(), appears_in.values()))
    ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
    wc = ac.merge(wc, on='word')
    
    # Appears in percent column
    wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
    
    
    return wc.sort_values(by='rank')
  
wc = count([tokens])wc.head()

Using this dataframe we can generate a cumulative distribution plot showing how a token’s rank relates to the cumulative makeup of our documents.

Example 5: Cumulative Distribution Plot

import seaborn as snssns.lineplot(x = 'rank', y = 'cul_pct_total', data = wc)
plt.show();

From the cumulative distribution plot we can see that the 13 most common words make up about 45% of the document after that the distribution looks pretty uniformly linear. We can see the relative percentage of document makeup of these top 13 tokens in a tree plot.

Example 6: Tree Plot

import squarifywc_top13 = wc[wc['rank'] <= 13]squarify.plot(sizes=wc_top13['pct_total'], label=wc_top13['word'], alpha=.8 )plt.axis('off')
plt.show();

These examples demonstrate some of the initial principles required for understanding the deeper world of NLP. Tokenization, word counts, and document makeup analysis are foundational concepts, and while we could write an entire NLP library from scratch, that often not the most efficient way to solve problems. There is a zoo of open source NLP libraries we could use to enhance our powers, but one in particular stands out as the leader.

4. Introducing SpaCy

SpaCy is “spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python”, developed by explosion_ai. Spacy’s datamodel for documents is unique among NLP libraries. Instead of storing the documents components repeatedly in various data structures, Spacy indexes components and simply stores the lookup information. This is often why Spacy is considered to be more fit for production grade projects than other libraries like NLTK.

Spacy has a wide array of tools that can be utilized for NLP purposes.

Example 7: Tokens with SpaCy

import spacynlp = spacy.load("en_core_web_lg")def spacy_tokenize(text):
    doc = nlp.tokenizer(text)
    return [token.text for token in doc]spacy_tokens = spacy_tokenize(sample_text)spacy_tokens[:10]
____________________________________________________________________['a',  'token',  'is',  'a',  'sequence',  'of',  'characters',  'in',  'a',  'document']

5. Stopwords

Words such as “I”, “and”, “of”, etc. have almost no semantic meaning to us. We call these useless words “stopwords,” because we should ‘stop’ ourselves from including them in our analysis.

Most NLP libraries have built in lists of stop words that common english words: conjunctions, articles, adverbs, pronouns, and common verbs. The best practice, however, is to extend/customize these standard english stopwords for your problem’s domain.

Example 8: Stopword Removal

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDSdef remove_stopwords(tokens):
    cleaned_tokens = []
    
    for token in tokens:
        if token not in spacy_stopwords:
            cleaned_tokens.append(token)
    
    return cleaned_tokenscleaned_tokens = remove_stopwords(spacy_tokens)cleaned_tokens[:10]
____________________________________________________________________['A', 'token', 'sequence', 'characters', 'document', 'useful', 'analytical', 'purpose', '.', 'Often']

6. Lemmas

Lemmatization looks up the lemma of the word or the root word. A lemma, put simply is the version of a word one would find in a dictionary. For example: run, runs, running, and ran are all forms of the same lexeme with run as the lemma.

Example 9: Lemmatization

def spacy_lemmatize(text):
    doc = nlp.tokenizer(text)
    return [token.lemma_ for token in doc]spacy_lemmas = spacy_lemmatize(sample_text)spacy_lemmas[:10]
____________________________________________________________________
['A', 'token', 'be', 'a', 'sequence', 'of', 'character', 'in', 'a', 'document']

Final Words

These are just some of the basic methods and tools used in NLP. Tokens, document makeup, stop words, and lemms will all play a part in the wars to come, but are rather elementary on their own. My next post Natural Language Processing with Python Part 2: Vector Representations will dive deeper into NLP.

Big shoutouts to Lambda School and Jon-Cody Sokoll for compiling a curriculum that includes material like this and presenting it in such an approachable way.

Thanks for reading!

Find me on Twitter, GitHub, and LinkedIn

P.S. Here is the link to the Notebook I used for this post.