NLP Glossary for Beginners

Demystify Natural Language Processing with these terms.

Mandy Gu
Mandy Gu
Feb 1 · 6 min read

Natural language processing sounds a lot more daunting than it actually is. In reality, the basic requirements for dabbling in this area are rudimentary programming skills, familiarity with data science principles and a conceptual understanding of how machines process languages.

This glossary will focus on natural language processing for written languages. It will be structured similarly to my other Glossary Article featuring general Data Science terms.

For more reading on NLP:

General Concepts

Natural Language Processing (NLP)

Vocabulary

Out of Vocabulary

Corpus (Plural: Corpora)

Documents

Preprocessing

Tokenization

(Word) Embeddings

n-grams

Transformers

Techniques

Parts of Speech (POS)

Parts of Speech Tagging

Normalization

Stop Words

Lemmatization

Stemming

Note on lemmatizing vs stemming: you might think, why would I ever need to use stemming? Wouldn’t this reduction step be more accurate if we knew the parts of speech? One clear advantage of stemming is that it is much faster. Another is that it eliminates the margin of error created by automatic parts of speech taggers.

Common NLP Tasks

Sentiment Analysis

Machine Translation

Machine (Reading) Comprehension/Question Answering

Named Entity Recognition (NER)

Information Retrieval/Latent Semantic Indexing

Embeddings

Bag of Words

Consider these documents:

Document 1: high five

Document 2: I am old.

Document 3: She is five.

This gives us the vocabulary: high, five, I, am, old, she, is. For simplicity, we will ignore punctuation and normalize by converting to lower case. We can construct a matrix which represents the number of times each vocabulary term occurs in a document.

This gives us the Bag of Words representation of each word and document. Move horizontally to get the word representation: high is [1, 0, 0]. Move vertically to get the document representation: document 1 is [1, 1, 0, 0, 0, 0, 0].

TF-IDF (term frequency — inverse document frequency)

The TF-IDF statistic for term i in document j is typically calculated as:

The document vectors can be used as features for a variety of machine learning models (SVM, Naive Bayes, Logistic Regression … etc).

word2vec

word2vec creates a high dimensional feature space. More complicated embeddings such as these is well suited for recurrent neural networks where the word ordering is also taken into consideration.

Context Dependent Embeddings

Notable Python Libraries

sklearn

keras/tensorflow/pytorch

nltk

spaCy


Here on some terms that were not fully covered by this glossary. I hope to cover them in more detail in subsequent articles:

  • Contextual Language Embeddings

That’s it for now — stay tuned for next time!

Learn NLP

Demystifying Natural Language Processing

Mandy Gu

Written by

Mandy Gu

Data Scientist @ Wealthsimple

Learn NLP

Learn NLP

Demystifying Natural Language Processing

More From Medium

More from Learn NLP

Related reads

More from Mandy Gu

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade