Introduction to Natural Language Processing

Srishti Sawla
May 29, 2018 · 4 min read

What is Natural Language processing ?

Natural language processing helps computer to understand human language as it is spoken.Real world use of natural languages such as English,Hindi,German,French etc doesn't have a formulated structure and keeps on evolving.Natural language processing is an ongoing attempt to capture all the details from the natural languages.

It sits at the intersection of computer science, artificial intelligence, and computational linguistics (Wikipedia)

Do you know you use Natural Language processing everyday?

1.Autocomplete helps you to suggest rest of the word.

2.Google search’s predictive typing helps you by suggesting the next word.

3.Spell checker in your email application saves you from stupid typing errors.

4.Spam detection in your mail box separates spam mails from important ones.

Over the years there have been many advancements in Natural language processing.

NLP Terminologies :

Lets understand the Basic Terminologies used in NLP :

Tokenization,Corpus or Corpora,Stemming,Bag of Words,Stop Words,Tf-idf,Disambiguation,Topic Models ,Word Boundaries

Tokenization :

Tokenization is a process to split longer strings into smaller pieces.Large documents can be tokenized into paragraphs,Paragraphs can be tokenized into sentences and sentences can be tokenized into phrases,words or letters.

Corpus or Corpora :

A large structured collection of texts is known as corpus(plural corpora)

Stemming :

Stemming is a process to eliminate affixes(prefix,suffix,infix,circumfix) from a word in order to obtain a word stem or root word.

going -> go , happily -> happy , am/are/is -> be.

A common term associated with stemming is Lemmatization. There is a slight difference between stemming and Lemmatization

Stemming cuts off the end or beginning of the word,taking into account a list of common prefixes and suffixes

Form : Studies Suffix : es Stem : Studi

Form : Studying Suffix : ing Stem : Study

Lemmatization takes into consideration morphological analysis of the words.

Form : Studies Lemma : Study

Form : Studying Lemma : Study

Lemmatization definitely has an edge over stemming but building a Stemmer is far easy then the latter as deep linguistic knowledge is required to look for the proper form of word.

Bag of Words :

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

Its called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

Stop Words :

Consider words like a, an, the, be etc. These words don’t add any extra information in a sentence. Such words can often create noise while modeling. Such words are known as Stop Words.

Tf-idf :

Short form for term frequency-inverse document frequency is a numerical statistic to define how important a word is to a document in a collection of documents.

Term frequency measures the frequency of a term in a document

Tf(t) = Number of times term t occurs in a document/Total number of terms in a document.

Inverse document frequency measures how important a term is.while calculating TF all terms are considered equally important.However certain terms like “and”,”is”,”are” appear a lot of time but have a little importance.Thus rare terms should be scaled up and frequent terms should be weighed down.

IDF(t) = log(Total number of documents/Number of documents having the term t.)

Disambiguation :

One of the major challenges in NLP is disambiguation of content.One word can have multiple meanings which at times becomes challenging for the machines to interpret.For example lead can be used in two different contexts

A pencil is made up of lead.

Prime Minister would lead the rally on Sunday.

Topic Models :

In machine learning or NLP topic models are type of statistical models which help in discovering the abstract topic that occur in the collection of documents

Word Boundaries:

In written form of languages punctuation marks help us to determine the end of a sentence or paragraph.But in Verbal communication Word Boundary detection plays an important role.Since there is no sign of start of the word, end of the word and number of words in the spoken utterance of any natural language, one must study the intonation pattern of a particular language.Several researches are open in this space.

NLP using Python :

There are few open source packages available for NLP.

All these packages have their own pros and cons which are out of scope of this blog post

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -I hope you enjoyed reading this blog post.



Srishti Sawla

Written by

Learning how machines learn!!