NLP Pipeline 101 With Basic Code Example — Text Processing

Haitian Wei
4 min readMar 19, 2019

--

Introduction

People are spending more time talking on internet these days. It’s very important to extract and learn people’s intention by processing these messages. So NLP techniques are quit popular now. So I decide to talk about basic NLP pipeline and its most basic codes.

NLP PIPELINE

There are mainly three stages of an NLP pipeline : Text Processing , Feature Extraction and Modeling. Major steps of each stage is shown by the picture below:

source: Udacity

And in practice we will always move around in the three process, it doesn’t always go linearly. In the section below I will go over each of these process.

1 Text Processing

Text processing means taking raw input text,and clean,normalize it, and convert it into a form that is suitable for feature extraction.

1.1 Cleaning

Cleaning procedure aims to remove irrelevant items, such as HTML tags. And powerful tools of this procedure include regular expressions(I have a regular expression 101 article here) and beautiful soup

Beautiful soup is a python library used to extract data from HTML and XML documents. So we initialize a soup object by telling him the html file and the method to parse it.

from bs4 import BeautifulSoup
import requests

html = requests.get(‘http://www.jianshu.com/’).content
soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
result = soup('div')

We can navigate, search, and modify a parse tree using beautiful soup. Most basic method are find_all(), select(),select_one(), prettify(), get_text(). Difference between find_all() and select() is that select() can search by tag layers.

soup.select("html head title")  

soup.select('td div a') ##tag route td --> div --> a

soup.select('td > div > a')
soup.find_all("div", {"class":"course-summary-card"})

Re is a python module provides regular expression matching operations. The very basic method are findall(),match(),search(),sub(),split()

1.2 Normalization

Normalization convert text to all lowercase and removing punctuation. The very basic method to achieve that are lower() and re.sub(). Below are examples.

text = '''Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported.'''text = text.lower()print(text)>>> edit the expression & text to see matches. roll over matches or the expression for details. pcre & javascript flavors of regex are supported.

And remove punctuation can be done like this:

import re
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
print(text)
>>> edit the expression text to see matches roll over matches or the expression for details pcre javascript flavors of regex are supported

1.3 Tokenize

Tokenize means splitting text into words or tokens. And this is usually done by nltk word_tokenize. Below are example code.

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
text = '''Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported.'''
words = word_tokenize(text)
print(words)
>>> ['Edit', 'the', 'Expression', '&', 'Text', 'to', 'see', 'matches', '.', 'Roll', 'over', 'matches', 'or', 'the', 'expression', 'for', 'details', '.', 'PCRE', '&', 'Javascript', 'flavors', 'of', 'RegEx', 'are', 'supported', '.']

And we can split text into sentences

sentences = sent_tokenize(text)
print(sentences)
>>> ['Edit the Expression & Text to see matches.', 'Roll over matches or the expression for details.', 'PCRE & Javascript flavors of RegEx are supported.']

1.4 Stop word removal

This procedure removing words that are too common(often been called as stopwords). And we will rely on nltk again but with a different method stopwords.

import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = '''Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported.'''
# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
# Tokenize text
words = word_tokenize(text)
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)
>>> ['edit', 'expression', 'text', 'see', 'matches', 'roll', 'matches', 'expression', 'details', 'pcre', 'javascript', 'flavors', 'regex', 'supported']

1.5 POS and NER

This procedure means identifying different parts of speech and named entities. Still we use nltk but this time pos_tag, ne_chunk

import nltk
from nltk.tokenize import word_tokenize
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
text = "I always lie down to tell a lie."
# tokenize text
sentence = word_tokenize(text)
# tag each word with part of speech
pos_tag(sentence)
>>> [('I', 'PRP'),
('always', 'RB'),
('lie', 'VBP'),
('down', 'RP'),
('to', 'TO'),
('tell', 'VB'),
('a', 'DT'),
('lie', 'NN'),
('.', '.')]

We use ne_chunk to find named entities

text = "Jim will go to Beijing to study in Peking University"
# tokenize, pos tag, then recognize named entities in text
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree)
>>>(S
(PERSON Jim/NNP)
will/MD
go/VB
to/TO
(GPE Beijing/NNP)
to/TO
study/VB
in/IN
(GPE Peking/NNP University/NNP))

1.6 Stemming and Lemmatization

This means converting words into their dictionary forms. And still nltk! This time it’s PorterStemmer and WordNetLemmatizer.

from nltk.stem.porter import PorterStemmerwords = ['renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice']
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)
>>> ['renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice']

And lemmatize

from nltk.stem.wordnet import WordNetLemmatizerlemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)
>>> ['renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice']

And this is what cleaning part of the NLP pipeline do. In the next article, I will go through the feature extraction part.

--

--