Analyzing The New York Times Articles with Python: Part II — Text Data Preprocessing: A Step-by-Step Guide

Sanela I.
CodeX
Published in
9 min readJan 22, 2023

In the first part of this series, we delved into collecting data from The New York Times API. Now, in part two, we’re going to tackle the next important step in data science: preprocessing.

Preprocessing is a crucial step in data science as it ensures that the data is clean and ready for analysis. But preprocessing can be a tedious and time-consuming task, especially when working with unstructured data. Imagine sifting through a pile of text, trying to extract useful information. It can be a daunting task, but NLP is here to make it a breeze.

Photo by Andrea De Santis on Unsplash

NLP is a branch of artificial intelligence that deals with the interaction between computers and human (natural) languages. In this article, we’ll explore one of the most popular NLP libraries, NLTK, and learn how to use it to easily preprocess and clean our text data for analysis. From tokenization to collocations, we’ll cover all the necessary steps to get our data in tip-top shape. Say goodbye to messy and unstructured text, hello to clean and organized data with the power of NLTK!

Tokenization

Tokenization is the first step in preprocessing text data. It is the process of breaking up a string of text into smaller pieces, called tokens. NLTK provides several functions for tokenizing text data, such as word_tokenize, sentence_tokenize, and others. For example, if we tokenize the sentence “Data science is the future of technology”, using the word_tokenize function, we would get the following output: [‘Data’, ‘science’, ‘is’, ‘the’, ‘future’, ‘of’, ‘technology’]. This can be useful for tasks like parsing and sentiment analysis. It allows us to easily analyze the individual words in a sentence and understand the overall meaning. The sentence_tokenize function, on the other hand, would break the sentence down into multiple sentences, making it useful for tasks like text summarization. The paragraph_tokenize function would break the text down into multiple paragraphs, useful for tasks like topic classification. The regexp_tokenize function allows us to tokenize text based on a specific regular expression, and the TweetTokenizer function is specifically designed to tokenize tweets. With NLTK, the possibilities for tokenization are endless, making it a powerful tool for preprocessing text data.

Removing punctuations

Removing punctuations is another step in the preprocessing process. Punctuation marks such as full stops, commas, question marks, and exclamation marks, are often used to separate or mark different parts of a sentence, but they may not always be necessary for understanding the meaning of the sentence. In some cases, punctuation marks can change the meaning of a sentence entirely, like in the sentence “Let’s eat, Grandpa!” which would have a completely different meaning if the comma was not present. Additionally, punctuation marks can also be used to create emphasis in a sentence, for example, the exclamation mark can be used to denote excitement or surprise. Therefore, it is important to remove punctuations from text data before starting any analysis using the NLTK library.

Removing stop words

Removing stop words is another important step in text data preprocessing. Stop words are commonly used words in a language that do not carry much meaning on their own. These words include articles, prepositions, conjunctions, and others. For example, in the sentence “The cat in the hat”, the stop words “the” and “in” do not add much meaning to the sentence and can be removed without changing the overall meaning of the sentence, which would leave us with “cat hat”. Removing stop words helps to clean the data, reduce the dimensionality, and improve the efficiency of text analysis algorithms. By removing unnecessary words, we can focus on the important words that carry more meaning in the text, leading to a better understanding of the text and the ability to extract valuable insights from it.

We will also remove any tokens that have a length less than 4 characters. These are typically not useful for our analysis and can cause issues during analysis.

Lemmatization

After removing punctuations, stop words, and small tokens, we will apply lemmatization to our text data. Lemmatization is the process of reducing a word to its base form. For example, “running” would be reduced to “run”. This helps to reduce the number of unique words in our text data and can improve the accuracy of our analysis. We will use NLTK’s WordNetLemmatizer to apply lemmatization to our text data. This function takes a token as input and returns the base form of the word.

Collocations

Finally, we will use collocations to find frequently occurring groups of words in our text data. Collocations are not just random combinations of words, but they have a specific meaning and context. For example, the phrase “strong coffee” is a collocation because it consists of two words that commonly occur together and have a specific meaning. The NLTK library has a built-in function for finding collocations, called collocations. We can also use bigrams, trigrams and multigrams to find collocations. Bigrams are two words that commonly occur together, such as “New York” and “Times”. Trigrams are three words that commonly occur together, such as “The New York Times”. By identifying collocations, we can understand the language better and improve the accuracy of text analysis and NLP models.

Are you ready to clean, organize, and prepare your text data for further analysis? Let’s dive into preprocessing and collocation analysis with Python.

1.Initial Setup:

In this section, we start with installing and importing the necessary libraries.

import nltk
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder, TrigramCollocationFinder
from nltk.tokenize import word_tokenize, MWETokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import warnings
warnings.simplefilter('ignore', category=DeprecationWarning)

2.Importing Data:

Time to bring in the data! We’ll be using articles from The New York Times for this guide.

# Importing the dataset
df = pd.read_csv('https://raw.githubusercontent.com/sivosevic/NYTimesNLP/main/TechArticles.csv')
df
 Unnamed: 0 title abstract
0 0 Ty Haney Is Doing Things Differently This Time The Outdoor Voices founder has a new venture t...
1 1 Washington State Advances Landmark Deal on Gig... Lawmakers have passed legislation granting ben...
2 2 Google Suspends Advertising in Russia The move came after a Russian regulator demand...
3 3 When Electric Cars Rule the Road, They’ll Need... A wireless infrastructure company is betting i...
4 4 A coalition of state attorneys general opens a... The group is looking into the Chinese-owned vi...
... ... ... ...
1195 1195 Technology Briefing PEOPLEAT&T EXECUTIVE JOINS PALM Palm Inc. has ...
1196 1196 Technology Briefing INTERNET CNET WARNS OF LOWER SALES The online ...
1197 1197 Technology Briefing INTERNET SEARCH PROVIDER FOR NBCI AND EXPLORER...
1198 1198 Technology Briefing TELECOMMUNICATIONSNOKIA TO BUY AMBER NETWORKS ...
1199 1199 Technology Briefing INTERNET RETRENCHMENT AT CMGI CMGI , the once-...

3.Preprocessing

a)This is where the magic happens! We’ll begin by combining each article’s title and abstract into a single body of lowercase text.

# Combining each article's title and abstract into a single body of lowercase text
titles = df['title'].to_numpy()
abstracts = df['abstract'].to_numpy()
articles = [((str(titles[i]) + ' ' + str(abstracts[i])).lower()) for i in range(len(titles))]
articles[:10]
['ty haney is doing things differently this time the outdoor voices founder has a new venture that aims to reward customers with blockchain-based assets. but do brand loyalists really want nfts?',
'washington state advances landmark deal on gig drivers’ job status lawmakers have passed legislation granting benefits and protections, but allowing lyft and uber to continue to treat drivers as contractors.',
'google suspends advertising in russia the move came after a russian regulator demanded that the company stop showing ads with what the regulator claimed was false information about the invasion of ukraine.',
'when electric cars rule the road, they’ll need spots to power up a wireless infrastructure company is betting it can figure out how to locate and install charging stations for a growing wave of new vehicles.',
'a coalition of state attorneys general opens an investigation into tiktok. the group is looking into the chinese-owned video site for the harms it may pose to younger users.',
'millions for crypto start-ups, no real names necessary investors give money to pseudonymous developers. venture capitalists back founders without learning their real names. what happens when they need to know?',
'russia intensifies censorship campaign, pressuring tech giants google, apple and others were warned that they must comply with a new law, which would make them more vulnerable to the kremlin’s censorship demands.',
'justice dept. sues to block $13 billion deal by unitedhealth group the agency’s lawsuit against the deal for a health technology company is the latest move by the biden administration to quash corporate consolidation.',
'russia could use cryptocurrency to blunt the force of u.s. sanctions russian companies have many cryptocurrency tools at their disposal to evade sanctions, including a so-called digital ruble and ransomware.',
'china, not spacex, may be source of rocket part crashing into moon the developer of astronomy software who said that elon musk’s company would cause a new crater on the moon says that he “had really gotten it wrong.”']

b)Tokenizing Text:

Let’s break down the text into individual words, or tokens. This step converts each article’s text into a list of tokens using the word_tokenize function.

# Tokenizing text
articles = [word_tokenize(article) for article in articles]
articles[:2]
[['ty',
'haney',
'is',
'doing',
'things',
'differently',
'this',
'time',
'the',
'outdoor',
'voices',
'founder',
'has',
'a',
'new',
'venture',
'that',
'aims',
'to',
'reward',
'customers',
'with',
'blockchain-based',
'assets',
'.',
'but',
'do',
'brand',
'loyalists',
'really',
'want',
'nfts',
'?'],

c)Removing Punctuations, Stopwords and small tokens:

We’ll remove any unnecessary characters and words that don’t add value to our analysis such as: punctuation marks, stop words and tokens with fewer than 4 characters.

# Removing punctuations and stopwords
stopwords = set(stopwords.words('english'))
stopwords = stopwords.union({"technology","company","percent","briefing","million","service","internet"})
punctuations = r".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"
articles = [[token for token in article if (token not in punctuations and token not in stopwords and len(token) > 4 and '-' not in token)] for article in articles]
articles[:3]
[['haney',
'things',
'differently',
'outdoor',
'voices',
'founder',
'venture',
'reward',
'customers',
'assets',
'brand',
'loyalists',
'really'],
['washington',
'state',
'advances',
'landmark',
'drivers',
'status',
'lawmakers',
'passed',
'legislation',
'granting',
'benefits',
'protections',
'allowing',
'continue',
'treat',
'drivers',
'contractors'],

d)Lemmatization:

We’ll standardize the words and make them more meaningful by applying lemmatization.

# Lemmatizing words
lemmatizer = WordNetLemmatizer()
articles = [[lemmatizer.lemmatize(token) for token in article] for article in articles]
articles[:3]
[['haney',
'thing',
'differently',
'outdoor',
'voice',
'founder',
'venture',
'reward',
'customer',
'asset',
'brand',
'loyalist',
'really'],
['washington',
'state',
'advance',
'landmark',
'driver',
'status',
'lawmaker',
'passed',
'legislation',
'granting',
'benefit',
'protection',
'allowing',
'continue',
'treat',
'driver',
'contractor'],

e)Finding Collocations:

Here, we’ll identify common bigrams and trigrams that occur in the text.

# Finding most common bigrams
bigram_finder = BigramCollocationFinder.from_documents(articles)
bigram_finder.apply_freq_filter(min_freq=3)
bigrams = list(bigram_finder.ngram_fd.items())
# Finding most common trigrams
trigram_finder = TrigramCollocationFinder.from_documents(articles)
trigram_finder.apply_freq_filter(min_freq=3)
trigrams = list(trigram_finder.ngram_fd.items())
print(bigrams[:3])
print(trigrams[:3])
[(('state', 'attorney'), 3), (('attorney', 'general'), 8), (('giant', 'google'), 3)]
[(('state', 'attorney', 'general'), 3), (('social', 'medium', 'platform'), 5), (('federal', 'trade', 'commission'), 3)]

f)Replacing Collocations in Text:

Finally, we’ll replace those identified collocations in the original text.

# Replacing collocations in text
bigrams = [bigram for bigram, freq in bigram_finder.ngram_fd.items()]
trigrams = [trigram for trigram, freq in trigram_finder.ngram_fd.items()]
mwe_tokenizer = MWETokenizer(bigrams + trigrams, separator='_') # here we are using _ as separator
articles = [mwe_tokenizer.tokenize(article) for article in articles]
articles[:15]
['coalition',
'state_attorney_general', #example of trigram
'open',
'investigation',
'tiktok',
'group',
'looking',
'video',
'harm',
'younger',
'user'],
['million',
'crypto',
'name',
'necessary',
'investor',
'money',
'pseudonymous',
'developer',
'venture',
'capitalist',
'founder',
'without',
'learning',
'name',
'happens'],
['russia',
'intensifies',
'censorship',
'campaign',
'pressuring',
'giant_google',#example of bigram
'apple',
'others',
'warned',
'comply',
'would',
'vulnerable',
'kremlin',
'censorship',
'demand'],
['justice',
'block',
'billion',
'unitedhealth',
'group',
'agency',
'lawsuit',
'health',
'latest',
'biden_administration',#example of bigram
'quash',
'corporate',
'consolidation'],

And there you have it, a step-by-step guide to preprocessing and collocation analysis.

In conclusion, preprocessing text data is an essential step in the data science pipeline and NLTK is a powerful tool that can help make this process a breeze. From tokenization to collocations, we’ve covered all the necessary steps to get our text data in tip-top shape. With NLTK, we can easily clean and organize our text data, making it ready for analysis. In addition, we can use NLTK to easily extract valuable insights from our text data and take the first step towards understanding the meaning behind. In the next part of this series, we will dive deeper into the world of NLP and explore how to perform topics discovery and sentiment analyst on our collected data. So, stay tuned and get ready to unlock the power of natural language processing with NLTK.

Jupyter Notebook with detailed code can be found here:

You can check my other writings at: https://medium.com/@eellaaivo

Thanks for reading, and keep an eye out for the next part of this series where we will delve deeper into topic discovery and sentiment analysis!

--

--