Text Processing with Python for NLP Beginners: Basics (part 1)

Alexei Stepanov
4 min readJan 28, 2024

--

In a world where terms like NLP, GPTs, and text analytics frequently buzz around us, the skill of text manipulation techniques has become an irreplaceable tool in a data scientist’s toolkit. Recognizing this, I’ve wanted to learn more about NLP and, along the way, I’ve decided to share my learnings.

This article presents essential concepts of document and corpus, introduces the NLTK library for natural language processing, and provides insights into reading text with Python’s open function and NLTK's corpus reader. It also offers practical methods for analyzing and exploring corpora to glean meaningful information from textual data.

A Dictionary Story print, Sam Winston, 2020. https://www.samwinston.com/projects/a-dictionary-story

1. Text Mining

Modern text mining

Text mining is a technique that involves the extraction of meaningful information from unstructured textual data. It uses machine learning and statistical methods to discover patterns and correlations, turning text into insights for informed decision-making. This process is applied to various sources, from transcripts and speeches to social media and academic journals, addressing the increasing demand for data-driven strategies across various sectors [1].

Document, corpus, and vocabulary

These fundamental concepts are extremely important in NLP and are extensively utilized in libraries like Gensim, known for their capabilities in topic modeling, document indexing, and similarity retrieval. In NLP a document is a unit so that it can relate to different objects, such as entire documents, and sentences. Suppose you have three rows in your dataset:

+-------+----------------------------------------------+
| index | text |
+-------+----------------------------------------------+
| 1 | I like learning NLP |
| 2 | Now I can GPT best sausage stew |
| 3 | Or write completely incomprehensible English |

In the example, “I like learning NLP” is a document. A corpus is a collection of documents. In the example above, the corpus is a collection of given three sentences. The vocabulary is the list of all the words in the corpus, therefore all the words in all the documents.

The vocabulary list will be:

['I', 'like', 'learning', 'NLP', 'Now', 'can', 'GPT', 'best', 'sausage', 'stew', 'Or', 'write', 'completely', 'incomprehensible', 'English', 'and', 'still', 'be', 'understood']

NLTK library basics

The Natural Language Toolkit (NLTK) is an essential library in Python for processing and analyzing human language data. It comes packed with easy-to-use interfaces to over 50 corpora and lexical resources, as well as a suite of text-processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Several NLTK’s functions will be demonstrated in the next sections and the second part of this article.

2. Reading Text

Reading raw files

I am sure that reading raw text files might seem too trivial to those with Python experience, yet it remains the bedrock of text analysis — a starting line from which all subsequent analysis unfolds

Python’s openfunction, combined with proper encoding, enables the handling of various text file formats. Real-world applications include data extraction from logs or transcripts, which can be processed using Python’s built-in functionalities or libraries such as pandasfor structured text data.

with open('example.txt', 'r', encoding='utf-8') as file:
text = file.read()

Reading files with a corpus reader

NLTK’s corpus readers specialize in structured data retrieval from textual datasets, like corpora. They abstract away the complexity of manual data loading, offering methods for direct text extraction, which is essential for building scalable NLP models that require standardized data input.\

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader('corpus/', '.*')

Explore corpus

Exploring a corpus involves not just reading the text but also understanding its composition — what languages are present, the variety of topics covered, and the styles of writing. This step often utilizes descriptive statistics and data visualization techniques to summarize the corpus characteristics.

Corpus analysis includes statistical models or NLP techniques to derive insights. Analysis might include identifying term frequencies, discovering topic distributions with models like LDA, or understanding word co-occurrences. Such analyses are crucial for tasks like document classification or trend analysis in social media feeds.

from nltk.probability import FreqDist

# Assume 'tokens' is a list of words tokenized from the corpus
fdist = FreqDist(tokens)
print(fdist.most_common(10))

Conclusion

We’ve done the basics by introducing text mining, exploring the NLTK library, and illustrating how to read and process text data.

In the next episode, I will cover tokenization, stop words, stemming, and lemmatization. Explore the second part here.

References

  1. Hassani, H., Beneki, C., Unger, S., Mazinani, M. T., & Yeganegi, M. R. (2020). Text Mining in Big Data Analytics. Big Data and Cognitive Computing, 4(1), 1. https://doi.org/10.3390/bdcc4010001
  2. Yogish, D., Manjunath, T. N., & Hegadi, R. S. (2019). Review on Natural Language Processing Trends and Techniques Using NLTK. Communications in Computer and Information Science, 589–606. https://doi.org/10.1007/978-981-13-9187-3_53

--

--

Alexei Stepanov

Hi! I am Data Scientist and this is my blog, sometimes I'm expressing genuine curiosity in data science stuff here