Member-only story
Modern NLP: Tokenization, Embedding, and Text Classification
Learning modern Natural Language Processing with Python code.
Introduction
Natural Language Processing (NLP) has evolved so much. The Large Language Models (LLM) are the best proof of that. Thanks to modern techniques, it is possible to chat with the computer just like chatting with any human being.
I remember the old-school packages for NLP, such as nltk
and spacy
, and I miss their simplicity. Packages like that helped me with many projects where I was able to find patterns in texts, discover keywords, and even analyze some context with n-grams (patterns of n words that repeat in sequence within a text). Not to mention the beautiful word clouds, that were so popular for a while.
But those days are gone. Old-school packages still have their place in text analysis, but when we’re talking about LLMs, there are modern techniques of NLP that we will be going over in this article.
Let’s get to them.
Modern NLP Concepts
In this article, here is what we will cover:
- How LLMs process text using tokenization
- What are embeddings
- How the Attention Mechanism works in Transformers