Data Science Collective

Advice, insights, and ideas from the Medium data science community

Member-only story

Modern NLP: Tokenization, Embedding, and Text Classification

--

Learning modern Natural Language Processing with Python code.

Text to Numbers | Image generated by AI. Gemini 3, 2025. https://gemini.google.com

Introduction

Natural Language Processing (NLP) has evolved so much. The Large Language Models (LLM) are the best proof of that. Thanks to modern techniques, it is possible to chat with the computer just like chatting with any human being.

I remember the old-school packages for NLP, such as nltk and spacy, and I miss their simplicity. Packages like that helped me with many projects where I was able to find patterns in texts, discover keywords, and even analyze some context with n-grams (patterns of n words that repeat in sequence within a text). Not to mention the beautiful word clouds, that were so popular for a while.

But those days are gone. Old-school packages still have their place in text analysis, but when we’re talking about LLMs, there are modern techniques of NLP that we will be going over in this article.

Let’s get to them.

Modern NLP Concepts

In this article, here is what we will cover:

  • How LLMs process text using tokenization
  • What are embeddings
  • How the Attention Mechanism works in Transformers

--

--

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Gustavo R Santos
Gustavo R Santos

Written by Gustavo R Santos

Data Scientist | I solve business challenges through the power of data. | Visit my site: https://gustavorsantos.me

Responses (5)