Document Classification Part 2: Text Processing (N-Gram Model & TF-IDF Model)

Photo by Luca Bravo on Unsplash

In this article I will explain some core concepts in text processing in conducting machine learning on documents to classify them into categories. This is the part 2 of a series outlined below:

Part 1: Intuition & How Do We Work With Documents?

Part 2: Text Processing (N-Gram Model & TF-IDF Model)

Part 3: Detection Algorithm (Support Vector Machines & Gradient Descent)

Part 4: Variations of This Approach (Malware Detection Using Document Classification)

In part 1, we went over representing documents as numerical vectors. But there are many problems with this simplistic approach. Let me put what we ended up with in the last article:

Sentence 1 (S1): “vectorize this text to have numbers!”
Sentence 2 (S2): "what does it mean to vectorize?"
Sentence 3 (S3): "document classification is cool"

Then we created the vocabulary set:

V = ['classification', 'cool', 'document', 'does', 'have', 'is', 'it', 'mean', 'numbers', 'text', 'this', 'to', 'vectorize', 'what']

Then we represented each sentence as a vector where each index corresponds to the index in the vocabulary set, where the number is how many times the word occurs…

--

--