Member-only story
Implementing Various NLP Text Representation in Python
One-Hot Encoding, Bag-of-Words, N-Grams, and TF-IDF
Natural language processing (NLP) is a subset of machine learning that deals with language and semantics. A machine learns the semantics of words by being trained, like how typical machine learning works. A problem arose when we realised that almost all commonly used machine learning models can only take numeric inputs. So to train a machine using text data, we have to find a way to represent a text as a numeric vector. This article will demonstrate some simple numeric text representations and how to implement them using Python.
For this tutorial, we are going to use the following data. The context of this data is a review of a university subject. I have preprocessed the data, namely stop-word removal, punctuation removal, and lemmatisation. These texts are all fictitious.
Let’s begin with our most straightforward representation. We call the data above using Pandas,
df = pd.read_csv("data.csv")