NLP is ubiquitous in technology we use daily, such as email filters, search results, predictive text, and smart assistants. There are also a few NLP data science project ideas floating around Ro, and I have a few personal projects in mind where NLP/data mining would be very handy, so I figured now would be a great time to learn about basic techniques through this course!
Overall, I think that the original lessons (Getting an Idea of NLP and its Applications) in this course are comprehensive and structured well, although it focuses more on data mining (superficial text analysis). The instructor added a lot more NLP content later on (which, unlike data mining, aims to preserve words’ contexts and sentence structures), and while I think that the information is useful, I often found myself Googling definitions or concepts for clarification after the instructor had just talked about it.
Below are some of the higher-level concepts that I familiarized myself with:
Methods to Preprocess Text
Text is messy to work with and like with all datasets, we need to ensure that the data is cleaned, structured, and easy to work with.
Tokenizing — Breaking up text data into all of its parts — words, punctuations, numbers
Example: “Hello, it’s me.” → “Hello”, “,”, “it”, “‘s”, “me”, “.”
Notice that the contraction is split up, and each punctuation mark is its own token.
Stemming — A word can take on multiple forms but can still mean, or be referring to, the same thing. For text analysis, it’s useful to consolidate these words
Example: “compute” → “comput”
“computing” → “comput”
“computer” → “comput”
“writing” → “write”
“halves” → “halv”
A downside of stemming is that it may result in a partial word, like “comput” or “halv”
Lemmatization — This method is similar to stemming, but the difference is it will return a word that is semantically complete, unlike stemming which can return a partial word. There will also be more distinct words after lemmatization
Example: “compute” → “compute”
“computer” → “computer”
“computed” → “compute”
“computing” → “computing”
Removing stop words — A stop word is a word that has little significance but are used often, such as “a”, “the”, “at”. Often times we will remove stop words so that our dataset is smaller and higher quality
Basic NLP Techniques
Bag of Words model (BoW) — Extracts the frequency of terms. This is useful because ML algorithms need numeric data to work with, and BoW is the simplest method. Each document is turned into a “bag of words”, but keep in mind that this erases information about grammar and word order
Term Frequency-Inverse Document Frequency (tf-idf) — A method of determining how important a word is in given text; the higher the score, the more rare the term is
Topic modeling — A way to be able to extract the topic of a given text, or the key concepts
One hot encoding — A method to encode categorical data so that the computer does not interpret one category is not seen as “better” than another category
Count vectorization — Converts vectors of words into counts. For each word, count how many times that word was present in the text. Note that with this method, you lose the context of the words, but you can use n-grams. For example: “cat”, “angry cat”, “not angry cat” all have different meanings and implications as you look at a larger grouping of words
N-grams — Looks at previous and following words to understand the context of a word. Example: unigram: “cat”, bigram: “angry cat”, trigram: “not angry cat”. Although it may seem better to include a higher number of words to ensure the context is captured, the tradeoff is that there will be more groupings, many of which will not provide more value
Hash vectorization — Similar to Count vectorization, except this method does not store the unique tokens
Pro: Much faster computationally and results in a smaller dataset
Con: You will not be able to access the actual token
Skip-Gram — Used to predict the surrounding words for a given target word
Continuous Bag of Words (CBoW)— The reverse of Skip-Gram; used to predict the target word given the surrounding words.
- Used to predict the target word given the surrounding words.
Both Skip-Gram and CBoW consider a window of neighboring words — the size of the window is changeable