Word similarity extraction with Machine learning Python

Theethat Anuraksoontorn
Geek Culture
Published in
7 min readSep 14, 2021

Machine learning, Deep learning, and AI had been around you for a decade, however it is now more visible than any other past such that everyone are using the AI-product from every day life such as IG filter, Netflix movie recommendation, Junk-mail detection and many more.

In recent past, this technology are now using mostly for marketing to acquire, retain and grow their customer. It used to be replace the tedious and repetitive task, but now it goes to analytic and creative. And one is surprising soar in few AI-powered start-up is the text extraction.

Text is one of the way the user experiences to express their thought in the internet or read the document inside your product. For us to be able to gain the useful information of the text documents is another level, it is helpful for us and leading us to more information about the word usage by user or consumer.

I do not want to fix the data for you guys, so I will use a web scraping technique so that you can use any website you want for this word similarity extraction.

Getting Data with Beautiful soup

Beautifulsoup is one of the strongest Python library for web scraping, it is the perfect job for getting the data from the HTML, XML, and other Markup language website. To get the Beautifulsoup run this line on your command line. For those who like to explore more about this library check their site here.

pip install beautifulsoup4

You also need to install the “parser” for interpreting the HTML element inside the website. You can run the pip to install lxml parser.

pip install lxml

First, import the library require for getting the data from the website. One is the Beautifulsoup which is bs4, another one is the urllib module which is the library for handling the fetching URLs. But we will use only the request protocol to opening and reading the URLs.

import bs4 as bs
import urllib.request

The code to get the data has three step on getting data from the Wikipedia website (you can choose any other website if you want).

  1. Send the URL request with the urllib to open and read the website, here I use the Wikipedia about machine learning you just change the parameter to any website you like.
  2. Read the article with bs4 parser, here is when the bs4 will work to load the HTML of the website, then we will select a part that contain the text information to work on later. The text part is contain inside the paragraph tag of the HTML.
  3. After we get the paragraph tag we need to remove the HTML markup language along with it, and keep only the text.
# send the url request to open and read the website
scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Machine_learning')
# read the article with beautiful soup lxml parser
article = scrapped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')
# parse only the paragraph tag inside the website HTML
paragraphs = parsed_article.find_all('p')
# get the string of the website from the paragraph tag
article_text = ""
for p in paragraphs:
article_text += p.text

If you use the same url as me what you see will be looks like this.

print(article_text)

Text Preprocessing

For those who not familiar with the NLP or ML with text, before you jump into getting the word similarity, first thing we have to do is to prepare the unstructured text into structure format. I will use the Natural-Language-Toolkit or NLTK for the task.

In the text preprocessing or Natural Language Processing, before we can use text in machine learning task we need to do one thing first, which is tokenization.

Computer do not actually understand the text or string as human do, what the tokenization do is separate the word out of the sentence so that the computer can remember each word. The tokenized word will then be used as a id or number for computing in the code.

import nltk
nltk.download('punkt')
import string

In using nltk, some part of the module need to be download manually outside the pip install. Here is the ‘punkt’, which is the punctuation tokenization, that do not purely tokenize word from the sentence but take the punctuation into the consideration.

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
processed_article = article_text.lower()
processed_article = re.sub(r'\s+', ' ', processed_article)
processed_article = re.sub(r'[!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]','', processed_article)
# Preparing the dataset
all_sentences = sent_tokenize(processed_article)
#
all_words = [word_tokenize(sent) for sent in all_sentences]

Another thing we just need to do is removing the stopwords, it is the group of word that use in the language to serve purely on the purpose of grammatical usage, it does not really generate meaningful relationship or insight to us human.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')# Removing Stop Words
stops = set(stopwords.words('english'))
for i in range(len(all_words)):
all_words[i] = [w for w in all_words[i] if w not in stops]

After we remove the stopwords, we can use the list of word to calculate the text similarity by using the Word2Vec model.

Word2Vec

Word2Vec is the algorithm that turn word into vector. By turning it to vector, it allow the word to be computed, quantifiable and measurable. This allow a word to show the relationship with another word of how far away each other are. We can use this to see how the word in our text have any meaning related to another word.

The similarity of the word is insightful because when we use a world in life, we tend to use many different word but mean the same things. It is the same when you sell your product online, you call it “SuperProduct” from your company “SuperSeller”. Consumer like your product, but normally they will change the name to fit it with their foe, it will become like “ I but an SP from the green store.”.

In this case, the SP is the abbreviation of the “SuperProduct”, this is the common human communication that trying to minimize the language usage to the level that it still can be used for the communication. The “green store” is your company or store called by your customer. In term of marketing, this information is insightful but most of the marketing already know the word customer use for calling their product or company.

However, these kind of thing will always happen every time the company expose new information to the customer, by having marketing team follow and track campaign for every feedback comment is time-consuming and ineffective. The real application is to automate the process and reduce time for the marketer for finding these new synonym create by customer so that the marketer can work on improve customer satisfaction, acquisition, or target repositioning.

Word similarity

Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible. To use the word2vec algorithm for our data, we need to train the word2vec model with our data.

from gensim.models import Word2Vecword2vec = Word2Vec(all_words, min_count=2)

Importing the Word2Vec module and declare the parameter as our words, and the minimum count or frequency that will be keep in the model. The trianing time depends on your data, but if you use the same Wiki text as me, it should be less than a few minutes.

vocabulary = word2vec.wv.vocab

After we create the model, we can access all the word to query by using the wv.vocab but when we want to see the word similarity, we will use wv.most_similar(str). Here I put string “data” inside to see what is the word that have semantic relationship as “data”.

word2vec.wv.most_similar('data')

The result shows all the word related to the word data, with the similarity score from 1 to 0, the higher the score the more similar the word. It seem that wikipedia have a low variance of topic when it specific on the context. So the similarity does not reflect to the extent of apply to the real world, but when we go with wider and diverse mind of consumer data, that will be a great data to mine the keywords.

Thank note and further work

Thanks you all for reading, If you enjoy my work, follow me for more article about applied ML. I am currently working many projects, but if you guys want to read anything in particular, comments and I will write one for you.

--

--

Theethat Anuraksoontorn
Geek Culture

Applied Economist | Inventor | Data Scientist At Accenture