Business2Vec: Identifying similar businesses using embeddings

Understanding Paragraph Vector representation and applying Doc2Vec to company recommendations

6 min readJun 24, 2018

Motivation

I once worked at a small sized B2B company and one of the challenges that faced our marketing team was generating new leads i.e discovering new prospective customers that we could reach out to and possibly sell our products to. And then an idea hit me, why don’t I build a business to business recommendation tool where any small business could go to, look up names of other companies they are already familiar with and get a list of very similar companies that they may not have heard of. Thereby increasing their chances of generating new leads , finding competitors and so much more.

And thus began the journey….

The idea is to extract embeddings of various Canadian companies’ description (piece of text describing the business’s product offering) and use this model to generate relevant similar company recommendations.

In this post we would:

explore the Paragraph Vector — Distributed Bag of Words (PV-DBOW) model,
use this model to extract business description embeddings,
use these embeddings to find similar company descriptions and
deploy trained model as a web application using Flask.

So what exactly are embeddings & how does PV-DBOW work??

Embeddings are simply a way of converting text into numbers.The idea is to find an array of numbers that can capture the semantic similarity between words. There are several ways to go about extracting these embeddings, but for this application we would be using the PV-DBOW method.

The PV-DBOW model is an extension of the popular Skip-gram model, but unlike Skip-gram it learns the embeddings of a paragraph as a whole rather than the embeddings of individual words.

Paragraph Vector — Distributed Bag of Words (PV-DBOW) architecture

The model is a shallow feed-forward neural network with an input vector, a single hidden layer, an output layer and is trained to carry out a classification task using words in a vocabulary as its output y. The really neat thing about this model is that rather than using the network for the classification task being trained for, our goal is to actually train the hidden layer weight matrix W_h, and use this is as our embeddings.

Input Vector

Here, we treat every company description as a paragraph, and use this as our dataset. For the input, we map every company to a unique position id and represent each id as a one-hot vector V. Our dataset has 46873 companies, so for example company “XYZ” with position id 2 would be one-hot vector V[‘XYZ’] = [ 0 0 1 0 0 … 0 ] shape(1 x 46873), with 1 in position 2 and 0’s in the other 46872 positions. To get the one-hot vector for another sample m in the dataset, we just shift 1 to its respective position.

Hidden Layer

The shape of the weight matrix W_h in the hidden layer is determined by a dimension size d. d is a hyper-parameter and should be tuned to see which value yields the best result. For our project, after experimenting with various values we chose 200 as d, hence W_h becomes a matrix of shape(46873 x 200). Also, unlike regular feed-forward networks PV-DBOW does not have an activation function in its hidden layer, rather the hidden layer is only represented by a linear function h = V.W_h + b_h.

Output Layer

For the output layer we use the negative sampling technique as the optimization objective. Recall earlier in this post, we said that PV-DBOW’s training task was to carry out classification using the words in the vocabulary as output, so what negative sampling does is that rather than using all of the words in the vocab for this task, it randomly selects one ‘positive’ word from the paragraph and then selects n number of other words as ‘negative samples’ (n is a hyper parameter, in our project we use 5).

For example, if the text description for company “XYZ” is: “We do deliveries for e-commerce products ………”. Negative sampling would select at random the word “deliveries” from the text and then go on to select 5 other negative samples such as “the”, “we”, “are”, “of”, “to”. It uses a unigram distribution (calculates the probability of selecting a word) to select the negative samples:

So the positive class becomes: “deliveries” and negatives class: “the”, “we”, “are”, “of”, “to” are passed on to a sigmoid function for binary classification. The output layer would now have 6 output nodes and repeats this negative sampling process with other positive-negative pairs until training is complete.

Training

So after setting up the input vector, hidden and output layer, its time to train. Using our project as an example lets look at the forward pass step:

we have input V a [1 x 46873] vector where we just shift the position of 1 depending on the position of training sample m .
Next, we initialize hidden weights matrix W_h to [46873 x 200] matrix and calculate the hidden state. After matrix multiplication h becomes a [1 x 200] vector and is passed on to the output layer.
h is then multiplied by the output layer weight matrix W_y shape [200 x 6] to produce a [1 x 6] vector, which is then passed on to the sigmoid function to produce ŷ .

For back pass, the back-prop and gradient descent steps run the same as with a standard feed-forward network. Finally after training, we just grab W_h [46873 x 200] matrix and this becomes the paragraph vector for all the company descriptions.

Dataset

To create our dataset, data was scraped from the Government of Canada’s Indigenous Business Directory website.

Over 40,000 pages were crawled and scraped for comprehensive information on business description, contacts, products etc.

As with any data project, we take a look at the data and clean it up as much as we can. In this project, all of the data was in text format, so most of the cleaning done involved removing duplicate entries, data wrangling and filling in missing values.

Training the Model

For training we use the gensim library to:

pre-process the text (tokenize text into individual words, remove punctuation and set text to lower case),
associate a tag with each company and create an iterable of words and tag.
initialize the PV-DBOW model with the model hyper-parameters (dimension size, min count, number of epochs etc.)
build a vocabulary ( a dictionary of all unique words extracted from the dataset, along with their frequencies)
and finally train.

Visualizing Embeddings

We visualize the trained embeddings using t-SNE in the TensorBoard Embedding Projector. The visualization helps us see that businesses with the same or similar industry categories are clustered together.

Visualization of business embeddings using t-SNE

Also, we inspect nearest neighbour subsets along with their respective cosine distances. For example the nearest neighbour businesses to ‘Black Forest Bakery’ are shown below:

Nearest Neighbour businesses to “Black Forest Bakery”

SimilarBusiness Web App

Finally, we turn this idea into an actual product, and deploy the model into a web application called SimilarBusiness. A demo for the app is shown below:

Demo showing how SimilarBusiness works

Conclusion

I really hope this was helpful in understanding how Paragraph Vector representation works and how we can use these vectors to find similar text descriptions.

Code for the web application can be found here.

Notebook for scraping, cleaning and training can be found here.