Applying Word2Vec on our catalog data

We are living in an era of revolution brought by recent researches in deep learning and easy availability of large computing power.

At Arvind Internet, we use Machine learning on variety of data to build great products. In this post i want to share some of the interesting results we found when we applied word2vec on our product description data. The goal was to find relationship between words and find similar words. Before I tell you more about what I did, let me briefly describe what is word2vec and details about the data.

What is Word2vec?

Word2vec algorithm was first created at Google in 2013. Word2vec takes a corpus of text and outputs a vector space. Each word vector in this space represent a word in corpus. Hence the name word2vec(tor).

Words which have “same context” are neighbors in the vector space. The catch here is “same context”. The location of a word relative to another word gives relationship between them. Take this famous example often given while explaining word2vec.

king — man + woman = queen

The relationship between words are derived from distances between words. So if you start from word “king” (in the vector space) and move the same distance and direction between “man” and “woman”, you end up in the area where “queen” appears. So if you apply addition and subtraction on the word vectors of words (king, man, woman, queen), equations like above are possible — king — man + woman = queen

We did same on our data and found some interesting relationships between words.

About the data

On our website, we have a small description for each product.

For example for one of the jeans, the description is:

Upgrade your denim collection with these impeccably washed jeans by U.S. Polo Assn. Cut from breathable cotton with a hint of stretch, they sit at mid waist and have five handy pockets. Wear yours cuffed with a plaid shirt and sneakers.

I took descriptions of 15,000 products and fed into word2vec algorithm. I used gensim’s word2vec implementation.

Here is the code:

One the model is saved, evaluate the model:

>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load(“”)
# Find words similar to "jeans"
>>> model.wv.most_similar([‘jeans’], topn=1);
[('pants', 0.8557192087173462)]


The results were pretty interesting and some were astonishing. Take a look:

1. Find how brands are similar to each other:

So from the given description of styles, “Nautica” is similar to which other brands.

We found Nautica is similar to Gant (92%), Izod(91%), Tommy (88%), Elle(86%), Aeropostle(85%).

>>> model.wv.most_similar([‘Nautica’], topn=5);
[(‘Gant’, 0.9252288937568665), (‘Izod’, 0.916165828704834), (‘Tommy’, 0.885221540927887), (‘Elle’, 0.8621432185173035), (‘Aeropostale’, 0.8563322424888611)]

2. Find similar article types:

Items similar to Jeans — pants (85%), denims(79%), sweatpants (79%), jeggings (79%), joggers (79%)

>>> model.wv.most_similar([‘jeans’], topn=5);
[(‘pants’, 0.8557192087173462), (‘denims’, 0.7956579327583313), (‘sweatpants’, 0.7948166728019714), (‘jeggings’, 0.7919270992279053), (‘joggers’, 0.7907577157020569)]

3. Add and Subtract words

We can add and subtract word vectors and this opens up enormous possibilities of adding and subtracting contexts.

For example — If i subtract “shirt”, “bow” and “waistcoat” from “suit”, it replies “jeans” !

suit — shirt — bow — waistcoat = jeans

>>>model.wv.most_similar(positive=[‘suit’], negative=[‘shirt’, ‘bow’, ‘waistcoat’], topn=1)
[(‘jeans.’, 0.4214898943901062)]

Some more interesting results:

party + weekend + clothing = holiday

model.wv.most_similar(positive=[‘party’, ‘weekend’, ‘clothing’], topn=1)
[(‘holiday’, 0.8865032196044922)]

party + weekend + clothing — enjoy = work-out

model.wv.most_similar(positive=[‘party’, ‘weekend’, ‘clothing’], negative=[‘enjoy’], topn=1)
[(‘work-out’, 0.8274893760681152)]

party + weekend + polo = seasonal

model.wv.most_similar(positive=[‘party’, ‘weekend’, ‘Polo’], topn=1)
[(‘seasonal’, 0.8255293965339661)]

shirt — buttons = sweater

model.wv.most_similar(positive=[‘shirt’], negative=[‘buttons’], topn=1)
[(‘sweater’, 0.7139133810997009)]

jeans + office = chinos

model.wv.most_similar(positive=[‘jeans’, ‘office’], topn=1)
[(‘chinos’, 0.8614673614501953)]

shirt + jeans + tie + work = suit

model.wv.most_similar(positive=[‘shirt’, ‘jeans’, ‘tie’, ‘work’], topn=1)
[(‘suit’, 0.8383824229240417)]


The results proved the point that machine learning can provide so many dramatic and meaningful solutions and machine learning is Fun!

Thanks for reading.


Like what you read? Give Subhash Medatwal a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.