Applying Word2Vec on our catalog data

Subhash Medatwal
Jan 5, 2018 · 3 min read

We are living in an era of revolution brought by recent researches in deep learning and easy availability of large computing power.

At Arvind Internet, we use Machine learning on variety of data to build great products. In this post i want to share some of the interesting results we found when we applied word2vec on our product description data. The goal was to find relationship between words and find similar words. Before I tell you more about what I did, let me briefly describe what is word2vec and details about the data.

What is Word2vec?

Word2vec algorithm was first created at Google in 2013. Word2vec takes a corpus of text and outputs a vector space. Each word vector in this space represent a word in corpus. Hence the name word2vec(tor).

Words which have “same context” are neighbors in the vector space. The catch here is “same context”. The location of a word relative to another word gives relationship between them. Take this famous example often given while explaining word2vec.

king — man + woman = queen

The relationship between words are derived from distances between words. So if you start from word “king” (in the vector space) and move the same distance and direction between “man” and “woman”, you end up in the area where “queen” appears. So if you apply addition and subtraction on the word vectors of words (king, man, woman, queen), equations like above are possible — king — man + woman = queen

We did same on our data and found some interesting relationships between words.

About the data

On our website, we have a small description for each product.

For example for one of the jeans, the description is:

Upgrade your denim collection with these impeccably washed jeans by U.S. Polo Assn. Cut from breathable cotton with a hint of stretch, they sit at mid waist and have five handy pockets. Wear yours cuffed with a plaid shirt and sneakers.

I took descriptions of 15,000 products and fed into word2vec algorithm. I used gensim’s word2vec implementation.

Here is the code:

One the model is saved, evaluate the model:


The results were pretty interesting and some were astonishing. Take a look:

So from the given description of styles, “Nautica” is similar to which other brands.

We found Nautica is similar to Gant (92%), Izod(91%), Tommy (88%), Elle(86%), Aeropostle(85%).

>>> model.wv.most_similar([‘Nautica’], topn=5);
[(‘Gant’, 0.9252288937568665), (‘Izod’, 0.916165828704834), (‘Tommy’, 0.885221540927887), (‘Elle’, 0.8621432185173035), (‘Aeropostale’, 0.8563322424888611)]

Items similar to Jeans — pants (85%), denims(79%), sweatpants (79%), jeggings (79%), joggers (79%)

>>> model.wv.most_similar([‘jeans’], topn=5);
[(‘pants’, 0.8557192087173462), (‘denims’, 0.7956579327583313), (‘sweatpants’, 0.7948166728019714), (‘jeggings’, 0.7919270992279053), (‘joggers’, 0.7907577157020569)]

We can add and subtract word vectors and this opens up enormous possibilities of adding and subtracting contexts.

For example — If i subtract “shirt”, “bow” and “waistcoat” from “suit”, it replies “jeans” !

suit — shirt — bow — waistcoat = jeans

>>>model.wv.most_similar(positive=[‘suit’], negative=[‘shirt’, ‘bow’, ‘waistcoat’], topn=1)
[(‘jeans.’, 0.4214898943901062)]

Some more interesting results:

party + weekend + clothing = holiday

model.wv.most_similar(positive=[‘party’, ‘weekend’, ‘clothing’], topn=1)
[(‘holiday’, 0.8865032196044922)]

party + weekend + clothing — enjoy = work-out

model.wv.most_similar(positive=[‘party’, ‘weekend’, ‘clothing’], negative=[‘enjoy’], topn=1)
[(‘work-out’, 0.8274893760681152)]

party + weekend + polo = seasonal

model.wv.most_similar(positive=[‘party’, ‘weekend’, ‘Polo’], topn=1)
[(‘seasonal’, 0.8255293965339661)]

shirt — buttons = sweater

model.wv.most_similar(positive=[‘shirt’], negative=[‘buttons’], topn=1)
[(‘sweater’, 0.7139133810997009)]

jeans + office = chinos

model.wv.most_similar(positive=[‘jeans’, ‘office’], topn=1)
[(‘chinos’, 0.8614673614501953)]

shirt + jeans + tie + work = suit

model.wv.most_similar(positive=[‘shirt’, ‘jeans’, ‘tie’, ‘work’], topn=1)
[(‘suit’, 0.8383824229240417)]


The results proved the point that machine learning can provide so many dramatic and meaningful solutions and machine learning is Fun!

Thanks for reading.


Arvind Internet

Fashion, Retail, Tech blog of Arvind Internet.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store