Photo by Jason Leung on Unsplash

When Magento meets Python (episode: Text Analysis)

Working every day with Magento, one of the most diffused eCommerce platform, I see a lot of stored data, often used just for presentation or unused at all, so why not unleash the mighty Python :) to see what can be accomplished with it?

My goal is try to pick examples where, with with few lines of code and a bit of technical context, is possible to extract valuable info, ranging from exploratory data analysis, advanced reporting and prediction.

So let’s start, for this first post, with something related to text, for example the product descriptions.

When an online stores contains a lot of products, can be interesting to perform a text analysis, for example to:

  • discover how often specific words are used
  • have a metric confirming or not that specific brands or products category have consistency in their description
  • try to find correlation between sales and the effectiveness of descriptions (let’s think to brands with similar targets, popularity and prices but different sales result)

But products description are just an example: think about customer’s product reviews or a blog containing editorial contents.

The possibilities are endless, but let’s keep it basic…ready to write some code? Let’s start!

Magento stores products descriptions on its db, so it’s very easy to retrieve the data from it with a query (targeting specific data related to stores, categories, brands, price ranges, etc).

But, in general, several options are possible, even scraping info directly from the store (meaning collect the text browsing the site with a kind of robot, like search engines)

We’ll use Madison Island demo store sample data.

After extracting the info (using a csv to keep the code simple) , removed duplicates, renamed a column and dealt with null values…

df = pd.read_csv(index_col='entity_id', delimiter='^', 
                 filepath_or_buffer='<.....>/madison.csv', 
                 engine='python')df.drop_duplicates(keep='first', subset='value', inplace=True)df=df.rename(columns = {'value':'descr'})df.replace(np.nan, '', regex=True, inplace=True)print(df)

this is what we have:

Basically, an index, the product entity_id and the product description.

Let’s start with CountVectorizer, a library allowing to count the words occurrences

vectorizer = CountVectorizer()

We assign as ‘corpus’ the descriptions and we use a simple function to sort the results, given a corpus and a Vectorizer (in this case the CountVectorizer we just created)

def words_frequency(vectorizer, corpus):    vect = vectorizer.fit_transform(corpus)    sum_words = vect.sum(axis=0)    words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)    return words_freq

Let’s use it and print some results!

print(wf_descr[0:100])

[(‘and’, 58), (‘the’, 50), (‘with’, 44), (‘to’, 35), (‘for’, 32), (‘in’, 29), (‘our’, 23), (‘of’, 21), (‘this’, 19), (‘from’, 18), (‘classic’, 17), (‘is’, 14), (‘your’, 14), (‘modern’, 12), (‘style’, 11), (‘wear’, 10), (‘perfect’, 9), (‘or’, 9), (‘on’, 8), (‘travel’, 8), (‘cotton’, 7), (‘dress’, 7), (‘shirt’, 7), (‘stylish’, 7), (‘wool’, 7), (‘jean’, 7), (‘world’, 7), (‘look’, 7), (‘all’, 7), (‘up’, 7), (‘refined’, 6), (‘fit’, 6), (‘season’, 6), (‘an’, 6), (‘camera’, 6), (‘made’, 5), (‘jeans’, 5), (‘that’, 5), (‘flattering’, 5), (‘any’, 5), (‘comfort’, 5), (‘it’, 5), (‘these’, 5), (‘are’, 5), (‘feminine’, 5), (‘day’, 5), (‘pair’, 5), (‘leather’, 5), (‘wrinkle’, 4), (‘crafted’, 4), (‘suit’, 4), (‘lightweight’, 4), (‘relaxed’, 4), (‘timeless’, 4), (‘maximum’, 4), (‘clean’, 4), (‘compact’, 4), (‘perfectly’, 4), (‘comfortable’, 4), (‘after’, 4), (‘denim’, 4), (‘blend’, 4), (‘top’, 4), (‘silhouette’, 4), (‘chic’, 4), (‘by’, 4), (‘statement’, 4), (‘madison’, 4), (‘at’, 4), (‘long’, 4), (‘add’, 4), (‘resistant’, 3), (‘twill’, 3), (‘french’, 3), (‘business’, 3), (‘bold’, 3), (‘everyday’, 3), (‘trousers’, 3), (‘slim’, 3), (‘jacket’, 3), (‘form’, 3), (‘you’, 3), (‘tailoring’, 3), (‘minimalist’, 3), (‘layering’, 3), (‘appeal’, 3), (‘straight’, 3), (‘crisp’, 3), (‘polished’, 3), (‘silk’, 3), (‘skinny’, 3), (‘simple’, 3), (‘no’, 3), (‘demure’, 3), (‘nude’, 3), (‘casual’, 3), (‘necklace’, 3), (‘trend’, 3), (‘pumps’, 3), (‘island’, 3)]

Interesting but in the first positions there are common and useless words for our analysis, so is there a way to skip them?

Yes, using the stop words, meaning the Vectorizer will skip common words based on the language (English in this case)

Super easy with nltk!

from nltk.corpus import stopwordsstop_words = set(stopwords.words('english'))

Now let just create again a CountVectorizer passing the stop_words and run it again

print(wf_descr[0:100])

[(‘classic’, 17), (‘modern’, 12), (‘style’, 11), (‘wear’, 10), (‘perfect’, 9), (‘travel’, 8), (‘cotton’, 7), (‘dress’, 7), (‘shirt’, 7), (‘stylish’, 7), (‘wool’, 7), (‘jean’, 7), (‘world’, 7), (‘look’, 7), (‘refined’, 6), (‘fit’, 6), (‘season’, 6), (‘camera’, 6), (‘made’, 5), (‘jeans’, 5), (‘flattering’, 5), (‘comfort’, 5), (‘feminine’, 5), (‘day’, 5), (‘pair’, 5), (‘leather’, 5), (‘wrinkle’, 4), (‘crafted’, 4), (‘suit’, 4), (‘lightweight’, 4), (‘relaxed’, 4), (‘timeless’, 4), (‘maximum’, 4), (‘clean’, 4), (‘compact’, 4), (‘perfectly’, 4), (‘comfortable’, 4), (‘denim’, 4), (‘blend’, 4), (‘top’, 4), (‘silhouette’, 4), (‘chic’, 4), (‘statement’, 4), (‘madison’, 4), (‘long’, 4), (‘add’, 4), (‘resistant’, 3), (‘twill’, 3), (‘french’, 3), (‘business’, 3), (‘bold’, 3), (‘everyday’, 3), (‘trousers’, 3), (‘slim’, 3), (‘jacket’, 3), (‘form’, 3), (‘tailoring’, 3), (‘minimalist’, 3), (‘layering’, 3), (‘appeal’, 3), (‘straight’, 3), (‘crisp’, 3), (‘polished’, 3), (‘silk’, 3), (‘skinny’, 3), (‘simple’, 3), (‘demure’, 3), (‘nude’, 3), (‘casual’, 3), (‘necklace’, 3), (‘trend’, 3), (‘pumps’, 3), (‘island’, 3), (‘fresh’, 3), (‘design’, 3), (‘make’, 3), (‘even’, 3), (‘carry’, 3), (‘room’, 3), (‘home’, 3), (‘pillow’, 3), (‘shaped’, 3), (‘edgy’, 3), (‘play’, 3), (‘sound’, 3), (‘elegance’, 3), (‘pearl’, 3), (‘cuffed’, 2), (‘luxury’, 2), (‘hue’, 2), (‘tailored’, 2), (‘two’, 2), (‘spring’, 2), (‘airy’, 2), (‘blazer’, 2), (‘warm’, 2), (‘yet’, 2), (‘tee’, 2), (‘smooth’, 2), (‘essential’, 2)]

That’s better…it seems Madison Island is selling both classic and modern items!

There is another approach, called Term Frequency — Inverse Data Frequency, basically a weight evaluation, mixing frequency (TF) with rare words (IDF). The idea is that even rare words can be important and so they are boosted.

Let’s try this too…Is just a matter to change Vectorizer (TfidfVectorizer) and we’are all set

wf_descr_tf = words_frequency(vectorizer, corpus)print(wf_descr_tf[0:100])

[(‘classic’, 3.6802363129400337), (‘style’, 2.861430738381172), (‘modern’, 2.818376752827764), (‘travel’, 2.447621972057555), (‘perfect’, 2.316218667986909), (‘shirt’, 2.2172807543912127), (‘stylish’, 2.1817915898565214), (‘world’, 2.1575552441568355), (‘camera’, 2.057805418983033), (‘cotton’, 1.9948929676453409), (‘wear’, 1.9895380586904707), (‘dress’, 1.8313625743322972), (‘season’, 1.6709130707852804), (‘leather’, 1.6148399048723963), (‘jean’, 1.5778232750090644), (‘comfort’, 1.566326033880678), (‘refined’, 1.5500468394774207), (‘wool’, 1.5485095747666622), (‘look’, 1.532970849847239), (‘compact’, 1.436071168247996), (‘jeans’, 1.4212824582035242), (‘lightweight’, 1.419809364532926), (‘wrinkle’, 1.412283102212597), (‘madison’, 1.3997975034473864), (‘maximum’, 1.386238929214222), (‘trip’, 1.3845288711733956), (‘add’, 1.3794980480802113), (‘relaxed’, 1.345218768591602), (‘fit’, 1.3218980156256626), (‘made’, 1.2828626384028334), (‘island’, 1.2319361777211486), (‘day’, 1.216415536058379), (‘minimalist’, 1.1748203908982608), (‘everyday’, 1.1706771791220492), (‘resistant’, 1.1591951424177578), (‘elegance’, 1.1441994752666833), (‘edgy’, 1.0974980631744995), (‘denim’, 1.0972760765058922), (‘feminine’, 1.090952147884136), (‘business’, 1.0829024461757262), (‘chic’, 1.080055394047742), (‘suit’, 1.0794760199492832), (‘demure’, 1.0590047985210609), (‘crafted’, 1.0535896069746806), (‘blend’, 1.0515122202421439), (‘polished’, 1.0501117554385162), (‘clean’, 1.045714991634424), (‘flattering’, 1.0422020729750932), (‘pair’, 1.0380637587499775), (‘top’, 1.0377328953038358), (‘fresh’, 1.0371606987174713), (‘simple’, 1.0322693005495547), (‘album’, 1.030309354038438), (‘pearl’, 1.0220775473050177), (‘room’, 1.0182288323951616), (‘carry’, 0.996883319848031), (‘timeless’, 0.9956850486377031), (‘silhouette’, 0.9922518707135732), (‘layering’, 0.9763933479178903), (‘make’, 0.9687209534108849), (‘appeal’, 0.9628168791253547), (‘year’, 0.9457015126038185), (‘warrenty’, 0.9457015126038185), (‘songs’, 0.9386806403807683), (‘keeping’, 0.9362533989095028), (‘memories’, 0.9362533989095028), (‘comfortable’, 0.9336664273323154), (‘tailoring’, 0.9326669382691108), (‘skinny’, 0.9321877420695608), (‘casual’, 0.9260740723943781), (‘design’, 0.9255354972656706), (‘statement’, 0.9088695803088533), (‘sound’, 0.9003636547739875), (‘long’, 0.89560816385577), (‘trousers’, 0.8945374205486629), (‘straight’, 0.8930395419127627), (‘subtle’, 0.8817729300838522), (‘perfectly’, 0.8806518868398824), (‘lbd’, 0.8796675801888253), (‘pillow’, 0.8744973615542261), (‘companion’, 0.8718719752862607), (‘shoe’, 0.8715022808180606), (‘adventures’, 0.8620738932312797), (‘premium’, 0.8572889749874235), (‘retro’, 0.8518014111156604), (‘bold’, 0.8487478357717784), (‘shaped’, 0.8413453846897486), (‘tee’, 0.8248856222357431), (‘crisp’, 0.8166972200664052), (‘vase’, 0.8128459231155899), (‘form’, 0.8083378399407374), (‘french’, 0.8068246728470747), (‘best’, 0.7900766691075813), (‘even’, 0.788887926914714), (‘color’, 0.788489663030129), (‘play’, 0.7825630732180437), (‘effect’, 0.7737637267608779), (‘tablet’, 0.7728878199239066), (‘distinct’, 0.7723259177407038), (‘luggage’, 0.7682658042733832)]

The results are different…

But what about analyzing more words grouped together? The proximity can give more insight on context. To do this, we’ll use ‘n-grams’ (from Wikipedia: an n-gram is a contiguous sequence of n items from a given sample of text or speech.)

Let’s build another Vectorizer, defining a range of 2 words

wf_body_tf = words_frequency(vectorizer, corpus)print(wf_body_tf[0:100])

[(‘madison island’, 1.2861718706230298), (‘minimalist style’, 1.1441663290058552), (‘style maximum’, 1.1441663290058552), (‘maximum comfort’, 1.1441663290058552), (‘wrinkle resistant’, 1.064796178023078), (‘island year’, 1.0151033883750111), (‘year camera’, 1.0151033883750111), (‘camera warrenty’, 1.0151033883750111), (‘songs trip’, 1.0), (‘tunes trip’, 1.0), (‘keeping travel’, 0.9783348792613211), (‘travel memories’, 0.9783348792613211), (‘memories compact’, 0.9783348792613211), (‘dress shirt’, 0.9141662840527143), (‘skinny jeans’, 0.9051077243508228), (‘relaxed wrinkle’, 0.8803298905984991), (‘comfort lightweight’, 0.8162374756946803), (‘camera case’, 0.761629664632101), (‘great layering’, 0.7469792688204052), (‘refined wool’, 0.7343349849303593), (‘suit coat’, 0.7149551517434061), (‘budding photo’, 0.7071067811865475), (‘photo connoisseurs’, 0.7071067811865475), (‘best breitband’, 0.7071067811865475), (‘breitband album’, 0.7071067811865475), (‘demure selection’, 0.7071067811865475), (‘selection occasion’, 0.7071067811865475), (‘travel friendly’, 0.7070404046688445), (‘spring season’, 0.677495104064123), (‘shirt perfect’, 0.6482440081830207), (‘modern silhouette’, 0.6070834670898614), (‘day dress’, 0.6051748843523153), (‘bold hue’, 0.5907444171703867), (‘insider access’, 0.5832439161860423), (‘carry spacious’, 0.5773502691896257), (‘spacious stylishflorentine’, 0.5773502691896257), (‘stylishflorentine satchel’, 0.5773502691896257), (‘murray modern’, 0.5773502691896257), (‘modern vase’, 0.5773502691896257), (‘vase set’, 0.5773502691896257), (‘traditional timepiece’, 0.5773502691896257), (‘timepiece edgy’, 0.5773502691896257), (‘edgy detailing’, 0.5773502691896257), (‘fresh water’, 0.5773502691896257), (‘water pearl’, 0.5773502691896257), (‘pearl necklaces’, 0.5773502691896257), (‘single album’, 0.5773502691896257), (‘album yes’, 0.5773502691896257), (‘yes means’, 0.5773502691896257), (‘music accompany’, 0.5773502691896257), (‘accompany world’, 0.5773502691896257), (‘world traveller’, 0.5773502691896257), (‘polished femininity’, 0.5773502691896257), (‘femininity cover’, 0.5773502691896257), (‘cover choice’, 0.5773502691896257), (‘cool casual’, 0.5773502691896257), (‘casual retro’, 0.5773502691896257), (‘retro urban’, 0.5773502691896257), (‘proper elegance’, 0.5773502691896257), (‘elegance modern’, 0.5773502691896257), (‘modern approach’, 0.5773502691896257), (‘semi sheer’, 0.5600598960103056), (‘denim achieve’, 0.5324504832224248), (‘compact 16gb’, 0.5311738397319206), (‘compact 8gb’, 0.5311738397319206), (‘resistant travel’, 0.5273406015829712), (‘travel shirt’, 0.5273406015829712), (‘stylish companion’, 0.5165429138620579), (‘companion lbd’, 0.5165429138620579), (‘lbd skinny’, 0.5165429138620579), (‘jean jacket’, 0.5120269667740323), (‘compact travel’, 0.5097736853234929), (‘friendly solution’, 0.5097736853234929), (‘solution sightseers’, 0.5097736853234929), (‘sharply tailored’, 0.5097736853234929), (‘tailored perfect’, 0.5097736853234929), (‘perfect spring’, 0.5097736853234929), (‘wool blend’, 0.5043940945007781), (‘slip style’, 0.5), (‘style ideal’, 0.5), (‘ideal everyday’, 0.5), (‘everyday use’, 0.5), (‘protect tablet’, 0.5), (‘tablet minimal’, 0.5), (‘minimal tablet’, 0.5), (‘tablet sleeve’, 0.5), (‘make impression’, 0.5), (‘impression overseas’, 0.5), (‘overseas business’, 0.5), (‘business meetings’, 0.5), (‘subtle nod’, 0.5), (‘nod old’, 0.5), (‘old world’, 0.5), (‘world antiquity’, 0.5), (‘heavy duty’, 0.5), (‘duty hard’, 0.5), (‘hard shell’, 0.5), (‘shell luggage’, 0.5), (‘conveniently packaged’, 0.5), (‘packaged pairing’, 0.5)]

‘madison island’ is the top result. It makes sense for a store named…Madison Island :)

Ok, this is just a glimpse of what is possible to do using this libraries…there is ton of documentation and examples available, so is possibile to crunch data very easily.

Hope you enjoyed this ride and see you for the next episode…still don’t know when and about what, but stay tuned!

Antonello Calamea, CTO and certified ML adopter

Written by

Less is more, except for love and knowledge