Engineering Your Webstore with NMF

Dom Vanden Dries
Future Vision
Published in
5 min readMay 9, 2019

Get insights on your customer-product relationships with Linear Algebra

NMF is a powerful linear-algebra tool often used in recommender systems, as it performs well in sparse circumstances. NMF can extract latent “topics” from a matrix — think movie genres, article subjects, etc. It works by decomposing an m x n matrix into a m x p matrix, W, and a p x n matrix, H, with p being the number of latent topics.

The W matrix represents how much each movie-watcher/customer/author belongs to each topic. The H matrix represents how much each movie/product/article belongs to each topic.

A visual representation of the decomposition of my webstore

This tutorial will walk you through using SKLearn’s NMF implementation on your eCom data, and discuss a few cool things we can do with the results!

Some notes before we get started

Big thanks to RockTape for letting me use their data for this project. RockTape is a sports-medicine business that works mainly with Chiropractors, Physical Therapists and other medical professionals. RockTape makes fitness products for the general public as well, and does a lot of work specifically with the CrossFit industry.

Away we go!

The first thing we’re going to do is create an itemized historical purchase matrix. AKA an m x n matrix with m being the number of customers in your database, and n being the number of products your store offers. Initialize this matrix with zeros, since most customers haven’t bought most things.

historical_purchase_df = pd.DataFrame(0, index = custy_df.index, columns=product_df['Product ID'])

I set the index as my customer dataframe’s index, with each column corresponding to a unique product ID. We’re going to loop through each customer, loop through each order they placed, loop through each product they purchased, and increment the corresponding cell. I know, it’s a loooot of loops.

The way BigCommerce writes order data makes this……challenging:

Product ID: 111, Product Qty: 3, Product SKU: 2012, Product Name: 2" 2012 Stars and Bars, Product Weight: 0.5000, Product Variation Details: , Product Unit Price: 15.00, Product Total Price: 45.00

If you’re working with a different eCom platform, hopefully the order details are a bit easier on the eyes. Check make_historical_purchase_matrix on my GitHub to see how I wrangled mine! You want something similar to this after all the trimming and splitting and somersaults:

historical_purchase_df[prod_id][customer] += prod_qty

I have about 70,000 customers and 4,000 products, and mine takes a solid 15–20 minutes to finish.

This is what your historical purchase matrix should resemble! The index matches your customer index and the columns correspond to your products.

NMF Time!

Specify the number of latent topics you’d like to faction out in n_components. It can be a guess at first. We will dial in that hyperparameter in the next step.

nmf = NMF_sklearn(n_components=5, max_iter=450)W = nmf.fit_transform(historical_purchase_matrix) # how much each customer belongs to each 'topic'H = nmf.components_ # how much each item belongs to each 'topic'

NOTE: Keep in mind that NMF initializes numbers at random and uses alternating least-squares to reconcile the W and H matrices. This makes the NMF unable to be reproduced identically each time. Consider setting a random_state, or pickling the resulting matrices if you’re doing further analysis.

This is a nifty function I wrote to display the top products for each topic using the H matrix and your product data. With a little domain knowledge of your customer base and product catalog, you can determine the right number of topics and label each one!

print(['Here are the top products for %s topics' % (n_topics))for topic in range(0, n_topics):     indicies = H[topic].argsort()[-25:]     print('\n')     print(product_df['Name'][indicies])

My product categories became uniquely separable with n_topics = 5. I hand labeled them and ended up with the following ‘topics’.

1. Consumer Rehab/Single Rolls

Top products: Single rolls in popular colors, PreCut RockTape, RockSauce

People recovering from surgery, rehabbing an injury, etc. These folks tend to churn quickly.

2. Consumer Fitness

Top products: Foam Rollers, RockTape, RockBand, RockBalls Infinity, Gloves

Weekend warriors and gym rats. Sports Med for the general public. Most addresses are homes.

3. CrossFit

Top products: CrossFit Knee Sleeves, H2O Tape, Rogue and Camouflage tape, Shin Protection

CrossFit gyms/boxes and individual CrossFitters. Some overlap with the Consumer Fitness category.

4. Marketing & Promo

Top products: Sample strips, tee shirts, RockSauce samples, posters

RockTape influencers, doctors who want to test our products, trade show orders, internal RockTape orders. Very low or zero dollar average order values.

5. Movement Professionals & Education

Top products: Courses, Bulk rolls of RockTape, 6-packs of RockTape, IASTM Tools, Bulk RockSauce

RockTape’s bread and butter — our Chiropractors, Physical Therapists, LMTs. Most of these are clinical addresses (they have a suite number listed). Higher average order values.

Considering our W matrix, every single customer has a weight in each topic. For example, a Physical Therapist who does CrossFit recreationally could be expressed as [0.02, 0.0, 0.3, 0.0, 1.2]. Most of the purchasing on his account is for clinical use, but he bought all his buddies the sweet new CrossFit knee sleeves for Christmas!

Next Steps and Insights

  • For my project, I simply concatenated my W matrix with my original feature matrix. This was a succinct yet effective way to digest and quantify the purchase history of my customers. If you’re going to perform more analysis (Random Forest, Regression Modeling, etc) I’d recommend this route!
Original features (red), engineered features via NMF (yellow)
  • Consider leveraging your W and H matrix to display recommended products in a “Customers like you also bought...” fashion. NMF is used for recommender systems all the time!
  • RockTape’s customer base is distinctly stratified, so each topic matched an existing webstore category pretty well. If your customer base is more nebulous, consider grouping items in categories that correspond to your H matrix.

Bringing it Back to Churn

My favorite application of NMF relates back to churn. After creating a binary feature for churn and masking your customer dataframe to just your inactives, you can re-run NMF and get a list of products that are most associated with churn.

My list consisted of a lot of discontinued products (often denoted by an asterisk), things we frequently had to warranty, and generally less popular items:

You can do this with more than just churn, too! Take a look at the get_items_associated function in my GitHub Repo. If you wanted a list of products most associated with Consumer Fitness customers in the state of Indiana, a few tweaks of your NMF can provide this (or maybe you could just use SQL, lol!).

--

--

Dom Vanden Dries
Future Vision

Solving eCom and Marketing problems with Data Science