At Creative Market we have millions of products in seven broad categories and a marketplace with constantly evolving design trends. To address the unique dynamics of our marketplace and produce exceptional recommendations we built an engine that relies on the marriage of traditional matrix factorization and word2vec. We adapted the approach from Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Co-occurrence¹. Their technique, dubbed CoFactor, demonstrates superior performance compared to traditional matrix factorization on several real-world datasets. It accomplishes this by skewing (regularizing) results towards item similarity derived from user browsing behavior.
Our recommendation engine was validated through a series of A/B tests on a weekly product email where it was compared against products ranked by popularity. We found that tuning certain parameters even slightly sent us on a rollercoaster ride, and certain values had strong negative effects. The final pipeline however boosted our click-through rate by 65% and purchase rate by 39%. We were able to achieve this by using only 6 weeks of prior data to account for fast moving trends and changing buyer interest. Our pipeline currently operates in production and allows us to generates recommendations for 187k users each day.
But before we dive into the details feast your eyes on some samples:
After hearing about the level of optimization and testing we did for our pipeline an acquaintance of ours was taken aback. His immediate reaction was: “why did you spend so much time working on it? Aren’t recommendations a solved problem?” Far from it! While there are plenty of mature recommendation algorithms, every platform has unique characteristics that must be carefully considered. These include:
- User-item sparsity
- Available engagement data (clicks, likes, purchases, etc)
- Seasonality trends
- Item categorization/diversity
- Variety of user behavior
With so many factors to consider it’s important to make informed decisions based on the platform, tune parameters using internal metrics, and finally select hyper-parameters based on statistically significant A/B test results. In our experience you can’t just throw some click stream data into a matrix factorization library and expect to do better than items ranked by popularity.
Behold the Pipeline
Our pipeline can be broken down into three main phases: pre-processing, recommendation, and post-processing.
- Remove views outside of our time horizon
- Remove views that correspond to free goods*
- Remove products from specific categories
(less popular ones that introduce noise into the system)
- Construct a list of seasonal items
(e.g. products exclusively for Christmas, Thanksgiving, Valentine’s Day)
- Group owned products by user ID
*Creative Market has 6–9 new products each week that are available for free. This is an example of a unique platform characteristic that must be accounted for in our pipeline. Initially neglecting this caused free goods to dominate our recommendations and this had negative effects on our metrics.
Build sparse matrices
- Filter out users and products below minimum view thresholds
- Construct sparse user-item matrix
- Construct item co-occurrence matrix
- Create SPPMI⁶ from the co-occurrence matrix
- Perform CoFactor coordinate updates
- Latent factor matrix multiplication and top k sorting
(k candidate items per user)
- Remove any seasonal products (if season has already passed)
- Remove items the user already owns
- Remove manually blacklisted items
(e.g. content we’re not comfortable sending)
It’s worth noting that Creative Market’s recommendation engine encompasses two platforms: the Marketplace (creativemarket.com) and the recently launched Pro platform (pro.creativemarket.com). There is a large set of products listed on both platforms, and we combine the product views from both sites to train the model. At the last step we split out recommendations for each platform to only include the products that are available in each one. Note that the A/B tests were carried out solely on the marketplace platform.
Collaborative filtering basics
Our recommendation engine uses collaborative filtering at its core. Collaborative filtering is like using the wisdom of the crowd to determine if a particular user will like an item. It essentially reduces to constructing a table of known ratings for each user (the user-item matrix) and filling in (predicting) the unknown entries using the available information.
A common family of algorithms for predicting these ratings is matrix factorization where the user-item matrix is decomposed into user and item latent factors⁴ ⁵. The factorization process ultimately reduces the preference of each user and item into two matrices which contain the latent factors. You can think of each latent factor as a high-level characteristic of an item (or user). To use a Creative Market specific example, the nth value of the item latent vector could be the measure of flowery-ness of that product. Similarly the nth value of the user latent vector would be his/her preference for flowery designs. The meaning of each factor is not likely to be so distinct since it is determined implicitly by the factorization process, but the example helps build intuition.
Performing the dot product of the user and item vectors gives us the prediction of how the user would rate that item or alternatively it can be the cosine similarity of a user and an item. Furthermore doing a matrix multiplication of the latent factors gives us a dense matrix which predicts ratings for all user-item pairs. Doing a top k sort on each column vector yields the k best recommendations for each user.
Co-occurrence Matrix Basics
Factorization of matrices that contain item co-occurrence information has a long history in natural language processing (NLP). The central theory around word co-occurrence dates to back to the 50’s and it states that words in similar contexts have similar meanings⁷. This ability of turning words into vector representations has been a staple in NLP for various semantic comparisons and machine learning tasks.
The technique has also been extended beyond words where there is some implicit similarity between items appearing in the same context. In our case the items will represent products and the context will be derived from user views.
An intuitive example: the user may be looking for a product for a particular social media project — maybe she is looking for minimalist photos and fonts that look nice together. The items she views will share this similarity even though she has not explicitly given us the criteria. Given enough data the model will generalize this type of behavior and it will predict high similarity for these types of products.
All of this matrix factorization is great, but how does co-occurrence information relate to our recommendation engine and how does this tie back to word2vec? While word2vec is based on training a shallow neural network a fascinating recent paper found a direct relationship between skipgram word2vec trained with negative sampling and matrix factorization⁶. The authors introduce a novel matrix that stores shifted positive point-wise mutual information (SPPMI) and show that factorizing it accomplishes the same objective.
Tying it all together: CoFactor
CoFactor is a hybrid of these two factorization approaches and exploits this direct relationship between word2vec and matrix factorization. In the author’s words CoFactor “simultaneously factorizes both the click matrix and the item cooccurrence matrix”¹. The model does this by finding the global optima of this function:
Just like traditional matrix factorization CoFactor produces user and item matrices containing latent factors and K controls the dimension of each user and item vector. Similarly taking the dot product of these vectors generates the predictions for each user-item pair. CoFactor has the set of hyperparameters Cui that allow you to control the amount of regularization from item similarity. See original paper¹ and the conference presentation slides³ for a more comprehensive overview. The paper includes closed form coordinate updates and the authors github repository² has the full Python open-source implementation.
We tested our recommendations via multiple email A/B tests. First we split out users randomly into group A and B and then separated each group into users who have enough data to receive recommendations (“active users”) and users who do not (“inactive users”).
As an added benefit, this split allowed us to perform sanity checks by comparing the engagement of the two inactive groups. During each test we made sure there were no statistically significant differences between the inactive group A and inactive group B.
Data exploration revealed that using training data from a 6 week window was a sweet spot for maintaining relevant browsing history. We simply discard interactions older than 6 weeks to avoid using outdated information about user preference. The intuition behind this is that our users tend to work on projects that have a fixed time duration. If we introduce older data to the recommender it will add noise to the system and produce recommendations that may have nothing to do with the project. It’s important to note that the window approach is still not a perfect solution because we may be including a portion of product views from older projects (we do not know when one project starts and another begins). Nonetheless, this approach allowed us to achieve positive results.
Here are characteristics of user-item data using final settings (values vary slightly from day-to-day):
Unique users: 187,528
Unique items: 43,032
Ratings: 2,075,399 (clicks)
% interaction: 0.026%
This is significantly more sparse than the datasets used in the CoFactor paper: movielens 20M (0.63%), TasteProfile (0.29%), and ArXiv (0.12%). Our motivation was to push the bounds of the algorithm. By increasing sparsity we were able to generate recommendations for more users and have greater item diversity, but as expected increasing sparsity too much led to poor results.
Key Results & Analysis
Listed are the results and interpretations of several A/B tests. For brevity only only a subset of results are included.
Purchase rate is defined as the “purchases over unique email opens”. Click rate in this context is defined as “email clicks over unique email opens”. Revenue is defined the amount in dollars attributed to the email (user must have clicked on the email). All analysis mentioned is between the active A and active B group. For each experiment we used 4 days of data after the initial send to perform our analysis.
A combination of preprocessing steps resulted in high sparsity and inclusion of data from up to 1 year ago (an initial ensemble technique that was later abandoned).
- Purchase rate: –14.3% (88% significance)
- Click rate: +42% (99% significance)
- Revenue: -8.1%
- Outcome: Loss
Additional hyperparameter tuning and reliance on older data.
- Purchase rate: +12% (82% significance)
- Click rate: +82% (99% significance)
- Revenue: +6.0%
- Outcome: Inconclusive
After increasing the user-item density, removing views older than 6 weeks, and trying different item regularization parameters we settled on parameters that produced the following results.
- Purchase rate: +39% (99% significance)
- Click rate: +65% (99% significance)
- Revenue: +51%
- Outcome: Win
Throughout testing we had a natural tendency to keep increasing the number of users who receive recommendations. Recommend all the things, right!? But we learned quickly that popular products are popular for a reason. In order to generate recommendations for more users we had to decrease either the minimum number of products viewed by a user or the minimum number of views for a given product. In both cases our user-item matrix became even more sparse and eventually this resulted in worse results. It’s interesting to note that for all of our tests, the recommendations always had a increased click rate, but the increased engagement was often not enough to increase sales.
We hope that this work underscores the importance of validating recommendation technologies through A/B tests and taking into account the unique characteristics of your platform. There is a lot that can go wrong even if the results look visually appealing. For example results from test 15 looked beautiful, but they would have had a negative impact on our business.
In this work we were able to take a cutting edge matrix factorization approach, CoFactor, and build a recommendation pipeline suitable for production. We pushed the bounds of the algorithm by training on only the most recent data which greatly increased the sparsity, and we successfully validated its effectiveness through a series of A/B tests.
This work would not have been possible without Insight Softmax Consulting who built the prototype pipeline and tirelessly experimented with us to produce these exceptional results.
We’re always looking for amazing people to join the Creative Market team. We value our culture as much as we value our mission, so if helping creators turn passion into opportunity sounds like something you’d love to do with a group of folks who feel the same way, then check out our job openings and apply today!
 D. Liang, J. Altosaar, L. Charlin, D. M. Blei. Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Co-occurrence. RecSys ’16, September 15–19, 2016, Boston, MA, USA. http://dawenl.github.io/publications/LiangACB16-cofactor.pdf
 CoFactor github repository.
 CoFactor presentation at RecSys 2016 https://www.slideshare.net/cheerz/factorization-meets-the-item-embedding-regularizing-matrix-factorization-with-item-cooccurrence
 B. Chen D. Agarwal, P. Elango, R. Ramakrishnan. Latent Factor Models for Web Recommender Systems. http://www.ideal.ece.utexas.edu/seminar/LatentFactorModels.pdf
 Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8): 30–37, 2009. https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf
 O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pages 2177–2185. 2014.
 Z. Harris. Distributional structure. Word, 10(23):146–162, 1954.
 D. Jurafsky & J. H. Martin. Chapter on Vector Semantics, Speech and Language Processing.
 R. Řehůřek. Making sense of word2vec.