Users’ Context of Use through Clustering Algorithms

Giuseppe Sorrentino
4 min readMar 25, 2018

--

The objective of the following blog post is to diffuse the results of a small experiment I ran. The basic idea was to see if using a cluster algorithm as k-means was possible to separate the context of use of the visitors of an e-commerce website (e.g. direct search, exploratory search, etc.).

I collected a set of custom clickstreams data from www.prezzifarmaco.it and I analyzed them as follow:

  • eliminated outliers;
  • grouped by session and ordered by timestamp;
  • counted main events in the clickstreams (page visited, main CTAs, etc.);
  • ran a kPCA to extract five principal components;
  • run a k-means clustering algorithm with 6 cluster (I will discuss later why the number 6);
  • analyzed results.

The main results are summarized in the following radar graph where the colored lines represent the “label” of the clusters while:

  • HOME_PAGE: is the number of visualizations of the home page;
  • PHARMACY_PAGE: is the number of visualizations of the page of details of the vendor;
  • SEARCH_RESULTS: is the number of visualizations of the search results page;
  • PRODUCT_PAGE: is the number of visualizations of the product detail page;
  • MORE_RESULTS_CTA: is the number of time the visitor used the “more results” CTA;
  • ADDCART_CTA: is the number of time the visitor added a product to the cart.
Radar graph representing the mean values of each event in the last five clusters.
Table representing the number of sessions in each cluster and the most frequent starting page.

As you can see from the graph and the table in the first 5 clusters the data give us interesting insights:

  • Cluster 5 AKA “Explorative search”: this cluster of visitors has its peak values in PRODUCT_PAGE (bigger than~1 event) and ADDCART_CTA (bigger than~1 event) plus the did few exploration using the internal search engine or the more results cta; they landed on the precise page from an external search engine and then tried to buy the product;
  • Cluster 4 AKA “Search based navigation starting from home”: the visitors in this cluster have landed to the home page (HOME_PAGE ~1 event) and then navigated the site mainly using the search engine (SEARCH_RESULTS ~1 event);
  • Cluster 3 AKA “Direct navigation”: this group is very interesting; they landed on a SEARCH_RESULTS (~2, probably they are recurrent users with saved url) and the navigated the site mainly through the internal search engine (SEARCH_RESULTS ~1 events); few part of the group added product to the cart (ADDCART_CTA ~0.0625 events); the group is the nearest to the “direct” search model;
  • Cluster 2 AKA “Product page visit”: this cluster is similar to cluster 5 but they did not add the product to the cart then they preferably left the site; they have their peak value in PRODUCT_PAGE (~2 events);
  • Cluster 1 AKA “Pharmacy page visit”: this cluster is similar to cluster 1 but they did not add the product to the cart then they preferably left the site; they have their peak value in PRODUCT_PAGE (~2 events);

Cluster 0 is not homogeneous and it is very hard to trace a scenario for it. The number of 6 clusters come from an elbow analysis of the data.

We can conclude it is possible to separate different context of use from a simple cluster analysis of clickstreams and in the future I hope to have the opportunity to extend this scenario including in the clustering process the number of “transfers” between one type of page and another.

--

--