The objective of the following blog post is to diffuse the results of a small experiment I ran. The basic idea was to see if using a cluster algorithm as k-means was possible to separate the context of use of the visitors of an e-commerce website (e.g. direct search, exploratory search, etc.).
I collected a set of custom clickstreams data from www.prezzifarmaco.it and I analyzed them as follow:
- eliminated outliers;
- grouped by session and ordered by timestamp;
- counted main events in the clickstreams (page visited, main CTAs, etc.);
- ran a kPCA to extract five principal components;
- run a k-means clustering algorithm with 6 cluster (I will discuss later why the number 6);
- analyzed results.
The main results are summarized in the following radar graph where the colored lines represent the “label” of the clusters while:
- HOME_PAGE: is the number of visualizations of the home page;
- PHARMACY_PAGE: is the number of visualizations of the page of details of the vendor;
- SEARCH_RESULTS: is the number of visualizations of the search results page;
- PRODUCT_PAGE: is the number of visualizations of the product detail page;
- MORE_RESULTS_CTA: is the number of time the visitor used the “more results” CTA;
- ADDCART_CTA: is the number of time the visitor added a product to the cart.
As you can see from the graph and the table in the first 5 clusters the data give us interesting insights:
- Cluster 5 AKA “Explorative search”: this cluster of visitors has its peak values in PRODUCT_PAGE (bigger than~1 event) and ADDCART_CTA (bigger than~1 event) plus the did few exploration using the internal search engine or the more results cta; they landed on the precise page from an external search engine and then tried to buy the product;
- Cluster 4 AKA “Search based navigation starting from home”: the visitors in this cluster have landed to the home page (HOME_PAGE ~1 event) and then navigated the site mainly using the search engine (SEARCH_RESULTS ~1 event);
- Cluster 3 AKA “Direct navigation”: this group is very interesting; they landed on a SEARCH_RESULTS (~2, probably they are recurrent users with saved url) and the navigated the site mainly through the internal search engine (SEARCH_RESULTS ~1 events); few part of the group added product to the cart (ADDCART_CTA ~0.0625 events); the group is the nearest to the “direct” search model;
- Cluster 2 AKA “Product page visit”: this cluster is similar to cluster 5 but they did not add the product to the cart then they preferably left the site; they have their peak value in PRODUCT_PAGE (~2 events);
- Cluster 1 AKA “Pharmacy page visit”: this cluster is similar to cluster 1 but they did not add the product to the cart then they preferably left the site; they have their peak value in PRODUCT_PAGE (~2 events);
Cluster 0 is not homogeneous and it is very hard to trace a scenario for it. The number of 6 clusters come from an elbow analysis of the data.
We can conclude it is possible to separate different context of use from a simple cluster analysis of clickstreams and in the future I hope to have the opportunity to extend this scenario including in the clustering process the number of “transfers” between one type of page and another.