Clickstream mining and browsing optimization

Nicolas Li
OVRSEA
Published in
5 min readNov 20, 2018

In the perspective of drastically improving the freight-forwarding process, Ovrsea, the French leader digital freight forwarder, has developed a web app for its operational staff and customers.

Ovrsea’s customers platform

This platform aims at gathering all the information necessary to manage a shipment and make it easier to treat. Given this goal, Ovrsea’s tech team is continuously researching new technologies to improve their platform. Optimizing browsing by focusing on pattern mining has recently been addressed.

What the heck is pattern mining??

Strange combination huh? And yet, sales of both have skyrocketed thanks to pattern mining! (Right picture credit: Rob Hayman)

Well, it’s a broad field in data science, with applications in many sectors. The main goal is to spot recurrent patterns in any kind of data: images, transactions, sequences of figures. It can detect specific medical issues, customers habits (see the classical beer-diaper example), or characteristic user actions.

Okay, how do you use it with your case then?

Ovrsea has recorded its staff browsing activity on its platform. The goal is to perform pattern mining on users clickstreams in order to spot where bottlenecks are, where users are spending the most time. The findings will guide the next optimizations of the platform.

Users clickstreams mining

Many pattern mining algorithms exist, such as the well-known Apriori algorithm, PrefixScan, BIDE… [2][3][4] However, none of them scans continuous clickstreams in their original form. Hence, a custom and simple mining algorithm was implemented, using sliding windows of different sizes to scan the actions logs and count all possible patterns.

Next, we discarded all the non-closed patterns (a closed pattern is a pattern which has no super-pattern with the same number of occurrences). The method is not optimal but simple: grouping the patterns by number of occurrences and check if each pattern is the sub-pattern of another. After this first cleaning, more than 1200 patterns for each user over two weeks were left. Among them, many were part of the same family, meaning redundancy was still high…

3 closed patterns belonging to the same family

So what did you do?

Things get interesting here. On the previous picture, the three patterns have the same color, because they have been clustered together.

Hierarchical clustering of patterns using SciPy Linkage function

To cluster any kind of data, you need to make two choices: distance and clustering algorithm. Since the goal was to cluster patterns belonging to the same family, the distance was chosen inversely proportional to the longest common sub-pattern (this is a well-known computer science problem). As for the algorithm, hierarchical clustering was chosen (implemented in SciPy library linkage function). The latter allows the user to choose clusters based on their distance to each other, which is a very interesting feature when the ideal distance is not known (unsupervised learning). The downside is that redundancy remains high since a pattern can belong to several clusters.

Hierarchical clusters (dendrogram) computed using Scipy Linkage function. Each pattern belong to several clusters.

In our case, keeping distance between 1.1 and 2 would give about 400 clusters per user.

Hmm… That’s still too many isn’t it?

Yup, here comes the last processing step: pattern scoring.

How to score patterns importance?

Intuitively, two features come to mind: the number of occurrences and the length of the pattern. Indeed, if a pattern is frequent and long, it might be worth to give a closer look to it. We added another important feature: periodicity. It is especially important since it exactly targets the kind of actions we want to spot. A pattern is said to be periodic if it contains a sub-pattern appearing several times in it. This means the user has repeated a sequence of actions many times in a row, which is a UI flaw.

And how do you compute the periodicity of a pattern?

Here comes a little trick. Each action type was mapped to a figure, which transforms a sequence of actions into a numerical sequence.

(actionA, actionB, ...) -> (0, 4, ...)

Such a transformation allows us to use signal treatment algorithms, especially the autocorrelation technique. Autocorrelation is the cross-correlation of a signal with itself but with different delays. Long story short, if the signal contains a periodic pattern, the autocorrelation function will show it.

Here, the autocorrelation gives two 1, which leads to 2 periods of (1, 2, 2)

Based on these 3 features, a custom score was generated, allowing us to rank clusters and patterns and display them.

Clusters found. The radius (resp. color) of each point corresponds to each cluster’s size (resp. periodicity)

What’s next? Well, we need to analyze the most problematic clusters to determine if they are relevant or not and correct the scoring method accordingly. Some interesting patterns have already been spotted: for instance, a user was found to perform the same sequence of actions several times just to delete a few tasks from a list, one by one. However, the biggest difficulty in such a project is the lack of labels which prevents from using supervised algorithms.

If you are interested in optimizing and analyzing users browsing activity, feel free to share your insights, add comments or reach me at nicolas@ovrsea.com!

Thanks for reading!

[1]: You, the Web and Your Device: Longitudinal Characterization of Browsing Habits, Luca Vassio, Idilio Drago, Marco Mellia, Zied Ben Houidi, Mohamed Lamine Lamali, 2018

[2]: Fast Algorithms for Mining Association Rules, Rakesh Agrawal Ramakrishnan Srikant, 1994

[3]: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, 2001

[4]: BIDE: Efficient Mining of Frequent Closed Sequence, Jianyong Wang and Jiawei Han, 2004

--

--