Customer Analytics: Pattern Mining on Clickstream Data in Python

11 min readAug 21, 2023

tl;dr In this post I show how we can use raw clickstream data and data mining to find patterns in the online user behavior of customers of an ecommerce site. We will cover the full workflow, going from raw API data to results.

The full code can be found on GitHub.

Introduction

In this post, I will look at techniques that can be used for pattern mining on clickstream data. Our general goal is to find patterns in the ways that customers interact with a website.

Clickstream Data

A clickstream refers to the sequence of interactions that a user has with a website, application, or digital platform. It includes every action taken by the user, such as clicking on links, buttons, or images, as well as any other interactions like submitting forms, scrolling, and navigating between pages. Essentially, a clickstream is a record of the user’s digital journey, capturing their behavior and engagement with the digital content.

Clickstream data is valuable for understanding user behavior, preferences, and patterns. By analyzing clickstreams, businesses and website owners can gain insights into how users navigate their platforms, which pages are most popular, where users tend to drop off, and what actions lead to conversions or desired outcomes. This data can be used for improving user experience, optimizing content placement, making data-driven decisions about design and functionality, and ultimately enhancing the overall performance of a digital platform.

Pattern Mining

Pattern mining is a data mining technique used to discover meaningful and interesting patterns, relationships, or associations within a dataset. These patterns can provide valuable insights into the underlying structure and characteristics of the data, which can be used for decision-making, prediction, and optimization in various domains. Pattern mining involves finding patterns that occur frequently, have significant co-occurrences, or follow certain sequences in the data.

There are several types of pattern mining techniques, each tailored to uncover different types of patterns within the data:

Frequent Itemset Mining: This technique focuses on finding sets of items that frequently appear together in a dataset. For example, in a retail setting, frequent itemset mining can identify which products are often bought together by customers. This information can be used for strategies like product placement and cross-selling.
Association Rule Mining: Building on frequent itemset mining, association rule mining uncovers relationships between different items in the dataset. These relationships are expressed as rules such as “If a customer buys item A, they are likely to buy item B.” Association rules are often used in market basket analysis and recommendation systems.
Sequential Pattern Mining: In datasets where the order of events matters, sequential pattern mining discovers patterns that occur in a specific order. This is especially relevant for understanding user behavior and sequences of events.
Spatial Pattern Mining: Spatial data often has geographic attributes. Spatial pattern mining aims to find relationships and patterns in data with spatial coordinates, like identifying clusters of similar locations or understanding the distribution of events across a geographic area.
Graph Pattern Mining: In datasets represented as graphs (nodes connected by edges), graph pattern mining seeks to discover patterns within the relationships and connections between nodes. This is useful for understanding social networks, web structures, and other interconnected data.

Mathematics of Clickstreams

Let us now try to define the clickstream of a user in a more formal manner. One way to do this is by using a sequence of events notation. For now, we will consider one single journey. We can write the clickstream as follows.

Clickstream formulation

Where t_i is the time at which the user had the interaction and e_i contains information about the interaction.

Besides the pure interactions we can also have some context data such as the device of the user, the browser and the user’s location. Let us simply call this context A. In our case, some of the attributes we have for one visit include the following.

A = {'serverDate': '2023-08-19',
 'visitorType': 'returningCustomer',
 'visitCount': '135',
 'deviceModel': 'Generic Desktop',
 'operatingSystem': 'Windows 7'}

We can see that all of these might be things that influence the user’s behavior. We can then make different hypothesis regarding the user behavior. We could for example say, that user behavior differs through time or that it is different depending on the device that the user uses.

If we want to get some high level overview of the data, we can look at the distribution of interaction duration.

As we can see, most interactions consisted of only 1–2 events, but some were quite lengthy.

There is also lots of metadata about each event. Two example events (with some fields omitted) can be seen below.

{'type': 'action',
    'url': 'https://dive-shop.net/products/diving-tank/',
    'pageTitle': 'Divezone Brand Diving Tank - Divezone Store',
    'pageIdAction': '19',
    'idpageview': 'Hp1IIb',
    'serverTimePretty': 'Aug 19, 2023 11:42:47',
    'pageId': '25678766',
    'timeSpent': 610,
    'pageviewPosition': '1',
    'timestamp': 1692445367,
 {'type': 'form',
    'icon': 'plugins/FormAnalytics/images/form.png',
    'idpageview': 'Hp1IIb',
    'title': 'interacted with form Checkout Form',
    'formName': 'Checkout Form',
    'formId': '42',
    'formStatus': 'running',
    'converted': '0',
    'submitted': 0,
    'serverTimePretty': 'Aug 19, 2023 11:52:56',
    'timeToFirstSubmission': '1',
    'timeSpent': '752994',
    'timeHesitation': '0',
    'leftBlank': 2,
    'fields': [{'fieldName': 'billing_address_1',
      'timeSpent': '4043',
      'timeHesitation': '1318',
      'leftBlank': '0',
      'submitted': '0'},
...
     {'fieldName': 'billing_city',
      'timeSpent': '2075',
      'timeHesitation': '990',
      'leftBlank': '0',
      'submitted': '0'}]

The first event shows the customer clicking on a link to an url. We additionally get the time at which she clicked the link and how long she then stayed on the page. In the second event, the user filled out a form. We can see exactly, what form it was and on which page it is shown. We can even see how long the user spent on filling out each of the form fields.

These two exemplary events already show how complex the data at hand is and that there are many ways in which to analyze it. For now, we will simplify the problem a bit and look at how we can discover some first patterns in the data.

The Data

The data at hand is taken from Matomo. Matomo is an open source analytics tool, similar to Google Analytics. The platform lets you track everything that a user does on your website. Luckily for us, they also have a demo website with some data. By default, Matomo already provides us with many analytics capabilities. But if we want to dig a little deeper, we can also retrieve the log data from the system via an API. In this case, we have data from an ecommerce site that sells diving gear.

Sequential Pattern Mining

Although we have rich metadata about the events and journeys, we will for now simply look at which events there were and in which order they happened. We will also disregard the exact timing of the events and simply look at their order.

We have to first think about what we define as an event. Here, we can think about different levels of granularity. For example, let’s say the user clicked on the link https://dive-shop.net/products/scuba-fins/ we can then say that the event is ‘user clicked on the product scuba fins’ or we can simply say that ‘the user clicked on a product’. Although the latter case is less precise, it also makes the feature space less sparse. A similar decision has to be made for ‘search’ type events. We can simply say that ‘the user searched’ or we can also include what he searched for. Again, in the latter case, we could end up with too many combinations.

Aside; Combinatorics

We will quickly run into the problem that, the number of possible ways in which the customer can interact with our website grows exponentially as the number of possible events increases. To show this, let K be the number of interactions a customer has on his visit. Further, let N be the number of types of interactions (e.g. ‘click’, ‘add to cart’, …). Then, the number of possible interactions X is the following.

Number of combinations

We are essentially asking, how many possible ways exist, in which we can draw K elements from the set of interactions, when the order matters.

Back to Pattern Mining: Preprocessing

We will therefore have to try and reduce the types of events as much as possible such that they still retain most of their meaning.

As an example, we can classify the actions by their urls. That way, we just see which section the customer was headed to but not what exactly they interacted with.

def classify_action(url):
    types = [
        "jobs", "products", "cart", "checkout", "faq", "diving", "best-dive-sites", 
        "best-of-the-best", "my-account", "liveaboard", "divesite", "blog", "resumes",
        "forum", "travel", "guides", "buying-guide"]
    for type_ in types:
        if type_ in url:
            e = {"type": "action", "details": type_, "e": "action__"+type_}
            break
        elif url == "https://dive-shop.net/":
            e = {"type": "action", "details": "shop-home", "e": "action__shop-home"}
            break
        elif url == "https://divezone.net/":
            e = {"type": "action", "details": "zone-home", "e": "action__zone-home"}
            break
    else:
        print(url)
        e = {"type": "action", "details": "other", "e": "action__other"}

    return e

We do the same thing for the other events after which we have an identifier for each event that we can use as a feature. So, as an example, let’s say we have a list of interactions such as below.

example_interactions = ['https://divezone.net/diving/maldives',
 'https://divezone.net/diving/florida',
 'https://divezone.net/diving/red-sea',
 'https://dive-shop.net/products/diving-accessory-starter-kit/',
 'https://dive-shop.net/products/distance-line-reel/']

Then, they will appear as the following events.

example_transf = [classify_action(x) for x in example]

print(example_transf)

>>> [{'type': 'action', 'details': 'diving', 'e': 'action__diving'},
 {'type': 'action', 'details': 'diving', 'e': 'action__diving'},
 {'type': 'action', 'details': 'diving', 'e': 'action__diving'},
 {'type': 'action', 'details': 'products', 'e': 'action__products'},
 {'type': 'action', 'details': 'products', 'e': 'action__products'}]

A sample interaction with the website now looks like this.

[{'type': 'action', 'details': 'products', 'e': 'action__products'},
  {'type': 'event', 'eventAction': 'Cart change', 'e': 'event__Cart change'},
  {'type': 'action', 'details': 'cart', 'e': 'action__cart'},
  {'type': 'action', 'details': 'checkout', 'e': 'action__checkout'},
  {'type': 'ecommerceOrder', 'e': 'ecommerceOrder'},
  {'type': 'action', 'details': 'checkout', 'e': 'action__checkout'},
  {'type': 'action', 'details': 'my-account', 'e': 'action__my-account'},
  {'type': 'action', 'details': 'products', 'e': 'action__products'},
  {'type': 'action', 'details': 'my-account', 'e': 'action__my-account'},
  {'type': 'outlink',
   'url': 'https://www.instagram.com/Divezone.net/',
   'e': 'outlink'},
  {'type': 'action', 'details': 'shop-home', 'e': 'action__shop-home'},
  {'type': 'outlink',
   'url': 'https://www.instagram.com/Divezone.net/',
   'e': 'outlink'}]

In the above log, we can see that the customer first looked at a product, then added it to the cart, after which he looked at his cart, proceeded to checkout and then did some other stuff. Interestingly, the customer did not actually buy the product (one interesting use case might be to try and predict whether a customer will buy something based on clickstream data).

Note that there is some more preprocessing we have to do, I only explained the process for the ‘action’ type interactions. There are of course other types of interactions such as ‘ecommerceOrder’, ‘form’, etc. that need to be preprocessed similarly. I have uploaded all the code needed to reproduce the experiment in this notebook on GitHub.

We can now do some further preprocessing, namely converting our types to ids. I also decided to sort some sequences out that were very short (sequences with 3 interactions or less). We also get 2 mappings, one from ids to types and one from types to ids.

sequences = []
type_to_id = {}
id_to_type = {}
i = 0

for visit in d:
    events = [j["e"] for j in visit]

    for event in events:
        if event not in type_to_id:
            type_to_id[event] = i
            i += 1
    events_id = [[type_to_id[event]] for event in events]
    if len(events) > 3:
        sequences.append(events_id)

The result looks the following. Note that with some libraries one might have to pad the sequences to the maximum sequence length, in our case, this is not necessary.

print(sequences)

>>> [[[8], [4], [8], [4], [9], [8], [8]],
 [[5], [8], [8], [8], [8], [8], [8], [8], [5]],
 [[12],[3]],
 [[12],
  [0],
  [3],
...
  [9],
  [10],
  [4]],
 [[8], [10], [10], [10], [10], [10]],

Each event id is in an extra bracket as in the normal setting, this can be used to signify, that two events are happening at the same time. So for example, if the first sequence were.

first_seq = [[8, 1], [4], [8], [4], [9], [8], [8]]

Then this would mean that in the first interaction, event id 8 and 1 are being done at the same time.

Doing the Actual ‘Mining’

After we have done a lot of preprocessing work, we can know use our pattern mining algorithm. In this case, we are using the SPAM algorithm from the ‘sequence-mining’ Python library. I will not go into the details of the algorithm here, but you can check out the original paper if you are interested in learning more about it [4]. For us, it suffices to say, that it finds the most frequent sequences in our data in an efficient way. We can also define a minimal threshold that must be surpassed for the sequence to be considered as frequent, this is called the support.

In the code below, we import the algorithm, initialize it with a support of 0.1 (i.e. the sequence must appear in 10% of visits) and then do the algorithm on the data. As you can see, compared with the preprocessing this step only consists of four lines of code!

from sequence_mining.spam import SpamAlgo

algo = SpamAlgo(0.1) # set up the algorithm
algo.spam(df) # do the mining

frequent_items = algo.frequent_items # get the frequent items

We can map the ids back to their string representations to give back some semantics.

frequent_items_types = []
for i in frequent_items:
    s = []
    for j in i:
        id_ = j if isinstance(j, int) else j[0]
        s.append(id_to_type[id_])
    frequent_items_types.append(s)

Looking at the Results

The results are now all those sequences that are frequent. This is exactly where a subject matter expert (SME) would have to look at the results for anything unexpected. We could either see something that should not be there (i.e. user interacting with the site in an unexpected way) or we do not find something that should be there (i.e. behavior we would be expecting but that is simply not there).

frequent_sequences = [['action__diving'],
 ['outlink'],
 ['action__jobs'],
 ['action__liveaboard'],
 ['action__diving', 'action__diving'],
 ['action__diving', 'action__diving', 'action__diving'],
 ['action__diving', 'action__diving', 'action__diving', 'action__diving'],
 ['outlink', 'outlink'],
 ['outlink', 'action__products'],
 ['outlink', 'action__cart'],
 ['action__jobs', 'action__jobs'],
 ['action__jobs', 'action__jobs', 'action__jobs'],
 ['action__jobs', 'action__jobs', 'action__jobs', 'action__jobs'],
 ['action__jobs',
  'action__jobs',
...
 ['action__shop-home', 'action__cart'],
 ['action__shop-home', 'action__my-account'],
 ['action__shop-home', 'action__my-account', 'action__my-account'],
 ['action__shop-home', 'action__shop-home'],
 ['action__liveaboard', 'action__liveaboard']]

Note that there is also some redundancy here. If the sequence a-a-a is frequent, then, the sequence a-a must also be frequent. We can see this in the results: If [‘action__jobs’, ‘action__jobs’, ‘action__jobs’, ‘action__jobs’] is frequent, a sequence like [‘action__jobs’, ‘action__jobs’] must also be frequent as the former is a superset of the latter.

Conclusion

In this post, we have seen a full pattern mining workflow. We started with getting raw data from an API, after which came the lengthy step of preprocessing and deciding on how to best represent an event. We can easily see that once we have done the preprocessing, performing the actual pattern mining is only a few lines of code. This again shows you that very often in ML, the actual algorithm takes up only a very small amount of work, wheras everything else (e.g. preprocessing) needs much more time. Additionally, in order to gain actionable insights, a SME needs to look over the results and interpret them.