CluStream — A Framework for Clustering Evolving Data Streams

Niruhan Viswarupan
Jul 25, 2017 · 8 min read

Most of the stream clustering algorithms are constrained by the one-pass condition i.e since we are using streaming dataset we can access information about one data point only once. Storing the data points will lead to high memory requirements which is infeasible for large datasets. The available streaming clustering algorithms are generally blind to the evolution of the data. This results in poor clustering when data streams evolve over time. If we consider streaming K-means it is sensitive to the order in which data arrives.

The authors of the paper on CluStream propose a fundamentally different approach to the problem, divide the clustering process into two components.

  1. Online micro-clustering
  2. Offline macro-clustering

This two-phased approach offers flexibility to users to explore the nature of the evolution of the clusters over different time periods.

During the online phase the summary statistics of the data points arrived thus far is efficiently stored so that this can be manipulated to cater different kinds of queries. Online phase satisfies one-pass constraint so that big data can be handled. In the offline phase only the summary statistics stored during the online phase is used, not the actual dataset itself. So our offline phase is no longer constrained by the one-pass condition. This allows us to make use of efficient offline algorithms to provide various insights to the clusters. To proceed further, reader should be familiar with two main concepts in the paper,

  • Micro-clusters: Used to store summary statistics of data points in terms of spatial locality and temporal components.
  • Pyramidal Time Frame: In order to facilitate the analysis of cluster evolution the information about micro-clusters are periodically stored away in a permanent storage medium. This period varies in a pyramidal fashion. This storing is called “snapshot” in the paper so we will continue to use that term.

The Stream Clustering Framework

With the concepts and terms introduced we will dive into formal definitions.

Data Stream: The data stream is assumed to consist of multi-dimensional data points X1, X2, … , Xk, .. arriving at timestamps T1, T2, … , Tk, … . Each data point Xi is a vector of d dimensions Xi = {xi1, xi2, … , xid}.

Micro-Cluster: A set of n d-dimensional data points X1, X2, … , Xn with timestamps T1, T2, … , Tn defined as (2d+3) tuple. Represented as (CF2X, CF1X, CF2T, CF1T, n) where,

  • CF2X is a d-dimensional vector having sum of squares of all n data points along each dimension. For example if the micro-cluster has points (a,b) and (c,d) arrived at times t1 and t2 CF2X = {a²+c², b²+d²}
  • CF1X is another d-dimensional vector having sum of components of each data point in the micro-cluster along each dimension. i.e CF1X = {a+c, b+d}
  • CF2T = t1²+t2²
  • CF1T = t1+t2
  • n = number of data points belonging to the micro-cluster

Pyramidal Time Frame: The snapshots are stored at different levels of granularity depending upon the recency. Snapshots are classified into different orders which can vary from 1 to log(T) where T = clock time elapsed since the beginning of the stream. The snapshot of the i-th order occurs every α^i time units where α is an integer and α>1.

For example when T=8 and α=2, stored snapshots order and times are as follows,

Order 0 snapshot (2⁰ x time units)- 0, 1, 2, 3, 4, 5, 6, 7, 8

Order 1 snapshot (2¹ x time units)- 0, 2, 4, 6, 8

Order 2 snapshot (2² x time units)- 0, 4, 8

Order 3 snapshot (2³ x time units)- 0, 8

If we choose to store only the last α+1 (=3) snapshots of every order then the snapshots are at,

Order 0 snapshot (2⁰ x time units)- 6, 7, 8

Order 1 snapshot (2¹ x time units)- 4, 6, 8

Order 2 snapshot (2² x time units)- 0, 4, 8

Order 3 snapshot (2³ x time units)- 0, 8

One can verify the following results,

  • the maximum order of any snapshot stored at T is logα(T)
  • the maximum number of snapshots at T is (α+1)logα(T)
  • for any user-specified time window of h, at least one snapshot can be found within 2h units of current time (time horizon approximation). (Proof of this result can be found at page 3 of reference[1].
https://image.slidesharecdn.com/streamdataminingclustreamframework-111105085938-phpapp01/95/stream-data-mining-clustream-framework-10-728.jpg?cb=1323935031

As an example if we want to find a snapshot within a factor of 2 of any user-specified time window for a data stream running for 100 years with a time granularity of 1 second the total number of snapshots needed to be maintained is (2+1)*log2(100*365*24*60*60)=95, which is a modest storage requirement.

We can improve this time horizon approximation by storing more than (α+1) snapshots of every order. It can be proved that if we store (α^l) + 1 snapshots time horizon can be approximated to a factor of 1+1/[α^(l-1)].

From the previous example where we stored snapshots of different orders one can see that many snapshots are repeatedly stored under different orders. This can be understood by the following. A snapshot at T=8 (2³) simultaneously corresponds to snapshots of orders 0, 1, 2, 3. We can eliminate this storage redundancy by deleting all the snapshots of order i which are divisible by 2^(i+1). Also whenever a new snapshot is stored the oldest of that order is deleted.


Online Micro-Cluster Maintenance

At any moment a total of q micro-clusters are maintained. The value of q is significantly larger than the natural number of clusters in the data set and also significantly smaller than the number of data points in the stream. This allows us to maintain a high level of temporal and spatial granularity while reducing the storage requirement. We will denote the q micro-clusters by M1, M2, …, Mq. Each micro-cluster will have an associated unique ID that was assigned at its creation.

These micro-clusters are stored away whenever the clock time is divisible by α^i and the snapshots of order i older than α^(l+i) units of time are deleted.

The initial q micro-clusters are created using an offline process. Using the beginning of data stream and applying k-means clustering algorithm create q initial clusters. Whenever a new data point comes after the initialization we have two options,

  1. Absorb the new point into an existing cluster
  2. create a cluster of its own and add it as the first point

The decision is based on the distance of the new point to the existing clusters. First we define a maximum boundary factor for each cluster. This is the root mean square deviation of the data points in the cluster multiplied by a factor of t. If the new point does not lie within the boundary of its nearest cluster we need to create a new cluster. This can be a result of the new point being an outlier or the emergence of a new natural cluster in the data stream. After creating a new cluster we need to reduce the number of clusters by one to maintain it at q. This can be done in two ways,

  1. Delete an old cluster
  2. merge two cluster which are close to each other

First we check if we can delete an old cluster. For this we need a user defined threshold δ, and the cluster that has not been updated much after T-δ, where T is the current time, can be deleted. For this we find a relevance stamp of each cluster. This relevance stamp is the average time of arrival of the last m points of the cluster. This is calculated by finding the m/(2n) -th percentile of the points in cluster M assuming the time-stamps are normally distributed. If the relevance stamp of a cluster is older than T-δ we can proceed to delete it. If none of the micro-clusters are irrelevant we merge the two clusters that are closest to each other. We create an ID list and add the unique IDs of the clusters merged. This operation is done at arrival of every data point.

So in the online component we handle every new data point in one of the two ways mentioned above and store snapshots at every α^i times.


Macro-Cluster Creation

This is an offline phase and is not constrained by the one-pass requirement as discussed previously. For this phase,

  • Inputs- h (time horizon), k (number of macro-cluster)
  • Data- The summary statistics in the stored snapshots
https://i.ytimg.com/vi/bQcsCQ5Zrck/hqdefault.jpg

Reader must understand the concept of time horizon before proceeding. As an example if the current time is T and the time horizon is h, the macro-cluster creation should only consider about the data that arrived within the period (T-h,T). One can note that the snapshots stored at every stage contains the entire history of the data stream before that stage. In order to understand the creation of micro-clusters that has information about the data arrives in the period (T-h,T) the reader should be familiar with the following 2 properties. We will refer to the micro-cluster for a set of points C by CFT(C).

  1. Let C1 and C2 be two sets of points. Then the cluster feature vector CFT(C1UC2) = CFT(C1) + CFT(C2).
  2. Let C1 and C2 be two sets of points such that C2 is a subset of C1. Then the cluster feature vector CFT(C1-C2) = CFT(C1) - CFT(C2).

Consider the situation at a clock time of tc, and time horizon h. We can find a snapshot at tc-h1 where h1 is the time horizon approximation of h. If we denote micro-clusters at tc-h by S(tc-h1) and the current micro-clusters by S(tc) the micro-clusters corresponding to the data between (tc-h, tc) is given by N(tc,h1) = S(tc) - S(tc-h1). We can subject this N(tc,h1) to a macro-clustering process.

We modify the k-means algorithm in the following ways to use it for macro-clustering process.

  • Initial seeds are not random, they are sampled with a probability proportional to the number of points in each micro-cluster.
  • Distance from the seed to any micro-cluster is equal to the distance between the seed and the centroid of the micro-cluster.
  • New seed for a given partition is defined as the weighted centroid of the micro-clusters in that partition.

One can note that for macro-clustering process we need only two carefully selected snapshots.


Evolution Analysis of Clusters

If we want to compare clusters at two different times t1 and t2 (t2>t1) the steps are as follows,

  • Find N(t1,h) and N(t2,h) using the process explained in the previous section.
  • Divide the micro-clusters in N(t1,h) U N(t2,h) into three categories,
  1. M-added(t1,t2) = clusters in N(t2,h) for which no corresponding IDs are present in N(t1,h)
  2. M-deleted(t1,t2) = clusters in N(t1,h) for which no corresponding IDs are present in N(t2,h)
  3. M-retained(t1,t2) = clusters for which some or all IDs are common in both N(t1,h) and N(t2,h).
  • Macro-cluster algorithm is then separately applied to these three categories
  • If the data stream has not evolved very much in(t1,t2) M-retained will contain large fraction of data
  • If the data stream evolved in (t1,t2) M-deleted and M-added will have large fraction of data.

One can verify that CluStream derives higher quality clusters than traditional stream clustering algorithms, especially when there is data evolution. And the pyramidal time frame snapshots enables users to get answers to many kinds of queries. CluStream also has good scalability.

Reference:

[1] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. A Framework for Clustering Evolving Data Streams. Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003.

Niruhan Viswarupan

Written by

Contributor to MOA (Massive Online Analysis), Former Software Engineering Intern @ WSO2 Inc., ENTC Undergrad @ University of Moratuwa, Sri Lanka

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade