Optimizing 1D Clustering Through Customization

Ido Hirsh
AppsFlyer Engineering
14 min readMay 3, 2023

The usage of off-the-shelf clustering algorithms does not always provide a satisfactory fit for certain problems. We encounter such a case where standard clustering algorithms fail to account for the constraints imposed by the characteristics of our products, customers, and partners. In order to obtain the optimal solution for our case, we customize a one-dimensional clustering algorithm to suit our needs.

In this blog post, we detail the process by which we modify an algorithm for a one-dimensional clustering problem to achieve an optimal solution for our specific use case. Additionally, we discuss the challenges we face while customizing off-the-shelf algorithms and the strategies employed to overcome them.

Problem context

The awareness of privacy continues to grow steadily. Apple is reinforcing this trend with the release of iOS14 and subsequent versions by implementing strict privacy policies that limit user data tracking. These policies reflect the industry-wide focus on protecting user privacy. Our products at AppsFlyer, help clients’ apps attract more users and increase revenue while ensuring user privacy and information security.

In order to uphold our dedication to safeguarding user data privacy, we utilize data aggregation methods such as clustering. This enables us to only share collective data with our customers, while enhancing our clients’ app user base and revenue.

Classic clustering adaptation

In this optimization task, we use a dataset of revenue generated by paying mobile app users. Our objective is to group these users into k clusters, each user’s revenue being as close as possible to the center of their respective cluster. This task is a classic example of a one-dimensional clustering problem. With typical clustering algorithms, the classic center of a group is defined as the average revenue of all the users in the group.

We demonstrate a use case in which we modify the classic center to align with our product’s unique needs, improving clustering accuracy. It is imperative to stress that due to our unique product requirements, we need to adapt some of the off-the-shelf known algorithms; more on that in the following section.

Putting business logic into the model

Our product’s business logic and customer limitations require us to deviate from the traditional definition of a group’s center, which adds complexity to the problem. To solve this one-dimensional clustering problem of grouping users into k clusters based on revenue, we aim to minimize the following objective function,

The centroid of cluster Cⱼ, denoted by ηⱼ, is not defined as the mean of all its members. Instead, it is defined as the midpoint between its infimum and supremum:

Centroid Calculation’s Effects on Clustering

In the case at hand, where we have a one-dimensional series of user revenue, we can use an exact polynomial-time algorithm that leverages dynamic programming.

In this blog post, we provide a brief overview of dynamic programming and how it can be used to solve this clustering problem.

What is Dynamic Programming?

Dynamic programming (DP) is a computer programming technique where a problem is first broken down into subproblems. Then the results of the sub-problems are saved and used to build an overall optimized solution to the original problem.

Dynamic Programming Algorithm for 1D Clustering

Let x₁, … , xₙ be the input data sorted in ascending order, the revenue of the users in our case. Let CC(j, m) be the loss of a single cluster of the subsequence xⱼ, … , xₘ where j ≤ m and 1 ≤ m ≤ n [1],

where ηⱼ‚ₘ is the centroid of the cluster which contains those data points.
Let us define a k×n matrix D such that D[i][m] is the loss when optimally clustering x₁, … , xₘ into i clusters.
For i = 1,

For i > 1,

Using argmin instead of min gives us the index of the leftmost element in the last cluster, and by standard backtracking — the complete optimal clustering can be found.

Furthermore, we replace the original centroid formula,

with the customized one,

which complies with our problem requirements, and plug it into CC(j, m).

Python Code for Dynamic Programming 1D Clustering

In order to implement the dynamic programming algorithm, we first define a cost function, Cᵢ[m][j], which calculates the cost of clustering x₁, …, xₘ into i clusters where the last cluster contains the data points xⱼ, …, xₘ.

For j ≤ m,
Cᵢ[m][j] = D[i-1][j-1] + CC(j, m).

For j > m,
Cᵢ[m][j] = D[i-1][m].

These two formulas can be reduced into one,
Cᵢ[m][j] = D[i-1][min{j-1, m}] + CC(j, m),

where CC(j, m) = 0 when j > m.

Using the definition of D[i][m] it yields that,
D[i][m] = min Cᵢ[m][j] (min over the index j),

Cᵢ is n×n matrix with row m and column j where 1 ≤ j,m ≤ n,
D
is k×n matrix with row i and column m where 1 ≤ i ≤ k and 1 ≤ m ≤ n.

Now, let us examine the complete process using Python:

Step 1 — Convert the input array of revenue into a table of values and frequencies.

def get_value_count_df(arr):
'''Convert an array to a pandas DataFrame of frequencies.'''

# Get unique values and their counts
unique, counts = np.unique(arr, return_counts=True)

# Create a pandas dataframe with the unique values and their counts
df = pd.DataFrame({'count': counts, 'revenue': unique})

# Sort by value
df.sort_values(by='revenue', ascending=True, ignore_index=True, inplace=True)

return df

Step 2 — Calculate the cluster cost function CC(i, j) for all possible i and j.

def CC(df_uniq: pd.DataFrame):
'''This function calculates the cost of each cluster of 1D data.
The input is pandas dataframe that store 1D data points as table of
frequencies with "count" and "revenue" columns, where "revenue" is the value
and "count" is the frequency of it.
The output is n x n matrix named CC where rows refer to the right boundary
of the cluster and the columns refer to the left boundary of the cluster and
the value refer to the distance of the data points from the centroid of the
cluster.

E.g., Assuming the data points are [30, 40, 60, 90, 110, 150] and that they
are sorted in ascending order. The cost of the cluster with the
elements in the second up to the fifth place [40, 60, 90, 110] is
stored in the matrix CC in the 5th row and 2nd column.

Sum of Weighted Square Error:
CC(p,q) = sum_i_equal_p_to_q(w_i * (x_i - u_pq)^2) =
= u_pq ^ 2 * sum(w_i) - 2 * u_pq * sum(w_i * x_i) + sum(w_i * x_i ^ 2)
where,
p and q are integers where p <= q,
sum_i_equal_p_to_q is summation that run over i equal p to i equal q,
x_i is 1D data point,
u_pq is the centroid of the cluster contains
x_p, x_p_plus_1, x_p_plus_2,..., x_q.

:param df_uniq: Pandas dataframe. A frequencies table stored in the columns:
"count" and "revenue"
:return: CC_values: A matrix of cluster costs, where the (i, j) entry in the matrix
represents the value of the cluster cost function at the input (i, j),
i.e., CC(i, j).
'''

# Adding cumulative weights and summations columns
df_uniq['count_cumsum'] = (df_uniq['count']).cumsum()
df_uniq['weighted_revenue_cumsum'] = (
df_uniq['count'] * df_uniq['revenue']).cumsum()
df_uniq['weighted_square_revenue_cumsum'] = (
df_uniq['count'] * df_uniq['revenue'] ** 2).cumsum()

# Generate the cluster costs matrix
len_uniq = len(df_uniq)
CC_values: np.ndarray = np.zeros((len_uniq, len_uniq))

revenue = df_uniq['revenue'].values
# Add leading zero
w_cumsum = np.concatenate((np.array([0]), df_uniq['count_cumsum'].values))
w_revenue_cumsum = np.concatenate(
(np.array([0]), df_uniq['weighted_revenue_cumsum'].values))
w_square_revenue_cumsum = np.concatenate(
(np.array([0]), df_uniq['weighted_square_revenue_cumsum'].values))

# Calculate CC matrix
for m in range(len_uniq):
centroid = 0.5 * (revenue[m] + revenue) # Mid-range (customized)
# CC[p, q] = u_pq ^ 2 * sum(w_i) - 2 * u_pq * sum(w_i * x_i)
# + sum(w_i * x_i ^ 2) for p=1 to n
CC_row = (w_cumsum[m + 1] - w_cumsum[:-1]) * centroid ** 2 \
- 2 * centroid * (w_revenue_cumsum[m + 1] - w_revenue_cumsum[:-1]) \
+ (w_square_revenue_cumsum[m + 1] - w_square_revenue_cumsum[:-1])
CC_values[m, :m + 1] = CC_row[: m + 1] # Set elements where j <= m

# Set with inf entries when left boundary > right boundary
CC_values[np.triu_indices(n=len_uniq, m=len_uniq, k=1)] = np.inf

return CC_values

Step 3 — Calculate optimal clustering costs for all possible combinations of consecutive data points and number of clusters from 1 to k.

def C(CC_values, num_clusters):
'''Implementation of dynamic programming algorithm for clustering 1D data.
The recursive rule:
D[i][m]=min_j_equal_1_to_m (D[i-1][j-1] + CC(j, m)).

C_i[m][j] defined as the cost of optimally clustering m data points into i clusters
where the last cluster start at the jth data point,
C_i[m][j] = D[i-1][min(j-1, m)] + CC(j, m).

D relates to C_i as follow:
D[i][m] = min_j_equal_1_to_m C_i[m][j]

:param CC_values: A matrix of cluster costs, where the (i, j) entry in the matrix
represents the value of the cluster cost function at the input (i, j),
i.e., CC(i, j).
:param num_clusters:
:return: C_values: A 3D numpy array, where the (i, m, j) entry in the 3D array represents
the cost of optimal clustering of m data points into i clusters, where
the smallest data point in the last cluster is the jth data point,
(if j > m or j < i, the cost is set to infinity).
'''
C_values: np.ndarray = np.ones((num_clusters, CC_values.shape[0], CC_values.shape[1])) + np.inf
np.fill_diagonal(C_values[0], CC_values[:, 0])
for k in range(1, num_clusters):
D_kminus1 = C_values[k - 1].min(axis=1) # Optimally cluster m points into k-1 clusters, m=1 to n
D_kminus1 = np.concatenate([[np.inf], D_kminus1[:-1]]) # Shift
C_values[k] = D_kminus1 + CC_values

return C_values

Step 4 — Assign each data point to the nearest cluster centroid.

def assign_to_nearest_centroid(df_uniq: pd.DataFrame, C_values: np.ndarray):
'''
:param C_values: A 3D numpy array, where the (i, m, j) entry in the 3D array represents
the cost of optimal clustering of m data points into i clusters, where
the smallest data point in the last cluster is the jth data point,
(if j > m or j < i, the cost is set to infinity).
:return: df_uniq: The input dataframe with added columns for data point cluster
and centroid.
'''
# Extract Clusters
suffix = '_dp_mid-range'
cluster_name = 'cluster' + suffix
centroid_name = 'centroid' + suffix
df_uniq[cluster_name] = None
idx_right_boundary = -1
len_C = len(C_values)
# Set cluster and centroid for each data point
for k in range(len_C - 1, 0, -1):
idx_left_boundary = C_values[k][idx_right_boundary].argmin()
left_boundary = df_uniq['revenue'].iloc[idx_left_boundary]
right_boundary = df_uniq['revenue'].iloc[idx_right_boundary]
in_cluster_cond = (df_uniq['revenue'] >= left_boundary) & \
(df_uniq['revenue'] <= right_boundary)
df_uniq.loc[in_cluster_cond, cluster_name] = len_C - k
df_uniq.loc[in_cluster_cond, centroid_name] = 0.5 * (
df_uniq.loc[in_cluster_cond, 'revenue'].min() +
df_uniq.loc[in_cluster_cond, 'revenue'].max())
idx_right_boundary = idx_left_boundary - 1

# Set cluster and centroid for the rest of data points
in_cluster_cond = df_uniq[cluster_name].isna()
df_uniq.loc[in_cluster_cond, centroid_name] = 0.5 * (
df_uniq.loc[in_cluster_cond, 'revenue'].min() +
df_uniq.loc[in_cluster_cond, 'revenue'].max())
df_uniq.loc[df_uniq[cluster_name].isna(), cluster_name] = len_C + 1

return df_uniq

Optimal Clustering — Combining Steps 1 to 4.

def one_d_dp_clustering_with_custom_centroids(arr, k):
'''
Run dynamic programming algorithm to clustering the input array into k
clusters.
:param arr: Numpy array with the 1D data point to be clustered
:param k: Number of clusters
:return: df_uniq: The input dataframe with added columns for data point cluster
and centroid.
'''
# Generate dataframe of count and unique values
df_uniq = get_value_count_df(arr)
# Calculate cluster cost for all possible clusters
CC_values = CC(df_uniq)
# Calculate the optimal clustering costs
C_values= C(CC_values, k)
# Assign data points to clusters
df_uniq = assign_to_nearest_centroid(df_uniq, C_values)

return df_uniq

This DP algorithm gives the optimal solution with run time of O(kn²), where k is the number of clusters and n is the number of data points, typically k<<n.
An algorithm called SMAWK accelerates the DP algorithm to a runtime of O(kn). However, due to the customization of the centroid function, it cannot be applied in this situation. You can find more information about the SMAWK algorithm and its inapplicability in this particular case towards the end of this blog post.

Experiment

Although we possess actual user data in reality, for the purpose of this blog post, we showcase our algorithm using artificially generated data that is distributed similarly to real-world data. We conduct experiments with synthetic data obtained from five beta distributions with distinct parameters and uniform distribution, resulting in a total of 1100 data points.

Example of synthetic data which simulates real data:

Histogram of the synthetic data:

Fig.1 — Histogram of revenue where X-axis is the the value and Y-axis is the frequency.

The experiment is repeated twice: first, using the DP algorithm when the objective function uses the classic centroid calculated by taking the mean of all data points in the cluster, and second when the objective function includes the custom centroid, calculated using the mid-range of all the data points in the cluster. In both experiments, the input parameter for the number of clusters is set to 5.

The two histograms below show the clustering results, with each color indicating a different cluster and the vertical line denoting its centroid.

The two histograms show differences in the clustering of outlier data points and the location of centroids. For instance, in the first histogram, the data points on the right side of the leftmost cluster belong to the second cluster in the second histogram.

Let us compare the RMSE (root mean squared error):

where xᵢ is a data point and cᵢ is the mid-range of the cluster it belongs to. The first histogram has an RMSE of 0.22 , while the second histogram has an RMSE of 0.17.
We expect the RMSE of the second model, where the centroid is calculated as the mid-range of all data points in the cluster, to be less than or equal to the RMSE of the first model, where the centroid is calculated as the average of all the data points in the cluster. This is because the argmin of the RMSE and the objective function of the second model with the custom centroid is the same.

Fig.2 — Histogram of the data from Figure 1, with colored bins indicating clusters and gray lines representing their centroids, which are defined as the mean of the data points within each cluster.
Fig.3 — Histogram of the data from Figure 1, with colored bins indicating clusters and gray lines representing their centroids, which are defined as the mid-range of the data points within each cluster.

Lessons learned

Occasionally, product requirements lead us to search for handcrafted solutions when off-the-shelf solutions fail to satisfy. In such cases, we have to be creative by customizing existing solutions to meet the optimal requirements defined for the product. However, modifying proven algorithms requires verifying that assumptions and prerequisites are met and making necessary adjustments.
In our example, the change of the centroid function prevents us from using the SMAWK algorithm [2] to speed up the dynamic programming solution, and generally speaking, we have seen that small changes can make a big difference.

The change we made in calculating the center of the clusters does not prevent us from making additional changes if necessary. For example, we can also change the objective function to

Furthermore if runtime is an issue, it is worthwhile to try optimizing it by, for instance, decreasing the data size with sampling or choosing faster algorithms, even if that means compromising accuracy.

Final takeaway: “Feel comfortable customizing an algorithm but respect all its requirements carefully.”

References

[1] Grønlund, A., Larsen, K. G., Mathiasen, A., Nielsen, J. S., Schneider, S., & Song, M. (2017). Fast exact k-Means, k-Medians and Bregman divergence clustering in 1D.

[2] Alok Aggarwal, Maria M. Klawe, Shlomo Moran, Peter W. Shor, and Robert E. Wilber. Geometric applications of a matrix-searching algorithm. Algorithmica, 2:195–208, 1987.

[3] Lawrence L. Larmore UNLV. The SMAWK Algorithm.

[4] Python implementation of the SMAWK algorithm available on Github.

Appendix

Speed up the dynamic programming with SMAWK algorithm

[See detailed example here [3]]

We initiate our examination by considering the subsequent definition:

Let A be a matrix with real numbers, and let argmin Aᵢₖ (argmin over the index k) be the index of the leftmost column containing the minimum value in row i of A.
A is said to be monotone if r < s implies that argmin Aᵣₖ ≤ argmin Aₛₖ (argmin over the index k).
A is totally monotone if all of its submatrices are monotone.

Example of totally monotone matrix:

The submatrix

induced by 3rd and 6th rows and 7th and 9th columns. This submatrix is totally monotone.

The SMAWK algorithm [2] finds all row minima of a totally monotone matrix in runtime of O(kn). Python implementation of the algorithm is also available [4].

In the standard case,

where

Let us look at the following two statements that use us next.

The Monge Concave property (Monge Concave for 1D k-Means cost):

For any a < b and u < v, it holds that,
CC(v, b) + CC(u, a) ≤ CC(u, b) + CC(v, a).

The second statement:

The matrix Cᵢ is totally monotone if the cluster cost CC(i, j) is monge concave.

CC(i, j) is monge concave (the proof is in section 4, as shown in [1]) which implies that the matrix Cᵢ is totally monotone and therefore the SMAWK algorithm can be used to find all row minima of Cᵢ in O(n) time. Using the formula that relates D[i][m] to Cᵢ is equivalent to calculating the ith row in D, which gives us a dynamic programming algorithm with a running time of O(kn).

Now we want to plug the customized centroid function

into CC(j, m).

We defined SMAWK, but now we should see whether it raises the question: Does the SMAWK algorithm work with the customized centroids?
Let us find out the answer to that.

Two approaches are suggested: First, proving that the cluster cost function, CC(j, m), is monge concave would yield a ‘Yes’ answer. Second, finding an example that contradicts the definition of a total monotone matrix over the matrix Cᵢ would yield a ‘No’ answer. An option to demonstrate the lack of total monotonicity in matrix is to identify a 2×2 submatrix in which the second row is increased while the first row is decreased.

In the example below, we have the submatrix of matrix C₅, which is calculated based on synthetic data, where the number of clusters is k=5.
Let us look in the submatrix induced by C₅[1, 1], C₅[1, 2], C₅[3, 1], C₅[3, 2].
It can be observed that the leftmost index of the minimum value of the second row (index = 1) is less than that of the first row (index = 2). Therefore, the submatrix is not monotone which implies that the matrix C₅ is not totally monotone, and we cannot use the SMAWK algorithm to speed up the dynamic programming algorithm in our use case.

Fig4 — The submatrix induced by the red squares is not monotone

--

--