Applying Machine Learning Clustering in DeFi

Published in

Coinmonks

7 min readMar 28, 2024

Machine Learning Algorithms in Action using On-Chain Data

Executive Summary:

Navigating the DeFi world can be overwhelming with the ever changing landscape of new protocols, chains and use cases. At SoDeFi we aim to simplify the user journey into DeFi by scoring pools for quality and sustainable yield. The scoring is done using statistical measures through the SoDeFi score as well as novel machine learning ways. In the following article, we will deep dive on unsupervised machine learning algorithms, k-means clustering can be utilized to better understand the yield characteristics of DeFi pools.

What is cluster analysis?

Before jumping into the analysis and methodology, I wanted to begin with what exactly cluster analysis is and why blockchain data is a great use case to apply this unsupervised learning algorithm.

K-means clustering categorizes items into ‘k’ groups based on certain statistical features. The algorithm works by calculating the distance of an item’s features from a randomly initialized ‘centroid’ for each group. It then assigns the item to the group with the closest centroid. The centroids are then recalculated based on the current members of each group, and the process is repeated until the group memberships stabilize.

K-means is considered an unsupervised learning algorithm because it doesn’t require labeled data for training. In supervised learning, algorithms learn from labeled data, where the correct output (label) is known for each instance in the training dataset. The algorithm uses this labeled data to learn a function that can predict the output for new, unseen instances.

In contrast, unsupervised learning algorithms like K-means work with unlabeled data. They aim to discover the inherent structure or patterns in the data. In the case of K-means, the goal is to group similar data points together based on their feature values, forming clusters.

Why cluster analysis with on-chain data?

K-means is widely used in business setting where you would like to learn about the inherent patterns with first setting out preconceived notions about behaviours. Customer segmentation is one great example. Businesses can use K-means to segment their customers based on various features like purchasing behavior, demographics, or product usage. This can help in targeted marketing, customer retention strategies, or improving customer service.

Similarly as retailers learn about their customer through segmentation, k-means clustering can help us understand the behaviour of DeFi pools. Specifically, the clustering can be used in the following ways.

Discovering Patterns: K-means clustering can help identify patterns or trends in the data that might not be immediately apparent. For example, it could reveal that certain DeFi pools with similar APYs also have similar TVLs, suggesting a relationship between these two variables.
Segmentation: K-means can be used to segment DeFi pools into different groups based on their APY and TVL. This can provide valuable insights into the characteristics of different types of pools. For instance, high APY pools might have a different risk profile compared to low APY pools.
Anomaly Detection: Clustering can also be used to detect outliers or anomalies. If a DeFi pool does not fit well into any of the established clusters, it could indicate that there’s something unusual about its APY or TVL.
Simplifying Complex Data: By grouping similar DeFi pools together, k-means clustering can make it easier to understand and interpret complex data. Instead of having to analyze each pool individually, you can analyze the characteristics of each cluster.
Predictive Modeling: The clusters created by k-means can also be used as input for predictive models. For example, you could use the cluster labels as a feature to predict future APY or TVL.

What does DeFi cluster analysis tells us?

Below is an output of the cluster analysis. The analysis created 15 clusters using both quantitative and qualitative data. On the quantitative side, the clustering algorithm took in the current TVL, seven day change in TVL, the seven day rolling mean APY and its standard deviation. On the qualitative side, each pool is described by how many tokens it holds, whether it is a stablecoin pool and whether is a top 25 project.

Some of the observation that jump out are as follows:

k-means clusters filtered out for pools that pay very minimal APY and have large TVL: cluster 12 and 13. These are ultra-safe collateralized lending pools bluechip token pools which offer no to zero yield but are used borrow against in order to access liquidity. An example of a pool here would be the Compound WBTC (Wrapped Bitcoin) pool.
High risk and volatility pools are clustered into clusters 6 and 14. Both closers have the highest weekly APY and standard variation. There are no stablecoins within the cluster and less than half of the pools are in top 25 largest projects. These pools would be classified as high risk and high reward pools.
Clusters that contain only stablecoins are cluster 1 and 9. The key differentiator between the two clusters is that cluster 1 contains pools whose APY increased over the past seven days while the opposite is true of cluster 9.
From an investing perspective, the clusters can be judge by a rule of thumb of how much return they generate per unit of standard deviation. Using such a heuristic clusters 4 and 5 would be specifically interesting as they have the highest ratio.

These are just some example of insights one can draw from clustering and segmentation. Next we will walk through the more technical part of how the clustering was accomplished.

Technical Guide to K-means Cluster Analysis

The data for the k-means analysis consists of a seven day period at the end of 2023 (December 31). The data was gather from DeFiLlama on an individual pool basis. The data features included quantitative information such TVL and APY as well as qualitative data, project name, pools tokens and chain details.

The APY data was transformed into weekly averages and its standard deviation to better gauge its overall pattern over time. As pool data tends to be quite noisy, the first transformation is visualized before. Pool observation tend to hug the X and Y axis. In order to correct this feature and better visualize the distribution of APY and TVL several transformations were carried out.

Firstly, the weekly APY was included as a logarithmic values. Secondly, TVL was converted to a logarithmic value and a new variable was created to capture its change over time. The transformations are visualized below.

Taking the logarithm helps to handle the underlying skewness of the APY and TVL variables as well reduces its variance. The below chart shows how the two variables are distributed after the transformation. Previously the variables, closely hugged the axis and now they are more uniformly distributed for easier clustering.

Selecting Number of Clusters

Now that the data has been prepped for clustering we need to select the optimal number of clusters to use. The clustering algorithm works by first randomly initializing ‘k’ centroids. Each data point is then assigned to the nearest centroid, forming ‘k’ clusters. The centroids are recalculated as the mean of all data points in the cluster, and the data points are reassigned again. This process is repeated until the centroids no longer change significantly, indicating that the algorithm has converged.

The ‘k’ in K-means stands for the number of clusters, which is a hyperparameter that needs to be set before running the algorithm. Determining the optimal number of clusters is a challenging task. One common method is the Elbow method.

The Elbow method involves running the K-means algorithm multiple times over a loop, with an increasing number of cluster choice and then plotting a clustering score as a function of the number of clusters.

The score could be the inertia (sum of squared distances of samples to their closest cluster center) or any other metric that measures the quality of clustering. The idea of the Elbow method is to choose the number of clusters where the rate of decrease sharply shifts, which is the “elbow” of the plot. This point represents a satisfactory trade-off between increasing complexity (more clusters) and diminishing returns from explaining the variance in the data.

Based on the visualization above any cluster after about 8 have a decreasing rate of explaining the data. Since, I wanted to deep dive a little further into how the data are clustered to see their difference I chose to use 15 as the number of clusters.

The final visual result of the clustering is demonstrated in the figure above. The colour variation demonstrates each of the 15 pools selected. Although, the data is quite crowed together we can still observe some distinct clusters that were previously discussed.

Thank you for reading this rather lengthy deep dive. For more information or to stay updated on the latest DeFi trends subscribe to the newsletter. If you feel others in your network might find this article interesting pleasure share the article.