Unsupervised Learning as Signals for Pairs Trading and StatArb

Published in

Call For Atlas

23 min readMar 6, 2024

Dall-E (2024) Imagining clusters in Capital Markets

In this article, we will leverage clustering algorithms to detect pairs in a universe of tradable securities. We will train three unsupervised learning models: KMeans, DBScan, and Agglomerative, to find stocks that have similar characteristics and movements. When one of these deviates from the cluster, that will be our signal to trade.

The code and models within this article are large and computationally expensive, if you wish to run this code expect training times of hours to days. The code caches at every step to prevent retraining and re-processing.

The models were inspired by the paper Han et al. (2023), which explores the use of unsupervised learning in statistical arbitrage, specifically pairs trading.

Firm Characteristics Data

All papers referenced in this article leverage the Center for Research in Security Prices (CRSP) research data on securities. Specifically they look at the company characteristics, presented in the table below:

Parquet Dataset Creation

The dataset is large, around 3GB of company characteristics from 1980 to 2021. This dataset has been curated for the papers “Empirical Asset Pricing via Machine Learning”(2018) and “Autoencoder Asset Pricing Models.” (2019) by Shihao Gu, Bryan Kelly and Dacheng Xiu. The raw format is available for download from the authors’ personal websites (or reach out to me for a curated dataset).

The dataset has 94 one-month-Lagged Firm Characteristics (as the CRSP releases these with a month delay).

Using pyarrow, we will convert everything to parquet for speedy processing.

import pyarrow as pa
import pyarrow.parquet as pq

if not os.path.exists(FILEPATH_PARQ):
    chars_df = pd.read_csv(FILEPATH)[FEATURES]
    chars_df['DATE'] = pd.to_datetime(chars_df['DATE'], format='%Y%m%d')
    chars_df = chars_df[(chars_df['DATE'].dt.year >= MIN_YEAR) & (chars_df['DATE'].dt.year <= MAX_YEAR)]
    chars_df = chars_df.sort_values("DATE")
    chars_df.to_parquet(FILEPATH_PARQ, index=False, compression="snappy")
else:
    chars_df = pd.read_parquet(FILEPATH_PARQ)
chars_df.head(5)

Sanitization and Feature Engineering

To sanitize our dataset, we drop any company with insufficient data to fill a window of 48 months and drop characteristics with more than 15% nans. We then impute any missing characteristic with the median of the current time window.

The data has a one month momentum (MOM1M) characteristic. MOM is the stock’s acceleration measure, and is calculated as follows:

Although we don’t have a method to convert PERMNO to the actual stock and calculate its price, MOM is a good enough proxy, and provides the change in price for for that month.

Using a rolling window for 2 to 64 months, we calculate additional MOM characteristics:

These moving averages help us capture trends ion tme.

By using 1 to 64 window sizes, it will allow the models to capture both short-term and long-term trends. While clustering is distance and hierarchy based, we do want to add some temporal dynamics allowing the clustering algorithms to consider the timeseries behaviors.

NAN_THRESHOLD = 0.15

def interpolate_with_median(group):
    rolling_median = group.rolling(window=WINDOW, min_periods=1).median()
    group= group.fillna(rolling_median).bfill()
    return group
if not os.path.exists(FILEPATH_PRE_PARQ):
    valid_groups = chars_df.groupby('permno').filter(lambda x: len(x) >= WINDOW and x[MOM_FEATURES[0]].isna().sum() <= 2)
    for i in tqdm(range(2, WINDOW + 1), desc="moms"):
        rolling_func = lambda x: (x + 1).rolling(window=i).apply(np.prod, raw=True) - 1
        valid_groups[f'mom{i}m'] = valid_groups.groupby('permno')[MOM_FEATURES[0]].transform(rolling_func)
    numerical_columns = valid_groups.select_dtypes(include=['float64', 'int64']).columns
    tqdm.pandas(desc="interpolate_with_median")
    valid_groups[numerical_columns] = valid_groups.groupby('permno')[numerical_columns].progress_transform(lambda x: interpolate_with_median(x))
    nan_percentages = valid_groups[FEATURES].isna().mean()
    valid_groups = valid_groups.drop(columns=nan_percentages[nan_percentages >= NAN_THRESHOLD].index)
    valid_groups = valid_groups.ffill().bfill()
    valid_groups.to_parquet(FILEPATH_PRE_PARQ, index=False, compression="snappy")
    chars_df = valid_groups
else:
    chars_df = pd.read_parquet(FILEPATH_PRE_PARQ)
chars_df.tail(5)

Dim Reduction with PCA

Standardization and PCA at 99% variance is performed to center the data’s means for the clustering algorithms, and slightly reduce its dimensionality — in the paper, PCA is used primarily for scaling and capturing magnitudes, as they utilize all components that describe 100% variance.

Clustering Algos

The helper function below will help us identify the correct epsilon, which is the L2 distance between 2 neighbors, that is a common hyper parameter to each model:

from sklearn.neighbors import NearestNeighbors

def distance_to_nearest_neighbors(df, k = 2, alpha = 0.3):
    """
    Calculates the distance to the nearest neighbors of each point in a DataFrame and determines a distance threshold based on a percentile.
    Computes the L2 nearest neighbors for each point in the dataset.
    It then calculates the average distance to the nearest neighbors, excluding the closest one (itself for k=2), across all points a threshold is used as cut-off distance for clustering or outlier detection.
    Parameters:
    - df (pd.DataFrame): The input DataFrame containing the data points.
    - k (int, optional): The number of nearest neighbors to consider. The default is 2 (itself and another).
    - alpha (int, optional): The percentile from 1.0 to 0.0 to use when determining the distance threshold.
    Returns:
    - A tuple containing:
        - NumPy array of distances.
        - Epsilon threshold.
    """
    neigh = NearestNeighbors(n_neighbors=k, n_jobs=-1, metric='l2')
    nbrs = neigh.fit(df)
    distances, _ = nbrs.kneighbors(df)
    distances = np.sort(distances, axis=0)
    distances = distances[:, 1:].mean(axis=1)
    epsilon = np.percentile(distances, alpha*100)
    return distances, epsilon
pca_result_df.columns = pca_result_df.columns.astype(str)
pca_components_cols = pca_result_df.drop('DATE', axis=1).columns
pca_components_cols

K-Means

K-Means aims to minimize the sum of square distances between points (Euclidian distance) in a two-step procedure that assigns the points (xi) to a cluster (uj), and then recalculating the clusters’ centroids (uj) with the means of all the assigned data — iterated until convergence (the max clusters requested).

This is represented as:

Where:

n is the total data points.
k is the total clusters
xi^j is the ith point assigned to the jth cluster
uj is the centroid of the jth cluster.

The problem of K-Means is that it cannot identify outliers, therefore the centroids get skewed.

In this case, we will need to find the distances to the centroids, and drop anything larger than the alpha threshold of the 2 nearest neighbours — which gives the epsilon distance calculated from the helper function above.

A visualization of the algo is in the flowchart below:

The paper stops here with K-Means, though if we later improve on this by detecting the optimal number of clusters using the elbow method or the silhouette score, then on the removal of outliers — refit the K-Means model to have the correctly sized clusters without outliers.

The K-Means training code:

import warnings
warnings.filterwarnings('ignore', category=UserWarning)
from sklearn.cluster import KMeans

def distance_to_centroid(km_model, df, alpha=30):
    """
    Calculate the distance from each data point in the provided DataFrame to the centroid of its assigned cluster.
    Parameters:
    - km_model: Fitted KMeans model containing cluster centroids.
    - df: DataFrame containing the data points.
    - a: Optional, alpha max dist percentile to get epsilon threshold.
    Returns:
    - A tuple containing:
        - NumPy array of distances from each data point to its centroid.
        - Epsilon threshold, calculated as the alpha percentile of the L2 distances.
    """
    labels = km_model.labels_
    dist_centroid = km_model.transform(df)
    dist_centroid_member = np.array([dist_centroid[i, labels[i]] for i in range(len(labels))])
    epsilon = np.percentile(dist_centroid_member, alpha)
    return dist_centroid_member, epsilon
def train_km_clusters(pca_result_df, optimal_clusters = None):
    models_dfs = []
    cluster_membership = []
    for month, data in tqdm(pca_result_df.groupby("DATE"), desc="train_km_clusters"):
        pca_data = data[pca_components_cols]
        if len(pca_data) < 2:
            print(f"Skipping {month} due to insufficient data.")
            continue
        km_model = KMeans(n_clusters=optimal_clusters,
                          init='k-means++',
                          n_init=10,
                          max_iter=1000,
                          random_state=1)
        km_model.fit(pca_data)
        # We need to refit and remove the outlier securities.
        dist, eps = distance_to_centroid(km_model, pca_data)
        clean_data = pca_data[~(dist > eps)]
        cluster_df = pd.DataFrame(data['permno'].copy(), index=data.index)
        cluster_df = pd.DataFrame(pca_data['permno'].copy(), index=pca_data.index)
        cluster_df['km_cluster'] = km_model.labels_ if km_model is not None else []
        cluster_df.loc[(dist > eps), "km_cluster"] = -1 # Outliers To be similar to DBScan - we set them to -1
        cluster_df[MOM_FEATURES[0]] = data[MOM_FEATURES[0]]
        cluster_df['DATE'] = month
        cluster_membership.append(cluster_df)
        models_dfs.append({'DATE': month, 'n_clusters': km_model.n_clusters})
    models_df = pd.DataFrame(models_dfs)
    cluster_membership_df = pd.concat(cluster_membership, ignore_index=False)
    return models_df, cluster_membership_df
if os.path.exists(KM_CLUSTERS_FILEPATH) and os.path.exists(KM_CLUSTERS_MEM_FILEPATH) :
    km_models_df = pd.read_pickle(KM_CLUSTERS_FILEPATH)
    km_cluster_membership_df = pd.read_pickle(KM_CLUSTERS_MEM_FILEPATH)
else:
    # For k-means clustering, the number of clusters K = 5; 10; 50; 100; 500; 1000, and 1500 are tested
    # and K = 500 is chosen for the main results as it gives the highest Sharpe ratio.
    km_models_df, km_cluster_membership_df = train_km_clusters(pca_result_df, optimal_clusters=500)
    km_models_df.to_pickle(KM_CLUSTERS_FILEPATH)
    km_cluster_membership_df.to_pickle(KM_CLUSTERS_MEM_FILEPATH)
pca_result_df['km_cluster'] = km_cluster_membership_df['km_cluster']
pca_result_df[['DATE', 'km_cluster']].tail(5)

Visualization of the clusters’ composition:

import matplotlib.pyplot as plt
import pandas as pd

def plot_cluster_distributions(cluster_membership_df, cluster_col):
    """
    Plots the distribution of clusters over time, including a comparison of outliers versus clustered stocks.
    Parameters:
    - cluster_membership_df: DataFrame containing clustering information.
    - cluster_col: String, the name of the column in DataFrame that contains cluster identifiers.
    """
    clusters = cluster_membership_df[cluster_membership_df[cluster_col] > -1].groupby('DATE')[cluster_col].nunique().reset_index()
    clusters['DATE'] = pd.to_datetime(clusters['DATE'])
    cluster_counts = cluster_membership_df.groupby(['DATE', cluster_col])["permno"].count().reset_index()
    cluster_counts['Rank'] = cluster_counts[cluster_counts[cluster_col] > -1].groupby('DATE')["permno"].rank("dense", ascending=False)
    cluster_counts['Cluster_Group'] = cluster_counts.apply(
        lambda row: 'First' if row['Rank'] == 1 else
                    'Second' if row['Rank'] == 2 else
                    'Third' if row['Rank'] == 3 else
                    'Rest', axis=1)
    cluster_counts['Cluster_Group'] = cluster_counts.apply(
        lambda row: 'Outliers' if row[cluster_col] == -1 else
                    row['Cluster_Group'], axis=1)
    cluster_summary = cluster_counts.groupby(['DATE', 'Cluster_Group'])["permno"].sum().unstack(fill_value=0).reset_index()
    cluster_summary['DATE'] = pd.to_datetime(cluster_summary['DATE'])
    fig, axs = plt.subplots(1, 3 if 'Outliers' in cluster_summary.columns else 2, figsize=(18, 4))
    axs[0].fill_between(clusters['DATE'], clusters[cluster_col], step="pre", alpha=0.4)
    axs[0].set_title('Number of Clusters')
    axs[0].set_xlabel('Date')
    axs[0].set_ylabel('Number of Clusters')
    axs[0].tick_params(axis='x', rotation=45)
    axs[1].stackplot(cluster_summary['DATE'],
                     cluster_summary.get('First', pd.Series()),
                     cluster_summary.get('Second', pd.Series()),
                     cluster_summary.get('Third', pd.Series()),
                     cluster_summary.get('Rest', pd.Series()),
                     labels=['First', 'Second', 'Third', 'Rest'], alpha=0.4)
    axs[1].set_title('Stocks in 3 Largest Clusters')
    axs[1].set_xlabel('Date')
    axs[1].set_ylabel('Number of Stocks')
    axs[1].legend(loc='upper left')
    axs[1].tick_params(axis='x', rotation=45)
    if 'Outliers' in cluster_summary.columns:
        clustered_stocks = cluster_summary[['First', 'Second', 'Third', 'Rest']].sum(axis=1)
        outliers = cluster_summary['Outliers']
        axs[2].stackplot(cluster_summary['DATE'], outliers, clustered_stocks, labels=['Outliers', 'Clustered'], alpha=0.4)
        axs[2].set_title('Outliers vs Clustered')
        axs[2].set_xlabel('Date')
        axs[2].set_ylabel('Number of Stocks')
        axs[2].tick_params(axis='x', rotation=45)
        axs[2].legend()
    return fig, axs
plot_cluster_distributions(km_cluster_membership_df, 'km_cluster')
plt.tight_layout()
plt.show()

DBSCAN

DBSCAN is a Density Based Spatial Clustering or Applications with Noise, that’s a mouthful of a model that identifies areas of high interest in high dimensional data, which we have. It is able to handle clusters of arbitrary shapes, and can deal with outliers.

The algo groups together points that are closely packed together within ε Epsilon distance, and assigns a cluster if the number of points are greater than MinPts. Clusters are areas of high density (called core points) separated by areas of low density (called border points). Points that do not belong to any cluster (further than ε) are noise.

It then expands the radius of a cluster, adding as many neighbours as it can, or creates new clusters m core points and keeps iterating until it has visited every point.

The flowchart below gives a good visual representation of this:

In the helper function discussed above, we calculate the epsilon distance by finding the distances between two neighbours, order these in percentiles and finding the Alpha percentile — The paper identified that the 10% alpha gives the optimal epsilon.

Code to train DBScan below:

from sklearn.cluster import DBSCAN

def train_db_clusters(pca_result_df, alpha=0.1):
    models_dfs = []
    cluster_membership = []
    for month, data in tqdm(pca_result_df.groupby("DATE"), desc="train_db_clusters"):
        pca_data = data[pca_components_cols]
        if len(pca_data) < 2:
            print(f"Skipping {month} due to insufficient data.")
            continue
        #MinPts is set to be the natural logarithm of the total number of data points N
        min_samples = int(round(np.log(len(data))))
        _, eps = distance_to_nearest_neighbors(pca_data, k=min_samples + 1, alpha = alpha)
        db_model = DBSCAN(eps=eps, metric='l2', min_samples=min_samples)
        db_model.fit(pca_data)
        cluster_df = pd.DataFrame(data['permno'].copy(), index=data.index)
        cluster_df['db_cluster'] = db_model.labels_
        cluster_df['DATE'] = month
        cluster_df[MOM_FEATURES[0]] = data[MOM_FEATURES[0]]
        cluster_membership.append(cluster_df)
        # -1 are the noise points, which we have to remove.
        num_clusters = len(set(db_model.labels_)) - (1 if -1 in db_model.labels_ else 0)
        models_dfs.append({'DATE': month, 'n_clusters': num_clusters})
    models_df = pd.DataFrame(models_dfs)
    cluster_membership_df = pd.concat(cluster_membership, ignore_index=False)
    return models_df, cluster_membership_df
if os.path.exists(DBS_CLUSTERS_FILEPATH) and os.path.exists(DBS_CLUSTERS_MEM_FILEPATH) :
    db_models_df = pd.read_pickle(DBS_CLUSTERS_FILEPATH)
    db_cluster_membership_df = pd.read_pickle(DBS_CLUSTERS_MEM_FILEPATH)
else:
    # For DBSCAN alpha = 0:1; : : : ; 0:9 are tested, and alpha = 0:1 is chosen for DBSCAN
    db_models_df, db_cluster_membership_df = train_db_clusters(pca_result_df, alpha=0.1)
    db_models_df.to_pickle(DBS_CLUSTERS_FILEPATH)
    db_cluster_membership_df.to_pickle(DBS_CLUSTERS_MEM_FILEPATH)
pca_result_df['db_cluster'] = db_cluster_membership_df['db_cluster']
pca_result_df[['DATE', 'db_cluster']].tail(5)

Agglomerative

Agglomerative is a hierarchical cluster algo that works iteratively bottom-up, by merging closest pairs of clusters until it converges to one cluster, or meeting a criterion.

In our case, the criterion is the ε Epsilon distance, which means the algo will iterate until the distance between neighboring clusters is more than ε.

Trade Simulation

Our simulated trades will take the following steps:

Get all securities in a cluster.
Split into deciles based on their returns (MOM1).
Get returns delta of the top decile (over valued) and bottom decile (under valued)
Calculate the cross-sectional standard deviation of these deltas.
If delta > cluster’s 1 std * factor, there is a StatArb opportunity. The factor is a parameter, which in our case it’s 1.5.
Select securities for the Long-Short portfolio to be traded the next month.
Close Long-Shorts from previous month (we assume these have reversed back to the cluster’s mean).

All portfolios will be Equally Weighted, with their returns unaltered (w_i will be 1). We will also assume 15BPs trade fees, which will be deducted from the returns:

The code below contains the logic to discover divergence in pairs:

from pandas import Timestamp
from datetime import datetime, timedelta

RETS = 'rets'
OVER_RETS = 'over_rets'
UNDER_RETS = 'under_rets'
RETS_1 = MOM_FEATURES[0]
def statarb_signals(group, std_dev_factor=1.5):
    """
    Finds signals for a given cluster of stocks by identifying overvalued ("overs") and undervalued ("unders") stocks
    in the top and bottom deciles, where the momentum difference exceeds a specified standard deviation factor, using pd.qcut.
    Parameters:
    - group (pd.DataFrame): Stock data for a specific group or cluster.
    - std_dev_factor (float, optional): The factor by which the standard deviation is multiplied to determine signal. Defaults to 1.
    Returns:
    - list of dicts:
        - 'DATE': The trade date for the identified signals.
        - 'Cluster': The identifier for the cluster from which the signals were generated.
        - 'overs': A set of stock identifiers ('permno') considered overvalued based on the strategy.
        - 'unders': A set of stock identifiers ('permno') considered undervalued based on the strategy.
    """
    overs = unders = []
    overs_rets = unders_rets = 0
    if len(group[MOM_FEATURES[0]]) > 1:
        group_sorted = group.sort_values(by=RETS_1, ascending=False).reset_index(drop=True)
        mid_idx = len(group_sorted) // 2
        top_half = group_sorted.iloc[:mid_idx]
        bottom_half = group_sorted.iloc[-mid_idx:]
        assert len(top_half) == len(bottom_half), f"len mismatch: {len(top_half) } != {len(bottom_half)}"
        if not bottom_half.empty and not top_half.empty:
            mom1_diffs = top_half[RETS_1].values - bottom_half[RETS_1].values
            mom1_std_dev = mom1_diffs.std()
            rets_diffs =  top_half[RETS_1].values - bottom_half[RETS_1].values
            valid_pairs_mask = rets_diffs > (mom1_std_dev * std_dev_factor)
            overs = top_half[valid_pairs_mask]['permno']
            unders = bottom_half[valid_pairs_mask]['permno']
            overs_rets = top_half[valid_pairs_mask][RETS].mul(-1).mean()
            unders_rets = bottom_half[valid_pairs_mask][RETS].mean()
    return [{
        'DATE': group['DATE_TRADE'].iloc[0],
        'Cluster': group['cluster'].iloc[0],
        'overs': set(overs),
        'unders': set(unders),
        OVER_RETS: overs_rets,
        UNDER_RETS: unders_rets,
        RETS: (overs_rets + unders_rets),
    }]

def process_trade_opportunities(df, cluster_label, filepath):
    df['cluster'] = df[cluster_label]
    tqdm.pandas(desc=f"StatArb opportunities with {cluster_label}")
    trade_opportunities = df.groupby(['cluster', 'DATE']).progress_apply(statarb_signals)
    trade_opportunities = [item for sublist in trade_opportunities for item in sublist]
    if len(trade_opportunities) == 0:
        return pd.DataFrame()
    trade_opportunities_df = pd.DataFrame(trade_opportunities)
    trade_opportunities_df.to_pickle(filepath)
    return trade_opportunities_df
# NB: The model will trade the next month on the selected clusters. Paper rebalances portfolios monthly.
pca_result_df['DATE_TRADE'] = pca_result_df.groupby(['permno'])['DATE'].shift(-1).ffill()
pca_result_df[RETS] = pca_result_df.groupby(['permno'])[RETS_1].shift(-1).ffill()
trade_opportunities_km_df = trade_opportunities_db_df = trade_opportunities_agg_df = None
if not os.path.exists(SIGNALS_KM_FILEPATH):
    trade_opportunities_km_df = process_trade_opportunities(pca_result_df, 'km_cluster', SIGNALS_KM_FILEPATH)
else:
    trade_opportunities_km_df = pd.read_pickle(SIGNALS_KM_FILEPATH)
if not os.path.exists(SIGNALS_DB_FILEPATH):
    trade_opportunities_db_df = process_trade_opportunities(pca_result_df, 'db_cluster', SIGNALS_DB_FILEPATH)
else:
    trade_opportunities_db_df = pd.read_pickle(SIGNALS_DB_FILEPATH)
if not os.path.exists(SIGNALS_AGG_FILEPATH):
    trade_opportunities_agg_df = process_trade_opportunities(pca_result_df, 'agg_cluster', SIGNALS_AGG_FILEPATH)
else:
    trade_opportunities_agg_df = pd.read_pickle(SIGNALS_AGG_FILEPATH)
trade_opportunities_km_df.tail(5)

K-Means Returns

The code below helps calculate returns and performance metrics, including their visualizations:

def get_pnl(trade_opportunities_df, fees=0.0015):
    """
    Calculates the profit and loss (PnL) for given trade opportunities, adjusting for fees, and aggregates the results on a monthly basis
    Parameters:
    - trade_opportunities_df (pd.DataFrame): A DataFrame containing trade opportunities with columns for date, overs, unders, and returns.
    - fees (float, optional): The trading fee rate to apply to the returns. Defaults to 0.0015.Returns:
    - overs_stock_df (pd.DataFrame): A DataFrame of over-performing trades merged with asset information, adjusted for fees.
    - unders_stock_df (pd.DataFrame): A DataFrame of under-performing trades merged with asset information, adjusted for fees.
    - overs_df (pd.DataFrame): The original over-performing trades DataFrame adjusted for fees and indexed by date.
    - unders_df (pd.DataFrame): The original under-performing trades DataFrame adjusted for fees and indexed by date.
    - re_unders_df (pd.Series): Monthly aggregated mean returns for under-performing trades.
    - re_overs_df (pd.Series): Monthly aggregated mean returns for over-performing trades.
    """
    trade_opportunities_df['DATE'] = pd.to_datetime(trade_opportunities_df['DATE'])
    asset_df = pca_result_df.reset_index()[['DATE', 'permno', MOM_FEATURES[0]]].copy()
    overs_df = trade_opportunities_df[["DATE", 'overs', OVER_RETS]].rename(columns={OVER_RETS: f'raw_{OVER_RETS}'})
    unders_df = trade_opportunities_df[["DATE", 'unders', UNDER_RETS]].rename(columns={UNDER_RETS: f'raw_{UNDER_RETS}'})
    overs_df[OVER_RETS] = overs_df[f'raw_{OVER_RETS}']
    unders_df[UNDER_RETS] = unders_df[f'raw_{UNDER_RETS}']
    overs_stock_df = pd.merge(overs_df.explode('overs').rename(columns={'overs': 'permno'}), asset_df, left_on=['DATE', 'permno'], right_on=['DATE', 'permno'])
    unders_stock_df = pd.merge(unders_df.explode('unders').rename(columns={'unders': 'permno'}), asset_df, left_on=['DATE', 'permno'], right_on=['DATE', 'permno'])
    # Add the trading fees
    overs_df[OVER_RETS] = overs_df[OVER_RETS].apply(lambda row: row- fees if row> 0 else row)
    unders_df[UNDER_RETS] = unders_df[UNDER_RETS].apply(lambda row: row- fees if row > 0 else row)
    unders_df = unders_df.set_index("DATE", drop=False).fillna(0)
    overs_df = overs_df.set_index("DATE", drop=False).fillna(0)
    # Resampling in case of multiple trades in one month
    re_unders_df = unders_df.resample('ME', on='DATE')[UNDER_RETS].mean()
    re_overs_df = overs_df.resample('ME', on='DATE')[OVER_RETS].mean()
    return overs_stock_df, unders_stock_df, overs_df, unders_df, re_unders_df, re_overs_df
overs_stock_df, unders_stock_df, overs_df, unders_df, re_unders_df, re_overs_df = get_pnl(trade_opportunities_km_df)

DATE                  overs  raw_over_rets  over_rets
DATE                                                                  
2021-09-30 2021-09-30                     {}       0.000000   0.000000
2021-10-29 2021-10-29                     {}       0.000000   0.000000
2021-11-30 2021-11-30                     {}       0.000000   0.000000
2021-12-31 2021-12-31                     {}       0.000000   0.000000
2021-12-31 2021-12-31  {89960, 89946, 89942}      -0.002305  -0.002305
                 DATE                 unders  raw_under_rets  under_rets
DATE                                                                    
2021-09-30 2021-09-30                     {}        0.000000    0.000000
2021-10-29 2021-10-29                     {}        0.000000    0.000000
2021-11-30 2021-11-30                     {}        0.000000    0.000000
2021-12-31 2021-12-31                     {}        0.000000    0.000000
2021-12-31 2021-12-31  {89952, 89949, 89965}        0.002305    0.000805
              DATE permno  raw_over_rets  over_rets     mom1m
1096785 2021-05-28  15343      -0.087538  -0.087538  0.104076
1096786 2021-07-30  13685      -0.015306  -0.015306  0.015306
1096787 2021-12-31  89960      -0.002305  -0.002305  0.074444
1096788 2021-12-31  89946      -0.002305  -0.002305  0.065893
1096789 2021-12-31  89942      -0.002305  -0.002305  0.029703
              DATE permno  raw_under_rets  under_rets     mom1m
1096520 2021-05-28  15340        0.104096    0.104096  0.177752
1096521 2021-07-30  13701       -0.136062   -0.136062 -0.136062
1096522 2021-12-31  89952        0.002305    0.002305 -0.258384
1096523 2021-12-31  89949        0.002305    0.002305 -0.053364
1096524 2021-12-31  89965        0.002305    0.002305 -0.076607

import matplotlib.pyplot as plt

def plot_cummulative_rets(re_unders_df, re_overs_df):
    # Calculating cumulative returns
    re_unders_cum_df = re_unders_df.add(1).cumprod().sub(1).mul(100)
    re_overs_cum_df = re_overs_df.add(1).cumprod().sub(1).mul(100)
    total = (re_unders_cum_df + re_overs_cum_df) / 2
    peaks = total.cummax()
    # Creating a figure and axis objects
    fig, axs = plt.subplots(1, 2, figsize=(18, 6))  # Creates 2 subplots vertically aligned
    # Plotting the first subplot
    axs[0].bar(re_unders_cum_df.index, re_unders_cum_df.values, label='Unders', color='b', alpha=0.4, width=365//4)
    axs[0].bar(re_overs_cum_df.index, re_overs_cum_df.values, label='Overs', color='r', alpha=0.4, width=365//4)
    axs[0].plot(re_overs_cum_df.index, total, label="Total", color='g')
    axs[0].set_xlabel('Date')
    axs[0].set_ylabel('Returns %')
    axs[0].legend()
    axs[0].set_title('Cumulative Returns Over Time')
    # Plotting the second subplot
    axs[1].plot(total.index, total.values, label="Total", color='g')
    axs[1].plot(peaks.index, peaks.values, label="Peak", color='k', linestyle='--')
    axs[1].set_xlabel('Date')
    axs[1].set_ylabel('Returns %')
    axs[1].legend()
    axs[1].set_title('Total Returns and Peaks Over Time')
    plt.tight_layout()  # Adjusts subplot params so that subplots fit into the figure area
    return fig
plot_cummulative_rets(re_unders_df, re_overs_df)
plt.show()

def get_log_rets(re_overs_df, re_unders_df):
    """
    Calculates the logarithmic returns of over-performing and under-performing trades, along with their aggregate totals and trade counts.
    Parameters:
    - re_overs_df (pd.Series): A Series of returns for over-performing trades.
    - re_unders_df (pd.Series): A Series of returns for under-performing trades.
    Returns:
    - log_rets (pd.DataFrame): A DataFrame containing the logarithmic returns for overs, unders, their totals, and number of trades done.
    """
    o = re_overs_df.add(1)
    o = np.log(o / o.shift(1).fillna(0))
    u = re_unders_df.add(1)
    u = np.log(u / u.shift(1).fillna(0))

    log_rets = pd.concat([ o, u, o + u], axis=1)
    log_rets.columns = ['Overs', 'Unders', 'Totals']
    overs_count = overs_stock_df.groupby("DATE").size().resample('ME').sum()
    unders_count = unders_stock_df.groupby("DATE").size().resample('ME').sum()
    log_rets['Trade Count'] = overs_count.values + unders_count.values
    return log_rets

log_rets = get_log_rets(re_overs_df, re_unders_df)
log_rets['Totals'].tail(5)

DATE
2021-08-31   -0.002195
2021-09-30    0.006663
2021-10-31   -0.009249
2021-11-30   -0.001036
2021-12-31   -0.004071
Freq: ME, Name: Totals, dtype: float64

def plot_log_rets(rets):
    fig, ax1 = plt.subplots(figsize=(18, 6))

ax1.set_xlabel('Date')
    ax1.set_ylabel('Returns %', color='tab:red')
    ax1.plot(rets.index, rets['Overs'], label='Over Returns', color='orange')
    ax1.plot(rets.index, rets['Unders'], label='Under Returns', color='r')
    ax1.plot(rets.index, rets['Totals'], label='Total Returns', color='b')
    ax1.tick_params(axis='y')
    ax1.legend(loc='upper left')
    ax2 = ax1.twinx()
    ax2.bar(rets.index, rets['Trade Count'], label='Trade Count', color='k', alpha=0.3, width=365//30)
    ax2.set_ylabel('Trades', color='k')
    ax2.set_ylabel('Trades', color='k')
    ax2.tick_params(axis='y')
    ax2.legend(loc='upper right')
    return fig, ax1, ax2
plot_log_rets(log_rets)
plt.title('Cumulative Returns and Trades')
plt.show()

The metrics to compare the algos with each other and the papers. Note how we annualize the returns to get one comparable metric:

from scipy.stats import skew, kurtosis


def get_trade_stats(rets, riskfree_rate=0.035):
    monthly_riskfree_rate = (1 + riskfree_rate)**(1/12) - 1
    annualized_return = rets.mean() * 12
    annualized_vol = rets.std() * np.sqrt(12)
    sharpe_ratio = (rets.mean() - monthly_riskfree_rate) / rets.std() * np.sqrt(12)
    downside_deviation = rets[rets < 0].std() * np.sqrt(12)
    cumulative_returns = (1 + rets).cumprod()
    peak = cumulative_returns.cummax()
    max_drawdown = ((peak - cumulative_returns) / peak).max()
    sortino_ratio = (rets.mean() - monthly_riskfree_rate) / downside_deviation
    return {
        "Annualized Return": annualized_return,
        "Annualized Vol": annualized_vol,
        "Sharpe Ratio": sharpe_ratio,
        "Downside Deviation": downside_deviation,
        "Sortino Ratio": sortino_ratio,
        "Max Drawdown": max_drawdown,
        "Skewness": skew(rets.values),
        "Kurtosis": kurtosis(rets.values)
    }
km_l = get_trade_stats(unders_df[UNDER_RETS].groupby(level=0).mean())
km_o = get_trade_stats(overs_df[OVER_RETS].groupby(level=0).mean())
km_t = get_trade_stats((unders_df[UNDER_RETS] + overs_df[OVER_RETS]).groupby(level=0).mean())
km_stats_df = pd.DataFrame({
    'km_long': pd.Series(km_l),
    'km_short': pd.Series(km_o),
    'km_total': pd.Series(km_t)
})
km_stats_df

DBScan Returns

We repeat the same for visualizations for the DBScan Algo.

Agglomerative Returns

And for the agglomerative:

Side-by-side Algo Comparison

Unfortunately we didn’t replicate the results of the paper. Though we didn’t utilize their data, nor had the full hyper-params they would have used for their clustering models.

Our best model was the KMeans, with an annualized 3.5% returns, max drawdown of 9%, and Sharpe of 0.01, with its long-only leg being 5 times better.

Here compared are all our models:

all_stats_df = all_stats_df = pd.concat([km_stats_df, db_stats_df, agg_stats_df], axis=1)

all_stats_df

Fine Tuning K-Means

We’ll use Silhouette scores to dynamically find the right number of K-Means clusters, and resize the clusters after the outliers are removed.

The Silhouette Score is used to calculate the effectiveness of clustering techniques. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

This metric ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. The score can be represented by this equation:

Where:

s(i) is the silhouette score for a single data point i
a(i) is the average distance from the ith data point to the other points in the same cluster,
b(i) is the smallest average distance from the ith data point to points in a different cluster, minimized over clusters.

Since calculating the score across all clusters is computationally expensive (taking us more 5hrs!), we reduce the data size by processing small random subsets (mini-batches) of the dataset in each iteration, instead of the entire dataset at once as done by the traditional KMeans algorithm.

To help it converge faster (we will have a max cluster size of 500 for each month across 40 years), our search function will use a bin-search loop, to converge to the best silhouette score quickly.

The search optimization code and the altered KMeans training code are below:

from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score
import multiprocessing

NUM_CORES = multiprocessing.cpu_count()
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score
import multiprocessing
NUM_CORES = multiprocessing.cpu_count()
def find_optimal_clusters(data, max_clusters=150, min_clusters=2, max_no_improvement=5, sample_size=1000):
    """
    Identifies the optimal number of clusters using a binary search approach with the Silhouette Coefficient.
    Parameters:
    - data: DataFrame containing the data points.
    - max_clusters: Maximum number of clusters to consider.
    - min_clusters: Minimum number of clusters to consider.
    - max_no_improvement: How many iterations without improvements to consider before stopping.
    - sample_size: Sample size for silhouette score calculation to speed up the process.
    Returns:
    - Optimal number of clusters based on the silhouette score.
    """
    best_score = -1
    best_k = None
    def evaluate_clusters(n_clusters):
        km = MiniBatchKMeans(n_clusters=n_clusters, random_state=0, batch_size=(256 * NUM_CORES), max_no_improvement=max_no_improvement)
        labels = km.fit_predict(data)
        if len(np.unique(labels)) < 2:
            return -1
        score = silhouette_score(data, labels, sample_size=sample_size)
        return score
    low = min_clusters
    high = max_clusters
    while low <= high:
        mid = (low + high) // 2
        score = evaluate_clusters(mid)
        # Check neighbors
        low_score = evaluate_clusters(mid - 1) if mid - 1 >= low else -1
        high_score = evaluate_clusters(mid + 1) if mid + 1 <= high else -1
        if score > best_score:
            best_score = score
            best_k = mid
        if low_score > score:
            high = mid - 1
        elif high_score > score:
            low = mid + 1
        else:
            break  # local maximum
    return best_k

def train_km_clusters_enhanced(pca_result_df, max_clusters=150, cols=pca_components_cols):
    models_dfs = []
    cluster_membership = []
    for month, data in tqdm(pca_result_df.groupby("DATE"), desc="Training KMeans and Optimizing Clusters"):
        pca_data = data[cols]
        if len(pca_data) < 2:
            print(f"Skipping {month} due to insufficient data.")
            continue
        optimal_clusters = find_optimal_clusters(pca_data, max_clusters=max_clusters)
        km_model = MiniBatchKMeans(n_clusters=optimal_clusters, batch_size = (256 * NUM_CORES))
        km_model.fit(pca_data)
        dist, eps = distance_to_centroid(km_model, pca_data)
        clean_data = pca_data[dist <= eps]
        cluster_df = pd.DataFrame(index=data.index)
        cluster_df['permno'] = -1
        cluster_df['km_cluster_opt'] = -1
        cluster_df[MOM_FEATURES[0]] = data[MOM_FEATURES[0]]
        cluster_df['DATE'] = month
        if not clean_data.empty:
            km_model.fit(clean_data)
            clean_labels = km_model.labels_
            clean_label_df = pd.DataFrame(clean_labels, index=clean_data.index, columns=['km_cluster_opt'])
            cluster_df['permno'] = data['permno']
            # We'll merge in the previous DF to keep the size.
            cluster_df.update(clean_label_df)
        cluster_membership.append(cluster_df)
        models_dfs.append({'DATE': month, 'n_clusters': km_model.n_clusters})
    models_df = pd.DataFrame(models_dfs)
    cluster_membership_df = pd.concat(cluster_membership, ignore_index=True)
    return models_df, cluster_membership_df
if os.path.exists(KM_CLUSTERS_OPT_FILEPATH) and os.path.exists(KM_CLUSTERS_OPT_MEM_FILEPATH) :
    km_opt_models_df = pd.read_pickle(KM_CLUSTERS_OPT_FILEPATH)
    km_opt_cluster_membership_df = pd.read_pickle(KM_CLUSTERS_OPT_MEM_FILEPATH)
else:
    km_opt_models_df, km_opt_cluster_membership_df = train_km_clusters_enhanced(pca_result_df, max_clusters=500)
    km_opt_models_df.to_pickle(KM_CLUSTERS_OPT_FILEPATH)
    km_opt_cluster_membership_df.to_pickle(KM_CLUSTERS_OPT_MEM_FILEPATH)
pca_result_df['km_cluster_opt'] = km_opt_cluster_membership_df['km_cluster_opt'].values
pca_result_df[['DATE', 'km_cluster_opt']].tail(5)

We visualize the optimal K-Means setup, noting the difference from the previous:

plot_cluster_distributions(km_opt_cluster_membership_df, 'km_cluster_opt')
plt.tight_layout()
plt.show()

Let’s simulate again the trades and see if we have an improvement

Our raw KMeans provided 3.5% returns, max draw down of 9%, and Sharpe of 0.01. With our optimizations, we achieved a 3.8% returns and a Sharpe of 0.4 but a max draw down of 27%!

Optimized K-Means with Dynamic Time Warping

Dynamic Time Warping (DTW) is a similarity measure that originated in speech recognition. It aligns 2 signals (timeseries) by warping the axis iteratively until it matches. It’s an O(n²) operation though, and we assume a longer processing time than the previous Silhouette score optimization function.

For 2 sequences A=a1,a2,an and B=b1,b2,bn of lengths n and m respectively, their DTW distance is calculated by constructing an n×m matrix where the element (i,j) in the matrix represents the distance between a_i and b_j combined with the minimum cumulative distance to reach that point from the start of the two sequences.

This cumulative distance D(i,j) is defined recursively as:

where:

d(a_i, b_j) being the distance between elements a_i and b_j, typically Euclidean distance: *d(a_i, b_j) = (a_i — b_j)²
D(0,0) = d(a_1, b_1)
D(i,0) and D(0,j) for i > 0 and j > 0 are initialized to be infinity or a suitably large number to ensure the boundary conditions reflect the start of the sequences.

The DTW distance between the two sequences A and B is then given by the value of D(n,m), which represents the cumulative distance between the two sequences considering the optimal alignment between them.

To initiate the recursion, boundary conditions are typically set as follows:

D(0,0) = d(a_1, b_1)
For i > 1, D(i,0) = inf, to ensure the path starts from (1,1)
Similarly, for j > 1, D(0,j) = inf

The image below from wikipedia helps us visualize this process. You see that the euclidean will break similarity the moment its x-axis are warped in anyway, while DTW is able to detect similarity irrelevant of the temporal dimensions.

Given the computational demands for the DTW algo, we expand our time windows to 1 year from 1 month, and only target the top capitalized companies — the algo still ran for 5hrs with this limited sample.

chars_df['decile'] = pd.qcut(chars_df['mve_ia'], 10, labels=False) + 1
top_permnos_df = chars_df[chars_df['decile'] == 10]
top_pca_result_df = pca_result_df[pca_result_df['permno'].isin(top_permnos_df['permno'])]
print(len(top_pca_result_df))
top_pca_result_df.head()

The code to run K-Means with DTW is below:

from tslearn.clustering import TimeSeriesKMeans

def reshape_data(df, cols):
    """
    Transforms DataFrame into a 3D numpy array for tslearn clustering,
    ensuring all series have at least 2 non-NaN entries after padding.
    Parameters:
    - df: DataFrame containing the time series data.
    - cols: Columns in the DataFrame to be used for creating time series data.
    Returns:
    - A 3D numpy array where each time series is padded to match the length of the longest series.
    """
    grouped = df.groupby('permno')
    valid_series_list = []
    for _, group in grouped:
        series = group[cols].values
        valid_series_list.append(series)
    longest_series = max(len(s) for s in valid_series_list)
    padded_series = np.array([
        np.pad(s, ((0, longest_series - s.shape[0]), (0, 0)), mode='constant', constant_values=np.nan)
        for s in valid_series_list
    ])
    # Optionally, filter out series with less than 2 non-NaN values after padding if needed
    # This step is typically not needed if all series originally had >= 2 entries and are meaningful
    return padded_series
def train_km_clusters_enhanced_dtw(df, cols=pca_components_cols, max_clusters=150):
    models_dfs = []
    cluster_membership = []
    for year, data in tqdm(df.groupby(df['DATE']), desc="Training TimeSeries KMeans"):
        if len(data) <= 2:
            continue
        permnos = data.groupby('permno')['DATE'].nunique()[lambda x: x >= 6].index
        data = data[data['permno'].isin(permnos)]
        optimal_clusters = find_optimal_clusters(data[pca_components_cols], max_clusters=max_clusters)
        time_series_data = reshape_data(data[pca_components_cols], cols)
        model = TimeSeriesKMeans(n_clusters=optimal_clusters, metric="dtw", verbose=False, n_jobs=-1, max_iter=5, max_iter_barycenter=5, random_state=0)
        labels = model.fit_predict(time_series_data)
        cluster_df = pd.DataFrame({'permno': data['permno'].unique(), 'km_cluster_opt': labels})
        cluster_df['DATE'] = year
        cluster_membership.append(cluster_df)
        models_dfs.append({'DATE': year, 'n_clusters': optimal_clusters})
    models_df = pd.DataFrame(models_dfs)
    cluster_membership_df = pd.concat(cluster_membership, ignore_index=True)
    return models_df, cluster_membership_df
if os.path.exists(KM_CLUSTERS_DTW_FILEPATH) and os.path.exists(KM_CLUSTERS_DTW_MEM_FILEPATH) and not IS_KAGGLE:
    km_optdtw_models_df = pd.read_pickle(KM_CLUSTERS_DTW_FILEPATH)
    km_optdtw_cluster_membership_df = pd.read_pickle(KM_CLUSTERS_DTW_MEM_FILEPATH)
else:
    km_optdtw_models_df, km_optdtw_cluster_membership_df = train_km_clusters_enhanced_dtw(top_pca_result_df, max_clusters=50, cols=['0','1','2','3'])
    if not IS_KAGGLE:
        km_optdtw_models_df.to_pickle(KM_CLUSTERS_DTW_FILEPATH)
        km_optdtw_cluster_membership_df.to_pickle(KM_CLUSTERS_DTW_MEM_FILEPATH)
top_pca_result_df['km_cluster_opt'] = km_optdtw_cluster_membership_df['km_cluster_opt']
top_pca_result_df['km_cluster_opt'].fillna(-1, inplace=True)
top_pca_result_df[['DATE', 'km_cluster_opt']].tail(5)

the clusters and trading is visualized here:

Given we altered the data for the experiment, we cannot draw conclussions if there was an improvement or not, though its a good setup to run in this in the future when better hardware or algos are available to us.

Fine-Tuning the Principal Components

In Han et al. (2020), the authors regress Fama-French factor models against the Long-Short portfolio returns, to find the risk factor loadings and get economic explanations of the clusters selections.

From their empirical analysis, the highest factor loadings were for the market risk (MRKT), size risk (SMB), and value (HML). We will drop the first component, assuming its the market factor, and cluster on the residuals only, helping the KMeans focus on anomalies that would hint sector or company trends outside the market movements.

pcaopt_result_df = pca_result_df.drop(columns=['0'])

if os.path.exists(KM_CLUSTERS_OPTPCA_FILEPATH) and os.path.exists(KM_CLUSTERS_OPTPCA_MEM_FILEPATH) :
    km_optpca_models_df = pd.read_pickle(KM_CLUSTERS_OPTPCA_FILEPATH)
    km_optpca_cluster_membership_df = pd.read_pickle(KM_CLUSTERS_OPTPCA_MEM_FILEPATH)
else:
    km_optpca_models_df, km_optpca_cluster_membership_df = train_km_clusters_enhanced(pcaopt_result_df, max_clusters=500, cols=[col for col in pca_components_cols if col != '0'])
    km_optpca_models_df.to_pickle(KM_CLUSTERS_OPTPCA_FILEPATH)
    km_optpca_cluster_membership_df.to_pickle(KM_CLUSTERS_OPTPCA_MEM_FILEPATH)
pcaopt_result_df['km_cluster_opt'] = km_optpca_cluster_membership_df['km_cluster_opt'].values
pcaopt_result_df[['DATE', 'km_cluster_opt']].tail(5)

Our metrics were not better, we returned 3.58% (vs 3.8%), Sharpe of 0.015 (vs 0.4) and still at a max drawdown of 20%!

Conclusion

In this article, we explored three clustering methods: K-Means, DBSCAN, and agglomerative clustering. Using CRSP data from Gu et al. (2020) we attempted to replicate and improve on the findings of Han et al. (2023) in their unsupervised learning for pairs trading.

In Han et al. (2023), they found that the agglomerative clustering-based strategy was the better one with an excess annualised return of 23.8% (vs ours of 3.8%), an annualised Sharpe ratio of 2.69 (vs ours of 0.4), and a maximum drawdown of 12.3% (vs ours of 20%).

In our version, K-Means was the optimal model and not the Agglomerative as in their paper. The edge that they probably had was higher quality data.

If we had to further enhance this setup, we would run an ensemble of models, and generalize the model on markets outside the US — good material for the next article.

References

Github

Article here is also available on Github

Kaggle notebook available here

Media

All media used (in the form of code or images) are either solely owned by me, acquired through licensing, or part of the Public Domain and granted use through Creative Commons License.

CC Licensing and Use

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.