AlgoCraft

Algorithmic Trading Ideas

Featured

Getting Started with Statistical Arbitrage: A Comprehensive Guide to Pairs Trading in Python

Alwan Alkautsar
AlgoCraft
Published in
16 min readJan 12, 2025

--

There are tons of algorithmic trading strategies out there. Statistical arbitrage is one of the most popular, which is done by utilizing statistical methods to identify significant relationships of the assets.

A robot trading the stock market, generated by OpenAI Sora.

At its core, a statistical arbitrage trading strategy relies on a mean reversion principle, which assumes that when there are two highly correlated assets, and one of the assets is moving disproportionately compared to the other, a position would be opened by the trader to exploit the short-term fluctuations under the assumption that the prices of both stocks would revert to their historical correlation.

Although perfect correlation is extremely rare in the market, there are statistical methods that we can employ to find highly correlated stocks. Traders are doing this by examining stocks that exhibit a high degree of comovement (which can be done by correlation analysis or cointegration testing) and would exploit its short-term price fluctuations, which usually are driven by market sentiment or sudden news announcements.

Figure 1. Illustration of temporary mispricing in statistical arbitrage

illustrates the historical price movements of (blue line) and (red line). Notice how the two stocks have moved closely together over time, demonstrating a strong positive correlation. However, in the most recent period, ’s price spikes upward, deviating significantly from its usual relationship with . Traders would address this as a temporary mispricing and would take advantage by executing the following trades:

Going Long on → Assuming Stock A will rise.

Going Short on → Assuming Stock B will fall back.

Both actions are taken under the assumption that either stock will revert toward its historical relationship with each other. Traders can either only take Option 1, only take Option 2, or take both options. When they both take both options, it’s what we refer to as pairs trading.

Understanding Pairs Trading

Pairs trading is a form of statistical arbitrage. Pairs trading is done by simultaneously taking two positions on a short-term mispricing of highly correlated assets. If we take on previous examples, it means going long on stock A and going short on stock B.

In pairs trading, trading signals are generated by the price difference (spread) of the two assets that we are pairing. Traders would find assets with high correlation, identify unusually large spreads in comparison to their historical trend, and open simultaneous positions on both assets with the anticipation that this divergence will correct itself.

Step-by-Step Guide to Pairs Trading with Python

Importing the Packages

import yfinance as yf
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from itertools import combinations
from statsmodels.tsa.stattools import coint

The first step to any Python project is to import all the necessary packages that we will use on this strategy. Here’s a breakdown of the packages:

  • Fetches historical stock or forex data directly from Yahoo Finance.
  • Handles data manipulation and cleaning.
  • Provides mathematical operations and support for arrays.
  • Performs statistical analysis for regression analysis and cointegration testing.
  • Creates visualizations for data and analysis.
  • Finds all possible pairs from a list of stocks or forex pairs.
  • Tests for cointegration between two time series, essential for pairs trading.

Stock Selection and Identifying Cointegrated Pairs

After we import all the libraries, we need to find highly correlated stock pairs to perform our trading strategy on. Let’s list the stocks to analyze. For this example,we will focus on the media and entertainment industry.

stocks = [
"DIS", # The Walt Disney Company
"NFLX", # Netflix, Inc.
"PARA", # Paramount Global
"CMCSA", # Comcast Corporation
"FOX", # Fox Corporation (Class B)
"FOXA", # Fox Corporation (Class A)
"WBD", # Warner Bros. Discovery, Inc.
"ROKU", # Roku, Inc.
"SPOT", # Spotify Technology S.A.
"LYV" # Live Nation Entertainment, Inc.
]

After storing the list of stocks to observe under we will utilize yfinance to fetch the historical data of each stock under the list.

# Create a dictionary to hold data for each pair
data_dict = {}

# Loop over each FX symbol in the list
for ticker in stocks:
print(f"Downloading data for: {ticker}")
data = yf.download(
tickers=ticker,
period="1y",
interval="1d"
)

# Store the DataFrame in our dictionary
data_dict[ticker] = data

We start by creating an empty dictionary named to hold the data for each stock ticker. Then, we create a loop to iterate over a list of stocks we previously created. For each ticker in the stocks list, we use the functionfrom the yfinance library to fetch one year's worth of daily historical data. Once downloaded, we then store the resulting data in .

The first step in identifying cointegrated pairs of stocks in our search space is to generate a list of unique pairs of stocks from the list, this can be done by performing the function from the package we imported earlier.

stock_pairs = list(combinations(stocks, 2))
stock_pairs

When run, it will generate the following listings:

Figure 2. Output from the combination() function.

By printing we found that there are 45 unique pairs of stocks, which we will test for cointegration to see their eligibility to be pair traded.

To identify cointegrated pairs of stocks, we will perform the Engle-Granger tests from our unique stock list. The Engle-Granger test is a statistical method used to determine if a pair of time series data share a long-term relationship despite being non-stationary individually. It does this by testing if the residuals from a linear regression between the series are stationary.

To run the Engle-Granger test, we run the following code:

results = []  

for pair in stock_pairs:
sym1, sym2 = pair

# Ensure both DataFrames exist in our dictionary
if sym1 not in data_dict or sym2 not in data_dict:
print(f"Data not found for one of these symbols: {sym1}, {sym2}")
continue

# Extract 'Adj Close' price series for each symbol
df1 = data_dict[sym1]["Adj Close"]
df2 = data_dict[sym2]["Adj Close"]

# Combine into one DataFrame on the same dates (inner join), drop missing values
combined = pd.concat([df1, df2], axis=1, join="inner").dropna()
combined.columns = ["Price1", "Price2"]

# If there's not enough data after alignment, skip
if len(combined) < 10:
continue

# Run Engle-Granger cointegration test
coint_t, p_value, critical_values = coint(combined["Price1"], combined["Price2"])

# Check if p-value < 0.05 for significance
is_significant = (p_value < 0.05)

# Store results
results.append({
"Symbol1": sym1,
"Symbol2": sym2,
"Test Statistic": coint_t,
"p-value": p_value,
"5% Critical Value": critical_values[0], # 1%, 5%, 10% in array
"Is_Cointegrated_5pct": is_significant
})

for res in results:
status = "Cointegrated" if res["Is_Cointegrated_5pct"] else "Not Cointegrated"
print(
f"{res['Symbol1']} & {res['Symbol2']} | "
f"Test Statistic: {res['Test Statistic']:.3f} | "
f"p-value: {res['p-value']:.3f} | "
f"5% Crit. Value: {res['5% Critical Value']:.3f} | "
f"Result: {status}"
)

Breakdown of above code:


  1. The list is prepared to store test outcomes for each pair.

  2. For each pair in , we unpack the pair into and .
    It checks whether both symbols have corresponding data in . If not, it logs a message and skips to the next pair.

  3. The adjusted close prices () for and are extracted from the . The two time series are merged into a single DataFrame using an to ensure both have data for the same dates, we also use to drop any missing values.

  4. Using the function from the test for cointegration for each pair in our . It returns:
    The test statistic ().
    The p-value (), indicating the likelihood that the two series are not cointegrated.
    Critical values at different confidence levels, which we store as .

  5. A p-value below 0.05 indicates significant cointegration at the 5% level. We use 5% as it is a widely used threshold, you are welcome to adjust this parameter as you see fit, like increasing the threshold to 10% if you can’t find any cointegrated. The result for the significance check is stored as .

  6. A dictionary containing the test outcomes for each pair is appended to the results list.

  7. The results are printed for each pair, showing the test statistic, p-value, critical value, and whether the pair is cointegrated.

If there are no errors found upon running the code, the following output will be generated:

Figure 3. Finding Cointegrated Stock Pairs

If you prefer to see the output in a dataframe, you can run the following code to convert the test result into a dataframe:

#Convert Cointegration Test Result to Dataframe
results_df = pd.DataFrame(results)
results_df

Next, let’s shortlist to just show stock pairs that are cointegrated and store it in :

# Filter for rows where Is_Cointegrated_5pct is True
filtered_df = results_df[results_df['Is_Cointegrated_5pct'] == True]

# Create a new list of only the cointegrated pairs
cointegrated_pairs = [
(row['Symbol1'], row['Symbol2'])
for _, row in filtered_df.iterrows()
]

print("Cointegrated Pairs (5% level):")
for cp in cointegrated_pairs:
print(cp)
Figure 4. Shortlisted cointegrated stock pairs

We found that there are 5 stock pairs that For easier analysis, let’s just focus on one pair to analyze. For this example, let’s go with and

Finding The Spread

As I have previously mentioned, signals in pairs trading are generated by the spread of our pair of assets. Therefore, after we found a highly correlated pair, the next step to find its spread and convert it to Z Score.

We will create a new function, which we’re going to call In this function, we will use the OLS regression model to examine how the price of predicts the price of , calculate the spread between the actual and predicted prices, and visualize the results.

def analyze_pair(data_dict, sym1, sym2):

# Extract & align data
priceX = data_dict[sym1]["Adj Close"].rename("X")
priceY = data_dict[sym2]["Adj Close"].rename("Y")
df = pd.concat([priceX, priceY], axis=1, join="inner").dropna()

# If insufficient data remains, exit
if len(df) < 10:
print(f"Not enough overlapping data for {sym1} & {sym2}. Exiting.")
return None

# Prepare & fit OLS model
X = sm.add_constant(df["X"])
Y = df["Y"]
model = sm.OLS(Y, X).fit()

# Compute predictions & spread
df["Y_pred"] = model.predict(X)
df["Spread"] = df["Y"] - df["Y_pred"]

# Plot results
plt.figure(figsize=(10, 5))
plt.plot(df.index, df["Y"], label=f"{sym2} (Actual)")
plt.plot(df.index, df["Y_pred"], label=f"{sym2} (Predicted from {sym1})")
plt.title(f"Pair: {sym1} (X) → {sym2} (Y)")
plt.legend()
plt.show()

plt.figure(figsize=(10, 4))
plt.plot(df.index, df["Spread"], label="Spread (Y - Y_pred)")
plt.axhline(df["Spread"].mean(), color='red', linestyle='--', label="Spread Mean")
plt.title(f"Spread for {sym1} & {sym2}")
plt.legend()
plt.show()

results_dict = {
"model_params": model.params, # alpha (const) and beta
"df": df, # the aligned dataframe with Spread
"summary": model.summary() # statsmodels summary object
}
return results_dict

Breakdown of above code:


  1. The function will take a dictionary of historical price data, where each key is a financial asset’s symbol and its value is a DataFrame containing price information.

  2. Extracts the adjusted closing prices for the two specified symbols.
    And then we will combine price series into a single DataFrame with matching dates and remove any missing data.

  3. Fits a linear regression model using as the independent variable and as the dependent variable.
    This determines the relationship (slope and intercept) between the two assets’ prices.
  4. We then will use a regression model to predict the prices of based on and calculate the spread between the two input variables.
  5. We’re going to visualize two plots with this function:
    a. A comparison of the actual prices of sym2 against the predicted prices based on the relationship with sym1.
    b. A plot of the spread over time, highlighting its mean for reference.

  6. The function will return a dictionary with the following:
    a. Regression parameters
    b. The aligned DataFrame containing the actual prices, predicted prices, and spread.
    c. A statistical summary of the regression model, including metrics like R-squared and standard errors.

We can run the function by executing the following lines of code:

sym1 = "PARA"
sym2 = "SPOT"
pair_results = analyze_pair(data_dict, sym1, sym2)
Figure 5: Displaying the spread as the linear regression model's residuals

As we can see on the second chart shown in Figure 5, we have successfully visualized the spread as the residuals of the linear regression model. Since spreads may vary among asset classes, it would be ideal to standardize them into a single scalar to facilitate comparison and strategy formulation. We can do this by converting the spread into Z-scores.

We use Z-score to measure how many standard deviations the daily spread deviates from its mean. The magnitude of the Z-score indicates the distance from the mean, positive for above the mean and negative for below.

For stock prices, which are inherently volatile, we use a moving average approach to calculate the running mean and standard deviation. Where each daily spread would have corresponding running mean and standard deviation based on the collection of the spreads in the rolling window.

In order to convert the spread to z-score, we will create a new function called

def convert_zscore(df, sym1, sym2, window_size=10):
#Compute rolling mean, rolling std, and Z-score
df["Spread_MA"] = df["Spread"].rolling(window_size).mean()
df["Spread_STD"] = df["Spread"].rolling(window_size).std()
df["Zscore"] = (df["Spread"] - df["Spread_MA"]) / df["Spread_STD"]

# Visualize the Z-score
plt.figure(figsize=(10, 4))
plt.plot(df.index, df["Zscore"], label="Z-Score of Spread")
plt.axhline(0, color="black", linestyle="--", lw=1)
plt.axhline(2.0, color="green", linestyle="--", lw=1, label="+2 Z")
plt.axhline(1.0, color="green", linestyle="--", lw=1, label="+1 Z")
plt.axhline(-1.0, color="red", linestyle="--", lw=1, label="-1 Z")
plt.axhline(-2.0, color="red", linestyle="--", lw=1, label="-2 Z")
plt.title(f"Z-Score of Spread (Window={window_size}): {sym1}, {sym2}")
plt.legend()
plt.show()

return df

Breakdown of above code:

  1. The input takes a dataFrame containing at least a column named, which represents the difference between the actual and predicted values of two assets of our input variable which will be represented by and Additionally, the last input variable will be , which will be the size of the rolling window used to compute the mean and standard deviation.

  2. Computes the moving average of the “Spread” column over the specified rolling window size.
    Computes the moving standard deviation of the “Spread” column over the rolling window size.
    Calculates the Z-score for each point in the “Spread” column. The Z-score measures how many standard deviations the current spread deviates from the rolling mean.

  3. For each row in the DataFrame, we will apply the following formula:
    (spread -spread mean) / spread standard deviation

  4. We further will plot the Z-score over time. Additionally, for easy viewing, we also add horizontal reference lines at key Z-score levels to easily identify the the deviation from its means.

  5. The function returns the updated , which now contains three new columns:
    Rolling mean of the spread.
    Rolling standard deviation of the spread.
    Z-score of the spread.

We can run the function by executing the following lines of code:

df_zscore = convert_zscore(pair_results_df, sym1, sym2, window_size=10)
Figure 6. Visualizing the z-scores after standardizing the spreads using the running mean and standard deviation

Trading Strategy Formulation and Backtesting

After successfully converting the spread into Z-scores, we can use them to generate trading signals:

  • : Enter a long position in the first stock when the Z-score falls below a preset negative threshold (e.g., -2).
  • : Exit the long position in the first stock when the Z-score crosses above a less extreme negative threshold (e.g., -1).
  • : Enter a short position in the second stock when the Z-score exceeds a preset positive threshold (e.g., 2).
  • : Exit the short position in the second stock when the Z-score crosses below a less extreme positive threshold (e.g., 1).
Figure 7. Formulation of trading signals based on the threshold of z-scores

Now let’s get into coding the strategy by creating a new function called :

def run_pair_trading(sym1, sym2, data_dict, df_zscore, window_size=10, initial_equity=100_000.0):
# 1) Align close prices
df1 = data_dict[sym1]["Adj Close"].rename("X")
df2 = data_dict[sym2]["Adj Close"].rename("Y")
df = pd.concat([df1, df2], axis=1, join="inner").dropna().sort_index()

if len(df) < window_size:
return {
"df": None,
"metrics": f"Not enough data for {sym1}-{sym2}",
"trades_df": pd.DataFrame()
}

df = df.join(df_zscore, how="inner", rsuffix="_zscore")

if "Zscore" not in df or df["Zscore"].isna().all():
return {
"df": None,
"metrics": f"Missing or invalid Zscore data for {sym1}-{sym2}",
"trades_df": pd.DataFrame()
}

# 2) Determine positions on each pair
df["x_position"] = np.nan
df["y_position"] = np.nan

# zscore > 2 => Short X, Long Y
df.loc[df["Zscore"] > 2, ["x_position", "y_position"]] = [-1, 1]

# -1 < zscore < 1 => Exit
df.loc[(df["Zscore"] > -1) & (df["Zscore"] < 1), ["x_position", "y_position"]] = [0, 0]

# zscore < -2 => Long X, Short Y
df.loc[df["Zscore"] < -2, ["x_position", "y_position"]] = [1, -1]

# Forward-fill positions in between signals
df["x_position"] = df["x_position"].ffill().fillna(0)
df["y_position"] = df["y_position"].ffill().fillna(0)

# 3) Calculate daily returns from each pair
df["x_return"] = df["X"].pct_change().fillna(0.0)
df["y_return"] = df["Y"].pct_change().fillna(0.0)

# Equity Allocation
df["x_notional"] = 0.02 * initial_equity
df["y_notional"] = 0.02 * initial_equity

# Daily PnL for each pair
df["daily_pnl_x"] = df["x_position"].shift(1) * df["x_notional"] * df["x_return"]
df["daily_pnl_y"] = df["y_position"].shift(1) * df["y_notional"] * df["y_return"]
df[["daily_pnl_x", "daily_pnl_y"]] = df[["daily_pnl_x", "daily_pnl_y"]].fillna(0.0)

df["daily_pnl"] = df["daily_pnl_x"] + df["daily_pnl_y"]
df["equity"] = initial_equity + df["daily_pnl"].cumsum()

# 4) Performance metrics
final_equity = df["equity"].iloc[-1]
total_return_pct = (final_equity - initial_equity) / initial_equity

df["equity_return"] = df["equity"].pct_change().fillna(0.0)
ann_factor = 252
mean_daily_ret = df["equity_return"].mean()
std_daily_ret = df["equity_return"].std()

if std_daily_ret != 0:
sharpe_ratio = (mean_daily_ret / std_daily_ret) * np.sqrt(ann_factor)
else:
sharpe_ratio = np.nan

neg_returns = df.loc[df["equity_return"] < 0, "equity_return"]
std_downside = neg_returns.std() if not neg_returns.empty else np.nan
if std_downside and std_downside != 0:
sortino_ratio = (mean_daily_ret / std_downside) * np.sqrt(ann_factor)
else:
sortino_ratio = np.nan

df["running_max"] = df["equity"].cummax()
df["drawdown"] = (df["equity"] / df["running_max"]) - 1
max_drawdown = df["drawdown"].min()

# 5) Trade-by-trade details
df["trade_signal"] = (df["x_position"].diff().abs() > 0) | (df["y_position"].diff().abs() > 0)
trades = df[df["trade_signal"]].copy()
trades["entry_date"] = trades.index
trades["exit_date"] = trades["entry_date"].shift(-1)
trades["pnl"] = trades["daily_pnl"]
trades["x_position"] = trades["x_position"]
trades["y_position"] = trades["y_position"]

trades_df = trades[["entry_date", "exit_date", "x_position", "y_position", "pnl"]]
num_trades = len(trades_df)
win_rate = (trades_df[trades_df["pnl"] > 0].shape[0] / num_trades) if num_trades > 0 else np.nan

metrics = {
"sym1": sym1,
"sym2": sym2,
"final_equity": final_equity,
"total_return_pct": total_return_pct,
"sharpe_ratio": sharpe_ratio,
"sortino_ratio": sortino_ratio,
"max_drawdown_pct": max_drawdown,
"num_trades": num_trades,
"win_rate_pct": 100.0 * win_rate if not np.isnan(win_rate) else np.nan
}

return {
"df": df,
"metrics": metrics,
"trades_df": trades_df
}

  1. First we need to align the adjusted close prices of the two assets and ensure data consistency. First is to check if there’s enough overlapping data to run the analysis, and then join the aligned price data with the Z-score data to include the spread’s standardized values.

  2. Long/Short Signals:
    Go short on the first asset and long on the second .
    Go long on the first asset and short on the second .
    Exit all positions.
    Positions are forward-filled to maintain positions until a new signal is triggered.

  3. Computes percentage changes in prices for both assets.
    Allocates of the initial equity to each asset for trading.

    — Multiplies positions by notional value and daily returns to compute profit/loss for each asset.
    — Sums the PnL of both assets to get the total daily PnL.
    Tracks the cumulative equity over time by adding daily PnL to the initial equity.

  4. The total value of the portfolio at the end of the simulation.
    Percentage return from initial equity to final equity.
    Measures risk-adjusted returns based on mean and standard deviation of daily returns.
    Similar to the Sharpe Ratio but only considers downside volatility (negative returns).
    The largest peak-to-trough decline in equity measured as a percentage.
    Tracks the highest equity value over time to compute drawdowns.

  5. When either (position of Asset 1) or (position of Asset 2)changes, a new trade is initiated/closed.

    The date on which the position changes.
    The next signal date (shifted by -1).
    The daily_pnl on that date (though for a more detailed trade PnL, you’d sum over the holding period).
    The actual positions in Asset 1 and Asset 2 at entry.
    : Count of trading signal generated.
    Percentage of trades with pnl > 0.


  6. The DataFrame with detailed results, including prices, positions, returns, PnL, and equity over time.
    A dictionary of key performance metrics for the trading strategy.
    A DataFrame summarizing individual trade details.

Now we have formulated our trading strategy, it’s time for us to run it. We can do this by:

pair_trading = run_pair_trading(sym1,sym2,data_dict,df_zscore)
pair_trading['metrics']

We run the so that we can focus on the performance of our strategy, you can also run if you want to see the full and detailed dataframe, or to only see the trades made during our backtesting period.

Figure 8. Performance result of our trading pair trading strategy

The relatively high Sharpe ratio of about and the low max drawdown of suggest that despite the small total return, the strategy maintained a favorable balance between returns and risk.

def plot_return(df):
# Calculate cumulative return in percentage
initial_equity = df["equity"].iloc[0]
df["cumulative_return_pct"] = ((df["equity"] - initial_equity) / initial_equity) * 100

plt.figure(figsize=(14, 7))
plt.plot(df.index, df["cumulative_return_pct"], label="Cumulative Return (%)", color="green", alpha=0.8)
plt.title("Cumulative Return in Percentage")
plt.xlabel("Date")
plt.ylabel("Cumulative Return (%)")
plt.legend()
plt.grid()
plt.show()

Now we can run the followingfunction to generate the

plot_return(pair_trading['df'])
Figure 9. Cumulative return graph of our pair trading strategy

Summary

This strategy provides a solid foundation for a pairs trading, but there is definitely a lot of room for refinement. Potential improvements include adjusting position sizing based on volatility, adjust trading signals parameters (you can try the bayesian optimization technique), and taking into accounts transaction costs and slippage to ensure more realistic backtesting. However, this example serves as a great starting point for further development and experimentation in pair trading.

https://algocraft.gumroad.com/l/pairtrading01

--

--

Alwan Alkautsar
Alwan Alkautsar

Written by Alwan Alkautsar

🚀 Product Manager | 📈 Stock & Forex Trader with a Dash of Algo | 🎾 Weekend Warrior on the Tennis Court

No responses yet