Featured
Getting Started with Statistical Arbitrage: A Comprehensive Guide to Pairs Trading in Python
There are tons of algorithmic trading strategies out there. Statistical arbitrage is one of the most popular, which is done by utilizing statistical methods to identify significant relationships of the assets.
At its core, a statistical arbitrage trading strategy relies on a mean reversion principle, which assumes that when there are two highly correlated assets, and one of the assets is moving disproportionately compared to the other, a position would be opened by the trader to exploit the short-term fluctuations under the assumption that the prices of both stocks would revert to their historical correlation.
Although perfect correlation is extremely rare in the market, there are statistical methods that we can employ to find highly correlated stocks. Traders are doing this by examining stocks that exhibit a high degree of comovement (which can be done by correlation analysis or cointegration testing) and would exploit its short-term price fluctuations, which usually are driven by market sentiment or sudden news announcements.
Figure 1 illustrates the historical price movements of Stock A (blue line) and Stock B (red line). Notice how the two stocks have moved closely together over time, demonstrating a strong positive correlation. However, in the most recent period, Stock B’s price spikes upward, deviating significantly from its usual relationship with Stock A. Traders would address this as a temporary mispricing and would take advantage by executing the following trades:
Option 1. Going Long on Stock A → Assuming Stock A will rise.
Option 2. Going Short on Stock B → Assuming Stock B will fall back.
Both actions are taken under the assumption that either stock will revert toward its historical relationship with each other. Traders can either only take Option 1, only take Option 2, or take both options. When they both take both options, it’s what we refer to as pairs trading.
Understanding Pairs Trading
Pairs trading is a form of statistical arbitrage. Pairs trading is done by simultaneously taking two positions on a short-term mispricing of highly correlated assets. If we take on previous examples, it means going long on stock A and going short on stock B.
In pairs trading, trading signals are generated by the price difference (spread) of the two assets that we are pairing. Traders would find assets with high correlation, identify unusually large spreads in comparison to their historical trend, and open simultaneous positions on both assets with the anticipation that this divergence will correct itself.
Step-by-Step Guide to Pairs Trading with Python
Importing the Packages
import yfinance as yf
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from itertools import combinations
from statsmodels.tsa.stattools import coint
The first step to any Python project is to import all the necessary packages that we will use on this strategy. Here’s a breakdown of the packages:
- yfinance: Fetches historical stock or forex data directly from Yahoo Finance.
- pandas: Handles data manipulation and cleaning.
- numpy: Provides mathematical operations and support for arrays.
- statsmodels.api: Performs statistical analysis for regression analysis and cointegration testing.
- matplotlib.pyplot: Creates visualizations for data and analysis.
- itertools.combinations: Finds all possible pairs from a list of stocks or forex pairs.
- statsmodels.tsa.stattools.coint: Tests for cointegration between two time series, essential for pairs trading.
Stock Selection and Identifying Cointegrated Pairs
After we import all the libraries, we need to find highly correlated stock pairs to perform our trading strategy on. Let’s list the stocks to analyze. For this example,we will focus on the media and entertainment industry.
stocks = [
"DIS", # The Walt Disney Company
"NFLX", # Netflix, Inc.
"PARA", # Paramount Global
"CMCSA", # Comcast Corporation
"FOX", # Fox Corporation (Class B)
"FOXA", # Fox Corporation (Class A)
"WBD", # Warner Bros. Discovery, Inc.
"ROKU", # Roku, Inc.
"SPOT", # Spotify Technology S.A.
"LYV" # Live Nation Entertainment, Inc.
]
After storing the list of stocks to observe under stocks, we will utilize yfinance to fetch the historical data of each stock under the list.
# Create a dictionary to hold data for each pair
data_dict = {}
# Loop over each FX symbol in the list
for ticker in stocks:
print(f"Downloading data for: {ticker}")
data = yf.download(
tickers=ticker,
period="1y",
interval="1d"
)
# Store the DataFrame in our dictionary
data_dict[ticker] = data
We start by creating an empty dictionary named data_dict to hold the data for each stock ticker. Then, we create a for loop to iterate over a list of stocks we previously created. For each ticker in the stocks list, we use the yf.download function from the yfinance library to fetch one year's worth of daily historical data. Once downloaded, we then store the resulting data in data_dict.
Finding All Unique Stock Pairs Combination
The first step in identifying cointegrated pairs of stocks in our search space is to generate a list of unique pairs of stocks from the stocks list, this can be done by performing the combinations() function from the itertools package we imported earlier.
stock_pairs = list(combinations(stocks, 2))
stock_pairs
When run, it will generate the following listings:
By printing stock_pairs, we found that there are 45 unique pairs of stocks, which we will test for cointegration to see their eligibility to be pair traded.
Identifying Cointegrated Pairs of Stocks
To identify cointegrated pairs of stocks, we will perform the Engle-Granger tests from our unique stock list. The Engle-Granger test is a statistical method used to determine if a pair of time series data share a long-term relationship despite being non-stationary individually. It does this by testing if the residuals from a linear regression between the series are stationary.
To run the Engle-Granger test, we run the following code:
results = []
for pair in stock_pairs:
sym1, sym2 = pair
# Ensure both DataFrames exist in our dictionary
if sym1 not in data_dict or sym2 not in data_dict:
print(f"Data not found for one of these symbols: {sym1}, {sym2}")
continue
# Extract 'Adj Close' price series for each symbol
df1 = data_dict[sym1]["Adj Close"]
df2 = data_dict[sym2]["Adj Close"]
# Combine into one DataFrame on the same dates (inner join), drop missing values
combined = pd.concat([df1, df2], axis=1, join="inner").dropna()
combined.columns = ["Price1", "Price2"]
# If there's not enough data after alignment, skip
if len(combined) < 10:
continue
# Run Engle-Granger cointegration test
coint_t, p_value, critical_values = coint(combined["Price1"], combined["Price2"])
# Check if p-value < 0.05 for significance
is_significant = (p_value < 0.05)
# Store results
results.append({
"Symbol1": sym1,
"Symbol2": sym2,
"Test Statistic": coint_t,
"p-value": p_value,
"5% Critical Value": critical_values[0], # 1%, 5%, 10% in array
"Is_Cointegrated_5pct": is_significant
})
for res in results:
status = "Cointegrated" if res["Is_Cointegrated_5pct"] else "Not Cointegrated"
print(
f"{res['Symbol1']} & {res['Symbol2']} | "
f"Test Statistic: {res['Test Statistic']:.3f} | "
f"p-value: {res['p-value']:.3f} | "
f"5% Crit. Value: {res['5% Critical Value']:.3f} | "
f"Result: {status}"
)
Breakdown of above code:
- Initialize Results Storage
The results list is prepared to store test outcomes for each pair. - Loop Through Pairs
For each pair in stock_pairs, we unpack the pair into sym1 and sym2.
It checks whether both symbols have corresponding data in data_dict. If not, it logs a message and skips to the next pair. - Data Alignment
The adjusted close prices (Adj Close) for sym1 and sym2 are extracted from the data_dict. The two time series are merged into a single DataFrame using an inner join to ensure both have data for the same dates, we also use .dropna() to drop any missing values. - Engle-Granger Test
Using the coint() function from the statsmodels.tsa.stattools, we test for cointegration for each pair in our stock_list. It returns:
The test statistic (coint_t).
The p-value (p_value), indicating the likelihood that the two series are not cointegrated.
Critical values at different confidence levels, which we store as critical_values. - Significance Check
A p-value below 0.05 indicates significant cointegration at the 5% level. We use 5% as it is a widely used threshold, you are welcome to adjust this parameter as you see fit, like increasing the threshold to 10% if you can’t find any cointegrated. The result for the significance check is stored as Is_Cointegrated_5pct. - Results Storage
A dictionary containing the test outcomes for each pair is appended to the results list. - Return Value
The results are printed for each pair, showing the test statistic, p-value, critical value, and whether the pair is cointegrated.
If there are no errors found upon running the code, the following output will be generated:
If you prefer to see the output in a dataframe, you can run the following code to convert the test result into a dataframe:
#Convert Cointegration Test Result to Dataframe
results_df = pd.DataFrame(results)
results_df
Next, let’s shortlist to just show stock pairs that are cointegrated and store it in cointegrated_pairs:
# Filter for rows where Is_Cointegrated_5pct is True
filtered_df = results_df[results_df['Is_Cointegrated_5pct'] == True]
# Create a new list of only the cointegrated pairs
cointegrated_pairs = [
(row['Symbol1'], row['Symbol2'])
for _, row in filtered_df.iterrows()
]
print("Cointegrated Pairs (5% level):")
for cp in cointegrated_pairs:
print(cp)
We found that there are 5 stock pairs that For easier analysis, let’s just focus on one pair to analyze. For this example, let’s go with PARA and SPOT.
Finding The Spread
As I have previously mentioned, signals in pairs trading are generated by the spread of our pair of assets. Therefore, after we found a highly correlated pair, the next step to find its spread and convert it to Z Score.
We will create a new function, which we’re going to call analyze_pair. In this function, we will use the OLS regression model to examine how the price of PARA predicts the price of SPOT, calculate the spread between the actual and predicted prices, and visualize the results.
def analyze_pair(data_dict, sym1, sym2):
# Extract & align data
priceX = data_dict[sym1]["Adj Close"].rename("X")
priceY = data_dict[sym2]["Adj Close"].rename("Y")
df = pd.concat([priceX, priceY], axis=1, join="inner").dropna()
# If insufficient data remains, exit
if len(df) < 10:
print(f"Not enough overlapping data for {sym1} & {sym2}. Exiting.")
return None
# Prepare & fit OLS model
X = sm.add_constant(df["X"])
Y = df["Y"]
model = sm.OLS(Y, X).fit()
# Compute predictions & spread
df["Y_pred"] = model.predict(X)
df["Spread"] = df["Y"] - df["Y_pred"]
# Plot results
plt.figure(figsize=(10, 5))
plt.plot(df.index, df["Y"], label=f"{sym2} (Actual)")
plt.plot(df.index, df["Y_pred"], label=f"{sym2} (Predicted from {sym1})")
plt.title(f"Pair: {sym1} (X) → {sym2} (Y)")
plt.legend()
plt.show()
plt.figure(figsize=(10, 4))
plt.plot(df.index, df["Spread"], label="Spread (Y - Y_pred)")
plt.axhline(df["Spread"].mean(), color='red', linestyle='--', label="Spread Mean")
plt.title(f"Spread for {sym1} & {sym2}")
plt.legend()
plt.show()
results_dict = {
"model_params": model.params, # alpha (const) and beta
"df": df, # the aligned dataframe with Spread
"summary": model.summary() # statsmodels summary object
}
return results_dict
Breakdown of above code:
- Function Inputs
The function will take a dictionary of historical price data, where each key is a financial asset’s symbol and its value is a DataFrame containing price information. - Data Extraction and Alignment
Extracts the adjusted closing prices for the two specified symbols.
And then we will combine price series into a single DataFrame with matching dates and remove any missing data. - Regression Model Fitting
Fits a linear regression model using sym1 as the independent variable and sym2 as the dependent variable.
This determines the relationship (slope and intercept) between the two assets’ prices. - Prediction and Spread Calculation
We then will use a regression model to predict the prices of sym2 based on sym1 and calculate the spread between the two input variables. - Visualization
We’re going to visualize two plots with this function:
a. A comparison of the actual prices of sym2 against the predicted prices based on the relationship with sym1.
b. A plot of the spread over time, highlighting its mean for reference. - Return Value
The function will return a dictionary with the following:
a. Regression parameters
b. The aligned DataFrame containing the actual prices, predicted prices, and spread.
c. A statistical summary of the regression model, including metrics like R-squared and standard errors.
We can run the analyze_pair function by executing the following lines of code:
sym1 = "PARA"
sym2 = "SPOT"
pair_results = analyze_pair(data_dict, sym1, sym2)
As we can see on the second chart shown in Figure 5, we have successfully visualized the spread as the residuals of the linear regression model. Since spreads may vary among asset classes, it would be ideal to standardize them into a single scalar to facilitate comparison and strategy formulation. We can do this by converting the spread into Z-scores.
Converting to Z-score
We use Z-score to measure how many standard deviations the daily spread deviates from its mean. The magnitude of the Z-score indicates the distance from the mean, positive for above the mean and negative for below.
For stock prices, which are inherently volatile, we use a moving average approach to calculate the running mean and standard deviation. Where each daily spread would have corresponding running mean and standard deviation based on the collection of the spreads in the rolling window.
In order to convert the spread to z-score, we will create a new function called convert_zscore:
def convert_zscore(df, sym1, sym2, window_size=10):
#Compute rolling mean, rolling std, and Z-score
df["Spread_MA"] = df["Spread"].rolling(window_size).mean()
df["Spread_STD"] = df["Spread"].rolling(window_size).std()
df["Zscore"] = (df["Spread"] - df["Spread_MA"]) / df["Spread_STD"]
# Visualize the Z-score
plt.figure(figsize=(10, 4))
plt.plot(df.index, df["Zscore"], label="Z-Score of Spread")
plt.axhline(0, color="black", linestyle="--", lw=1)
plt.axhline(2.0, color="green", linestyle="--", lw=1, label="+2 Z")
plt.axhline(1.0, color="green", linestyle="--", lw=1, label="+1 Z")
plt.axhline(-1.0, color="red", linestyle="--", lw=1, label="-1 Z")
plt.axhline(-2.0, color="red", linestyle="--", lw=1, label="-2 Z")
plt.title(f"Z-Score of Spread (Window={window_size}): {sym1}, {sym2}")
plt.legend()
plt.show()
return df
Breakdown of above code:
- Inputs
The input takes a dataFrame containing at least a column named “Spread”, which represents the difference between the actual and predicted values of two assets of our input variable which will be represented by sym1 and sym2. Additionally, the last input variable will be window_size, which will be the size of the rolling window used to compute the mean and standard deviation. - Rolling Calculations
Rolling Mean (Spread_MA): Computes the moving average of the “Spread” column over the specified rolling window size.
Rolling Standard Deviation (Spread_STD): Computes the moving standard deviation of the “Spread” column over the rolling window size.
Z-Score (Zscore): Calculates the Z-score for each point in the “Spread” column. The Z-score measures how many standard deviations the current spread deviates from the rolling mean. - Z-Score Calculation Formula
For each row in the DataFrame, we will apply the following formula:
(spread -spread mean) / spread standard deviation - Visualization
We further will plot the Z-score over time. Additionally, for easy viewing, we also add horizontal reference lines at key Z-score levels to easily identify the the deviation from its means. - Return Value
The function returns the updated df, which now contains three new columns:
Spread_MA: Rolling mean of the spread.
Spread_STD: Rolling standard deviation of the spread.
Zscore: Z-score of the spread.
We can run the analyze_pair function by executing the following lines of code:
df_zscore = convert_zscore(pair_results_df, sym1, sym2, window_size=10)
Trading Strategy Formulation and Backtesting
After successfully converting the spread into Z-scores, we can use them to generate trading signals:
- Long Entry: Enter a long position in the first stock when the Z-score falls below a preset negative threshold (e.g., -2).
- Long Exit: Exit the long position in the first stock when the Z-score crosses above a less extreme negative threshold (e.g., -1).
- Short Entry: Enter a short position in the second stock when the Z-score exceeds a preset positive threshold (e.g., 2).
- Short Exit: Exit the short position in the second stock when the Z-score crosses below a less extreme positive threshold (e.g., 1).
Now let’s get into coding the strategy by creating a new function called run_pair_trading:
def run_pair_trading(sym1, sym2, data_dict, df_zscore, window_size=10, initial_equity=100_000.0):
# 1) Align close prices
df1 = data_dict[sym1]["Adj Close"].rename("X")
df2 = data_dict[sym2]["Adj Close"].rename("Y")
df = pd.concat([df1, df2], axis=1, join="inner").dropna().sort_index()
if len(df) < window_size:
return {
"df": None,
"metrics": f"Not enough data for {sym1}-{sym2}",
"trades_df": pd.DataFrame()
}
df = df.join(df_zscore, how="inner", rsuffix="_zscore")
if "Zscore" not in df or df["Zscore"].isna().all():
return {
"df": None,
"metrics": f"Missing or invalid Zscore data for {sym1}-{sym2}",
"trades_df": pd.DataFrame()
}
# 2) Determine positions on each pair
df["x_position"] = np.nan
df["y_position"] = np.nan
# zscore > 2 => Short X, Long Y
df.loc[df["Zscore"] > 2, ["x_position", "y_position"]] = [-1, 1]
# -1 < zscore < 1 => Exit
df.loc[(df["Zscore"] > -1) & (df["Zscore"] < 1), ["x_position", "y_position"]] = [0, 0]
# zscore < -2 => Long X, Short Y
df.loc[df["Zscore"] < -2, ["x_position", "y_position"]] = [1, -1]
# Forward-fill positions in between signals
df["x_position"] = df["x_position"].ffill().fillna(0)
df["y_position"] = df["y_position"].ffill().fillna(0)
# 3) Calculate daily returns from each pair
df["x_return"] = df["X"].pct_change().fillna(0.0)
df["y_return"] = df["Y"].pct_change().fillna(0.0)
# Equity Allocation
df["x_notional"] = 0.02 * initial_equity
df["y_notional"] = 0.02 * initial_equity
# Daily PnL for each pair
df["daily_pnl_x"] = df["x_position"].shift(1) * df["x_notional"] * df["x_return"]
df["daily_pnl_y"] = df["y_position"].shift(1) * df["y_notional"] * df["y_return"]
df[["daily_pnl_x", "daily_pnl_y"]] = df[["daily_pnl_x", "daily_pnl_y"]].fillna(0.0)
df["daily_pnl"] = df["daily_pnl_x"] + df["daily_pnl_y"]
df["equity"] = initial_equity + df["daily_pnl"].cumsum()
# 4) Performance metrics
final_equity = df["equity"].iloc[-1]
total_return_pct = (final_equity - initial_equity) / initial_equity
df["equity_return"] = df["equity"].pct_change().fillna(0.0)
ann_factor = 252
mean_daily_ret = df["equity_return"].mean()
std_daily_ret = df["equity_return"].std()
if std_daily_ret != 0:
sharpe_ratio = (mean_daily_ret / std_daily_ret) * np.sqrt(ann_factor)
else:
sharpe_ratio = np.nan
neg_returns = df.loc[df["equity_return"] < 0, "equity_return"]
std_downside = neg_returns.std() if not neg_returns.empty else np.nan
if std_downside and std_downside != 0:
sortino_ratio = (mean_daily_ret / std_downside) * np.sqrt(ann_factor)
else:
sortino_ratio = np.nan
df["running_max"] = df["equity"].cummax()
df["drawdown"] = (df["equity"] / df["running_max"]) - 1
max_drawdown = df["drawdown"].min()
# 5) Trade-by-trade details
df["trade_signal"] = (df["x_position"].diff().abs() > 0) | (df["y_position"].diff().abs() > 0)
trades = df[df["trade_signal"]].copy()
trades["entry_date"] = trades.index
trades["exit_date"] = trades["entry_date"].shift(-1)
trades["pnl"] = trades["daily_pnl"]
trades["x_position"] = trades["x_position"]
trades["y_position"] = trades["y_position"]
trades_df = trades[["entry_date", "exit_date", "x_position", "y_position", "pnl"]]
num_trades = len(trades_df)
win_rate = (trades_df[trades_df["pnl"] > 0].shape[0] / num_trades) if num_trades > 0 else np.nan
metrics = {
"sym1": sym1,
"sym2": sym2,
"final_equity": final_equity,
"total_return_pct": total_return_pct,
"sharpe_ratio": sharpe_ratio,
"sortino_ratio": sortino_ratio,
"max_drawdown_pct": max_drawdown,
"num_trades": num_trades,
"win_rate_pct": 100.0 * win_rate if not np.isnan(win_rate) else np.nan
}
return {
"df": df,
"metrics": metrics,
"trades_df": trades_df
}
- Align Data
First we need to align the adjusted close prices of the two assets and ensure data consistency. First is to check if there’s enough overlapping data to run the analysis, and then join the aligned price data with the Z-score data (df_zscore) to include the spread’s standardized values. - Generate Trading Signals
Long/Short Signals:
Z-score > 2: Go short on the first asset (sym1) and long on the second (sym2).
Z-score < -2: Go long on the first asset (sym1) and short on the second (sym2).
-1 < Z-score < 1: Exit all positions.
Positions are forward-filled to maintain positions until a new signal is triggered. - Calculate Daily Returns and Equity Changes
Daily Returns: Computes percentage changes in prices for both assets.
Equity Allocation: Allocates 2% of the initial equity to each asset for trading.
Daily PnL:
— Multiplies positions by notional value and daily returns to compute profit/loss for each asset.
— Sums the PnL of both assets to get the total daily PnL.
Equity Update: Tracks the cumulative equity over time by adding daily PnL to the initial equity. - Performance Metrics
Final Equity: The total value of the portfolio at the end of the simulation.
Total Return: Percentage return from initial equity to final equity.
Sharpe Ratio: Measures risk-adjusted returns based on mean and standard deviation of daily returns.
Sortino Ratio: Similar to the Sharpe Ratio but only considers downside volatility (negative returns).
Max Drawdown: The largest peak-to-trough decline in equity measured as a percentage.
Running Max: Tracks the highest equity value over time to compute drawdowns. - Trade Details
Identify trade signals: When either x_position (position of Asset 1) or y_position (position of Asset 2) changes, a new trade is initiated/closed.
Record each trade in trades_df with:
— entry_date: The date on which the position changes.
— exit_date: The next signal date (shifted by -1).
— pnl: The daily_pnl on that date (though for a more detailed trade PnL, you’d sum over the holding period).
Positions: The actual positions in Asset 1 and Asset 2 at entry.
Number of trades: Count of trading signal generated.
Win -rate: Percentage of trades with pnl > 0. - Output
Returns a dictionary containing:
— df: The DataFrame with detailed results, including prices, positions, returns, PnL, and equity over time.
— metrics: A dictionary of key performance metrics for the trading strategy.
— trades_df: A DataFrame summarizing individual trade details.
Now we have formulated our trading strategy, it’s time for us to run it. We can do this by:
pair_trading = run_pair_trading(sym1,sym2,data_dict,df_zscore)
pair_trading['metrics']
We run the pair_trading[‘metrics’] so that we can focus on the performance of our strategy, you can also run pair_trading[‘df’] if you want to see the full and detailed dataframe, or pair_trading[‘trades_df’] to only see the trades made during our backtesting period.
The relatively high Sharpe ratio of about 2.0 and the low max drawdown of 0.22% suggest that despite the small total return, the strategy maintained a favorable balance between returns and risk.
Let’s plot the cummulative returns
def plot_return(df):
# Calculate cumulative return in percentage
initial_equity = df["equity"].iloc[0]
df["cumulative_return_pct"] = ((df["equity"] - initial_equity) / initial_equity) * 100
plt.figure(figsize=(14, 7))
plt.plot(df.index, df["cumulative_return_pct"], label="Cumulative Return (%)", color="green", alpha=0.8)
plt.title("Cumulative Return in Percentage")
plt.xlabel("Date")
plt.ylabel("Cumulative Return (%)")
plt.legend()
plt.grid()
plt.show()
Now we can run the followingfunction to generate the
plot_return(pair_trading['df'])
Summary
This strategy provides a solid foundation for a pairs trading, but there is definitely a lot of room for refinement. Potential improvements include adjusting position sizing based on volatility, adjust trading signals parameters (you can try the bayesian optimization technique), and taking into accounts transaction costs and slippage to ensure more realistic backtesting. However, this example serves as a great starting point for further development and experimentation in pair trading.
If you wish to download the Jupyter Notebook file for this stretegy, you can find it here: