Statistical Arbitrage with Pairs Trading and Backtesting

Sabir Jana, CFA
Analytics Vidhya
Published in
11 min readJul 31, 2020


Statistical Arbitrage (Stat Arb) are trading strategies that typically take advantage of either mean reversion in share prices or opportunities created by market microstructure anomalies. It is a highly quantitative analytical approach involving a well-diversified portfolio of securities. However, to illustrate the concept I will take the example of pairs trading involving only two cointegrated securities.

Pairs trading is a market-neutral strategy where we use statistical techniques to identify two stocks that are historically highly correlated with each other. When there is a deviation in the price relationship of these stocks, we expect this to be mean reverting and buy the underperforming stock and simultaneously sell the outperforming one. If our mean-reversion assumption is valid then prices should converge to long term average and trade should benefit. However, if the price divergence is not temporary and it is due to structural reasons then there is a high risk of losing the money. You can find the Python Notebook and data used in this article on my Github page. The overall approach followed in this article is mentioned below:

  1. Stock Universe and Identification of Cointegrated Pairs of Stocks.
  2. Perform Stationary test for the Selected Pair.
  3. Generate Trading Signals using z-score.
  4. Portfolio Profit and Loss Calculation.

Stock Universe and Identification of Cointegrated Pairs of Stocks

Our first step is to decide the stock universe and identify the pairs with high correlation. It is very important that this is based on economic relationships such as companies with similar businesses, else it might be spurious. I have taken all constituents of NSE-100 which are categorized as ‘FINANCIAL SERVICES’ companies. This gives us a list of 25 companies to start with. However, we filter out companies with less than 10 years of daily pricing data and are left with only the final 15 stocks. We take the daily closing price for these 15 stocks and split the dataframe into test and training sets. This is to ensure that our decision to select a cointegrated pair is based on the training dataset and backtesting is done using out of sample test dataset. As a first step, we will use the Pearson correlation coefficient to get the basic idea about the relationship between these stocks and then work to identify cointegrated stocks using the function coint form statsmodels.tsa.stattools . The function coint, will return p-values of the cointegration test for each pair. We will store these p-values in an array and visualize it as a heatmap. If a p-value is lower than 0.05, this means that we can reject the null hypothesis and the two time series of different symbols can be cointegrated.

Let’s jump right to the code:

# make the necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
idx = pd.IndexSlice
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import coint
from sklearn.model_selection import train_test_split
%matplotlib inline
%config InlineBackend.figure_format = ‘retina’
# read the matadata csv
nifty_meta = pd.read_csv('data/nifty_meta.csv')
# get the ticker list with industry is equal to FINANCIAL SERVICES
tickers = list(nifty_meta[nifty_meta.Industry=='FINANCIAL SERVICES'].Symbol)
# start and end dates for backtesting
fromdate = datetime.datetime(2010, 1, 1)
todate = datetime.datetime(2020, 6, 15)
# read back the pricing data
prices = pd.read_csv('data/prices.csv', index_col=['ticker','date'], parse_dates=True)
# remove tickers where we have less than 10 years of data.
min_obs = 2520
nobs = prices.groupby(level='ticker').size()
keep = nobs[nobs>min_obs].index
prices = prices.loc[idx[keep,:], :]
# final tickers list
TICKERS = list(prices.index.get_level_values('ticker').unique())
# unstack and take close price
close = prices.unstack('ticker')['close'].sort_index()
close = close.dropna()
# train test split
train_close, test_close = train_test_split(close, test_size=0.5, shuffle=False)
# quick view of head and tail of train set
# Pearson correlation to get the basic idea about the relationship
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(train_close.pct_change().corr(method ='pearson'), ax=ax, cmap='coolwarm', annot=True, fmt=".2f") #spearman
ax.set_title('Assets Correlation Matrix')
plt.savefig('images/chart1', dpi=300)
# function to find cointegrated pairs
def find_cointegrated_pairs(data):
n = data.shape[1]
pvalue_matrix = np.ones((n, n))
keys = data.keys()
pairs = []
for i in range(n):
for j in range(i+1, n):
result = coint(data[keys[i]], data[keys[j]])
pvalue_matrix[i, j] = result[1]
if result[1] < 0.05:
pairs.append((keys[i], keys[j]))
return pvalue_matrix, pairs
# calculate p-values and plot as a heatmap
pvalues, pairs = find_cointegrated_pairs(train_close)
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(pvalues, xticklabels = train_close.columns,
yticklabels = train_close.columns, cmap = 'RdYlGn_r', annot = True, fmt=".2f",
mask = (pvalues >= 0.99))
ax.set_title('Assets Cointregration Matrix p-values Between Pairs')
plt.savefig('images/chart2', dpi=300)

Code commentary:

  1. We make the necessary imports of python libraries.
  2. Next, we read nifty_meta.csv and create a list of tickers where the industry is ‘FINANCIAL SERVICES’ and set start and end dates.
  3. Read daily pricing data from prices.csv and remove tickers where we have less than 10 years of data.
  4. Unstack the pricing dataframe to have only daily closing data and split into 50% each into the train and test datasets.
  5. Plot Pearson correlation of daily returns to get the basic idea about the relationship.
  6. Define a function find_cointegrated_pairs to find cointegrated pairs and corresponding p-values.
  7. Plot the heatmap of p-values.
Pearson Correlation of Daily Returns

Pearson correlation coefficient varies between +1 to -1 and is a linear measure of the relationship between two variables. The value +1 indicates a strong positive correlation, zero indicates no relationship, and -1 indicates a strong negative relationship. We can see in the above heatmap that there are multiple pairs with a strong positive correlation.

Assets Cointegration Matrix p-values Between Pairs

Let’s also analyze the result of the cointegration test. We can see in the above heatmap that there are many pairs with a p-value of less than 0.05. This means that for these pairs we can reject the null hypothesis and they can be cointegrated.

Perform Stationary test for the Selected Pair

Now, we have many candidates of pairs for the strategy where the p-value is less than 0.05. Selecting the right pair is of the utmost importance as the strategy will not work well if the prices are moving exactly together. They need to be diverging and mean-reverting for our strategy to be profitable.

Let’s go with tickers BANKBARODA and SBIN and further test the stationarity of spread using the Augmented Dickey-Fuller test. It is important that the spread is stationary. A time series is considered stationary if parameters such as mean and variance do not change over time and there is no unit root. We will first calculate the hedge ratio between these two tickers using OLS regression. Then, using the hedge ratio, we will calculate the spread and run the Augmented Dickey-Fuller test.

Let’s go through the code:

# final pair to test strategy
asset1 = ‘BANKBARODA’
asset2 = ‘SBIN’
# create a train dataframe of 2 assets
train = pd.DataFrame()
train['asset1'] = train_close[asset1]
train['asset2'] = train_close[asset2]
# visualize closing prices
ax = train[['asset1','asset2']].plot(figsize=(12, 6), title = 'Daily Closing Prices for {} and {}'.format(asset1,asset2))
ax.set_ylabel("Closing Price")
plt.savefig('images/chart3', dpi=300)
# run OLS regression
model=sm.OLS(train.asset2, train.asset1).fit()
# print regression summary results
plt.rc('figure', figsize=(12, 7))
plt.text(0.01, 0.05, str(model.summary()), {'fontsize': 16}, fontproperties = 'monospace')
plt.subplots_adjust(left=0.2, right=0.8, top=0.7, bottom=0.1)
plt.savefig('images/chart4', dpi=300);
print('Hedge Ratio = ', model.params[0])# calculate spread
spread = train.asset2 - model.params[0] * train.asset1
# Plot the spread
ax = spread.plot(figsize=(12, 6), title = "Pair's Spread")
# conduct Augmented Dickey-Fuller test
adf = adfuller(spread, maxlag = 1)
print('Critical Value = ', adf[0])
# probablity critical values

Code commentary:

  1. Select asset1 as BANKBARODA and asset2 as SBIN.
  2. Create a dataframe of the above two stock’s closing prices using the training dataset and visualize it.
  3. Run OLS regression and get the slope coefficient which is also our hedge ratio.
  4. Calculate the spread and plot it for visualization.
  5. Run the Augmented Dickey-Fuller test to check the stationarity of the spread and presence of unit root.
Daily Closing Prices for the Pair

We can see from the above plot that closing prices between these two stocks move quite together.

OLS Regression Results
Pair’s Spread

A high value of R-square and near-zero p-value from OLS regression suggest a very high correlation between these two stocks. The spread looks stationary and the critical value from the Augmented Dickey-Fuller test is -3.459 which is less than the value at 1% (-3.435) significance level. Hence, we are able to reject the null hypothesis that spread has a unit root and can conclude that it is stationarity in nature.

Generate Trading Signals using z-score

We have used the training dataset until this point to finalize the stock pair for our strategy. Now onward we will be using the test dataset to ensure trading signal generation and backtesting is using out of sample dataset. We will use the z-score of the ratio between the two stock prices to generate trading signals and set the upper and lower thresholds. This will tell us how far a price is from the population mean value. If it is positive and the value is above the upper thresholds then the stock price is higher than the average price value. Therefore, its price is expected to go down hence we want to short (sell) this stock and long (buy) the other one.

Let’s go through the code:

# calculate z-score
def zscore(series):
return (series — series.mean()) / np.std(series)
# create a dataframe for trading signals
signals = pd.DataFrame()
signals['asset1'] = test_close[asset1]
signals['asset2'] = test_close[asset2]
ratios = signals.asset1 / signals.asset2
# calculate z-score and define upper and lower thresholds
signals['z'] = zscore(ratios)
signals['z upper limit'] = np.mean(signals['z']) + np.std(signals['z'])
signals['z lower limit'] = np.mean(signals['z']) - np.std(signals['z'])
# create signal - short if z-score is greater than upper limit else long
signals['signals1'] = 0
signals['signals1'] =[signals['z'] > \
signals['z upper limit'], signals['z'] < signals['z lower limit']], [-1, 1], default=0)
# we take the first order difference to obtain portfolio position in that stock
signals['positions1'] = signals['signals1'].diff()
signals['signals2'] = -signals['signals1']
signals['positions2'] = signals['signals2'].diff()
# verify datafame head and tail
# visualize trading signals and position
bx = fig.add_subplot(111)
bx2 = bx.twinx()
#plot two different assets
l1, = bx.plot(signals['asset1'], c='#4abdac')
l2, = bx2.plot(signals['asset2'], c='#907163')
u1, = bx.plot(signals['asset1'][signals['positions1'] == 1], lw=0, marker='^', markersize=8, c='g',alpha=0.7)

Code commentary:

  1. Define the function zscore to calculate z-score.
  2. Create a signals dataframe of our two stocks with the closing price from the testing dataset and calculate their price ratio.
  3. Calculate z-score for the ratio and define upper and lower thresholds with plus and minus one standard deviation.
  4. Create a signal column with the following logic — If z-score is greater than the upper threshold than we will have -1 (short signal) however if z-score is less than the lower threshold than +1 (long signal) and the default is zero for no signal.
  5. Take the first-order difference of the signal column to obtain the stock position. If it is +1 then we are long, -1 then short and 0 if no position.
  6. The second signal will be just opposite to first which means we go long on one stock and simultaneously short on the other one. Similarly, take the first-order difference for the second signal and calculate the second position column.
  7. Next, we visualize both the stock prices along with its long and short positions in the portfolio.
Pair Trading — Trading Signals and Position

Portfolio Profit and Loss Calculation

We will start with an initial capital of 100,000 and calculate the maximum number of shares position for each stock using the initial capital. On any given day, total profit and loss from the first stock will be total holding in that stock and cash position for that stock. Similarly, profit and loss for the second stock will be total holding in the stock and cash for that stock. You need to keep in mind that we have a market neutral position which means we go long and short simultaneously with approximately the same capital. Finally, to get the total profit and loss we have to aggregate these two. Based on the position for the stock 1 and 2, we calculate their respective daily returns. We will also add a z-score column with upper and lower threshold columns to the final portfolio dataframe for visualization purposes.

Let’s jump to the code:

# initial capital to calculate the actual pnl
initial_capital = 100000

# shares to buy for each position
positions1 = initial_capital// max(signals['asset1'])
positions2 = initial_capital// max(signals['asset2'])

# since there are two assets, we calculate each asset Pnl
# separately and in the end we aggregate them into one portfolio
portfolio = pd.DataFrame()
portfolio['asset1'] = signals['asset1']
portfolio['holdings1'] = signals['positions1'].cumsum() * signals['asset1'] * positions1
portfolio['cash1'] = initial_capital - (signals['positions1'] * signals['asset1'] * positions1).cumsum()
portfolio['total asset1'] = portfolio['holdings1'] + portfolio['cash1']
portfolio['return1'] = portfolio['total asset1'].pct_change()
portfolio['positions1'] = signals['positions1']
# pnl for the 2nd asset
portfolio['asset2'] = signals['asset2']
portfolio['holdings2'] = signals['positions2'].cumsum() * signals['asset2'] * positions2
portfolio['cash2'] = initial_capital - (signals['positions2'] * signals['asset2'] * positions2).cumsum()
portfolio['total asset2'] = portfolio['holdings2'] + portfolio['cash2']
portfolio['return2'] = portfolio['total asset2'].pct_change()
portfolio['positions2'] = signals['positions2']

# total pnl and z-score
portfolio['z'] = signals['z']
portfolio['total asset'] = portfolio['total asset1'] + portfolio['total asset2']
portfolio['z upper limit'] = signals['z upper limit']
portfolio['z lower limit'] = signals['z lower limit']
portfolio = portfolio.dropna()

# plot the asset value change of the portfolio and pnl along with z-score
fig = plt.figure(figsize=(14,6),)
ax = fig.add_subplot(111)
ax2 = ax.twinx()
l1, = ax.plot(portfolio['total asset'], c='g')
l2, = ax2.plot(portfolio['z'], c='black', alpha=0.3)
b = ax2.fill_between(portfolio.index,portfolio['z upper limit'],\
portfolio['z lower limit'], \
ax.set_ylabel('Asset Value')
ax2.set_ylabel('Z Statistics',rotation=270)
plt.title('Portfolio Performance with Profit and Loss')
plt.legend([l2,b,l1],['Z Statistics',
'Z Statistics +-1 Sigma',
'Total Portfolio Value'],loc='upper left');
plt.savefig('images/chart8', dpi=300);

# calculate CAGR
final_portfolio = portfolio['total asset'].iloc[-1]
delta = (portfolio.index[-1] - portfolio.index[0]).days
print('Number of days = ', delta)
returns = (final_portfolio/initial_capital) ** (YEAR_DAYS/delta) - 1
print('CAGR = {:.3f}%' .format(returns * 100))

Code commentary:

  1. We start with the initial capital of 100,000 and calculate the number of shares to buy for each stock.
  2. Next, we calculate holding in the first stock by taking the cumulative sum of its position multiplied by stock price and the total number of shares.
  3. We then calculate the cash position by subtracting holding from the initial cash position. The total position of stock in the portfolio is the sum of cash plus holding.
  4. Calculate daily returns using total stock position.
  5. Next, we follow the steps 1 to 4 for the second stock and sum up two asset’s positions for the total portfolio value.
  6. Add the z-score with upper and lower thresholds for visualization.
  7. Visualize the portfolio performance along with z-score, upper, and lower thresholds.
  8. Calculate portfolio CAGR.
Portfolio Performance with Profit and Loss

The Compounded Annual Growth Rate (CAGR) for the strategy is 16.5% which looks promising however there are many things to consider before we draw any conclusion. Few important factors to account for are as follows:

  1. As this is a market-neutral approach a lot depends on our ability to short sell which may be limited due to various reasons including regulations.
  2. We have not accounted for costs related to trading, market slippage, and security borrowing. Normally, a market-neutral strategy results in a high number of trades.
  3. There is always a limitation of using historical data to forecast the future.

Let’s keep in mind that any decision to implement a strategy should be based only after considering all the critical performance parameters including its feasibility and returns net of fee and charges.

Happy investing and do leave your comments on the article!

Please Note: This analysis is only for educational purposes and the author is not liable for any of your investment decisions.


  1. Learn Algorithmic Trading: Build and deploy algorithmic trading systems and strategies using Python and advanced data analysis by Sebastien Donadio and Sourav Ghosh.
  3. Please check out my other articles/ posts on quantitative finance at my Linkedin page or on Medium.