# Statistical Arbitrage with Pairs Trading and Backtesting

Published in

--

Statistical Arbitrage (Stat Arb) are trading strategies that typically take advantage of either mean reversion in share prices or opportunities created by market microstructure anomalies. It is a highly quantitative analytical approach involving a well-diversified portfolio of securities. However, to illustrate the concept I will take the example of pairs trading involving only two cointegrated securities.

Pairs trading is a market-neutral strategy where we use statistical techniques to identify two stocks that are historically highly correlated with each other. When there is a deviation in the price relationship of these stocks, we expect this to be mean reverting and buy the underperforming stock and simultaneously sell the outperforming one. If our mean-reversion assumption is valid then prices should converge to long term average and trade should benefit. However, if the price divergence is not temporary and it is due to structural reasons then there is a high risk of losing the money. You can find the Python Notebook and data used in this article on my Github page. The overall approach followed in this article is mentioned below:

1. Stock Universe and Identification of Cointegrated Pairs of Stocks.
2. Perform Stationary test for the Selected Pair.
3. Generate Trading Signals using z-score.
4. Portfolio Profit and Loss Calculation.

Stock Universe and Identification of Cointegrated Pairs of Stocks

Our first step is to decide the stock universe and identify the pairs with high correlation. It is very important that this is based on economic relationships such as companies with similar businesses, else it might be spurious. I have taken all constituents of NSE-100 which are categorized as ‘FINANCIAL SERVICES’ companies. This gives us a list of 25 companies to start with. However, we filter out companies with less than 10 years of daily pricing data and are left with only the final 15 stocks. We take the daily closing price for these 15 stocks and split the dataframe into test and training sets. This is to ensure that our decision to select a cointegrated pair is based on the training dataset and backtesting is done using out of sample test dataset. As a first step, we will use the Pearson correlation coefficient to get the basic idea about the relationship between these stocks and then work to identify cointegrated stocks using the function `coint` form `statsmodels.tsa.stattools `. The function `coint,` will return p-values of the cointegration test for each pair. We will store these p-values in an array and visualize it as a heatmap. If a p-value is lower than 0.05, this means that we can reject the null hypothesis and the two time series of different symbols can be cointegrated.

Let’s jump right to the code:

`# make the necessary importsimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport datetimeidx = pd.IndexSliceimport statsmodels.api as smfrom statsmodels.regression.linear_model import OLSfrom statsmodels.tsa.stattools import adfullerfrom statsmodels.tsa.stattools import cointfrom sklearn.model_selection import train_test_split%matplotlib inline%config InlineBackend.figure_format = ‘retina’# read the matadata csvnifty_meta = pd.read_csv('data/nifty_meta.csv')nifty_meta.head(2)# get the ticker list with industry is equal to FINANCIAL SERVICEStickers = list(nifty_meta[nifty_meta.Industry=='FINANCIAL SERVICES'].Symbol)print(tickers)print(len(tickers))# start and end dates for backtestingfromdate = datetime.datetime(2010, 1, 1)todate = datetime.datetime(2020, 6, 15)# read back the pricing dataprices = pd.read_csv('data/prices.csv', index_col=['ticker','date'], parse_dates=True)prices.head(2)# remove tickers where we have less than 10 years of data.min_obs = 2520nobs = prices.groupby(level='ticker').size()keep = nobs[nobs>min_obs].indexprices = prices.loc[idx[keep,:], :]prices.info()# final tickers listTICKERS = list(prices.index.get_level_values('ticker').unique())print(len(TICKERS))print(TICKERS)# unstack and take close priceclose = prices.unstack('ticker')['close'].sort_index()close = close.dropna()close.head(2)# train test split train_close, test_close = train_test_split(close, test_size=0.5, shuffle=False)# quick view of head and tail of train settrain_close.head(2).append(train_close.tail(2))# Pearson correlation to get the basic idea about the relationshipfig, ax = plt.subplots(figsize=(10,7))sns.heatmap(train_close.pct_change().corr(method ='pearson'), ax=ax, cmap='coolwarm', annot=True, fmt=".2f") #spearmanax.set_title('Assets Correlation Matrix')plt.savefig('images/chart1', dpi=300)# function to find cointegrated pairsdef find_cointegrated_pairs(data):    n = data.shape[1]    pvalue_matrix = np.ones((n, n))    keys = data.keys()    pairs = []    for i in range(n):        for j in range(i+1, n):            result = coint(data[keys[i]], data[keys[j]])            pvalue_matrix[i, j] = result[1]            if result[1] < 0.05:                pairs.append((keys[i], keys[j]))    return pvalue_matrix, pairs# calculate p-values and plot as a heatmappvalues, pairs = find_cointegrated_pairs(train_close)print(pairs)fig, ax = plt.subplots(figsize=(10,7))sns.heatmap(pvalues, xticklabels = train_close.columns,                yticklabels = train_close.columns, cmap = 'RdYlGn_r', annot = True, fmt=".2f",                mask = (pvalues >= 0.99))ax.set_title('Assets Cointregration Matrix p-values Between Pairs')plt.tight_layout()plt.savefig('images/chart2', dpi=300)`

Code commentary:

1. We make the necessary imports of python libraries.
2. Next, we read `nifty_meta.csv` and create a list of tickers where the industry is ‘FINANCIAL SERVICES’ and set start and end dates.
3. Read daily pricing data from `prices.csv` and remove tickers where we have less than 10 years of data.
4. Unstack the pricing dataframe to have only daily closing data and split into 50% each into the train and test datasets.
5. Plot Pearson correlation of daily returns to get the basic idea about the relationship.
6. Define a function `find_cointegrated_pairs` to find cointegrated pairs and corresponding p-values.
7. Plot the heatmap of p-values.

Pearson correlation coefficient varies between +1 to -1 and is a linear measure of the relationship between two variables. The value +1 indicates a strong positive correlation, zero indicates no relationship, and -1 indicates a strong negative relationship. We can see in the above heatmap that there are multiple pairs with a strong positive correlation.

Let’s also analyze the result of the cointegration test. We can see in the above heatmap that there are many pairs with a p-value of less than 0.05. This means that for these pairs we can reject the null hypothesis and they can be cointegrated.

Perform Stationary test for the Selected Pair

Now, we have many candidates of pairs for the strategy where the p-value is less than 0.05. Selecting the right pair is of the utmost importance as the strategy will not work well if the prices are moving exactly together. They need to be diverging and mean-reverting for our strategy to be profitable.

Let’s go with tickers BANKBARODA and SBIN and further test the stationarity of spread using the Augmented Dickey-Fuller test. It is important that the spread is stationary. A time series is considered stationary if parameters such as mean and variance do not change over time and there is no unit root. We will first calculate the hedge ratio between these two tickers using OLS regression. Then, using the hedge ratio, we will calculate the spread and run the Augmented Dickey-Fuller test.

Let’s go through the code:

`# final pair to test strategyasset1 = ‘BANKBARODA’asset2 = ‘SBIN’# create a train dataframe of 2 assetstrain = pd.DataFrame()train['asset1'] = train_close[asset1]train['asset2'] = train_close[asset2]# visualize closing pricesax = train[['asset1','asset2']].plot(figsize=(12, 6), title = 'Daily Closing Prices for {} and {}'.format(asset1,asset2))ax.set_ylabel("Closing Price")ax.grid(True);plt.savefig('images/chart3', dpi=300)# run OLS regressionmodel=sm.OLS(train.asset2, train.asset1).fit()# print regression summary resultsplt.rc('figure', figsize=(12, 7))plt.text(0.01, 0.05, str(model.summary()), {'fontsize': 16}, fontproperties = 'monospace')plt.axis('off')plt.tight_layout()plt.subplots_adjust(left=0.2, right=0.8, top=0.7, bottom=0.1)plt.savefig('images/chart4', dpi=300);print('Hedge Ratio = ', model.params[0])# calculate spreadspread = train.asset2 - model.params[0] * train.asset1# Plot the spreadax = spread.plot(figsize=(12, 6), title = "Pair's Spread")ax.set_ylabel("Spread")ax.grid(True);# conduct Augmented Dickey-Fuller testadf = adfuller(spread, maxlag = 1)print('Critical Value = ', adf[0])# probablity critical valuesprint(adf[4])`

Code commentary:

1. Select asset1 as BANKBARODA and asset2 as SBIN.
2. Create a dataframe of the above two stock’s closing prices using the training dataset and visualize it.
3. Run OLS regression and get the slope coefficient which is also our hedge ratio.
4. Calculate the spread and plot it for visualization.
5. Run the Augmented Dickey-Fuller test to check the stationarity of the spread and presence of unit root.

We can see from the above plot that closing prices between these two stocks move quite together.

A high value of R-square and near-zero p-value from OLS regression suggest a very high correlation between these two stocks. The spread looks stationary and the critical value from the Augmented Dickey-Fuller test is -3.459 which is less than the value at 1% (-3.435) significance level. Hence, we are able to reject the null hypothesis that spread has a unit root and can conclude that it is stationarity in nature.

We have used the training dataset until this point to finalize the stock pair for our strategy. Now onward we will be using the test dataset to ensure trading signal generation and backtesting is using out of sample dataset. We will use the z-score of the ratio between the two stock prices to generate trading signals and set the upper and lower thresholds. This will tell us how far a price is from the population mean value. If it is positive and the value is above the upper thresholds then the stock price is higher than the average price value. Therefore, its price is expected to go down hence we want to short (sell) this stock and long (buy) the other one.

Let’s go through the code:

`# calculate z-scoredef zscore(series): return (series — series.mean()) / np.std(series)# create a dataframe for trading signalssignals = pd.DataFrame()signals['asset1'] = test_close[asset1] signals['asset2'] = test_close[asset2]ratios = signals.asset1 / signals.asset2# calculate z-score and define upper and lower thresholdssignals['z'] = zscore(ratios)signals['z upper limit'] = np.mean(signals['z']) + np.std(signals['z'])signals['z lower limit'] = np.mean(signals['z']) - np.std(signals['z'])# create signal - short if z-score is greater than upper limit else longsignals['signals1'] = 0signals['signals1'] = np.select([signals['z'] > \                                 signals['z upper limit'], signals['z'] < signals['z lower limit']], [-1, 1], default=0)# we take the first order difference to obtain portfolio position in that stocksignals['positions1'] = signals['signals1'].diff()signals['signals2'] = -signals['signals1']signals['positions2'] = signals['signals2'].diff()# verify datafame head and tailsignals.head(3).append(signals.tail(3))# visualize trading signals and positionfig=plt.figure(figsize=(14,6))bx = fig.add_subplot(111)   bx2 = bx.twinx()#plot two different assetsl1, = bx.plot(signals['asset1'], c='#4abdac')l2, = bx2.plot(signals['asset2'], c='#907163')u1, = bx.plot(signals['asset1'][signals['positions1'] == 1], lw=0, marker='^', markersize=8, c='g',alpha=0.7)`

Code commentary:

1. Define the function `zscore `to calculate z-score.
2. Create a signals dataframe of our two stocks with the closing price from the testing dataset and calculate their price ratio.
3. Calculate z-score for the ratio and define upper and lower thresholds with plus and minus one standard deviation.
4. Create a signal column with the following logic — If z-score is greater than the upper threshold than we will have -1 (short signal) however if z-score is less than the lower threshold than +1 (long signal) and the default is zero for no signal.
5. Take the first-order difference of the signal column to obtain the stock position. If it is +1 then we are long, -1 then short and 0 if no position.
6. The second signal will be just opposite to first which means we go long on one stock and simultaneously short on the other one. Similarly, take the first-order difference for the second signal and calculate the second position column.
7. Next, we visualize both the stock prices along with its long and short positions in the portfolio.

Portfolio Profit and Loss Calculation

We will start with an initial capital of 100,000 and calculate the maximum number of shares position for each stock using the initial capital. On any given day, total profit and loss from the first stock will be total holding in that stock and cash position for that stock. Similarly, profit and loss for the second stock will be total holding in the stock and cash for that stock. You need to keep in mind that we have a market neutral position which means we go long and short simultaneously with approximately the same capital. Finally, to get the total profit and loss we have to aggregate these two. Based on the position for the stock 1 and 2, we calculate their respective daily returns. We will also add a z-score column with upper and lower threshold columns to the final portfolio dataframe for visualization purposes.

`# initial capital to calculate the actual pnlinitial_capital = 100000# shares to buy for each positionpositions1 = initial_capital// max(signals['asset1'])positions2 = initial_capital// max(signals['asset2'])# since there are two assets, we calculate each asset Pnl # separately and in the end we aggregate them into one portfolioportfolio = pd.DataFrame()portfolio['asset1'] = signals['asset1']portfolio['holdings1'] = signals['positions1'].cumsum() * signals['asset1'] * positions1portfolio['cash1'] = initial_capital - (signals['positions1'] * signals['asset1'] * positions1).cumsum()portfolio['total asset1'] = portfolio['holdings1'] + portfolio['cash1']portfolio['return1'] = portfolio['total asset1'].pct_change()portfolio['positions1'] = signals['positions1']# pnl for the 2nd assetportfolio['asset2'] = signals['asset2']portfolio['holdings2'] = signals['positions2'].cumsum() * signals['asset2'] * positions2portfolio['cash2'] = initial_capital - (signals['positions2'] * signals['asset2'] * positions2).cumsum()portfolio['total asset2'] = portfolio['holdings2'] + portfolio['cash2']portfolio['return2'] = portfolio['total asset2'].pct_change()portfolio['positions2'] = signals['positions2']# total pnl and z-scoreportfolio['z'] = signals['z']portfolio['total asset'] = portfolio['total asset1'] + portfolio['total asset2']portfolio['z upper limit'] = signals['z upper limit']portfolio['z lower limit'] = signals['z lower limit']portfolio = portfolio.dropna()# plot the asset value change of the portfolio and pnl along with z-scorefig = plt.figure(figsize=(14,6),)ax = fig.add_subplot(111)ax2 = ax.twinx()l1, = ax.plot(portfolio['total asset'], c='g')l2, = ax2.plot(portfolio['z'], c='black', alpha=0.3)b = ax2.fill_between(portfolio.index,portfolio['z upper limit'],\                portfolio['z lower limit'], \                alpha=0.2,color='#ffb48f')ax.set_ylabel('Asset Value')ax2.set_ylabel('Z Statistics',rotation=270)ax.yaxis.labelpad=15ax2.yaxis.labelpad=15ax.set_xlabel('Date')ax.xaxis.labelpad=15plt.title('Portfolio Performance with Profit and Loss')plt.legend([l2,b,l1],['Z Statistics',                      'Z Statistics +-1 Sigma',                      'Total Portfolio Value'],loc='upper left');plt.savefig('images/chart8', dpi=300);# calculate CAGRfinal_portfolio = portfolio['total asset'].iloc[-1]delta = (portfolio.index[-1] - portfolio.index[0]).daysprint('Number of days = ', delta)YEAR_DAYS = 365returns = (final_portfolio/initial_capital) ** (YEAR_DAYS/delta) - 1print('CAGR = {:.3f}%' .format(returns * 100))`

Code commentary:

1. We start with the initial capital of 100,000 and calculate the number of shares to buy for each stock.
2. Next, we calculate holding in the first stock by taking the cumulative sum of its position multiplied by stock price and the total number of shares.
3. We then calculate the cash position by subtracting holding from the initial cash position. The total position of stock in the portfolio is the sum of cash plus holding.
4. Calculate daily returns using total stock position.
5. Next, we follow the steps 1 to 4 for the second stock and sum up two asset’s positions for the total portfolio value.
6. Add the z-score with upper and lower thresholds for visualization.
7. Visualize the portfolio performance along with z-score, upper, and lower thresholds.
8. Calculate portfolio CAGR.

The Compounded Annual Growth Rate (CAGR) for the strategy is 16.5% which looks promising however there are many things to consider before we draw any conclusion. Few important factors to account for are as follows:

1. As this is a market-neutral approach a lot depends on our ability to short sell which may be limited due to various reasons including regulations.
2. We have not accounted for costs related to trading, market slippage, and security borrowing. Normally, a market-neutral strategy results in a high number of trades.
3. There is always a limitation of using historical data to forecast the future.

Let’s keep in mind that any decision to implement a strategy should be based only after considering all the critical performance parameters including its feasibility and returns net of fee and charges.