How to Import Indian equities Data to zipline on your local machine?

Sabir Jana, CFA
Analytics Vidhya
Published in
10 min readJun 7, 2020

By Sabir Jana

zipline is probably the most mature and powerful backtesting and live trading engine. It is developed, maintained, and used in production by Quantopian Inc. Quantopian is a Boston based crowed source technology and asset management firm. Anyone can create a free account on their website to develop and backtest their strategy with rich fundamental and minute level data.

zipline is the backtesting engine on their website. However, it can also be downloaded and installed locally. Local installation of the zipline is a little complex, hence you need to refer to zipline Documentation on how to perform a local setup. It is much easier to use their website with free availability of rich historical data than run the strategy locally. However, everything done on their website needs to stay on the website.

When you have a local setup of zipline, you need to supply your own data. This is called Ingesting your own custom bundle. There is some help available on how to set up a single asset data, however, nothing much is available on multiple stocks. Andreas F. Clenow has explained the process to set up your own custom bundle very well in his exceptional book — ‘Trading Evolved: Anyone can Build Killer Trading Strategies in Python’. However, there are many challenges if you want to do this for Indian equities and especially with free Quandl data.

In this article, I will go through the step by step process on how to create your own custom bundle for S&P BSE 500 constituents assuming you have already created a virtual environment as env_zipline and installed zipline in the former on your local machine.

Following are the steps involved to Ingest your own custom bundle and test it:

  1. Download 500 stocks data from Quandl and store them as individual .csv files.
  2. Clean up data to match 'XBOM’ trading calendar and create a custom bundle.
  3. Ingest last 15 years of daily OHLCV data for 500 stocks of the S&P BSE 500.
  4. Test your newly created custom bundle by running a moving average crossover strategy for Infosys Ltd. stock.

Download 500 stocks data from Quandl and store them as individual .csv files

Quandl provides free Bombay Stock Exchange (BSE) data feed containing end-of-day stock prices and indices from the BSE. This data feed contains 4,566 time-series, each corresponding to a stock or index. You need to have a free Quandl account to access this data feed. You will also need a ticker symbol of 500 constituents of the S&P BSE 500. Please go to https://www.bseindia.com/sensex/code/17 and download ‘BSE 500index_Constituents.csv’ file. I have assumed that you already have Anaconda setup available on your local machine. If not, then please do so. I have used Jupyter Notebook for this part, you can use any IDE familiar to you.

Install Quandl if you do not have it.

$pip install Quandl

The code in this section will perform the following tasks:

  1. Read ‘Index Constituents.csv’ file as a pandas Dataframe.
  2. Create a list of all 501 tickers of the S&P BSE 500 constituents.
  3. Download the daily pricing data of these tickers from Quandl.
  4. Filter required columns to create your own custom bundle and rename them as [‘open’,’high’,’low’,’close’,’volume’] .
# Do necessary Imports
import os
import numpy as np
import pandas as pd
from pathlib import Path
import quandl
# Read BSE 500 Constituents csv file
bse500_metadata = pd.read_csv('BSE 500index_Constituents.csv')
bse500_metadata.head(2)
# Get all 500+ tickers
tickers = list(bse500_metadata[‘Scrip Code’])
# Input your quandl key
quandl.ApiConfig.api_key = <"your quandl key goes here">
# Start Bulk download in a loop and create a Dataframe
def get(tickers):
def data(ticker):
print(‘Processing…’, f’BSE/BOM{ticker}’)
return (quandl.get(f’BSE/BOM{ticker}’))
datas = map(data, tickers)
return(pd.concat(datas, keys=tickers, names=[‘ticker’, ‘date’]))

df = get(tickers)

It will take some time as this will download 501 tickers data. The progress will be displayed as:

Processing… BSE/BOM523395
Processing… BSE/BOM541988

# Verify the head of Dataframe
df.head(2)
# filter required columns from Dataframe
columns = [‘Open’,’High’,’Low’,’Close’,’No. of Trades’]
# Rename the columns as per zipline requirement
prices.columns = ['open','high','low','close','volume']
prices.head(2)
# Write a csv file for each ticker
for ticker in tickers:
df = prices.loc[ticker]
df.to_csv(f’bse500/{ticker}.csv’, index=True)

This will generate 501 .csv files in bse500 folder such as 500410.csv, 500488.csv, etc. The .csv files will look something like this:

date,open,high,low,close,volume
1991–01–02,1500.0,1500.0,1475.0,1475.0,0.0
1991–01–03,1475.0,1500.0,1350.0,1400.0,0.0
1991–01–07,1450.0,1475.0,1400.0,1450.0,0.0
1991–01–09,1475.0,1500.0,1450.0,1475.0,0.0
1991–01–11,1425.0,1500.0,1425.0,1500.0,0.0

Clean up data to match 'XBOM’ trading calendar and create a custom bundle

We need to write a Python script for this. I am going to use Anaconda Spyder for this purpose, however, you are free to choose any Python editor of your choice. This is a two-step process. First, we will write a bundle, then, this new bundle will be registered.

Please open Python editor of your choice for a new Python script. We do the necessary imports and set up the path where our .csv files reside.

# necessary imports
import pandas as pd
from os import listdir
from trading_calendars import get_calendar
# the path to where you have your data
path = 'C:\Users\sabirj\Desktop\P4Finance\bse500 bundle\bse500'

Next, we need to write the ingest function. In our case, it will be bse_data. This function is responsible for loading the data into memory and passing it to a set of writer objects provided by zipline to convert the data to zipline’s internal format. This function performs the following tasks:

  1. Read all .csv files from the specified location.
  2. Prepare structure for dividends, splits, and metadata Dataframes. We do not have dividends and splits information into our data but this is in case you have it.
  3. Check the valid trading calendar dates for our trading calendar with start_session, end_session. We will read these dates from the second part of the bundle registration. We will specify ‘XBOM’ , the trading calendar which is valid for both BSE and NSE.
# The ingest function needs to have this exact signature
def bse_data(environ,
asset_db_writer,
minute_bar_writer,
daily_bar_writer,
adjustment_writer,
calendar,
start_session,
end_session,
cache,
show_progress,
output_dir):

# Get list of files from path
# Slicing off the last part
# 'example.csv'[:-4] = 'example'
symbols = [f[:-4] for f in listdir(path)]

if not symbols:
raise ValueError("No symbols found in folder.")


# Prepare an empty DataFrame for dividends
divs = pd.DataFrame(columns=['sid',
'amount',
'ex_date',
'record_date',
'declared_date',
'pay_date']
)

# Prepare an empty DataFrame for splits
splits = pd.DataFrame(columns=['sid',
'ratio',
'effective_date']
)

# Prepare an empty DataFrame for metadata
metadata = pd.DataFrame(columns=('start_date',
'end_date',
'auto_close_date',
'symbol',
'exchange'
)
)
# Check valid trading dates, according to the selected exchange calendar
sessions = calendar.sessions_in_range(start_session, end_session)
# Get data for all stocks and write to Zipline
daily_bar_writer.write(
process_stocks(symbols, sessions, metadata, divs)
)
# Write the metadata
asset_db_writer.write(equities=metadata)

# Write splits and dividends
adjustment_writer.write(splits=splits,
dividends=divs)

Now, we will write the generator function process_stocks which will perform the following tasks:

  1. Iterate our .csv files to load the historical data and populate a pandas Dataframe.
  2. For each ticker, filter and only take rows as per the valid trading calendar dates. This is a very important step as holidays, extra trading days, and Muhurat trading is not supported by the trading calendar. This filtering takes place in one line; df = df[df.index.isin(sessions)] .
  3. Return the historical data needed for the daily bar writer and populate the metadata and the dividends (if data is supplied).
  4. Specify ‘XBOM’ as our trading calendar.
# Generator function to iterate stocks, build historical data, 
# metadata and dividend data
def process_stocks(symbols, sessions, metadata, divs):
# Loop the stocks, setting a unique Security ID (SID)
for sid, symbol in enumerate(symbols):
print(‘Loading {}…’.format(symbol))

# Read the stock data from csv file.
df = pd.read_csv(‘{}/{}.csv’.format(path, symbol), index_col=[0], parse_dates=[0])
# filter data for valide trading calendar dates
df = df[df.index.isin(sessions)]

start_date = df.index[0]
end_date = df.index[-1]
# Synch to the official exchange calendar
df = df.reindex(sessions.tz_localize(None))[start_date:end_date]
# Forward fill missing data
df.fillna(method=’ffill’, inplace=True)

# Drop remaining NaN
df.dropna(inplace=True)

# The auto_close date is the day after the last trade.
ac_date = end_date + pd.Timedelta(days=1)

# Add a row to the metadata DataFrame. Don’t forget to add an exchange field.
metadata.loc[sid] = start_date, end_date, ac_date, symbol, ‘XBOM’

# If there’s dividend data, add that to the dividend DataFrame
if ‘dividend’ in df.columns:
# Slice off the days with dividends
tmp = df[df[‘dividend’] != 0.0][‘dividend’]
div = pd.DataFrame(data=tmp.index.tolist(), columns=[‘ex_date’])

# Provide empty columns as we don’t have this data for now
div[‘record_date’] = pd.NaT
div[‘declared_date’] = pd.NaT
div[‘pay_date’] = pd.NaT

# Store the dividends and set the Security ID
div[‘amount’] = tmp.tolist()
div[‘sid’] = sid

# Start numbering at where we left off last time
ind = pd.Index(range(divs.shape[0], divs.shape[0] + div.shape[0]))
div.set_index(ind, inplace=True)

# Append this stock’s dividends to the list of all dividends
divs = divs.append(div)
yield sid, df

You need to save the file as india_stock_data.py and move it under..\ProgramData\Anaconda3\envs\env_zipline\Lib\site-packages\zipline\data\bundles folder.

Next, we will move on to the second part, the bundle registration. This is fairly simple and quick. You need to look for extension.py under Users\<user>\.zipline folder. Please open this file in Python editor of your choice and add the following code. This will perform the following tasks:

  1. Resister bse_data as a new custom bundle.
  2. Specify valid trading calendar dates as betweenstart_session and end_session which we have used in the first part to filter our data.
# necessary imports
import pandas as pd
from zipline.data.bundles import register, india_stock_data
start_session = pd.Timestamp(‘2005–01–03’, tz=’utc’)
end_session = pd.Timestamp(‘2020–05–20’, tz=’utc’)
register(
‘bse_data’,
india_stock_data.bse_data,
calendar_name=’XBOM’,
start_session=start_session,
end_session=end_session
)

Ingest last 15 years of daily OHLCV data for 500 stocks of the S&P BSE 500

We are done with all the hard work and now it is time to load the data! You need to go to the command prompt and run the below-given commands.

$conda activate env_zipline

$zipline ingest -b bse_data

This should start loading your data to zipline and you should see something like this:

(env_zipline) PS C:\Users\sabirj> zipline ingest -b bse_data Loading 500003…
Loading 500008…
Loading 500010…
Loading 500020…
Loading 500027…
Loading 500031…
Loading 500033…

Once data loading is complete, you need to run the below command to check the bundles. bse_data should be on the list.

zipline bundles(env_zipline) PS C:\Users\sabirj> zipline bundles bse_data 2020–06–06 07:16:23.009573
bse_data 2020–06–03 08:02:21.869497
bse_data 2020–06–03 08:01:47.696447
bse_data 2020–06–03 07:57:22.557573
bse_data 2020–06–03 07:50:10.828655
bse_data 2020–06–03 07:44:21.626445
bse_data 2020–06–03 07:35:54.149334
bse_data 2020–06–03 07:21:41.360368
bse_data 2020–06–03 07:15:00.297208
bse_data 2020–06–03 07:12:33.342635
bse_data 2020–06–01 05:27:41.900309
bse_data 2020–06–01 04:25:14.641486
csvdir 2020–05–30 15:48:52.453535
csvdir 2020–05–30 15:47:38.997242
csvdir 2020–05–30 15:44:16.304702
quandl 2020–05–01 09:02:42.886594
quandl 2020–05–01 07:56:14.848811
quandl 2020–05–01 06:39:10.659670
quandl 2020–05–01 06:38:00.607966
quandl 2020–05–01 06:32:41.798931
quandl 2020–05–01 06:29:42.382986
quandl 2020–05–01 06:26:05.043059
quantopian-quandl 2020–05–06 16:02:58.453567

Test your newly created custom bundle by running a moving average crossover strategy for Infosys Ltd. stock.

We are almost done! However, it is very important to test our newly created bundle. It is seen many times that the bundle is created successfully, but when you try to use it for a backtest, it throws an unexpected error. So, I want to make sure that is not the case with our custom bundle.

We will test a very simple MA crossover strategy for Infosys Ltd. stock. We will take 50 days as a short moving average and 100 days as a long moving average. If the short moving average is more than the long moving average, then we go and buy 10 Infosys stocks. However, if the short moving average is less than long, we liquidate our position. We will use start and end dates as 2015–01–01 to 2020–05–20. This is just to test the newly ingested bundle, so, don’t judge the strategy by this. I will not be explaining the code for this part, hence you may need to work that out if you do not know.

Here is the code.

# necessary inports.
%matplotlib inline
from zipline import run_algorithm
from zipline.api import order_target_percent, order_target, symbol
from datetime import datetime
import pytz
import matplotlib.pyplot as plt
# Set start and end date
start_date = datetime(2015, 1, 1, tzinfo=pytz.UTC)
end_date = datetime(2020, 5, 20, tzinfo=pytz.UTC)
def initialize(context):
# Infosys Ltd. as stock
context.stock = symbol('500209')
context.i = 0
def handle_data(context, data):
# Skip first 100 days to get full windows
context.i += 1
if context.i < 100:
return
# Compute averages
# data.history() has to be called with the same params
# from above and returns a pandas dataframe.
short_mavg = data.history(context.stock, 'price', bar_count=50, frequency="1d").mean()
long_mavg = data.history(context.stock, 'price', bar_count=100, frequency="1d").mean()
# Trading logic
if short_mavg > long_mavg:
# order_target orders as many shares as needed to
# achieve the desired number of shares.
order_target(context.stock, 10)
elif short_mavg < long_mavg:
order_target(context.stock, 0)
def analyze(context, perf):
fig = plt.figure(figsize=(12, 8))

# First chart
ax = fig.add_subplot(311)
ax.set_title('Strategy Results')
ax.plot(perf['portfolio_value'], linestyle='-',
label='Equity Curve', linewidth=1.0)
ax.legend()
ax.grid(False)

# Second chart
ax = fig.add_subplot(312)
ax.plot(perf['gross_leverage'],
label='Exposure', linestyle='-', linewidth=1.0)
ax.legend()
ax.grid(True)
# Third chart
ax = fig.add_subplot(313)
ax.plot(perf['returns'], label='Returns', linestyle='-.', linewidth=1.0)
ax.legend()
ax.grid(True)
plt.savefig('strategy',dpi=400)

# Fire off the backtest
results = run_algorithm(
start=start_date,
end=end_date,
initialize=initialize,
analyze=analyze,
handle_data=handle_data,
capital_base=10000,
data_frequency = 'daily',
bundle='bse_data'
)

You should get the following output if the newly Ingested bundle is working properly.

That is it for now! I have shared all the required code. Simply copy and paste the code from this article and it should work. However, if required, please feel free to contact me. Enjoy your backtesting with zipline!

Please find the code related to this article on my Github page.

References:

  1. Andreas F. Clenow‘Trading Evolved: Anyone can Build Killer Trading Strategies in Python’.
  2. zipline Documentation

--

--