Daniel Aisen
May 29 · 14 min read

My very first project at IEX, back in 2012, was to build a transaction cost analysis (TCA) tool, an application that would consume order and trade data from an investment firm, line it up with corresponding market data, and produce metrics evaluating how that firm’s brokers performed on those trades. We had a client at the time who was interested in our help on this front and passed along a substantial set of data for us to examine. This type of analysis is important for two reasons:

  1. An investor needs to understand the quality of execution it’s receiving through its brokers. For example, once an investor makes the decision to buy a security, are they able to actually make that purchase at current market levels, or does their demand push the price up? Even an outstanding stock picker will struggle to make money if it leaves money on the table by incurring a large amount of slippage.
  2. Both the SEC and FINRA have best execution guidelines mandating that brokers and investment advisers take steps to ensure they do not incur undue costs when trading on behalf of their clients. The most common way for an investor or broker to defend its behavior on this front is to point to TCA reports that demonstrate reasonable trading outcomes versus expectations.

When I was assigned this project, I knew practically nothing. I was less than 3 years out of school, I had zero database experience, I had never performed data analysis, and I had never worked with web application frameworks. All I had was a limited amount of general programming experience and a bit of knowledge about the stock market. I started off by just trying to learn as much as I could on a variety of topics that seemed relevant. I took online courses on databases and web programming. I decided to blog about the experience — my first blog post ever — and wound up writing three in-depth articles about the project (1, 2, 3). It’s fun going back and reading about my thought process at the time.

I wound up spending about six months working on this project until we ultimately nixed it so that I could work on the core trading system. It wasn’t a total loss, however; I learned skills that proved very useful throughout my time at IEX, especially after I shifted roles from programmer to data scientist. Additionally, this project was the inspiration for a popular internal tool our UX team built about a year later.

Déjà vu at Proof

Fast forward 7 years: we were in a meeting with an investment firm a couple months ago, and they offered to share some of their raw execution data with us. They figured it might be useful for us to have real-world data as we design and build out our trading product, and if we were able to provide any interesting insights back to them, it could be a win-win. We gladly accepted the task. Unlike at IEX, where offering TCA was actually part of the original business plan, we don’t intend to go down this path at Proof. For a broker to provide TCA to its clients is a conflict-of-interest; what broker is ever going to highlight when they do a bad job? Nevertheless we do absolutely need to build out a robust research environment, and this seemed like a perfect use case to start with.

Selecting the right tools for the job

Development environment: Python (Jupyter + pandas + plotly)

The first step was to choose a toolkit in which to build this functionality. kdb+/q was the tick database and querying language of choice at IEX, and it is certainly the environment in which I’m most comfortable, but unfortunately kdb+ is prohibitively expensive for a firm at our stage.

I did have a limited amount of prior exposure to Python data science libraries (pandas, numpy), as well as Jupyter as an interactive development environment. Even before starting Proof, we were thinking this combination could be a good approach for our research endeavors, and it’s certainly a popular one; Python has an extremely robust open source data science ecosystem, including a great interactive graphing library called plotly. Given our positive prior impressions with the Python stack, this path seemed like the clear choice.

Historical market data: Polygon.io

The most important (and difficult) decision was landing on a good historical tick data provider; you can’t perform TCA without clean, accurate market data. Considering that delayed SIP data costs $21/month to redistribute, one would think that there would be a number of robust and reasonably priced options out there for historical tick data, but sadly this is not the case. It appears that most tick data providers have both high monthly fees and long minimum contracts. Here are the ones we considered:

  • NYSE Daily TAQ — NYSE’s Daily TAQ product is currently the gold standard in historical U.S. equities data. It is simply a normalized capture of the CTA SIP (which NYSE operates) and the UTP SIP operated by Nasdaq. NYSE charges $3000 a month for this data and they mandate a 12-month minimum contract, plus they charge an additional $500 per month of backfilled data. NYSE is also notorious for imposing heavy-handed restrictions on how data purchased through them may be used (again, the SIP is supposed to be public data!).
  • OneTick — OneTick is one of the largest players in this space and one of the first we considered. We initially thought of it as simply a market data vendor, but during our introductory call, they positioned themselves as a full data analysis environment — basically like kdb+/q with market data included out of the box. We considered their cloud product, which seemed like a solid fit for our needs. The pricing was much more reasonable than NYSE, but still with a year-minimum contract and a quickly scaling price point depending on how much data we wanted to access.
  • QuantGo/AlgoSeek — we also considered QuantGo and AlgoSeek, which are two different companies under the same umbrella. The Virtual Quant Lab product in particular looked interesting. It seems to be a hosted cloud instance where you access and pay for data on demand, but when we asked about it their sales rep actually pushed us away from using it. He recommended we “lease” historical flat files and pay a monthly subscription fee for continued usage rights (again, with a minimum contract length).
  • Nanex — I had once purchased a data set from Nanex for a one-off project, and I remembered it being reasonably priced and satisfying our needs, but it seems they may have raised their prices in recent years to be more in line with the competition.
  • Polygon.io — our dream was to find a much more economical option than the ones above, with excellent quality data and no long term contract, that was accessible through a modern API. When we stumbled upon Polygon, we thought it was too good to be true, and in some ways it was. Among other things, Polygon provides exactly what we were looking for: full historical US equities trade and quote data via a great API for $399/mo with no commitment. Amazing! We signed up right away, plugged in, and the data started flowing. Unfortunately however, Polygon was smack in the middle of a massive migration of its raw data source, and the data we needed was not yet available. That said, Polygon was great to work with to come to a solution — they prioritized the migration for part of the date range we needed and even added new functions to the API at our request. At the time of this writing, the full data we need still isn’t all there, but we had a great experience working with their partial data, and we’re optimistic that this will be a fantastic option going forward once the migration is complete!

Hosting / setting it all up: Google Cloud Platform

Once the tool set was selected, the next step was to decide whether to host the research environment locally or in the cloud. Even though it was only me working on this first project, our production research environment will be used by the whole team, so I decided to set it up as a JupyterHub instance in the cloud. I chose Google for this initial experiment, but the setup is very much cloud-agnostic — I think it would be straightforward to migrate the whole environment to another cloud provider in just an hour or two.

There’s a wonderful guide for setting up a JupyterHub instance in the cloud, with detailed specifics for each of the major providers, called Zero to JupyterHub with Kubernetes. It worked beautifully. I set up our environment with the datascience-notebook docker image, JupyterLab, and Google authentication.

Security: Google OAuth2 with 2FA / Google-managed encryption

For this project, I locked our JupyterHub instance behind a Google authentication wall with 2FA, with the encryption of our data managed by Google. This seems like a reasonable approach to me, but I suppose if we didn’t trust Google with this responsibility, a possible next step would be to manage the encryption ourselves.

Performing the data analysis

Enriching the execution data

Once the environment was setup, there were several steps to get the raw data analyzed and into report form. The data set was a flat file of over a million executions with the following key fields: date, timestamp, symbol, side, price, quantity, broker. In an ideal world, the raw data would be populated with venue information and other useful fields like add/remove flag as well, but at least the timestamps were to the millisecond. The first step of the analysis was to supplement this raw data with relevant market data from around the time of the trade. I chose two times to focus on: 100 milliseconds before the trade, to measure the stable state of the market prior to the trade, and 1 second afterward to capture what happened next (i.e. the short-term markout).

I wrote code to iterate through the raw data file, pull out 1 day at a time, and then go fill-by-fill querying Polygon for the bid and offer at each of those two times. This enrichment process took roughly about 45 minutes per day, which is quite slow (Python is not known for its speed), but I was able to parallelize the process to make it workable. Enrichment of the executions took by far the longest of any step of the analysis.

Here’s the code:

import pandas as pd
import requests
import datetime
import numpy as np
import multiprocessing
POLYGON_KEY = '[PRIVATE_KEY]'#symbology adjustment for stocks like BRK/B
def normalizeSymbol(sym):
return sym.replace('/','.')
#pull one data point (e.g. nbbo) as of a specific time from Polygon.io
def getAsof(sym, ts, querytype='nbbo', limit='1', reverse='true'):
sym = normalizeSymbol(sym)
date = ts.strftime("%Y-%m-%d")
timestamp = int(1000*ts.timestamp())
return requests.get('https://api.polygon.io/v2/ticks/stocks/{}/{}/{}?timestamp={}&limit={}&reverse={}&apikey={}'.format(querytype, sym, date, timestamp, limit, reverse, key))
#populate one row of a dataframe with the bid/mid/ask from a specific time (given an offset in milliseconds)
def AddNbboTs(row, ts, offset=0):
x = getAsof(row['symbol'],ts + (offset * datetime.timedelta(milliseconds=1)))
if('results_count' in x.json() and x.json()['results_count'] >0):
row['bid' + str(offset) + 'ms'] = x.json()['results'][0]['p']
row['ask' + str(offset) + 'ms'] = x.json()['results'][0]['P']
row['mid' + str(offset) + 'ms'] = 0.5*(row['bid' + str(offset) + 'ms'] + row['ask' + str(offset) + 'ms'])
return row
#wrapper around AddNbboTs that specifically uses the timestamp field from that row (_ts)
def AddNbbo(row, offset=0):
return AddNbboTs(row, datetime.datetime.strptime(row['_ts'],'%Y-%m-%d %H:%M:%S.%f')+datetime.timedelta(hours=5), offset) #note I added 5 hours to manually convert from EST -> UTC, but better would be to use pytz
#function to enrich one day of trades and write it to a file
def enrichDay2(x, date_string):
x = x.apply(AddNbbo, offset=1000, axis=1)
x = x.apply(AddNbbo, offset=-100, axis=1)
x.to_csv(date_string[0:10] + '.csv', index=False)
#read in your raw data
df = pd.read_csv('raw_execution_data.csv')
#separate the raw data into selected individual days and start a separate process for each of them
for d in df.order_date.unique()[120:125]: #select which days you want to enrich
x = df[df['order_date'].str.startswith(d)]
t = multiprocessing.Process(target=enrichDay2, args=(x,d,))
for p in processes:

Aggregating executions to perform order-level metrics

Once the enrichment was complete, everything else was surprisingly fast and easy. Next step: aggregate all those fills into coherent “parent orders.” For this data set, I made the assumption that all executions over the course of a single day, on the same side, executed by a single broker were the same parent order. I took the enriched daily execution files and combined by them by month, and then analyzed one month at a time.

Unfortunately because historical trade data wasn’t yet available in Polygon over my date range, I had to leave out a few standard metrics like participation rate and slippage vs. vwap, but those would be straightforward to layer on in the future. Additionally, my initial data set was missing LastMkt information (i.e. on which venue did the execution occur). With historical trade data, I could have attempted an alignment of the executions with the tape to try to guess the venues, but since neither was available I did not perform venue analysis as part of this project.

All of the order-level metrics I produced could be calculated using data already procured at this point except for one: 3-minute post order reversion, but fortunately that only required a single additional Polygon query per order.

Here are the key order level metrics I calculated:

  • Slippage vs. arrival: what was the overall price achieved on the order relative to the market at the time the order began trading? Did the order cause market impact that adversely affected its outcome? I think this is the single most important TCA metric, but unfortunately it’s super noisy because the market moves around for all sorts of reasons, not just as a result of your order.
  • 1-second trade-to-mid markouts: what happened during the 1-second immediately following the execution? Did you cross the spread and immediately push the stock away from you? Were you able to add liquidity during a stable market, or did you always get run over (indicating that you were always sitting at the back of the queue)? This is my favorite microstructure metric. It’s much cleaner and more precise than high-level metrics like slippage.
  • 3-minute reversion: what happened in the market during the next few minutes after the entire order was complete? For example, did a buy order unnecessarily prop up the market by putting a floor on the stock when it probably would have been better off executing with a lighter touch and allowing the market to first come in?
  • Passive/midpoint/aggressive fill breakdown: did the order typically pick up fills on the bid, inside the spread, or at the offer? Alternatively, did the order typically add liquidity or remove?

Here’s the code:

def processFills(x):
x['order_date'] = x['order_date'].str.replace(' 00:00:00.000','')
x['ts'] = pd.to_datetime(x['_ts'])
x['ts2'] = x['ts'] + datetime.timedelta(hours=5) $again, using pytz would be better
x['sideInt'] = x['side'].map({'buy':1, 'sell':-1, 'sellshort':-1})
lp = x['LastPx']
#aggressiveness: [-3,3] negative is passive, positive is aggressive; for a buy order, -3: <bid, -2: bid, -1: between bid/mid, 0: mid, 1: between mid/ask, 2: ask, 3: >ask
x['agg'] = (-3 + (lp >= x['bid-100ms']) + (lp > x['bid-100ms']) + (lp >= x['mid-100ms']) + (lp > x['mid-100ms']) + (lp >= x['ask-100ms']) + (lp > x['ask-100ms'])) * x['sideInt']
return x
def getMid(symbol, ts, offset=0):
x = getAsof(symbol, ts + (offset * datetime.timedelta(milliseconds=1)))
if('results_count' in x.json() and x.json()['results_count'] >0):
return 0.5*(x.json()['results'][0]['p'] + x.json()['results'][0]['P'])
return -1
def add3mReversion(row):
row['Reversion3m'] = round((1 if (row['side'] == 'buy') else -1) * (row['DepartureMid'] - getMid(row['symbol'], row['EndTime'] + datetime.timedelta(hours=5), 180000)), 4)
return row
def processOrders(x):
d = {}
d['FilledShares'] = x['LastQty'].sum()
d['AvgPrice'] = ((x['LastQty'] * x['LastPx']).sum() / d['FilledShares']).round(4)
d['NumTrades'] = len(x)
d['AvgTradeSize'] = (d['FilledShares'] / d['NumTrades']).round(1)
d['StartTime'] = x['ts'].min()
d['EndTime'] = x['ts'].max()
#1-second markouts
d['Markout1sMils'] = (10000*(x['LastQty'] * (x['mid1000ms'] - x['LastPx'])*x['sideInt']).sum() / d['FilledShares']).round(1)
d['PctPassive'] = ((x['LastQty']*(x['agg']<0)).sum() / d['FilledShares']).round(3)
d['PctMid'] = ((x['LastQty']*(x['agg']==0)).sum() / d['FilledShares']).round(3)
d['PctAggressive'] = ((x['LastQty']*(x['agg']>0)).sum() / d['FilledShares']).round(3)
d['ArrivalMid'] = (x['mid-100ms'].iloc[0]).round(4)
d['DepartureMid'] = (x['mid-100ms'].iloc[-1]).round(4)
sideInt = x['sideInt'].iloc[0]
d['SlippageVsArrival'] = sideInt * (d['AvgPrice'] - d['ArrivalMid'])
return pd.Series(d, index=['FilledShares', 'AvgPrice', 'NumTrades', 'AvgTradeSize', 'StartTime', 'EndTime', 'Markout1sMils', 'PctPassive', 'PctMid', 'PctAggressive', 'ArrivalMid', 'DepartureMid', 'SlippageVsArrival'])
def enrichAndSaveOrders(month):
o = pd.read_csv(month + '.csv')
o = processFills(o)
o2 = o.groupby(['order_date','symbol','broker','side'], as_index=False).apply(processOrders)
o2 = o2.reset_index()
o2 = o2.apply(add3mReversion, axis=1)
o2.to_csv(month + '-orders.csv', index=False)
month = '2018-06'

Aggregating orders to perform broker-level metrics

The final data analysis step was the aggregate the order-level stats by broker, which pandas makes very straightforward.

def getOrderStatsByBroker(orders):
orders['Notional'] = orders['FilledShares']*orders['AvgPrice']
wavg_notional = lambda x : np.average(x[np.logical_not(np.isnan(x))], weights=orders.loc[x.index, 'Notional'][np.logical_not(np.isnan(x))])
wavg_shares = lambda x : np.average(x[np.logical_not(np.isnan(x))], weights=orders.loc[x.index, 'FilledShares'][np.logical_not(np.isnan(x))])
stats = orders.groupby(['broker','month']).agg({'FilledShares' : np.sum, 'NumTrades': np.sum, 'Notional' : np.sum, 'SlippageVsArrival':wavg_shares,'Markout1sMils':wavg_shares,'Reversion3m':wavg_shares,'PctPassive':wavg_shares,'PctMid':wavg_shares,'PctAggressive':wavg_shares}).reset_index()
stats['SlippageVsArrival'] = stats['SlippageVsArrival'].round(4)
stats['Markout1sMils'] = stats['Markout1sMils'].round(1)
stats['Reversion3m'] = stats['Reversion3m'].round(4)
stats['PctPassive'] = stats['PctPassive'].round(3)
stats['PctMid'] = stats['PctMid'].round(3)
stats['PctAggressive'] = stats['PctAggressive'].round(3)
stats['Notional'] = stats['Notional'].round(0)
return stats
month = '2018-06'
fills = pd.read_csv(month + '.csv')
fills = processFills(fills)
orders = pd.read_csv(month + '-orders.csv')
stats = getOrderStatsByBroker(orders)

Presenting the results

Once the data was processed, all that was left was to visualize it. I used the graphing library plotly to build two reports: one for individual order exploration and the other as a high level broker comparison. Please note that all data presented here is randomized and purely for illustration purposes.

Individual order report

This single order report allows the user to pick out an order, pull the relevant market data, and graph the order’s fills alongside quotes in the market. It’s a great way to manually examine what happened while the order was trading. In cases where an order was “gamed” (e.g. a large midpoint peg order in a dark pool got detected by a predatory strategy that pushed the market away and then picked it off), a tool like this can illustrate what happened.

#takes ~2 minutes to install plotly; probably a good idea to just extend the docker image to install in automatically
!pip install plotly
!jupyter labextension install @jupyterlab/plotly-extension
import plotly.graph_objs as go
import plotly
import plotly.figure_factory as ff
o = orders.iloc[2221] #enter the order index here
DATE = o['order_date']
y = fills[(fills['order_date'].str.startswith(DATE)) & (fills['symbol'] == SYMBOL)] firstTs = y['ts2'].min()
lastTs = y['ts2'].max()
resp = getAsof(SYMBOL, firstTs - datetime.timedelta(minutes=5), reverse='false', limit='50000')
resp2 = [r for r in resp.json()['results'] if r['t']<(lastTs + datetime.timedelta(minutes=5)).timestamp()*1000000000]
ts2 = [(datetime.datetime.fromtimestamp(r['t']/1e9) - datetime.timedelta(hours=5)) for r in resp2]
b2 = [r['p'] for r in resp2]
a2 = [r['P'] for r in resp2]
trace = go.Scatter(
x = y['ts'],
y = y['LastPx'],
text = [(str(lq) + ' shares') for lq in y['LastQty']],
bid = go.Scatter(x = ts2, y = b2, mode='lines', name='bid', hoverinfo='y', marker=dict(color='#00ff99'))
ask = go.Scatter(x = ts2, y = a2, mode='lines', name='ask', hoverinfo='y', marker=dict(color='#ffaaaa'))
table_trace = go.Table(header=dict(values=['Field','Value'],line=dict(color='#000'),fill=dict(color='#111'),font = dict(color = '#ffffff', size = 11)), cells=dict(values=[o.keys().values,o.values],line=dict(color='#000'),fill=dict(color='#333'),font = dict(size = 9)),domain=dict(x=[0, 0.25],
y=[0, 1.0]))
layout = dict(width=1000,height=600,title=dict(text=(DATE[0:10] + ': ' + y['side'].iloc[0] + ' ' + str(y['LastQty'].sum()) + ' ' + SYMBOL),font = dict(color = '#ffffff', size = 16)),
xaxis=dict(domain=[0.35,1]),paper_bgcolor='#111',plot_bgcolor='#111',font = dict(color = '#ffffff', family='Franklin Gothic'))

Broker comparison report

Most TCA reports primarily compare high level broker performance. This simple report highlights a few of the most important metrics that I like to look at.

colors = ['rgba(0,255,125,1)','rgba(0,125,255,1)','rgba(255,0,125,1)','rgba(125,0,255,1)']
shares_by_broker = go.Pie(labels=stats['broker'],values=stats['FilledShares'],marker=dict(colors=colors),domain=dict(x=[0,0.25],y=[0.7,1]),hoverinfo='label+value+percent',textinfo='label+percent')
slippage = go.Bar(x=stats['broker'],y=stats['SlippageVsArrival'],xaxis='x1',marker=dict(color=colors),hoverinfo='x+y')
markouts = go.Bar(x=stats['broker'],y=stats['Markout1sMils'],xaxis='x2',yaxis='y2',marker=dict(color=colors),hoverinfo='x+y')
reversion = go.Bar(x=stats['broker'],y=stats['Reversion3m'],xaxis='x3',yaxis='y3',marker=dict(color=colors),hoverinfo='x+y')
for i in range(len(colors)):
col = colors[i]
aggCharts[i] = go.Pie(labels=['Passive','Mid','Aggressive'],values=[stats.iloc[i]['PctPassive'],stats.iloc[i]['PctMid'],stats.iloc[i]['PctAggressive']],
marker=dict(colors=[str.replace(col,'1)','0.7)'), str.replace(col,'1)','0.85)'), col]), textfont=dict(color="#000000" if i==0 else "#ffffff"),domain=dict(x=[0.25*i,0.25*(i+1)],y=[0.27,0.57]),hoverinfo='label+percent',textinfo='label+percent')
stats_table = go.Table(header=dict(values=list(stats.columns),line=dict(color='#000'),fill=dict(color='#111'),font = dict(color = '#ffffff', size = 11)),
cells=dict(values=[stats.broker,stats.month,stats.FilledShares,stats.NumTrades,stats.Notional,stats.SlippageVsArrival,stats.Markout1sMils,stats.Reversion3m,stats.PctPassive,stats.PctMid,stats.PctAggressive],line=dict(color='#000'),fill=dict(color='#333'),font = dict(size = 9)),domain=dict(y=[0, 0.2],
x=[0, 1.0]))

layout = go.Layout(width=1000,height=800,title=dict(text='Broker Stats: '+stats.iloc[0]['month']),yaxis=dict(domain=[0.7, 1]),yaxis2=dict(domain=[0.7, 1],anchor='x2'),yaxis3=dict(domain=[0.7, 1],anchor='x3'),xaxis4=dict(domain=[0, 0.25],title='Total Volume'),xaxis1=dict(domain=[0.3, 0.5],title='Slippage vs. Arrival'),xaxis2=dict(domain=[0.55,0.75],title='1-s Markouts (mils)'),xaxis3=dict(domain=[0.8,1],title='3-m Reversion'),paper_bgcolor='#111',plot_bgcolor='#111',font = dict(color = '#ffffff', family='Franklin Gothic', size=10),annotations=[dict(x=0.08,y=0.637,text='Total Volume',xref='paper',yref='paper',showarrow=False,font=dict(size=12))
,dict(x=0.5,y=0.57,text='Trade Aggressiveness',xref='paper',yref='paper',showarrow=False,font=dict(size=12))
fig = go.Figure(data=data,layout=layout)

Conclusion / further extensions

This project was a proof of concept to test out pandas+JupyterHub+GCP as an equity trading research stack, plotly as a graphing library, and Polygon as a market data source, all of which worked very well. There are many possible extensions to layer on additional useful information such as:

  • Incorporate historical trade data to calculate stats like participation rate and slippage vs. vwap.
  • Perform venue analysis either by using venue information from the executing brokers or by matching the trades to the tape.
  • Filter the executions more granularly to better isolate specific market dynamics, for example by calculating trade markouts broken out by fill aggressiveness or add/remove flag.

All in all, this project took a few weeks of work, with the main time sinks being market data provider selection and raw data enrichment. Not bad considering the six months I invested during my first foray into TCA in 2012. Plus, the resulting proof of concept this time around is far more coherent and useful (and polished!). This exercise has been a testament to the great strides taken in the Python data science ecosystem over the past several years. I’m not sure if we’ll eventually set up our production research environment exactly this way at Proof, but it certainly seems to be a viable approach!


Proof is a new institutional equities broker. Launching in 2020.

Daniel Aisen

Written by



Proof is a new institutional equities broker. Launching in 2020.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade