Market data pricing: part 1 of many

Published in

Proof Reading

7 min readApr 4, 2019

Everyone these days has an opinion about the price of US equities market data. Your broker thinks it’s too high. Your stock exchange thinks it’s fair and balanced. Your regulator thinks it’s unconvincingly justified. Your grandmother thinks it’s probably a scam.

Before we try to sort out who is right and who is wrong, let’s answer the basic question that underlies it all: What is the price of market data?

Should be simple, right?

(Imagine evil laugh here.) It turns out this is the kind of question you pose to an intern you don’t like.

But since Proof Trading doesn’t have any interns, I went down this rabbit hole myself. In what is likely to be a substantial series of posts, I’d like to share with you what I found.

First let’s break down this question of the market data into the dimensions that we must specify in order to arrive at answers. All of these emerge from the many pages of policy documents that the exchanges produce to define their market data prices: e.g.

https://nasdaqtrader.com/Trader.aspx?id=DPUSdata, https://www.nyse.com/publicdocs/nyse/data/NYSE_Market_Data_Pricing.pdf,

https://markets.cboe.com/us/equities/market_data_services/

(IEX is not included here because they do not charge market data fees.)

Which data product?

If you want to get your hands on US equities market data from the 13 exchanges or the SIPs (Securities Information Processors), one thing to decide is which data products from which exchanges? Each exchange has a varied menu of options: do you just want data about trades? or just about quotes? or trades + quotes? If you want quotes, do you just want the top of book, or do you want full depth of book? Do you want data about imbalances leading into auctions? Do you want to know what the market makers are tweeting about? That last one is not a real product offered by exchanges — yet.

Who and what is the data for?

The data fees also vary by the use case of the data. Are you redistributing the data? Who are you redistributing it to? Are you displaying it on screens? How many eye balls will look at those screens? How many of those eye balls are professional and how many are non-professional? Are you using it for computations? How many servers are doing the computations? Are the computations for your own trading or for clients? Are you running a dark pool or pools? If so, how many? How many user accounts can access the data? How many of the user account passwords are dumb? (again, that last one is not real, but there should be a fee for that.)

How are you receiving the data and who is charging the fee?

To use a data product that comes from an exchange that charges data fees, you will have to pay fees to the exchange and to whomever is providing you the data. Sometimes the entity providing you the data is also the exchange. But don’t think that gets you out of paying separate fees! For example, if you consume Nasdaq data directly from Nasdaq or consume Nasdaq data through someone else but still in the raw data format that is output by Nasdaq, you will pay the “direct access” fees in addition to usage and other fees. If you get Nasdaq data through another data provider in a different, normalized format, you will pay the data provider some fee for that service, and also pay Nasdaq for data usage and other fees, but you won’t pay the direct access fee to Nasdaq. And the data provider will pay Nasdaq for their own consumption + being a redistributor. It is important to note that none of this is yet accounting for the physical connection, logical connection, and port fees that will be paid by whomever is directly consuming the data from the exchange.

Putting it all together

So the high-level question of “what is the price of market data?” is actually a very long list of low-level questions like:

What is the market data bill that a company C owes to Exchange X for consuming data product D from Exchange X, obtained through data provider Y in format R, where the data touches Z servers belonging to company C, is accessible to W user accounts within company C, is used internally by company C for purposes A and B, is displayed for view by V human employees of company C, is re-distributed externally to E parties, with company C controlling the entitlements for F external professional users and G external non-professional users?

And that, my friends, is why your middle school math teachers built up your tolerance for complex word problems!

There are currently 13 possible values for X, and conditioned on each choice for X, there are typically > 4 possible values for D. The variables Y and R typically combine to have a binary effect on the bill due to exchange X: there are line item “access” or “direct access” fees that apply in full or not at all, depending on the variables Y and R.

The variables Z, W, and V are positive integers, while the variables A and B vary over values like “agency trading,” “proprietary trading,” “running a dark pool,” or “running two dark pools,” etc. The variables E, F, and G refer to entities that may be humans or companies.

All together, this gives us not one answer to the price of market data question, but a multi-dimensional shape of answers, parameterized by all of these variables.

As a former math teacher, I can feel you groaning.

And there are several more complicating factors that fall beyond the scope of these core variables. One of these factors is bundling of multiple data products within an exchange or exchange family. Sometimes this leads to savings compared to purchasing each product individually. Another important factor is time: all of these prices can and do change over time, as exchanges tweak prices of existing products, introduce new products, phase out old ones, or otherwise change their market data policies.

So really, we aren’t just looking at multi-dimensional shape of answers, we are looking at a combinatorial explosion of multi-dimensional shapes of answers, further evolving across an axis of time.

Not to mention the potential confusion and messiness that arises from trying to map a real-world corporate operational structure and infrastructure onto somewhat ambiguous units. Many of the market data policies refer to “devices,” “servers,” “users,” etc. in contexts that aren’t entirely clear. For display fees for example, if a single human user has a single entitled account on a single dedicated device, then words like “user”, “device,” and “account” can be used and counted interchangeably. But as technology evolves, this is becoming less common, and market data policies and pricing structures that were very clear for Bloomberg terminals become a lot less clear when a human user has say two devices, one dedicated, one shared with other users, and many accounts or entitlements that cumulatively enable some number of simultaneous accesses. Even a unit as seemingly clear as “servers” becomes a lot murkier in the context of a cloud infrastructure.

So what do people do with this?

When faced with this kind of multi-faceted complexity, a natural response is to tackle only the messiness of mapping your own company’s operations onto the variables above, simply plug in the specific values of those variables that are relevant to you, and extract the narrower answer that you need for the moment from the many pages of market data pricing documents. This is what I started doing — I started out my market data research by tallying up only the fees for use cases that I expected to be relevant to Proof’s day one trading plans. Presumably this is what many participants in the financial ecosystem are doing — they are keeping track of exactly the market data policies that affect their bill, and they go back to the broader policy documents only when something changes, either in their usage or in the policies themselves, and they compute only the delta that allows them to understand their new bill.

Across the industry, this means a lot of duplication of work. A lot of individual people and departments within individual firms are spending human energy deciphering the same complex policies in order to tease out the implications that directly apply to them. And the result they get is framed in their specific business case, so they are not in a good position to share what they’ve learned with others without revealing information about their business that they’d prefer not to reveal.

A notable exception is the recent market data cost report written by Adrian Facini et. al. from the perspective of IEX:

https://iextrading.com/docs/The Cost of Exchange Services.pdf

This reports details the costs that IEX pays to the other exchanges for market data use, and compares it to the costs that IEX incurs to provide market data from its own exchange. This report represents a rare and valuable glimpse into how these fees accumulate for the products that IEX consumes and the use cases that it represents.

A broader approach

In this series of posts, we will aim to develop a broader view of how the variables interact to produce final market data costs. This will take us longer than the narrower computation of projecting what Proof’s market data costs are likely to be, but we feel it is worth it in order to make our work reusable, both to us as our choices and use cases evolve, and to the industry as a whole.

We are also working on a data visualization tool that will allow substitution of arbitrary values for the variables in the market data template question above, across various combinations of market data products. This will allow us (and you!) to explore the larger space of answers to market data pricing questions, and to see how the different variables interact to produce the final results. Our hope is that this can provide a more comprehensive and effective understanding of the market data landscape, and reduce the amount of duplicated work in tracking market data fees for specific business cases. In the longterm, we’d like to extend this tool to have a dimension of time as well, allowing us to see the impact of changes in market data fees and policies and how they have affected costs across the differing use cases over time.