Setting up a Tick Data Research Environment

Published in

Proof Reading

16 min readMay 11, 2020

In a previous post, we discussed the characteristics desirable in the database serving our Algorithmic Trading System (currently for US Equities). We hinted then that while we don’t intend to use this trading system database for tick data analysis, some of the considerations do overlap, especially the scalability and performance of the tools in question. One key difference is that a tick data research environment typically only needs to house read-only non-proprietary data such as trades and quotes, as opposed to transactional proprietary data such as customer orders.

In this technical post, we expand upon these thoughts and explore what it takes to set up a productive tick data research environment.

TL;DR

A tick data research environment consists of much more than the database and analysis tools. More specifically, it involves the acquisition/capture and timely processing of tick and reference data into a cleansed, normalized, and complete data set. Given the cost and complexity of setting up these processes, and the significant performance and scalability requirements for such an environment, most small and even many large teams will benefit from using a hosted tick data service, such as OneTick Cloud.

Although multiple vendors offer hosted tick data, we believe OneTick has the right execution model where computation is performed directly on the stored data. We think this model of computation has advantages over services that offer API-based access to raw data (or simple aggregations), requiring the extraction of a large amount of underlying data to perform the simplest analytics.

Background

So, what exactly is Tick data? Tick data is really financial jargon that simply refers to any type of published prices for securities. This can include trades in a security, or the published quotes, or other asset-class-specific data such as LULD bands. Within Equities, quote records may include:

an NBBO quote: the national best bid and offer
a Top-of-Book quote (aka Level 1): the best bid and offer displayed by a particular exchange
a Depth-of-Book price quote (aka Level 2): typically an individual order at a given price level at a particular exchange (either on the bid side or the offer side); sometimes all orders at a price level are aggregated into a single price-level record

Additionally, reference data related to the subject securities often accompanies the tick data. Some examples of reference data elements for Equity securities include Security Type, Primary Exchange, Lot Size, Industry Sector, Index memberships, or more obscure attributes such as the LULD Tier.

Common Use Cases

A well-designed tick data research environment can serve a diverse set of use cases:

Market microstructure research
Algorithmic trading model development
Strategy back-testing
Trading signal development
Calculation of security stats often used in algorithmic trading strategies: Average Daily Trading Volume (ADTV), Volume Curve(s), Average Spread, Average Displayed Quote Size, Historical Volatility
Venue analysis, Sector analysis, Factor analysis
Regulatory and customer reporting, most commonly around Transaction Cost Analysis: This use case requires integration with transaction data sources such as customer orders and executions.

Why is a separate environment needed?

So why does this environment need to be distinct from the usual application database where you store your transactions? There are a few reasons:

The tick data research environment primarily houses read-only historical tick data (and its derived data) typically not generated by the trading or back-office systems.
This strictly historical data often goes back multiple years or even decades. Efficient storage and fast processing of on-disk data is the name of the game. There is very little need for transactional capabilities of a database, and the hardware and software infrastructure is geared towards scaling storage and computations rather than response times.
The analysis use-cases are strictly batch-mode. The queries are generally long-running with run times from minutes in some case to days in other cases.
The data loaded into this environment has generally been cleansed and normalized, and any gaps have either been explicitly identified or have been filled using alternate data sources. Even if the trading system database captures market data to some extent, it is typically on a best-efforts basis, where data may have been conflated or elided.
Given the accuracy, usability, and completeness requirements, the data is often sourced from vendors who specialize in the curation of such data. As an example, even if a firm subscribes to SIP market data for US Equities and records such market data itself, it may choose to populate its research environment using NYSE TAQ on a nightly basis.
This environment caters to a different set of users — computer scientists, data scientists, and data analysts. Considering the research slant of this environment, often, the use-cases are ad-hoc explorations rather than production systems or reports. This means that a rich and interactive user-interface is essential for this environment.

Requirements for our Environment

Below are some of the technical as well as non-technical requirements we have. We are primarily a US Equities shop, so this list is highly geared towards that:

The data should include historical trades, quotes, and reference data for NMS stocks (aka exchange-listed equity securities) going back at least two years, but ideally as much as fifteen years. We would be fine with just top-of-book quotes for now but would like to have the ability to include depth-of-book data at a later date.
The data should be substantially complete and include all of the usual data elements such as the condition codes on trades and quotes. Since we intend to use this environment to generate customer-facing reports and public-facing market structure commentary, the bar for the accuracy and reliability of this data is that much higher.
Automated daily loads of new data should not require any human effort on our part on an ongoing basis. Additionally, the data for the previous day should ideally be available by the market opening on T+1.
The analysis tools should be able to process enormous amounts of stored data. For example, just the last two years’ worth of quotes data would consist of about 400 billion records. We want the ability to perform aggregations, as-of joins, and custom statistical analysis.
As mentioned in a previous post (see Comparison to a full tick database), we believe that nearly all tick data analysis follows a fundamental pattern: the query engine needs to scan, filter, and aggregate vast amounts of data, only to produce a result set that is much smaller in relation to the amount of data scanned. For this reason, we prefer tools that bring the analytics code to the stored data, rather than wastefully transporting large amounts of raw data to an analytics server. While bringing compute to the stored data is more performant, and our preferred approach, a potential downside is the additional constraints this may impose on the infrastructure. If the computation is performed on multiple nodes, the data would need to be made accessible on each node, using a shared file system or by making copies of the data.
Since scientists are the primary intended users of this environment for us, we want to have rich tooling around the analysis process (e.g., Jupyter) so that the data can be accessed interactively, as well as in an automated fashion.
While this is not the primary use case, the tools within this environment should be able to tap into the transactional data hosted within our systems, such as orders and executions.
We do not want to spend a ton of money on this. We think that a tick database should be a commodity item in the industry by now, and there are few compelling reasons for a team to build one from scratch. While the analytics produced can undoubtedly be a differentiator for the team, the toolset itself is likely not, and only needs to be adequately sized and well-featured.

Possible Solutions

Traditional Tick Database

This involves setting up the appropriate software and hardware infrastructure, combined with a tick database tool such as the OneTick time-series database, kdb+, or DolphinDB. Let’s try to tally up what it would take to set this up.

Infrastructure: Depending on the tool’s performance and capabilities, we would need to set up either one or more large servers, along with ample disk storage. We would probably go with two m5.8xlarge instances on AWS (16 physical cores, 128GB RAM) along with 2 x 10TB of Throughput Optimized HDD volumes (or perhaps SSDs). These seem like large servers, but note that these are still only modestly sized servers for this use case (at least in terms of memory). We would be looking at a cost of ~$3000/mo. Of course, we can reduce costs by shutting off the machines when not in use, or losing the redundancy, or allocating disk space only as needed (each of these has trade-offs).
Data: Multiple vendors sell top-of-book tick data for US Equities, and while there are differences in the price and content of many offerings, two years of top-of-book data seems to cost a few hundred to a few thousand dollars a month. If we want a more extended history, we would be required to pay extra to back-populate the database.
Maintenance overhead: The daily loads would be automated, except when things fail. Generally speaking, the way this works is that data files from vendors become available overnight (either on an SFTP site or via AWS), and then those files get loaded into our database. The files are either in a human-readable file format (such as CSV) or a supported proprietary format (e.g., the binary format supported by kdb+ or OneTick). Many teams that follow this model have half to a full headcount dedicated to keeping this setup running.
Database software licensing costs: These alone are likely to be higher than what we want to pay for the total solution.
Depending on how well-versed the end-users on your team are with the specific technologies in question, they may need additional support when using this environment, which should be factored into the cost of this setup.

The biggest benefit of this approach is the flexibility it affords. If a team has specialized needs for how data should be stored or indexed, or if they need to store proprietary data that doesn’t typically go with tick data or need to integrate with proprietary systems, this approach could make that possible.

All three of the suggested databases above have Python APIs or adapters that can be invoked from a Jupyter instance relatively easily. However, in many instances, you’re not working with real Pandas DataFrames, and the experience can end up being sub-optimal. One advantage is that since the computations are happening within the database (as opposed to Python), the scalability of an individual Jupyter instance is not relevant.

REST API based Services

Another way of achieving these goals could be to use a service that serves up historical tick data, and basic aggregations thereof, using REST APIs. One such service that we have used with good success in the recent past is polygon.io. We are aware of other vendors providing a similar or a more advanced service.

We could use a JupyterHub cluster to run a series of Jupyter notebooks, and write analytics in Python using Pandas. We’ve talked about a similar setup in a previous post.

We think there are a few drawbacks to this approach:

As we’ve eluded to already, we do not like the idea of having to query or transport large volumes of raw data to perform simple analytics. A caching layer would not help, and we would spend an inordinate amount of time performing I/O, possibly over the internet, rather than crunching the numbers. The performance for serious analytics is likely to be an order (or orders) of magnitude slower.
The capacity of a single Jupyter instance may limit the type of analytics that can be performed. One of our former colleagues likes to say, “if your data fits in the memory of a single machine, it is not big data,” and we are most certainly working with big data. Pandas is known to be somewhat inefficient with memory utilization, sometimes reserving memory as much as 5x the size of the actual data set. Also, while Pandas is quite fast, Python itself is not, so we would have to be very conscious about how we write the analytics code to achieve any semblance of acceptable performance.
There are ways to solve Jupyter’s scalability: (1) Process the data in chunks if there is a natural way to split the data. Within equities, this is indeed possible where the data for a symbol+date can be reasonably considered to be an independent chunk. This workaround doesn’t work for all types of analytics though, and its mere presence makes this option unappealing. (2) Jupyter Enterprise Gateway allows Jupyter notebooks to share resources across a cluster by using some combination of Apache Spark, Hadoop, YARN, Kubernetes, Docker, and Dask. Each of those things scares us, and the combination is outright terrifying. We’ll keep an eye on this one though, and maybe someday we’ll scrounge up enough courage to try it out.

There are a few advantages to this approach as well:

Depending on the vendor and contract, we may not need to pay for data separately. We would expect that the data costs would be distributed across the multiple customers of the vendor.
We would not need to manage the nightly loads of data.
The infrastructure footprint required is possibly smaller, at least in terms of disk space.
Given the use of open-source analytics components, the overall cost is likely to be lower than the previous option.
The maintenance effort required in keeping a JupyterHub cluster running is generally not too onerous.

This option tends to have a lower total cost of ownership than the traditional tick database, and comparable to the managed solution discussed next. However, the performance characteristics of this solution leave a lot to be desired.

A Managed Tick Data Service — OneTick Cloud

OneTick Cloud is the solution we have finally settled on. OneTick is a seasoned player in this space, and we are familiar with its proprietary tick database for over a decade now. OneTick Cloud is a managed service that combines the proprietary tick database with cleansed and normalized data for Equities and Equity Derivatives, all neatly packaged and hosted in the AWS cloud. OneTick supports environments running on either shared infrastructure or dedicated infrastructure, depending on the subscriber’s choice, requirements, and usage patterns. It also supports the ingestion of real-time data and on-the-fly analytics, although we have not explored this aspect of the offering.

We have started with a subscription for Top-of-Book (Level 1) Tick Data and Reference Data for US Equities. We can easily and seamlessly switch to a more extended tick history or Depth-of-Book data with a simple request (and a significantly higher monthly bill).

The following are some of the advantages we have observed:

We pay a bundled fee that includes the database license and the data. The total cost of ownership is lower or comparable to any other option we have evaluated that has equivalent capabilities.
We do not need to worry about nightly loads of data.
The execution performance is quite impressive — we can run reasonably complex analytics through a day’s worth of trade and quote data for all symbols in a matter of minutes. Running through an entire year’s worth of data is possible in an hour or a few hours (depends on what you’re doing).
OneTick comes with a rich, interactive user-interface for designing queries (more on that below).
We are not required to run any servers or manage any infrastructure at all. The query designer can be started and functions perfectly adequately on our laptop computers.
The tick database can be accessed from Python using a proprietary API. The Python layer is only used to compose and submit the queries to the server, while the actual execution happens on OneTick servers. Hence, there are no concerns about the performance or scalability of Python or Jupyter.
A knowledgeable support team is available to help users with questions related to the environment, the database, the query language, or even individual data elements.
When using the dedicated infrastructure offering (which we do not), OneTick can integrate with the subscriber’s existing data sources. Even on the shared infrastructure, there are ways to temporarily upload order data, join it with tick data, and achieve similar outcomes.

Having mentioned all of these benefits, we must acknowledge that a significant disadvantage is lock-in, since OneTick uses a proprietary query language and API. If we ever canceled our subscription, we would not only lose access to the data and the ability to run any analytics, but we would need to rewrite much of our code in another language. To the extent that our analytics code uses Python, we think this issue can be mitigated by writing a Python library layer to encapsulate the proprietary API. We hope that if we ever wanted to switch, we could rewrite that layer to target another platform, and this would spare much of our business logic from a rewrite.

In the next section, contributed by Allison (President of Proof and the lead scientist on our team), we take a look at how OneTick Cloud is working out in practice.

Evaluating OneTick for our Research

With all this in place, there is one important question left to answer: how well does the design of the proprietary OneTick query language match up with the kind of analyses we want to do, and what is the resulting performance like? To determine this, we need a basic understanding of the design principles of the OneTick query language as well as a more specific characterization of the kind of analyses we do frequently.

Currently, our research on historical market data has several purposes:

Implement our initial framework for historical simulation as outlined in our whitepaper
Develop new metrics for effective transaction cost analysis (TCA)
Contribute to the design of our trading algorithm(s)

These kinds of efforts require rather large samples of data, and so, our ability to quickly perform relevant computations over many stocks and many days is crucial to our research success. To assess how well OneTick can perform for such tasks, we started by understanding what OneTick is good at and why, so that we can better translate our computational requirements into things that OneTick is designed to do well.

The core structure of the OneTick query language is a directed acyclic graph. Time series data, like a series of trades or quotes in a particular stock for a particular day, flows from source nodes in the graph through paths of intermediary nodes that perform processing tasks (Event Processors), to sink nodes that spit out the results. The fundamental unit of data is a tick, which is similar to a single row of a typical database, except that its timestamp field plays a special and critical role. The timestamp is the organizational lynchpin of OneTick, and cannot be treated interchangeably with other fields.

Event Processors operate most naturally in a streaming fashion, looking at ticks one at a time in the proper order. State variables can be declared and used to store dynamically updating information as the ticks stream by. We can visualize event processors as individual workers on an assembly line: ticks are sent through one by one on a conveyor belt, and each event processor performs its task and sends the result on to the next processor. This physical analogy breaks down a bit when an event processor node has multiple successor nodes, but we can think of this split in the processing path as a copy operation on the result, with each copy being sent onto a different processing path downstream. Conversely, there are also joins and merges, which can combine multiple incoming tick streams into a single outgoing stream.

Aggregations are perhaps the clunkiest of the typical computations to impose on this streaming paradigm, and these are indeed where we found OneTick to be less intuitive and less optimized in performance. Aggregations that divide ticks into discrete buckets of time are perhaps still somewhat natural, as they simply change the underlying unit from a single tick to a bucket of ticks. But running aggregations are a bigger conceptual leap. A grouping of ticks determined by a criterion like: “this tick and all preceding ticks that occurred within the last 15 minutes” is unnatural in the streaming paradigm. It feels like asking for a vegan meal in a steakhouse. OneTick certainly provides this capability, but we have found it best to avoid invoking if possible.

For example, our historical simulation approach is based on the idea that current market conditions, combined with fresh actions, drive the probability distributions of future market conditions. There are lots of different ways we might define and measure “current market conditions.” One reasonable way involves trailing windows: e.g., what has trading behavior looked like over the last 15 minutes? But if we used this definition, we’d be invoking the unnatural running aggregations in OneTick a lot, and taking an unnecessary performance hit in our computations.

What we might do instead is to keep a “running average” of sorts, where newer data has a meaningful impact on the computation, while the older data does not. This running aggregation can then be used as a proxy for “recent market conditions” while allowing us to keep a constant amount of state in memory, and without storing individual event data. Of course, the exponential moving average function fits the bill. More concretely, each time you receive a tick with a price P after time △t, you might compute a new “recent average” value A by taking:

This kind of “trailing average” is not exactly the same as the average over a strict trailing window, but it is morally similar. Crucially, defining the “current market conditions” this way allows us to perform this computation in a one-pass streaming fashion. There is one slight catch — because we are not using exact trailing windows, older outlier values can continue to impact the “current conditions” for longer than we might like. This is something we’ll watch out for, but it seems like an acceptable trade-off for now.

As soon as we changed our computations to fit this streaming paradigm, all of our historical research computations sped up. As is stands now, we’re able to churn through trades and quotes for all symbols for a given trading day in a matter of minutes, computing our definitions of prevailing market conditions at each trading event.

Interestingly, since our real-time trading algorithms need to compute the current market conditions for similar reasons and with similar constraints, it makes sense for us to keep the state variables lightweight and easy to compute. In this way, our goals very much align with the OneTick design, and so far, we are finding it quite capable of meeting our needs.

Closing Thoughts

We find that services like OneTick Cloud are not only convenient and cost-effective, but they have the potential to democratize research on historical market data by allowing small teams, like ourselves, to focus on their business, rather than the infrastructure and engineering aspects of this environment.

If you have questions or comments, please feel free to reach me on twitter: @preraksanghvi