Data Landing and Data Storage for Quantitative Analysis

Nick Wienholt quantivfy.eth
6 min readJun 14, 2022

--

Photo by Claudio Schwarz on Unsplash

Data is the oxygen of the quant process, and the quality of the historical data used in the model development and training phase is one of the crucial aspects that determines whether a model will make money once it is released into the wild.

In Chapter 2 of Advances in Financial Machine Learning, de Prado states that:

In general, you do not want to consume someone else’s processed dataset, as the likely outcome will be that you discover what someone else already knows or will figure out soon. Ideally your starting point is a collection of unstructured, raw data that you are going to process in a way that will lead to informative features.

This comment is deeply insightful, and provides the ‘why’ context for most of this post, which will focus predominantly on the ‘how’. To briefly set the scene, there are at least four distinct ways of getting the data to the stage where it is eligible to be made available to a ML model for training:

  1. The data can be imported from the raw, external source in the same process as the training steps. This is wrong from multiple angles — sharing the data with others within an organisation is hard, the original source may disappear suddenly (ala Quantopian’s intensely rapid disappearance), versioning the data is not possible, this is the slowest possible option, data scientists rarely write decent data engineering code, and sampling the data from an original source can be difficult, requiring the loading of the entire dataset before it is filtered.
  2. The data can be extracted from the raw source into a relational database using custom code written in a general purpose programming language. This is an abysmal approach — the end result is a poorly documented, poorly considered, poorly tested, poorly performing, feature-poor in house ETL ‘framework’. It looks so simple to start with to just bang out a bit of C# code to pick up a file and shred it into a table in a database, and then some code is written to handle retries, then some more code to handle duplicate data, and then some more to process common file formats. This is the typical way organisations end up with a poor man’s bespoke ETL, and that’s a horrible place to end up for all concerned. There is absolutely nothing in any data source encountered that cannot be handled by the two superior approaches described below.
  3. The data can be extracted by an off-the-shelve ETL tool into a relational database. This has generally been the least worst option until around 5 years ago, and ETL tools have many nice features for handling file formats, retries, conditional logic paths and scaling. There are a number of serious problems though: the ETL tools are their own speciality, requiring some degree of familiarity and training, and they impart a degree of opacity on the data transformation. Their graphical nature hides a lot of the details of the transformations that are being performed (see image below of a moderately complex Azure Data Factory Pipeline), with the actual transform logic hidden three clicks deep in some dialog box. By nature, ETL tools are batch processes, and while operational trading decisions are unlikely to be running off the data that is being processed, near-real time reporting capabilities can be compromised by batch. The critical downfall of most ETL tools is that they lead to destructive changes being performed on the data. This will inhibit bug fixes and extraction of new features on this data in the future (unless the data is re-processed from the source). While this isn’t an inherent design flaw of these tools (you can land data in permanent raw schemas on the receiving end, and then implement the transformation on this), the default choice of these tools and their practitioners is to provide the transform on data in flight and not save the rawest form of the data.

4. Advances in compute availability and new database architectures have made an entirely new data loading and transformation process possible. In contrast to the ETL (extract-transform-load) process described in option 3, an ELT option is now feasible. In ELT, a separate tool is used to land the data in a raw format (or at least a logically raw format where no grouping, cleaning, joining or extraction has been performed) and transformation of the data is performed on-demand with database views which have inbuilt capability for easily handling formats like JSON, XML and CSV. With in-built caching, the transformed data is available as rapidly as the data would be in an ETL process, and the views are obviously non-destructive on the raw data, which can be reinterpreted and reprocessed at any time. This approach has the same desirable quality as Lambda architecture by preserving raw facts, and also eliminates the latency associated with batch processes. The Snowflake database platform is the absolute leader in enabling this approach through Snowpipes and the rich, performant functionality available in JSON/ XML handling. Snowflake’s use of very cheap blob storage from cloud providers allows the storage of the raw data indefinitely at very low cost. Snowflake’s ability to support UDFs in both SQL and Java means that complex transformations can be handled with relative ease, and for complex calculations external functions are available. External functions have some of the logical pitfalls of ETL tools however, in that the processing is moved to a separate ‘thing’ and transparency is decreased. Any good architecture strives for both completeness and minimalism, and each moving part increases TCO and provides another opportunity for operational breakages.

Using the ELT approach outlined in point 4, data from a particular source will exist in two or three separate schemas. The first schema will contain the data in a format as close to raw as possible — the only desirable or required transformations are from a proprietary format to JSON, but this should involve no data loss. The second schema will contain views which deliver the Transformation of the ELT, and might normalise and clean the data into a format where it is ready to co-mingle with other data sources.

The use of a third schema for extracting features from the second normalised schema is a debatable option. Creating ML features at the database level is entirely reasonable from a technical standpoint, increases feature sharing and can be used to implement Chinese Walls between members of the data science team. It is not a ‘wrong’ approach. The intrinsic issue with the approach is that the analysis is split between two technologies (typically a SQL-like approach in the database, and a Python- or Scala-like approach in a notebook), and by pre-cooking part of the menu, the data scientist chef responsible for delivering alpha is at least partially constrained. In addition, tools like numpy and Pandas are typically better at handling feature construction.

The main decision point between shared ML features at the database-level versus notebook-level will be organisational maturity. If the database schema is source controlled and readily updatable via a DevOps process, database level features are fine. In a less mature or resourced organisation, striving for wide-scale code or logic re-use is harder. A quant will want to inspect and modify the features being used in their model, and will often want to tweak or extend them. If modifying a feature with optional parameters “shouldn’t” happen “because it might break my model which uses the same features”, the organisation isn’t ready for feature sharing. Instead, consider shared Python/ Scala helpers that can be more easily source-controlled and included as part of a notebook.

NoSQL As An Option?

For the sake of completeness, it is worth briefly mentioning NoSQL approaches for data landing. To be as brief as possible, they generally suck, and options like MongoDB suck really hard. The only useful feature that I have found in MongoDB is the uninstaller.

Cassandra, as well as specialised time-series database, might be a decent alternative in some scenarios. Towards the larger end of town, products like kdb+ will definitely have some well deserved fans, but in general, sane practitioners should stay away from NoSQL offerings. Like their long-forgotten logical predecesor the object database, NoSQL has sparked some great innovations in the relational world, but the lack of so many critical features renders most NoSQL offerings as gimmicky ‘cool-kid ware’.

--

--