MiFID II & Research — non-traditional data & mining for alpha.

MiFID II arrived on Jan 3rd. Sort of. No-action reliefs and legal entity identifier (LEI) related delays mean investment firms get some regulatory clarifications to digest, and a little implementation breathing room. But, not a lot.

One of MiFID II’s major disruptions occurs in sell-side research with the rule’s unbundling mandate. This requires EU asset managers to pay separately for research and execution services. And with “hard dollar” payments for research services: either from the asset managers’ own P&L; or from a client-funded research payment account. This puts an end to the practice of asset managers paying for research in “soft dollars” i.e. payment via trade flow and execution commissions.

Unbundling is expected to lead to a lower and more discriminating spend on research. EU asset managers are also required by MiFID II to justify any research costs passed on to their clients, and not absorbed into their P&L (read: high tracking and attribution costs). Since research is purchased and deployed globally, the impact reaches beyond Europe.

A recent Oliver Wyman study estimates that research spend on average will see a reduction of 10–30%; with many asset managers expecting to reduce the number of research providers they use to a handful (4–6 global suppliers) and a long tail of select specialists. Optimizing the return capability of the research that they do buy (or generate themselves) is also critical. This sounds a death knell to middling research, and a bugle call to providers of high quality, market outperforming research.

This disruption is a clear risk to traditional sell-side research. But it also presents an opportunity to re-tool and re-engage research offerings. There is also an entry opportunity for new data and research providers.

And so the hunt for research alpha.

Alpha - the excess risk-adjusted return over market benchmark returns - is the asset manager’s holy grail, and is critical in research optimization. Also key is finding new sources of alpha that are orthogonal & uncorrelated to existing strategies. For this, asset managers are increasingly turning to non-traditional datasets that have not been otherwise picked over by the market.

These include:

Corporate exhaust datasets i.e. data thrown up from a company’s normal business activities that may be insightful and market predictive (depending on who is looking at it and how they’re looking at it).

Public US companies are required by the SEC to file 8-Ks as “timely” public disclosure of the occurrence of “unscheduled material events e.g. the resignation of key personnel or geographic expansion of a division. But while there are robust legal definitions of materiality, many of these have to depend on an assumption of the reasonable market-moving expectations of an investing public. The predictive capability of a business event is usually not one of those assumed reasonable expectations e.g. 8-K disclosures are usually not required for a public company’s monthly moving & relocation costs.

Insight materiality is very much in the eye, imagination and acquired knowledge base of the beholder.

In addition to Corporate exhaust datasets, there are also Government Agency exhaust datasets with similar (and non-similar idiosyncratic) considerations. And then there are other non-traditional datasets.

Alternative datasets — which really can be any dataset with predictive capability — or at least predictive narratives from which falsifiable hypotheses can be drawn. These are sourced from a wide variety of providers.

Below is a categorization of alternative dataset collections, as available from Eagle Alpha — one of the leading alternative data aggregators and providers.

Raw non-traditional datasets can be:

  • Unstructured;
  • Internally inconsistent;
  • Noisy, with ill-defined & uncertain signals.

It does not mean that they are empty of insight. It does mean though that insight or predictive value must be mined, And it can be tricky to mine (read: costly). Mining for insight typically employs big data pipelines, machine learning (ML) and (sometimes) artificial intelligence (AI) workflows. The complexity of these depends on: the nature of the datasets and their dependencies; the hypotheses being formed from them; the features these hypotheses need, and what can be extracted from the data, and modeled.

The number and variety of these datasets can only be expected to grow and evolve.

Growth, as more of a company’s business transactions becomes visible via public and private ledgers; as supply chains capture even more sensor-visible data; and as our own individual exhaust data grows, through our engagement with a company’s products, services, physical and virtual sites, and jobs.

Evolution, as occurs as some datasets fail to deliver on their touted predictive capabilities; and as some succeed and diffuse through the market, losing predictive strength and becoming traditional along the way (diffusion is inevitable, “pesky” employees just won’t stay put). Either way, the value of datasets will decay — to be replaced by other datasets and predictors.

However this does not mean that acquired insight necessarily grows or evolves as well.

This proliferation and churn of non-traditional datasets will only further complicate the asset manager and research provider’s tasks i.e.:

  • Of assessing the predictive value of a dataset, and capitalizing on the predictive opportunity it presents before that opportunity fades, and
  • Of building a business & operational model, a pipeline and workflows that can consistently do this.

There are 2 summary dimensions to this problem statement.

  1. That mining for insight — and so alpha — can be complex and costly.
  2. That value is in alpha. And not only is it scarce, it decays — with a half-life to its revenue capability.

A non-traditional dataset’s attribute taxonomies

Source: Acuity Derivatives — Non-Traditional Dataset, Attribute Taxonomies x Processing - a sample dataset map

The graphic above illustrates an approach to looking at non-traditional datasets, with these two summary dimensions in mind.

Visualizing cost & complexity on one hand, and alpha decay on the other. And then capturing and mapping a rich set of dataset attribute taxonomies across both dimensions.

Cost & complexity drivers include:

  • Data volume and velocity types — this drives the storage and network infrastructure and the complexity of data engineering required.
  • Sensitive data and Personally Identifiable Information (PII) types — scrubbing and masking these drives legal and regulatory compliance, and the complexity of the legal & reputation risk mitigation required.
  • Provider/Source ratings, data completeness and methodology types — this drives data veracity; and data cleaning/backfill, reverse engineering and similar pre-processing transformation efforts to improve accuracy. This also affects the likelihood of GIGO (Garbage In…).
  • Data format and structure types — this drives pre-processing and the complexity of the ingestion processing required. One rule of thumb is pre-processing can constitute 80% of total effort.
  • Data density and historical breadth types— this drives the degree and complexity of interpolation and data augmentation required. Sample size, and the dataset’s intersection with labelled data — i.e. data with known outcomes — also drives the modeling and training possibilities; and drives the modeling complexity and computation required.
  • Feature density and data depth types — this drives the richness of hypotheses, features and predictors that can be derived from the dataset. Density and depth however also drives dimensionality — i.e. a dataset with n data points and p features is high dimensioned if p>n — and higher order — where features to be extracted are composed from extracted sub-features etc. This in turn drives feature extraction complexity, modeling complexity, and computation requirements.
  • Control metadata types e.g. identifiers, lineage, ownership/roles, versioning, audit trail— this drives data dependency, in-process data accuracy, data conflict resolution, and other governance and coordination complexities that can prove costly in the breach (again GIGO).

Alpha and alpha decay drivers include:

  • Asset classes —this drives alpha capability. Some asset classes are currently more opaque and less available data density than others.
  • Uniqueness and seasoning/aging metrics — this drives residual alpha. The more proprietary and unique, and the newer to market the dataset is, the more residual alpha there should be — all else being equal.
  • Fund and strategies types — this drives client range. Access to a wide range of fund and strategy types may help optimize the continued monetization of residual alpha at several points along the decay curve.
  • Distribution model & API types— similarly this drives client range by expanding the clients serviceable.
  • Feature density and data depth types — the complexity of processing datasets that are feature dense, with high dimensionality and higher order will be prohibitive to a majority of market participants e.g. predictors relying on feature recognition of objects or people in images or video. This entry barrier potentially lengthens the half life of alpha decay — all else being equal.
  • Predictor performance metrics — this drives the estimation of alpha capability and of alpha decay.

Modeling a rich set of attribute and metadata taxonomies allows:

  • A common basis of comparison of cost, complexity and potential alpha capability and decay across datasets.
  • The development of heuristics and a metadata model with which to better understand predictors of cost, alpha and alpha decay across dataset types.
  • The development of a coherent pricing framework for both input raw datasets, and output sets of predictors and insights.

An asymmetry exists here.

The cost & complexity profile that a specific dataset presents is a function of, not only the dataset profile, but also the skill/remuneration & infrastructure profile of the specific operational model it is being processed through — i.e. its processing pipeline profile.

The alpha decay of the dataset’s predictors on the other hand is a function of the dataset profile, of the processing pipeline profiles of ALL consumers of the data and predictors represented by the dataset, and of the capabilities of their operational models.

Given a specific processing pipeline profile, this asymmetry affects the P&L viability of datasets processed through the pipeline. Two operational models processing the same dataset and extracting the same features and predictors may have two very different cost & complexity profiles. But those predictors will be subject to similar alpha decay.

Being able to generate attribute taxonomy maps like these can be of immense benefit in tuning the deployed business strategy. And in correctly sizing the operational model to ensure that a capability mismatch does not fall out of this asymmetry.

Approaches to strategy vary and can include:

A. Focus on raw data and light pre-processing — where very little or no insight or predictor mining is done. Value is added in generating or curating potentially interesting datasets. This might be valuable when a provider’s industry expertise is uniquely broad and deep; and especially in an industry that either has a glut of data, or a scarcity of available data (where much of that data is private and locked up inside a handful of dominant companies). Value can also be added in improving the datasets’ pre-ingestion quality.

Many Quant funds (and some so-called Quantamental funds — hybrid Quant/Fundamental) that have invested significantly in developing their own very sophisticated processing pipelines are natural consumers of raw data — the bloodier and more jagged, the better.

The processing profile associated with this raw-data only approach is light on ML and AI workflows and resources. But still depends on access to big data capabilities. This can make it operationally lean — relatively; with an emphasis on the sourcing, procurement, curating and sale of raw data. Many traditional research providers may find this an easier transition path.

However, without access to ML and AI workflows and the ability to directly analyze datasets for predictors and alpha capability; the pricing of these raw datasets may lack coherence, and may be detached from their true market value. Additionally this approach leverages existing insight. It does little to grow, deepen or evolve insight through the capabilities available to ML & AI.

Value may also tail off very quickly with raw data consumers. As you move down the alpha decay curve, residual alpha becomes less profitable for them to exploit and the opportunity cost becomes prohibitive. A narrow client range profile like this truncates the revenue capability of a single dataset or class of datasets to its provider, and may require a dense and varied pipeline of datasets to smooth out the lumpy nature of this revenue stream.

B. An end-to-end pipeline but with bounded feature complexity — where an end-to-end pipeline is developed with a complexity constraint on its processing capabilities. This may reflect an operational model matched to a specific budget, legacy processing infrastructure, or talent pool cap. Bounding feature complexity may involve either, or a combination of:

  • Bounding the pre-processing complexity of the pipeline.
  • Bounding the ingestion, modeling and model training complexity of the pipeline.

When paired with, or expanded to from, option A — this approach may help address the key weaknesses of A.

By developing an end-to-end capability for generating predictors, the consumer base of a dataset can be broadened to cover:

  • Consumers of raw data and pre-processed data,
  • Consumers of extracted features and related descriptive statistics, and
  • Consumers of trade-able predictors and insights.

An expanded client range can help extend the revenue capability of a dataset, even as the decay curves of its predictors flattens. It can also help develop a more market responsive pricing framework for the data, features, predictors and insight.

Additionally, with this approach the provider develops a big data/ML/AI pipeline and the necessary talent capability — particularly pertinent for sell-side research. As ML & AI continues to make inroads into many sell-side business functions; Research — especially Quantitative Research — is a well suited arrow-head from which this capability can be built out. It touches all of the other points of entry — Trading, Sales, Risk management, and Model Validation.

Source: Acuity Derivatives — Pre-Processing vs Ingestion & Modeling Complexity Quadrant

C. An end-to-end pipeline targeting the upper bound of feature complexity — this removes the complexity constraints of approach B and targets the hardest to reach predictors. This approach targets consumers who are willing to consume the most valuable predictors and insights, but are constrained in their ability build a matching processing pipeline and talent capability. Essentially the provider becomes an outsourced pre-frontal cortex — but one that only concerns itself with high minded things.

Given the market’s need for exclusivity & proprietary access— this approach probably succeeds only on a very careful segmentation, selection and management of client funds and strategies.

One can conceivably see an evolving and mixed approach that combines two or more of these. With a rich attribute taxonomy and metadata model, and richly annotated datasets to provide a road map.


  • MiFID II, specifically unbundling, is disrupting the provision of research to the buy-side; and is risky to the traditional sell-side Research model.
  • Mining for insight and alpha is the key differentiator, and asset managers are turning to alternative and corporate exhaust datasets for this.
  • Mining for alpha from these datasets can be uniquely complex and costly (for you); while alpha decays (for everyone) — could be a P&L problem.
  • Build a rich attribute and metadata taxonomy of your datasets; analyze attribute profiles across 2 dimensions — cost & complexity vs alpha decay.
  • Use to fine tune & right size your business strategy and operational model.

For more information: