Building The Next Billion Dollar Dataset

Published in

Big Data Protocol

7 min readJul 31, 2023

Evolution of Alternative Data, Financial Tools, & KPIs

When I got my beak wet in the capital markets arena, machine-readable news was all the rage (I remember GovBrain). It was an era where the potential of alternative data began to be realized, with its ability to analyze text, conduct sentiment analysis, and extract essential data using natural language processing. But while the concept was groundbreaking, its execution was far from smooth.

Back in those days, it was a Herculean task to develop these products, training them specifically for the financial sector and ensuring they were tailored to our clients’ needs. Though we were aware of the demand for such tools, understanding their real-time application and the alpha they drove remained enigmatic.

To demystify this, we went all out, roping in quant analysts, consultants, interns, and even PhD candidates. This collaborative approach shed light on the nuanced ways these tools were being employed in the industry.

Today, technology has expedited these processes. With the right tools, one can swiftly evaluate if a particular data set potentially holds alpha. Once identified, this data is then passed down to the experts: data engineers, data scientists, and portfolio teams who further assess its relevance to specific investment strategies.

The recent years have witnessed a significant shift. Expertise in data application has become increasingly commodified. This commoditization has empowered vendors to understand potential data use cases, often employing platforms like FactSet or Visible Alpha to gauge critical Key Performance Indicators (KPIs) for various companies. By juxtaposing historical data with their signals, vendors can potentially forecast these KPIs, using consensus as a benchmark.

Many vendors now have the capability to conduct systematic back-tests on these vital KPIs. With such advanced tools, sell-side vendors can rapidly demonstrate their value proposition to buy-side firms.

Looking back, many initially aimed to leap from raw data directly to stock price predictions, often overlooking the myriad of factors in between. This approach had its limitations. Now, there’s a more granular focus on predicting individual KPIs, such as sales or costs, using specific datasets. By amalgamating multiple datasets, including several alternative ones, one can construct a more sophisticated and robust investment model.

Many buy-side firms are now leveraging automation from the outset. When a vendor receives approval to trial a data set, these firms efficiently ingest the data into systems like S3 buckets, apply tagging, normalize based on demographics, and then back-test against key KPIs, assessing its performance alone or in combination with existing data sets. However, validating and getting approval for a trial period requires much human-to-human interaction. True automation can overcome this phase completely with the use of blockchain.

The Intricacies of Licensing in the Age of AI

As someone who’s been on both sides of the spectrum — selling and buying data — I can attest to the ever-evolving challenges in navigating this territory. Like they say in Caddy Shack — It’s a doozy Judge…

Licensing & The Sell Side vs. Buy Side Dilemma

Data vendors often face a crucial decision: should I cater to the sell side or the buy side? The conundrumm arises from the potential of sell side firms broadcasting the data to all its clients, reducing its unique utility for buy side firms. Cannibalization for one, please. Many vendors, therefore, choose to target the buy side directly after an initial promotional phase with the sell side, seeing more potential in this space — then pick up their prospect pipelines on the backend (not a bad strategy). An example would be selling a one-year redistribution licensing agreement to a bank, tracking who the end recipients are, then absorbing the direct list in year 2, 3, 4…

Language Models & Licensing Ambiguities

How can we not have a blog post without talking about AI and Chat GPT? The reality is with these large language models, licensing becomes even murkier. These models, trained on a plethora of data from open web sources to proprietary datasets, pose the question: what’s the legality of using them if the training data wasn’t explicitly permitted for such use? Especially in finance, where data is often behind impenetrable walls and wrapped in complex contracts, it’s crucial to ensure we’re on the right side of the law.

Derivative Works: It Gets a little Grey Grey

Having been involved in creating and buying data sets, I’ve seen how intricate the concept of derivative work is. If a model produces an aggregated metric, the way it’s used and distributed is paramount. Whether it’s incorporated into a quant model or displayed on the screens of multiple portfolio managers, licensing needs to cover every scenario.

Furthermore, the nature of derivative work has evolved. It’s not just about numbers on a spreadsheet anymore. It could be a thought or insight articulated in a sentence or paragraph. This makes tracing back to the original data more challenging but emphasizes the importance of proper licensing even more.

Derivative Products & Rights

A burning question nowadays revolves around the rights of derivative products. If you train a model and create weights in a large language model, that model is technically yours. But what if the data it was trained on wasn’t licensed for such derivative creation? This is a question echoing in many sectors, not just finance.

When engaging with data vendors, the licensing parameters are usually well-defined. Sophisticated vendors often categorize licenses based on use cases. But with the rise of tools like ChatGPT, the lines become blurrier. Once a model is trained, determining its usage and ensuring appropriate licensing for derivative content, especially if shared widely, becomes critical.

The Blockchain Solution

How the hell do we grapple with the challenges of tracing and crediting derivative data products to their original source, both on the contracting side and the data itself? Could blockchain technology, with its immutable, transparent, and decentralized nature, could offer a unique solution to this problem? Perhaps, let’s break it down Charlie Brown:

Timestamping: Every transaction or data entry on the blockchain is time-stamped. This can prove the exact moment a piece of data was created or modified, ensuring that the original source or author can always prove their primacy.

Smart Contracts: These are self-executing contracts with the terms of the agreement directly written into code. In the context of data licensing, smart contracts could automatically execute terms of licensing agreements when certain conditions are met. For example, if a derivative product is created, the smart contract could automatically ensure that appropriate royalties or acknowledgments are given to the original data source.

Transparent Tracking: As derivative products are developed from original data, each transformation or derivation can be logged on the blockchain. This creates a transparent and traceable lineage of data, making it clear how original data has been used or transformed over time.

Digital Identity: Blockchain can help in establishing a robust digital identity for authors or original data sources. This identity can be referenced each time the data is used or a derivative product is created, ensuring credit is always given where it’s due.

Direct Transactions: With blockchain, it’s possible for content creators and those wishing to use their data to transact directly without intermediaries. This can simplify the process of obtaining permissions and ensure that the original authors are compensated fairly.

Licensing Templates: Blockchain platforms could offer standardized licensing templates that define how data can be used, transformed, or shared. These templates can then be easily referenced or included in smart contracts, streamlining the licensing process.

Compute-to-Data: The Blockchain’s Answer to Data Privacy and Monetization (hey Ocean Protocol)

One of the most groundbreaking benefits of blockchain in the realm of data management and artificial intelligence is the concept of Compute-to-Data. Along the core to our mission, this principle acknowledges a simple yet profound truth: the most valuable data is often private (hence why BDP focuses on Private Data). And while leveraging this private data can lead to enhanced research and business outcomes, the ever-present concerns over privacy and control make it challenging to harness its full potential.

Compute-to-Data is designed to address this conundrum head-on. Instead of sharing private data directly, this mechanism allows data owners to grant specific access to it. Here’s a breakdown of its key benefits:

Controlled Access: Data owners have the autonomy to approve which AI algorithms can run on their data. This ensures that only trusted algorithms have access, maintaining the sanctity and privacy of the data.

Remote Computation: Compute-to-Data orchestrates the execution of computations remotely on the data, allowing AI models to be trained without ever compromising the data’s confidentiality.

Monetization Potential (also available on current BDP version): One of the most enticing aspects of Compute-to-Data is its monetization capabilities. Data owners can publish their datasets and designate Compute-to-Data as the sole access mechanism. By doing so, they can set a price for access, unlocking a revenue stream from previously untapped data.

Enhanced Security: While metadata remains publicly accessible, references to the actual data files are encrypted, ensuring the data’s security.

Limited User Access: End-users, such as businesses or researchers leveraging the data, will only receive the status of their computation and the final results. Direct access to the asset remains restricted.

Trust and Data Protection: One pivotal concern when dealing with private data is the risk of personally identifiable information (PII) leakage. Compute-to-Data addresses this by emphasizing functions that aggregate data, like averaging, which inherently protect PII. Many AI algorithms also function by aggregating information, further minimizing risks.

Autonomy in Trust Decisions: The decision of which algorithms to trust rests with the marketplace or the data owner, the entity that bears the risk of potential data exposure. This ensures that those with the most at stake have the final say, allowing them to make choices aligned with their risk-reward preferences.

To wrap up, there is a revolution currently underway in the data, AI, and Web3 industries. We at BDP are excited to be at the heart of this transformation.

Building The Next Billion Dollar Dataset

The Intricacies of Licensing in the Age of AI

The Blockchain Solution

Compute-to-Data: The Blockchain’s Answer to Data Privacy and Monetization (hey Ocean Protocol)

Written by Mark Donovick