Innate Data and the Easy AI Advantage

Published in

Speaking Artificially

5 min readApr 21, 2023

Imagined by Bing Image Creation and DALL-E

In a world of Artificial Intelligence, data continuously proves to be a source of competitive advantage for organizations. In a previous article, I outlined the three layers of training required to build a cognitive system (based on natural language) but yet to delve into the competitive advantage one gains by fully utilizing data to their unique situation.

The 3 Layers of Conversational AI Ground Truth | by Sam Bobo | Speaking Artificially | Medium

While the article points out three layers (1) Foundational, (2) Industry (3) and Use Case specific, this article seeks to expand upon those based on how organizations obtain said data. Most specifically, I explore the competitive advantages gained by building data-powered platforms and the virtuous cycle of competitive advantage (assuming, for now, without regulation) and can build immense value for customers simply by expanding into AI or deeper into AI amidst this AI proliferation. Lets begin!

Foundational Data

Foundational Data, as the name suggests, is base-level training data to provide a wide corpus or knowledge base for a AI model to glean insights from. The internet, being a vast landscape of freely available user-generated content, provides a massive opportunity to rapidly build the base for AI based models. For example, those building knowledge and natural language-based models might scrape the English Wikipedia or Reddit, or building a codex by scraping open source projects on GitHub, or image based models through ingesting images from Instagram or Getty Images. Many of the most popular large language models or cognitive engines have used this methodology for building a foundational data layer within their AI.

Most recently in the news, however, wide variety of lawsuits have arisen from Reddit and Getty images who are starting to understand the broader trend here, specifically within the context of training Large Language Models (“LLMs”). Further business constraints are also being implemented to increase the barrier to adoption such as monetizing API access to the platform to add financial pressure to training these systems.

While there is no true source of competitive advantage within Foundational Data, its helpful to understand the role foundational data plays in overall AI model training and why models can simply “appear” overnight using freely available data. One might argue that the internet is the breeding ground for Artificial Intelligence development and has helped catalyze its impact and success within society.

Interaction Data

Searching Google, making an Amazon.com purchase, browsing a show on Netflix, Liking or commenting on a Facebook post, “liking” or “following” someone on Twitter, all of these interactions are datapoints feeding large platforms. These data points, what I am deeming “interaction data,” extract customer preferences based on acted behavior on a platform or service, and input that data for inference against a recommendation algorithm to display a recommendation for the user’s next action on the platform. Such recommendation algorithms are typically tied to a form of engagement mechanism meant to keep the user continuing to use the platform in a perpetual or virtuous (vicious if your perspective is such) cycle. These platforms tout value propositions such as: “the platform learns from your interactions and is tailored towards you.” This value statement is valid and creates a level of lock-in for users whom have spent a significant amount of time tuning the platform to their liking. One such example seems to be the Twitter community and their feeds, only to be squandered by Elon Musk’s algorithm changes.

Interaction data can give rise to a number of competitive advantages within the market, albeit, now without regulatory scrutiny as Artificial Intelligence continues to evolve at the ever-accelerating pace of today:

Two-sided Market Expansion— a type of company, called an Aggregator coined by analyst Ben Thompson — defined as platforms that come to dominate the industries in which they compete in a systematic and predictable way. Aggregators have all three of the following characteristics: 1. direct relationship with users; 2. zero marginal costs for serving users (via the Internet); 3. and demand-driven multi-sided networks with decreasing acquisition costs. Personalization on one platform massively increases the perceived switching cost of moving to another platform and thus lock-in.
First Party Advantage— while under scrutiny by regulators due to the promotional placement nature of the content, one example is Amazon whom took customer purchase history to launch Amazon-owned branded content and feed a roadmap of product launches based on popular purchases by customers on Amazon, without having to perform external market research. Platforms can utilize interaction data to drive product related insights that continue to provide immense value to customers as they subscribe and continue to use the platform.

Knowledge / Expertise Data

Narrowing our perspective to industry-specific organizations, in certain instances, organizations have built a wealth of expertise in a particular domain based on the solutions in market, collective organizational knowledge (e.g from Research and Development), or simply data simply “sitting around” and underutilized given no particular inroads into AI. In a previous post, I explored the concept of industry-specific AI engines

Open Source AI and the Counsil of Experts | by Sam Bobo | Apr, 2023 | Medium

this section seeks to build upon such framework. Take, for example, a medical company scanning for skin cancer, lets narrow down and say breast cancer. Over years of operation, the company has amassed hundred of thousands of images, both with malignant and non-malignant identifications within the xrays. Barring HIPPA for a moment, should the organization have permission to use said images to train an AI, that would be an advantage the organization would inherently have simply from being in market. Evolving into an AI-powered system would be a natural progression for the organization and add marginal cost to bring live (all else equal) since obtaining the training data, unique to their domain and specific situation, would be achieved already.

Bloomberg, known industry-wide for financial analysis and reporting as well as the infamous terminals, sought just that. Bloomberg ingested all proprietary financial information (analysis, financial benchmarks, financial reports, charts, etc) and built BloombergGPT. Specifically (source linked previously)

For more than a decade, Bloomberg has been a trailblazer in its application of AI, Machine Learning, and NLP in finance. Today, Bloomberg supports a very large and diverse set of NLP tasks that will benefit from a new finance-aware language model. Bloomberg researchers pioneered a mixed approach that combines both finance data with general-purpose datasets to train a model that achieves best-in-class results on financial benchmarks, while also maintaining competitive performance on general-purpose LLM benchmarks.

This is the exact manner to extend competitive advantage in a world of artificial intelligence as well as contribute to the broader AI ecosystem (such as ChatGPT plugins). No other company can match the unique combination of brand name and data source required to replicate such a corpus powering this AI that the data becomes a source of competitive advantage.

As demonstrated, data generated in-house, whether intentionally from a data-powered platform or collected simply from performing daily course of business, data can be employed to build collective intelligence for the broader populus and monetized in a manner that extends a brand and builds upon competitive advantage in the market. I anticipate that we will see more and more monetization and pay walls created from both ends of the spectrum (foundational to institutional) and it will be interesting to analyze and understand the ramifications of such, but that is for another post!

Innate Data and the Easy AI Advantage

Foundational Data

Interaction Data

Knowledge / Expertise Data

Written by Sam Bobo