The Missing Piece of Infrastructure to Solve Generative AI’s “Hell Scenario”

Published in

Tranquil Data

8 min readApr 28, 2023

[May 1st 2023 update: OpenAI did enough for the Italian regulators to lift the ban on the service in Italy. They picked off the low hanging fruit (e.g. age confirmation, privacy policy modifications). The largest risk discussed in this article, “The big question remains what legal basis OpenAI had to process people’s information in the first place” remains.]

Standard planning procedures for risk assume that decision makers have an understanding of the probability and impact of risks. Based on this understanding, a plan can be developed by assigning probability and impact scores to each risk, ranking them, and allocating resources accordingly. The problem for Generative AI is that the probabilities and impacts of the risks are not well-understood because it is a net new technology that seems to be affecting everything, everywhere, all at once.

Given the uncertainty surrounding risk probabilities and impact, how can Generative AI companies determine which risks to prioritize? A good starting point is acknowledging the following categories of risks:

Unknown-unknowns: the risks that are not yet known or considered
Known-unknowns: the risks that are recognized, but lack a clear understanding of probability and impact
Known-knowns: risks that exist today with the best understood impacts

The unknown-unknowns may be the most dangerous, but given they are not knowable, the only action is to make them known. The known-unknowns have received widespread attention from the media and critics because ChatGPT has produced examples, and many align with virtues and popular causes: hallucinations, racism and bias, automation bias, bad actors. We know these exist, but don’t understand how often they occur, or what their impact will be. The OpenAI team has done a good job of communicating their approach to the known-unknowns, most of which can be explained as the alignment problem: making artificial intelligence align with human values, and follow human intent.

The known-knowns are most actionable because they exist today, and their impact can be best understood. Two known-knowns that are actionable today are the risk of (1) blanket regulatory bans (the “Hell Scenario”) and (2) blockers to widespread enterprise adoption. Both of these risks are related to building trust and transparency with customers and regulators that data is collected, used, and shared properly (our software at Tranquil Data solves this problem, which will be discussed later).

Risk of Regulatory Bans: The Italy Problem

OpenAI geoblocked access to ChatGPT in Italy on April 1st after an order by their Data Protection Authority (DPA). The Italians issued the order over concerns that OpenAI violates GDPR by:

(1) unlawfully processing personal data;

(2) not allowing users to rectify erroneous information;

(3) instances of breach;

(4) not preventing minors from accessing the platform

Known-known problems should be prioritized based on their impact. Why then heavily resource the Italy problem if Italy only represents .8% of the world’s population? Despite Sam’s maybe-cheeky tweet, there is a good reason to believe Italy represents the first domino to fall in Europe due to legal technicality called “main establishment.” OpenAI does not have a main establishment, and is therefore susceptible to facing regulation in each of the 27 member states, versus the one member-state they would face if they had one. The impact of facing complex regulation across Europe (10% of the world’s population), with regulators around the world in-waiting should make Italy a top priority.

The regulatory hurdle that will be hardest for OpenAI to overcome is the charge that they illegally process personal information. Under GDPR, processing includes collection, structuring, and storage (among other actions). This means that OpenAI only has two paths to satisfy regulators: prove they do not collect or store personal information, or provide a legal basis for processing. There is no debate that OpenAI currently processes personal information, like names:

The only legal basis for OpenAI to process personal information is to either get the named individual’s consent or to have a legitimate interest. In the case of web-crawling, obtaining consent is a non-starter as it would require getting the consent of ~ everyone who uses the internet (including Leonardo).

That leaves the “legitimate interest” three-part test as the only legal basis for processing personal data:

1. Purpose test — is there a legitimate interest for processing?

According to the ICO, “because the term ‘legitimate interest’ is broad, the interests do not have to be very compelling.” In Leonardo’s case, answering my question is an example of a legitimate interest. In Google’s case, organizing the internet was.

2. Necessity test — is the processing necessary for that purpose?

This question asks whether there would be a less intrusive way of answering my question. It doesn’t seem there is in the case of my question about Italy’s richest person.

3. Balancing test — is the legitimate interest overridden by the individual’s interests, rights or freedoms?

The balancing test requires balancing “the interests or fundamental rights and freedoms of the data subject” including physical, financial, or losing control over their data, against OpenAI’s legitimate and necessary interests. In particular, the GDPR is clear that the interests of the individual could override a company’s legitimate interests if personal data is processed in ways in which individuals do not reasonably expect. For example, would the late Leonardo expect ChatGPT to be able to answer the question that was is the richest man in Italy? Probably.

Leonardo might feel differently about his address, even though the fact that he lived at Palazzo Vidia in Milan is widely available on the internet and on Google Maps. In the case of his address, OpenAI either built logic to not scrape personal addresses, or understood that this needed to be anonymized after scrapping.

Experts believe that OpenAI’s entire argument of legal processing will hinge on this nuanced detail — how much transparency is being provided into what data is collected, where it comes from, and what is being done with it, and does this practice outweigh individual’s expectations of privacy?

ChatGPT, and all LLMs, will never be able to explain to regulators how their models go from inputs to outputs. Thus, OpenAI and all other generative AI companies must build a system of record that captures the context of data before it is aggregated, and is capable of answering questions like:

For user inputted data:

- Where is a user and what is their age?

- Has a user affirmatively consented to their data being used for training?

- When was the consent?

- Was consent ever withdrawn?

- Can I show what data was shared and what it was used for at any given time under any given consent choice?

For web-scrapped data

- When was data collected?

- Where was data collected (e.g. what website was it, how publicly was it shared)?

- What policies were in place to ensure sensitive information (e.g. bank accounts, emails, addresses, social security numbers) was not scrapped?

- What policies were in place at any given time to ensure that if sensitive information was scrapped it was anonymized or deleted?

OpenAI needs to be able to answer these questions to prove that their practices are reasonable. If they can, they should be able to demonstrate that they meet the balance test (and as a bonus will be prepared for the host of lawsuits that will decide whether their scrapping is fair use or IP theft).

The need to be able to answer these types of questions are not unique to Generative AI or LLMs. All modern platform businesses take on data from disparate sources that have unique rules about how that data can be used. These rules can be driven by user consents, state, federal, and international regulatory schemes, and business contracts. The manual process of documenting where data came from, the rules around its use, and then relying on training engineers to ensure it is all done properly always breaks at scale (see the recent Twitter, Facebook, GoodRX, and UK Tik Tok examples).

This complexity is what lead our team at Tranquil Data to build the first system of record for data context. It captures all versions of business policies and regulations, connects those to metadata about data across services, and relates this to knowledge about an individual’s attributes and relationships over time. The result is a graph dataset that speaks with integrity to the context of where data came from, why you have it, and what you may do with it. This knowledge is input to real-time, policy-driven enforcement within the data platform, automating always-on correct use and providing a transparent audit trail showing why any given use was allowed, denied, or filtered.

The key for OpenAI to solve the Italy problem will be the ability to transparently show regulators what data was collected, from where, and what was done with it.

The Enterprise Problem

ChatGPT may choose to go-to-market with a bottom-up strategy that focuses on individual users and SMBs. To become the most valuable company ever, they will need to close enterprise accounts, like JPMorgan, Amazon, Verizon and Accenture, all of which have already restricted employee use. These companies do not trust ChatGPT with their sensitive data, and they should not, given that ChatGPT tells users “not to share any sensitive information in your conversations.” Presumptively this is because text entered into the chat will be used for training, and could later be included in ChatGPT outputs shared with third parties. Furthermore, employees can’t be trusted to understand what is and isn’t sensitive, and are incentivized to utilize the tool because of its utility. See this example of employees who shared sensitive information with ChatGPT shortly after an internal ban was lifted at Samsung.

To adopt technologies like ChatGPT, these enterprise prospects need to be able to share sensitive information, and OpenAI will need to provide transparency that sensitive information shared in conversations is either (1) not used for training, or (2) is used for training but only in the context of a model that is segmented to that specific to that organization. OpenAI will need to build a system of record for segmentation (including the ability to answer the same questions outlined above that our software handles) to show enterprise customers that their data will not be shared or used other than for their purposes. Once this capability is met, enterprise customers can share sensitive information, and enrich the models with their own proprietary data sets to improve the utility of the service.

The key for OpenAI (and all Generative AI companies) will be to ensure that as data is taken on, there’s a way to prove to regulators and enterprise customers alike that data is being used properly. This calls for a net new piece of infrastructure that captures the context of data before it is aggregated, and creates transparency to show external stakeholders that their requirements are met.

The missing platform piece is the software we’ve built at Tranquil Data, which is becoming a core component of next-generation AI Infrastructure.

The Missing Piece of Infrastructure to Solve Generative AI’s “Hell Scenario”

Written by Shawn Flaherty