Why OpenAI’s GPT-3 is a good product, not just a groundbreaking model

Sam Stone
Structured Ramblings
7 min readFeb 22, 2021

After months on the waiting list, I recently received access to GPT-3, OpenAI’s powerful multipurpose language model. GPT-3 was released in June 2020 and there’s been a wealth of writing since then about it’s technical breakthroughs, shortcomings, and potential societal impact. (Here are two pieces I found particularly interesting and insightful.) Since it’s well-trod territory, I won’t comment on the technical aspects of GPT-3 beyond noting the creativity and cogency of its responses FAR outperformed my expectations.

Instead, this post discusses the GPT-3 beta from a product management perspective. Not only is GPT-3 a groundbreaking model, it’s also a very good product. Those are two different things, and the reasons it’s a very good product are broadly applicable to other algorithm-based products, including those outside the language space.

What makes GPT-3 a good product is that it embraces constraints, specifically constraints around users, constraints around interface, and constraints around outputs.

Constraint 1: A Narrow (but Non-Obvious) User Type

The first constraint was made obvious to me months before I got access to GPT-3, by virtue of the fact that I had to wait months to get access: this beta is not for everyone. I assumed it was just for developers, but two surprises awaited me:

The first surprise was the ease of use. Using the beta required writing no code, as everything could be done via UI. And the readability and extensiveness of non-technical examples meant I barely even had to reference the documentation.

The second surprise was the intentional barriers to using GPT-3 for production use cases. For example, you have to start jumping through hoops if you want to serve outputs to more than 5 people, even if they’re your colleagues. Here’s the warning:

This made me pause and wonder, “Exactly what job to be done is OpenAI is trying to satisfy with this beta?” Clearly, it’s not testing how the model can respond to high volumes of requests or other challenges of scale. Nor is it testing out monetization strategies or willingness to pay (the beta is free, and while users have to pay beyond a usage threshold, this is not prominent).

The reliance on a clean, simple UI and examples over documentation indicate a non-technical user is top of mind. I think that OpenAI’s goals are to (1) enable users to apply GPT-3 to a wider variety of scenarios and (2) measure users’ engagement across those scenarios. I’d call these users “idea generators”, because what OpenAI is really truing to understand is against what problems (ideas) users most want to apply this versatile new tool (the language model). It’s the reverse of the normal product management paradigm, where we identify a problem first, and only afterwards a solution. By having clear non-goals (e.g. scalability, monetization), it allowed OpenAI to focus on and better serve this “idea generator” user type, for example via adding no-code features.

This makes sense as a long-term strategy for OpenAI, given their funding affords them a longer horizon. Make it easy for idea generators to show, via their actions, what they want GPT-3 to do — and then scale and monetize those applications later.

Constraint 2: A Single Interface

GPT-3 can do A LOT of different things. Here’s just a sampling of use cases, with users creating more each day.

However, there is just one main interface for GPT-3. If you’re calling via code, there’s one main API endpoint [1], and if you’re using the GUI, it looks like this:

The same UI applies across all GPT-3 use cases.

This is a new take on the paradigm for delivering complex AI services. Today, most multipurpose AI providers take the following approach to serving different use cases. Even if the same model is powering multiple or all of the use cases, developers need to learn different interfaces for different use cases (or to call different services).

GPT adheres to a different structure, in which all use cases flow through the same interface — and thus all use cases are subject to the same parameters. This doesn’t mean the parameters should be set to the same values for different applications — but the menu of options always looks the same.

Here’s what the menu of options looks like. While the parameter names are not all intuitive (e.g. “Top P”), none are technical jargon. And note the nice tooltip that explains what “engine” means in non-technical terms. Tooltips with simple language are available for every parameter.

The same menu of parameters applies across GPT-3 use cases.

With any AI product, testing new use cases requires parameter tuning. In the traditional paradigm, there’s limited (or zero) cross-application learnings about parameter tuning, since the parameters differ or are non-overlapping, between use cases. But with the GPT-3 paradigm, the more use cases I tried, the better I got at tuning the model — because I was always adjusting the same set of parameters, regardless of use case.

This paradigm probably breaks down for squeezing out the last few drops of performance; to do that, you probably DO need to adjust different parameters under the hood for language translation vs Q&A vs summarization, etc. But this gets back to constraint #1 and focus on the idea generator (not the optimizer, a different user type).

Constraint 3: Public Declarations of Failure

Language models have had some spectacular, and terrifying, failures. Microsoft launched a high-profile Twitter-bot that started spewing hate speech and was taken down within days of launch. And GPT-3 has also attracted attention for its potential to generate offensive text. So when I finally got beta access, I was particularly interested to explore how it dealt with sensitive topics.

I was pleasantly surprised to find obvious and explicit warnings when the prompt or the model’s response strayed into sensitive topics. It is not hard to make GPT-3 generate racist or sexist text. But, in my experience, it is hard to make it do so without eliciting a blindingly-obvious half-page yellow or red warning shown below.

Two aspects of this warning system are particularly notable. First, it quite obviously errs on the side of caution — it returns many more false positives (responses that are not problematic, but get flagged as such) than false negatives (I couldn’t generate a problematic response that didn’t get flagged). Second, the warning system’s output is very simple: text is labeled safe, sensitive, or unsafe. There’s no probability estimates, dimensions of safeness or sensitivity, or identification of the particular trigger for the warning.

This means that if the user wants a fine-grained understanding of how sensitive text is, why it’s sensitive, or how to deal with it, the onus is on the user to try figure that out. That requires extra work — and I’d bet most beta users (who are idea generators, not content moderations specialists) won’t want to do that extra work. That leaves these idea generators with two options: defer to OpenAI’s “err on the side of caution” approach, or ignore the ternary rating system and potentially release a headline-grabbing racist-bot.

By keeping content warnings simple — and forcing users to make this rather stark choice — I’d like to think OpenAI coaxes more users into the “err on the side of caution” approach. I could, of course, be wrong about this; I’m making assumptions about the morality of idea generators. Perhaps this choice leads more idea generators to just pass along all content, regardless of warning, to end users. If this is the case, I hope that GPT-3 is instrumented well enough that OpenAI can recognize such applications and intervene.

Language models appear poised to get more powerful, and perhaps significantly more powerful in a very short timeframe. GPT-3 has 175 billion learned parameters; in January 2021, Google announced they’d trained a language model with over a trillion learned parameters. Google hasn’t released this model to the general public, and it’s unclear that just adding more parameters makes for a better model — but given the wide variety of ways that researchers are pushing the frontier, it seems a safe bet that the state-of-the-art will keep advancing forward.

But better models don’t necessarily translate into better products. The teams working on these models would do well to pay attention to the design principles embedded in the full product experience that surrounds GPT-3, and apply and evolve those principles as new and different models emerge.

[1] The primary endpoint is a POST call to https://api.openai.com/v1/engines/{engine_id}/completions. There’s also a GET endpoint and a few other endpoints that serve metadata, e.g. which model engines are available.

--

--