OpenAI GPT-3 and Prompt Engineering
OpenAI released their GPT-3 language model in June 2020. It was trained on 175 billion parameters, which is 10x more parameters than their previous iteration GPT-2. That in itself is a huge feat of engineering, but this post is dedicated to dig deep into the “prompts” which is a way to get the model to produce text by specifying an instruction in natural language and showing some demonstrations of how to follow the instructions well.
What is GPT3?
GPT-3 is the latest language model created by OpenAI. OpenAI is a for-profit company based in Silicon Valley, which is doing some amazing work in the field of Artificial Intelligence or AI, and their goal is to solve intelligence, which means creating an artificial entity that is as smart and capable as the smartest human ever lived. But intelligence of one human is not compatible with all the others, so in order to be the best, it has to have what is called “general intelligence” which is a combination of all kinds of demonstrable intelligence humans have. Intelligence comes in many forms, and it could be possible for an AI system to demonstrate all of them.
We are not there yet, but GPT-3 has shown some promise in this direction by demonstrating what is called “meta-learning”. Meta-learning is a concept of learning several different concepts or real world ideas and representing them inside the language model in the form of a language. This language is different then english language but it is able to understand the English language very well (as well as other languages). It can use these representations to perform diverse tasks when asked to. A language model does this by representing concepts in a low multi-dimensional neural network. Representing language is pretty complex, as we have to embed not only the language features like grammar, syntax, rules but it also needs to understand concepts about the real world (which we humans have learnt to do involuntarily because of evolution). The concepts have almost stochastic connections with each other which is hard to visualize in just 3 dimensions. So neural network learns to embed these concepts in a multi dimensional graph, which are hard for our brain to make sense. This concepts are also heavily compressed and understood in a very coarse and abstract form by the model, stripping away the details of the concepts and only retaining the most important things about it, such that they can be easily clustered and play with other concepts. The neural networks compress high entropy data from the real world using some kind of a bottleneck, which compresses the data using as little space as possible, and then this low entropy information is stored in the model. At inference time, this encoded data are decoded to get real world data back from it. It’s encoding information in, and decoding information out.
GPT-3 uses an architecture called “Transformers”, which is a structure consisting of several neural network layers of both encoder and decoder, stacked seperately. Transformers takes in an input sequence of sentence/s, processes each input word to predict next sequence of words. Each layer has a “self-attention” block which has several heads doing the same task. The task is to process the current word in the input, compare it with all the previous inputs words and produce some kind of contextual representation for the sequence upto that point. In other words, the attention mechanism tells the transformer, which part of the input sequence it should focus on so that it can correctly predict the next words. And since there are multiple heads, at the end of the layer we get several different contextual representations of the input sequence and it sums those reps together for a final representation used by the next layer. These multiple reps help in combining several different concepts together and figuring out which concepts it should pay more attention to from the list of all possible concepts the transformer has already learned during training. And as you can imagine a 175 billion parameter can store a hell of a lot of concepts learned from the real world condensed in a form only the model understands.
But having such a big model has its own disadvantages. For one you cannot fit it in your pockets. So it has to reside in a secure server farm. It has to be secure, as these models could be dangerous in the wrong hands (think disinformation campaigns). Second, it’s impossible for someone with finite computing resources to get as good as OpenAI at training these huge models. You need cash and time. So OpenAI has come up with a plan that involves making the huge model available for interested people by giving access to an API that directly communicates with the big boy. This access will eventually be a paid service. I am hearing rumors the paid service will not be that expensive. Some people that OpenAI trusts have already got beta access and they have been doing some pretty amazing and fascinating things with it. The way to communicate with the big boy is using “prompts”.
A simple prompt has a task description the model should perform eg. Translate text to French, write a story, write a news article. You can get really descriptive here eg.
“Parodies of the fantasy novel series Harry Potter in the style of famous authors:”
And the next text that follows is an example of the text the model generates based on this input. Sometimes just this single instruction works, which in the research field is called “zero-shot learning”.
Most times you have to follow up the instruction with a demonstration example which shows how the model should go about performing the task. Which is called “Few-shot Learning”. This is the exciting part of the whole model, that’s why the OpenAI paper is called “Language Models are Few-Shot Learners”.
Here the instruction is to convert plain language text to legal language, and @f_j_j_ follows this with some examples of plain-to-legal speak. From a few examples, OpenAI learns what it needs to do, and then you can use this “primed” model from now on to continue to convert any plain text it gets to legal text.
(I am interested in knowing how the api works on the backend. Does it keep different instances of these “primed” models for every new API connection? Is it possible to save these primed models for future use? Maybe for an app called “LegalSpeak” which has only one purpose.)
But it’s not necessary to follow this structure. You can be creative and try prompts in different ways, and it still manages to surprise.
@quasimondo tried a different way to give prompts:
Here he is giving an example, and then conditioning the primed model to generate new sequences of sentences, but primed on different prompts like “poetic way” and “funny way”. You can keep on adding more features in such a way to have a completely dynamic way of interacting with GPT-3.
The best case of such a technique has been perfected by @nickwalton00 who created AI dungeon.
Here the prompts are just flowing with each other like a progression in a story. You direct what happens in the story, and the next generation depends on all of the previous context you just conditioned it on based on your choices. And the choices are not decided by the game. You can basically perform any action you want. So instead of punching the director, I can choose to kiss him, and the story will progress (in a rather awkward way). Play the game, it’s amazing how well it works.
All of this leads to the final and ultimate way of constructing prompts. You just converse with the model. Yes, I am talking about chatbots, but with the power of GPT-3 these chatbots can be anything you want.
Learnfromanyone is an app being developed on top of GPT-3, which was let loose for a short while and I got to try it. Here I am conversing with Elon and debating him about the AI apocalypse. The idea of the site is, you just type in any person (someone who is quite a bit popular on the internet) and start learning anything you want about the world and about them. GPT-3 just takes the role of the person and imagines a conversation. For most part it does take on the role and personality very well, and doesn’t break character too often unless you break it to a limit.
Replika is another chatbot which is getting really popular lately, with how real the conversations are, that several people have posted reviews which are fascinating to read. They are falling in love with the chatbots. Now Replika is not running on GPT-3, atleast as far as I know, and from the little I tried it isn’t as impressive as any conversation I had using GPT-3, or the examples I have read online, but it’s still interesting to many people. Now instead of Replikas which have pretty boring personalities, are replaced by any personality you can think of, and you can start dating them, talk philosophy with this chatbots, discuss your problems, even ask for solutions and it will give you very plausible answers which might work for you. It could be a therapist or a teacher like you never had before. And I say plausible, because not all of what it says will be true. After all it’s learn everything from the internet. But it tries it’s best.
All of this is coming sooner than you think. But all of this just looks like it’s limited to texts and conversations. But here’s the crazy part, it’s not just good at generating plausible and meaningful sentences, it can do all sorts of other stuff. It can do basic maths, by just prompting it to do certain operations, It can write sql queries based on your statement in natural language, it can create React UI components from just prompts: “Give me a red button of rectangle shape”, It can give instructions on how to draw basic shapes.
So what is prompt engineering? It doesn’t seem like engineering to me. Engineering needs a certain fixed pipeline with robust quality control and we have to build workflows, and have design meetings for the architecture. It might not be suited for most engineering tasks. I do imagine there will be several best practices and kinks we will discover that would be well suited for certain tasks, and certain tasks it will always be bad at. But it’s the fastest idea to conception for a variety of different ideas, I have ever seen happen using a computer software. In the hands of creatives, it will be a meandering forest to get lost into, in the hands of knowledge workers, it will speed up a lot of knowledge discovery, in the hands of programmers it will be a much faster search engine than Google and in the hands of creators, this thing is going to be one of the infinity stone.
This generated conversation was the most fascinating to me. Why? Because someone posted a failure condition for GPT3 where it was generating nonsense answers when asked nonsense questions. And then Nick just asked kindly to the AI to be a brilliant AI and not answer anything nonsense. And it abides and only replies if you ask genuine questions and for nonsense questions it just says “yo be real”. This is fascinating, because even if it’s not learning on the fly, if it has a certain understanding of a brilliant AI encoded inside it, it can just pretend to be a brilliant AI, and we will be happy with the result. Similarly I am thinking, if we just ask the AI to pass the turing test, it already has most of the information needed to pass the turing test already encoded within it and it will cheat. So GPT-3 can just flow and morph into something just based on what you need it to be at that particular moment. Which is a fascinating superpower and one really important feature of a general intelligence.
How is it different from earlier methods?
Until now accessing a language model to do any “downstream” tasks such as summarization, language translation etc, was done by a process of finetuning the original model. What that means is you need to think about what tasks you need the transformer to accomplish, prepare a dataset which has several diverse examples for your tasks with ground truth expectations defined for it. And the language model will modify it’s parameters (learn and update gradients) to perform these downstream tasks well. The process is very similar to training any neural network model, you define a loss function, optimize using some optimizer and back propagate the loss values through several layers of transformers.
Google’s BERT is one such open source language model, which has been applied to many downstream tasks and performs very well on lots of them. But performing any kind of downstream tasks needs an engineer or team to do following things which takes up resources:
- Gathering examples of data with ground truth (takes up majority of time)
- Setting up the model includes
- installing neural network libraries on your computers and servers
- Learning about how neural network works
- Learning best practices about the way we choose the dataset, how to setup the layers in transformers (their size), basically learning the best way to construct/architect your transformers, so that it gives good result
- Running the finetuning process. This again takes a lot of experimentation. You can easily get the process to run if you follow what AI researchers are doing, but often fails for real world tasks, as the same hyper parameters don’t work so well, or the architecture needs to be changed. So there’s a lot of uncertainty and trial and error.
The era of prompts. I find it funny to call the previous Bert finetuning an “era”, because it lasted for only a few years (one year?), but that’s how fast AI field progresses. GPT-3 has already come up with a better way to get language models to perform downstream tasks that eliminates most of the list that I listed above. What GPT-3 is good at is they are very efficient “few-shot” learners. Few shot learning simply means the AI should learn to perform new tasks after we demonstrate a few examples of what we want from it. No changes to the weights is needed for this new tasks, because it has already encoded much of the world knowledge inside it, that it doesn’t need to update it’s parameters, just needs to find a quick path through the layers, activating the right neurons, to produce the predictions as desired by the prompt.
So how does it know how to perform new tasks (tasks which it wasn’t explicitly trained on)? They call this “in-context learning”. They train the model to perform differently conditioned on only unsupervised textual input. Very rudimentary LMs started by predicting the next letter, next word, next sentence and then next few sentences. It got better with remembering context from a few sentences before, understanding the instructions (see finetuning). But till now it only got good at one instruction, and any other instruction needed its own finetuning steps. But with GPT-3 it just needs prompts, and no other finetuning step is necessary.
But how does it understand any instruction given to it if it was not trained to perform this tasks? The answer is scale and huge amounts of data. The 175 billion parameters comes in handy. The model is trained on a huge amount of data, and it learns to compress this data in a lower multi-dimensional form (unintelligible to human brain), clustering similar meaningful patterns, and now even different tasks or concepts. And when we demonstrate the model few examples, it has also learned to take the shortest path through the billions of parameters to arrive at a prediction that is closest to the pattern it has already seen and does pattern matching. So it learns meta concepts from the prompts, this concepts help it figure out some representation which closely matches the output expected of that prompt.
Future implications
Now this is not general intelligence yet, and the model is definitely not sentient. But it’s a good snapshot of what a sentient artificial intelligent life could look like. But for it to be sentient, it needs to show some intelligence on the fly, and not just pattern matching the patterns it has already learnt during training. It needs to generate some new information from the compressed encoded data. That means generating new patterns when faced with not before seen situations, showing some form of strategy to predict the future, ability to learn from mistakes, ability to know when it made mistakes, to be able to explain the decisions and predictions it made. Deepmind showed that earlier in the “narrow” intellectual pursuit of Go, where the human champion who competed against it said that it shows such brilliance and beauty in it’s moves sometimes, it’s almost god like. It has performed moves which a human can never think of, create strategies that completely take you by surprise. The ability to surprise us would be important. For general intelligence, the next big puzzle is reinforcement learning and applying it successfully on the compressed language model representation which has managed to encode much of the world knowledge. It just needs to learn how to use it properly.
Please email me at swapp19902@gmail.com if you have any questions or just comment here. I am not affiliated with OpenAI and most of what I have written is just from the twitter posts I have seen and interacting with developers. I haven’t even had the opportunity to try out the API yet. If anyone is able to accelerate the process of getting me access, I would really appreciate it. Meanwhile I will wait…
Find me on twitter at @swapp19902. Let’s connect and make this world a fun place to live in.