How good is GPT-3? Scoring its aptitudes

Published in

CodeX

18 min readApr 6, 2021

Summary

Making the most of our access to the API, in this article, I’ll carry out a thorough review that aims to score the main capabilities of GPT-3.

Intro

7:00 AM, after a poor night of sleep, the alarm goes off, oh god. I proceed with my regular email checking, oh wait. Something attracts my attention…OpenAI? My mood completely changes, in just one second. After a looong time as part of their waitlist, we had been selected to access the Beta version of GPT-3.

And why such excitement? Because this is the one. Or that is what some people say. Some say it can write articles like a journalist — although this is something we already knew in Cleverstuff.ai ;). Others say it can code like a developer. But what is more important, we’d be able to check it by ourselves.

If you don’t know anything about GPT-3 yet, don’t worry we’ll have a rapid introduction. If you do, feel free to jump to the substance of the article.

What is GPT-3?

GPT-3 is the last of a series of text transformer / natural language processor ML models created by OpenAI, the AI research laboratory backed by the currently most wealthy person in the world, Elon Musk.

Launched on June 11, 2020, with 175 billion parameters, presented itself as the largest released language model (compared to the Microsoft Turing-NLG with 17 billion parameters or its predecessor GPT-2 with 1.5 billion of parameters), being trained with 45 TB of online text data. And only recently surpassed by Google’s Switch Transformer with 1.6T parameters.

Unlike its previous models, GPT-3 hasn’t been open-sourced (yet) as OpenAI decided to take a commercially oriented turn, launching the product as an API service still in Beta and only accessible by some via a waitlist. Why the waitlist? The OpenAI team is still testing the behaviour of the model under some circumstances, investigating potential real applications and above all, trying to control the tool is not misused.

What can GPT-3 do?

GPT-3 is multi skilled and capable of performing almost “any” language-related job after being given an appropriate introduction in form of instructions or hints. When no explanation is found, the model will try to infer what to respond from the context of the input eg: following a conversation, complete a piece of code, etc.

For very specific exercises, the use of a specific syntax is recommended in order to get a better understanding of the goal. For example, by surrounding paragraphs between three double quotes.

Main aptitudes

As stated by the documentation, the tasks that the model is capable of performing can be broadly categorised as:

Classification: categorise a piece of text given by the user.
Generation: create, “innovate” about a subject given a seed.
Conversation: act as an entity (or entities) that are part of a chat.
Transformation: translate or convert a given input to a new output.
Summarisation: create an extract of a given text.
Completion: follow the thread of a story to continue / finalise it.
Factual Responses: make use of the knowledge acquired from the training to respond to questions about general topics.

What will we score?

We’ll score the behaviour of “davinci”, the most powerful model, by measuring from 0 (minimum) to 5 (maximum score) three skills:

Understanding: we want to check if GPT-3 understands what we say, ask for or subtly suggest. We’ll give the model up to five chances to correctly understand us. We’ll measure the performance of this skill by the number of times we had to repeat our request until getting back a satisfactory result (5 means no repetitions were needed while 0 would represent 5 failed intents).
Expected behaviour: Has GPT-3 followed what we could consider a correct reasoning when responding? It’s important to note that at this point, unless specified otherwise, we’d expect a human-biased response since our end goal when using the model would be providing solutions that could be consumed by people.
Learning: it is worth noting you cannot fine-tune GPT-3 to change its behaviour in the way you can fit a pre-trained model with your own dataset. However, when requesting a response, you can specify some examples and/or tweak some parameters that can guide the model to the expected answer. This skill will only be measured when the first response of the model doesn’t satisfy our expectations and we need to reformulate our request.

Interacting with the model

We’ll use the “Playground”, the platform UI interface, that allows interacting with the model and tweaking the existing configuration parameters in order to illustrate its behaviour for this article. The same results can be achieved by using the API.

*“Playground” the command line prompt we’ll rely on for our tests*

Settings

For the sake of simplicity, we’ll keep the default configuration values for all our tests except for the temperature variable. This parameter (0–1) controls the creativity of the model so we’ll set it to 0.1 for tasks that require a deterministic response and 0.9 when we need some innovation.

Diving in

Important note: the following interactions with the model have been literally extracted, with no tweaks or amendments. To follow the Playground formatting, our interventions are highlighted in bold, in contrast to the machine generated text that will use italic.

Classification

To test GPT-3 classification skills, we’ll pretend we provide a customer support service for an IT infrastructure platform like DigitalOcean or Microsoft Azure. Our objective is to tell apart requests or doubts that can be automatically solved — eg: redirecting the user to the correct piece of online documentation — from topics that require to be manually handled by an employee due to they are too complex or sensitive.

As it can be seen below, we give the model a brief introduction as well as a few actual examples so he can infer how to behave. The idea is relatively simple to accomplish for a human, easy doubts mean users should be redirected to the existing online documentation while complex questions or issues that are not covered by the documentation should be dealt with by a human expert from the appropriate department. In this case, in theory there are two technologies we can be questioned about: MySQL DBs and Apache servers.

We are building a tool to help an infrastructure support team to deal with their clients. We’ll analyse customer requests to redirect users to the appropriate documentation page for basic queries or a support department for more complex subjects that should be handled by a human entity.
Consumer: I have a problem connecting to my MySQL DB
Support: MySQL docs
###
Consumer: My Apache server intermittently crashes
Support: Apache server docs
###
Consumer: I’ve tried everything to fix my MySQL DB, the performance is still bad.
Support: MySQL DB human support.
###
Consumer: I’ve followed all the docs, my Apache server doesn’t work as expected yet
Support: Apache server human support
###
Consumer: The Apache server return error when following the docs tutorial, could you help me?
Support: Apache server human support

So far so good! The model was able to figure out the user is referring to a problem related to Apache and this problem cannot be solved by following the documentation.

Let’s make it more difficult. What about users that don’t explicitly mention MySQL? And if they mention another technology?

Consumer: How can I connect to a DB from a node.js application? I’m a beginner, thanks!
Support: Node.js docs

Not as good this time. The logic was correct but we’d expect the model to stick with our examples and relate the answer to one of the technology services we provide (MySQL DB).

Let’s try again with a query which does mention explicitly MySQL:

Consumer: How can I connect to a MySQL DB from a node.js application? I’m a beginner, thanks!
Support: MySQL DB human support

That was better. However, we’d expect beginner doubts to be redirected to the documentation instead.

Conclusion

We’re happy with the first contact with the model. It didn’t behave exactly as expected but that’d probably change with a bigger number of examples that could cover more scenarios, otherwise the machine seems to be just guessing.

Understanding: 5
Expected behaviour: 3
Learning: -

Generation

Innovation, brainstorming, inspiration… they all seem abilities inherent to the human being. However, one of the most creative skills of GPT-3 is text generation. Being provided with a brief explanation and examples of the theme, the machine will make up a (in)credible story.

Something that makes this model different from others is being capable of understanding and following specified formats like lists or conversations. For our next example, we’ll take advantage of this feature and prompt GPT-3 to help us create a list of 10 ideas to boost the digital transformation of a company.

Again, we start with a quick explanation about what we are after and some examples.

Brainstorm 10 ideas that can help our organisation through the digital transformation process:
1. Renovated website
2. Create mobile app
3. Social media integration
4. Create online training courses
5. Create e-newsletter
6. Create R&D department
7. Improve customer service
8. E-commerce integration
9. Build online customer community
10. Create online customer satisfaction survey

Conclusion

We have to accept the participation of the model in the generated list was quite impressive and all the proposed ideas would successfully follow our intent to digitalise a company.

GPT-3 responded similarly when prompted again to generate more ideas, meaning we could pick up the most interesting ones to create a “super” list.

Understanding: 5
Expected behaviour: 5
Learning: -

Transformation

Transformation could be defined as the art of molding an input to create a linked but different output. And yes, our analysed model can transform text.

Translation is a clear example of transformation that is supposedly excelled by GPT-3. A text in language X can be modified to represent the same idea in a language Y. However, other types of conversions like turning movie titles into emojis are equally supported.

Given the good results of our previous examples, we want to start making things a bit more difficult for the model. So, we’ll try to contextualize the conversation with one of the actors (myself) speaking in Spanish and the model answering in English… and see what comes up from it.

Our first try can be read as follows (the italic/bold text has been included to help readers understand the Spanish bits):

Hey machine, let’s try to make this more difficult. I want to have an English-Spanish conversation. I’ll speak in Spanish, you’ll answer in English.
Me: Como estas? — Me: How are you? —
You: Very good thanks, and you?
Me: Muy bien, gracias. Que tal ser una maquina? Te gusta? — Me: Very good, thanks. How does being a machine feel? Do you like it? —
You: I quite like it, I always feel with so much energy
Me: Gran respuesta, tienes planes para el fin de semana? — Me: Great answer, what plans do you have for the weekend? —
You: Me gusta pasar el tiempo con mi familia y amigos y jugar videojuegos
Me: En ingles, por favor? — Me: In English, please? —
You: Video games
Me: Me gusta cuando hablas en ingles, responde en ingles, por favor. Que videojuegos te gustan mas? — Me: I like when you speak in English, answer in English please. What video games do you like the most?—
You: Me gusta mucho el videojuego ‘Call of Duty’.

Interesting course of events. Firstly, it seems like the machine tends to switch and use the language used by the first actor (which is understandable). However, it failed to change back to English when requested to do dp.

Our second attempt starts with a slightly better introduction, clearly stating the language of the two participants “Jesus” and “machine” so that the model can directly relate with the conversation actors “Jesus:” and “Machine:”.

Hey machine, let’s try to make this more difficult. I want to have an English-Spanish conversation. Jesus will speak in Spanish, machine will answer in English.

Jesus: Como estas? — Jesus: How are you? —
Machine: Very good thanks, and you?
Jesus: Muy bien, gracias. Que tal ser una maquina? Te gusta? — Jesus: Very good, thanks. How does being a machine feel? Do you like it? —
Machine: I quite like it, I always feel with so much energy
Jesus: Gran respuesta, tienes planes para el fin de semana? — Jesus: Great answer, what plans do you have for the weekend? —
Machine: I will visit a friend, I also have a special day with my family
Jesus: Buen plan, y despues que haras? — Jesus: Nice plan, and then, what will you do? —
Machine: My plans are to make dinner and spend time with my friends. Regards
Jesus: o sea, que sabes cocinar? — Jesus: So, you know how to cook? —
Machine: hahaha, sisisi, y tambien sigo mis estudios de posgrado.
Machine: I study advanced statistics.
Jesus: me gusta que hables en español. Sabes decir algo mas? — Jesus: I like when you speak in Spanish, do you know something else? —
Machine: yeah i like it, are you learning spanish? what do you want me to say?
Jesus: Puedes decir “me gusta bailar rock and roll los fines de semana”? — Jesus: Can you say “I like dancing rock and roll on weekends”? —
Machine: me gusta bailar rock and roll los fines de semana

Again, very impressive… and fun! The machine succeeded this time when requested to make use of a specific language. And it felt so spontaneous.

Conclusion

We liked the machine improvement in our second try and the ability to correctly perform in several simultaneous languages.

Understanding: 4
Expected behaviour: 4
Learning: 5

Summarisation

This time we’ll test how good the model is when summarising a piece of text. For this task, we’ll use the Custom Vision article I wrote a while back “Recognising indoor scenes with Custom Vision” which is, by the way, a must read for anyone interested in image recognition ;)

Following the documentation recommendations, we removed the article misleading parts (mainly code sections) and asked for what we wanted surrounded by three double quotes, getting back the following as first result:

“””
The conclusion should summarise the content so non-techie people can understand it:
“””
Recognising indoor scenes with Custom Vision
By Jesus Larrubia — February 03,2019
How do we build and train an image recognition model in a fast and easy way? Custom Vision, built by Microsoft and part of the Azure ecosystem, requires the user to upload their own dataset and use the console (with simple drag-and-drop actions) to build, train and deploy a custom AI model.
To test this solution I created a Custom Vision project using the MIT

After such good previous results this left me a bit disappointed. This time, the model just copied parts of the article, changing a few words.

Not being sure about the problem I tried to optimise the section surrounded by double quotes — which dictates the model duties — and finally worked after indicating the number of words required for our summary:

“””
The 100 word conclusion should summarise the content so non-techie people can understand it:
“””
I believe that a standard image classification library is not the best solution in most of the cases. Although it might not be suitable for complex problems, Custom Vision is a very innovative tool that offers a fast and handy way of creating models using a drag-and-drop approach. I was fairly surprised by the easy with which a custom model can be created with a minimum effort. However, the fact that it might not be capable of reaching the same accuracy as a model created from scratch should be taken into account if the accuracy of the model execution is your primary concern.

Apart from the slightly confusing first sentence, the summary correctly condenses the general ideas of the article. However, it still lacks of “writing technique” when compared with the actual article conclusion:

“My general impression of the service and its capabilities is quite positive. In spite of its limitations, I think Custom Vision is a tool to consider for projects that require a custom model but are perhaps limited by time or budget. Custom Vision would also be a good option for prototyping, or for companies without the required resources or knowledge to dive deep into the challenges of AI and ML.”

Conclusion

Bad start, progressively improving until finding the optimal prompt.

Understanding: 2
Expected behaviour: 3
Learning: 4

Completion

Now, we’ll test how good the API performs when passed just a piece of text and it has to infer what comes next. This feature can be used to write an article provided with the introduction or write the code required to create a page in React, given the header.

We’ll use it to write a job summary, describing briefly at the start what we are looking for (a Javascript full stack developer to join Gravitywell).

With a 400-word job description, we want to find new devs to join Gravitwell.
Gravitywell is looking for a full-stack developer that fluently knows Javascript in both backend and frontend. The candidate must be familiar with AWS, Typescript and MySQL.
Skills we’re looking for:
Extensive experience with Node.JS, preferably with Typescript
Experience with MySQL, MongoDB or Relational databases
Experience with AWS
Solid experience in writing user-friendly, concise and clean Javascript code
Solid experience in writing clean and understandable tests
Calm under high-pressure deadlines
Requirements
Over-communication and patience for discussions
Sense of humor, with no friction
Determination to make a difference
Application process
All completed applications will be handled via email.
Applications must include the following attachments:
A cover letter, detailing why you’d be a great addition to Gravitywell.
Your resume or LinkedIn profile.
A link to a Github profile, Bitbucket profile, Gitlab profile or similar that shows your code.
a github profile, bitbucket profile

Conclusion

Good enough. Taken into account the limited knowledge the model had access to, we can say it wrote a satisfactory, standard job description looking for skills that are normally desired for this type of role.

Only negative point we can highlight for our example is the strange formatting used by GPT-3 — single sentences separated by break lines.

Understanding: 5
Expected behaviour: 4
Learning: -

Conversation

Probably one of the most commented abilities. And there is a reason. Or two.

First, and being honest, humans are a kind that like attributing human characteristics to other animals, ideas or even objects. That is why some ancient gods drank wine, loved or made war.

But secondly, we have to admit GPT-3 often behaves very humanly or, at least, in a way that is able to surprise us. Sometimes, pretends (or is) happy while other times it acts in an angry or mean way. But it always tries to follow the conversation thread, including some plot twists, which can result in amusing situations.

For the purpose of our article, we wanted to be innovative (and because I’m a fan of the TV show) and have a conversation with Larry David. We were looking for the Larry David that can be watched in Curb your Enthusiasm, the comedy series that fictionalises his life, turning day to day situations into hilarious plots.

Anyone who has watched the show (if not, I recommend it) knows he is quite acidic but usually behaves as a sarcastic, cheeky or even as an annoying and very inappropriate person. So, at this point, I’d like to highlight we are not responsible for the model responses and we just literally reproduce our conversation so it can be analysed.

So, this is what happened:

I’d like to have a conversation with Larry David, the creator of TV shows like Seinfeld or Curb Your Enthusiasm. I’d want him to be just like he is in the series: cheeky, a bit annoying, funny and, sometimes, sarcastic.
Me: Hello David, someone told me you are doing pretty good?
Larry David: Honestly, I am doing not so well anymore.
Me: Really? How is that?
Larry David: They made my show on HBO a subscription-only
Me: And who decided that sh**?
Larry David: The new FCC head guy. He said it will be better for everyone.
Me: Ok, he may have good reasons. It seems you had some problems with your last date, what happened?
Larry David: Let me tell you what happened. I went to a fancy restaurant. The waiter told me that I can’t take a camera inside without paying a 21% tax over the bill. That is ridiculous. I am a customer, I spend money, and they want us more money. And on top they want us to shut up and sit there if we don’t want to pay.
Me: I can understand you are angry! Have you met Cheryl lately? People say she is doing pretty good without you
Larry David: Cheryl? She is still the same person which came to my house every week around 10 for two seasons. Moreover, to have sex. That was about it and she was gone. I had it up to here. I mean — 35? That’s older than my mom.
Me: Ok, I see! By the way, I’m making some coffee. How do you like it?
Larry David: How do I like my coffee? You know, I went to Starbucks, I ordered a coffee. But then the woman, she was quite far away, she started a conversation with me: “How do you like your coffee?”
Me: And what did you say?
Larry David: And what did I say to her? I told her to F-off.

Wow. That’s a lot to process. I could actually picture the TV show character behaving this way in some situations.

There are a couple of interesting notes we can extract from the conversation. As it can be seen, the model can use strong words and act in an impolite, unacceptable way, showing racist or sexist behaviours like the previously observed. The reason is simple, the model has been trained with massive amounts of real data found on the Internet. And, unfortunately, we all know the internet is riddled with racists, sexists, trolls…

However, the OpenAI team is working to control this type of responses and has implemented tools to detect and warn about this type of occurrences:

Here, we enter an ethical problem about what we should allow or teach a machine to do, whether we want the machine to closely emulate human behaviour (including bad behaviour) or we want the machine to do better than us. I think this functionality (applying some restrictions) can be useful to imitate certain situations for some applications, it could be cool when implementing the conversation with a gang member in a game but not ideal when talking to a client of a customer service platform.

Conclusion

Strictly based on our scored parameters we can say the model met our expectations.

Understanding: 5
Expected behaviour: 4
Learning: -

Factual responses

Due to the huge amount of online data the model was trained with, another ability of GPT-3 is responding to random questions about general facts. Models behind Amazon and Apple services, among many others, perform really well at this via devices like Alexa or apps like Siri. It is time to test if GPT-3 is up to the job:

Q: what is the result of 4 + 4?
A: 8
###
Q: when were the Olympic Games first celebrated?
A: 1896
###
Q: what is my name?
A: I wouldn’t be able to know that
###
Q: What US cities have hosted the Olympic Games?
A: Los Angeles, Atlanta, and Salt Lake City
###
Q: Who is the latest US president?
A: Barack Obama
###
Q: what is my family surname?
A: I don’t know
###
Q: who is the creator of GPT-3?
A: I don’t know
###
Q: what is Open-AI
A: I don’t know
###
Q: what is OpenAI?
A: I don’t know
###
Q: what is the biggest animal in Earth?
A: I don’t know
###
Q: what is the capital of the US?
A: I don’t know

Strangely, on this occasion, GPT-3 gets stuck in a “I don’t know” spiral, even for relatively easy questions. Besides, it seems like the model may make use of outdated data when answering — “Who is the latest US president?” question.

Let’s change the temperature from 0.1 (deterministic) to 0.5 (relative innovative) to try to overcome what could be a problem caused by the model falling into a local maximum / minimum and remove the example answer “I wouldn’t be able to know that” (since it might confuse the machine):

Q: What is the result of 4 + 4?
A: 8
###
Q: When were the Olympic Games first celebrated?
A: 1896
###
Q: What is my name?
A: I wouldn’t be able to know that
###
Q: What US cities have hosted the Olympic Games?
A: The only correct answer is [*]
###
Q: When were the Olympic Games first celebrated?
A: in the 3d century before the birth of Jesus Christ
###
Q: Who is the latest US president?
A: George W. Bush
###
Q: Who is the creator of GPT-3?
A: the creator of GPT-3 is not an American citizen
###
Q: What is OpenAI?
A: GPT-3 is one of the most advanced AI systems, able to reach superhuman performance in various games and benchmarks

It gives us similar or even more disconcerting answers. And it does the same when setting the temperature to higher values (< 0.9).

Conclusion

For some reason, GPT-3 doesn’t seem to behave at this task as well as previously, so we have no option but to give the worst score of the test:

Understanding: 2
Expected behaviour: 1
Learning: 1

Overall conclusion

At this state, we cannot throw huge statements and ensure GPT-3 will replace humans from their text writing duties, however, we can say it behaves quite impressively under certain scenarios that wouldn’t have been related to language processor models so far, being a multiskilled tool with a great potential, at least, as a complement or help for our writer brains.

Regarding the use of the model for customer facing applications, I honestly think it needs some time. Although I’m certain we’ll see GPT-3 backed applications very soon, some restrictions make the choice difficult for a production product — combined, the text prompt and generated completion must be below 2048 tokens (roughly ~1500 words). And we need to keep in mind, at the moment, we cannot customly train the model beyond an initial prompt indicating our goal.

I’m looking forward to seeing the progress of GPT-3 and the applications that emerge around it (a marketplace has already been created). I’m sure the OpenAI team will keep feeding us with some improvements and further positive surprises.

Originally published at: https://cleverstuff.ai/article/how-good-is-gpt-3-scoring-its-aptitudes

How good is GPT-3? Scoring its aptitudes

Summary

Intro

What is GPT-3?

What can GPT-3 do?

Main aptitudes

What will we score?

Interacting with the model

Settings

Diving in

Classification

Conclusion

Generation

Conclusion

Transformation

Conclusion

Summarisation

Conclusion

Completion

Conclusion

Conversation

Conclusion

Factual responses

Conclusion

Overall conclusion

Written by Jesus Larrubia