How Does ChatGPT Work? A Didactic Overview of this Reinforcement Learning-Powered Revolutionary Tool

Christian Zambra
productmanagerslife
3 min readApr 3, 2023

Before we talk about the machine, let’s talk about inspiration. So, how do we humans learn?

Well, the first thing that comes to mind is trial and error. To learn new things, we try, and the return, positive or negative, will be the feedback that will direct us to evolve. We learn to walk by falling. We learn not to put our hands on the fire by trying? Not exactly.

In the beginning, we are supervised by our parents, grandparents, or other adults who help us and teach us not to put our hands on fire. They share with us the culture and principles of our family and are around to save us if we fall. In my opinion, this is the biggest revolution of ChatGPT, and probably one of the features that gave it so much value, making it grow in an unprecedented way — the interaction and equilibrium between humans and machines, between supervision and autonomy.

ChatGPT is an autonomous system based on reinforcement learning, so it will learn through trial and error. It will try, users will give feedback (thumbs up, leaving the chat, etc.), and it will learn from that feedback. However, indiscriminate learning from anyone can corrupt even a pure machine’s soul, as we learned from past experiences (do you remember Tay, the chatbot that became offensive and was shut down after less than 24 hours of work?). Therefore, ChatGPT started with supervision.

As we can see clearly on OpenAI’s blog, ChatGPT started its life on supervised policy, just like the majority of humans. Selected people (specialists) create a sample of questions and answers, just like baby humans learn from the behavior of their parents/protectors/supervisors. It is believed that ChatGPT has ground truth rules, just like human core values, family values, etc. Those values should be very similar to the ones used by other generative systems, like DeepMind Sparrow’s dialogue model.

After the system learns with those specialists (collects comparison data and trains the reward model), it goes to the open world and starts to learn from interactions with the general public. However, as we know, the way it learns is through reinforcement learning, so for each action, there’s a reward that says if it is good or bad. Here, the reward of the specialists, stored in the system, has more weight than the common public, just as our mother/father/uncle/family opinion. That barrier probably prevents the system from going crazy like some of its predecessors. By the way, that dataset is one of the biggest values of the company before the launch (now it should rival with brand value and the number of clients).

To any data scientist or data science enthusiast who wants to go deeper into ChatGPT and generative AI, I suggest taking a look at OpenAI’s blog and DeepMind’s blog.

To all my readers who want to take advantage of this powerful tool, I suggest going to ChatGPT’s website and interacting with it . In my first experience, one great thing was viewing ChatGPT’s reactions in the light of reinforcement learning theory. Therefore, I suggest reading my article about reinforcement learning.

Thank you!!!

Chris

--

--

Christian Zambra
productmanagerslife

Passionate to learn; believes that new products are made to change people’s life for better; Fuzzy AND Techie :) B. Engineering & Advertising. Alma Matter: USP