The Birth of OpenAI’s ChatGPT and Dota 2

kevin zhu
Data Science Student Society @ UC San Diego
12 min readJul 2, 2024
ChatGPT is the poster boy for AI and ML. But how did it come to be?

Abstract:

It’s 2018 in Shanghai, and the largest e-sports tournament in the world had just ended. Known as the International 2018 for the video game Dota 2, it featured a prize pool of nearly 1 million dollars- up until that point the largest prize pool for any e-sport. The winners are team O.G. and they just made more money than they had ever seen in their lifetimes. As Gabe Newell, founder of Valve Games, walked onto the stage to present them their trophy and more importantly, the one million dollar check, he told the audience that they are not quite done, because the winners still needed to play one more-one more game against a strange new team, not fully human, known as Team OpenAI. Newell explained that OpenAI was a small startup company based in San Francisco just beginning their development of artificial intelligence and algorithms, and had collaborated with Valve to test their product — amid laughter from the crowd and the commentators. After 25 minutes, the crowd was in shock. The world champions, the best at the game, had just been stomped by OpenAI’s bot team.Getting destroyed by “Terry Bot” and “John Bot” was something no one had expected.

This was the first time that OpenAI was introduced to the world, and it achieved two things. One, it was great marketing for OpenAI and sparked investor interest, eventually leading to Microsoft making a 1 billion dollar investment in 2019. Two, it proved that artificial intelligence wasn’t something that only belonged in science fiction. In 2018, OpenAI was relatively unknown and something completely novel, but now in 2024, it has become a powerhouse of commercial generative AI. How did this all come to be?

Part 1: Logistics

OpenAI’s beginnings were as a research group, not as a commercial business, and their goal was to develop and publish open-source machine learning algorithms. First consisting of only 9 members, all distinguished members of the machine learning academia community, they began development and testing in 2015 all while making its research open to the public. Corporate titans also were interested in the group and its cutting edge research. Examples include NVIDIA gifting a massive supercomputer to increase computational efficiency, Elon Musk as the leading provider of funds, and later on, a billion dollar investment from Microsoft after OpenAI’s first showcase at the International 2018.

Part 2: Testing

Testing and training an AI system pose significant challenges, balancing sophistication and scale. Sophistication pertains to the intricacy and accuracy of the model, while scale refers to the breadth of acceptable input data. OpenAI faced the dilemma of finding a testing environment sufficiently complex yet rich in training data.

Known as one of the hardest games created in the modern PC gaming era, the 5 on 5 MOBA(Multiplayer Online Battle Arena) game, Dota 2 emerged as an ideal testing ground due to its abundance of data and its incredible complexity compared to other games. Unlike chess, which lacks the dynamic nature of real-time strategy games, Dota 2 encompasses multiple factors influencing outcomes including but not limited to the drafting phase, the mechanical skill of the players,the randomness of certain interactions, the overall pacing and strategy of the two teams, and the individual decision of each of the 5 players. These factors are important because it means that the game mirrors the complexity of real-world complicated scenarios that are equally convoluted like political systems or financial markets. Furthermore, since it is a video game, Dota 2 provided access to extensive archival data on the scale of billions of bytes which facilitates large-scale training. This is something that other more traditional fields of data exploration would be unable to provide like medicine or sports, where not every detail in a patient’s body or every single play in a game is recorded. Dota 2 was also a neutral testing ground for AI-if the environment was instead, for example, finance or sports betting, it might raise legal or ethical concerns. Since OpenAI was originally founded as a non-profit, using their models in an environment that could generate money would violate their mission statement, so they had to choose something not related to these fields. Dota was the perfect choice.

Part 3: Magic

OpenAI’s journey from theoretical exploration to commercial AI prominence is underpinned by groundbreaking advancements in technical methodologies and tools. Key among these are Rapid, Proximal Policy Optimization (PPO), and hierarchical reinforcement learning, each contributing uniquely to OpenAI’s success in mastering complex tasks like Dota 2 gameplay.

Hierarchical Reinforcement Learning

Hierarchical learning, a fundamental concept in OpenAI’s testing process, remains a cornerstone of AI research. This approach entails decomposing complex tasks into hierarchies of simpler subtasks, enabling more efficient learning and decision-making. An extremely simplified example is by taking a problem that has 3 features, and instead of making a model relating the 3 features together, you can make 3 separate models that relate each of the 2 features together. Essentially, instead of a model connecting ABC, you create three different models, AB, BC, and AC. The “hierarchy” part of the problem comes from a difference in importance-perhaps in the ABC model, AB is far more impactful than BC, thus establishing levels of hierarchy. In practice, hierarchies are much more convoluted than this simple ABC tree, with hundreds or thousands of features, and the breakdowns will not be 2 feature models but simply any number less than the original N amount of features. What this looks like in the context of Dota 2 is this scenario:

Figure 1, subtask decomposition: Instead of creating one large model that relates x, y, z, we create separate simpler models relating x to y, y to z, x to z, …. and so on to decompose the complexity of the problem.

In Dota 2, players can spend gold which is earned by farming resources and killing enemies on items. These items can either help accelerate farms or allow them to beat enemy players in combat. One of the most important items in the game is Black King Bar, an item which grants spell immunity. It’s a situational item because it’s very good against certain types of enemies that like to cast a lot of spells, but against enemies that don’t, it would be a waste of gold. The decision of when to buy this item or whether to buy it at all is highly complex and one of the turning points in a Dota match, and depends on a multitude of factors:

  • How strong the enemy is
  • What your biggest weaknesses are (do you have trouble staying alive, or killing others)
  • How much gold you currently have
  • How much gold you are projected to earn in the future
  • If there are better items to have, what area of the map you are playing around
  • What role you are playing
  • Your teammates’ synergies

The list of features goes on and on for one of the most important decisions in a Dota match. Rather than try to tackle this problem of “BKB or not?” in a single model, hierarchical learning breaks this down into many different smaller, lower dimensionality problems to reduce complexity and increase efficiency by answering 1 of those aspects at a time. Enemy team has all spellcasters? Training data says that you should buy BKB. You don’t have enough gold for more important items like buyback? Training data says you shouldn’t buy BKB. At the end, the results of all these decisions culminate together, with some having more weight than others(hence the hierarchy), and a final decision is made.

Figure 2, real time decision making process: Algorithm considers each factor one by one and reaches a final verdict, being that of “BUY BKB”. Each factor will “push” or cause the algorithm to one of the two decisions by a certain amount. This “one-by-one” approach is because of the decomposition of subtasks as explained in Figure 1.

Rapid

Rapid is a technique developed by OpenAI that focuses more so on more efficient computation hardware wise and making full use of the available computational power.

Essentially, Rapid allows research to address bottlenecks in computational intensive tasks by optimizing hardware processes, like using parallel processing in CPUs or distributing computing to multiple CPUs and GPUs. This allows for faster convergence and faster iteration in training AI. While standard parallel and distributed processing systems are designed for a broad range of applications, RAPID is optimized specifically for AI and machine learning tasks, and with extremely large amounts of training data and calculations. In this context, large means calculations that have complexities in the billions of the data that it’s working with, something that traditional distributed computing systems just aren’t prepared for, hence the necessity for a new system of coordination. Rapid is essentially the software that allows these many thousands of machines to talk to each other to coordinate this process of spreading out the workload. The actual implementation of Rapid means OpenAI needed a lot of CPUs and GPUs in the first place, which they fulfilled by renting about 128,000 computing units from Google. This is a lot of money-in 2017, about a quarter of OpenAI’s budget went towards renting out cloud computing resources.

Proximal Policy Optimization (PPO)

PPO is a training process that involves first trial and error, and sampling from the results of the trial and error process to simulate decision making. After the decision is made on training data, the machine is “rewarded” with either positive or negative reinforcement depending on if their decision was correct. The ingenuity of PPO lies in its objective function, which determines which trajectory(or series of decisions) is the best decision.

To understand the objective function, it’s important to understand what Trust Region Policy Optimization is. TRPO was essentially the original method used in reinforcement learning, and it would determine the best trajectory given a constraint expressed in terms of KL-Divergence. PPO took the equation used in TRPO and modified it, mainly changing the constraints to a different ratio-based “clipping” function.

The main difference between PPO and TRPO is the fact that the PPO function is far more easily optimizable than TRPO. Because the clipping function can be optimized with gradient descent, PPO turns out to be the far faster and less complex equation, whereas TRPO is not as clean mathematically and requires a lot more kneading around. According to OpenAI, their approach to solve the TRPO optimization problem was to first approximate it as a Taylor Series and then perform the methods of Lagrangian duality to approximate the extrema- a process far more complicated than simple gradient descent(or as simple as gradient descent gets).

Essentially, this change in the math behind the algorithms of reinforcement learning is what truly allowed OpenAI to improve its AI decision making not just in terms of accuracy, but speed and plausibility. With TRPO, although correct decisions could be arrived at given enough time and computing resources, it simply took too much time and resources to even be plausible. PPO allowed AI decision making to actually be done in a realistic interval of time and even though it still required a huge amount of computing power(hence the 8 million dollars used by OpenAI in renting out Google computers alone), it managed to create a machine learning model that could be trained fast enough.(Fast enough refers to within the time span of a few months instead of decades.)

Both PPO and TRPO are policy gradient methods. They adjust the policy parameters θ to improve the expected reward. The idea is to change θ in a way that the policy gets better at choosing actions that lead to higher rewards.

The general idea in policy gradient methods is to adjust the policy parameters θ in a direction that increases the expected reward. This adjustment is guided by the gradient of the expected reward with respect to the policy parameters.

The optimization problem TRPO tries to solve looks like this:

Math Breakdown

The equation seems daunting, but let’s break it down step by step so we understand:

Advantage Function: Aₜ

The advantage function Aₜ tells us how much better (or worse) an action Aₜ taken in state st is compared to the average action in that state. If Aₜ is positive, it means the action was better than average, and if it’s negative, the action was worse.

Probability Ratio:

This ratio compares the probability of taking action aₜ​ under the new policy πθ₁​ to the probability of taking the same action under the old policy πθ₀​. It helps us understand how the likelihood of taking that action has changed due to the policy update.

Here’s an intuitive example:

Imagine you’re trying to adjust a cake recipe based on feedback, where the advantage is the feedback score (positive if good, negative if bad), and the probability ratio is how much more or less you’re adding of an ingredient compared to last time. If you receive positive feedback (high advantage) for adding more sugar (higher probability ratio), the product of the two encourages you to add even more sugar. If you receive negative feedback (negative advantage), it discourages you from adding more.

Then, the question becomes, what is the effect of multiplying the advantage function and the probability ratio?

When we multiply the probability ratio rₜ(θ) by the advantage Aₜ, we effectively scale the advantage by how much the policy has changed. This multiplication has the following effects depending on if the product is positive, negative, or 0:

If the action at​ has become more probable under the new policy (rₜ(θ)>1) and the advantage Aₜ is positive, the rₜ​(θ)Aₜ is even more positive, encouraging the update in that direction. If the advantage Aₜ is negative, the product rₜ(θ)Aₜ will be negative, discouraging the update in that direction. If rₜ(θ)=1 (meaning the new policy hasn’t changed the probability of taking action Aₜ​), the product is just Aₜ​, maintaining the original advantage signal.

Once we understand the probability ratio and the advantage function, the overall equation begins to make sense. By taking the expectation of the whole function, we get a measure of the mean, and KL[] really is just a way to incorporate some level of abstraction to an already difficult equation, but it essentially is a measure of difference. We set that difference to be under a certain limit δ, and then take the average of those values, and in doing so, we achieve the “stability” that we want by limiting the divergence.

On the other hand, PPO is designed to simplify TRPO while maintaining its benefits. Like TRPO, PPO aims to maximize the expected reward. However, instead of using a trust region defined by KL divergence, PPO uses a simpler clipping mechanism to ensure the policy update isn’t too large.

It’s clear that TRPO and PPO are extremely similar. However, the main difference is that the measure of difference is no longer KL divergence, but this strange (1-ϵ, ϵ+1) expression that is called “clipping”: min(rₜ​(θ)Aₜ​, clip(rₜ(θ), 1−ϵ, 1+ϵ) Aₜ​)

The PPO objective includes a clipping mechanism to ensure stable updates: The clipping is a different method of controlling for stability.

Clipping function

This restricts the ratio rₜ(θ) to be within a certain range [1−ϵ,1+ϵ]. This prevents the policy from changing too drastically.

Then, by taking the minimum of the unclipped and clipped terms, PPO ensures that the update is conservative if the change is too large. This prevents large, destabilizing updates while still allowing beneficial updates within the safe range.

This is a much more accurate and simpler method of restricting our differences, and is the main reason why PPO is much faster and better than TRPO. The KL[] term in TRPO means that the TRPO function is not easily differentiable and requires Taylor Approximation, whereas after replacing KL[] with a the bounds [1-ϵ, ϵ+1], PPO can be easily optimized with gradient descent since it really is just a polynomial.

Conclusion:
Initially conceived as a research initiative, OpenAI embarked on its quest to develop and publish open-source machine learning algorithms. Collaborative efforts from distinguished members of the machine learning academia community, coupled with support from corporate giants like NVIDIA and Elon Musk, laid the foundation and the budget for OpenAI’s research endeavors.

The testing phase posed significant challenges, requiring a balance between sophistication and scale. Dota 2 emerged as the ideal testing ground, offering a rich environment replete with complex dynamics mirroring real-world scenarios.

The culmination of OpenAI’s efforts manifested in the creation of innovative tools like Rapid, Proximal Policy Optimization (PPO), and hierarchical reinforcement learning (RL). These advancements revolutionized AI training processes, mainly by increasing efficiency to the point where training a model was plausible in terms of time and resources(albeit still inefficient compared to 2024 algorithms). Under previous algorithms and hardware constraints, training a model would have taken decades, but by decreasing computation complexity through hierarchical breakdown and a mathematically simpler PPO function, and by increasing computing power by hooking many computers together through RAPID, training a model became a plausible feat that could be done within months.

In essence, OpenAI’s unlikely origins in a video game exemplifies the fusion of technological innovation with human ingenuity. The first iterations of OpenAI through Dota 2 is proof that innovation comes from unlikely places, and does not always spring up in the laboratories of academic institutions- that we as humanity should always keep an open mind, and to harbor wonder for anything and everything.

--

--