Using Unsupervised Machine Learning to Assume Positions in League of Legends

This post is the second in a series on data analysis in League of Legends. Read the first one here:


Data analysis is the process of fiddling with recorded observations, in the form of numbers and properties, in order to discover meaningful insights. Since it is heavily based on figures, using unreliable figures may lead to unreliable conclusions. In other words, if some of the features in your dataset are flawed, you’re going to have a problem.

Working with data that somebody else has recorded may mean having to compromise on standards. You can’t always get the data you want. Being that our data comes from a variety of different sources, this is a challenge we face often at Snipe.

In my last post, Understanding League of Legends Data Analytics, we saw how analytics sites for League of Legends mass hundreds of millions of page views each month, and learned that a team in a game of League is comprised of 5 players, each tasked with a distinct role (AKA position)— Top, Middle, Jungle, Support, and Carry.

The choice of position is the most fundamental choice a gamer makes in a game of League, as it directly influences all subsequent in-game choices. Therefore, it should come as no surprise that every analysis of in-game player actions relies on knowing what position that player has played.

This post discusses how and why we trained a machine learning model to assume players’ positions based on post-game data, including code snippets in Python.


When we first began writing data analytic tasks for League game data, we noticed that some of the data gathered from Riot Games’ official League of Legends API seemed off. Specifically, the ‘position’ (in terms of role) values assigned to players by the API often appeared to be wrong. As in, a player who played position X (as was made obvious by analyzing their stats) would get assigned position Y.

We have confirmed that the positions recognized by the API are unreliable by reading posts on the (now defunct) Riot Developer Discussions and by reading about the experiences of fellow developers.

In order to estimate the scope of the problem, we devised a quick way to assert if a team composition was likely* wrong. Every team should have exactly 1 of each position. If the team is missing a position, we determine the assumed team composition is incorrect. Running this test on a sample of 100k ranked games played by experts (rank Platinum and above), we found out that 35% of teams suffered from faulty position recognition. For future reference, we will call this test the “team composition test”.

*These two points clarify why the team composition test is not absolute in determining whether an assumed team composition is incorrect:
1. It’s possible that the position recognition algorithm has mixed two positions (for instance, assume that somebody who actually played X as Y, and somebody who actually played Y as X).
2. Sometimes, players deliberately play out of meta (as explained in the previous post), in which case it is possible for there to have been two players playing the same role.

Knowing that the positions recognized by the API are highly inaccurate, we decided to build our own player position recognition engine based on different, more reliable data provided by the API.


Position Recognition Algorithm

Our initial idea was to write an algorithm that assumes positions in the context of a team, with a divide-and-conquer strategy. Within a team, the easiest role to identify is the Jungler (for starters, almost certainly the Jungler will pick the summoner spell ‘Smite’, and will be the only team member to do so). Once we single out who the Jungler is, we can quite easily single out Top and Middle (as they played in the Top and Middle lanes of the map respectively, and we have access to player coordinates), and then we’re left with 2 players of which we need to figure out who played Support and Carry (the latter two both play in the Bottom lane, so coordinates alone don’t help us there).

Thinking this through, we realized that writing an algorithm will take a lot of effort. Firstly, the flow is quite convoluted — making it difficult to code and maintain. Secondly, we will need to figure out a bunch of numbers and thresholds to put in our code (for instance, we would need to find the coordinates for the polygons that correlate to each lane). Contemplating the last point, it dawned on us — why write an algorithm when we can have a machine learn the data and come up with a model to solve our problem?

Cluster Analysis, a subset of unsupervised learning, is the field of machine learning that breaks down a set of observations (dataset) into groups (clusters) based on commonalities between the observations. We were hoping that we can use cluster analysis to break down a dataset of player data (where each observation is a single player’s stats in a single game) into 5 clusters — one per position.

The algorithm we used is KMeans Clustering. KMeans is arguably the simplest, most popular unsupervised learning algorithm out there. While our initial intention was to use it as a stepping stone to building a position recognition engine, it proved to be an adequate solution by itself.

Choosing the features

A good dataset starts by choosing good features. That is, the properties we want to have recorded in the dataset, per observation. Selecting the right features is fundamental to training any machine learning algorithm optimally.

Feature selection is subject to expertise in the forms of knowledge (knowing which kinds of variables work optimally with which algorithms), intuition (“I believe this feature will help the algorithm create a good model because…”), and trial-and-error. You can read about it more here.

The data returned by the League API (/match/timeline endpoint, that gives detailed post-game data) can be split into two categories:

  • Events: per player, a list of events they took part in, alongside the timestamp (player killed, took objective, purchased an item, etc)
  • Frames: per player, a by-minute snapshot of metrics (amount of gold made, number of creeps killed, coordinates at the time of the snapshot).

For our model, we chose 4 features from the Frames category. Let’s go over our intuition for choosing them:

Jungle Creeps killed at minute 14 — the overall number of jungle (neutral) creeps killed by the player by minute 14 of the game. During the first 10–15 minutes of the game, the Jungler should have killed significantly more jungle creeps than anyone else. We are expecting this feature to help the algorithm cluster Junglers together.

Median of X and median of Y coordinates at minutes 2–12 — the League API gives us the coordinates (on the 2D game map) of each player on a by-minute sample rate. During minutes 2–12 of the game, laners should be in their respective lane almost exclusively. We are expecting these features to help the algorithm cluster by lane. We’re taking the median rather than the mean to avoid skewing samples (for instance, when caught somebody was out of lane at the time of the snapshot).

Enemy Creeps Killed at minute 14 — the overall number of enemy creeps killed by the player by minute 14 of the game. During the first 10–15 minutes of the game, Support players should have considerably fewer enemy creeps kills than any other player on the team. We are expecting this feature to help the algorithm cluster Supports together.

Preprocessing

We constructed a dataset using data gathered from the League API. We took a sample of 500k games played by high ranking players (leaving us with 5M observations, one per player for 10 players within a game).

We are going to use Python with scikit-learn to create our model.

The first step is to load our dataset from disk to memory. After doing so, we scale the data by applying the MinMaxScaler to it. Feature scaling is important when dealing with quantitative features, as failure to scale them may result in the algorithm undesirably putting more weight on features with greater variance.

Validation

Next, we should verify that the features we chose work optimally with the KMeans algorithm (when k=5). That is, that the algorithm can create a model to optimally distribute the data between 5 clusters, we will do so using the elbow method.

The elbow method provides an effective way to find the optimal K for the dataset. Without getting into the ‘why’ (you can read about it here), we’re looking for the “elbow” in the curve. As we can be seen by the graph above, the elbow is at 5, meaning the optimal K for our dataset is 5. This is good news. Now we can get to fit the model to our dataset.

Training

In order to assess the effectiveness of our model, we’re going to look at the distribution of observations to clusters. Since every team has 1 of each of role (and our dataset is comprised of full teams), we are expecting the clusters to have a more or less equal number of observations — around 1M (5M observations divided by 5 positions).

Indeed, we can see that each cluster makes up almost exactly 20% of the dataset. Running the team composition test on our assumed team compositions, we measured that less than 0.5% of teams were badly composed. To check for overfitting, we let our model cluster a test set (same features, different observations). Looking at the distribution to clusters in our new test set showed similar numbers (we measured a deviation of up to 1% per cluster).

You may notice the clusters aren’t labeled appropriately but are rather arbitrarily numbered 0–4. Labeling each cluster (understanding what cluster correlates to what position) is quite simple in our case. We can take a single game that we played ourselves (so we know for sure who played in what position), run the relevant stats through the model for each player, and see what cluster the model assigns to each player.


A Scatter plot visualization of our clustered dataset (using PCA algorithm to reduce it to 2D). every dot is an observation and every color is the cluster the model assigned it to. Read about this more here.

Thanks to the magic of machine learning (and some data) we were able to turn a virtually unusable metric into one that is fairly reliable. Our goal here was not to build a perfectly accurate solution, but rather to come up with a fast one, that would yield errors at a rate that is statistically insignificant.

The full code is available here.


Conclusions

  • You can’t always get the data you want, but if you try sometimes, you may find… you can mend your dataset by estimating what the flawed values should have been like. Machine learning is a great tool for that, as it can predict variables accurately based on other data, with little human interaction (saving time).
  • Unsupervised machine learning is usually thought of as a toolset for exploring data, but it can go beyond this use case. In our case, we have essentially used it to build a classifier.
  • Simple machine learning models can replace hundreds of lines of code. Knowledge in machine learning is invaluable to have among developers, and thus should not be only reserved for data scientists.

So, we know how to assume a player’s position in a past game, but can we predict a player’s position in a game that just started? Follow us to find out how we tackle more data-related challenges!