League of Legends Data Modeling

Andrew A Suter
5 min readFeb 17, 2022

--

Notebooks for this article

Using the Riot API and the Cassiopeia python Library, I was able to collect around 17000 match entries from the KR, EUW, and NA servers. All of these matches were from the Challenger division.

The data collection process is one of the most important steps to any future analysis, therefore I used random sampling as well as only gathering one participant of 10 from every match to ensure independence for my entries.

The process went as follows: I gathered all of the regions’ challenger players, got each players’ most recent 20 games, and randomly selected one player (does not have to be the player in question) from each match to add to my data. I also ensured the match_id was never used twice between any players’ match histories.

My Results:

Initial visualizations allowed me to uncover the most played champions in my dataset by each role. This is a sample collected in January and does not reflect the entire population.

Korea:

note: Unfortunately my x-axis does not display every champ, only every other one. If you would like to see the full list please visit my GitHub Repository here.

North America:

note: Unfortunately my x-axis does not display every champ, only every other one. If you would like to see the full list please visit my GitHub Repository here.

Europe West:

note: Unfortunately my x-axis does not display every champ, only every other one. If you would like to see the full list please visit my GitHub Repository here.

Notes For the Roles:

  • Sup: Lulu, Rakan, and Thresh dominate across all 3 regions.
  • Adc: Jhin is the most popular across all 3 regions.
  • Mid: Leblanc is most common in both EUW and NA, while Viktor dominates in KR.
  • Jungle: Leesin is very popular in EUW and KR, while Graves is the most popular in NA.
  • Top: Camille is the most popular in EUW and NA, while Jayce is the most common in Korea.

Modeling:

To start my modeling process I label encoded my object type variables into numerical variables indicating the categories.

I also dropped the ‘Champion’, ‘Role’, ‘f_spell’, and ‘d_spell’ columns as there were too many levels to these variables, and I am already splitting up my models by role.

I did not have to deal with any missing values in my data sets.

Next, I split my data into training and testing sets (80–20 basis) to properly test my models’ accuracy. I also used repeated k-fold cross-validation to evaluate my models before accessing the testing set.

My Random Forest model performed the best across all roles and regions. I used F1-score and ROC-AUC-score to judge.

Then I predicted my models on the testing set and arrived at the following prediction scores

Korea:

North America:

Europe West:

However, the main point of this modeling was to answer the question of what features held the most important when predicting. Since we arrived at relatively good prediction scores, concluding what features best influence games would not be over-stepping.

Feature Importance Summary:

EUW Feature Importance
KR Feature Importance
NA Feature Importance

As we can see KDA plays a key role in all of our models as a dominant feature. Certain roles, like ADC having turret kills and damage to objectives as dominant features, and support having assists as dominant features. Ultimately KDA plays the largest role across all divisions and roles in these models.

Notes:

  • I did not include a combined analysis of these data sets which I hope to do in the future
  • I did not plot the distributions of all of my variables to weed out outliers
  • VIF would be useful for my logistic model as it is clear certain features have high correlations between themselves
  • The damage to objectives, turrets, and buildings as well as turret kills features seem to all play a large role in my models. However, any winning team will most likely take the most turrets and deal the most damage to objectives. Therefore in the future finding a better way to include these variables or find replacements would be best when considering individual performance as a predictor
  • If you have any other comments/considerations please visit my Kaggle!

Thanks for your time and if you enjoy this content follow my page to see biweekly updates and changes to my analysis!

Links:

--

--