Regression models & prediction of PUBG in R

5 min readDec 14, 2018

In PUBG, players can choose to play solo or queue with friends. So I divide all the data into two sections, solo mode and multi mode.

In the previous article Data Analysis of PlayerUnknown’s Battlegrounds (PUBG) — Introduction &Data Preparation we already have two data sets: solo and multi.

Now let’s analyze the correlation coefficient between each variable and winning placement for these two game modes. By analyzing the correlation of the player winning placement and other variables, we can build the regression model of winning formula.

In this article, I will use the PUBG dataset to demonstrate the regression analysis and will include the following topics:

Analyze the correlation coefficient
Build Data-Driven Winning Formula
Predict the winning percentage

Analyze the correlation coefficient

R code:

#see all the correlation of variables in solo
#install.packages("corrplot")
library(corrplot)
res_solo<-cor(solo[5:27])
corrplot(res_solo, type = “upper”, tl.col = “black”, tl.srt = 45)#see all the correlation of variables in multi
res_multiple<-cor(multiple[5:29])
corrplot(res_multiple, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

Import the package corrplot, use function cor to get the correlation coefficient of the solo dataset. Then use corrplot to generate the correlation map. if it is positive related, it will be blue. If it is negative, it will be red.

Correlation Map of Solo Mode(left) and Multi Mode(right)

winPlacePerc is The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to the last place in the match. So here we only focus on the correlation coefficient of winPlacePerc and other variables.

A positive correlation coefficient means that, for any two variables X and Y, an increase in X is associated with an increase in Y.

A negative correlation coefficient means that, for any two variables X and Y, an increase in X is associated with a decrease in Y.

Here we

Let’s Zoom into five highest correlated variables with winPlacePerc.

walkDistance — Total distance traveled on foot measured in meters.
killPlace — Ranking in a match of the number of enemy players killed, Ex. If a player killed 99 people in a single game, his ranking is 1; if the number of kills is only 1, the ranking may be 80. (killPlace is negatively related)
boosts — Number of boost items used.
weaponsAcquired — Number of weapons picked up.
heals — Number of healing items used. (in multi-player mode)
damageDealt — Total damage dealt. Note: Self-inflicted damage is subtracted. (in solo mode)

Top 5 correlated variables of Solo Mode(left) and Multi Mode(right)

Players who have high percent winning placement, their actions patterns are ranked from high to low as follows:
1. The walking distance in the game is longer
2. The ranking of the number of enemy players killed will be higher
3. Will use more boost, which can make the walking speed faster
4. Will pick up more weapons
5. In multiplayer mode, more healing items will be picked up; in single player mode, the damage will be higher for other players.

You can see the top four related variables of the two modes are the same. But in the multi-mode, the fifth one is heals; in solo mode, the fifth is Total damage dealt. This implies that if players want to get a higher ranking under the multiplayer mode, they should to collect more healing items and try to heal the injured teammates.

Build Data-Driven Winning Formula

In this section, I will talk about two regression models.

#------------- Determine the regression model ----------------##regression model for solo mode
fit1 <- lm(winPlacePerc ~ walkDistance + killPlace + boosts + weaponsAcquired + damageDealt + kills, data=solo)
summary(fit1)#regression model for multi mode
fit2 <- lm(winPlacePerc ~ walkDistance + killPlace + boosts + weaponsAcquired + heals + damageDealt, data=multiple)
summary(fit2)

Use summary function to produce result summaries of the results of various model fitting functions.

The summary results of solo (left) and multi (right) mode

The dependent variable is winPlacePerc, and basically all independent variables are those high relevant variables I just mentioned. You can see the p-value shows that all variables are significant, and Adjusted R-suquared are 0.7986 and 0.7682, means the model fits the data well.

Predict the winning percentage

In this section, I will use the predict function to predict winPlacePerc under the regression model we built earlier.

First step is to prepare two data sets, remove the winPlacePerc field in solo.xlsx to save as solo_test.xlsx, then import both files in Rstudio.

#---------- Prediction of winPlacePerc in Solo Mode -------------#
library(readxl)
train_solo <- read_excel("file path/solo.xlsx")
test_solo<- read_excel("file path/solo_test.xlsx")

Then use lm to build the regression model r2, usepredict to get r2.predict.

r2.predict is the prediction value generated by the regression model r2.

r2 = lm(train_solo$winPlacePerc ~ train_solo$walkDistance + train_solo$killPlace + train_solo$boosts + train_solo$weaponsAcquired + train_solo$damageDealt + train_solo$kills)r2.predict <- predict(r2 ,data=test_solo)

Plot the prediction result.

plot(r2.predict)
points(train_solo$winPlacePerc, col = 2)

Solo Mode Prediction Result, black dots are the prediction value generated by regression model. Red dots are the actual value of *winPlacePerc*.

Save the prediction result as csv file.

write.csv(r2.predict,file="C:/Users/Alicia/Downloads/all/solo_predict.csv",row.names = FALSE)

Same process in Multi Mode.

#---------- Prediction of winPlacePerc in Multi Mode -------------#train_multi<- read_excel("C:/Users/Alicia/Downloads/all/multi.xlsx")
test_multi<- read_excel("C:/Users/Alicia/Downloads/all/multi_test.xlsx")
r1 = lm(train_multi$winPlacePerc ~ train_multi$walkDistance +  train_multi$killPlace + train_multi$boosts + train_multi$weaponsAcquired + train_multi$heals +train_multi$damageDealt)
r1.predict <- predict(r1 ,data=test_multi)
plot(r1.predict)
points(train_multi$winPlacePerc, col = 2)write.csv(r2.predict,file="C:/Users/Alicia/Downloads/all/multi_predict.csv",row.names = FALSE)

Multi Mode Prediction Result, black dots are the prediction value generated by regression model. Red dots are the actual value of *winPlacePerc*.

Want to know more about data analysis of PUBG? Check these articles!

Regression models & prediction of PUBG in R

Analyze the correlation coefficient

Build Data-Driven Winning Formula

Predict the winning percentage

Written by Alicia Li