Data Analysis of PlayerUnknown’s Battlegrounds (PUBG) — Introduction &Data Preparation

Alicia Li
4 min readDec 14, 2018

--

Here in this article, I’ll cover the following topics:

  1. Introduction of PUBG and Objectives
  2. Data Source & Description
  3. Data preparation & Guidance

In the next few articles, I will mention how I use R and Tableau to do regression modeling and data visualization. You can click the links below to see my analysis results.

Exploratory Data Analysis of PUBG

Identifying Cheaters(coming soon)

Using R to build regression models and predict the winning placement percentage of players

Introduction of PUBG and Objectives

PUBG PlayerUnknown’s Battlegrounds

PUBG is a complex and competitive game, with global tournaments for cash prizes driving a lot of players to take part. Notably, in June 2018, it was estimated that 400 million players around the globe play the battle royale-style game.

Because each game starts with 100 players, and the last person surviving is deemed the winner, the chance of winning is slim. Definitely, those chances increase as one plays the game and becomes more proficient and experienced. Better accuracy, better reaction times and more can contribute to a winning combination for players.

As such, the problem many players face is that their chance of winning is very low, especially if they’re a new player. It’s understandable that every day thousands of searches online for winning tactics and tips for the game are conducted. So we set out to gather data that correlates with winning players to understand what certain actions in the game could lead to more kills and a longer lifespan in the game, which could result in the player winning the game.

Data Source & Description

The first dataset I used is from Kaggle: PUBG Finish Placement Prediction

This dataset provided a large number of anonymized PUBG game stats, and each row contains one player’s post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 players per group.

Here I only list some important data fields. For details, please refer to the link.

  1. assists — Number of enemy players this player damaged that were killed by teammates.
  2. boosts — Number of boost items used.
  3. heals — Number of healing items used.
  4. damageDealt — Total damage dealt. Note: Self-inflicted damage is subtracted.
  5. killPlace — Ranking in a match of the number of enemy players killed.
  6. kills — Number of enemy players killed.
  7. rideDistance — Total distance traveled in vehicles measured in meters.
  8. swimDistance — Total distance traveled by swimming measured in meters.
  9. walkDistance — Total distance traveled on foot measured in meters.
  10. weaponsAcquired — Number of weapons picked up.
  11. winPlacePerc — The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to the last place in the match.

With this dataset, we can do some EDA and define the data-driven winning formula.

Another dataset is also from Kaggle, PUBG Match Deaths and Statistics

This dataset provides two zips: aggregate and deaths.

In deaths.zip, the files record every death that occurred within the 720k matches. That is, each row documents an event where a player has died in the match.

With this dataset, we can Generate Death Heat Map by Tableau

Data preparation & Guidance

First, import the first dataset train_V2.xlsx into R studio.

# — — — — — — — — — — Import data file — — — — — — — — — — #
library(readxl)
pubg_100000 <- read_excel(“file path/train_V2.xlsx”)
attach(pubg_100000)

In PUBG, players can choose to play solo or queue with friends. So I divide all the data into two sections, solo mode and multi mode.

column<-c(1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29)
solo<-subset(pubg_100000[column],matchType==”solo-fpp”|matchType==”solo”|matchType==”normal-solo”|matchType==”normal-solo-fpp”)
multiple<-subset(pubg_100000, matchType==”duo-fpp”|matchType==”duo”|
matchType==”squad-fpp”|matchType==”squad”|
matchType==”normal-squad-fpp”|matchType==”normal-squad”|
matchType==”normal-duo-fpp”| matchType==”normal-duo” )

Now you’ll have two data sets: solo and multi. We will use these datasets later in correlation analysis and building regression models. Let’s dive into data in next article: Exploratory Data Analysis if PUBG

Want to know more about data analysis of PUBG? Check these articles!

Exploratory Data Analysis — —— — — — — — — Identifying Cheaters (coming soon)— — — —— — Regression models &prediction

--

--

Alicia Li

Data Scientist based in Tokyo since 2019. Host of the podcast 世界任意門, sharing global insights and experiences. Proud cat mom. https://linktr.ee/aliciapodcast