‘’DamageTrack’’: Using Computer Vision to Extract Unique Insights from Gameplay

Source: Dorset College

Introduction

Competitive multi-player video games have become a dominant part of video game industry. For casual players and especially professional players, quantitative analysis of video game match is crucial for players to received feedback and improve their performance. It would be easy if the game opens up their data API where users can access the data. However, the data API is not always available. A third party platform that can quantify the gameplay will come handy for both casual and professional gamers.

During the spring of 2018, I participated in the Insight Data Science Program as a data science fellow. I got the chance to consult with Mayhem, a company that live-streams video game replay videos. They are in possession of a large repository of gameplay videos uploaded by users and building a quantitative analysis platform that analyzes player performance throughout the game. The game they are explicitly working on is Overwatch, a first-person shooter game by Blizzard. Overwatch has different objectives depending on the game mode, but the overall themes across different game modes remain the same: to shoot at your enemies and deal as much as damage as possible. The damage a player deals on the enemy, therefore, is a key performance indicator (KPI) of the player. This essential information, however, can not be parsed from the video because it does not show up the screen. My task is to extract this damage information from the unstructured, unlabeled video data so that the client can integrate this into their product. (Note that there are other mechanisms in the game that could generate damage, but we only focus on the damage dealt by shooting, the primary source of damage generation.)

Approach

To estimate the damage, we first need to understand what happens when a shooting occurs. There are two objects on the screen that is directly related to the damage. The first one is the meter indicator at the bottom of the screen. The meter is the energy meter for an ultimate skill, which is a mighty skill and will be available whenever the meter is filled up. What’s essential for my task, the reading of the meter will increase whenever damage is dealt. This number has already been parsed from the video and readily available. In principle, I could use this ultimate reading as a good proxy for the damage dealt. Unfortunately, this is not always the case. Depending on the character used in the game, there could be other factors such as healing and natural generation that contribute to the increase of the meter. What’s more, the meter reading is not available when the bar is fully charged, or the ultimate skill is in use. During those time, there is no information as to how much damage has been dealt by reading the meter. We need to find something more universal and reliable to be able to infer the damage information throughout the gameplay.

Screen shot of Overwatch gameplay. The objects that are related to damage include: the energy meter indicator at the bottom of the screen and the cross at the center of the screen.

This issue leads to the second object, which is a cross in the middle of the screen. This is a mechanism designed to give players feedback. It only appears when a shot lands on target. Furthermore, it has different lengths and colors depending on the actual damage. This is universal and reliable to predict the damage. The challenge is to build a quantitative representation of the cross and map the representation to the damage dealt.

The strategy is as followed. I use the cross as our features. To label the features, I use the time when the ultimate reading is a good proxy for the damage to label our training data. By a good proxy, it means the ultimate reading is available and proportional to the damage. By training on the labeled data, a model can be built to establish a direct relationship between the cross features and the damage. With this model, we can predict the damage directly through the cross features with or without the meter indicator.

Data preparation

Feature engineer

To capture the features of the cross, I extracted the RGB intensity of different points in or around the cross as a function of time. The features I engineered for each color channel include:

1. the intensity of all the points

2. the intensity change of all the points

3. the difference of intensity between the inner points and the background points

4. the difference of intensity change between the inner points and the background points etc.

The RGB intensity of different points on or around the cross.

The first thing that comes to mind is to use these features for a specific frame. However, through looking at these time series data, I noticed that there could be a relatively small time drift of the cross with respect to the increase of the meter reading (values I use to label data in the next part). So instead of using the intensity of a single frame, I used the statistical (mean and maximum) information of a time window starting from two frames ahead to two frames after.

Data Labeling

Now that I have the features, the next thing is to label the feature so that I can perform supervised learning. Manually labeling the data is not feasible considering the amount of videos frames. Having an automatic way to label the data is essential because it enables the system to improve as more video comes in. The only information that I could potentially use is the ultimate meter reading. As mentioned, this meter reading is a good proxy for damage most of the time, and they are parsed already by Mayhem’s computer vision engine. There are sometimes when the ultimate meter is not available, and I exclude them from the training set. As there are other factors such as skill damage, healing and natural generation that will contribute to the increase of the ultimate meter, I choose a special hero as my training material. The hero ‘Widowmaker’ is a sniper without healing skill, and she deals a considerable amount of damage whenever she lands a shot. The increase of ultimate meter reading is significantly more substantial than the increase from natural generation and skill damage. All these mean that whenever there is a jump of ultimate reading larger than 2, this jump corresponds to a shooting event with the cross feature. Therefore, by extracting all these jumps in ultimate reading along with the synchronized cross features, I have an excellent training data with labels. Even though the model is trained based on a specific hero, it can be readily generalized to other heroes by adjusting the damage based on their settings since the cross features are universal across all heroes. Now that the ultimate meter reading is directly proportional to the damage in the training set, from now on I will call the labeled value damage for simplicity.

Data pipeline

Before a detailed discussion about the model, let’s talk a bit about the data pipeline. The system composes of an online part and an offline part. I implemented the online component based on Mayhem’s codebase. As new videos come in, the CV layer will parse the information into structured data, which is fed into a pre-trained model to generate the damage information.

Data pipeline.

The offline part is standalone from Mayhem’s system. The structured data will also be stored locally for training. Since there could be multiple heroes used by the players in a single video, the pipeline can store them respectively into different folders. Even though the model uses only the Widowmaker data, other heroes’ data could be used to calculate their damage statistics to adjust the damage setting. As more data comes in, the model trained offline will be used to replace the old online model to improve performance in a frequency the client see fit. Note that the training process adopted so far is non-incremental. If frequent update of the model is needed, incremental learning models should be used instead.

Model building

To build the model with the labeled data, I need to define the problem further. What model should I use? Since I have the luxury of being in possession of labeled data, supervised learning is my choice since it is more reliable and easy to validate. (A side note: A possible alternative is an anomaly detection model. Since I don’t have enough time to test this approach, I am not sure how good the performance would be. But this unsupervised approach could not improve with more data.) The first obvious choice is to perform regression on the damage. I decide this is not a good idea due to the following two reasons: Firstly, the labels are highly imbalanced and the majority of the frames has 0 damage and the regression result would be highly biased towards 0. Secondly, regression results trained on a specific hero will not be easy to generalize to other heroes. So I resort to the second method, classification. By binning the damage values, we can define different types of damage. For the development of the model, I have only a limited amount of data, so I bin the damage into two types: normal damage and headshot damage. Any damage increase larger than an amount will be labeled as headshot damage while the rest will be labeled as normal damage. The model can be easily generalized to multiple bins with more data to have more the granularity.

One thing about the data is that they are highly imbalanced. Out of the 20,000 instances in the dataset I extracted, most of them are none class with no damage events, and only 133 of them are normal damage while 44 of them are headshot damage. Since I am mostly interested in the minority class and want to increase the recall of the model, I used an upsampling method Synthetic Minority Over-sampling Technique (SMOTE) so that the number of different class are equal. That is, of course, after I split my data into a training set and a validation set. (One common mistake to make is to split the data after upsampling. Since this is a synthetic upsampling method, some of the testing data information will leak into the training set.) One reason that I use the statistical value of a time window instead of the value also has to do with this upsampling method. If there is a drift in the feature between the k original instances, this synthetic process will generate a new instance with some average values across all time points, which is not representative of the real damage events. In the contrary, calculating the statistics of a time window will not have this issue.

After getting the balanced data set. I trained a random forest classifier to predict the damage type. Then I test the model with the reserved validation set. There are two issues: a significant percentage of the headshot damage gets classified as normal damage; the precision is low for both classes.

Confusion matrix at the validation set and precision-recall curves as a function of threshold.

After some exploration. I figured out these two problems come from two other mechanisms of the game. Firstly, there are some scenarios when the headshot damage leads to the death of an enemy with very little health left. The ultimate reading increase from damage can only be as high as the health. What this means is that the cross features will appear as the headshot one while they are labeled as normal damage in the training and validation set, because the ultimate reading increase does not reach the threshold. Fortunately, this problem can be easily solved by taking advantage of the kill feed information on the screen in the future. The second problem comes from another mechanism called blocked damage. In the game, when a shot lands on the shield of some heroes, the damage will not go through, and the ultimate reading will not increase as a result. The cross features, however, will still show up as normal damage. This scenario means that it will generate a very small amount of miss-labeled data. They have the same feature as the normal damage class but get labeled as a none class. This label noise is minimal within the none class, so this will not cause a problem during training as long as we do not use boosted tree method. This, however, will significantly decrease the precision during validation because the blocked damage events will contribute to the false positive. To confirm this, I go back to the video, and most of the false positive are the blocked damage events. The model can correctly classify the features, but we need to validate it differently.

Outcome

Precision-recall curves as a function of threshold at the tested with hand-labeled dataset after accounting for the kill events and blocked damage events.

To validate the model, I manually label the damage type for each frame in a new video. Using the trained model, I was able to calculate the precision and recall for the two damage type at a different threshold. Since the output is the probability information, the client can adjust the threshold for classification according to their need. The way I set the threshold gives me an average recall of 75% recall and average precision of 80%.

The prediction on new a new video. The first graph shows the ultimate reading (black line with right y axis) along with the probability of different classes (left axis, blue stands for none, orange stands for normal damage, and green stands for headshot damage). The second graph is the final output after the post processing.

In fact, the model can be used to extract the blocked damage for the period when the ultimate information is available. If the model predicts with a certain level of confidence that the features are there, but there is no increase in ultimate nearby, we can classify this event as a blocked damage event. This metric itself turns out to be another critical metric to evaluate player performance.

Future improvements

There are several things I think will improve the model performance and help to generalize to other heroes better:

1. Labeling the kill events differently with the killfeed data as I mentioned earlier.

2. Since the cross of the widowmaker is always pretty large, so there are not enough training data for small cross features yet. It will help improving the model’s ability to recognize small cross features if there is training data extracted from other heroes.

3. Perform more post-processing of the output since there are some cases when multiple frames near the window get recognized as damage frames while they should belong to a single shooting event.

4. After having enough data points for damage, undersampling the none classes could be used to rebalance the sample.

Final remarks

To sum up, I manage to build a data pipeline that can automatically transform the unstructured video data into labeled data, which is used to train a machine learning model that predicts an obscure metric. The prediction is universal and reliable compared with the proxy information parsed from the video. This critical metric will be implemented as a new feature in the client’s gameplay analysis platform.
In the end, I would like to take this opportunity to thank everyone in Mayhem. It is my great honor working with these talented and passionate engineers. I especially want to thank my contact Ivan Zhou and Anhang Zhu. I also want to thank Eric Yuan for helping me with the AWS setup and teaching me the professional development practice.