PogChampNet —how we used Twitch chat + deep learning to create automatic game highlights with only video as input at Visor.gg.
Given a lot of Overwatch gameplay video can we create a model that automatically finds the important, highlight worthy moments?
My name is Farza and I used to be a computer vision engineer at Visor.gg! After leaving Visor, I asked for permission to talk about one of my crazy projects I worked on, alongside our intern William, while there — PogChampNet: a neural net powered by memes (yea, srsly). So here it is!
In the above video are highlights from popular Overwatch streamers on Twitch that were found after looking through many hours of footage. Though, these highlights weren’t chosen by a human. They were picked by a deep convolutional neural net which ”watched” the video and automatically generated the clips it thought were highlight worthy with only a video as input, nothing else. Just pixels. All in real-time.
Let’s talk about it. Twitch chat is this wild beast that’s on the right side of every single Twitch stream.
It works like any chat system — user’s who watch the stream are placed in a chatroom where they can chat with other viewers. This is usually a pretty calm experience when the stream has a couple hundred viewers. But, for big streamers with thousands of concurrent viewers Twitch chat gets, well, a little crazy.
Some people love Twitch chat because it’s like this live commentary of the stream powered by the users of Twitch. But, there are many who also hate Twitch chat and the mindless garbage that is spewed by the community.
Personally, I find Twitch chat extremely interesting because the content of the chat is usually correlated pretty heavily with what is actually happening on the stream.
Twitch chat is essentially labeling the stream for us!
If something funny happens on the stream chat will fill up words like LOL, LMAO, and LUL.
If something really impressive happens on stream, like if the streamer got a crazy triple kill and barely gets out alive, then Twitch chat gets excited with words like Pog, POGGERS, and PogChamp:
So then I just thought to myself, “okay we have thousands of hours of stream footage and thousands of hours of chat logs, is this data useful?”. At Visor.gg we were doing a lot of work with computer vision and deep learning to aggregate data from video games. One problem that I was working on while there with was automatic game highlights. These highlight systems already existed, but all they did was this:
The major flaw with this system is that it uses hand coded rules to create highlights. In this case, if the user gets a bunch of kills then it makes a highlight. But, what if I’m playing Overwatch and get just one kill but that one kill was a crazy snipe? Or what if I’m playing Reinhardt and I get no kills but I make a big play to win the game? These would be amazing highlights but the current system wouldn’t get them. There would need to be more hand coded rules.
I wanted to create a smarter highlight system that would be able to just look at video game footage like a human would and tell if something was exciting, impressive, or highlight worthy using deep learning.
Twitch Chat to the Rescue
If I wanted to create a neural net capable of creating automatic highlights, I would need a lot of labeled data! Though, I didn’t want to spend hours upon hours labeling video game footage. That where Twitch chat comes in. If you remember, Twitch chat explodes with words like Pog, POGGERS, and PogChamp whenever something highlight worthy happens in game.
The idea was simple: create labeled video data from Twitch chat that we can feed to a neural net trained to score how highlight worthy a clip is. For example, if Twitch chat explodes with excitement then we know something interesting happened during that moment of the stream.
I told this idea to a bunch of my friends and they would always erupt in laughter followed by a “...wait, your serious?”.
So, me and my co-worker William went to work to prove the naysayers wrong!
Choosing the Streamers
We first had to figure out what streamers we would pull video and chat data from to power our neural net. It had to be streamers who:
- Were really good at Overwatch so that he or she would have lots of highlight worthy plays.
- Had very active chats full of people who would be ready to spam PogChamp upon the streamer doing something crazy in-game.
We ended up going with: xQc, Seagull, Mendokusaii, and PvPTwitch.
Getting the Data
Step 1: Get the VODs from Twitch. This was made easy by a tool made named TwitchLeecher. We went over to each streamer’s past broadcasts and simply downloaded a bunch of them.
Step 2: Get the chat associated with each stream VOD. This was also made easy by a tool named TwitchChatDownloader.
Step 3: Find the important moments. We wrote a script that would parse Twitch chat and find where it exploded with key words like “PogChamp”, “Poggers”, “wow”, “holy shit”, and many many others. That’s why it’s called PogChampNet :). Based on the timestamp where Twitch chat exploded with excitement, we’d make a 30 second clip from the downloaded VOD. The idea here was that if Twitch chat was hype, then something interesting must have happened! We also knew how exciting a clip was based on how many of these key phrases we detected.
Step 4: Create smaller clips. At this point, we had had hundreds of these thirty second clips. Many of them were exactly what we wanted them to be: really exciting highlight worthy clips of the streamer making some really good plays in-game. This was awesome. Instead of watching hundreds of hours of game play, we were able to aggregate data in hours instead of weeks. But there was an issue. 99% of the time the entire 30-second clip wasn’t highlight worthy. It was perhaps just 5–10 seconds of the clip. We needed a more fine grained way to label a clip. We actually took each 30-second clip and broke it into 30 1-second clips.
Step 5: Create fine-grained labels. Okay, so now we had thousands of these smaller 1-second clips and wanted to essentially score each 1-second clip based on how highlight worthy the clip was. William created a simple tool that allowed us to score these one second clips quickly and easily based on how exciting we thought the 1-second clip was on a scale from 0–9. We gave clips very low scores when the player was just walking around or was involved in very little action. We gave clips high scores where they were in the middle of a ton of action and trading blows with enemies.
Note: Not every clip received a label between 0–9. Some clip had the label “s” or “i”. For a clip that had nothing to do with Overwatch, like a clip of the streamer doing pushups, it’d be labeled “s” as in “skip” for non-gameplay. For video where the player was in game but was interrupted by something like a pause screen, it’d be labeled “i” for “interrupted gameplay”. By doing this, our neural net would be trained to understand what non-gameplay looks like and would have a lower chance of spitting out false positives.
NICE. Now we had thousands of labeled, 1-second clips.
Deep Learning Time
I didn’t want to make things too complex here and just wanted to be as simple as possible for version one of this idea. I think people often go crazy when choosing neural net architectures and pick designs that are overly complex and difficult to train. This is why we played it safe and just chose to use InceptionResNetV2. It’s an architecture by Google that’s well documented and used a ton by researchers.
The next thing we had to figure out was how to actually feed these clips to our neural net as video. My first thought was that we’d need some sort of recurrent net involved, like an LSTM, to help our neural net understand the time quality of our clips. For example, it’s probably helps our neural net to “watch” a full 1-second clip to better decide if it was highlight worthy.
But, for version one you always want to be as simple/stupid as possible to establish some sort of baseline. I ditched the idea of feeding a video to the neural net and instead came up with the idea of making the problem of highlight creation an image classification problem instead of a video classification problem. Image classification means “given this image, tell me what it’s an image of”. So if we give a neural net a picture of a car, it should output “car”. Easy as that. What if we gave our neural net a single image from one of our 1-second clips, would it be able predict how highlight-worthy that image is on a scale from 0–9?
I know that sounds weird. How can a neural net tell you how highlight-worthy something is from just an image? Well, imagine I gave you a bunch of the images below. Could you generally tell me which of the images came from highlight-worthy clips?
Definitely! The images from the two images at the top seem to come from pretty hype clips. In both images, the player is getting a bunch of kills, explosions are happening, and abilities are being used. The images on the bottom obviously are not hype. One is where McCree is just walking around and the other is just a player on the menu screen about to leave a game. So looks like we can actually turn this into an image classification problem where we simply take image as input and output a score from 0–9 deciding how highlight worthy the clip is!
We proceeded by taking each labeled 1-second clip in the dataset and broke it up into 30 frames. This was easy since the original video was 30 FPS. We then took the 10th, 20th, and 30th frame, saved them individually, and assigned them all the original label we gave the 1-second clip. So for example, if our clip had a score of 6, we’d take the 10th, 20th, and 30th frame and all three of these frames would receive a label of 6.
Our data set ends up looking something like this:
"image_1.png" : 4,
"image_2.png" : 4,
"image_3.png" : 4,
"image_4.png" : 6,
"image_5.png" : 6,
"image_7.png" : 6,
"image_8.png" : 2,
"image_9.png" : 2,
"image_10.png" : 2, ...
We had about 40,000 labeled images in our dataset.
Note: We chose to only take three frames from each 1-second clip (vs taking all 30 frames) to avoid adding a lot of repetitive data to our dataset. In a game, what happens on frame #2 and what happens on frame #3 usually looks extremely similar. This can hurt our neural net’s ability to generalize!
At this point the hardest part, which is putting together the dataset, was done. At train time, we used Keras + Python which is extremely simple to use.
get_model is all our actual neural net code. Literally less than five lines. Notice how there are 12 class. 10 classes for a score between 0–9 and 2 for “s” and “i”.
All we did after this was write a script that trained this neural net (above). And that’s it! We threw this onto a GCP instance with 2 Tesla K80s and it took about 12 hours for it to train for 20 epochs. That’s it!
Test Time / Interesting Behavior
At test time how do we go from classifying individual frames to creating clips? Let’s say we want to find the exciting moments from within a 60 second clip. We’d go through this video and grab every 10th frame, give it to PogChampNet, and get a score. If we get many high scores in a row, our test script simply takes the video and creates a 10 second clip of it at around the time stamp where we got a bunch of high scores. If the scores are low, we simply keep going because low scores means nothing really exciting was happening within those frames.
The video at the beginning of this post already shows off how well PogChampNet does. It’s actually scary good at choosing clips and does the job well. But, when working with neural nets there are almost always places where it messes up.
We want to show off some places where PogChampNet messed up and classified a clip as highlight worthy, when really, it wasn’t highlight worthy at all!
So, it’s definitely not perfect! It has some hiccups most likely caused by the fact that our dataset was only 40,000 images which perhaps doesn’t allow our model to properly understand every possible clip it sees. I know that saying “it breaks because it needs more data” is kind of a cop out answer, but I think it reigns true here!
Overall, PogChampNet did way better than we thought it would. Given ten hours of Twitch VODs, it could easily find exciting moments within the game with very few false positives. Future improvements would definitely include using some sort of recurrent net or 3D convolutions to take advantage of the temporal quality of the data.
So that’s it! A neural net powered by memes. Please do DM me with any question’s you may have (farzatv) and thanks for reading!