Using AI to extract data from e-sports live streams

Robert Hunt
11 min readSep 8, 2017

--

AI for live video is an emerging area that interests me. (I spent five years working with live video streams at my prior startup.) The challenge is to take in live video and figure out what is going on using only the video itself. Commercial uses abound. Pinterest uses its Lens feature to suggest “pins” from other users or companies. Facebook uses AI to scan videos for emotions and Google now offers video AI as a service.

E-sports streams are a particular area for AI work since it is easy and cheap to generate a vast body of training data. AI works by looking at existing data and then predicting future data. So if you wanted to train an AI to identify dogs in videos, you’d need to have lots of videos with dogs. It’s actually hard to find thousands of dog videos (and especially videos that you can use in your project — i.e. without breaking the Youtube terms of service) .

E-sports, aka video games, don’t have this problem since you can generate hours and hours of video at effectively no cost. Furthermore, computer games can be automated to generate effectively infinite hours of video footage. This has made e-sports videos a big area of AI research recently. See for instance Facebook’s AI team releasing 1.5 billion e-sports images for use in AI training.

Part I: Overview of problem

We want to take in a live video stream and figure out what’s going on in each frame. (Side note: video is a continuous stream of single frames. Our eyes see it as motion but it’s really 30 or 60 separate images per second. Those images are called frames). I started this project using streams for one of the world’s most popular game with more 60 million players: League of Legends

In this game, there are both friends and enemies. Each player has a different in game character and those characters have things like position (where they are), how strong they are (health and items), and how they’re attacking each other. For example, consider the raw frame below.

We want to extract information like the highlighted areas:

Part II: Overview of the technology

We’re going to be taking a live video stream in, pulling out the frames, and then analyzing them with AI.

The players sends an RTMP stream from their capture software, usually OBS, to our servers.

An Nginx server running the RTMP module ingests the stream.

The AI is fed a series of decoded frames and identifies “what and where” is in each frame.The AI runs on its own GPU server.

Part III: What the AI needs to do

The AI needs to answer the “what and where” questions. In this game, there are 120 different types of enemies and the AI needs to be able to tell each apart. Here are examples of four of the types.

Moreover, each enemy is a 3D model that has full range of movement. So sometimes we see the front of an enemy and sometimes we see the back or side. Furthermore, each enemy has “moves” that make them look different. Same goes for standing on top of a cluttered background. The AI needs to be able to pull out the relevant details and ignore everything else.

Lastly, the AI needs to be able to tell the location of the enemies. Even without knowing the game, you can guess why: players that end up right next to each other are going to fight.

Since we’re working on live video streams, we need the AI to be very fast. By fast, I mean capable of processing 60 frames per second. When you do the math, 60 frames per second is one frame every 16 milliseconds. That’s fast… There are tricks we can use to make this less time sensitive (you can imagine at the extreme getting 60 different servers and having each server process a single frame per second) but it would be preferable if we could find a more straightforward way of doing this.

Fortunately, AI networks for use on live video have emerged in recent years. We’ll use one of the most advanced.

Part IV: A YOLO Network

A YOLO network is a You Only Look Once network. What that means is that it can figure out what’s in a frame (what is called “classification” in the AI world”) and where it is (“localization”) by only looking at the frame a single time. Prior networks would divide this into two separate steps: one part of the AI would do classification and another would do localization. All things being equal, doing things in a single step should be faster than doing things in two steps.

An author of the YOLO algorithm created a video showing the network in action on a James Bond film. We’ll be applying similar logic for our e-sports streams. As you watch the video, note both the boxes on the screen, i.e. localization, and the names in the upper left corner, i.e. classification.

Aside: How a YOLO network works

If you’ve never seen how a neural network works, you may want to skip this section.

A YOLO network is effectively a traditional Convolution Neural Network with a pretty different final layer and loss function. Whereas many networks use a fully connected layer into a Cross Entropy Loss function, a YOLO network needs to include both classification and localization information in that final layer. To do this it creates a munge of the two.

Next, the YOLO network further stresses the localization aspect by dividing the input into an n x n grid and computes the output above for each cell.

A “trick” of the YOLO network that improves the localization accuracy is that the X,Y prediction applies only to objects that have their center inside that cell. The inclusion of the Width and Height parameter allows objects to span multiple cells (since the center will be inside a single cell but the edges of the object could be outside the cell). The downside of this approach is that a YOLO network needs a workaround to handle multiple objects centered in the same grid cell. That workaround is that it duplicates the entire output layer in each cell for every additional object that could be centered in the same grid. That can lead to a very large output layer where number of predicted values equals:

where n*n is the number of cells, m is the max number of objects per cell, and C is the number of classes

Note on the math:

The YOLO loss function is rather involved since it includes both classification and localization. For instance, the loss function has two sub functions that are invoked depending on whether the cell contains an object. When calculating the gradient, things are further complicated by the fact that the loss function includes an Intersection over Union and that value is very often zero (i.e. when there is no intersection between predicted location and actual location — something that happens more often than not)

Part V: Training the AI

Since the AI will output “what and where” data, we need to train it with that same data. The James Bond video from above was trained on a dataset of people and things in the real world. To have the AI understand e-sports streams, we will need to train it with lots of frames from e-sports streams.

An interesting aspect of the YOLO network is that since each cell predicts independently, we can train the network on a single enemy and the network will perform similarly to when there are multiple enemies in the same frame, so long as they’re in different cells. That greatly simplifies the training problem since we can record scenes from the game that only have a single, known enemy on screen at any given time. The details here aren’t important but the game supports a training mode where we can specify which enemies we want to appear. (Side note: We could get this same data by inspecting the internals of the game as it is running. I wanted to avoid that though since doing so violates the terms of service)

We can record a video of this and then extract the frames. That will tell us that there is a single, known enemy but we won’t know where that enemy appears in the frame without more work.

Here’s what an input frame looks like.

To get the enemy’s location, we can leverage the fact that the red bar above its head has a fixed shape and never rotates. Unlike the enemy who can move in 3D space, the red health bar above its head always is fixed relative to the enemy.

There is a quirk here though. That health bar has some visual differences between frames. For instance, it can be full or empty. Also there are some other visual differences like the number that appears on the bar.

So when we’re looking for the health bar, we need something that will be able to match regardless of whether the bar is full or empty and regardless of what number is displayed on it.

Luckily, we can use the area surrounding the bar combined with a mask to do this. The mask knocks out areas of difference while letting areas of sameness show through.

In OpenCV code, this becomes

cv2.matchTemplate(frame, template, cv2.TM_CCORR_NORMED, mask=mask)

The match will never be perfect due to artifacts from the video compression but if we threshold the match at something like 90%, we can get a reliable location from each frame.

Once we find the health bar, we can assert that the enemy is below the health bar. The particular box size doesn’t matter. So running the image analysis on a raw frame, we finally get the “where” that we need to train the AI.

When we run the image matcher in a little program, we can take an input video, extract its frames, and tag the “what and where” — 60 times per second! This generates a large amount of training data very quickly. In fact so quickly that we choose to generate training data at a slower pace (i.e. we ignore a certain percentage of frames per second) so that there’s more variation between each frame.

Lastly, we need to train the AI itself on the input images along with the “what and where” labels that we generated from our masking system. As a first effort at this, I used transfer learning on an existing Inception v3 base for training the YOLO network. (Technically, this a YOLOv2 network using the YOLO creator’s Darknet implementation). The network trained for 48 hours on an AWS p2.xlarge machine using a training data set of 1,000 images per enemy type.

On a side note, these AI boxes are relatively expensive at 90 cents per hour. So 48 hours of training the network costs 40 dollars. For comparison, this is nearly 10 times what standard servers cost per hour.

Part VI: How well did it work?

We’ll run some recorded videos through the system and see how well it works. First off, we’ll use a single enemy.

The YOLO network did well in this video. When the enemy is in the frame, it identifies it correctly and gets the location right. The network also correctly recognizes when there is no enemy in the frame. That’s good!

When we move to multiple enemies, things get less clear. For instance, check out the clip below. There are two enemies on screen. In the beginning of the clip the enemies are standing close to each other. They then move away apart.

When the enemies stand on top of each other, sometimes we get the proper identification of one enemy instead of two enemies. That’s not surprising given that it can be hard for our human eyes to figure out what’s going on. For instance, in the frame below, there should be two enemies, Ali and Garen, but the AI only sees one of them correctly.

Half right: Found one enemy but not the other

Another variant though causes the overlapping enemies to appear like a completely different enemy! This is a bad outcome. For instance, the frame below came from the same video and just a second later. In this frame, instead of Ali or Garen, the AI has identified an enemy that doesn’t appear.

Worst outcome: two enemies identified as one unrelated enemy!

When the enemies diverge, the AI resumes identifying the enemies correctly, as below.

Correct identification of both Ali and Garen

In a future version, we can train the AI on input images that show enemies standing on top of each other. This is the most direct approach and mimics what we expect the AI to do: predict things based on training data that it has seen before.

Part VII: Takeaways and Next Steps

Positives:

  • Tracking enemies worked well. The AI got the “what and where” right in most cases
  • The AI was pretty fast. It took only 50 milliseconds per frame
  • The AI can handle multiple enemies in a single frame even though it was only trained on single enemies.
  • We could do all this on standard cloud hardware.

Negatives:

  • We’ll need to get smarter to handle enemies standing on top of each other
  • While 50 milliseconds per frame is fast, that’s only 20 frames per second (1000 milliseconds per second divided by 50 milliseconds per frame). To handle 60 frames per second live video, we would need to use multiple GPUs and interleave the output

Areas for future investigation:

  • Experiment with different networks. I’m interested in the performance of an approach that first finds health bars (to get location) and then sends only the surrounding area to a traditional convolutional network to do classification.
  • Try e-sports that don’t have a fixed camera angle, e.g. Blizzard’s Overwatch. Having a more fluid camera angle will result in images with different perspectives as enemies are either close or far from the camera
  • Find a real world (non e-sports) use case and build a product around live video AI!

If AI for video is of interest to you, please get in touch. You can reach me on LinkedIn

--

--