Creating The Dataset — Part 1: The Mysterious Web Socket
This is probably something many people are wondering about because of how huge this dataset is. Let me spoil it now. I did not manually label 100,000 images of the mini map. That would be insane. Every single image has 10 bounding boxes (assuming all 10 champs are alive) denoting where the champion is and which champion it is. Even if it took me 5 seconds per image to manually label, it would take me over 8000 minutes to complete!
Let me take you through how I created the dataset, it involved some clever tricks and the help of another developer named remixz on GitHub who discovered the Mysterious Web Socket (as I like to call it) which I’ll be talking about below.
When you watch a game of League live on lolesports.com, there is actually a secret web socket that is constantly tossing out data about that live game. I call it the Mysterious Web Socket because not many know about it and it seems semi-hidden. The data produced by the socket includes everything from the names of the players in that game and their champion to the position of the champions at every second and their health. This exists because it powers the live player stats functionality on the website.
You might start to see how I used this data!
I created my own node script (similar to one remixz made here) that basically detected whenever The Mysterious Web Socket was open, listened to incoming data, and saved that data onto a JSON file. I hosted this script on an AWS EC2 machine and bam, I was now saving data from NA and EU LCS games automatically!
If you’re still curious about the data, here is a little snippet from an LCS game to give you a better idea of how it looked.
The JSON data on its own isn’t useless. But remember, the whole point of doing this is to create a labeled dataset where I would have actual images of the mini map with labels that corresponded to where the champions actually were on the map. I don’t care about the JSON data on its own.
I should add that DeepLeague only recognizes about 55 champs from the game because I only use data from the LCS. Players from the LCS usually only play a certain set of champions. For example, mid laners usually play a lot of Ahri, but almost no one ever plays Teemo! This means I wouldn’t be able to train a model to recognize Teemo. It also means I would have too much data for Ahri. I needed to have a balance of champions in my dataset. You can check out the code for how I balance my dataset here under the function check_champs.
At this point, all I had WERE hundreds of these JSON files that corresponded to individual games that told me where every single champion according to the in game timer. So all I had to do was to download the VOD for each LCS game the JSON was associated with and match the data with the VOD. Initially I thought this was easy. I would just go to YouTube, find the LCS game, download the VOD and have a script automatically extract frames from the video and match it to the JSON data.
N O P E.
Creating The Dataset — Part 2: Understanding The Problems
I made a massive mistake. Let me explain.
I can’t treat VOD’s of LCS streams like a normal League of Legends VOD I record at home. For example, if I record myself playing League of Legends at home from the beginning of the game all the way to the end then I can simply run this code:
Note: When I say “frame”, assume that I mean every single second in the game is associated with an actual image “frame”. So a 60 second VOD would have a total of 60 frames where 1 frame is associated with every second. So, 1 FPS!
# first go through every single frame in the VOD.
in_game_timestamp = 0
for frame in vod:
# go in the vod's json data. find the json data associated with
# that specific timestamp.
frame.json_data = vod_json_data[in_game_timestamp]
in_game_timestamp += 1
This code would work perfectly on my home VOD because let’s say I start recording my home VOD at the in game timestamp of 0:00 and I stop at the in game timestamp of 22:34. Then, lets say I wanted one frame of data for every single second that went by of the actual in game timer. This is trivial because:
The timer of my Home VOD directly aligns with the timestamp of the in game timer.
Ahahaha, friends, I wish it was so easy with LCS VODs.
The only way to get VODs of pro LCS games is by taking whatever is live streamed on Twitch. These Twitch live streams for LCS games have a lot of interruptions during the game for things like instant replays, player interviews, and pause breaks. The VOD JSON data corresponds to the in-game timer. Do you see why this is an issue? My VOD timer doesn’t align with the in game timer.
Lets say this happens:
- LCS Game in game timer is currently at timestamp 12:21.
- LCS Game in game timer is currently at timestamp 12:22
- LCS Game in game timer is currently at timestamp 12:23
- The stream transitions to Instant Replay and is played that shows off the last team fight. This lasts 17 seconds.
- The stream transitions back to the game. LCS Game in game timer is currently at timestamp 12:24
Oh no!! This is horrible. 😢 😢 😢. I’ve completely lost track of the in game timer because the in game time and the VOD time doesn’t align because of the interruption! How is my program supposed to know how to extract data from the VOD and associate individual frames with the JSON data I got from the web socket?
Creating The Dataset — Part 3 Ripping Off Google Cloud Services
The problem is pretty clear. My script to extract data from the VODs must know what the actual in game time stamp is. Only then can I be truly sure that the actual game is being shown and not some sort of instant replay or other interruption. Also, knowing the in game timestamp is extremely important because remember, The Mysterious Web Socket gives us one frame of data for every second of the actual game. Even if something like an instant replay comes up, The Mysterious Web Socket is still tossing data at us. So, we need to know this in game timestamp in order to match the frame to the JSON data.
The first thing I did was use basic OCR on the time stamp. I tried every popular library, and all of them gave me terrible results. My guess is that the weird font and the fact that the background was always changing made things very difficult.
Finally, I came across the Google Cloud Vision API which also has an OCR function. This API did amazing and barely made any mistakes. But there was an issue.
It costs $1.50 for every 1000 images you process with the API. My first thought was to put a bunch of timestamps together in one image and process all those as one image, and for some reason I got terrible results with this. The API kept giving me incorrect answers. This meant I had one option. I would need to send each tiny timestamp image to the API one at a time. I had over 100,000 frames. This means it would cost me around $150. That’s not too bad, but I don’t have that kind of money… I’m just a college kid :(.
But. I was blessed to find this:
You get 300$ for free just for making an account. Now, I definitely did not make three accounts just so that I could use the 900$ of free credit to process my VODs and do random testing and scripting on GCP. That would be against the Terms of Service and disrespectful to the company. Kappa.
Anyways, with the power of this free money at my disposal, I wrote a script that would process the VODs using the Google Vision API one by one. This script outputted a JSON called “time_stamp_data_clean.json” that took the individual frames from the game and labelled them based on what the in game timer read for that frame.
Holy moly boys its coming together! Stuff is working!
At this point, everything is near perfect and the dataset was almost ready to go. Now it was time for the last step. We just need to match the data from this JSON with the JSON from The Mysterious Web Socket. For that, I created this script.
Its a big pain to work with a giant dataset if it isn’t organized properly. I needed a good way to say “this frame has these bounding boxes + labels”. I could of just had a bunch of .jpg files and a .csv file with all the label and coordinate information. It would look something like this:
frame_1.jpg, Ahri [120, 145], Shen [11, 678], ...
frame_2.jpg, Ahri [122, 147], Shen [15, 650], ...
frame_3.jpg, Ahri [115, 133], Shen [10, 700], ...
This was bad though, because CSV files are annoying and JPG files are even more annoying. Plus it means I would have had to rename all my image files so they correspond to that CSV. Hell no. There had to be a better way. And there was.
Instead of JPGs and CSVs, I saved all the data into a .npz file which saved stuff as raw numpy arrays. Numpy is the language of machine learning, so this was perfect. Each image was saved within the numpy array, and along with it the labels were saved. It looks something like this:
[[Ahri, 120, 145, 130, 150],
[Shen, 122, 147, 170, 160],
[[Ahri, 125, 175, 180, 190],
[Shen, 172, 177, 190, 180],
Now we don’t need deal with pesky file names or annoying ass CSVs. Everything is saved in massive array saved to a file and easily accessible by index.
Finally the hard part of deep learning, getting the data + organizing it, is done!
Choosing a Neural Net Architecture
What good is all this data without a model to train it with?!
I knew from the beginning that I wanted to use an existing architecture that specialized in object detection because this is all just a proof of concept. I didn’t want to spend weeks trying to create an architecture perfect for video games. I’ll leave that for future Ph.D students to solve :). As you read way above, I decided to use YOLO because its fast and was state of the art for a while. Plus, the creator of YOLO is amazing and has open-sourced all his code which really opens up doors for developers. But, the creator of YOLO created YOLO using C++ which is a language I didn’t want to use just because most of code for the data was already done using Python and some Node.js. Luckily, a group of lovely people decided to create YAD2K which allows people to use YOLO using Python + Keras.
And, honestly, another huge reason I chose YOLO is because I actually understood the paper behind it. I knew I would need to mess around with the core code behind the architecture which meant I’d need to truly understand it. After reading the papers behind other popular architectures, things didn’t quite click for me as clearly. YOLOs ability to look at an image just once and come to a conclusion felt like the most human compared to the thousands of region proposals used by R-CNNs for example. On top of that, the code behind it was easy enough to follow using the paper.
I won’t be explaining at length how YOLO works within this post simply because there are many other resources, such as this one, out there that would explain it better than I!
Warning: I’m getting way more technical this section. If you get confused, you can always ask me question on Twitter!
YOLO is a very deep neural net. Or maybe it’s not and I’m just easily impressed. In any case, here’s the architecture:
I use a 2012 Macbook Pro. There was no way I was about to train this massive model on my machine. It would actually take years to finish so I decided to pay for an AWS EC2 GPU instance because I wanted to finish training the model within the century.
Here’s how the re-train script runs:
- I’m not training YOLO completely from scratch. YAD2K first takes pre-trained weights, freezes all the layers of the body, and runs for 5 epochs.
- Then, it runs with the entire model unfrozen for 30 epochs.
- Then, it runs with early stopping and exits when the validation loss begins to go up so that we don’t overtrain our model.
So, in the beginning I naively took data from about 5 LCS Games which is about 7,500 frames, ran it through YOLO, and it overfit to oblivion within 2 epochs. This makes sense, though. The model has a lot of parameters and I didn’t use any form of data augmentation, I was asking for it to fail.
Speaking of data augmentation, I actually didn’t use it for this project at all. Usually data augmentation helps out models a TON when training them on objects in the real world. For example, a cup could appear within an image at thousands of different sizes. There’s no way we could have a dataset that covers every cup size. So, we use data augmentation. But in this case with the mini map, everything was constant except the position of the champ icons and other things, like wards. The size of the mini-map was also the same at all times because I only used VODs that were 1080p. Data augmentation might have been useful in some situations where I wanted more data for a specific champion, so I could of just flipped the mini map frame and then I would have gotten two frames from a single frame. But this would have meant the champ icon would be flipped as well and it may have just ended up confusing the model. It’s something I haven’t tested. Might be good though!
After my first failure I then thought “okay whatever I’ll throw my entire dataset at it”. But my training script kept crashing within 20 minutes or so because my system was “out of memory” (as in RAM).
This makes sense because my dataset is huge and the training script was actually loading the entire dataset into RAM memory. This is the equivalent of opening up 100,000 images up on your computer. I could have just rented an AWS instance with more memory and there would be no issues, but I’m cheap and poor. So, I added new functionality to the training script to allow it to train in batches. This meant it only loaded part of the dataset at a time instead of loading the whole thing into memory at once. Saved!
I ran YOLO with my ‘train by batch’ modification and things finally started rolling. It took me about 10 tries where I’d: run the model for a few hours, realize the code had a massive bug, stop the training, start again, and repeat the process. Eventually, I fixed the bugs and finally saw the loss going down and the model wasn’t overfitting this time! I ran the model for the full training time and it took about two days. Sadly, my wallet took a hit, but at the end of it all I had my final weights. The loss converged nicely and the validation loss also converged nicely while hovering right above the training loss.
I wish I could show you guys my fancy TensorFlow graphs but I was an idiot and accidentally deleted everything on my instance after saving the trained weights onto my laptop. KMS :(. I could train the model for 2 days again to analyze the training, just drop me the cash for a GPU instance :).
So thats it! After all that work I was finally able to get some really solid results for my task. Now, I could end here and talk about how perfect of a tools this is and how amazing I am but that’d be lying. To clarify this is far from a perfect tool, but I am indeed amazing.
While DeepLeague does really well most of the time, lets observe one of its main issues.
In the output above, DeepLeague incorrectly labels Cho’Gath as Karma and gives it a confidence score of 1.0. This is terrible. The neural net seems to be 100% sure that Cho’Gath is Karma. I mentioned in another section that I had to balance my dataset because I would have a lot of one champ but very little for another. For example, for Karma I had a lot of data because she’s broken and everyone used to play her in the LCS. But, not many people play Cho’Gath! This meant I had way less data for Cho’Gath then Karma. There’s actually a much deeper issue here that made balancing the dataset so hard.
Lets say I have 50,000 frames where Karma is in the game and my entire dataset is 100,000 frames. Thats a lot of data for a single champ and training my net on a lot of Karma may cause how well the neural net learns about other champs much more difficult. It may also cause major localization issues.
I know what you’re thinking: “just throw away some of that Karma data!”. I can’t just start throwing away frames with Karma in it in order to balance the dataset because the frame that includes Karma has data for 9 other champions within it. That means if I get rid of Karma frames I’d be reducing the data for 9 other champs as well! I tried to balance the dataset the best I could but its very possible that Cho’Gath is being classified as Karma simply because the net has seen very little Cho’Gath in very few areas on the map but it has seen a lot Karma all over the map. The easy answer to this problem, like many other problems in deep learning, is more data. And that’s very possible because we can keep scraping data from the web socket! We could also use something called the focal loss which “learns” to balance a dataset. I have yet to mess with it, though.
Despite this issue of misclassification for certain champions, DeepLeague is still surprisingly good. I’m actually very excited to see if this project inspires some other ideas as well. For example, could we perhaps perform action detection within a video game? That way if certain champs use certain spells at certain times, we can recognize that! When you see Ahri throw her Q out she makes certain movements, movements that a neural net can analyze. Imagine if you could analyze how Faker times his abilities, manages his mana, and roams around the map. That can all be possible via computer vision :).
Thank you so much for reading this entire post, you’re my favorite. Feel free to call me out on Twitter if you have any questions. Goodbye for now!