AI-based Video Game Commentary With End-To-End Transformers?

Overview of the paper “End-to-End Object Detection with Transformers” by N Carion et al.

Chintan Trivedi
deepgamingai

--

Last year I shared a project prototype of an AI-based commentary system for the game of football using GPT-2 language model. Check it out in the video embedded below.

Based on this project, I concluded that there was tremendous potential to generate non-repetitive commentary by using unique phrases with GPT-like AI. But the big limitation was that it wasn’t entirely clear to me how I could give game information as input to the GPT-2 model so that it would generate relevant commentary. While we could manually craft an input representation of game information like free-kick, offside, goal, etc. and use that to generate relevant lines, it wasn’t clear how we could obtain a training dataset for the same.

So, when I came across this recent paper from Facebook AI on using GPT-like Transformer models on images, it got me really excited. Before I explain how this work could potentially be used for video game commentary, let’s take a look at the paper. It is titled “End-to-End Object Detection with Transformers” or DETR in short.

Until now, transformers were mainly used for language modeling tasks, but this is the first time I am seeing them being used for computer vision task. The basic idea is to simplify the object detection process in methods like R-CNN by using transformers. The input image is converted to an image embedding using a CNN, similar to techniques like Word2Vec in NLP. Also from NLP, the concept of positional embedding is used to retain spatial information. These are then fed to a transformer encoder followed by the decoder to give the desired output. In the case of this paper, the DETR model outputs bounding box information which is useful for the task of object detection, showcasing the usefulness of transformers in computer vision tasks as well.

Now with regards to the game commentary task, this architecture is useful because we can combine the encoder of the image model with the decoder of the language model, which wasn’t previously possible using the same model architecture. We can now use real-life football games to learn commentary from video action, thereby connecting computer vision and natural language processing tasks together with a ready made training dataset.

This is just a hypothesis, I have no idea if this will actually work or not. I have never worked with transformers before, but if you have, please leave a comment down below with your thoughts, would love to discuss this with someone who is more experienced in domain.

Thank you for reading. If you liked this article, you may follow more of my work on Medium, GitHub, or subscribe to my YouTube channel.

--

--

Chintan Trivedi
deepgamingai

AI, ML for Digital Games Researcher. Founder at DG AI Research Lab, India. Visit our publication homepage medium.com/deepgamingai for weekly AI & Games content!