I have often been referred to as an uber-nerd. In fact some have gone so far as to refer to me as such in a job recommendation. That is not a complaint as it got me the job. It is a mantle I gladly embrace. As an uber-nerd I like to dabble in all things nerdy from getting a Ph.D. in mathematics, to becoming a 2 dan Go player, to running time series analysis on data from a text based game. Recently my data dabbling has developed into a full blown passion and I am in the process of turning it into a career through the Insight data science program.

However not all my hobbies can be as mentally draining. Now and then I need the nerd equivalent of “grabb’in a beer, kicking back, and watching the game”. Unfortunately tradition sports are not my cup of tea. I have been blessed by my fortuitous time of birth. E-sports is rapidly on the rise, allowing me to spectate humans competing in feats of strength of mind and dexterity of fingers that are more relevant to my interests. And finally this meandering path brings us to my point,

There are many things I love about Twitch, however no one is perfect (as hard as I may try) and Twitch has an infamous dark side, the chat.

I aspire to be the chosen one and bring balance to the chat. Part of the Insight data science program is a quick three week project. As every Jedi needs a Sith to do battle against every project needs a problem to solve. It is thus that Twitch chat and I must inevitably cross light sabers. (Despite the metaphor I feel inclined to mention that I am in fact more of a Trekkie).

The Project

In essence I would like to be able to interact with streamers through the chat. However in large streams Twitch chat quickly develops a horde mentality hell bent on nothing less than the complete and total destruction of logical conversation flow. Which is great, if you are into that sort of thing, and many of Twitches users are. In what follows I will discuss my data product that will allow streamers to extract relevant messages from the chat without interfering with the chat’s right to continue its crusade against sanity. This will allow Twitch chat to become a more interactive platform for all its users, not just for The Horde.

Every hero must know their limits. Taking on all of Twitch chat at once would be too formidable a task even for a paragon of data scientist. I decided to focus on one game, Hearthstone. I chose Hearthstone for two reasons; it is a game I know well so I feel qualified to discus what is relevant to it and also it is a game that commonly has a lot of viewers so it is the place my data product will make the most impact.

All well choreographed sword (equally lightsaber) fights start with the prolonged period of the fencers circling each other, throwing testing blows, and trying to measure their opponent. In my clash with Twitch chat, this is represented by the data gathering stage. To the surprise of nobody, it turns out that no one before has taken the time and effort to make a publicly available database of Twitch messages labeled by relevance.

Thus I sent forth my minions (bots) to sit in popular channels on Twitch and to quietly listen. Once they returned with their spoils I began the most tedious part of my project, dramatic pause for emphasis, manually labeling the data! Three very dull hours later I was at the limit of my tolerance for monotony.

Some quick analysis provided useful insights on how to proceed. Tournament broadcasts had around 90% spam, while individual streamers had a rate closer to 75%. Once I discovered this trend I focused my attention on individual steamers. In retrospect it is clear that individual streamers are the correct market for my product as they are more likely to be reading and interacting with chat.

In labeling the data I came across many messages containing no words. This lead me to think of one of the defining questions of the project. “In my context what exactly is a word?”. I wanted to start by grouping words into three categories; Twitch words (those native to all of Twitch), Hearthstone words (words native to Hearthstone), and English words (I think this one is self explanatory).

Fortunately a minions days work is never done. While I was focusing on analysis, my minions diligently stayed out collecting more data.

Using this new data I defined a Hearthstone word to be a word that appeared in the Hearthstone data above a base frequency, but did not occur in other games above that frequency. I defined a Twitch word to be a non-English word that appeared above the same frequency in all the data.

At this point I ran some simple models on a collection of six descriptive features and the results were promising. All models beat random chance in terms of accuracy, but random forest performed the best. The large difference between the training accuracy and the validation accuracy lead me back to the data mines.

After another stint of manual labeling I ended up with 10,000 labeled messages and 200,000 unlabeled ones. In my first models I was unimpressed with the feature importance of the word frequency features. Now that I had more unlabeled data I used word2vec to vectorize the words and group them into clusters based on the cosine distance. Some of the clusters with highest importance to the final model are shown below.

In red we can see that something funny happened. The word chair was grouped with the pronouns and names. So what happened? Well it is a common joke on some Twitch streams that when a streamer leaves their stream the webcam is pointed at their chair. So the chat will start addressing the chair as they would the streamer. So it seemed that a K-means clustering algorithm does in fact have a sense of humor.

After tweaking parameters to maximize for recall I decided on a random forest with 300 trees and taking 300 features of word frequencies corresponding to different word clusters.

Now I had to choose the best way to implement my model. I decided to avoid implementing it as a traditional spam filter since the cost of displaying spam when there were no relevant messages is very low. Instead I used my model to output the probability that a message is relevant. Then I allowed my user to custom select how often they want to see new messages. Then the web page would rank the most recent messages by relevance and display the top four in order.

To see for yourself you can visit

Since my product was more about ranking messages than classifying them, I decided a standard accuracy type metric would be the wrong approach to validation. Instead since Twitch chat is 80% spam I decided to look at the rate in which a random relevant message from the test set was ordered at the top of a group also containing four random spam messages from the test set. The result was that it rose to the top 80% of the time.

Where To Go From Here

Thus I struck a strong blow against Twitch chat, but not a fatal one. And as we all know, even if you cut off three of your opponent’s limbs and throw them into a pit of lava, they still might return.

If given more time there are still many ways in which I could improve my model and the overall data product.

First, I could use a lot more data. It would be nice to crowd source the problem, however caution will be needed with this as some domain knowledge of the game being streamed is required.

With more data I would feel more confident increasing the number of features. In particular by increasing the number of clusters to get finer definition between words. (This was tried with my current data set but lead to a slight reduction in validation score). As was seen earlier many game words got grouped with numbers since they are things that would often be quantified. While this makes sense, grouping like this is not ideal. Additionally I would like to add manual features such as if an @ symbol is followed by the streamer’s name and whether a question mark is at the end or in the middle of a statement.

Due to the frequency of misspellings and alternate spellings on Twitch (for example awesome vs 4wesome) I performed no type of spell check. It would be helpful to develop a custom spell check for common words on Twitch and their misspellings. This would allow me to add a feature that easily removes duplicate messages.

Whether a message is spam or relevant can be subjective. More often than not it is clear. For example “squid1 squid the Illuminati” is definitely spam. However there are still edge cases that are determined by a person’s preference. I would like to add in a feedback mechanism that would allow users to rate the messages they are shown and put those ratings back into the pipeline.

It would be beneficial to use text-to-speech to capture what the steamer is saying so that how close a message is to what the streamer is discussing could be taken into account.

Such text ranking techniques are far from limited to Hearthstone chat. Similar processes could be used to provide insight into a variety of data products.

With great data comes great responsibility. While the intentions of many data products are innocent they can often have unforeseen adverse effects. It is possible that a censorship product such as this could silence the voice of a minority. In particular I am worried about non-native English speakers. I would like to encourage others to test this product for such effects before any wide spread adoption.