Methodology for Creative-IRL deep dive

10 min readJul 16, 2018

This post goes into detail about the statistics and methodology behind the Creative-IRL work I’ve done. My goal is to show how I obtained my results so people can judge how valid they are, and explain my approach clearly enough so any interested person can understand what’s going on. I tried to link to things I didn’t feel I explained very well, since my explanations probably aren’t enough to understand.

Data Collection

From data I collected through Twitch’s API, I randomly sampled streamers who streamed IRL or Creative in February and March. I required a minimum of 5 Creative/IRL streams in either month, or 10 streams over the course of both months.

I sorted the streams roughly into 4 groups: creative streamer within the creative directory, creative streamer outside the directory, streamer who split time between IRL and Creative, and primarily IRL streamer. These groups were based on streaming behavior and review of past VODs. I didn’t come across many pure IRL streamers, but the ones I included streamed a good portion of their time to the IRL directory. After selection and removing streamers that did not fall into one of the above groups, I was left with about 1000 streamers.

Ethics of data collection
Data scraping of social media is a controversial topic in ethics. Informed consent is a common standard in research, but the scope and scale that would be needed in social media projects is prohibitive. Some researchers consider social media to be public data, while others argue there is a reasonable expectation of privacy for the average person on social media.
Ultimately, I decided that the data I was collecting should be considered public, meaning I would not have to contact every streamer I wanted to track to make sure they were okay with it first. Most streamers are trying to grow their streams, which means they are trying to attract people who are new to them. The only places I could think of where there were regular interactions between people who had never met before is out in public.

I set up an IRC client to generate chat logs for these channels. I collected chat data as well as data from the Twitch API throughout April. I got banned in 8 channels where I had never chatted, so I excluded these channels from any results.

Methodology

Only live streams were included, so no premieres or re-runs. Streams must have been live for a minimum of 30 minutes, and each category in a stream must have been streamed for a minimum of 30 minutes or that section was excluded.

I separate out the mixed streamers (streamers who had both Creative and IRL streams in good balance) for most of these because they have a direct comparison of Creative and IRL for the same stream/streamer. So for most of the analyses I conducted, I ran it twice: once for the streamers who mixed time between IRL and Creative only, and once for all the other streamers I tracked.

The way I decided to look at bans, subs, and bits was on an hourly basis. Longer streams obviously allow more opportunity for any of the above events to occur, but I didn’t think any of them were particularly tied to certain times during stream. In other words, a streamer is just as likely to ban chatters in the first hour of stream as they are the last hour of stream. This is a major assumption that underlies my results.

I used generalized linear mixed-effects models for the majority of my statistical testing. I’ll try to explain what that is:

A linear model (regression) is essentially a best straight line fit where a change in an independent variable causes a constant change in the dependent variable. Think Y = mx + b.

The mixed-effects part means the model is better able to account for the variation between subjects, in this case streamers. Since I collected data on multiple streams a streamer did, I can add a variable that captures the variance between streams of a single streamer. The goal is to get a clearer picture of the effects of other variables.

A generalized linear model relaxes an assumption of normal linear models, allowing a better fit to a larger variety of data. In my case, I used a Tweedie distribution. Tweedie has a unique property where it has a spike of data points at zero, as well as a more regular distribution of data above zero. Since it’s common to have streams where there are no bans/subs/bits, this allows a more accurate model.

For the bans, subs, and bits the models followed a common pattern (R notation):

hourly_measure ~ log(median_viewer_count +1) + game_category + (1 | streamer)

The hourly measure is the dependent variable, with game category (Creative, IRL, Other) being one independent variable, and the log of a stream’s median viewer count being the other. The (1 | streamer) is the variable that captures the variance of the individual streamer. the median viewer count is log transformed because viewer counts are highly right skewed, and this skew could have an outsize impact on the estimated coefficients. Log transforming makes the variable much more well-behaved.

The coefficients for independent variables are interpreted as the effect on the dependent variable when all other independent variables are held constant. By including a (transformed) median viewer count, this accounts for the effect of viewership, so game category can be interpreted as the effect of the game category when viewership is held constant.

Timeout/Bans

In looking at bans, I included timeouts of 5 minutes or longer as well as permanent bans. There is no way to determine unbans, so all bans were included regardless of whether a chatter was unbanned at a later time. Another thing to note, the way I counted was mod actions against unique chatters per stream. For instance, 3 mods all timing out a user at the same time would count as one action. This has the unfortunate side effect that if a chatter waited through their timeout and was timed out or banned at a later point in the same stream, this later action would not be counted. An individual chatter receiving a timeout in two separate streams would still count as two actions.

Like mentioned above, I fit two models: the mixed streamers, and all the other streamers. For mod actions, both had similar results. The mixed streamers showed a mod action frequency in IRL 2.99x higher than in Creative, while the other tracked streamers showed an IRL mod action frequency 3.22x higher than in Creative. I will come back to p-values and significance later, but both results were highly significant.

Subs

Next was subs. nonaffiliated streamers can’t receive subs, so they were excluded. I ultimately created 4 models: all new subs of any type including gifted, and new subs without gifted subs. Both sets were run for mixed streamers and all the other streamers, resulting in the 4 models. Prime subs were not treated differently than regular paid subs. I would have very much liked to examine sub retention, but the data is not available to me, as I can only access those sub renewals that get shared in chat.

There was no statistical difference in either group of streamers for either cut of new subs. Like I call out in my other post, this is after controlling for median viewer counts. A difference in viewership and growth levels may have an effect on new subs.

Bits

The analysis for bits was very similar to subs, with the only thing changing being the hourly measure. nonaffiliated streamers were again excluded. Like subs, I did not find a statistical difference between bits cheered in Creative and IRL for either the mixed streamers or all the other streamers tracked.

Hosts

The statistical model I used to test end of stream hosts is a little different. Instead of using game category, I used the streamer type groups I had created at the beginning. I thought it was less likely to end a stream on an IRL segment, and wanted to avoid situations where a majority of a stream is spent in one area like Creative, and the last part is spent gaming. My sorting of the streamer groups isn’t perfect, but I posited it was likely accurate enough to draw conclusions in the aggregate. I did not run a model for the mixed streamer group, because this depends much more heavily on the streamer’s behavior, so the main comparison is between the groups of streamers; comparing streamers to themselves didn’t make sense.

To be specific, I used a logistic distribution for this analysis. Logistic distributions are commonly used in categorization. For this, I labeled streams as either ending in a host, or not ending in a host. I did not differentiate between an autohost, streamer initated host, or a host resulting from a raid.

I found the results between streamer type groups to be significant. streamers I had grouped as primarily IRL were much less likely to host others at the end of their stream compared to the group I labeled as primarily streaming within Creative. Specifically, the odds of hosting are 60% lower when comparing an IRL streamer to a Creative one. Odds ratios aren’t really intuitive to convert back to probabilities, so as an example if you take two streamers with the same number of viewers, but one tends to stream IRL and the other Creative, the IRL streamer might be 35% likely to host another stream while the Creative streamer would be 58% likely to host another stream.

If you want to read further about log odds there is an FAQ from UCLA I think does a decent job.

Sentiment

I had a huge volume of chat messages, so to cut it down to an amount I could handle, I only used messages from partners or affiliates that set English as the language. This ended up being ~4.7 million messages.

I used two different methods for sentiment analysis. One was from the R package meanr, the other being from the package sentimentr. I again ran each of these twice, once for mixed streamers, once for all the other tracked streamers. Linear mixed effects models were used again, but the generalized versions were not needed in this case. Sentiment was determined on a message level, then averaged up to a per stream sentiment score.

The mixed streamers did not show a difference for either method. The second set of all the other streamers showed a significant difference for both methods, with IRL being correlated with more negative chats.

Mixed viewer differences

One major finding that eluded me and continues to elude me is growth potential differences between Creative and IRL. I was hesitant to try on the data I have since I already know that viewer counts are correlated with the streamer type groups I made. The IRL streamers that were selected tended to have higher viewer counts than the Creative ones. One thing I did try is compare the median viewer levels of the mixed streamer group to see if there was a difference in viewer levels for those streamers comparing Creative and IRL categories. There wasn’t any difference between the two, which makes sense that growth would tend to affect the stream as a whole and wouldn’t be limited to a single category for streamers who regularly switched categories. It doesn’t answer the question of how much growth is impacted by different game categories.

Statistical Significance

I mentioned I would come back to p-values and significance. I ran quite a few tests on the same data sets and if a bunch of tests are run at the same critical p-value, the likelihood that some of your results are bad goes up. The reason behind this is the .05 value commonly used means there is less than a 5% chance the observed results occurred by chance. If, for example, 20 tests are run at the .05 level with no adjustments, you would expect to see one “statistically significant” result that occurred by pure chance, and wasn’t because of any underlying relationship. To counteract this, I used the Holm-Bonferroni method to mitigate issues of multiple tests. I found the example in the wikipedia article to be a clear explanation of how it works. There were 7 tests conducted on my two data sets: the mixed streamers and all other streamers tracked. As there was no overlap in these data sets, I felt comfortable treating them separately.

The test sets, with significant results in bold:

All other streamers:

bans
hosts
sentiment(sentimentr)
sentiment(meanr)
bits
subs(non-gift)
subs(all new)

Mixed streamers:

bans
sentiment(sentimentr)
sentiment(meanr)
viewer levels
bits
subs(non-gift)
subs (all new)

The Holm-Bonferroni method excluded a few results that could have otherwise been considered significant. Bits cheered for all other streamers nominally showed IRL streams had more cheered bits. The mixed streamer set had 3 results that were not significant after adjustment. Both sentiment analyses nominally showed slightly more negative chats in IRL versus Creative, similar to the results from the other streamers. The viewer levels test interestingly was also one that might have otherwise been significant if not for the multiple tests adjustment. I nominally found median viewer levels to be slightly lower in IRL than Creative for the mixed streamers. None of the results outlined in this paragraph should be considered significant, but I outline them because I found it interesting.

Flaws/Shortcomings

This is a correlational study, so I can’t say that any of the results are “because of” something, only that Creative is associated with certain things and IRL is associated with others. I can hypothesize why the correlations are showing the way they are, but the reasons I come up with don’t have the weight of statistical analysis behind them.

The models I used were pretty sparse, for the most part they only involved a term to control for median viewer count and the independent variable of interest: game category or streamer type group. It’s possible that there are other factors that when included would change the results. This is called omitted variable bias.

I didn’t put much weight on the timing within a stream. I assumed that an hour of stream for a given streamer had the same chances to get a new sub, ban a viewer, etc. If these events tend to be tied to certain times like the beginning of a stream, that could change my results.

My results are time-boxed to a certain extent. Communities can change over time, and I don’t think there is anything inherent in Creative/IRL that forces it to be a certain way. As time goes on, things might change that make these findings less relevant or even invalidated.