Player Counts: False Friends in Games User Research?

Ever since reading this excellent piece by Steve Bromley a while back, I’ve been thinking about the points Steve raises about counts— that is, the “how many times did that happen?” question — in the playtests and usability research that we run on games. Should we rely on the numbers of players who we observe having a problem in our game to help us understand how important that problem is? I’m coming at this from the games user research angle, but some aspects could also apply in wider user research settings.

This post might interest developers, user researchers, UX specialists, research managers — anyone whose role involes, or could involve, running playtests, usability analyses, or otherwise looking at how players experience your games. There’s also some discussion that applies to general UX research, not just in games, or perhaps you just love getting in-depth on the minutae of running research.

I’m a researcher at Player Research: we’re a small-ish team of playtesting and user research specialists based in Brighton, UK, and Montreal, Canada. We’re always trying to find better ways to do better research for the developers we work with, and reporting player counts in our playtests is one of those questions that comes up again and again: should we or shouldn’t we?. This article reflects some of those mediations, and hopefully also gives a window into some of the things we think about as games user researchers.

Counts & Confidence

Steve’s very reasonable conclusion (and you should really read his full article) is that we usually shouldn’t report the numbers of players — or users — who encounter an issue to developers or clients. This is partly because the confidence intervals are so wide in terms of how those numbers in a small usability test (of, say, five players) translate to the numbers of people who will go on to experience it in the real world. Steve cites WALD completion rate statistics, which suggest that if only one user encounters an issue in an N=5 test, it’s possible that anywhere from 0–65% of real-world players might actually go on to show that issue.

However, Steve also talks about how, as researchers, we need to use multiple criteria to prioritise which issues are the most important, and the number of users encountering an issue can be one of these factors. Steve points to this matrix, created by userfocus, which is an interesting and potentially useful tool for researcher. Using this matrix has a couple of implications: for example, an issue can only be defined as critical if it’s also defined as persistent.

(Note: Steve Bromley pointed something useful out to me here, in that persistence in the userfocus matrix doesn’t actually necessarily refer to the problem occurring across users/players: a problem can be persistent in a single user. This makes good sense and makes it a lot more attractive to me as a tool.)
Userfocus’s matrix gives a handy rationale for prioritising issues as critical, serious, medium or low.

One Rule for Us..

For me, there’s a bit of tension here. Player counts are a factor that we can’t be confident about in terms of the number of users who will be affected in real world performance; but also, they’re one of three key determinants of issue priority. We are not comfortable with giving these numbers to clients or developers, but we do use them ourselves in reporting and prioritising issues. In terms of robustness and transparency, is it okay that we use player counts but aren’t comfortable communicating them?

As a user researcher — in fact, in any research, especially client-focused work — this is an issue that’s worth thinking about, because it’s a situation that could well crop up sooner or later. A similar situation came up for me in previous work in the charity sector, where a client was interested in trying to predict numbers on a national level from a sample. Now, as researchers, we know that we can do those calculations for a client, but we also know the problems involved in doing so and the limitations of how those figures should be interpreted; in basic terms, we are generating numbers with a huge confidence interval. In that case, we reached a compromise by asking what level of quality we, as researchers, would be happy putting our name to: our approach was to provide the calculations, but ensure that they were always provided alongside provisos. We even gave the client some basic stats training, so that they could talk to others about how confident they — and we — were about the statistics, and interpretations of the data that should be avoided.

A similar issue applies here with player counts. Essentially, we’re being a bit paternalistic: as researchers, we claim to be able to interpret and use these figures, especially when considered in light of other factors. The userfocus matrix is one shortcut to this. However, we also recognise how compelling these counts are: the reason we’re less-than-keen to provide them to others is that could easily become the overriding focus of the findings we generate.

Counts Can Be Useful

Still, is it really for us to refuse to provide these counts? Not necessarily. Steve also covers this in his article, by talking about ‘usual’ practice; Steve and his team just don’t provide them by default. Here are Player Research we do occasionally provide player counts: we and the developers we work with seem to find them especially useful in task-based research designs. Here, even in small numbers of players, patterns of performance often emerge: for example, two similar tasks may be problematic across all players, or individual players may have problems with certain types of task. And so the data presented immediately becomes a little richer, and a little harder to intepret in isolation.

In other cases, if asked to specifically provide player counts, we would always try to ensure that they’re not interpreted on their own. This could mean anything from a little explanatory text in a report, to a chat with the client, to some informal training.

Qualitative or Quantitative?

This led me to start thinking about a more fundamental question: is usability testing qualitative or quantitative? It’s very tempting to say it’s a bit of both, or that it’s a mixed method. However, in other disciplines, research does not comfortably sit across both qualitative and quantitative lines: in fact, compounding — or, worse, confusing — qual and quant methods is a sure-fire way to get rejected by a peer review process.

(Mixing methods in this context does not mean reporting results from both qualitative and quantitative methods in a single research project, which is uncontroversial; instead, where academic researchers run into trouble is in implementing counts and weighting of responses in qualitative research.)

And yet implementing a form of issue weighting — via player counts — is often exactly what we do in usability research. Is usability research unique in permitting this? If so, can we articulate why it is unique?

Outliers

I’d like to think that this isn’t just me waffling on about an abstract question. Steve talks about this issue in relation to outliers, where we encounter a ‘non-typical’ user who encounters an unrealistic problem for the game or software’s typical user demographic. An example of this would be recruiting four experienced strategy game players, aged 30–40, for a usability test of a new PC city builder game, along with one five-year old who has never played a PC game before. That five year old is going to generate a lot of issues that don’t apply to the game’s target demographic! (There’s a whole can of worms to open here in that those issues perhaps shouldn’t be ignored however; but that’s an argument for another day.)

We usually rely on screening and targeted recruitment to ensure that outlier situations don’t occur. But if we’re confident that we’ve done that screening and recruitment well, shouldn’t we also be very wary of disregarding outliers? And if that’s the case — where all our players are appropriate to the game or app’s target audience — then perhaps we should take a fully qualitative view on usability testing, where every player or user’s experience is equally important as a piece of evidence, including equal to a problem that emerges in mutiple or all members of the group. This could mean removing player counts from the prioritisation process altogether, or just being very, very reluctant to paint any user or player’s experience as ‘just’ an outlier.

Applied Issues and Prioritisation

However, there’s a counter-argument for keeping counts in as part of our research process, and — for me — this argument also helps to explain why usability testing is a little different from academic research.

Put simply, it’s the prioritisation aspect. Qualitative research is not interested in prioritisation: in general, qualitative studies are run with the intention of uncovering issues, but are not designed to look at wider generalisability of findings. That’s part of why qualitative studies are such incredible tools to run with minority groups and voices: they represent both minority and majority perspectives, and potentially in great depth. But if you want to examine generalisability, you would take the findings from a qualitative study and run a quantitative design of some kind.

Prioritisation is key in usability research because of the applied nature of the work: of the issues we identify during a piece of usability research, clients and developers benefit from some measure of which of these issues is the most pressing or important. Player counts give us another rationale for prioritisation, often alongside severity of the issue, and/or whether the issue is avoidable in the game/app flow. There’s an element of satisficing here: it’s essential that usability testing achieves its goals with pretty minimal time and resources, because it has to be recursive. We don’t have the luxury of doing multiple rounds of lengthy qualitative and quantitative testing for every round of research and before the next build is produced.

Finally, and furthermore, there are times when we have to compromise with the quantitative work that we can carry out on games. Arguably, comprehensive, large scale quantitative work is best carried out using analytics: but analytics require a functioning, released title with an active player base. Player counts are a viable alternative when analytics are not an option.

A Spanner in the Works

This article has covered a fair few issues. It’s gone from the benefits and drawbacks of reporting player numbers, to looking at qualitative vs quantitative issues, to looking at how the applied aspects of user research and games user research may make it a somewhat special case.

I just mentioned that player counts are just one of many tools we use to prioritise issues. To throw a final spanner in the works, I wonder whether if we had better qualitative tools available, we wouldn’t need to rely on this rather unique quantitative measure in usability studies. To go back to an earlier statistic, if one player out of five could represent a real world confidence interval of 0 to 65%, perhaps that is just too much uncertainty to work with: could we develop better and more structured qualitative tools?

In our work with the Google Play team, we’ve been looking at other ways to frame issues in games user research specifically, and highlight those which we think have the potential to block progress or impede players’ learning about game systems. A single player experiencing a blocking issue — one that fosters a strong negative reaction to the game, perhaps leading to a quick uninstall — could be much more meaningful than five players showing an issue that leads to a minor annoyance.

I would argue that being able to identify issues in line with the psychology and abilities of players is potentially more powerful and reliable than relying on player counts. A structured framework for carrying out usability studies would be one way to achieve this; either as a replacement for player counts, or simply another tool to use as needed.

Take Home Summary

Counting players who encountered issues in usability studies does not provide an adequate indication of how many players will enounter them in real world settings
But player counts in these studies are attractive, straightforward, and perhaps misleading; sometimes they are used as primary determinants of issue priority, which is not advised
Player counts can be factored into prioritisation, but this should always be based on other factors determining how severe and important the issue is
If you are confident in your player screening and recruitment process, be wary of treating any issues or players as outliers
The better you understand your target players — what blocks them from enjoying your game and what supports them — the less you should need to rely on player counts in small studies