Investigating Attendee Reviews…with Data Science!

Published in

Skills Matter

8 min readOct 15, 2018

During Clojure eXchange last year, Skills Matter premiered a new mobile app created by our team of developers. This created an easier way for attendees to leave feedback, and allowed us to collect and engage with this feedback on an unprecedented scale.

As shown by the screenshot below, this is a relatively simple feedback form which asks for a score, rating both the enjoyment of the talk and how much the respondent learned. These two fields are both required. There is also the option of leaving written feedback about what they liked or disliked about the talk, as well as any other suggestions they might have.

Some ten months later since the app’s implementation, we’ve collected over 3,700 individual session reviews (a session is a specific talk or workshop an attendee might join), so it feels like now is a good time to dive into the data!

Key Figures

3,785 reviews
Reviews from 442 unique attendees
Reviews for 335 unique sessions
912 reviews with written feedback
All figures current through 1st October 2018

Key Questions

As the resident data scientist, I wanted to see what interesting results we could find from the data and if this data gave us any indication of how attendees interact with the reviews app. I was particularly interested in the following:

Do reviews tend to come from the same attendees?
Do sessions get the same number of reviews?
Do positive and negative reviews use the same words?
Do attendees rate enjoyment and learning scores similarly?

Do reviews tend to come from the same attendees?

We’ve collected over 3,700 individual session reviews, but these reviews come from only 442 attendees, out of a total of 3,123 attendees for all conferences considered. In other words, only 14% of attendees leave a review. Of the attendees who do, there is some variation.

The majority of attendees left multiple reviews, although the most common number of reviews to leave overall was one. As shown with the histogram of total reviews by attendee above, there is a significant positive skew with a gradual drop off. The largest number of reviews left by any attendee was 26, of which two attendees did so.

Do sessions get the same number of reviews?

How does this compare to the number of reviews by session? Again, we get a similar positive skew in the number of reviews by session, but with a more significant drop off.

Can we infer anything else about sessions with more reviews than most? Perhaps these reviews illicit more of a response than most e.g. they are more favourably rated than the average? Looking into this a bit more, there doesn’t appear to be any relationship between the average score given to the session and the number of reviews a session receives.

So what might be a factor in some sessions getting more reviews than others? Looking at the top 10 sessions by the total number of reviews, all of which had over 30 reviews, nine of those were keynotes so we could anticipate a large attendance. Perhaps the number of reviews for a given session could be taken as a proxy for the total attendance for that session. This is all the more useful as we currently can’t track exact attendance per session.

Sessions with most reviews

Keynote: The Maths Behind Types
Keynote: Architectural patterns in Building Modular Domain Models
Opening Keynote: JavaScript: The Next Generation
Keynote: The Magic Behind Spark
Keynote: The Survival Kit of the Web and PWAs
Keynote: Own the Future
Keynote: Serverless Functions and Vue.js
Free Monad or Tagless Final? How Not to Commit to a Monad Too Early
Keynote: V8 Engine Internals For JavaScript Developers
Keynote: Choose Your Animation Adventure

Do positive and negative reviews use the same words?

So far we’ve only considered numerical scores, but how about written feedback? Firstly, some context.

Of the 3,700+reviews collected, only 1,079 (just over one quarter) have any written response, be it relating to feedback or suggestions, and only 512 have both fields completed. Due to the small numbers, it’s debatable as to how useful a session-by-session comparison (based on written responses) would be. However, in aggregate, it could be interesting nonetheless to see what common features there were (i.e. in positive versus negative written reviews).

First things first, how to tell if a written response is negative or positive? This is a question of determining the sentiment of each review. To do this, I fed the feedback and suggestion text fields into the AWS Comprehend API which returns a probability vector indicating the most likely sentiment for the input text — one of either positive, negative, mixed, and neutral. The sentiment of a given response is then the sentiment class with highest probability. This approach can lead to a few edge cases, i.e. Comprehend will determine a response to have a different sentiment to what I might determine, but in general, it is quite robust.

It’s interesting to consider the reviews split into these categories as it already gives some idea of how attendees interact with the feedback app. If we look at the number of written responses by sentiment, we already see some important patterns in behaviour:

Written feedback responses are predominantly used to elaborate on what the attendee liked about the session.
Written suggestion responses are well divided between three of the four sentiment classes that Comprehend uses. Reviewers use this field to detail what they want to see improved.

Overall, I’d take this as some indication that reviewers tend to provide as much constructive and appropriate feedback as possible, and are generally well-meaning with their comments. In the case of the ‘feedback’ field, the overwhelmingly positive feedback may largely be due to the responses being primed by the text immediately above the text field: ‘One thing you liked about this talk’.

Taking this a bit further, I wanted to see what kind of language was used for responses of either a positive or negative sentiment. Firstly, I identified keywords in written reviews. To do this I used the keywords module of the Gensim library, which does a good job of ranking words in a source text by how important it is to the meaning of the text (not just how frequently it occurred).

There is a very clear split between the kinds of keywords found in reviews with positive sentiment compared to reviews with negative sentiment. The top-rated keywords for positive reviews were mostly all adjectives such as ‘great’, ‘good’, ‘nice’, ‘interested’, ‘useful’, whereas top keywords for negative reviews were all nouns such as ‘talk’, ‘time’, ‘code’, and ‘examples’. A possible reason behind this is that negative reviews tend to be more direct and less elaborate, as we saw above the feedback field is predominantly used for comments of a positive sentiment.

Positive Feedback Top Keywords

Great
Good
Nice
Nicely
Interested

Negative Feedback Top Keywords

Talk
Talks
Time
Timing
Coding

Nouns and noun phrases typically used in written reviews tended to be the same regardless of sentiment. Typically these were words related to ‘talk’, ‘presenter’, ‘speaker’, and ‘time’ i.e. general words related to presentations.

A different way of looking at the written feedback is to look at the responses as a word cloud. This is not directly comparable with the keywords discussed above as it only considers overall word frequency not whether they are keywords as per the Gensim keywords module. Nonetheless, it’s a good overview of the kind of language we can expect from different reviews.

Word Clouds by Sentiment

The word cloud for all written responses with positive sentiment …

… and the corresponding word cloud for all written responses with negative sentiment

Do attendees rate enjoyment and learning scores similarly?

Coming into this question I was of the assumption that there would a strong positive correlation between enjoyment and learning scores. After all, I think it’s safe to say people come to our conferences with a mind to learn things, and that is a big part of their enjoyment. Attendees rate sessions on both enjoyment of the talk and how much they learned using a series of emojis. For the purpose of the discussion to follow, the emoji correspond to a numerical score as indicated in the key below, where the higher the score the better.

Key for emojis used to rate learning from and enjoyment of talk

To illustrate the relationship between the two scores, the following heatmap shows the co-occurrence of learning and enjoyment scores. This follows the key above where ‘1’ corresponds to the worst score a session can receive and ‘4' the highest. The numbers in each square give the number of reviews with that score e.g. 8 reviews scored a session with a learning score of 1 and an enjoyment score of 4.

The immediate takeaways from this chart are that sessions are often favourably rated for both learning and enjoyment, and there is some indication of correlation between the two scores, as the highest numbers are along the middle diagonal.

To put more numbers to this last point, I calculated the Spearman rank-order correlation coefficient between these two scores using the Scipy library. This returned a correlation coefficient of 0.69 and p-value of approximately 0. This means that there is a moderate positive correlation between the two scores, which is statistically significant. Or in other words, the higher the learning score, the higher the enjoyment score, and vice versa. This meshes well with my initial assumption. It is interesting to note as well that sessions have generally received slightly higher enjoyment scores compared to learning scores, whatever the reason might be.

Overall, I’m really pleased with the results we’ve obtained so far through the reviews app. The figures seem to suggest that we’ve done a good job of capturing how attendees feel about sessions and already allow us to track some important patterns in attendee behaviour. This will only get more important, as we come to increasingly rely on data-driven approaches to improve the attendee experience.

For more on the why behind data-driven organizations, check out my post earlier this year.

It is now critical to learn from this data to iterate on this initial feedback app. For instance, we’ve seen that a big challenge for us is the relatively small numbers of attendees leaving feedback, which probably means we’re missing out on some important feedback.

Looking forward, I’m most excited to see how we can begin to compare the same conference series across multiple years and see if we are making the changes our attendees want to see. Very soon we will be able to compare between Scala eXchange and Clojure eXchange.