A/B Testing Visualizations

Karl Sluis
Making Next Big Sound
5 min readMar 2, 2015

or The Best Five Dollars I Ever Spent

Here at Next Big Sound, we’ve been experimenting with sharing short, data-driven stories we like to call “Byte-Sized Content.”

A few weeks ago, I found myself in the middle of an impromptu, six-way critique of the above visualization, right before we posted it on Twitter. Opinions, hunches, and assumptions flew through the air. It was chaos, really, until a pause provided me a chance to wonder, out loud,

“If only there were a way to test this… ”

… the this being the fitness of two alternate visualizations of Facebook Page Likes for Missy Elliott and Katy Perry before and after Super Bowl XLIX. I wanted to show the relatively larger and sustained social boost Missy Elliott received from her appearance during the halftime show. I’d heard of usability testing using Amazon’s Mechanical Turk service, so I figured I’d give it a shot at A/B testing two alternate visualizations:

Ten users, five questions, and five dollars later, here’s what I discovered:

Testing revealed a clear favorite.

The results of the test couldn’t have been more clear. 70% of the testers read the intended message in Graph B, where as none read the intended message in Graph A. I’m floored that I received such a strong signal from testers subjectively evaluating the message of graphs.

Testers demonstrated more comfort with transformed data than I expected.

Call me a cynic. I thought that percentages — and small ones at that — would confuse most of the testers. I expected that the most testers would believe that “Both musicians saw very small performance increases” was the intended message of Graph B. Instead, when I asked users to express the message of the graph in their own words, I received sharp, insightful, short-form written jewels like:

In terms of percentage increase of total Facebook Likes, Missy Elliott saw a much larger increase than Katy Perry after the Super Bowl. Because of Katy Perry’s already extremely large following, it is likely more difficult for her to receive new Likes, but Missy Elliott reaped the benefits of the wide exposure provided by the Super Bowl. — Anonymous user

Mind, blown.

The “dumber” graph was just confusing.

Graph A was actually my first attempt at visualizing the data. I know, mixing different y-axis scales is nearly a capital crime in the visualization world. It shouldn’t be — we’ve had success with blended y-axes with clearly labeled maximums for a simple visualizations. Illustrating separate y-axes to achieve banking to 45° would be chart-junk overkill for a small, Tweet-friendly graph, wouldn’t it?

Testing said, emphatically, “No.” Four testers didn’t feel that any of the five multiple-choice statements fit Graph A, whereas the other six selected messages that deemed Katy Perry the victor — an entirely justifiable interpretation, but not the intended message.

If we understand communication as the process of transferring information from one human mind to another, then we can absolutely evaluate the effectiveness of communication by comparing the fidelity of the understood meaning to the intended meaning.

My experiment validated that A/B testing visualizations with Mechanical Turk is a viable method for this evaluation, and I certainly plan on creating, and writing about, more tests in the future.

Appendix

Methodology

  • A/B testing two alternate visualizations of the same data to determine which data transforms and presentations were more effective at communicating an intended message. Put another way, “Did the tester receive the message in my mind?”
  • Three questions were multiple choice. The first simply asked “Which of these two graphs appears easier to read and understand?” and I received very little signal from it. The other two asked the tester to select from six statements the one the best described the graph, including “None of the above.” I wrote statements that captured the intended message (“Missy Elliott performed better than Katy Perry”) and the messages I feared the graphs might convey, fair or simply incorrect.

Multiple choice was easily the most successful and useful component of the tests.

  • Two questions were open-ended. I asked testers to describe message of both graphs, in their own words, in 2–4 sentences, explicitly. Although these were useful to validate the intelligence of the mTurk testers (again, I was impressed!), I see little benefit to use them again. I do plan to explicitly ask if anything is confusing, in future tests.
  • I solicited ten testers, and I got a clear signal from them. Although for usability, you only need five test users, I’m curious to see what impact 25 Turk testers would have.
  • I paid testers 50 cents each. To mTurk’s credit, they’re incredibly explicit about the effective hourly rate you’re paying. 50 cents seemed to be a fair rate — my tests were finished within an hour (!).

Why didn’t I use my product for testing?

  • First of all, this was an experiment. Integrating an A/B test into our product — or, even more difficult, our Tweets — would have represented a significant technical barrier. MVP, FTW.
  • Any conversion or action that I could imagine tracking would only be a secondary proxy for what I actually wanted to measure, the visualization’s effectiveness.
  • I can’t imagine anything better than getting clear, targeted, direct feedback on the effectiveness of the visualizations than directly asking testers how they interpret graphs. People may be infamously bad at reporting their behavior but I believe they’re well-suited to report how they understand a graph, in the moment.

Using Mechanical Turk

  • Unfortunately, Mechanical Turk is not for the faint of heart. I had to jump into the HTML editor to write the custom, image-focused questionnaire that I wanted to share with testers. Mechanical Turk is a UX hot mess, no doubt about it.
  • Case in point: I screwed it up! I actually ran the test twice. For the first go-around, while copy-pasting questions, I forgot to update an HTML attribute, which ruined the test results — I only received answers for the last question.
  • Finally, the test results were a little difficult to tabulate. mTurk delivers a CSV of questions and responses, which is fine, but I’d really prefer a test solution that offers structured and visualized results.

--

--

Karl Sluis
Making Next Big Sound

Cities, mobility, and product leadership in New York City