Playing 20 Questions with Sketches

Alyssa Li Dayan
5 min readDec 5, 2019

Can you guess someone’s country of origin from how they draw? I investigated predicting from single drawings, boosting accuracy by combining evidence from multiple drawings, and finally developed a 20 Questions-esque game that at each step asks the player for the most informative drawing. Here’s a video of me playing a very rough prototype of the game (I am indeed from Britain!)

The rest of the article describes how I made this.

Guessing Your Country from One Sketch

I took drawings from 64 different object types/categories (e.g. owl, train, nose) and 25 different countries. For each type, a separate classifier was trained using 320 drawings per country. (See my other article for more details about the data and models used). Given a drawing, the classifier returns a distribution of probabilities over countries, and the prediction is the country with the maximum probability.

Results

Surprisingly, the average accuracy is much higher than chance (which would be 4%) for all countries. I plotted a prediction confusion matrix showing how frequently drawings from country i were predicted to be from country j at (row i, column j). Observe how there are only a few bright cells in each row- this shows that each country was usually confused with only a few other countries. Also, note that the plot is symmetric about the diagonal- this shows that i is mistaken for j and vice versa with about the same frequency.

Left: Accuracy (F1 score) plotted by country, averaged over all drawing types. Right:

Guessing From 64 Drawings

To improve the accuracy, you could ask for a drawing from all 64 categories and combine the classifier outputs. The naive way would be to directly multiply the probabilities together:

This improves the accuracy¹ compared to guessing from just a single drawing:

Average accuracy using one drawing vs 64 drawings (one from each category).

Up-weighting More Informative Drawing Types

However, some drawing categories are much more informative than others: the average test accuracy when guessing from fan drawings was only 27%, compared to trumpets which was 49%:

This suggests we could do better by up-weighting the classifiers of more informative drawing categories. To inspect further, I plotted a probability confusion matrix, where the entry at (row i, column j) is the average probability assigned to country j when the drawing was from country i. The probability mass in the confusion matrix for the trumpet classifier is much more focused on the diagonal, showing it usually assigns higher probability to the correct country.

Center: accuracy by drawing category averaged over all countries, plotting just the highest and lowest accuracy countries. Left and right: confusion matrices for the fan and trumpet classifiers.

However, the fan drawing classifier is just as good (or even slightly better) for France and Sweden, so we don’t want to discount its output probabilities for all countries. I formulated a way to individually adjust the classifier outputs for each country using the confusion matrices:

Using adjusted probability improved the mean accuracy from 79% to 86%!

Guessing from Just 10 Drawings

Asking someone to produce 64 drawings is a lot, but if we only use drawings from the categories with the highest test accuracy the final accuracy decreases to 75%.

Greedily Asking Sequentially

Since people can only draw one thing at a time, after they finish each drawing we could be smart about which category to ask for next. Maybe after the first drawing we narrowed down the set of possible countries to Poland and Romania: now we just need a classifier that is good at distinguishing these two countries, and don’t care about its performance on drawings from Brazil.

I used an idea from Bayesian Experimental Design: ask for the category (experiment) that maximizes the expected utility of outcome. The utility is the increased knowledge about the artist’s country, i.e. the decrease in entropy of the guessed probability distribution over countries. This is greedy because I only look one step into the future rather than optimizing for my knowledge after all 10 drawings.

In the last step I chose the optimal category t* to minimize the entropy of the expected posterior, because it is very difficult to calculate the expected entropy of the posterior². The expected posterior after asking for drawing type t can be calculated as follows, where N is the normalizing factor required to make the probabilities sum to 1:

Now we can simply calculate this for every remaining type and ask for a drawing of the type with the lowest value. This led to a modest increase in accuracy from 75% to 78%.

Average accuracy when using the greedy sequential algorithm (solid blue) vs always asking for the top 10 highest test accuracy categories (stripes).

Finally, here is a rough design for a more polished version that I would make if I had more time!

[1]: I didn’t have labels of which drawings were done by the same person, so to test my idea I bundled together drawings from the same country to create fake people. This might artificially inflate the accuracy, since someone who draws strangely will be hard to classify, so multiple drawings from them will still be hard to classify. But when we bundle multiple drawings from random different people it’s very unlikely they will all be drawn strangely, or the different ways in which they are strange will cancel out.

[2]: This is because it is hard to integrate over all possible drawings- this is an infinite space and we don’t know what the probability of a given drawing is. Because entropy is a convex function, Jensen’s inequality tells us that H(E[posterior]) ≤ E[H(posterior)]. By minimizing H(E[posterior]) instead, we are at least allowing E[H(posterior)] to be lower.

--

--