Analyzing Sketches Around the World with sketch-rnn
Quick, draw a sandwich! What does it look like? If you were from South Korea, you may have added pickled cucumber, or if you were from Sweden you might have drawn a smörgås i.e. open-faced sandwich. The way we draw reflects our environment and culture. I explored using a neural representation to analyze thousands of sketches from across the world, and found it revealed fascinating differences across countries. I further investigated predicting the country from the drawing, finding which countries have the most distinctive drawings and which drawing categories show the most geographical variations.
Background
The Data
Ha and Eck constructed the QuickDraw dataset from Quick, Draw!, a game where players draw a prompt in less than 20 seconds, stopping as soon as Google’s AI guesses the prompt. QuickDraw consists of over 50 million sketches complete with object class label (dog, tree, piano etc), country of origin, and even the exact sequence and direction at which the lines were drawn.
Since the data were released to the public for free, many interesting studies have been done on it, on topics ranging from how long it takes to draw dogs versus cats to automatically identifying graffiti.
The Problem: How to Compare Thousands of Drawings?
Analyses of differences between countries in the QuickDraw dataset have been limited until now. How do you condense thousands of different sketches from each country into something that can be easily compared? Ha, T. and Sonnad calculated the distribution of people who draw simple shapes clockwise vs anticlockwise, but it’s much harder to summarize more complex aspects of these drawings, like structure or style. One method is overlaying drawings: Jana and Lovejoy found from overlays that in different countries people tend to draw chairs facing in different directions, while Martino, Strobelt et al in Forma Fluens found differences between countries in drawing watermelons and snowmen.
This works well for some simple object classes, but for most the result of overlaying more than a dozen sketches is an unintelligible blur (a phenomenon described in Forma Fluens as “divergence”). This is because there is too much variability in where people draw each line (for instance, the splayed legs of an insect, or the angle of the bend in an arm), even if they’re drawing the same overall thing.
How can large, “divergent” collections of drawings be summarized to capture their qualitative characteristics?
My Approach: Analyze Features instead of Pixels
I decided to use sketch-rnn to encode the sketches’ qualitative features into vectors. Analyzing these vectors could allow us to summarize the sketches without being “distracted” by the high variability in the low-level details. After manipulating the vectors we can even decode them back into sketches for easy interpretation, so to average the sketches we could average their vector encodings and then decode the result:
Intro to sketch-rnn
The variational auto-encoder sketch-rnn can encode a drawing (represented as a sequence of pen movements) into a sequence of floating point numbers called a latent vector. This vector captures only the sketch’s qualitative characteristics, so decoding it gives a new sketch that has a similar structure (e.g. presence of legs or whiskers) but isn’t identical to the original (e.g. has a slightly longer tail). The decoder generates a short series of pen movements, ensuring a crisp image.
Adding and subtracting the vectors further suggests they can capture high level concepts in the drawings. In each of the below images, the black drawings were encoded into vectors and after arithmetic the resulting vector was decoded into the blue drawing (source):
This property was previously demonstrated for vector encodings of words — for instance Mikolov et al’s observation that King — Man + Woman = Queen showing that the learned representations capture “meaningful syntactic and semantic regularities”. Thus, encoding all the drawings into vectors seemed a promising approach to find shared features from each country.
Results
For each of 64 object classes (moustache, hat, penguin etc) I trained a fresh sketch-rnn model on 10,000 drawings (400 from each of the 25 countries with sufficient data) which I then used to encode all the drawings into latent vectors. The results reported here are all from analysis of these 640,000 vectors.
Generating Average Sketches
For my initial analysis I ignored the data about which country each drawing was from and just investigated using the encoded vectors to generate average drawings.
The “Universal” Drawing?
For each object class I averaged all 10,000 latent vectors then used its corresponding sketch-rnn model to decode the average vector back into a drawing. With an equal number of drawings from each country, a good average could reveal a universal depiction of an object free from cultural idiosyncrasies. I compared the result with the images in Forma Fluens produced by overlaying pixels (1000 drawings from each of 34 countries).
For many object classes, the decoded average vector was a clean image that clearly and simply represented its class. When compared with convergent overlapped images, it appears that the vector method can preserve details that may be common to most drawings but get blurred out in the overlay due to differences such as the positioning of the octopus’ tentacles.
Another thing to note is that details that appear infrequently in the drawings tend not to show up at all in the sketch-rnn average — for instance the texture on the ice cream cone, the face on the octopus or legs on the piano. It appears sketch-rnn averaging keeps the highest common denominator in the averaged drawings. One exception is the ice cream average appearing to have two scoops- but we could interpret the smaller curve that appears slightly above and to the right as a generalized “topping” on the ice cream rather than specifically a scoop, in which case it is not uncommon. See the supplemental notes at the end for further investigation into the sketch-rnn average.
“Failure” Cases
However, for some object classes, sketch-rnn’s “universal average” did not make intuitive sense.
This seemed to happen when there are multiple very structurally different ways of drawing something which are about equally common. For instance, there was an even mixture of people drawing just the head of a bear versus the whole body, or drawing a telephone as a handset, a rotary dial phone or a mobile phone.
Averaging between “head” and “whole body” lacks a meaningful solution; which David Ha demonstrated by interpolating between the vectors for drawings of the head and whole body of a pig and decoding the vector
This type of divergence is on another level from that found by Forma Fluens where the drawings simply don’t align well in pixel overlays: there is no consistent way of representing the subject of the object class. For the rest of this article I’ll call it representation divergence, and the overlay divergence will be denoted with pixel divergence. See the supplemental notes for further analysis of these different types of divergence.
Country Analysis
Can You Guess the Country from the Drawing?
If the sketches vary significantly by country, and the latent vector can capture these differences, then it should be possible to guess the country of origin from the latent vector. I tried training simple classifiers to do this, training from scratch for each object class (details in the supplemental notes).
Calculating the accuracy (the f1 score on held-out test data) for each object class and taking the average, I found that for all countries it was greater than 20% — much higher than random chance (1 in 25 = 4%). This means that the drawings really did show differences between countries.
The final accuracy could be much higher by examining drawings from multiple categories, so I developed a 20-questions style drawing game that guesses your country (described in this article).
Why are French and Swedish Drawings so Recognizable?
Why were drawings from France so recognizable, with an accuracy of 96%? Inspecting a few of their object classes shows that in most cases their drawings (top) are much simpler than the global average (bottom), with a tendency to draw much of the picture in a single stroke (see jacket, school bus and UFO).
As soon as the object class is correctly guessed by Google’s AI, the player cannot continue drawing — it is possible participants in France would have added more details given the chance, but they were really good at capturing the essence of their prompt with just the first few strokes.
Meanwhile the country with the second most recognizable (90% accuracy) drawings, Sweden, appears to be distinctive for the opposite reason:
I did not have time to analyze the stroke order but it is possible that participants in Sweden drew pictures in an unusual order, adding details before overall outlines, such that by the time their drawings were recognizable they were already quite complicated.
Which Countries’ Drawings are Most Similar?
To get an idea of similarity between drawings in different countries, we can use how much the classifier confused them with each other.
I looked at how often A was predicted as B as a fraction of how often A was correctly predicted as itself (A->B/A->A) and then averaged it with the reverse (B->A/B->B) to get a form of “mutual confusion”. Hierarchical clustering based on mutual confusion shows whole groups of countries which are often confused (see the supplemental notes for the full confusion matrix).
The three most frequently confused pairs of countries were Poland/Romania, Finland/Russia and Philippines/Thailand. Along with obvious geographical ties, Poland and Romania were part of the Austro-Hungarian Empire and Finland was part of the Russian Empire for a century until WW1. It could be speculated that similar drawing styles can be reflective of shared culture- the way drawing is taught in schools, or even more generally a way of seeing the world: which features of an object are most important, what first comes to mind when multiple versions or interpretations of the drawing prompt are possible. But beyond the top three there are pairings and groups of countries with no obvious connections. Interpretations are left to the reader!
Object Class Analysis
Which Types of Drawings Reveal the Most?
I calculated the accuracy averaged over all countries for each object class, and found that all the classes which could on average reveal the most about the country of origin were complex man-made objects: trumpet (49%), piano (45%), telephone (42%), train (42%).
This could be because drawings of simple objects contain too few lines to predict with, and natural things tend to look the same anywhere in the world but man-made objects can have different designs which may be more or less popular in different countries. Jana and Lovejoy also observed that doodles of naturally occurring objects tended to look more alike across cultures.
An alternative explanation is that Google’s Quick, Draw! neural network was better at recognizing natural objects, so it guesses the class and saves the drawing before the player can add more details that may give clues about their country.
Why Are Trumpets so Revealing?
Looking at raw trumpet sketches, a wide variety of shapes can be seen, with differences in the diameter and roundness of the bell, thickness of the pipe, presence of a mouthpiece or tuning slide, number of finger buttons etc. In fact, there are many different types of trumpet that all look somewhat different, which probably contributed to the variety present in the drawings.
The country averages show a similar amount of diversity, suggesting that different types of trumpet are more popular in different countries. Despite the many variations between countries, the global average is simple, showing what the vast majority of the countries’ drawings have in common: a long tube with buttons on top.
Why are Fans the Most Generic?
The object class which the classifier struggled the most with was the fan with an average accuracy of only 27%. Fans are complex man-made objects, why are they so unhelpful for predicting the country? Inspecting the raw sketches, there seems to be plenty of variation in how people draw fans.
However, the country averages all show the same set of simple components: blades around concentric circles, and maybe a stand. This suggests that although there are many ways of drawing a fan, there is not much difference in the popularity of each method between different countries.
Just for Fun
Moustaches: Thick vs Curly?
It appeared that moustaches could either be two-dimensional or curly, but not both. Maybe non-curly moustaches need that extra dimension to differentiate themselves from mountains or waves?
A Smörgåsbord of Sandwiches
Hungary had the flattest sandwiches, while the USA had the thickest, most layered sandwich, and Russia had the most triangular sandwich. The countries with the most interestingly shaped sandwiches were South Korea and Sweden, which could be explained by the popularity of the smörgås i.e. open sandwich in Sweden, meanwhile sandwiches in South Korea commonly contain pickled cucumber.
Miscellaneous
Discussion
Four hundred randomly chosen sketches per object class per country may not be enough to draw any definite conclusions. However, I think this project demonstrates an interesting new way to visualize and analyse large collections of drawings.
Potential future work could be to look deeper into the latent vectors. Is it possible to identify all the modes of drawing an object class by clustering the vectors (analogous to Doodle Maps clustering the pixel images)? What does each latent space dimension represent for each object class? How many dimensions represent single “intuitive” features and are there commonalities between the types of features represented across different object classes?
Acknowledgements
Thank you to the IBM Visual AI Lab for hosting and mentoring me for this project, in particular Hendrik Strobelt, Daniel Weidele, Evan Phibbs and Mauro Martino.
Supplemental Notes
Further Analysis of Representation and Pixel Divergence
I found that sketch-rnn couldn’t produce a meaningful average for some object classes, and called these representation divergent classes. Meanwhile I called the object classes where pixel overlays failed to give a clear image pixel divergent.
I noticed that the bear class was representation divergent, which may be because just a bear face is drawn about as often as the entire body. Animals for which generally the whole body must be drawn in order to be recognizable don’t seem to have this problem, for instance the tiger which needs stripes on its back, the squirrel which needs a bushy tail, and the octopus which needs tentacles.
Interestingly, object classes that are representation-divergent did not always appear to be pixel divergent. For the representation-divergent telephone class there was a mix of touchscreen smartphones, cellphones with keypads, wireless or corded handsets, and even the old-fashioned rotary dial variants. However the pixel overlay appears to be simply a touch-screen smartphone, which could be misleading. This may be because the touchscreen smartphone is the only convergent representation (as it is a simple rectangle shape always drawn facing the viewer) and the other representations are blurred out by averaging.
Looking at just telephones from India, the results differ even more. From a sample of sketches, the smart phone is much less common, but there is still an even mix of other representations. In this case the pixel overlay looks like the old-fashioned docked handset but the sketch-rnn average is a free-floating handset.
The overlay of telephones from India show that the docked handset representation is also convergent. I speculate that it was not visible in the global overlay because it was less convergent than the cellphone, which was also common in the global collection of drawings. More generally, from these observations I would speculate that when multiple common depictions coexist:
- Overlays show the most pixel-convergent depiction. This would explain why when all 3 representations are present just the most convergent touch-screen is visible, and the docked handset depiction only emerges in the absence of the touch-screen (e.g. drawings from India). The docked representation could be less convergent than the touch-screen because there is little consensus as to what is on its front face (rotary dial or keypad) and the surrounding shape (round, square, trapezoid). But it is generally drawn facing the viewer, so it would be more pixel-convergent than the free handset which could be floating at any angle (similar to the pixel-divergent arm). If there is no pixel-convergent common depiction at all (i.e. in a pixel-divergent class) the image is a blur.
- Sketch-rnn generally keeps the highest common denominator (plus or minus a few minor details). The traditional docked telephone and free handset depictions both contain a handset, so the sketch-rnn average of the Indian telephones is the handset. But when the touch screen is added into the mix for the global average, the highest common denominator is just circles and long curved lines. Similarly, although people drew piano bodies differently, all the drawings included a keyboard, so the sketch-rnn average of a piano was the keyboard part. Note that in the case of head and whole body averaging, the head isn’t preserved by averaging because when the whole animal is drawn the head is often represented in a different, much simpler way (look at the bear drawings, for example).
Country Classifier Details
For each object class, I used 320 images per country for training and 80 for testing. The classifiers used were multinomial logistic linear regression and Gaussian Naive Bayes, using scikit-learn’s implementation and default hyper-parameter values. For each class the maximum test accuracy achieved by the classifiers was reported. The accuracy reported is the f1 score, which is an accuracy measure taking into account both true and false positives.
Country Prediction Confusion Matrix
I plotted how frequently drawings from country A were predicted to be from country B (denoted A->B) at (row A, column B). Observe how there are only a few bright cells in each row- this shows that each country was usually confused with only a few other countries. Also, note that the plot is symmetric about the diagonal- this shows that A->B and B->A with about the same frequency.