Visualizing the XKCD comics network using Google Vision, spaCy and d3
I love XKCD. According to their website, the webcomic is about romance, sarcasm, math, and language, but after so many years, Randall Munroe explored many other topics as well. Some of them more than once.
I wanted to know the structure of this fantastic stick-figure world he created, so in my spare time I scraped all his webcomics from the interblag (or blagosphere), then extracted all the relevant words from each, and finally plotted the result below. In the following graph, each node represent a comic, and 2 comics share an edge if they contain words in common. The graph is interactive, in the sense that you can drag it along the screen (which is always a fun thing to do), but also when you click on an edge you can see the words in common between the respective 2 nodes (click again to vanish them), and when you click on a node the corresponding comic will appear below the graph.
Moreover, I’m only showing about 20% of the full graph, for readability. It’s a random 20%, so if you refresh the page you’ll see another subset!
If you scroll down a bit, I will explain how I did all this and share code snippets.
I first used bs4 and some regular expressions to download all the comics from here: https://www.explainxkcd.com/wiki/index.php/List_of_all_comics_(full). I could have used xkcd.com, and in fact in retrospective it would have been easier, but working with a table seemed easier in the beginning, and then I just stuck with that.
Extracting texts with Google Vision
Then I used Google Vision to get the image from each text. Google Vision is free for the first 1000 images (at least as of 2020). The following script took about 1 hour in my notebook — it would have taken shorter by parallelising:
Extracting important words with spaCy
I fell in love with spaCy when I compared it against NLTK, so I used the former to remove stopwords from each text. I then calculated the edges between each comic — now starting to think about it as a graph. Two comics shared an edge if they had words in common. But no stopwords! It’s not interesting to know that both comic 345 and 876 use the words “and”, “the” and “I”.
Plotting with d3
For plotting, I considered Python’s NetworkX. But I wanted something interactive, and NetworkX made it hard for me to work with callbacks. Someone in my network then recommended d3, which looked promising! So I decided to learn a bit of d3. Note: I really like Gephi, but unfortunately it doesn’t have an easy way of making pictures appear of screen!
This is the code that generates the visualization you see above. It’s self contained — meaning that if you download it you will be able to see the graph locally in your browser as well.
Once more, for the ones who just scrolled till the bottom to see the graph:
Originally published at https://www.fluentdata.tech on November 30, 2020.