What to watch tonight: scraping IMDb and visualizing its data as an interactive website

7 min readFeb 21, 2018

Inspired with lastfm interactive map I decided to make graph visualization of movie recommendations. Here is the story of how I made this, from data scraping to creating an interactive web demo.

IMDB recommendations graph (173K+ movies)

tl;dr:

1. My IMDb spider written with scrapy + rotating proxies + sqlite3: https://github.com/iggisv9t/imdb-spider
2. Visualization made with gephi: OpenOrd + Force Atlas
3. Interactive IMDb graph made with this js lib: https://www.ayalpinkus.nl/shinglejs/
4. All intermediate data conversions made with disposable python scripts.
5. Result is here: https://iggisv9t.github.io/imdb/index.html

Getting the data

If you want to get data from IMDB, it’s not a problem. IMDB shares its dataset. But there are no recommendations, so if you want to build a graph such as me or planning to build your own recommendation system then you need to get data by yourself. There are not too many options on how to make it using python:
1. Requests + some lib for Html parsing: BeautifulSoup or lxml
2. Scrapy

The first option seems too hard in comparison with the second. Almost all the things you need to care about when scrapping are already present in Scrapy. So if you just want to get the data from the web, not to develop your own little framework just use Scrapy.

Writing spider

Install scrapy and create a new project as described in docs.
First, modify items.py: you need to create your item class for scrapped objects.
In the project, you need to create your spider class in a separate file in `spiders` dir. You need to find XPath for all the needed elements on the page. Navigate inspect in context menu than in elements tab select element and find copy → copy xpath in its context menu.
Debug your request using scrapy shell:

$ scrapy shell http://www.imdb.com/title/tt0086190/ 
>>> director_xpath = '//span[@itemprop="director"]/a/span/text()'
>>> response.xpath(director_xpath).extract_first()
'Richard Marquand'

Create your pipeline in pipelines.py. I chose to save my data to sqlite DB because it’s easy.
Check everything works: run your spider with parameter CLOSESPIDER_PAGECOUNT=5 to limit the number of requests while debugging.
Create a directory for example crawls1 and run your spider like this: scrapy crawl myspider -s JOBDIR=crawls1. So now you can restart your job after a pause or fail. Read more in docs.

Avoiding ban

Ok. Spider is ready, but that’s not all. Most of websites detect robot requests and ban it. So IMDb is not an exception. There is a lot of options on how to avoid a ban with scrapy spider. For example: pass requests through TOR and change nodes every few seconds, randomly choose from the list of free proxies, or use a rotating proxy service. All options are described well on the web and all of them have their advantages and disadvantages. Choose according to your needs. I chose a rotating proxy service. It’ is easy to use, minimum additional code required but it’s paid. How does it work: allow proxy middleware in settings.py and use your service ip:port in each request. Service changes outer IP for every request.

In settings.py:

DOWNLOADER_MIDDLEWARES = {                           'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':543                       }

In each request:

scrapy.Request(url=url, callback=self.parse,
               meta='proxy':'http://YOUR_PROXY_IP:PORT'})

Transform data

As we get our data, we need to transform it into the graph form. The most universal way to describe a graph, I think, it’s a dot language. It is a very easy human-readable format. I’ve made all transformation with simple python script. At first line, you need to write digraph imdb {\n — “digraph” means the graph is directed “imdb” is name, and bracket means the beginning of graph description. As a minimum, you need to write source_id -> target_id;\n at each line to describe graphs edges. And don’t forget to close the bracket. More about dot language in wiki and docs. Originally this language used by grpahviz tool, but many other tools can handle it. As a last resort, you can import the dot file in Gephi and then export it to another format.

Example:

digraph sample {
1 -> 2; 
1 -> 3; 
5 -> 4 [weight="5"]; 
4 [shape="circle"]; 
}

digraph means that the graph is directed, use graph if your graph is not directed. Define edges with -> or --. 1 -> 2; means edge from node 1 directed to node 2. Parameters can be declared inside [] brackets.

Graph tools and layouts

There is a lot of graph visualization tools, but if we talk about large graphs, the situation becomes a little bit harder. I’ve scrapped data for approx 173K movies, the full graph has approx 1.5M edges. I know only three tools, that can handle (without pain) such amount of nodes: graphviz, gephi and pajek. Sfdp layout from graphviz can easily handle much more heavy data, but it’s a sophisticated CLI tool, so it’s hard to find the best parameters for it in our case. Pajek is an even more powerful tool and even more sophisticated.

For cases like this, when you have a graph with size approx 10K-1M nodes and when you just need to get pretty visualization, gephi is the best choice. Gephi has several layouts, but only two are suitable for large graphs: OpenOrd and ForceAtlas (I’ve used Multigravity ForceAtlas 2).

OpenOrd is a fast approx algorithm. ForceAtlas is a more exact and precise algorithm but a bit slower. For some graphs, OpenOrd is the only option, because of strong connectivity. Force Atlas sometimes squeezes dense graphs to dark hairy cloud. I’ve used OpenOrd to get initial approximation and then ForceAtlas to expand intermediate result a little bit. As a final step I’ve tried to eliminate nodes overlapping with “noverlap” layout, but it took too long.

Grid-graph, ForceAtlas. Pretty but slow.

Grid-graph, OpenOrd. Very fast hairball.

Here is a comparison of OpenOrd and ForceAtlas for same “grid-graph”

Export results to interactive graph

You always can export an image as svg or png from Gephi, but in our case, this is not enough. There are too many nodes in the IMDb graph, so labels are overlapping each other, so it’s not possible to find any movie and it’s connections on the image. We need an interactive web-page. Here are several options to do this:

sigma.js

It’s the most simple option. There is a plugin for Gephi that can export your project to sigma.js template. But there is one disadvantage: it freezes when the graph is large.

gefx-js

The next option is very similar to the previous one. In order to use gefx-js one should just export a graph from Gephi as gefx file and put it into the template folder. Ready. But it also freezes with even smaller graphs.

openseadragon

If you need to show a large image, you can use openseadragon. It works the same way as geographic maps render: slice image to tiles and show only needed tiles corresponding to the view area. It can handle any amount of nodes and edges. Disadvantages: no interactivity, impossible to look behind overlapping nodes.

shingle.js

What if we could mix advantages of openseadragon scalability and js libs interactivity? Here is the solution: shingle.js! I have found it accidentally and it was just what I needed.
Advantages: can handle very large graphs with interactivity.
Disadvantages: not so pretty as sigma.js, data preparation is a little bit sophisticated.

Export data to shinglejs

Here is the explanation of how to export your graph from Gephi to shinglejs template:

Export your graph as .gdf file. It seems to be only one option if you want to extract nodes’ coordinates.
Read the file and get vertices and edges description. I’ve made it with pandas. Gdf file consists of two blocks: the vertices table and edges table. So you can split the pandas dataframe by nan values into the last columns. Pandas read the only first header and expect the same number of columns for the whole file

# gdf nodes header looks like this:nodedef> name VARCHAR label VARCHAR width DOUBLE height DOUBLE x DOUBLE y DOUBLE rating VARCHAR weight FLOAT PageRank DOUBLE default 0.0# gdf edges header:nodedef> name VARCHAR label VARCHAR width DOUBLE

Rename columns corresponding to shinglejs docs and export it to json. Shinglejs does not support colors as node attributes, but you can assign a color to the “community” attribute. I’ve used movie rating as community id.
Define the communities' colors list in the main page source code. Color index in the list is calculated as community_id % colors_number
Combine json files for nodes and edges to one final json. I’ve done it with bash: cat start imdbnodes.json middle imdbedges.json end > imdbdata.json where “start”, “middle” and “end” are files with {"nodes" :, "relations" : and } respectively.
Next, follow the instructions from the official page.
Do not forget to create a bitmap using binary executable from the template folder. By default, it is expected that files image_2400.jpg and image_1200.jpg the are placed in the same folder with your tiles jsons.
That’s all, folks! You have seen the resulting demo already in the tl;dr section.

Interesting observations

At lastfm map you can find clustering by the music band country. For example Japanese pop or rock, Greek metal, etc. The same thing for movies: Turkish, Indian, Brazilian, Korean movies are allocated to separate clusters. There is also a large cluster of cartoons away from other movies, a very dense cluster of superhero movies, separate areas of music videos, youtube blogs and amatory movies about the Harry Potter universe. You will have many discoveries. Just try it: https://iggisv9t.github.io/imdb/index.html

P. S.

I wish there will be more projects either like this or with another interactive dataviz. I wish some of the readers of this text will be inspired to support shinglejs project by committing new features to or developing Gephi export plugin. Many thanks to all my ML friends that were always ready to answer my annoying questions.