Thematic treemap analysis

Published in

DensityDesign Research Lab

9 min readFeb 4, 2016

Using keywords to discover what and how is a certain topic regarded

1. What did we want to discover?

We created this protocol in order to investigate how the migration theme is related to other topics in movies about migration: which are the most correlated with the topic of migration in the 120 movies contained in our list?

A film is a complex artwork, not only for the fact that it has several levels of interpretation, but also because the story in itself connects different topics with new relationships every time.

We also wanted to see how these topics are spread along the list when ordered by popularity. For example: are there similar patterns in the topics of the most popular movies? Which topics are the most recurrent? etc.

2. Which sources did we choose? Why are these useful to analyze our research question?

In order to discover the relations between “migration” and the other topics connected to it, we decided to use the keywords we found in the internet movie database (imdb.com) as our source.

This kind of data is a useful source to analyze our research question because of two main reasons. The first one is that even if these keywords are written by Imdb users, they are both the mirror of what people perceive when watching a movie, and the point of view of the production. Keywords tell us what strikes people the most about a film. This can be noticed for example, when looking at the description of violence or sex scenes: these are depicted in an incredibly precise and meticulous way, which clearly shows a particular attitude towards those topics.

Furthermore, keywords written by Imdb users are also useful for our analysis because we worked on a type of content that is not in a written format: if our analysis corpus were articles, we would have had no need to find a “textual handhold”. In that case, our “keywords” would have been the most recurrent words in the articles.

3. How does our method work? What kind of operations/transformations did we do and in what way?

First of all, we looked for movies related to the immigration topic by entering words such as immigrant, immigration, emigrant, emigration, migrant and migration on several internet movie platforms. These platforms were: imdb.com, allmovie.com and themoviedb.org, that are movie related platforms, and more general ones such as the wikipedia page that lists immigration movies, and google.com which returns a slider with the covers of the movies ranked by a “most frequently asked” criterion.

In this way we created 15 lists, then we proceeded by selecting the movies that were more relevant by using a recurrence criteria: if they were present in at least three lists they were considered relevant. Once the selection was done, our list was composed by 120 movies.

Each one of these movies has a dedicated webpage on Imdb website: by using Kimono we scraped these 120 webpages in order to obtain a dataset containing the 120 urls to the pages.

Each page contains the movie data together with a list of keywords that describes the movie content. We created a second Kimono API in order to obtain a dataset containing all keywords Imdb users wrote about each of our movies.

Once this dataset was created, we used Excel in order to understand which were the most recurring keywords and then to create an univocal list composed by 5216 keywords.

After this univocal list was obtained, we noticed that keywords in our dataset were often too specific and precise, so we decided to manually cluster our 5216 keywords into more general topics. Thirty-nine topics were identified by reading the keywords. These are: violence, travel, time, technology, sport, social issues, sex, security forces, religion, relationships, reference, politics, people, nationalities, media, love, language, justice, job, immigration, history, health, places, gender issues, food, film, family, emotions, education, economy, death, cultural differences, criminality, car, arts, animals, ages, addiction, sexual abuse and other.

By doing this, we obtained a three-column dataset: the name of the movie, the keyword and the macro-topic related to that keyword.

After that, we decided which topics were the more interesting in relation to our subject and we chose to visualize the following ones: violence, job, arts, immigration, family, criminality, cultural difference, sex and security forces.

Therefore, a double-column dataset was created for each movie containing only the title of the movie and the topics we selected.

By inserting each of our hundred-and-twenty movies one by one in Raw we created 120 treemap visualizations. The “Topic” value was assigned to the “Hierarchy” and to the “Color” dimensions too. The size of each portion of the graph is given by the repetition of a certain tag (i.e:”criminality”) in the movie.

Once the 120 treemap graphs were obtained, these were assembled on a grid in Adobe Illustrator and each area of the treemaps was colored according to its topic. The 120 graphs were disposed according to a boustrophedic way of lecture (which means: from the top left corner to the bottom left, starting from left to right, then vice versa till the bottom) following Imdb popularity ranking criteria.

The colors of our visualization must be read in this way: the areas which have a hue that tends to red represent the topics that have a more negative connotation; the ones which tend to blue represent the more positive categories, having white in the middle for the neutral topics.

It is clear that each topic includes keywords with positive, negative or neutral connotation, however after counting them we assigned a color based on the distribution of positive, negative and neutral words.

In this way topics like violence, criminality, security forces, sex show a more negative tendency; family, arts and job are more positive while cultural difference and immigration are positioned in the middle.

4. What results have we obtained?

By switching a different topic at a time, different patterns can be visualized. These patterns show how themes are spread out in our movies disposed along the graph according to Imdb popularity ranking criteria.

What is interesting to point out is that two main patterns emerge and can be seen through this visualization.

On one hand, we can see that topics characterized by evident negative connotation (criminality and violence) appear in the higher zone of the visualization together with other topics (like sex and security forces) which are not characterized by a clearly negative trait but whose connotation softly tends to the negative side of the color palette.

Family is a topic strongly present and equally spread through our films.

On the other hand, it is interesting to discover that two topics such as “cultural differences” and “immigration” are more spread along the lower half of the graph than on the higher one. Despite the fact that these two topics are composed by keywords strongly related to our main theme, they are not the most popular, while topics like violence and sex seem more appealing.

5. How can this method be applied to other cases?

First of all, this method can be applied to other lists of movies related to other topics that have to be investigated. For instance, a list of movies about World War II can be analyzed in the same way we did.

Furthermore, this analysis method can be applied to each web platform that uses a keywords, hashtags or tags system to manage its content. These words must be visible by the users. For example tags on Youtube are not visible by the user, they can just be written by who uploads a video.

Every platform with a research system uses keywords, but not all these platforms allow users to see these words, and in this case the visibility is a must in order to have the possibility to harvest the keywords.

This method could be particularly useful for platforms in which the main contents are not textual. We think that in case of written content, it’s the text itself which can be analyzed, but this does not exclude the use of such method for example in news websites that use tags to categorize their articles (i.e: Al-jazeera or The Guardian)

Anyway, we believe it could be more useful if the main content of the platforms to analyze its video, audio or visual in general.

This method allows to discover which are the topics that relate to a selected theme, this means that it can visualize patterns between topics and pinpoint the relative relations.

Furthermore, it is possible to discover the most and the less related topics under different points of view: these points of view are given by the ranking criteria which are provided by the platform we are analyzing. For example, common ranking criteria are: from the most to the less popular, from the most to the less recent etc.

Another interesting possibility given by this method consists in a temporal analysis of the context around a certain topic: this can be seen by repeating this operation for a certain number of times in a given period of time. For instance, Imdb keywords are written by users through time: they aren’t written in a single instant, so with this method it could be possible to discover the “formation” of keywords around a movie from the moment in which its Imdb page borns until today.

The scraping phase of the work can be done automatically by deselecting “manual crawl” and selecting a temporal frequency of crawling (every 15 minutes, daily etc.) while creating an API using Kimono. This could show how a certain topic is associated to other ones along time by creating more visualizations (one for each crawling operation).

Therefore, as it was said before, this method can be applied to all those platforms which use a hashtag/tag/keywords system that can be seen by the users while navigating that website. Possible platforms in which an analysis like this can be done are, for example: Flickr.com, Behance.net, or Soundcloud.com.

Taking Behance platform as a possible case study, this analysis could be interesting in order to discover which design tendencies, practices or trends are associated to a specific sector of design and how these factors change along time. In fact, users are asked to insert keywords when they upload a project on the platform and these keywords are visible by any other user.

For instance, this method could help us discover which are the trends and tastes that are more related to a query like “data visualization”. The analysis of the related keywords could reveal which are the most connected fields (would it be graphic design, motion graphics or web design?), the most frequent topics (environmental issues or portfolios?), the most used styles and techniques (flat design or 3d?) when mentioning data visualization and so on.

We can also see how these topics are related to data visualization under different lenses which are the ranking criteria proposed by the platform (most appreciated, most discussed, most viewed, most recent and etc). Moreover, we can also sharpen our corpus of projects selection (from which we scrape the keywords using Kimono) by using Behance filters regarding country of origin of the project, the creative field and etc.

In conclusion, this method can be useful for mapping informations contained in websites in which content is not textual and in which a keywords/tags system is present and visible by the user. It can be used in order to create a single shot of a situation in a certain moment, or it can be used to analyze trends along time by generating a certain amount of visualizations.