DATA STORIES | GAMIFICATION OF LEARNING | KNIME ANALYTICS PLATFORM

Unlocking the Potential of Movie Posters: The Winning Team’s Innovative Analysis of Box Office Revenues

My Data Guest — An Interview with the Winners of the “Analytics in Creative Industries” Challenge with KNIME

Roberto Cadili
Low Code for Data Science
9 min readFeb 27, 2023

--

My Data Guest — An Interview with the Winners of the “Analytics in Creative Industries” Challenge with KNIME.

This is an especially young episode of the My Data Guest series. For this episode, we talked to the winning team of the 2021 Business & Marketing Analytics Challenge with KNIME “Analytics in Creative Industries”. The challenge was held in November 2021 at LUISS Guido Carli University in Rome (Italy) under the supervision of Prof. Francisco Villarroel and KNIME and the teams were given a two-week period to complete their projects.

Read about Prof. Villarroel’s take on the experience — the educator!

Eleven teams competed to analyze data about products from the creative industries, such as movies, music, memes, social media, and other similar content. Some of the fascinating topics explored included the analysis of the correlation between nudity and revenues, the identification of users’ preferred music genre on TikTok, or the exploration and prediction of viral songs in the music industry. The team whose workflow project ranked #1 masterfully combined image and text analysis with the goal of finding the relationship (if any) between movie posters and box office revenues. In our conversation, we will take a closer look at their findings and the implementation of their solution.

Our Guests

The team members whose project ranked #1 in the challenge are Alida Brizzante, Francesco Buzzi, Anthony Ballerino, Simone Di Gregorio, Giulia Iadisernia, and Francesco Cardarelli. Congratulations on your win, guys!

At the time of the challenge, our guests were final-year students in the BSc. Management and Computer Science at LUISS University. The study curriculum encompasses courses that integrate business studies with computer science and data mining. The students were introduced to KNIME Analytics Platform by Prof. Villarroel within the scope of the course in “Business And Marketing Analytics” and view it as a useful tool for performing a variety of complex tasks without the need of mastering a scripting language. An important part of the experience was having external evaluators assessing their work, which is a perfect simulation for a real life scenario.

Roberto: Let’s take a walkthrough of the “Analytics in Creative Industries” Challenge and your project. Were there any specific requirements in terms of industry or analytics area?

Alida: The creative industry is an interesting but quite challenging industry for its variety and data type complexity. For the challenge, we were required to take the role of a junior marketing consultant and develop a project that focused on at least two of the following areas:

1. Exploratory analysis to identify and create insightful visualization of characteristics and changes overtime.

2. Explanatory analysis between new constructs from (un)structured data and marketing constructs.

3. Predictive analysis of product/service characteristics and success via the development of solutions powered by AI.

Roberto: You impressed the technical jury with a project on “The impact of Movie Data and Posters on Box Office Revenues”. Why was the movie industry an interesting case for you?

Simone: Members in our team have a strong personal interest in the movie industry. Additionally, data about movies and box office revenues is easily accessible as box offices publish data for every movie that comes out. For other industries, it would have been much harder to have access to the data.

Roberto: On top of the business relevance, what management challenges does this industry pose to marketers?

Anthony: It’s widely known in the movie industry that movie posters must meet viewers’ expectations, which differ from genres to genre. This means that marketers need to design movie posters in such a way that triggers viewers’ interest and increases their willingness to actually watch the movie. According to the genre, there may be some specific poster features, which are more likely to boost a movie’s revenue.

Roberto: That is quite a challenge. To provide a compelling answer, your analytical process must have started with the formulation of a research question. Can you tell us in a nutshell what your research question was?

Anthony: In short, our research question was: “Do visual and text features of movie posters have an impact on box office revenues?”, and further: “How does the impact differ across different movie genres?”.

Roberto: Before we reveal the result of your project, let’s focus on how you got there: On which analytics areas did you focus to answer those questions?

Giulia: Our focus was on exploring and visualizing our data to gain a better understanding of what we were dealing with. We then pursued exploring the relationships between our variables in more detail and trying to find what movie poster features lead to higher box office revenue.

Roberto: The follow-up question would be then what data you used. Did you work with text, image, structured data?

Francesco C.: We used image data, i.e., the posters from which we extracted the relevant features, and textual data, i.e. movie poster headlines.

Roberto: How did you retrieve the data you needed?

Simone: We started with an IMDb dataset that was available on Kaggle, including many movie features (e.g., genre, title, minutes, etc.). For those movies, we collected poster images relying on a website (similar to IMDb) that provided an API. This API allowed us to collect the image URLs for all the movies in our IMDb dataset. Finally, we were able to scrape all images from the collected image URLs.

Roberto: The next step was data exploration. Which techniques did you use for data exploration? Did you create dashboards? What did you find out in this phase of the project?

Alida: As we had more than one dataset to explore, we ended up having almost 30 different plots. We used a dashboard for data exploration which let us organize all the plots in a productive and interactive way. The dashboard helped a lot in understanding which features are important and how to interpret them.

Roberto: So you retrieved data and explored it. Were the available features sufficient to determine what movie poster features increase box office revenues? Or did you also apply techniques for feature engineering?

Francesco C.: We applied feature engineering to create new features from the initial data, for example, “colorfulness” or “rgb” (the coloration of the movie poster). Combining these new features with existing ones, we found out that features that are connected to multiple genres have different impacts on the box office revenues within each genre. For example, the feature “dominance” has a great impact on box office revenues within the action genre, but less influence on other movie genres.

Roberto: Was the entire feature engineering process completely codeless or low code?

Alida: In the beginning, we thought it would be easier and quicker to rely on our existing knowledge of programming languages when performing more complex operations. However, we ended up conducting the entire feature engineering completely codeless. It was great to explore and learn about the potential of KNIME Analytics Platform, allowing us to perform everything we needed for the project.

Roberto: Did you rely on any specific KNIME extension? Which one do you wish you had more time to explore?

Francesco C.: Yes, several. We used, for example, KNIME REST Client Extension to send API calls, the Python Integration, or the JavaScript Views Extension for the plots. We wish we had more time to explore all of them really.

Roberto: Integrating the capabilities of KNIME Analytics Platform with those available in Python is definitely a great idea, especially because you were able to do all that in one, uniform environment. Did you use any other programming language integration for your project?

Alida: We used the R Integration and the R Snippet node to build multiple linear regression models or to perform tests concerning, for example, overdispersion.

Roberto: With multiple linear regression, which relationships were you trying to model?

Giulia: With the regression models, we wanted to observe the relationship between box office revenues, movie poster and text features. We created different models, one including all of the variables, others including different interactions between box office revenues and movie genres.

Roberto: And finally, it’s time to reveal your results. What strategic marketing suggestions would you give to marketers working on movie posters if they want to secure large revenues?

Giulia: Our results indicate that positive word sentiment is negatively correlated with higher box office revenues. Hence, we suggest including words in movie titles that are more negative and less pleasant because they tend to be better remembered and more impactful. In addition, including longer titles and more colorful movie posters tend to increase box office revenues. However, the results greatly depend on the genre. For example, for action movies it’s best to have monochromatic posters instead of colorful ones.

Roberto: By the scale and complexity of the project (i.e., from data retrieval and exploration, to complex feature engineering and modeling), I am wondering how large your workflow was. Did you encapsulate some of the complexity in dedicated components?

Alida: Our workflow is quite big and we employed around 60 different types of nodes. Hence, we made extensive use of metanodes and components which made our workflow more understandable and explorable.

Roberto: And what about other workflow strategies to help you declutter your work and gain execution efficiency?

Simone: We used some of the orchestration nodes in order to split up our large workflow. In the end, we had one parent workflow that called all the other sub-workflows of our pipeline.

Roberto: How did you become such proficient users of KNIME?

Simone: In my case ‒and I believe that applies also for the others‒ it was all about playing with it. By playing and ‘breaking’ things, you learn. When you are in that mindset, everything becomes much simpler.

Roberto: Which KNIME capability would you have liked to know from day one?

Alida: How to make dashboards dynamic and interactive.

Anthony: The Conda Environment propagation node. I find it very useful to easily recreate the right Conda Environment in other machines.

Francesco B.: …It also makes the workflow very portable when integrating R or Python.

Roberto: What are some of your other favorite nodes?

Simone: The Image Reader node that allowed me to import images very easily.

Alida: Oh, also some of the more basic nodes, such as the Pivoting node and the Reference Column Filter node that I discovered on the KNIME Community Hub.

Anthony: For me, the Twitter API Connector node, which we did not need for this project in particular but for other projects in the course. It is a simple and brilliant way to connect to Twitter and scrape tweets.

Rosaria: How did you manage all this work in two weeks? Did you divide the task so that each person had specialized roles?

Simone: Alida and I took care of most of the data blending, workflow orchestration and dashboards parts as we were more experienced in those areas. Giulia, Francesco B., Anthony, and Francesco C. designed our models and our tasks. They defined the research questions, what we wanted to implement specifically and what marketing tasks we wanted to address.

Rosaria: Do you have any advice for people who are starting a new data science project?

Anthony: One fundamental part is to spend enough time on data retrieval and preparation. This makes everything much easier afterwards.

Alida: I would recommend always reading the documentation associated with the nodes. It’s not as boring as it seems and it is definitely very useful.

Roberto: What are some sources that you use to stay up-to-date in the field?

Simone: The first source that any data scientist or developer uses is StackOverflow. Whenever you type a question on Google, this site pops up. Besides that, playing with the tools. When I am curious about something, I search the internet, read books/articles, watch video tutorials, and do my best to keep up with the most recent developments.

Roberto: Do you use the KNIME Forum as well to find answers?

Simone: KNIME Forum is super important for KNIME-related problems. You can find some very specific answers there, for instance when a node crashes or you are stuck with your workflow.

Roberto: We are reaching the end of our conversation. Before we say goodbye, where can people follow your work?

Simone: Our workflow is freely available on the KNIME Community Hub. You can just download the whole folder and the workflow will work pretty much by itself.

Roberto: Thank you very much for being our guests and best of luck on your path to graduation!

Watch the original interview with the Winners of the “Analytics in Creative Industries” Challenge on YouTube.

--

--

Roberto Cadili
Low Code for Data Science

Data scientist at KNIME, NLP enthusiast, and history lover. Editor for Low Code for Data Science.