What makes a photo engaging in the context of hospitality?

Oprean Cristina
Accor - Tech & Digital
12 min readJul 24, 2024

1. CONTEXT & OBJECTIVES

1.1 Context

In today’s highly competitive hospitality industry, effective visual marketing strategies play a pivotal role in attracting, converting, and enrolling customers. As hotels strive to stand out amidst a sea of options, the engagement level of their photos has become a critical factor in capturing the attention and interest of potential guests. However, identifying the right visuals that resonate with the target audience and drive desired actions such as bookings and loyalty program enrollments remains a complex challenge. In light of this, there is a pressing need for advanced predictive models capable of accurately estimating the engagement level of hotel photos.

1.2 Objectives

The problem at hand revolves around the challenge of accurately predicting the engagement level of hotel photos. In the era of social media dominance and digital marketing, visuals have become a vital tool for hotels to showcase their offerings and entice potential guests. However, determining which photos will effectively capture the attention and interest of the target audience is a complex task. Without a clear understanding of the factors that drive engagement, hotels may struggle to create visual content that resonates with their customers. This lack of insight can result in missed opportunities to attract, convert, and enroll customers. Therefore, robust predictive models that can evaluate the engagement potential of hotel photos, providing hoteliers with valuable guidance on selecting visuals that align with their branding, drive bookings, and highlight the benefits of loyalty programs have become increasingly valuable. By addressing this problem, the research aims to empower the hospitality industry with data-driven strategies to optimize visual marketing efforts and achieve higher customer engagement levels.

2. DATA

2.1 Data Collection

To train our model, we gathered Instagram images from various hotels and relevant hashtags within and without Accor’s diverse brand portfolio. Spanning from January 1, 2020, to December 31, 2022, our training data consisted of approximately 18,000 images sourced from different Instagram accounts and hashtags. To ensure a diverse representation of images, we carefully selected a range of Instagram accounts. These included random accounts showcasing beautiful landscapes, Vacation Vibes (featuring people visiting scenic destinations), luxury world traveller, beautiful destinations, booking.com, all.com, and tripadvisor. We also incorporated relevant hashtags such as hotel life and prettyhotels.

To adequately represent all hospitality segments (luxury, midscale, and economy), specific accounts and hashtags that represented these segments were selected. For instance, the Novotel and Hilton brands were included to represent the midscale segment, Marriott, Hyatt, Sofitel, Fairmont, and Pullman hotel accounts were included for the luxury segment, and ibis for the economy segment.

2.2 Engagement

Engagement, in the context of social media, has no formal definition. Here, we have decided to quantify it into a score by using a formula. Our approach focuses on measuring engagement as a composite metric derived from the number of likes and comments a photo receives, divided by the number of followers. By incorporating both quantitative indicators (likes) and qualitative indicators (comments), we aim to capture a more comprehensive measure of user engagement that considers both the reach and interaction level of the content.

Figure 1. Social Media Metrics by account/hashtags for the data collected

3. ANALYTICS

3.1 Model Framework

Given the multi-faceted nature of engagement, any attempt at predicting how engaging a picture is requires a holistic approach that takes visual appeal, emotional impact, and content into account. These major categories were approximated by using sub models to predict their most important components: visual appeal by aesthetics and memorability, emotional impact by latent interactions and sentiments, content of the photos by their captions and object identifications. These six components are then fed into the larger model to predict the final engagement score.

The purpose of each part can be summarized as follows:

  • Visual Appeal and Retention

Aesthetic and memorability features increase viewer attention and engagement with a photo. By providing a comprehensive understanding of its impact, it helps identify key elements for targeted improvements.

  • Content

Enable to better understand which elements and contexts are most likely to engage and resonate with the target audience.

Figure 2. Framework

3.2 Sub Models

Visual Appeal

We incorporated the model proposed by Gautam Malu in “Learning Photography Aesthetics with Deep CNNs” as a subcomponent of our global model. This model generates 8 aesthetic attributes and a global score which are then used in a later stage. By incorporating the aesthetic attributes generated by Gautam Malu’s model, our global model gained valuable insights into the appeal of photos, contributing to a more comprehensive understanding of engagement in social media photos.

Memorability

To incorporate the memorability component of images, we integrated the model proposed by Jiri Fajtl et al. in “AMNet: Memorability Estimation with Attention” as a subcomponent of our global model. This model generates a memorability score ranging from 0 to 1 which is used at a later stage.

Visual Sentiment Analysis

To extract sentiments from raw images, we incorporated the model proposed by Borth et al. in “SentiBank: Large-scale Ontology and Classifiers for Detecting Sentiment and Emotions in Visual Content’’ as a subcomponent of our global model. This model could classify images into Adjective Noun Pair (ANP) categories, which express a particular sentiment. The top 6 noun-pair sentimental attributes associated with the image are tokenized with the caption mode output before being passed to the textual branch of the global model. By incorporating the sentiment attributes generated by SentiBank, our global model gained valuable insights into the sentiments evoked by photos, contributing to a more nuanced understanding of engagement in photos.

Latent Interactions (VisE)

To predict how people would react to a photo, the model proposed by Jia et al. in “Exploring Visual Engagement Signals for Representation Learning” was incorporated as well. This model can map images to labels that were derived from engagement signals. The output labels were stored as vectors, which are an abstract representation of the reactions people would have on a social media platform. These labels were then fed to the global model as a parameter to account for emotional reaction.

Object Detection

The object detection submodel is able to accurately identify objects in photos. The information of which objects are present in a photo is stored in vectors, which are then passed on to the global model. This gives each photo important contextual information.

Image Captioning

Rather than a list of objects present in an image, image content can instead be expressed through a natural language description. This description is called a caption, which can contain physical descriptions of the objects, such as color, and relative positions, amongst other properties. The sub model ExpansionNetV2 can generate captions with a high level of fidelity and is incorporated to provide the global model with this information. The model outputs a caption associated with the image which is then tokenized with the output of the captioning model before being passed to the textual branch of the global model.

3.3 Global Model Architecture

Figure 3. Global Architecture

The architecture presented is a multimodal model that incorporates information from multiple modalities, including images, text, and numeric inputs. It takes advantage of the rich visual information conveyed by images, the textual information expressed through sentiment analysis and captions, and the numerical values of aesthetic scores and memorability. By combining these different modalities, the model is able to capture a more comprehensive understanding of the input data, enabling a holistic analysis of the factors influencing the engagement created by a photo.

3.4 Evaluation

Global Model Performance

Table 1. Global model metrics

Based on the results, we can say that the model has a high performance in predicting the engagement of images. The average error between the actual and predicted engagement is very low, indicating that the model is accurate and precise. The rank correlation (Spearman) and the Kendall correlation are both very high and positive, indicating that the model is consistent and reliable in ranking the images according to their engagement. These metrics suggest that the model captures well the relationship between the image features and the engagement score.

Figure 4. Actual Rank vs predicted rank for Global Model

Image Model Performance

The image model performs moderately well in predicting image engagement. In the heat map, we can observe that the model is much better at predicting the extremes than those in the middle.

Figure 5. Actual Rank vs predicted rank for Image model

Without Text Performance

When subtracting textual components from the model, we have a worse accuracy than in the two previous proposals. This suggests that the textual model adds more accuracy to the global model than if only the image model was considered.

3.5 Analysis

Memorability

Images with low memorability displayed sparse and scattered attention maps with few minor peaks, whereas images with high memorability exhibited dense and concentrated attention maps with larger and more distinct peaks.

Analysis also shows that the main sources of memorability in images are regions containing human faces and figures. Furthermore, images with unusual or unexpected content had higher memorability than images with common or expected content.

On the other hand, it seems that images that do not show the elements mentioned above, such as landscapes or natural scenes, tend to be less memorable by showing less intense maps.

Figure 6. Memorability heatmap

Aesthetics

The model showed a preference for pictures with high color saturation and diverse content, where the objects were sharp and well-defined, and where there was a clear focal point in the center of the image. The model also favored pictures with human figures (especially when the skin was visible).

The images with the lowest scores were typically dark, with dull colors such as beige, black and white. They also included elements of rooms without any prominent subject (such as pictures of the background of the room, the bathroom or the facilities).

Figure 7. Aesthetic heatmaps

Interestingly, the predicted engagement rate did not appear to have a strong correlation with the aesthetic attributes.

Content Analysis

BERTopic modeling was used to conduct the text analysis and identify the most prevalent topics, based on the frequency of word usage in the generated captions. By applying caption analysis to the 20th percentile of the sample, we obtained the following results:

Table 2. Topics related to the top scored images

The highest engagement scores were associated with images pertaining to the theme of restaurants, food and people.

On the other hand, within the same sample we can see that the topics related to the images with the lowest predicted engagement are images with zoom, images of rooms and backgrounds.

Sentiment

The following sentiments were obtained for the best ranked images (top 20%). We can observe that the feeling of comfort and tranquility within familiar spaces such as rooms, houses, and boats are the most represented type of sentiment.

Table 3. Sentiments related to the top scored images

3.6 Limitations

The limitations of the model can come from a variety of sources, but the main issues arise from the scraped Instagram images that the model is trained on:

● Irrelevant images on Instagram occasionally drive extremely high engagement rates. Some attempts to resolve this issue were performed, but perfect removal is not a realistic possibility.

● The scraped data is missing information that is extremely relevant, such as the number of followers an account had at the time a post is made. Engagement metrics had to be derived from the current number of followers, which leads to inaccurate engagement rates for older photos, as the number of followers an account may have changed drastically over time.

Outside of inherent issues with the dataset, some other limitations are also present for various reasons:

● Other potentially influential details about the posts, such as hashtags and the trending status of those hashtags, the time of day it was posted, were also not considered.

● Engagement from comments were also taken to be a flat positive, rather than looking deeper at the sentiment of those comments. As a result, controversy driven engagement is also accepted as a valid source of engagement by the model. This, however, may be a type of engagement that hospitality companies would want to avoid.

● Trends evolve, which may require that the model be retrained periodically on the latest data. If not, the model will be unable to adapt to emerging trends, which may have a huge impact on engagement rate.

Due to the limitations stated above, predicting exact engagement rates is difficult. Fortunately, in the context of the hospitality industry, predicting the ranking of the images is where the true value of the model lies, since what we are interested in is choosing the best photos to use to drive engagement. In this aspect the model is performing extremely well.

4. BUSINESS RECOMMENDATIONS & CONCLUSION

Overall, we have observed the following key points that suggest that an image may be more likely to generate engagement:

● It is preferable that the picture contains people, faces and skin visible

● Well-centered images with “main character visible”, that are not out of focus and all the elements/objects are well defined

● Images with intense color (avoid dull colors like black & white or beige combos)

● Pictures mostly related to food/restaurant

● They most evoke feelings such as calmness or quietness

It is worth mentioning that these recommendations are the result of the analysis of the images, their theme and the text attached to them. For this analysis we have not taken into account the time of day, nor the traffic of users within the social media platforms, which is well known to influence the number of interactions that a post receives.

Likewise, although the present study has been focused on the tourism sector, the range of applications of the model can be extended to various industries. However, we emphasize that the results obtained are limited to the hotel industry.

5. ACKNOWLEDGEMENTS

We extend our heartfelt gratitude to the ESSEC x Centrale Supélec students and the Data Science & AI team at Accor who contributed to this work: Antoine Cloute, Annabelle Luo, Aiza Avila Canibe, Deepesh Dwivedi, Darius Chua, Siwar Abbes, Assitan Diarra. Their dedication, creativity, and teamwork were crucial to the success of this project. Special thanks are also due to the business team, represented by Christophe Pélé & Asma Baillet, whose sponsorship and vision were the driving force behind this initiative. All the combined efforts have not only advanced our research but also set a high standard for future endeavors. Thank you for your invaluable contributions and hard work.

6. REFERENCES

[1] Aloufi, S., Zhu, S., & El Saddik, A. “On the Prediction of Flickr Image Popularity by Analyzing Heterogeneous Social Sensory Data.” Sensors 17, no. 3 (2017): 631. https://doi.org/10.3390/s17030631.

[2] Borth, D., Ji, R., Chen, T., Breuel, T., & Chang, S. “Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs.” In Proceedings of the 21st ACM International Conference on Multimedia, 223–232. ACM, 2013.

[3] Bylinskii, Z., Isola, P., Bainbridge, C., Torralba, A., & Oliva, A. “Intrinsic and Extrinsic Effects on Image Memorability.” Vision Research 116 (2015): 165–178. Elsevier.

[4] Cantero Priego, J. “Predicting the number of likes on Instagram with tensorflow.” Pàgina inicial de UPCommons. February 17, 2021. https://upcommons.upc.edu/handle/2117/339937.

[5] Chen, T., Borth, D., Darrell, T., & Chang, S. “DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks.” arXiv preprint arXiv:1410.8586.

[6] Fajtl, J., Argyriou, V., Monekosso, D., & Remagnino, P. “AMNet: Memorability Estimation with Attention.” arXiv preprint arXiv:1804.03115.

[7] Jia, M., Wu, Z., Reiter, A., Cardie, C., Belongie, S., & Lim, S. “Exploring Visual Engagement Signals for Representation Learning.” arXiv preprint arXiv:2104.07767.

[8] Khosla, A., Raju, A. S., Torralba, A., & Oliva, A. “Understanding and Predicting Image Memorability at a Large Scale.” In Proceedings of the International Conference on Computer Vision (ICCV).

[9] Lei Hou, Xue Pan. “Aesthetics of hotel photos and its impact on consumer engagement: A computer vision approach.” Tourism Management 94 (2023): 104653. https://doi.org/10.1016/j.tourman.2022.104653.

[10] Lennan, C., & Tran, D. “Deep Learning for Classifying Hotel Aesthetics Photos.” NVIDIA Developer Blog. October 30, 2018. Retrieved June 5, 2023, from https://developer.nvidia.com/blog/deep-learning-hotel-aesthetics-photos/.

[11] Malu, G., Bapi, R. S., & Indurkhya, B. “Learning photography aesthetics with Deep Cnns.” arXiv.org. July 13, 2017. https://arxiv.org/abs/1707.03981.

[12] Murray, N., Marchesotti, L., & Perronnin, F. “AVA: A large-scale database for aesthetic visual analysis.” In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2408–2415. Providence, RI, USA, 2012. doi: 10.1109/CVPR.2012.6247954.

[13] Praveen, A., Noorwali, A., Samiayya, D., Zubair Khan, M., Vincent, P. M. D. R., Bashir, A. K., & Alagupandi, V. “ResMem-Net: memory-based deep CNN for image memorability estimation.” PeerJ Comput Sci 7 (2021): e767. doi: 10.7717/peerj-cs.767.

[14] Talebi, H., & Milanfar, P. “Nima: Neural Image Assessment.” arXiv.org. April 26, 2018. https://arxiv.org/abs/1709.05424.

--

--