DisplaceNet: Recognising displaced people from images by exploiting their dominance level

Grigorios Kalliatakis
Jul 6 · 4 min read

Every year millions of men, women and children are forced to leave their homes and seek refuge from wars, human rights violations, persecution, and natural disasters. The number of forcibly displaced people came at a record rate of 44,400 every day throughout 2017, raising the cumulative total to 68.5 million at the years end, overtaken the total population of the United Kingdom


Objective

Currently, information extraction from human-rights related imagery requires manual labour by human rights analysts and advocates. Such analysis is time consuming, expensive, and remains emotionally traumatic for analysts to focus on images of horrific events.

In this article, we strive to reconcile this gap by automating parts of this process; given a single image we try to label the image as either displaced people or non-displaced people.

Problem formulation

Can you label the images above as either displaced people or non-displaced people? Try to label them from the inference results of object detection and/or scene recognition

Main idea

A person’s control level of a situation can be a notifying difference between the encoded visual content of an image that depicts a non-violent situation and the encoded visual content of an image displaying displaced people.

Our hypothesis is that the control level of the situation by the person, ranging from submissive / non-control to dominant / in-control, is a powerful cue that can help our network make a distinction between displaced people and non-violent instances. First, we develop an end-to-end model for recognising rich information about people’s emotional states by jointly analysing the person and the whole scene. We use the continuous dimensions of the VAD Emotional State Model [1], which describes emotions using three numerical dimensions: Valence (V); Arousal (A); and Dominance (D). Second, following the estimation of emotional states, we introduce a new method for interpreting the overall dominance level of an entire image sample based on the emotional states of all individuals on the scene. As a final step, we propose to assign weights to image samples according to the image-to-overall-dominance relevance to guide prediction of the image classifier

Estimating continuous emotions in VAD space vs overall dominance from the combined body and image features

Model architecture

Model Architecture. Our model consists of an object detection branch, a human-centric branch, and a displaced people branch.
  • Object Detection Branch: localise the boxes containing a human and the object of interaction using RetinaNet [2].
  • Human-centric Branch: VAD score for each human box & Overall Dominance Score that characterises entire image.
  • Displaced People Branch: Classification score for input image & re-adjust classification score based on overall dominance score

Getting the Data

Human Rights Archive is the core set of the dataset which has been used to train DisplaceNet.

The constructed dataset contains 609 images of displaced people and the same number of non displaced people counterparts for training, as well as 100 images collected from the web for testing and validation.

Setting up the System

The following dependancies are required to run this project:

  • Python 2.7+
  • Keras 2.1.5+
  • TensorFlow 1.6.0+
  • HDF5 and h5py (required if you plan on saving/loading Keras models to disk)

Before installing DisplaceNet, please install one of Keras backend engines: TensorFlow, Theano, or CNTK. We recommend the TensorFlow backend — DisplaceNet has not been tested on Theano or CNTK backend engines.

Then, you can install DisplaceNet itself.

Install DisplaceNet from the GitHub source (recommended):

$ git clone https://github.com/GKalliatakis/DisplaceNet.git

Inference on new data with pretrained models

To make a single image inference using DisplaceNet, run the script below.

$ python run_DisplaceNet.py --img_path test_image.jpg \
--hra_model_backend_name VGG16 \
--emotic_model_backend_name VGG16 \
--nb_of_conv_layers_to_fine_tune 1

DisplaceNet vs fine-tuned CNNs

The entire code is also available on our GitHub repo & full paper available HERE.

  1. Albert Mehrabian. Framework for a comprehensive description and measurement of emotional states. Genetic, social, and general psychology monographs, 1995.
  2. Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade