Who-Where-WhomWith(WWWW): A Facial Recognition Tool for Image-based Data Gathering and Graph Analysis

Published in

Analytics Vidhya

7 min readMay 13, 2020

In this previous post I showed how easy it is to set up an Elasticsearch instance to store facial encodings and build your own facial recognition database.

In this blog I will add a very interesting feature based on the same architecture. So, If you don’t know how to set up Elasticsearch and store face encodings for further analysis, go back and set everything up!

The previous post showed how you can search for a plausible match in the database starting from a known picture. Basically, this is known as supervised learning. Conceptually, it is the most simple task. But there is much more that can be done with a facial recognition database and I believe it is something that could turn out to be very useful in OSINT analysis: we’re talking about face clustering.

With face clustering, we are not searching for a match in our database starting from an input image: we are asking our computer to find unique faces (face encodings belonging to the same person) and do something with those faces. This is an unsupervised learning task.

Imagine that you have gathered thousands of images from social media, pertaining to one or more public events which have been attended by many different individuals. Imagine that you want to investigate, for each attendee, in which image he/she is present and if the same person has attended multiple events, and whom with this person has been attending.

You will need a solution that analyzes all the single faces shown in the pictures, clusters the faces together and produce a graph where the relations will be unique_face > image.jpg.

Keep reading and you will be able to do all of this right away!

The first step is finding a bunch of images with many people therein. If you don’t have a collection already, you can easily make up one using instalooter, an awesome tool that enables you to download data from Instagram. You can install it by the command line:

>> pip install instalooter

Now go to this github repository and download the files:

cluster_faces.py
encode_faces.py

Save the files into a directory and create the following sub-directories within the same:

/output
/temp
/dataset

Now install all the required Python libraries via pip:

pip install opencv-python
pip install scikit-learn
pip install elasticsearch
pip install numpy
pip install cmake
pip install face_recognition
pip install imutils
pip install networkx
pip install shutil
pip install dlib

You should have some of these libraries already installed if you followed the previous tutorial.

Now, you will need to download some pictures using instalooter. This tool enables you to download pictures not only from users’ timelines, but also from hashtags. I have found that users engaged in promoting ponzi-like or similar schemes are keen to share many pictures of events, so I chose a well known company (Global Intergold) which is promoted with the #globalintergold hashtag.

From the command line, download all the pictures tagged #globalintergold and save them within the “dataset” directory that you should have previously created. Use this command:

>> instalooter hashtag globalintergold /path-to-dataset-directory

After you will have downloaded a fair amount of pictures (I managed to have more than 40.000 for that hashtag) you will need to encode the facial vectors for each face and store them into your Elasticsearch instance. Use the encode_faces.py script (code here) to do so:

>> python encode_faces.py -d path-to-dataset-directory

For 40.000 images, it will probably take something like 10/16 hours to encode every face in your collection.

As soon as the encoding job is completed, you will be ready to roll!

So, launch the cluster_faces.py script (source code here):

>> python cluster_faces.py

At this point, the program will start the face clustering job. Again, some time will be required since it is a demanding task for you CPU. Your pc will need 40/60 minutes to cluster 50.000/60.000 facial vectors in order to find unique faces. While doing so, it will assign each unique face a numeric ID.

Note that since we are doing an unsupervised learning job, numeric IDs will be associated to unique faces instead of real names. But this is everything we need to make further graph analysis.

After the program will have clustered all the faces and assigned a numeric ID to each unique face, it will begin editing the images in your /dataset directory drawing a box around each face. Below the bounding box, the unique ID will be present too:

At the end of the job, all the edited images will be stored in the /output directory, which you should have previously created.

Before you start clustering the faces, it is a very good idea to make a backup of your original dataset!

The script will also create a graph file named face_graph.graphml into your root directory. This is the most important output. You can open it with Gephi to inspect and investigate all the correlations between unique faces and the respective pictures where they are shown:

In the case of #globalintergold, I was able to detect more than 6.000 unique faces. Inspecting the graph, you will be able to see the filenames and the unique IDs of the persons that appear therein. In order to verify the graph, you can open the filenames into your /output directory.

Considerations

No machine learning/facial recognition algorithm is perfect. You will surely find false positives and false negatives too. However, I found this methodology to be very reliable. I have not measured the error rate, but I believe it to be less than 10–5%.

Perhaps it is important to stress that the reliability of the clustering job will be affected by the quality and number of data that you feed in. The more pictures are available for each person, and the higher the quality, the better. But this is something you cannot control when gathering data from social media/the internet, so you will need to take into account potential errors and verify the relationships found through the graph inspection.

Furthermore, in the cluster_faces.py file there are a few parameters of utterly importance that you can set manually: eps and min_samples:

eps defines the euclidean distance value. The higher the value, the less faces you will detect and more faces, even pertaining to multiple individuals, will be clustered under the same ID, thus causing misleading results. I have found a reliable value for facial encodings to be 0.35/0.37. You can use the epsilon.py (source code here) script to find an estimate of the average distance in your dataset:

>> python epsilon.py

However, the value returned by the script, I found, can be used with good outcomes if you have many (thousands) of facial vectors stored in the database. I recommend to try a value between 0.35 and 0.40 regardless what epsilon.py will say.

The min_samples value is another parameter that you can set and will define whether a face must be considered as a unique ID (a unique cluster) or not. For example, if you set the parameter to 5, if a facial vector does not occur at least 5 times in the dataset, it will be skipped. This means that min_samples can be chosen to reduce the number of total unique faces shown in the graph and in the /output directory. This strategy can be used if you aim just at finding out the most important players in your network.

UPDATE

Some people have asked if it is possible to use the techniques described herein with videos instead of pictures. Indeed it is. I have updated the Github repository with two new Python scripts that will allow video analysis and face clustering. The whole process, however will still be based on images as input for the face encoding and clustering job. In fact, before encoding anything, you will need to run the segment_frames.txt script within the same directory of your videos. This script will segment each video into frames and will save each frame as a .jpg image in the same folder. For example, given a video.mp4 input, you will have .jpg files named as:

/directory/video.mp4___500.jpg
/directory/video.mp5___1000.jpg

where the “___500” string will identify the frame number in the video. The script is currently configured to save every 500th frame, but you can change the parameter to suit your needs. Obviously, the lower the parameter, the more frames you will output, the longer it will take to encode and cluster the faces.

After you will have segmented all the videos into “N-th” frames, run the encode_faces.py file just as with the standard procedure. But you will need to use the cluster_faces_video.py script instead of cluster_faces.py in order to generate an output where the edges in the graph will be:

unique-face > video.mp4

instead of:

unique-face > frame.jpg

In order to achieve this result, the script simply splits each .jpg frame in two parts by the “___” separator, where the first part will be the absolute name of the video, and the second part will be the frame number.

Technically, it will create a Python list for each frame such as:

[‘video.mp4’, ‘500.jpg’]

and will use only the index [0] into the list, namely the video absolute name.

This is all. I hope you will enjoy the code and, obviously, make further improvements!

Who-Where-WhomWith(WWWW): A Facial Recognition Tool for Image-based Data Gathering and Graph Analysis

Written by Lorenzo Romani