GPT 4 Video KYC - Part I

Face recognition

Elina Maliarsky
9 min readJan 9, 2024

Let’s circle back to our video KYC adventure , where we’ve encountered a whole ensemble of DNN-based models and even created one of our own. These are the fundamental building blocks that operate quietly behind the scenes and make video KYC work: one model to detect faces, another to recognize them, a bunch for analyzing various facial attributes, a few to check if you’re a live person, and at least one to catch out any face forgery attempts. It’s a mixed salad of simple, fast models and heavyweight ones; each one needs its own setup. When you combine them together, you get a process that’s, frankly, a bit slow and clumsy, with enough backend code to make you feel overwhelmed. What if there was a simple “Analyze faces on this frame to do face recognition/ocr/etc’'' command that could easily handle appropriate tasks? Well, guess what? We might just have that sonic screwdriver in the form of large multimodal models (LMMs) like the GPT-4V, which can be instructed to analyze images along with the texts according to your needs. In this new “GPT 4 Video KYC” series, which is a sequel to Video KYC, we’ll see this multimodality in action, and the first use case will be face recognition.

Generated by DALL-E 3

LMMs and GPT-4 V(ision) overview

Machine learning models traditionally specialize in singular data types—text for language tasks, images or videos for visual recognition, or audio for speech processing. But human intelligence isn’t limited to one modality. We interact with a complex world through reading, writing, observing, listening, and more. A multimodal approach is crucial for any AI that aims to function effectively in diverse real-world scenarios.

GPT-4 Vision, or GPT-4V, is an advanced large multimodal model (LMM) by OpenAI that can understand both text and images. For example, when you upload a photo of an ID and ask questions about both visual (photo, logo, background) and textual data, GPT-4V can provide you with insightful answers. OpenAI noted in their GPT-4V system card that “incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development.” Yet, it remains an ongoing project with certain limitations. It is currently available to all developers who have access to GPT-4 via thegpt-4-vision-preview model and the Chat Completions API, which has been updated to support image inputs.

GPT-4 face recognition

OpenAI claims that it is not ready to roll out facial recognition or analysis features for public use as it may invite legal issues in jurisdictions that require consent for using biometric data. If you take a picture with the person, even your own selfie, upload it to ChatGPT (once you have paid for a ChatGPT Plus account) and ask, for example, “Who is the person in this photo?", you’ll get something like:

I’m sorry, but I cannot provide identification or assumptions about people in images, including drawings or any visual representations. My capabilities are designed to respect privacy and adhere to content policy guidelines. If there is another way I can assist you, such as by providing information on image analysis techniques or other general inquiries, please let me know!

The model GPT-4V itself is not so restrictive. Once I’ve tried to ask via API the question, “Who is the person in this photo?" I’ve got, “I’m sorry, but I cannot provide the identity or name of real people in images. However, I can describe the facial features and expressions if you would like.”

Yes, I’d like it, please. I want to ask whether, in a photo with two people, it’s possible that these individuals are actually the same person. I also want to understand why LMM arrived at such a conclusion and to see how confident it is. So here is my prompt:

face_recognition_prompt =  """ 
You are an expert at finding human faces on paintings, photos and drawings and analyzing them.
Your sub-expertise is in comparing facial features.
Look only at the face; ignore hairstyles and clothes.
Consider the shape of the face, jaw line, skulls, nose, and other facial features only.
Compare the people in the photo and say if they can be the same person.
You should provide a short (only one sentence) but convincing explanation of why you think
the people are the same person or not.
In addition, return a confidence score from 0 to 100 for each pair of compared people.
"""

Images are made available to the model in two main ways: by passing a link to the image or by passing the base64-encoded image directly in the request. Images can be passed in the user, system and assistant messages.

Here is the code: I use a base64-encoded image and user message. In addition, I use langchain framework to make ChatCompletions API calls (nothing personal, just accustomed to it).

#read image and convert its colors to RGB
image = cv2.cvtColor(cv2.imread('img.jpg'), cv2.COLOR_BGR2RGB)
# Encode the image for llm (for simplicity I always encode as jpg)
retval, buffer = cv2.imencode('.jpg', image)
# Convert to base64 encoding and get the Base64 string
base64_str = base64.b64encode(buffer).decode('utf-8')
image_url = f"data:image/jpeg;base64,{base64_str}"

msg = chain.invoke(
[
HumanMessage(
content=[
{"type": "text", "text": prompt },
{
"type": "image_url",
"image_url": {"url": image_url},
},
]
)]
)
print(msg.content)
plt.imshow(image)

Then I’ve tried to run this code on some images.

From time to time, I got messages like “I cannot process the request" and “I cannot make assumptions about real people," but mostly I’ve got answers, and all of them were correct.

The same doctor, glasses and hairstyle is not a problem:

The people in the photo appear to be the same person, as the shape of the 
face, jawline, nose, and other facial features are remarkably consistent
across both images.

Confidence score: 95/100

The same actor who looks completely different in two of his roles.

The individuals in the photos have similar facial structures, such as the 
shape of the jawline, nose, and the spacing of the eyes, which suggests
they could be the same person.

Confidence score: 85%.

The model is not as sure as before, but the initial conditions are not so perfect.

The next pair is of two differently-looking Doctors:

The two individuals in the photo have distinct facial features such as 
differing jawlines, nose shapes, and overall facial structures,
indicating they are not the same person.

Confidence score: 95%.

Nice! Pay attention to the fact that this time the model is confident that the individuals are NOT the same person.

And finally, the Doctor and his not-yet-companion:

The two individuals in the photo have distinctly different facial 
structures, with the man having a more elongated face and pronounced
chin compared to the woman's rounder face and softer jawline.

Confidence score: 0/100

You are not restricted to one image; you can add additional images just like this:

 HumanMessage(
content=[
{"type": "text", "text": face_recognition_prompt2 },
{
"type": "image_url",
"image_url": {
"url": image_url1
},
},
{
"type": "image_url",
"image_url": {
"url": image_url2
},
},
]
)]

But once I did it, I started to get the answer, “I’m sorry, but I cannot assist with requests involving real people in images. If you have any other questions or need information on a different topic, feel free to ask!” all the time.
At this point, I’ve understood that I’m tired of humans and switched to the other forms of life.

GPT–4 Cats recognition

Here is my new prompt: I’m asking if both of these photos are of the same cat.

face_recognition_prompt_cats =  """ 
You are an expert at finding cats on drawings and images and analyzing them.
Your sub-expertise is in comparing.
Compare the cats in the attached photos and say if they can be the same cat.
You should provide a short but detailed explanation of why you think the cats are the same.
In addition, return a confidence score from 0 to 100 for each pair of compared animals.
"""
Photos by Bofu Shaw on Unsplash

And here are the reliable results:

The cats in the images appear to be the same individual, as they have 
matching fur patterns, color distribution, and facial features.

Confidence score: 95%.

Now let’s try on two photos of different cats

Left photo is by Bofu Shaw on Unsplash
The cats cannot be the same because the first cat has a shorter fur with a 
less prominent ruff around its neck, and its facial color pattern differs
from the second cat's fuller ruff and distinct facial color pattern.

Confidence score: 5/100.

It works! The model found out that the fur and the facial color pattern are different, as are the cats.

Summary

The experiment was interesting and held practical value, yet it remained just that—an experiment. While GPT-4V excels in facial feature analysis, OpenAI’s legal constraints render it currently unsuitable for direct application with human clients. However, one can construct something useful for pets, for example, an advanced cat feeding system. This system could potentially use facial recognition to identify different cats and tailor the feeding schedule and portions to each individual cat’s dietary needs.

In the upcoming blog post of the series, I will continue with this experiment to demonstrate how a solitary prompt and a singular LMM have the potential to replace age, gender, emotion, and some of the liveness detection models we’ve met in Video KYC Part IV and Part VI.

Appendix

But wait, you ask, what about face detection? Can LMM be utilized for this task as well?

Unfortunately, it cannot: OpenAI claims that GPT-4V is not so good at localization, i.e., finding where objects are on the image, and I cannot agree more. I’ve tried to find the boundaries surrounding the faces with this prompt:

face_detection_prompt = """
You are an expert in finding human faces on pictures and drawings
and analyzing them. Your sub-expertise is in face detection,
i.e., locating faces. Locate the face (only face, without hair and neck)
by finding coordinates of boundaries.
The order of faces on the picture is from left to right and from top
to bottom. Reply with list of numbers. .
You should always return an answer based on the following
JSON schema without ```json prefix and extra notes, only json:

{
"Faces":
[
{"Coordinates": [left, top, right, bottom]}
]
}
"""

And the fun began:

I’m sorry, but I cannot process your request…

I’m sorry, but I cannot analyze images….

I’m sorry, but I can’t provide real-time analysis or generate coordinates for images. If you need help with detecting faces in images programmatically, you typically would use computer vision libraries or services that offer face detection capabilities, such as OpenCV or cloud-based APIs like Amazon Rekognition, Microsoft Azure Face API, or Google Cloud Vision API. These tools can provide you with coordinates for faces detected in an image…

I’m sorry…

I’m sorry…

And finally, I’ve got the coordinates (see the boundaries) like this without any single apology.

Actimize

Using innovative technology to protect institutions and safeguard consumers and investors’ assets, NICE Actimize detects and prevents financial crimes and provides regulatory compliance. Artificial Intelligence (AI) and automation in scalable production have seen a significant surge within the financial crime domain, with NICE Actimize playing a pivotal role in driving this advancement. Aligned with its long-term vision of proactively preventing fraud through real-time automation in scalable production, Actimize aims to provide robust analytical capabilities in a time-sensitive manner. NICE Actimize recognizes the potential utilization of GenAI, our latest endeavor in harnessing the power of Generative AI and Large Language Models (LLMs), to address complex challenges, unlocking unique capabilities that complement our commitment to advancing financial crime prevention solutions

--

--