MiVOLO: a new State-of-the-Art open-source neural network for gender and age estimation from a photo

14 min readAug 23, 2023

I want to tell you our story about how an initially routine work task ended up resulting in the creation of an open source, state-of-the-art neural network, scientific research, and a new dataset.

How it all began

Our team, Layer from SaluteDevices, is engaged in a diverse spectrum of computer vision tasks: from everything that can be done with images and videos to complex multi-modal systems. However, one of the most important directions has always been the search for similar products: clothing items, accessories, furniture, food, and more.

An example of the final result produced by our CV engine

The critical stage of the pipeline involves determining the gender* and age of a person in the photo. However, there is no assurance of a person’s presence or their depiction in full body. For instance, we collaborate with MegaMarket marketplace, where we search for visually similar clothing items and provide recommendations to the user. We also work with other marketplaces, and not all clothing items are worn by models there on pictures.

Anyway, in the majority of projects, a person is present in the photo, at least partially:

An example of what the input in our task might look like

And it’s critical issue if a child is in the photo, and we suggest buying adult clothing, or if we accidentally misidentify the gender.

At the same time, it can be quite challenging to determine solely based on clothing (although we do make efforts in that direction, albeit with limited accuracy). Therefore, the most effective approach is to utilize the information from the person in the photo. Additionally, age information could potentially provide extra insights to enhance the quality of recommendations.

In general, there are many open-source solutions for this task. Before creating our own, we utilized the open model FairFace, choosing it due to its simplicity and relevance to our task. However, the error rate in production was not perfect - it ranged from 1 to 10% or more in different projects. Consequently, these were potential scenarios where the recommendation wouldn’t match the user at all, leading to missed profits for a business. On the other hand, we had never supposed to deal with faces professionally, so we aimed to act wisely and rely on others’ solutions.

But over time, it became clear that tolerating this situation was no longer viable, and most importantly, there wouldn’t likely be anything significantly better suited for our task. The challenge lies in the fact that our visual domain is extremely diverse: from blurry selfies to professional studio shots. The conditions, filters, quality, colors, sizes — everything is completely arbitrary without any assumptions. Facial recognition systems aren’t adapted to such conditions. Moreover, very few in the market address this task. Thus, 3 months ago, we started developing our own approach.

Initially, we simply wanted to somehow mitigate the existing errors. We didn’t set ambitious goals for ourselves in this task, considering it auxiliary.

Baseline

The solution from FairFace has several significant drawbacks. Their classification approach seemed to be the biggest one: firstly, the age ranges were poorly divided for our needs, secondly, problems with borderline cases constantly arose. Thirdly, the algorithm doesn’t differentiate errors between classes, for example 0–5 years or 60–70 years, during training, which couldn’t have a favorable impact on generalization ability. There is also no training code available for the model.

The task for the baseline was as follows: a small, efficient model working with faces and predicting both gender and age in a single pass.

Therefore, we started with timm, a vast and famous repository of classification models from Hugging Face, pretrained on large open datasets.

We replaced classification with regression, making necessary code adjustments. This isn’t covered in the scientific article, as it focuses on the final transformer-based solution, but, during this stage, our main experimental model was a pure convolutional neural network resnext50_32x4d. It is fast and performs well.

Additionally, we used two techniques from the Deep Imbalance Regression paper: the Label Distribution Smoothing and Feature Distribution Smoothing. Without going into detail, these approaches significantly compensate for the natural age imbalance in the data. The first technique, in particular, proved to be very useful, allowing us to calculate weights for examples based on their distribution, which are then applied in the MSE (Mean Squared Error) loss function.

Visualization of age distribution in the IMDB-clean dataset. A similar pattern is observed in the majority of open datasets.

The initial experiments demonstrated the promise of the approach. For instance, with minimal tuning, we achieved an MAE (Mean Squared Error) of around 5.0 on the IMDB-clean dataset for age prediction, where the state-of-the-art was 4.68.

However, we needed more than just age prediction. And when we added the second output, gender, to the model, the performance of resnext began to degrade instead of benefiting from the additional data.

If a multi-task network is properly designed, it typically achieves an increase in accuracy compared to single-task variants: https://arxiv.org/abs/1705.07115

We shifted our focus towards transformers since these models are more versatile and promising. We experimented with various architectures. Many models, such as CaiT and XCiT, worked well. However, a new problem emerged with them — they lacked speed. The very first (historically within successful), fast and popular visual transformer, ViT, didn’t converge well in our case. Most probable, it’s due to its data hunger.

After numerous experiments, we finally found the ideal solution: a hybrid model called VOLO.

It combines the strengths of both convolutional and transformer neural networks. In this model, instead of simply dividing the image into patches, a series of convolutional layers are applied first. Additionally, it employs a unique attention block called Outlook Attention, which helps address some of the challenges transform architectures face when adapting to images.

If we delve into the details, there’s some sort of strange story. In the original VOLO description, it is presented as a purely transformer-based model, and there’s no mention of convolutions in the paper. However, in the implementation, convolutional stems are used to prepare patches in the PatchEmbed module. It’s likely that the initial idea was taken from: https://arxiv.org/abs/2104.0113, but I can not be sure.

VOLO is one of the fastest models for visual transformers, both in terms of processing speed and training convergence. Eventually, when we added gender as a second output, we captured the desired effect and achieved an increase in age prediction accuracy too!

Data

It was just the beginning, but back then, we thought that we would pour all this with data and could wrap up the work.

Data became a challenge, especially for age. It’s very difficult to find accurate age labels for photographs, and manual annotation is even more challenging. For understandable reasons: humans themselves don’t estimate ages very precisely, but more on that below. Typically, prior to our efforts, datasets in this task were collected in two main ways: either by using photos of celebrities, for whom it’s relatively easy to estimate age based on photos from various events, or by gathering photos taken in a studio or at a police station. The latter type of data is always very limited, and the former introduces certain biases and variations.

Actress **Marcia Cross** at the 30th Annual Film Independent Spirit Awards (2015), she was 53 years old at that time. This control image posed significant challenges for annotators: the average error increased to over 16 years.

That’s why we decided to start from scratch and annotate the data using the crowdsourcing platform Yandex Toloka. Readers might wonder how we planned to do this, given that we just mentioned the relatively low accuracy of human annotation. Fair point.

Our main bet was on the “wisdom of the crowd”: based on our experience, we believed that by setting the right conditions, we could ensure high-quality answers that, when properly aggregated, would lead to much higher accuracy than that achieved by individual annotators.

Since any crowdsourcing platform is regularly targeted by bots and cheaters, and in order to assess human accuracy later and find the best vote aggregation method, we included a 7th control example in each set of 6. We sourced these control tasks from the IMDB-clean dataset and had ground-truth answers for them. However, the annotators were unaware of this.

If the annotators accuracy dropped below a certain threshold, we permanently banned them. Battling dishonest annotators was an ongoing challenge even with a strict quality control system, and after analyzing the results, we had to rework the task from scratch a few times and encountered many interesting nuances, but there isn’t enough space here to delve into those details. In short, thanks to our extensive experience, we eventually managed to get everything on track.

In total, we collected around 500 000 images from our production service and the Open Images Dataset. We decided to release a portion of the data from the latter to finally provide the research community with a truly balanced regression benchmark (see Materials below). We balanced it not only overall but also by gender within 5-year age ranges:

The distribution of males and females in the final benchmark. There were not enough examples to fully compensate for the natural imbalance in the right end, but overall, it’s close to the ideal.

As for vote aggregation, eventually, we tried almost all existing methods:

We didn’t mention it in the article, but we conducted far more experiments than listed in the table, including machine learning methods. In the end, weighted averaging emerged as the clear winner by a significant margin. The weights in this method were determined by individual user errors:

Here, A represents the final prediction from the vote vector v, and **MAE(ui)** is the personal error of the **i-th** user. The exponent is used to strongly differentiate the value of votes from users with low and high errors.

MiVOLO: Success Breaks Free from Control

Having obtained high-quality data, we realized that we wanted to solve all the issues at once, which meant learning to predict the desired attributes even from photos without faces.

This raised the question of how to train a network to perform two tasks (and with a multi-task output the same time) without losing overall accuracy and, if possible, even enhancing it.

The input size for the network is quite small — 224x224. If we simply feed the entire human image, body with face, the neural network is guaranteed to significantly degrade its performance on face-related tasks, which is the most reliable approach. Increasing the resolution? That’s very costly, and it will drastically reduce processing speed.

Therefore, it would be ideal to treat these images independently, as two separate inputs through two convolutional stems, and then merge them.

What do the input images of bodies look like

The significant question is how and when to perform this fusion. If we use late fusion technique and gather features towards the end of the network, we might sacrifice either speed or accuracy (by resizing parameters of the branches). Moreover, we wouldn’t be able to leverage transfer learning and pretrained weights due to changes in dimensions. Therefore, early fusion is best way, with dimensionality reduction back to the original.

We began experimenting with a classic technique, commonly used in convolutional networks — the 1x1 convolutional squeeze layer. It takes 2N channels as input and outputs N.

While this approach worked, it performed less accurately on examples containing faces. It seemed that there was room for improvement. Hence, we tried numerous approaches, including the promising BottleneckAttention. Ultimately, we designed our own module.

The image depicts the core essence of our solution. We take compressed representations from faces and bodies (essentially patches), perform cross-attention first in one direction and then in the other, subsequently combine the features, and reduce the number of channels using an Multilayer perceptron:

General scheme and module layout. Two stems for two inputs can be observed

Thus, the module addresses three problems at once:

Using the attention mechanism, features are enriched and become of higher quality.
The target feature dimensionality is achieved.
Examples where useful information occupies only a part of the image are effectively processed (see below). Pure transformers struggle with such examples.

The solution quickly demonstrated its effectiveness, finally yielding the desired outcome — the network produced better results in all cases compared to VOLO with a single input.

One final issue remained — training our model took a considerable amount of time. While VOLO required around 300 epochs to converge, MiVOLO needed about 700. This posed a significant time challenge, even on a powerful NVIDIA DGX server, for the full dataset of half a million images. Therefore, we simplified the task for the network.

Instead of starting from an Imagenet checkpoint, we began training directly from our own baseline VOLO checkpoint. As we had weights only for one of the stems, we duplicated them. We also froze the stem responsible for faces, as it was already performing well and there was no need for further adjustments. This variant converged within an additional 250–300 epochs with a reduced learning rate. The final computational costs were manageable, and some experiments were conducted on a basic server equipped with only two NVIDIA A4000 GPUs.

To ensure the network learned to operate in all scenarios and developed a deeper understanding of the task, we applied simple training tricks. One of the most significant was FaceDropout, which randomly replaces the face with an empty image, forcing the network to learn from only body images. The reverse was also applied, but with a lower probability, as not all body images had corresponding face images.

By combining all these factors and utilizing all the collected data, the results were outstanding. We not only surpassed the SOTA results on IMDB but also secured first place on UTKFace without additional data and with minimal changes to the training process.

At this point, the article was almost complete. However, we decided to take a bold step and attempt classification benchmarks. This was daring, as it meant we couldn’t use any training examples from these datasets, the same time the domains and statistics are different. Moreover, classification tasks often involve unique class ranges, making it uncertain whether simply remapping regression output to classification would work.

To our surprise, without making any changes, we immediately took first place in one of the most popular and longstanding datasets, Adience, as well as in the recent FairFace dataset. Further victories were challenging to achieve, as other datasets either required complex acquisition processes (such as filling out forms, with authors responding infrequently) or were too small and uninteresting.

The second surprise was the accuracy of the system on examples without faces. For assessment, we masked the face in test examples with a black square and were astounded by the outcome. The achieved MAE of 6.66, which outperformed human accuracy with visible faces (see below)!

Moreover, the network began performing well even on entirely new examples it had never encountered during training. For instance, it could predict accurately when a person’s back was turned. Initially, we reasoned that a face was the most crucial aspect, and without it, the chances of correct prediction were minimal. Therefore, in such cases, it was necessary to attempt a prediction in some way, as this would be better for business than having no prediction at all.

However, the final accuracy did not merely meet the “somehow” criterion, it was more in line with “impressive”:

This indicates that the model has truly learned to generalize: its understanding of the task extends beyond the training data.

All the final results and benchmarks can be found on Papers With Code.

By the way, I haven’t discussed gender accuracy much here, as it’s slightly less complex and slightly less interesting than the age prediction task. Nevertheless, we also got the first place on the Adience dataset for gender prediction. Interestingly, the margin of improvement was quite significant:

The human-level accuracy in the age prediction task

Since we had control tasks, measuring human accuracy was not a problem. Without further ado, here’s the histogram:

Distribution of human accuracy. The x-axis represents the number of users with average error, while the y-axis shows the frequency.

So, the average accuracy is 7.22, and the median doesn’t differ significantly due to the distribution shape being close to symmetric. This is a very large error.

By the way, the best individual result in the left tail is 4.54.

Interestingly, how do people estimate their own accuracy? I conducted a survey on my social networks and collected 105 votes, asking respondents to estimate the average error they expect from people in such a task. Here’s the result:

Number of votes and predicted average error. Among the respondents, 43% answered absolutely correctly. Another 24% have little faith in humanity, while 38% overestimate the accuracy of people :)

So our, human, self-calibration is quite good.

The most important question is how much more accurate the model is compared to humans? Hard even to compare:

Graph of the relationship between average error and age in correct predictions. The slight difference in the starting point of the graphs is due to the fact that measurements with the neural network were made on the full dataset. Nevertheless, the shapes of the distributions are very similar. The shadow represents the standard deviation on each side.

And the most popular question I’m asked about human accuracy is whether there’s a correlation with the age of the annotator? This is an interesting point. However, it turned out that our crowdsource service doesn’t provide this information through the API. But it’s available in the web version. Well, I had to use Python Selenium :)

Result:

Smoothed graph of average error by annotators’ age.

Percentage of responses with errors less than 6, relatively good responses, based on annotators’ age.

As seen from the graphs, there is almost no correlation. I did not create a graph for annotators older than 60–65 years due to limited data and fluctuating results. The only notable observation is the reduced accuracy within the 20-year age range. Is this due to youthful impatience during annotation or the perception of individuals older than themselves at that age? I leave it up to the reader to speculate, especially considering that users of the service self-report their ages and no one verifies them.

Another interesting question might be whether the accuracy of annotators improved during the annotation process. Since there was no feedback mechanism other than a permanent ban, this is highly unlikely. Still, I checked just in case, as perhaps some form of unsupervised learning mechanism could be at work in our brains.

However, no miracle occurred. Accuracy even decreased over time (and annotators eventually got banned) in the absence of consequences for careless annotation. If there’s any lesson to be drawn from this, it’s that providing a source of feedback is highly desirable. On the other hand, implementing such feedback is quite labor-intensive and always a balance between desired investment and return. In our case, the game wasn’t worth the candle.

Epilogue

There are many factors that could significantly improve the model. The main aspect I would emphasize is working on body cropping. The method we used is rather rudimentary, relying on simple image processing for cropping. If to integrate the recently released Fast SAM, which enables accurate object masking, into the pipeline, model could achieve substantial enhancements. Additionally, challenges persist in predicting age beyond around 70, but this can be addressed through data.

Lastly, I’d like to highlight the current limitations of the model. The model was created to solve the described business tasks. It was not designed to address ethical, philosophical, and other questions, and in many cases, its behavior may be unpredictable. We are in no way responsible for the outcomes generated by the model.

A similar situation applies to various filters, such as effects seen on TikTok or even simple smoothing techniques, which could significantly alter the results. However, TikTok is breaking even more due to its own filters :)

Could this model theoretically handle such tasks? Most likely yes, within reasonable physical limits. It is indeed a very powerful model. However, achieving this would require a specific goal and the appropriate data.

Materials

The resources related to this text:

The article on Arxiv.
The code repository for validation and inference.
Demo on Hugging Face.
Landing page for the new dataset and additional data.
My LinkedIn and co-author LinkedIn

*In this article and in our paper, ”gender recognition” refers to a well established computer vision problem, specifically the estimation of biological sex from a photo using binary classification. We acknowledge the complexity of gender identification and related issues, which cannot be resolved through a single photo analysis. We do not want to cause any harm to anyone or offend in any way.