Why it is just as important for machine learning models to be fair as much as they are accurate?

Published in

Nerd For Tech

5 min readApr 4, 2021

We hear machine learning to be a buzzword nowadays and we see how it is gaining rapid traction in different industries such as finance, trading and social media.

There are new algorithms being created and one has to tune different hyperparameters for machine learning algorithms to perform well. We see how industries are moving to the new AI age and how the business value of various industries is increasing with the aid of machine learning and deep learning.

Machine learning, in general, involves dividing the overall data into 2 parts namely the train data and the test data. We would be training the machine learning algorithms with the training data and after training, we would test them with the test data to get the accuracy of the models on unseen data (test data) respectively.

When we look at the performance of these machine learning models, we have certain metrics that would help us know how the models are performing in the test set. We take some metrics such as accuracy, precision and recall. As a result, we get a good intuition of different machine learning models along with their performance. Based on these results, we would be taking a decision as whether to deploy them in real-time or not respectively. We also would take measures to ensure that we improve the accuracy or recall of the models until we get the output that is desired and considered good in the context of the problem.

However, one of the key things that often people forget is that there is also a possibility for machine learning algorithms to be unfair. In other words, there might be overrepresentation of certain groups while underrepresentation of certain other groups respectively.

Different scenarios where a machine learning model may be unfair

Consider one problem of predicting the chances of a person having heart disease. In order to predict the chances of a heart disease, one would have to consider some important features such as blood pressure, heart rate and glucose levels. Sometimes, the location would also play a role and would be a deciding factor whether a person would have a heart disease or not. Therefore, we would use all these features in our data and then give those values to the machine learning models for prediction. Once we train the machine learning models with the right hyper parameters, we would be then testing them using our test set to see how they perform with new unseen data.

Once we get to know the performance of the machine learning algorithm on the test set, we would use that metric and see if there are any other models that perform well. Furthermore, we would replace models until we get the best model in terms of a few metrics respectively.

One thing to consider is whether the models are fair and represent the entire population or do they represent certain groups while not representing others. Consider one scenario where the machine learning models were able to perform really well on the test set and they got an accuracy of 95% respectively. We have to also see whether the 95% accuracy is true for all the different groups of people. Consider, for example, the accuracy is about 98% for people who stay at New York compared to 90% for the people of California. Though the model did well and gave an accuracy of 95%, we see that California residents are not able to get the right predictions and often times are misdiagnosed by the machine learning model. We also see that 95% accuracy, though good, is due to high accuracy in New York region residents compared to 90% accuracy of California residents respectively. We see that the model is not very fair when diagnosing the prediction of heart disease for residents of California.

This is just one scenario. There can be multiple ways at which machine learning models are unfair. Consider a different scenario where machine learning models can be unfair respectively. Consider, for example, the face recognition technology which, when used in airports, correctly identifies a certain group of people while inaccurately predicts the other groups. This could lead to a severe impact in the way in which machine learning models are run. Face recognition technology, when tested with standard metrics, might have given a good accuracy (say 95%) on the test set. When we look at a particular set of groups, however, we can sometimes get very high accuracy for certain groups while a low accuracy for the other groups leading to a good accuracy of the overall test data respectively.

One more example would be when machine learning models are used to filter out candidates based on the job application and their resumé. During the training phase, the machine learning models might have gotten a good accuracy, recall or F1 score respectively. Let us consider that one machine learning model M1 got an accurcy of about 90% while the other machine learning model M2 got an accuracy of 85% respectively. We would be then willing to deploy model M1 on the production environment as it has a higher accuracy. However, we have to also see how fair the first machine learning model is compared to the other. Consider, for example, that the M1machine learning model was able to get about 95% accuracy for one group while it only got 80% accuracy for the remaining groups. Alternatively, consider the machine learning model M2 got 85% accuracy for all the groups. If we consider not just accuracy but also fairness, we see the second machine learning models is equally representing different groups of people with similar accuracy. If we were to deploy model M1 in the real world setting, we could easily see that it performs really well for one group of people while doesn’t actually perform the same for the other groups. As a result, it would be best to deploy model M2 as it is equally representing all groups of people despite having a low overall accuracy compared to model M1 respectively. Therefore, we have to also consider fairness as a metric and see to it that machine learning models remain fair in predictions respectively.

Conclusion

Considering all the above cases, we see how important fairness is in terms of deployment of the machine learning models apart from other metrics such as accuracy, precision and recall respectively. We have seen some scenarios where machine learning models can be accurate but unfair at the same time. We see some good highly performing models often failing to represent certain group of people while over representing others respectively. Therefore, it is really important that we also consider fairness and ethics of AI apart from the stardard metrics such as accuracy, precision and recall to understand and deploy machine learning models for production. Hope this article gives a good insight about the fairness in AI and how it is also an important factor when considering whether to deploy the model or not. If you found this article helpful, feel free to comment and clap. Thanks!

Why it is just as important for machine learning models to be fair as much as they are accurate?

Different scenarios where a machine learning model may be unfair

Conclusion

Written by Suhas Maddali