How the choice of training data excludes or includes certain demographics
‘Fairness and Bias in Machine Learning’ is a growing conversation as more and more people become aware of the impact of machine learning in our daily lives. Machine learning models are increasingly being used in industries like finance, employment, education, and even healthcare. One of the biggest challenges in achieving fair models is getting datasets that contain relevant up-to-date information which is representative of the target population. Some of the reasons acquiring these training datasets is challenging is because it is a time-consuming and expensive process which requires a lot of technical knowledge on key ethical factors. While this might be a challenging task for a lot of smaller companies that need to be more stringent on their costs and labor, it is a non-negotiable aspect of developing efficient machine learning models. In this article I’ll discuss how missing data impacts the development of fair machine learning models in the finance sector.
Within the finance industry, machine learning models are used to estimate the risk involved in certain investments or when extending credit to borrowers. In fact the best display of the use of ML in finance is in the rise of FinTech (Financial Technology) companies. FinTech companies get consent from consumers to access their credit history hence creating a wider pool of available data. They then use this information to come up with comprehensive credit scores of the borrowers at a faster and more efficient rate. Because FinTechs are able to process borrower information a lot faster than traditional banks, they are a more appealing lender and hence draw more traction from consumers. Machine learning not only improves borrower evaluations for FinTechs, but it also improves more general aspects of the business when used effectively. FinTech companies that deploy AI to numerous areas of their company reap more benefits quicker than those companies that seem to cherry pick where to deploy models eg. chatbots, underwriting etc. Having interconnected models that reduce expenses in various parts of the business help to increase profit which could then be redirected to the consumers in the form of lower lending rates. 
Machine learning clearly has the potential to revamp the finance industry. The challenge that comes up with using machine learning is that there is a lot of room for unfairly biased algorithms due to implicit bias of the developers or of the training data. Even the most savvy and progressive FinTech borrower evaluation models rely on information that was used in the traditional banking system to determine credit score. I recently read a 2015 study on the amount of ‘credit invisible’ and ‘credit unscorable’ people there are in the US. ‘Credit invisible’ people are people who do not have any credit record with the credit bureau or the NCRAs, meanwhile ‘credit unscorable’ people are those whose records with the NCRAs and credit bureaus are not sufficient or time-relevant enough for them to be evaluated for borrowing purposes. The study highlighted that 11% of American adults were credit invisible and that 8% were unscorable. Although seems like a small percentage of the adult population, it is concerning when these metrics are evaluated along racial and ethical lines. The findings show that black and hispanics are more likely to be unscorable from early in their adulthood and maintain that trajectory for the rest of their adult life. 
It may be debatable as to whether an individual needs to maintain a traceable credit record. However, when we evaluate the effects this has on one’s access to credit for personal investments i.e. a mortgage, a business loan etc, it becomes imperative that credit documentation exists. Traditional banks are often hesitant to issue loans without a credit score because they’re unable to evaluate an individual’s risk. Oftentimes this results in either denial of the loan application or the loan granted at higher than average premiums. This practice limits the amount of people who have access to credit; the credit that is necessary to improve their living standards and fuel wealth generation. There is an undeniable correlation between the large number of ‘credit invisible’ and ‘unscorable’ people from minority groups and their limited access to credit from financial institutions. The question is: ‘How will machine learning and big data change that?”
At the moment FinTech companies have a lot of access to consumer information through their (consumers’) digital footprints. They can use this information to substitute the information they have from traditional sources like credit bureaus and create more comprehensive models that are able to score previously unscorable borrowers hence increasing access to credit. This is a great next step in terms of extending credit to previously underserved communities. Machine learning has the ability to learn correlations between modern digital information, traditional credit scores, and default rates.
Digital information seems to be the answer, right? I mean everyone is represented on the internet somehow through social media, subscription services, or even web searches. However, the nature of your online representation also says a lot about features like your education, socio-economic status, spending habits etc. Therefore there is a possibility for invisibility in the digital space as well depending on the features the developers choose to use to access creditworthiness. Does the level of education listed on your LinkedIn resume make you more credit worthy? Does your Facebook post on your obsession with buying new shoes raise a red flag? Is the absence of a subscription service a show of frugality or low income? There’s a lot that the information online could say about an individual. There’s a lot that the lack of information online also says about an individual. Just as the traditional methods excluded groups of individuals with undocumented credit scores, the newer methods of evaluation could create new socio-economic rifts between the next wave of borrowers. There is a possibility of a new kind of invisible and/or unscorable borrower. It is imperative that machine learning developers in the financial sector as well as policy makers interested in the ethics and possible outcomes of these models, come together and discuss how we can get representative datasets that don’t exclude minority groups from credit.
Note: Minority groups may not always be divided by protected attributes such as race, gender, sexuality or even class. However, features selected for model development may serve as proxies for these protected attributes.