Machine Learning Models for Income Prediction on US Census Data

Explainability and Performance Improvement through Intel Optimized Software

Nithiwat S

Published in

Intel Analytics Software

6 min readOct 16, 2023

Nithiwat Siripatrachai and Niroopshankar Ammbashankar, Intel Corporation

Machine learning (ML) is often used as a black box with input leading to output and limited insight into the model’s prediction process. This is where Explainable AI (XAI) comes in. It aims to understand how models make predictions and tries to evaluate their fairness. XAI is becoming increasingly important in ML model development.

In this article, we present a high-level overview of the ML pipeline focusing primarily on explaining the XAI results and showcasing the performance improvements that are possible from using Intel optimized software. To begin, we trained XGBoost regressors to predict income based on US Census data. We discuss how XAI can be used to analyze model predictions and detect bias. Finally, we demonstrate the performance improvement when running the workload on 4th Generation Intel Xeon Scalable Processors.

We use the term “protected class” to describe features that the model should handle without exhibiting bias. Depending on the context, the protected classes could be age, sex, gender, race, ethnicity, sexual orientation, religion, and marital status. For this project, the US Census data contains the sex (i.e., male and female) and age of respondents. While age could be considered a protected class, in this work, we use it as a proxy for work experience. Older respondents have more experience, which could correlate to higher incomes. Ideally, the ML model should be unbiased with respect to the protected class and provide consistent predictions for both males and females, assuming all other factors remain equal. We will use XAI tool to analyze our model’s prediction process.

Modeling

This example uses the Intel Distribution of Modin for data import and manipulation, Intel Extension for Scikit-learn for data processing, XGBoost 1.6.2, and TruEra XAI toolkit 9.3.1. The pipeline is shown in Figure 1.

Figure 1. Summary of our ML pipeline and dependencies in each step

The US Census data size on disk is 7.3 GB and can be download at the IPUMS USA (https://usa.ipums.org/usa/). The census is performed every ten years. We used data from 1970 to 2010. It includes the respondent’s age, sex, education, plus other household member information. The original data has 45 features. We trained XGBoost regressors to predict individual income (the target value is converted to constant 1999 dollars).

Before we discuss explainability, let’s talk about feature importance and fairness. The former is the average influence of a feature on the model predictions. The latter is the model’s ability to make fair predictions across protected features. Explainability can be achieved by computing the Shapley value, first introduced in cooperative game theory. This represents the expected value of a feature’s contribution to the prediction, which helps increase transparency, fairness, and interpretability of ML models. Because computation of Shapley values is intensive, a subset of the entire dataset is used to make baseline predictions. With the computed Shapley values, we can assess how each feature contributes to the final prediction for each sample, and the aggregated values show the model’s fairness in predicting the income for the protected class.

Explainability

To capture the characteristics of the data and analyze the predictions, we trained distinct models for each census year from 1970 to 2010 then used the TruEra XAI tool to analyze model predictions, feature importance, and fairness through the years. Let’s examine the aggregated feature contributions and predictions for female respondents (Figure 2). The importance of the protected feature “sex” has decreased over the years, dropping from 28.72% in 1970 to 14.85% in 2010, suggesting that the model’s dependency on this feature has decreased (Figure 2a). On average, females are predicted to have an income disadvantage compared to male counterparts (Figure 2b). However, the income disadvantage decreases over time, indicating that the model becomes “fairer” and relies less on the protected feature to fit the data.

Figure 2. (a) Percent importance of sex on model prediction (b) female disadvantage in income

We can say that the trained models capture the inherent historical and societal disadvantage in income among females (Figure 2). Possible explanations may stem from societal changes from 1970 to 2010 in one or more of the following aspects:

female begins to enter the workforce,
an increase in number of hours in the workforce for female,
better education for female which improves competitiveness,
more opportunities in various fields

It should be noted that the model isn’t necessarily unfair or biased because female respondents are at a disadvantage (i.e., female tends to have lower total income, given all else equal). What we see from the TruEra XAI analysis of predictions is the model picking up the inherent characteristics and structure of the data. A model trained on data lacking granularity and depth could perpetuate or even exacerbate bias. Such data restricts the model’s ability to learn and make fair, informed predictions. For example, the incomes reported in the US Census data do not consider the number of hours worked. This means, for example, that someone who works part-time would likely report lower income. Additional factors that may impact income but are not recorded in the data include job title, types of work, and field or industry of work. Arguably, these features are more crucial for making accurate and “fair” income predictions, as opposed to sex. Furthermore, it is important to acknowledge that the dataset used to train these models is representative of the years in question. However, relying solely on this dataset for income prediction in ML applications may lead to issues of accuracy and potential problems. Finally, to address potential biases in the model, it is advisable for developers to conduct additional investigations and consider exclusion of protected features from the training process.

Performance

We optimized the pipeline with Intel software stacks. We repeated this analysis for models, each trained on one US Census from 1970 to 2010. The speedup for the entire pipeline is 1.45x going from 3rd Generation Intel Xeon Scalable Processors to 4th Generation Intel Xeon Scalable Processors (Table 1). Intel Distribution of Modin for data import has significantly improved the data ingestion and handling throughout the pipeline. Intel Extension for Scikit-learn was used in data processing. Intel optimized software does not require major changes to the existing ML pipeline to improve performance. It’s usually enough to change only a few lines of code.

Table 1. Generation-over-generation speedup for each step of the pipeline

Conclusion

We developed XGBoost regressors, each trained on US Census data from 1970 to 2010, then used TruEra XAI to analyze the feature importance and fairness of the models. Based on the results, the models may have naturally captured differences between male and female respondents. They display income prediction disparities favoring males, with this disparity gradually diminishing over the years. This trend could be due to evolving societal structures and the limited granularity and detail within the dataset. Exclusion of protected features may address possible biases in the model. Finally, we showed a performance improvement of 1.45x on 4th Generation Intel Xeon Scalable Processors by employing Intel optimized software to run the pipeline.