Reverse Engineering FICO 8

Michael Fowlie
13 min readDec 29, 2022

--

The FICO algorithm is proprietary and closely guarded leading many to speculate as to exactly how it works. We will explore the use of machine learning techniques to reverse engineer the FICO algorithm and to better understand how it operates. By analyzing a large dataset of credit data and applying machine learning algorithms, we gain insight into the factors that influence an individual’s FICO score. By shedding light on the inner workings of the FICO algorithm, we hope to provide a more complete understanding of the credit scoring process and its impact on consumers.

About FICO 8

FICO scores are used in over 90% of credit decisions in the USA. It’s used by most lenders to assess risk associated with extending credit to an individual borrower. The most common variant is the FICO 8 version, although newer versions exist. FICO is a model for scoring data from different credit bureaus from the Fair Isaac Corporation (FICO).

FICO 8 scores range from 300 to 850, with higher scores indicating a lower default risk. The model takes into account these factors:

  1. Payment history: Whether the borrower has made their payments on time, as well as any payments 30+ days past due, or collections.
  2. Amounts owed: Total amount of debt the borrower has, as well as the amount of available credit utilized.
  3. Length of credit history: How long the borrower has had credit accounts and how long it has been since their accounts were used.
  4. New credit: Any new credit accounts the borrower has opened, as well as any recent credit inquiries.
  5. Credit mix: The types of credit accounts the borrower has, such as credit cards, mortgages, and auto loans.

Each FICO version is designed to be more predictive of credit risk than prior versions. They are also more transparent, with more information provided about how it works to both consumers and lenders.

There are newer versions of FICO such as FICO 9 and FICO 10 but these aren’t yet commonly used in the industry.

About the dataset

The dataset is widely available online alongside the data dictionary but was since taken down by LendingClub from the official download site.

It contains scores from 612 to 847. It also has a large number of scores towards the lower end of that range. There is range of subprime scores that are not represented in this dataset.

OLS Model

If we look at major factors vs. score they mostly tend to agree with the common understanding of how these factors effect scores. Note that since not all is equal, these charts do not represent the effect of these variables on score.

We first analyze FICO 8 by using an Ordinary Least Squares (OLS) linear regression model. We had done feature engineering, and we dropped all columns that had insignificant p-values. This explains 64% of the variance and the error has a standard deviation of 19 points. 95% of the predictions are within 40 points of the actual score.

Age of credit history is a major factor for FICO 8.

A linear regression indicated that FICO score goes down with longer credit history, which doesn’t make sense. However, a longer history of either installment accounts or revolving accounts is associated with a positive score. The OLS model doesn’t properly explain the age factor.

A linear regression indicated that FICO score goes down by 1 point for every 3% of credit utilization.

As the number of tradelines (such as credit cards) that are past due increases, score tends to decrease. The dataset had values larger than 2 as well, and for large values the trend becomes non-linear.

  • Each trade line 30 days past due decreases the score by 8 points.
  • Each trade line 120 days past due decreases the score by 13 points.
  • On the first delinquency, an additional 22 points is lost.
  • If any delinquency is within the last 12 months, an additional 8 points is lost.

The more credit cards that are over 75% utilized, the lower your score. This again is mostly linear.

This has a negligible effect with each 27% of credit cards being over 75% utilized decreasing score by 1%. Note that this is in addition to the utilization factor.

Here’s an interesting one. People with more installment loans look like they have a lower credit score. But all else equal, under the OLS model they have a score that goes up by a small fraction with each additional installment loan.

The more inquiries you have, the lower your score, the more recent they are the more damage done to your score. Those in the last 6 months dinging you by 1.1 points, within 6–12 months by 0.8 points, and 12–24 months 0.13 points.

There was a strange effect of mths_since_recent_inq having a positive coefficient, which means the more recent the most recent inquiry was, the higher the score. This reflects OLS being a poor model for the data.

OLS Regression Model

The OLS model has some obvious problems, such as the coefficients of some variables. pub_rec for example has a positive coefficient which obviously doesn’t make any sense.

                            OLS Regression Results                            
==============================================================================
Dep. Variable: y R-squared: 0.640
Model: OLS Adj. R-squared: 0.640
Method: Least Squares F-statistic: 1.370e+04
Date: Wed, 28 Dec 2022 Prob (F-statistic): 0.00
Time: 11:50:35 Log-Likelihood: -2.1834e+06
No. Observations: 499990 AIC: 4.367e+06
Df Residuals: 499924 BIC: 4.368e+06
Df Model: 65
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const 631.7101 0.562 1123.574 0.000 630.608 632.812
delinq_2yrs -0.8341 0.052 -15.962 0.000 -0.937 -0.732
inq_last_6mths -0.3191 0.039 -8.095 0.000 -0.396 -0.242
mths_since_last_delinq 0.0765 0.001 149.673 0.000 0.075 0.077
mths_since_last_record 0.2267 0.002 111.385 0.000 0.223 0.231
open_acc -1.9251 0.106 -18.090 0.000 -2.134 -1.716
pub_rec 2.8057 0.146 19.210 0.000 2.519 3.092
revol_bal 0.0002 3.62e-06 42.274 0.000 0.000 0.000
revol_util -0.3568 0.002 -150.608 0.000 -0.361 -0.352
total_acc 0.1451 0.010 13.973 0.000 0.125 0.165
collections_12_mths_ex_med -6.1545 0.177 -34.704 0.000 -6.502 -5.807
mths_since_last_major_derog 0.0268 0.000 54.407 0.000 0.026 0.028
tot_coll_amt -2.535e-05 2.31e-06 -10.991 0.000 -2.99e-05 -2.08e-05
tot_cur_bal -5.664e-05 1.07e-06 -52.724 0.000 -5.87e-05 -5.45e-05
open_acc_6m -0.1998 0.076 -2.619 0.009 -0.349 -0.050
open_il_12m 1.1109 0.085 13.019 0.000 0.944 1.278
mths_since_rcnt_il -0.0182 0.000 -40.456 0.000 -0.019 -0.017
il_util 0.0202 0.002 8.276 0.000 0.015 0.025
open_rv_12m 0.3279 0.078 4.191 0.000 0.175 0.481
open_rv_24m -0.5537 0.071 -7.817 0.000 -0.692 -0.415
max_bal_bc 2.179e-05 1.23e-05 1.770 0.077 -2.34e-06 4.59e-05
all_util -0.1343 0.003 -40.047 0.000 -0.141 -0.128
total_rev_hi_lim 9.087e-06 2.4e-06 3.784 0.000 4.38e-06 1.38e-05
inq_fi -0.1383 0.050 -2.772 0.006 -0.236 -0.041
total_cu_tl -0.2509 0.053 -4.759 0.000 -0.354 -0.148
inq_last_12m -0.6809 0.034 -20.270 0.000 -0.747 -0.615
acc_open_past_24mths -0.6526 0.016 -41.485 0.000 -0.683 -0.622
avg_cur_bal 0.0001 3.93e-06 36.866 0.000 0.000 0.000
bc_open_to_buy 0.0005 5.38e-06 99.118 0.000 0.001 0.001
bc_util -0.0525 0.002 -21.189 0.000 -0.057 -0.048
chargeoff_within_12_mths -3.2343 0.246 -13.148 0.000 -3.716 -2.752
mo_sin_old_il_acct 0.0273 0.000 61.115 0.000 0.026 0.028
mo_sin_old_rev_tl_op 0.0689 0.001 136.105 0.000 0.068 0.070
mo_sin_rcnt_rev_tl_op 0.0962 0.002 44.271 0.000 0.092 0.100
mo_sin_rcnt_tl 0.0500 0.004 12.291 0.000 0.042 0.058
mort_acc 0.4472 0.021 21.086 0.000 0.406 0.489
mths_since_recent_bc -0.0163 0.000 -33.729 0.000 -0.017 -0.015
mths_since_recent_inq 0.2098 0.004 48.544 0.000 0.201 0.218
mths_since_recent_revol_delinq 0.0096 0.001 15.198 0.000 0.008 0.011
num_accts_ever_120_pd -0.6902 0.028 -24.250 0.000 -0.746 -0.634
num_actv_bc_tl -0.9165 0.034 -26.945 0.000 -0.983 -0.850
num_actv_rev_tl 0.8061 0.048 16.710 0.000 0.712 0.901
num_bc_sats -0.1056 0.026 -4.068 0.000 -0.156 -0.055
num_bc_tl -0.2554 0.016 -15.667 0.000 -0.287 -0.223
num_il_tl -0.2500 0.012 -20.962 0.000 -0.273 -0.227
num_op_rev_tl -0.5263 0.022 -23.908 0.000 -0.569 -0.483
num_rev_tl_bal_gt_0 -1.6938 0.051 -33.527 0.000 -1.793 -1.595
num_sats 2.6087 0.107 24.461 0.000 2.400 2.818
num_tl_120dpd_2m -5.3243 1.004 -5.303 0.000 -7.292 -3.356
num_tl_30dpd -7.7708 0.448 -17.352 0.000 -8.649 -6.893
num_tl_90g_dpd_24m 1.3891 0.080 17.324 0.000 1.232 1.546
num_tl_op_past_12m -0.7651 0.026 -29.348 0.000 -0.816 -0.714
pct_tl_nvr_dlq 0.3480 0.005 70.469 0.000 0.338 0.358
percent_bc_gt_75 -0.0830 0.001 -58.711 0.000 -0.086 -0.080
pub_rec_bankruptcies -12.4333 0.146 -85.429 0.000 -12.719 -12.148
tax_liens -0.9490 0.147 -6.449 0.000 -1.237 -0.661
tot_hi_cred_lim 5.618e-05 9.18e-07 61.183 0.000 5.44e-05 5.8e-05
total_bal_ex_mort -0.0002 2.16e-06 -93.167 0.000 -0.000 -0.000
total_bc_limit -0.0001 3.71e-06 -37.516 0.000 -0.000 -0.000
total_il_high_credit_limit 0.0002 2.11e-06 99.116 0.000 0.000 0.000
age_earliest_cr_line -0.0368 0.001 -63.867 0.000 -0.038 -0.036
is_clean 22.1202 0.126 175.156 0.000 21.873 22.368
num_tradelines 0.4314 0.069 6.262 0.000 0.296 0.566
is_thick -1.8869 0.190 -9.930 0.000 -2.259 -1.515
has_recent_pr 3.6187 0.548 6.606 0.000 2.545 4.692
has_recent_delinq -8.3508 0.120 -69.329 0.000 -8.587 -8.115
==============================================================================
Omnibus: 48888.324 Durbin-Watson: 2.001
Prob(Omnibus): 0.000 Jarque-Bera (JB): 381413.138
Skew: 0.086 Prob(JB): 0.00
Kurtosis: 7.275 Cond. No. 1.26e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.26e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Machine Learning Model

Predicting the FICO score is done using a multi-task learning model. The model is trained to perform multiple tasks at once, where each task is to predict a subscore for one of the five categories: payment history, amounts owed, history length, new credit, and credit mix. Then a final model predicts the estimated FICO score given the 5 subscores.

The model has a mean absolute error of 14, in other words the average difference between the prediction from the model and the actual FICO score is 14 points.

Subscores

Subscores range from 0.0 to 1.0

  • Adjusting the payment history subscore by 0.001 increases the FICO score with a mean of 0.198 and standard deviation of 0.0745.
  • Adjusting the amounts owed subscore by 0.001 increases the FICO score with a mean of 0.029 and standard deviation of 0.017.
  • Adjusting the history length subscore by 0.001 increases the FICO score with a mean of 0.040 and standard deviation of 0.016.
  • Adjusting the new credit subscore by 0.001 increases the FICO score with a mean of 0.059 and standard deviation of 0.020.
  • Adjusting the credit mix subscore by 0.001 increases the FICO score with a mean of 0.041 and standard deviation of 0.024.

Tree Model

We simplify the final aggregation part of the model and switch it out with a decision tree to demonstrate the different credit scorecards used by FICO.

Feature 0 (payment history) is clearly the most important feature, followed by feature 2 (history length), and then feature 1 (amounts owed).

This tree model has an R² of 32% with a depth of only 3.

|--- feature_0 <= 0.18
| |--- feature_0 <= 0.08
| | |--- feature_0 <= 0.04
| | | |--- value: [673.93]
| | |--- feature_0 > 0.04
| | | |--- value: [682.85]
| |--- feature_0 > 0.08
| | |--- feature_2 <= 0.33
| | | |--- value: [690.20]
| | |--- feature_2 > 0.33
| | | |--- value: [703.36]
|--- feature_0 > 0.18
| |--- feature_0 <= 0.34
| | |--- feature_2 <= 0.34
| | | |--- value: [711.82]
| | |--- feature_2 > 0.34
| | | |--- value: [731.15]
| |--- feature_0 > 0.34
| | |--- feature_1 <= 0.56
| | | |--- value: [750.73]
| | |--- feature_1 > 0.56
| | | |--- value: [777.38]

Effect of Subscores

For each subscore, we adjust it up/down across all observations and take the mean predicted FICO score. We repeat this process with multiple adjustments and record the adjusted subscore vs. mean predicted FICO score.

The charts indicate the non-linearity of the effects of each of these subscores, as well as the magnitude of their effects. Note how the y-axis for payment history is much larger than the other charts.

Subscore sensitivity analysis

In this sensitivity analysis, a single input variable (such as maximum bankcard balance) is varied over the range of values from the original dataset, while the other variables are held constant. The model is then run for each value of the input variable, and the output is recorded. By plotting the output values against the input values, it is possible to see how the output of the model changes as the input variable is varied. This differs from plotting the variable in question against the subscore alone since sensitivity analysis shows the changes in subscore that can be attributed to the feature and aren’t just correlated.

Payment History

  • Collection amounts under $100 don’t have much of an effect.

Amounts Owed

  • There’s a cutoff at $10,000 average current balance of all accounts at which the score drops substantially.
  • There’s a small cutoff at $35,000 of total current balance all accounts.
  • There’s a boost when total revolving credit limits hit $15,000 and then again at $50,000.
  • There’s a boost when total open to buy on revolving bankcards hits $7000 and then again at $25,000.
  • There’s a boost when total credit limits across all accounts hits $75,000 and then again at $150,000.
  • There’s a drop when total balance on all accounts except mortgages hits $31,000 and then again at $50,000, and $100,000.
  • There’s a boost when total bank card limit hits $39,000.
  • There’s a drop when total installment loan limits exceed $5000, again at $50,000.
  • There’s a drop when maximum current balance owed on all revolving accounts hits $5000.

History Length

Some of these charts are concerning as a longer history should always result in the same or a higher score. This may mean the model is correcting for information in other features though since all else is never equal.

New Credit

Credit Mix

Perfect profile

We analyzed what near perfect profiles (FICO ≥ 840) look like in terms of the subscores, after adjusting the subscores by percentile.

The payment history subscore was clearly the most important, as for other subscores lower values were acceptable.

  • Perfect profiles ranged from percentile 01 at 91th percentile of payment history scores, to 99th percentile at 99.9th percentile of scores.
  • Perfect profiles ranged from percentile 01 at 47th percentile of amounts owed, to 99th percentile at 100th percentile of scores.
  • Perfect profiles ranged from percentile 01 at the 6th percentile of history length, to 99th percentile at 100th percentile of scores.
  • Perfect profiles ranged from percentile 01 at the 24th percentile of new credit, to 99th percentile at 100th percentile of scores.
  • Perfect profiles ranged from percentile 01 at the 12th percentile of credit mix, to the 99th percentile at the 100th percentile of scores.

Perfect profiles look like this:

  • Maximum 2 inquires in the last 6 months. Average 0.25 in the last 12 months. 16 month average since the most recent inquiry.
  • 18 years+ since the last delinquency, or no delinquency.
  • No public record.
  • 2–33, average of 10 open accounts.
  • No public records.
  • Revolving balance can vary.
  • Revolving utilization between 0–35% but averaging 4.4%.
  • Total accounts between 4–63, averaging 24.
  • No collections excluding medical in the last 12 months.
  • Maximum 2 new accounts in the last 6 months.
  • Most have no open installment loans.
  • Most haven’t opened a revolving account in the last 12 months.
  • All account utilization averages 5.8%.
  • Total revolving high credit/credit limit averages $86,000
  • Average of 2 new accounts in the last 24 months.
  • Bank card utilization 4.6%.
  • No charge offs in the last 12 months.
  • No delinquent accounts.
  • 13 years since the oldest installment loan.
  • 25 years since the oldest revolving account.
  • 2.8 mortgage accounts, but it’s possible to have this score with zero mortgages.
  • 5 years since the most recent bank card.
  • 14 years since the most recent revolving delinquency, if any.
  • No accounts ever 120 days past due.
  • 2 currently active bank card accounts.
  • 3 currently active revolving trades, with 3 with balances greater than zero.
  • 5 satisfactory bank card accounts.
  • 5 installment loans.
  • 15 revolving accounts, 8 open.
  • 11 satisfactory accounts
  • No bank cards with > 75% utilization.
  • No bankruptcies.
  • No tax liens.
  • High credit limit of $35,000
  • Age of earliest credit line at least 9 years but averaging 28 years.

Invoking the AI model

The AI model is available via REST endpoint here. https://mfowcreditscoring.azurewebsites.net/api/Fico8 source code is available on GitHub mfow/fico.

Many fields are optional and default to good behavior or no behavior (e.g. no inquiries).

Example POST request:

{
"paymentHistory": {
"open_acc": 300,
"num_sats": 10,
"pct_tl_nvr_dlq": 100,
"percent_bc_gt_75": 0
},
"amountsOwed": {
"revol_bal": 1000,
"revol_util": 50,
"tot_cur_bal": 8000,
"all_util": 10,
"total_bal_ex_mort": 10000,
"avg_cur_bal": 5000,
"tot_cur_bal": 1000
},
"historyLength": {
"age_earliest_cr_line": 120
},
"newCredit": {

},
"creditMix": {
"num_bc_sats": 3
}
}

--

--

Michael Fowlie

Software Engineer in HFT, Statistician & Data Scientist, Master of Finance student. Opinion only. Not professional advice. https://www.michaelfowlie.com/