Building Better Credit Scores

Introduction

Traditional Credit Scoring Model (FICO Score)

Credit scores are pivotal in today’s financial landscape, influencing everything from rental eligibility to access to health insurance, yet the formula for calculating creditworthiness has long been shrouded in mystery and often overlooks important nuances. Typically, the credit score is determined based on five factors: payment history, amount owed, new credit, credit history, and credit mix. This structure can place individuals with limited credit history, especially young adults who are just starting out building their credit, at a compounded disadvantage, restricting their access to loans, credit cards, employment opportunities, and insurance. Our proposed Cash Score model aims to address these limitations by providing a more comprehensive measure of creditworthiness. The Cash Score model leverages detailed account transaction data to predicts the probability of defaulting on a loan (as known as loan delinquency). This approach highlights the potential for transaction-based credit evaluation to more accurately assess financial risk and improve access to credit, offering a fairer alternative to traditional credit scoring methods.

We adopted an iterative approach to model development, emphasizing continuous refinement and enhancement of features alongside model selection and performance evaluation. We began with logistic regression to establish a baseline and identify key features. As the process evolved, we integrated more advanced algorithms like HistGradientBoosting (HistGB), CatBoost, LightGBM, and XGBoost, chosen for their ability to handle complex data patterns. Throughout the iterations, we focused on refining and enhancing feature generation, selecting the most relevant ones to improve the model’s performance. This iterative process allows us to optimize the model’s predictive power.

Methods

Data Description

Our analysis leverages four key datasets that provide insights into consumer accounts, transaction histories, and credit scores. As the datasets were prepared and preprocessed by Prism Data, this minimized the need for extensive data cleaning. Our primary focus in terms of data cleaning was reviewing the data for consistency, addressing any remaining missing values, standardizing categorical variables, and structuring time-series data to optimize it for modeling.

Below are the heads of the datasets used in this analysis. Click on each section to expand and view the data:

prism_consumer_id	prism_account_id	account_type	balance_date	balance
3,023	0	SAVINGS	2021-08-31	90.57
3,023	1	CHECKING	2021-08-31	225.95
4,416	2	SAVINGS	2022-03-31	15,157.17
4,416	3	CHECKING	2022-03-31	66.42
4,227	4	CHECKING	2021-07-31	7,042.90

The acctDF.csv dataset provides detailed information about consumer financial accounts, such as account types, balances, and balance dates.

prism_consumer_id	evaluation_date	credit_score
0	2021-09-01	726
1	2021-07-01	626
2	2021-05-01	680
3	2021-03-01	734
4	2021-10-01	676

The consDF.csv dataset which provides credit scores, evaluation dates, and delinquency targets for each consumer. This dataset is essential for building a model of credit risk, as it contains direct indicators of a consumer’s creditworthiness. The delinquency targets serves as the dependent variable, enabling us to assess our model’s per- formance in predicting credit risk.

prism_consumer_id	prism_transaction_id	category	amount	credit_or_debit	posted_date
3,023	0	4	0.05	CREDIT	2021-04-16
3,023	1	12	481.56	CREDIT	2021-04-30
3,023	2	4	0.05	CREDIT	2021-05-16
3,023	3	4	0.07	CREDIT	2021-06-16
3,023	4	4	0.06	CREDIT	2021-07-16

The trxnDF.csv dataset records individual transactions, including transaction category IDs, amounts, and whether the transaction was a credit or debit. These transactional data are vital for modeling consumer behavior, such as income sources, spend- ing habits, and cash flow.

category_id	category
0	SELF_TRANSFER
1	EXTERNAL_TRANSFER
2	DEPOSIT
3	PAYCHECK
4	MISCELLANEOUS

The cat_map.csv dataset maps transaction categories to their corresponding category IDs, allowing us to classify and interpret the transactions effectively.

It is important to note that, in compliance with the Equal Credit Opportunity Act (ECOA), we excluded specific transaction categories that could introduce bias in credit decision-making. Categories related to child dependents, healthcare and medical expenses, unemployment benefits, education, and pensions have been removed to ensure that our model does not unintentionally discriminate based on protected attributes.

Exploratory Data Analysis

Through exploratory data analysis (EDA), we examined consumer transaction trends and spending patterns to uncover insights that aid in identifying key factors for predicting credit risk. Below are a few examples of EDA conducted to look at temporal trends, transaction frequency, spending categories, and the impact of specific financial behaviors.

Balance Over Time of Delinquent vs. Non-Delinquent Consumer

Comparing bank balances over time between a randomly selected delinquent and non-delinquent consumer reveals distinct financial patterns. The delinquent consumer’s balance remains mostly stagnant, with a single large spike that quickly drops. In contrast, the non-delinquent consumer maintains a steady, positive balance with gradual growth, indicating stable income, controlled spending, and savings. This suggests that bank balance trends can serve as a strong predictor of creditworthiness.

Distribution of Credit Score by Delinquency Status

The normal distribution of delinquent credit scores, compared to the left-skewed distribution of non-delinquent credit scores, shows that non-delinquent individuals typically have higher credit scores, while most delinquent individuals fall within the lower middle of the credit score range. This reinforces that credit scores are already a strong indicator of delinquency. This provides a foundation for our model, allowing us to build upon the credit score feature to potentially outperform traditional models at predicting delinquency.

Identifying "Buy Now, Pay Later" (BNPL) as a risky category, we analyzed this category further. The figure reveals that a significantly higher proportion of non-delinquent consumers fall into the lowest bin for mean BNPL transactions. However, delinquent consumers tend to have higher proportions in the upper bins, indicating that they engage in larger BNPL transactions compared to non-delinquent consumers.

The plot reveals a wider range of tax transactions over the last two weeks for non-delinquent consumers, while delinquent consumers show little to no variation in their tax transaction frequency. This suggests that non-delinquent consumers are more active and consistent in handling their tax-related transactions, which could indicate better financial management and stability compared to delinquent consumers.

1/4

Feature Generation

We engineered features to capture financial behavior through transaction history, balance trends, spending patterns, and risk indicators. Our feature generation process included:

Time Window Analysis: Transactions were analyzed across multiple time windows—14 days, 30 days, 3 months, 6 months, and 1 year—to capture short- and long-term trends.
Aggregated Statistics: Summary statistics (minimum, maximum, mean, median, standard deviation, sum, count, percent of transactions) are calculated on categorical and balance trends.

Category-Based Feature Generation Process

This diagram showcases our process for generating category-based features. For example, one of the features created through this process is FOOD_BEVERAGES_last_14_days_mean, which represents the average transaction amount within the “Food & Beverages” category over the past 14 days. By analyzing these features, we aim to capture spending habits, identify fluctuations in financial stability, and differentiate between various financial behaviors.
Risk Indicators: High-risk behaviors were identified through flagged transactions, such as gambling, using threshold-based indicators.
Balance Features: Features that reflect balance fluctuations such as balance deltas, rolling averages, and recent trends were created.
Income Features: Income-based features such as the number of income sources and income standard deviation were calculated to assess the diversity and variability of a consumer’s income.
Standardization: Non-categorical features were standardized to ensure consistent scaling.
Resampling: Our dataset had an imbalance, meaning one class had much more data than the other. To fix this, we used Sythethic Minority Over-Sampling Technique (SMOTE) to generate new samples for the smaller group and undersampling to reduce the larger group. This helped create a more balanced dataset, allowing the model to learn patterns more effectively without bias.

Feature Selection

The final dataset contained more than 2,000 features, with the dataframe shape being 15000 rows × 2430 columns. To refine model input, we performed feature selection using the following techniques:

Correlation Analysis: Selected top features most correlated with delinquency using Lasso (L1) Regularization.
Mutual Information: Identified features with the highest mutual information score for predictive power.
Embedded Method: Utilized Random Forest to rank and select the most relevant features.

Models

We evaluated multiple machine learning models to predict credit risk. Below is a brief description of each model used in our analysis.

Modeling Approaches

Baseline Model: Logistic Regression: A simple yet effective linear model that serves as the starting point for comparison with more advanced models.
Histogram-based Gradient Boosting (HistGB): Speeds up training by grouping data into bins, working well for large datasets and making the model faster and more memory-efficient.
Categorical Boosting (CatBoost): A gradient boosting method specifically designed for categorical data, preventing overfitting and building more balanced trees for better predictions.
Light Gradient-Boosting Machine (LightGBM): Uses "leaf-wise" decision trees for faster learning and reduced memory usage, making it particularly effective for large datasets.
Extreme Gradient Boosting (XGBoost): A popular gradient boosting method known for its speed and accuracy. It reduces errors through regularization and handles large datasets efficiently by running in parallel on multiple processors.

Each of these models (other than Logistic Regression) improves upon regular decision trees by using "boosting" to combine multiple trees, which enhances the model's accuracy.

Model Evaluation

We used the following metrics to assess model performance:

ROC AUC: Shows how well the model can tell the difference between positive and negative outcomes. Higher values mean the model is better at making this distinction.
Accuracy: Tells us the percentage of times the model made a correct prediction.
Precision: Measures how many of the model's positive predictions were actually correct.
Recall: Shows how many of the actual positive cases were correctly identified by the model.
Confusion Matrix: A table that helps us see how many predictions were correct and how many were wrong, broken down by type of error (false positive, false negative).

Results

Feature Importance

Click to expand

Figure: Top SHAP Values

SHAP (SHapley Additive exPlanations) is a method used to explain model predictions by attributing each feature's contribution to the final prediction.

In our model, SHAP identified the following features to be important in predicting credit delinquency:

sum_acct_balances: Higher account balances suggest lower delinquency risk.
HAS_SAVINGS_ACCT: Having a savings account reduces delinquency risk.
DEPOSIT_last_14_days_count: Recent deposits indicate financial stability.
OVERDRAFT: Frequent overdrafts increase delinquency risk.
LOAN_last_14_days_count: Recent loans may signal financial stress.

Model Performance

The ROC curves below illustrate the trade-off between the true positive rate and the false positive rate for each model. The AUC scores indicate overall model performance, with higher values reflecting better predictive power.

**Model Performance Metrics Comparison**
Model	ROC-AUC	Accuracy	Precision	Recall	F1-Score	Training	Prediction
Logistic Regression (w/o Credit Score)	0.7079	0.8445	0.2383	0.2785	0.2568	1.3368	0.4016
Logistic Regression (w/ Credit Score)	0.7241	0.8571	0.2674	0.3548	0.3050	1.7175	0.3315
LightGBM (w/o Credit Score)	0.7796	0.8991	0.3878	0.0802	0.1329	4.1249	0.0931
LightGBM (w/ Credit Score)	0.8162	0.9068	0.4167	0.1382	0.2076	3.9720	0.0859
CatBoost (w/o Credit Score)	0.7704	0.9019	0.4474	0.0717	0.1236	38.6703	0.0788
CatBoost (w/ Credit Score)	0.8260	0.9170	0.4681	0.1095	0.1774	40.9512	0.0960

Key Insights:

Adding credit scores improves model performance, helping predict delinquency more accurately for all models.
CatBoost (with credit score as a feature) is the most accurate model, with the highest AUC-ROC score, but struggles to detect delinquent cases and takes longer to train.
Without credit scores, LightGBM performs best. Compared to CatBoost, the training time is also significantly lower.

Futher Analysis: Confusion Matrices

Click to expand

CatBoost Confusion Matrix without Credit Score

CatBoost (w/o Credit Score)

CatBoost Confusion Matrix with Credit Score

CatBoost (w/ Credit Score)

LightGBM Confusion Matrix without Credit Score

LightGBM (w/o Credit Score)

LightGBM Confusion Matrix with Credit Score

LightGBM (w/ Credit Score)

From the confusion matrices, we observe that both CatBoost and LightGBM improve slightly with credit score inclusion. However, they remain highly conservative, predicting very few positive cases. This results in high precision but low recall as seen in the model performance table.

Note: In these models, "positive" refers to delinquent cases, while "negative" represents non-delinquent cases.

Cash Score vs. Credit Score

Delinquency Rate Heatmap (Cash Score vs. Credit Score)

The heatmap visually represents delinquency rates using color intensity and numerical values, where darker regions indicate higher delinquency. The bottom-left region, where scores are lowest, shows delinquency reaching 100%, while the top-right region, representing higher scores, exhibits near-zero delinquency. This highlights both cash and credit scores as strong indicators of financial risk, with higher scores consistently associated with lower delinquency rates.

Conclusion

Our research highlights that incorporating detailed bank transaction data into credit scoring models results in performance that is comparable to traditional models, all without the necessity of credit history. This approach allows for a more comprehensive and nuanced assessment of an individual’s creditworthiness, providing a more holistic view of their financial behavior. By utilizing transactional data, we aim to improve the accuracy of credit scoring, offering a more transparent and equitable evaluation process. This model addresses existing biases and limitations in traditional credit scoring, especially for individuals with limited or no credit history, such as young adults or those from underrepresented groups. Ultimately, this approach seeks to enhance fairness and inclusivity within the financial system, increasing access to credit opportunities for those who have historically been overlooked or excluded from traditional lending practices.

Next Steps

Feature Engineering: We aim to optimize aggregated feature metrics based on transaction categories and time windows. Additionally, we plan to implement clustering algorithms to identify and select the most relevant features for improved model performance.
Model Refinement: We intend to explore deep learning models, incorporating extended hyperparameter tuning sessions to uncover more complex patterns in the data and improve predictive accuracy.
Bias & Fairness: To ensure equitable credit assessments, we will evaluate the potential for biases in predictions across different demographic groups and implement fairness constraints to mitigate any identified disparities.

Our Team

Aman Kar

akar@ucsd.edu

Daniel Mathew

drmathew@ucsd.edu

Tracy Pham

tnp003@ucsd.edu