A comparative study of classical, bagging, and hybrid methods for optimizing loan default prediction

Ismail Idowu Akuji; Ahmed Babajide Olanrewaju; Taofik Abiodun Ahmed; Ayodeji Jubril Alabi; Idris Babatunde Adeyemi

doi:10.11591/csit.v7i2.p179-195

A comparative study of classical, bagging, and hybrid methods for optimizing loan default prediction

Ismail Idowu Akuji, Ahmed Babajide Olanrewaju, Taofik Abiodun Ahmed, Ayodeji Jubril Alabi, Idris Babatunde Adeyemi

Abstract

This study optimized loan default prediction by comparing k-nearest neighbor (KNN), random forest (RF), and hybrid methods. The dataset used was preprocessed using simple imputer, label encoder, synthetic minority oversampling technique (SMOTE), and correlation-based feature selection on top 7 features while grid search cross-validation (GSCV) and random search cross-validation (RSCV) were employed to optimize models. Before tuning, RF achieved perfect performance (100% accuracy, 99.8% precision, 100% recall, 99.9% F1, 1.000 area under curve (AUC)), outperforming untuned KNN (99.2% accuracy, 96.2% precision, 99.8% recall, 98.0% F1, 0.997 AUC) and hybrid (99.8% accuracy, 99.1% precision, 99.9% recall, 99.5% F1). After tuning, RF maintained same results, confirmed by 10× nested CV stability (F1=0.9997±0.0002) and McNemar tests showing equivalence to RF_RSCV (p=1.0000). KNN improved marginally in precision (96.2%→99.8%) but declined in recall, while hybrid dropped slightly across metrics. Partial dependence plots confirm RF’s dominance stems from three key features (lump_sum_payment, property_value, co-applicant_credit_type), validated by business impact analysis showing minimal errors against KNN/hybrid. RF_GSCV’s perfection reflects true generalization, not overfitting, establishing it as the production-ready gold standard. Future work can address static dataset limitation by incorporating dynamic time-series data with online learning, concept drift detection, and real-time macroeconomic features to enhance real-world generalizability.

Keywords

Correlation based feature selection; Grid search cross-validation; K-nearest neighbor; Loan default; Random forest; Random search cross-validation

Full Text:

PDF

DOI: https://doi.org/10.11591/csit.v7i2.p179-195

Refbacks

There are currently no refbacks.

Computer Science and Information Technologies
p-ISSN: 2722-323X, e-ISSN: 2722-3221
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Universitas Ahmad Dahlan (UAD).

CSIT Visitor Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me