This study optimized loan default prediction by comparing k-nearest neighbor (KNN), random forest (RF), and hybrid methods. The dataset used was preprocessed using simple imputer, label encoder, synthetic minority oversampling technique (SMOTE), and correlation-based feature selection on top 7 features while grid search cross-validation (GSCV) and random search cross-validation (RSCV) were employed to optimize models. Before tuning, RF achieved perfect performance (100% accuracy, 99.8% precision, 100% recall, 99.9% F1, 1.000 area under curve (AUC)), outperforming untuned KNN (99.2% accuracy, 96.2% precision, 99.8% recall, 98.0% F1, 0.997 AUC) and hybrid (99.8% accuracy, 99.1% precision, 99.9% recall, 99.5% F1). After tuning, RF maintained same results, confirmed by 10× nested CV stability (F1=0.9997±0.0002) and McNemar tests showing equivalence to RF_RSCV (p=1.0000). KNN improved marginally in precision (96.2%→99.8%) but declined in recall, while hybrid dropped slightly across metrics. Partial dependence plots confirm RF’s dominance stems from three key features (lump_sum_payment, property_value, co-applicant_credit_type), validated by business impact analysis showing minimal errors against KNN/hybrid. RF_GSCV’s perfection reflects true generalization, not overfitting, establishing it as the production-ready gold standard. Future work can address static dataset limitation by incorporating dynamic time-series data with online learning, concept drift detection, and real-time macroeconomic features to enhance real-world generalizability.
Keywords
Correlation based feature selection; Grid search cross-validation; K-nearest neighbor; Loan default; Random forest; Random search cross-validation