An ensemble learning approach for diabetes prediction using the stacking method
Elliot Kojo Attipoe, Alimatu Saadia Yussiff, Maame Gyamfua Asante-Mensah, Emmanuel Dortey Tetteh, Regina Esi Turkson
Abstract
Diabetes is a severe illness characterized by high blood glucose levels. Machine learning algorithms, with their ability to detect and predict diabetes in its early stages, offer a promising avenue for research. This study sought to enhance the accuracy of predicting diabetes mellitus by employing the stacking method. The stacking method was chosen because it integrates predictions from various base models, resulting in a more precise final prediction. The stacking method enhances accuracy and generalization by utilizing the varied strengths of multiple base models. The Pima Indians diabetes dataset, a widely used benchmark dataset, was utilized in the study. The machine learning models used for the studies were logistic regression (LR), naïve Bayes (NB), extreme gradient boost (XGBoost), K-nearest neighbor (KNN), decision tree (DT), and support vector machine (SVM). LR, KNN, and SVM were the best-performing models based on accuracy, F1-score, precision, and area under the curve (AUC) score, and were consequently used as the base model for the stacking method. The LR model was utilized for the meta-model. The proposed ensemble approach using the stacking method demonstrated a high accuracy of 82.4%, better than the individual models and other ensemble techniques such as bagging or boosting. This study advances diabetes prediction by developing a more accurate early-stage detection model, thereby improving clinical management of the disease.