שמור ב:
מידע ביבליוגרפי
מחבר ראשי: Khan Gulrez Shagufa Fazal Ahmed
פורמט: Recurso digital
שפה:אנגלית
יצא לאור: Zenodo 2026
נושאים:
גישה מקוונת:https://doi.org/10.5281/zenodo.20336854
תגים: הוספת תג
אין תגיות, היה/י הראשונ/ה לתייג את הרשומה!
תוכן הענינים:
  • <p>Diabetes mellitus is a chronic metabolic disorder affecting over 537 million adults worldwide. This study presents a complete end-to-end machine learning pipeline for binary classification of diabetes status using the Pima Indians Diabetes Dataset (n=768). The pipeline integrates systematic data cleaning, group-median imputation, IQR-based outlier clipping, and six engineered interaction features. An XGBoost classifier was trained with 300 estimators, class-weighted loss, and L1/L2 regularization. Cross-validation was performed using a scikit-learn Pipeline to prevent data leakage. The model achieved accuracy of 87.0%, recall of 85.2%, precision of 79.3%, F1 score of 82.1%, and ROC-AUC of 94.7%. Five-fold CV AUC was 0.944 (SD=0.014). SHAP analysis identified Glucose, Glucose x BMI interaction, and BMI as the three most impactful predictors. Source code, trained model artifacts, and figures are publicly available on GitHub (https://github.com/randomthingsonlineatsk-cloud/diabetes-xgboost-prediction) and archived on Zenodo (DOI: 10.5281/zenodo.20332710).</p>