שמור ב:
| מחבר ראשי: | |
|---|---|
| פורמט: | Recurso digital |
| שפה: | אנגלית |
| יצא לאור: |
Zenodo
2026
|
| נושאים: | |
| גישה מקוונת: | https://doi.org/10.5281/zenodo.20336854 |
| תגים: |
הוספת תג
אין תגיות, היה/י הראשונ/ה לתייג את הרשומה!
|
תוכן הענינים:
- <p>Diabetes mellitus is a chronic metabolic disorder affecting over 537 million adults worldwide. This study presents a complete end-to-end machine learning pipeline for binary classification of diabetes status using the Pima Indians Diabetes Dataset (n=768). The pipeline integrates systematic data cleaning, group-median imputation, IQR-based outlier clipping, and six engineered interaction features. An XGBoost classifier was trained with 300 estimators, class-weighted loss, and L1/L2 regularization. Cross-validation was performed using a scikit-learn Pipeline to prevent data leakage. The model achieved accuracy of 87.0%, recall of 85.2%, precision of 79.3%, F1 score of 82.1%, and ROC-AUC of 94.7%. Five-fold CV AUC was 0.944 (SD=0.014). SHAP analysis identified Glucose, Glucose x BMI interaction, and BMI as the three most impactful predictors. Source code, trained model artifacts, and figures are publicly available on GitHub (https://github.com/randomthingsonlineatsk-cloud/diabetes-xgboost-prediction) and archived on Zenodo (DOI: 10.5281/zenodo.20332710).</p>