Saved in:
Bibliographic Details
Main Author: Fabian, George Hezron
Format: Recurso digital
Language:English
Published: Zenodo 2026
Online Access:https://doi.org/10.5281/zenodo.20029342
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866902051915563008
author Fabian, George Hezron
author_facet Fabian, George Hezron
contents <p class="MsoNormal"><span>This study presents a comparative analysis of machine learning models for early diabetes risk prediction using the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The dataset includes key health indicators such as plasma glucose concentration, body mass index (BMI), blood pressure, insulin levels, age, diabetes pedigree function, and related physiological variables.</span></p> <p class="MsoNormal"><span>A structured machine learning pipeline was developed using six supervised classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Naïve Bayes. During preprocessing, biologically implausible zero values were treated as missing data and handled using median imputation. Feature standardization was applied to ensure uniform scaling and improve model performance stability.</span></p> <p class="MsoNormal"><span>Model evaluation was conducted using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure a comprehensive and balanced assessment. Cross-validation results show that all models achieve satisfactory predictive performance, with ensemble-based methods, particularly Random Forest, demonstrating the highest and most consistent classification ability. Logistic Regression and Support Vector Machine also perform competitively, indicating the presence of both linear and nonlinear relationships in the dataset.</span></p> <p class="MsoNormal"><span>Feature importance analysis identifies glucose level and body mass index as the most significant predictors of diabetes risk, followed by genetic and demographic factors. Statistical testing confirms significant differences among models, with Random Forest, SVM, and Logistic Regression forming a statistically comparable top-performing group.</span></p> <p class="MsoNormal"><span>Overall, the findings demonstrate that machine learning methods can effectively support the early detection of diabetes using routine health data. These models offer strong potential for integration into clinical decision-support systems to enhance early diagnosis, risk stratification, and preventive healthcare strategies.</span></p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_20029342
institution Zenodo
language eng
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle Comparative Analysis of Machine Learning Models for Diabetes Risk Prediction Using Clinical Health Indicators
Fabian, George Hezron
<p class="MsoNormal"><span>This study presents a comparative analysis of machine learning models for early diabetes risk prediction using the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The dataset includes key health indicators such as plasma glucose concentration, body mass index (BMI), blood pressure, insulin levels, age, diabetes pedigree function, and related physiological variables.</span></p> <p class="MsoNormal"><span>A structured machine learning pipeline was developed using six supervised classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Naïve Bayes. During preprocessing, biologically implausible zero values were treated as missing data and handled using median imputation. Feature standardization was applied to ensure uniform scaling and improve model performance stability.</span></p> <p class="MsoNormal"><span>Model evaluation was conducted using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure a comprehensive and balanced assessment. Cross-validation results show that all models achieve satisfactory predictive performance, with ensemble-based methods, particularly Random Forest, demonstrating the highest and most consistent classification ability. Logistic Regression and Support Vector Machine also perform competitively, indicating the presence of both linear and nonlinear relationships in the dataset.</span></p> <p class="MsoNormal"><span>Feature importance analysis identifies glucose level and body mass index as the most significant predictors of diabetes risk, followed by genetic and demographic factors. Statistical testing confirms significant differences among models, with Random Forest, SVM, and Logistic Regression forming a statistically comparable top-performing group.</span></p> <p class="MsoNormal"><span>Overall, the findings demonstrate that machine learning methods can effectively support the early detection of diabetes using routine health data. These models offer strong potential for integration into clinical decision-support systems to enhance early diagnosis, risk stratification, and preventive healthcare strategies.</span></p>
title Comparative Analysis of Machine Learning Models for Diabetes Risk Prediction Using Clinical Health Indicators
url https://doi.org/10.5281/zenodo.20029342