Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Fabian, George Hezron
Format:	Recurso digital
Language:	English
Published:	Zenodo 2026
Online Access:	https://doi.org/10.5281/zenodo.20029342
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866902051915563008
author	Fabian, George Hezron
author_facet	Fabian, George Hezron
contents	<p class="MsoNormal"><span>This study presents a comparative analysis of machine learning models for early diabetes risk prediction using the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The dataset includes key health indicators such as plasma glucose concentration, body mass index (BMI), blood pressure, insulin levels, age, diabetes pedigree function, and related physiological variables.</span></p> <p class="MsoNormal"><span>A structured machine learning pipeline was developed using six supervised classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Naïve Bayes. During preprocessing, biologically implausible zero values were treated as missing data and handled using median imputation. Feature standardization was applied to ensure uniform scaling and improve model performance stability.</span></p> <p class="MsoNormal"><span>Model evaluation was conducted using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure a comprehensive and balanced assessment. Cross-validation results show that all models achieve satisfactory predictive performance, with ensemble-based methods, particularly Random Forest, demonstrating the highest and most consistent classification ability. Logistic Regression and Support Vector Machine also perform competitively, indicating the presence of both linear and nonlinear relationships in the dataset.</span></p> <p class="MsoNormal"><span>Feature importance analysis identifies glucose level and body mass index as the most significant predictors of diabetes risk, followed by genetic and demographic factors. Statistical testing confirms significant differences among models, with Random Forest, SVM, and Logistic Regression forming a statistically comparable top-performing group.</span></p> <p class="MsoNormal"><span>Overall, the findings demonstrate that machine learning methods can effectively support the early detection of diabetes using routine health data. These models offer strong potential for integration into clinical decision-support systems to enhance early diagnosis, risk stratification, and preventive healthcare strategies.</span></p>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_20029342
institution	Zenodo
language	eng
publishDate	2026
publisher	Zenodo
record_format	zenodo
spellingShingle	Comparative Analysis of Machine Learning Models for Diabetes Risk Prediction Using Clinical Health Indicators Fabian, George Hezron <p class="MsoNormal"><span>This study presents a comparative analysis of machine learning models for early diabetes risk prediction using the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The dataset includes key health indicators such as plasma glucose concentration, body mass index (BMI), blood pressure, insulin levels, age, diabetes pedigree function, and related physiological variables.</span></p> <p class="MsoNormal"><span>A structured machine learning pipeline was developed using six supervised classification algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Naïve Bayes. During preprocessing, biologically implausible zero values were treated as missing data and handled using median imputation. Feature standardization was applied to ensure uniform scaling and improve model performance stability.</span></p> <p class="MsoNormal"><span>Model evaluation was conducted using Accuracy, Precision, Recall, F1-score, and ROC-AUC to ensure a comprehensive and balanced assessment. Cross-validation results show that all models achieve satisfactory predictive performance, with ensemble-based methods, particularly Random Forest, demonstrating the highest and most consistent classification ability. Logistic Regression and Support Vector Machine also perform competitively, indicating the presence of both linear and nonlinear relationships in the dataset.</span></p> <p class="MsoNormal"><span>Feature importance analysis identifies glucose level and body mass index as the most significant predictors of diabetes risk, followed by genetic and demographic factors. Statistical testing confirms significant differences among models, with Random Forest, SVM, and Logistic Regression forming a statistically comparable top-performing group.</span></p> <p class="MsoNormal"><span>Overall, the findings demonstrate that machine learning methods can effectively support the early detection of diabetes using routine health data. These models offer strong potential for integration into clinical decision-support systems to enhance early diagnosis, risk stratification, and preventive healthcare strategies.</span></p>
title	Comparative Analysis of Machine Learning Models for Diabetes Risk Prediction Using Clinical Health Indicators
url	https://doi.org/10.5281/zenodo.20029342

Similar Items