Statistical Modelling of Diabetes Risk Report and Workflow
Problem Statement
Sourced from UBC STAT 310 website:
“In this course you will work in assigned teams of three or four to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.
Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:
- A well documented and modularized software package and scripts written in R and/or Python,
- A data analysis pipeline automated with GNU Make,
- A reproducible report powered by either Jupyter book or R Markdown,
- A containerized computational environment created and made shareable by Docker, and
- A remote version control repository on GitHub for project collaboration and sharing, as well as automation of test suite execution and documentation and software deployment.”
Project Goals and Objectives
We attempted to develop a logistic regression model to predict whether a patient has diabetes or not using the Diabetes Health Indicators dataset sourced from the UCI machine learning repository (Centers for Disease Control and Prevention 2023). We employed Random Over-Sampling Examples (ROSE) to balance the data and a tuned Least Absolute Shrinkage and Selection Operator (LASSO) regression model to classify patients who are at high risk of developing diabetes using a subset of the 21 risk factors provided in the dataset.
Feature selection was performed via Cramér’s V statistical tests and Information Gain analysis. Recall was used to measure the classifier’s performance, as the consequences of false negatives would be more severe than false positives for this task. The area under the receiver operating characteristic (ROC) curve (AUC) was also chosen to evaluate the model’s effectiveness in distinguishing between the two classes compared to random guessing.
Process and Approach
- Drafted a teamwork contract on how the group would conduct and organise ourselves throughout the course of the project.
- Set-up and organised Github repository
- Outlined code of conduct, contribution and license files based on existing documents for similar purposes.
- Github Issues and Milestones were used to keep track of tasks and persistent issues with the system.
- Selected a data analysis project and corresponding dataset to create a narrated analysis that asks and answers a predictive question using a classification method.
- Dataset used: Diabetes Health Indicators
- Data was initially extracted by running a Python file for
ucimlrepo
- Constructed report for predicting whether a patient has diabetes or not
- Loading extracted data and perform exploratory data analysis by obtaining counts of distinct and
NA
values for each column, the initial data types of each column, and the class imbalance in the target variableDiabetes_binary
. - Applied Random Over-Sampling Examples (ROSE) technique to undersample the majority class
Diabetes_binary == 0
and oversample the minority classDiabetes_binary == 1
. - Binning numerical BMI into discrete categories
- Performed feature selection by finding the distributions of
Diabetes_binary
by various variables, and employing the Cramér’s V and Information Gain metrics alongside chi-squared tests to narrow down relevant variables. - Developed LASSO regression workflow model and determined how its features contribute to the final prediction.
- Loading extracted data and perform exploratory data analysis by obtaining counts of distinct and
- Created
Dockerfile
to provide specifications for a Docker image that provides a computational environment for the analysis.- Used
jupyter/r-notebook:x86_64-ubuntu-22.04
as base Docker image in release0.0.1
. - Changed base Docker image to
rocker/verse:4.2.1
in later releases.
- Used
- Abstracted code within the report as individual scripts
01-load_clean.R
: Loading and cleaning data02-eda.R
: Exploratory data analysis03-model.R
: Constructing LASSO classification analysis model04-analysis.R
: Applying model onto test set and producing analysis
- Applied Quarto formatting to report:
- Table of contents
- Bibtex references
- Labelling, cross-referencing and automatic numbering for figures and tables
- Removal of code chunks for easier viewing
- Created
Makefile
to proving functions to automate the process of running all required scripts in the correct order, and for scripts to acquire consistent inputs and provide consistent outputs. - Abstracted code within the scripts as individual functions and relocated them into a new package
predictdiabetes
. - Reviewed data validation checks from the provided checklist into the analysis pipeline as
01-validate_rawdf.R
using thepointblank
library.
Results
Our model achieved a recall score of 0.7652607 (7 s.f.) on the test set, implying that about 77% of all positive instances of diabetes were correctly classified by the LASSO regression model. This suggests that the model is relatively effective at identifying individuals who are at risk of developing diabetes. Additionally, the model achieved an area under the ROC curve of 0.80139 (7 s.f.) on the test set. Since the AUC is above 0.5, this indicates that the model can discriminate between diabetic and non-diabetic cases better than random guessing.
We expected the model to perform better than random guessing, which it did achieve. Additionally, we aimed to minimize false negatives which is particularly important in healthcare diagnoses where a false negative case can have serious consequences. For example, a false negative would indicate that the model predicts the patient to not develop diabetes even though they do. This may lead to the patient not getting the treatment or care they need, potentially resulting in health complications and even death. The model had a false negative rate around 23% which is concerning as it indicates a significant risk for missing positive cases leading to unfavourable patient outcomes.
Impact and Reflection
Our model was optimized for recall through hyperparameter tuning, with cross-validation used to evaluate model performance during the process. However, despite our results, the model falls short of what is expected in clinical applications. For example, more complex models in the literature which incorporate advanced feature selection techniques (Alhussan et al. 2023) or Generative Adversarial Networks (GANs) (Feng, Cai, and Xin 2023) can achieve recall and AUC scores upwards of 97%. Our findings serve as a proof of concept for the feasibility of classification models to predict the risk of diabetes based on publicly available health data. Since our model did relatively well compared to random guessing, this indicates potential correlations between health indicators and the likelihood of developing diabetes. With this information, people might be more aware of their health and lifestyle choices. They may be inclined to work harder to reduce cholesterol levels, manage high blood pressure, keep alcohol consumption under control and maintain a healthy lifestyle. Thus, this may help decrease the global mortality rate from diabetes through early interventions and lifestyle changes.
Future directions could include how we can improve our classification method to more accurately predict the risk of diabetes in patients. We can explore more rigorous feature selection techniques or implement other machine learning models such as boosted trees to improve our classification performance. In the context of the healthcare system, we can look to integrate classification models into the diagnosis process to help detect diabetes early to improve patient outcomes.