Heart Disease Data Analysis
Medical Data Insights · Statistical Modeling · Risk Factor Exploration
Overview
A data analysis project aimed at exploring the impact of socioeconomic and health indicators on heart disease prevalence, using real-world health survey data.
Heart disease is one of the leading causes of death worldwide. This project investigates which factors are statistically significant in predicting heart disease status.
Technology Stack
- Programming Language:
R - Data Processing & Visualization:
tidyverse, ggplot2, lattice, corrplot, psych - Statistical Modeling:
- Ordinary Least Squares (OLS)
- LASSO Regression (glmnet)
- Principal Component Analysis (PCA)
- Factor Analysis (psych::principal)
- Linear Discriminant Analysis (LDA)
Dataset Description
- This dataset includes 1,069 U.S. counties and 23 variables, spanning health behaviors (e.g., obesity, smoking, diabetes), economic indicators (e.g., unemployment, insurance coverage), demographic rates (e.g., birth and death rates), and regional classifications. It was curated by merging and cleaning data from multiple public sources, including the USDA Economic Research Service, CDC, U.S. Census Population Estimates, and the Bureau of Labor Statistics.



Methodology
and Approach
- This project followed a structured, multi-stage analytical process to explore how socioeconomic and health-related indicators influence heart disease mortality across U.S. counties:
1. Data Preparation
- Merged two public datasets and standardized 23 variables
- Removed ~66% of records with randomly missing values in 8 variables
- Renamed variables for coding efficiency (e.g., V1, V2, …)
2. Feature Transformation
- Applied log and square root transformations to normalize skewed data
- Evaluated distributions via histograms and correlation heatmaps
3. Modeling & Analysis
- LASSO Regression with cross-validation improved test RMSE to 25.2, outperforming OLS baseline
- PCA retained 4 components capturing ~70.7% variance
- Factor Analysis revealed latent themes (e.g., poor health, access to care)
4. Classification via LDA
- Predicted:
- Area_Rucc (urban classification) – 73.2% accuracy after class merging
- Economic Typology – 63.7% accuracy
- Urban Influence Level – 60.2% accuracy
- Evaluation based on confusion matrices and class-group patterns






Results and Impact
🧩 Key Results
- LASSO Regression identified key predictors of heart disease mortality:
- Obesity, diabetes, smoking, low birthweight, and physical inactivity
- LDA Classification achieved:
- 73.2% accuracy in predicting urban-rural area classification (Area_Rucc)
- 63.7% accuracy for local economic typology
- 60.2% accuracy for urban influence level
- PCA & Factor Analysis revealed interpretable latent factors:
- Poor health conditions and healthcare access clustered together
- Economic stressors (e.g., unemployment, lack of insurance) formed distinct components
💡 Key Insights
- Health behavior patterns—particularly obesity, smoking, and diabetes—show a strong, consistent correlation with heart disease mortality.
- Geographic context matters: urban vs. rural location, proximity to metro areas, and economic typology all influenced model predictions.
- Predictive modeling can help inform public health resource allocation by identifying counties with overlapping socioeconomic risk factors.
Access Full Details and Files
For full project details, source files, and additional insights, visit the GitHub repository.