Abstract 18809: Development of an Electronic Health Record-Based Algorithm for Smoking Status Using the Million Veteran Program (MVP) Cohort Survey Response
Introduction: The Veterans Health Administration (VHA) is the largest integrated health care system with over 16 years of electronic health record (EHR) data for more than 12 million veterans. One challenge with EHR data is to accurately define risk factors which may not be recorded simply as structured data elements, such as smoking status. Smoking remains a major risk factor for many chronic diseases such as cardiovascular disease and lung cancer, and increases mortality risk compared to non-smokers. As such, it is important to accurately define smoking status to quantify its effect on disease and minimize its effect as a potential confounder.
Methods: We developed a probabilistic model to predict smoking status using the Million Veteran Program (MVP) self-reported smoking data as the gold standard (N=79,440). MVP is an on-going cohort study and mega-biobank that collects genetic samples and self-reported data on lifestyle factors. Participants were categorized as a never or ever smoker based on their Baseline and Lifestyle survey responses. LASSO [Lease Absolute Shrinkage and Selection Operator] regression with 10-fold cross validation was used to select the most important predictors of smoking status from the EHR structured data and apply a penalty to prevent overfitting. The beta coefficients were used to calculate the predicted probability of being a never or ever smoker.
Results: There were 73% ever and 27% never smokers in the MVP cohort. Using the most probable smoking category for a subject, the algorithm sensitivity for an ever and never smoker was 86% and 87%, respectively. The specificity for an ever and never smoker was 87% and 87%, respectively. The algorithm’s positive predictive value was 95%. The final predictors in the model were 7 of the 11 smoking-related adjudicated health factors, and ICD-9 codes for tobacco dependence.
Conclusion: A probabilistic model using all smoking-related structured data from a large EHR database can produce a highly sensitive and specific model for identifying ever vs. never smokers. Furthermore, a probabilistic approach results in greater utility for research projects that may have different needs for a smoking status variable e.g., studies may only want smokers with a very high probability of being a smoker.
Author Disclosures: R.J. Song: None. Y. Ho: None. X.T. Nguyen: None. J. Honerlaw: None. R. Quaden: None. J. Gaziano: None. J. Concato: None. K. Cho: None. D.R. Gagnon: None.
- © 2016 by American Heart Association, Inc.