Skip to main content

Abstract: Enhanced Classification of Diabetes Status and Type Using Structured Data from Nationwide U.S. Electronic Health Records: A High-Throughput Approach

Erin M. Tallon, PhD, RN1,2; Grant Scott, PhD2,3; Cintya Schweisberger, DO1; Joseph T. Cernich, MD1; Ryan McDonough, DO1; Camila Manrique-Acevedo, MD4; Mark A. Clements, MD, PhD1; Chi-Ren Shyu, PhD2,3,5 

1Pediatric Endocrinology, Children’s Mercy Kansas City, Kansas City, Missouri, USA, 2Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA, 3Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA, 4Department of Medicine - Division of Endocrinology, Diabetes and Metabolism,University of Missouri,Columbia,Missouri, USA, 5School of Medicine, University of Missouri, Columbia, Missouri, USA 

Background and Aims: Accurate identification of individuals’ diabetes status and type in electronic health records (EHRs) is critical for conducting diabetes health outcomes and disease monitoring studies at scale. Existing case identification approaches use rule-based methods (RBM; e.g., presence of diabetes medications) to identify diabetes status, but these conservative approaches miss many true type 1 (T1D) and type 2 diabetes (T2D) cases. We therefore developed a machine learning method to classify cases “missed” by RBM. 

Methods: We standardized and refined published rule-based criteria across numerous coding systems to identify individuals with T1D (n=37,999), T2D (n=1,507,923), and no diabetes (NoDM; n=11,589,848) in the nationwide, U.S.-based Oracle EHR Real-World DataTM database. After curating 1,980 features from structured EHR data, we used stratified five-fold cross validation and regularized gradient boosting (XGBoost) to classify individuals’ diabetes status/type using training data that first included (Model 1), and then excluded (Model 2), diabetes-related features. Performance was validated in two cohorts “missed” via RBM, whose diabetes status/type was determined via expert physician review. 

Results: Mean weighted F1 scores for models trained via cross validation were 0.998 (Model 1) and 0.960 (Model 2). Model 1 achieved a weighted F1 score of 0.978 and T1D/T2D recall of 1.000/0.947 in a validation cohort with insufficient diabetes criteria for identification via RBM. Model 2 achieved a weighted F1 score of 0.876 and T1D/T2D recall of 0.857/0.821 in individuals with mixed T1D/T2D diagnosis codes. 

Conclusions: These findings demonstrate the utility of this high throughput computational phenotyping method for automated, comprehensive classification of diabetes status/type in EHRs.