Background

Cigarette smoking has been reported to increase the risk of several diseases, including chronic obstructive pulmonary disease (COPD) and cardiovascular disease (CVD) [1]. These diseases have been suggested to be caused by the harmful constituents of cigarette smoke [2, 3]. Tobacco companies have been developing products that may reduce disease risks because they emit fewer harmful and potentially harmful constituents [4]. Cigarette smokers switching to these products showed reductions in their biomarkers of exposure (BoE), which are derived from constituents in tobacco smoke [5,6,7,8,9,10,11,12,13]. Biomarkers of potential harm (BoPH) (e.g., oxidative stress, inflammation, lipid metabolism, and platelet activation/coagulation) have also been reported to be closer to those of non-smokers or former smokers in those who switch from conventional cigarettes to tobacco heating systems [14, 15], and nonconventional vapor products [16]. However, further studies are needed to verify the risk reduction achieved by switching to these products, including with a sufficiently large sample and good background information.

The Population Assessment of Tobacco and Health (PATH) study is a joint project of the Food and Drug Administration (FDA) and the National Institutes of Health (NIH), and is one of the largest studies that track tobacco product use and health effects over time [17]. In the PATH study, questionnaire data on smoking status and health status, biomarkers of exposure and potential harm, and other background information were obtained from participants and registered with the Inter-University Consortium for Political and Social Research (ICPSR). The PATH study includes dual and exclusive users of potentially lower-risk products (e.g., electronic cigarettes and smokeless tobacco) as well as current and former smokers and non-smokers. The study data will therefore allow for a more extensive analysis of tobacco smoke exposure and the biological effects of using potentially lower-risk products. Several reports have compared biomarkers among smokers and users of electronic cigarettes and smokeless tobacco products using the PATH study data [18,19,20], making statistical group comparisons for each biomarker. These statistical methods may yield different comparisons and different results for some biomarkers within the same report: for example, the concentrations of nicotine metabolites and tobacco-specific nitrosamines were higher in users of smokeless products than in current smokers, while those of polycyclic aromatic hydrocarbons and volatile organic compounds were lower [18].

Machine learning optimizes technologically advanced computational power and statistical tools for the analysis of big data, with advanced algorithms capable of handling multicollinearity, nonlinearity, and higher-order interactions among variables [21]. In general, supervised learning techniques are used in classification models. The model is trained with labeled data to properly classify non-labeled data. Where binary classification is used, the model shows whether the new observations are closer to one or another option based on the features of the training dataset. Previous studies reported that users of electronic cigarettes and smokeless tobacco could have levels of several biomarkers that are closer to those found in non-smokers or former smokers than current smokers. However, these studies evaluated the biomarkers individually, and not comprehensively. Machine learning can evaluate the difference between participants by integrating all tested biomarkers. This study aimed to develop two classification models to discriminate between current and former smokers based on their biomarkers of exposure and potential harm. The classification models that had learned the features of current and former smokers were then used to evaluate markers from users of electronic cigarettes and smokeless tobacco products to see if they were similar to either current or former smokers. It is also possible to evaluate the differences in disease prevalence in individuals, including users of electronic cigarettes and smokeless tobacco products, between those classified as current and former smokers. This may suggest whether electronic cigarettes and smokeless tobacco products genuinely pose a potentially lower risk to health.

Methods

Study design

We used data for adults (18 to 90 years old) from the public-use files (ICPSR (a), ICPSR 36,498) and restricted-use files (ICPSR (b), ICPSR 36,231) from Wave 1 (September 12, 2013 to December 14, 2014) of the PATH Study. To protect the rights, welfare, and well-being of all human participants in this study, the Westat Institutional Review Board approved the study design and protocol, and the Office of Management and Budget approved data collection. The detailed study design and data collection of the PATH study have previously been published [17].

Participants

Groups were defined using self-reports in the PATH study questionnaire. Participants who reported smoking cigarettes daily but not using any other tobacco products were defined as exclusive cigarette smokers (CS group). Exclusive users of electronic cigarettes and smokeless tobacco products were defined similarly (EPRODS and SMKLS groups). Participants who reported using both conventional and electronic cigarettes, or conventional cigarettes and smokeless tobacco products, were defined as dual users (dual-EPRODS and dual-SMKLS groups). People who reported that they had quit smoking and did not use any tobacco products were defined as former smokers (ExSM group). We also extracted information on age, gender, whether they drank alcohol, lived in urban or rural areas, and health status (cardiovascular disease; respiratory disease; high blood pressure, high cholesterol, and diabetes) from the questionnaire data. For health status, those who answered “Yes” to the question “Has a doctor, nurse or other health professional told you that you have COPD?” were considered to have COPD. People who answered “Yes” to questions on any of COPD, chronic bronchitis, emphysema, or other pulmonary or respiratory diseases were considered to have a respiratory disease. Anyone who answered “Yes” to questions about congestive heart failure, stroke, heart attack, or some other heart condition were considered to have CVD. A total of 2707 participants were included in the analysis for biomarkers of exposure, and 3466 in the analysis for biomarkers of potential harm.

Biomarkers

For the biomarkers of exposure included in the restricted data collected by the PATH study, we selected those that corresponded to harmful and potentially harmful constituents listed by the FDA [4] and evaluated in previous tobacco evaluation studies [6, 10]. These included 4-(methylnitrosamino)-4-(3-pyridyl)-1-butanol (NNAL), N’-nitrosonornicotine (NNNT), total nicotine equivalents (TNE7), N-acetyl-S-(2-carboxyethyl)-L-cysteine (CEMA), N-acetyl-S-(2-hydroxyethyl)-L-cysteine (HEMA), N-acetyl-S-(3-hydroxypropyl)-L-cysteine (HPMA), N-acetyl-S-(phenyl)-L-cysteine (PMA), 1-naphthol or 1-hydroxynaphthalene (P01), 2-naphthol or 2-hydroxynaphthalene (P02), and 1-hydroxypyrene (P10). For the biomarkers of potential harm, we used 8-isoprostane (8PGFT), high-sensitivity C-reactive protein (hsCRP), interleukin 6 (IL6), soluble intercellular adhesion module (sICAM), and fibrinogen (FIB). The creatinine correction was used for each participant for biomarkers in urine samples to adjust for daily excretion.

Machine learning

To predict the exposed constituent and biological effects of tobacco smoke from the use of each tobacco product, we created models with features consisting of biomarkers of either exposure (for exposure [BoE] classification model) or potential harm (for potential harm [BoPH] classification model). We divided data from current and former smokers into training datasets (80%) and test datasets (20%). These test datasets were evaluated as references for users of electronic cigarettes and smokeless tobacco products. The total number of participants and the number used for machine training are shown in Tables 1 and 2. Using 80% of the training data, the two biomarker classification models were created by machine learning, and used to classify participants into current or former smokers. A random forest model (five-fold, 100-repeated cross-validation) was used for the algorithm. The cross-validation accuracy of the models obtained in the machine learning process was evaluated by the receiver operating characteristic curve-area under curve (ROC-AUC). We input data from the two test datasets on current and former smokers, and also on dual and single users of electronic cigarettes and smokeless tobacco products. The percentage of data classified as current and former smokers was tabulated for each group. There were more current than former smokers and the current smokers were therefore randomly sampled by random number generation, to align the number of participants in each group, because using imbalanced data leads to poorer model performance [22]. The “feature importance” was calculated as the percentage of each feature that influenced the classification into current or former smokers in the cross-validation.

Table 1 Demographic information and geometric mean of biomarkers for the exposure biomarkers classification model
Table 2 Demographic information and geometric mean of biomarkers for the potential harm classification model

To assess the prevalence of disease in participants classified as current or former smokers, we calculated the percentage of those labelled with CVD or respiratory diseases among those classified as current or former smokers by the classification model, using data from the questionnaire.

Data analysis, including machine learning, was performed in R software version 4.2.1 using the caret package [23].

Results

Demographic information and biomarkers

The demographic information and biomarker characteristics of all the participants in this study and those randomly selected for machine learning for the two classification models are shown in Tables 1 and 2. In accordance with ICPSR data release rules, tables with cell sizes smaller than the threshold for the specific dataset will not be released.

Machine-learning models

The percentage of users of each tobacco product classified as current or former smokers by (a) the biomarkers of exposure model and (b) the biomarkers of potential harm model, and the order of importance of the features are shown in Figs. 1 and 2. The ROC-AUC, a cross-validation model performance score, was 95.0% for the exposure biomarkers classification model and 82.0% for potential harm biomarkers model. For both the classification models, the scores for both test datasets were more than 75%. A higher percentage of users of either electronic cigarettes or smokeless tobacco products were classified as former than current smokers. Both the dual user groups (dual-EPRODS and dual-SMKLS) had a higher percentage classified as current smokers than exclusive users of a single product. In the exposure biomarkers classification model, the most important feature was TNE7, followed by HPMA. In the potential harm biomarkers classification model, 8PGFT and sICAM showed the highest feature importance, followed by IL6.

Fig. 1
figure 1

Percentage of users of each tobacco and nicotine product classified as current or former smokers by the two classification models. Abbreviations: CS current cigarette smoker, dual-EPRODS user of conventional cigarettes and electronic cigarettes, dual-SMKLS user of conventional cigarettes and smokeless tobacco products, ExSM former smoker, EPRODS user of electronic cigarettes, SMKLS user of smokeless tobacco, test-CS models using test data for current smokers, test-ExSM models using test data for ex-smokers

Fig. 2
figure 2

The features contributing to classification as current or former smokers in cross-validation of the exposure and potential harm classification models (feature importance %). Abbreviations: 8PGFT 8-isoprostane, CEMA N-acetyl-S-(2-carboxyethyl)-L-cysteine, FIB fibrinogen, HEMA N-acetyl-S-(2-hydroxyethyl)-L-cysteine, HPMA N-acetyl-S-(3-hydroxypropyl)-L-cysteine, hsCRP high-sensitivity C-reactive protein, IL6 interleukin 6, NNAL 4-(methylnitrosamino)-4-(3-pyridyl)-1-butanol, NNNT N’-nitrosonornicotine, sICAM soluble intercellular adhesion module, TNE7 total nicotine equivalents, PMA N-acetyl-S-(phenyl)-L-cysteine, P01 1-naphthol or 1-hydroxynaphthalene, P02 2-naphthol or 2-hydroxynaphthalene, P10 1-hydroxypyrene

Disease prevalence

Among those indicated by either classification model as a current or former smoker, we calculated the percentage with CVD or respiratory diseases for each group. The percentage with CVD or respiratory diseases was consistently higher among those classified as current than former smokers in both models (Table 3).

Table 3 The percentage of participants classified as current or former smokers by exposure or potential harm classification models who had cardiovascular or respiratory diseases

Discussion

We first developed the biomarkers of exposure-based classification model to classify participants as either current or former smokers. The result should reflect the overall exposure level to constituents of cigarette smoke. This model predicted that users of electronic cigarettes and smokeless tobacco products would be more like former than current smokers, suggesting that the use of these products reduced exposure. Consistent with previous reports [18, 20], TNE7, the total amount of nicotine metabolites [24], was identified as the most important variable in distinguishing between current and former smokers. TNE7 was much lower in former than current smokers. However, the level in users of either electronic cigarettes or smokeless tobacco products was higher than in current smokers, suggesting that the classification of these two groups did not depend on the TNE7 level but on other biomarkers of exposure. High TNE7 levels in users of potentially lower-risk tobacco products are expected because they provide access to the parent chemical (i.e., nicotine). It is therefore likely that other biomarkers of exposure contribute more to the classification of users of electronic cigarettes and smokeless tobacco products.

Similarly, the classification model based on biomarkers of potential harm also classified users of electronic cigarettes and smokeless tobacco products as more like former than current smokers, but to a lesser extent than the exposure model. The difference in the results between users of electronic cigarettes and smokeless products from the two different classification models may be due to the difference in characteristics as biomarkers. The exposure biomarkers are for cigarette smoke-specific constituents, but the biomarkers of potential harm could be influenced by a variety of factors other than smoking. The most important features of the potential harm model were 8PGFT and sICAM. Both biomarkers are known to be associated with oxidative stress and/or inflammation [25, 26] and also recognized as predisposing factors for CVD [27, 28]. IL-6 was also important in the potential harm biomarkers model. IL-6 has a short half-life and inter-individual variability, and its association with risk for CVD and respiratory diseases is still controversial. However, it is among the most frequently studied biomarkers and several studies have identified its importance for disease risk [29, 30]. The biomarkers of potential harm are important, but no single one can explain all disease risk. Our model therefore includes several biomarkers of potential harm, and could approximate the relative disease risk between cigarette smoking and smoking cessation. Our potential harm classification model is based on the hypothesis that smoking cessation could minimize the risk of smoking-associated diseases. If a user of potentially lower-risk tobacco products is classified as a former smoker based on their biomarkers of potential harm, this suggests that using these products does indeed reduce the disease risk compared with smoking conventional cigarettes. Our results therefore suggest that both electronic cigarettes and smokeless tobacco products could contribute to reduce smoking-associated disease risk. Disease prevalence was slightly lower in participants classified as former smokers than those classified as current smokers. Our results show that participants classified as current smokers in either model had relatively higher rates of CVD or respiratory diseases than those classified as former smokers. There may therefore be a reduced risk of disease among users of potentially lower-risk products. The results of the feature importance, and of previous reports of biomarkers, suggest an association with CVD.

We also evaluated dual users of both conventional cigarettes and either electronic cigarettes or smokeless tobacco products. The exposure model recognized these dual users as current smokers. If participants use both cigarettes and electronic cigarettes or smokeless tobacco products on a daily basis, the biomarkers of exposure could be closer to those of cigarette smokers than those who use only potentially lower-risk products. The result of the exposure model reflects this hypothesis. However, the model based on biomarkers of potential harm showed an intermediate result for dual users, between smokers and users of only electronic cigarettes or smokeless tobacco products, suggesting that dual use reduces the risk of harm. However, longitudinal observation is necessary to detect significant changes in biomarkers of potential harm, and to establish whether robust homeostasis in the human body suppresses rapid changes. Multiple exposures and cumulative effects possibly contribute to the changes in the biomarkers of potential harm as well as the manifestation of health impacts. The extent of risk reduction in dual users would therefore depend on the proportion of cigarette smoking and use of electronic cigarettes and smokeless tobacco products.

This study had several limitations. In this study, we used strict definitions for each group of tobacco product users. Exclusive users had to use only one product on a daily basis. Users of electronic cigarettes and smokeless tobacco products who also smoked were defined as dual-users. Some similarities were seen between users of electronic cigarettes and smokeless tobacco products and former smokers in the population used for modeling. However, this should be further investigated with more study participants to reflect the real-world population. The health impact of e-cigarettes is still controversial [31, 32] and the subject of ongoing study, and we cannot conclude that the risk of disease is reduced even if e-cigarette users were more likely to be classified as former smokers in this study. The variables used in this study are absent from the succeeding waves, and we therefore cannot evaluate the time-related changes in the classification results. However, longitudinal analyses could provide more insights into the health impact of these new tobacco products because the pathogenesis of smoking-related chronic diseases takes time to manifest. Expansion of variables to include items such as “Frequency of product use” and “Start of product use” could also provide additional insights, although we could not obtain sufficient data for this analysis from Wave 1 alone. These limitations could be addressed if the variables used in this study are available in future waves of the PATH study.

Conclusion

The results of our classification models based on biomarkers of exposure and potential harm showed that the biomarker profiles of people who used electronic cigarettes or smokeless tobacco products were more like those of former than current smokers. This suggests that exposure to constituents in cigarette smoke and the resulting biological effects have potential to be reduced with the use of these products, at least among those involved in the PATH study.