1 Introduction

Stroke is the second leading cause of death worldwide and first leading cause of death in China. The diverse variety of ischemic stroke due to its etiology is an important research part of stroke, with about 30% of ischemic stroke being cryptic stroke [1]. Patent foramen ovale (PFO) is present in approximately 25% of the general population, and approximately 40–50% of cryptogenic stroke (CS) presents with PFO [2]. However, the potential association between PFO and CS has been a controversial issue for several decades. TEE is the gold standard for the diagnosis of PFO, and it is also the first choice for diagnosis. The front end of the TEE is the Doppler sensor, where the end is a multiplanar probe with the most widely used rotating sensor, and a series of horizontal, vertical sectional images can arbitrarily “cut” the heart from an angle of 0–180°, with less bending stretch, you can get easier and more detailed heart imaging data [3]. Therefore, remote sensing technologies provide valuable assisted information that can be used to enhance the precision and timeliness of PFO parameter acquisition. Machine Learning algorithms are crucial for prediction the association between PFO and CS [4]. In this article, we propose a random forest algorithm based on TEE imaging data, and compare it with other machine learning algorithms.

2 Literature Review

Previous studies have stated that large-size PFO, long-tunnel PFO, the presence of atrial septal aneurysm (ASA), and severe right-to-left (RL) shunt are high-risk factors related to CS [5]; conversely, other studies have shown that there are no morphological differences in PFO between those with and without CS, and that only interatrial shunt is significantly related to possible embolism [6]. However, both sets of studies have ignored the fact that PFO shape, interatrial hemodynamics, and atrial pressures change with the cardiac cycle. Regardless of physiological condition, each normal cardiac cycle during right ventricular early diastole and isovolumetric contraction presents a transient spontaneous reversal of left and right atrial pressures [7], and this reversal gradient may increase substantially with certain physiological conditions, such as a cough, inspiration, and Valsalva maneuver. Transcatheter PFO closure has been suggested as one therapeutic option of CS; while no strong data from randomized clinical trials have been provided to support primary preventive PFO closure, many results have pointed to the benefits of preventative PFO closure in certain high-risk patients [8]. Moreover, evidence from recent trials has demonstrated that PFO device closure is more effective at preventing the recurrence of stroke compared to medical therapy, especially in patients with high-risk PFO [9, 10].

Other studies have found that logistic analysis to propose a predictive model of high-risk PFO related to CS [11]. Machine learning, as a branch of artificial intelligence, has been widely used in the medical and biological sectors. However, due to the complex distribution of PFO-related CS data collected from ultrasound images with many features, high noise, and many discrete attribute variables. Although these machine learning algorithms can achieve better prediction results than empirical statistical models. However, these traditional intelligent prediction models have inherent defects such as being highly dependent on the accuracy of the database, being inefficient, time-consuming, and easily falling into local optimum, including support vector machine (SVM) and artificial neural network (ANN) approaches [12]. The random forest approach, on the other hand, has good generalization ability and is insensitive to noise, so it is suitable to establish a prediction model of high-risk features of PFO. Random forest is a supervised integrated learning classification technology. As an emerging machine learning ensemble algorithm, random forest can solve complex problems of nonparametric and nonlinear classification [13]. Moreover, it can reduce the computing complexity under the premise of improving the accuracy, with the advantages of few parameters, strong generalization ability and strong resistance to overfitting [14]. For high-dimensional data, the comprehensive performance index of the random forest approach, such as classification accuracy and algorithm efficiency, are clearly superior to other single classifiers and integrated classifiers.

Therefore, random forest technology has been applied widely in the fields of biological information, text mining, and image classification in recent years, and has become a frequent research topic in the fields of data mining, machine learning, and pattern recognition [15]. Based on aforementioned discussion, our study investigates whether PFO morphological features in different phases of the cardiac cycle can predict CS, and evaluates which of these features show the highest predictive value of high-risk PFO related to CS by establishing a random forest model. All data are taken from Zhongnan Hospital of Wuhan University, and the results are realized using MATLAB.

3 Materials and Methods

3.1 Study Population

From November 2018 to December 2020, we retrospectively enrolled 151 consecutive patients with detected PFO at our institution. All patients were subject to brain and carotid imaging, 12-lead electrocardiography, echocardiography, and a hypercoagulability panel. The TEE and contrast transesophageal echocardiography (c-TEE) findings were reviewed in patients with and without CS. CS was defined to present with a transient or permanent neurological deficit on magnetic resonance imaging after excluding all other identifiable causes of ischemic lesion. Patients without CS presented with migraine and dizziness, without cerebral infarction lesion (confirmed by magnetic resonance imaging). CS was evaluated by an experienced neurologist. Patients with poor visualization of PFO, an atrial septal defect, atrial fibrillation, valvular heart disease, congestive heart failure with ejection fraction < 50%, or an inability to perform the Valsalva maneuver because of impaired cognition or coordination, and having other causes of CS were excluded. Patients with observed PFO on TEE and underwent a saline contrast study were ultimately included. This study was approved by the ethics committee of Zhongnan Hospital of Wuhan University (No. 2020060 K), and all enrolled patients provided their written informed consent.

3.2 Assessment of Characteristics of PFO by TEE and c-TEE

Echocardiographic studies were performed using the GE Vivid E95 cardiovascular ultrasound system equipped with an M5S probe and 6VT-D probe (GE Vingmed Ultrasound AS, Horten, Norway). The principle of the Doppler sensor is transesophageal ultrasound imaging, including 2-D, M-mode color ultrasound and spectral Doppler, which can be transesophageal 20–25 cm, middle 35–40 cm, stomach 40–45 cm, multi-angle 0–180°, long- and short-axis, multi-section to observe the dynamic structure and function of the heart and blood vessel, using the caliper and tracker to calculate the area, volume, diameter, length of each anatomical part of the heart and blood vessel. Understand myocardial systolic and diastolic functions, detect the degree of valve disease, and the direction of blood vessels. Understand the function of the heart, regulate the circulation in real time, and maintain the stability of the circulation. A saline-contrast study was administered both during normal respiration and during the Valsalva maneuver. The presence of PFO was confirmed based on either (1) direct visualization of microbubbles passing through the atrial septum to the left atrium within three consecutive cardiac cycles after entire right atrium opacification, or (2) visualization of color Doppler flow through the atrial septum (Fig. 1) [16].

Fig. 1
figure 1

PFO diagnosis. A Visualization of color Doppler flow through the atrial septum. B Microbubbles passing through the atrial septum; the number of microbubbles in the left atrium was counted

The following parameters on the anatomical and functional characteristics of PFO were studied: PFO size in ventricular systole and diastole, length of PFO tunnel in ventricular systole and diastole, presence of ASA, maximum mobility, presence of prominent Eustachian valve or Chiari’s network, and degree of RL shunt at rest and during Valsalva maneuver. PFO size was defined as the maximum separation between the septum primum and septum secundum at the point of entry to the left atrium (Fig. 2), and a size of ≥ 2 mm was defined as large-size PFO [17]. The length of the PFO tunnel was measured according to the maximum overlap between the septum primum and septum secundum (Fig. 2), and a length of ≥ 10 mm was defined as long-tunnel PFO [18]. Prominent Eustachian valve was defined as a ≥ 10 mm protrusion within the right atrium. Chiari’s network was defined as a network of threads in the right atrium with attachments to the upper wall of the right atrium or the interatrial septum (Fig. 3A) [19]. ASA was defined as ≥ 10 mm of septal excursion from the midline into the right or left atrium, or ≥ 15 mm of the total excursion between the right and left atrium (Fig. 3B) [20]. Maximum mobility was equal to the sum of the excursions (the greatest leftward and rightward deflections of the septum with respect to a line perpendicular to the fossa ovalis plane; Fig. 3C). The degree of RL shunt was assessed either at rest or during the Valsalva maneuver using agitated saline contrast. According to the amount of microbubbles appearing in the left atrium, the degree of shunt was defined as mild (3–9), moderate (10–30), or severe (> 30). Each transesophageal echocardiographic study was reviewed and analyzed on the EchoPAC system (GE Vingmed Ultrasound AS, Horten, Norway) by two independent cardiologists who were blinded to the CS status of the patients.

Fig. 2
figure 2

PFO size and tunnel length. The PFO size (white arrow) was measured via the maximum separation in the ventricular systole and diastole, the length of the PFO tunnel (yellow arrow) was measured via the maximum overlap in the ventricular systole and diastole

Fig. 3
figure 3

PFO characteristics. A A network of threads in the right atrium with attachments to the upper wall of the right atrium were defined as Chiari’s network (arrow). B ASA was defined as ≥ 10 mm of septal excursion from the midline into the right or left atrium. C A moving and floppy septum defined the maximum mobility of the interatrial septum

3.3 Random Forest

Random forest is a statistical learning theory based on a classification tree. Random forests utilize the bootstrap resampling method to draw multiple sample sets back and forth from the original sample set, and model each sample set separately for decision trees. Each decision tree randomly selects features during modeling to split attributes on the internal nodes, and constitutes part of the random forest; the final prediction result is synthesized from the resulting vote of each decision tree [21] (Fig. 4).

Fig. 4
figure 4

Flow diagram of the random forest regression

3.3.1 Generalization Error of Random Forest

When taking a sample set, about 36.8% of the samples in each original sample set are not selected. These samples are called out-of-bag data (OOB), and can be used to calculate the generalization error of the model [22]. The generalization error of a random forest can be expressed as follows:

$$E^{*} = P_{X,Y} (M(X,Y) \triangleleft 0),$$
(1)

where subscripts X and Y indicate that probability P covers X and Y spaces. In a random forest, when the number of decision trees is large enough, \(E^{*}\) converges to

$$P_{X,Y} (P_{\theta } (h(X,\theta ) = Y) - \mathop {\max }\limits_{{{\text{j}} \ne Y}} P_{\theta } ({\text{h}}(X,\theta ) = {\text{j}}) \triangleleft 0).$$
(2)

This shows that the generalization error does not cause overfitting as the number of decision trees increases, but approaches a finite upper bound.

3.3.2 Evaluation of the Significance of Features

There two primary methods by which to judge the importance of random forest features: one is to rank each feature according to the impurity of the Gini coefficient, and the other is to calculate the influence of each feature on the accuracy of the model. This paper chooses the latter method to evaluate the importance of feature variables.

When determining the importance of the model, the OOB error R1 was calculated using the corresponding OOB data, and then the order of a feature in the OOB data was randomly transformed to calculate the OOB error R2 again. Assuming that there are N decision trees in a random forest, the importance I of a certain feature is

$$I = \frac{{\sum\limits_{i = 1}^{N} {(r_{1} - r_{2} )} }}{N}.$$
(3)

After obtaining the importance degree of each feature, recursive feature elimination was used to sequentially reject the features with the least importance until the optimal number of features was reached, thus enabling feature selection.

3.4 Risk Assessment of CS Risk Based on the Random Forest Approach

A technical flowchart of the PFO-related CS risk prediction model based on the random forest approach is shown in Fig. 5. First, the CS risk-assessment system was constructed, and the training set was established by random sampling. Next, the parameters were optimized and a random forest training model was established. The importance of the evaluation indicators was then determined, the test set data were input into the training model, and each regression tree in the model obtained a set of predicted values based on the test set data. The mean value was the final prediction result, and error analysis and variable sensitivity analysis were performed thereon.

Fig. 5
figure 5

Flow diagram of the random forest regression model on cryptogenic stroke risk

3.5 Construction of PFO-Related CS Risk Assessment System and Establishment of Training Set

Establish a risk-assessment system. The mechanism of CS induced by PFO was analyzed, and relevant influencing factors were obtained. According to a large amount of practical experience and relevant references, the PFO-related CS risk evaluation index system was constructed and the risk grade was determined.

Set up the original training set sample. Each index of the index system was taken as the random forest variable, and the index-related data were taken as the original training set, of which 80% of the original data were trained.

Generate random self-help sample set. The original training set was recorded as \(T = \left\{ {\left( {x_{1} ,y_{1} } \right),\left( {x_{2} ,y_{2} } \right), \cdots ,\left( {x_{n} ,y_{n} } \right)} \right\}\). The bootstrap sampling method was used to extract \(k\) times from sample \(T\) with \(n\) sample size, so as to form a \(k\) mutually independent training set \(\left\{ {T_{i} ,i = 1,2, \cdots ,k} \right\}\).

3.6 Determination of Optimal Parameters and Establishment of the Training Model

Choose the best branch via k-fold cross-validation. The k-fold cross-validation method was adopted to divide the initial sample into k subsamples, and a single subsample was retained as the data of the validation mode. The other k-1 samples were used for training. Finally, the average prediction accuracy of the k models was used as the final estimate of the prediction accuracy of the model, and the splitting mode with the highest prediction accuracy was selected as the optimal branch. Tenfold cross-validation was adopted in this study [23].

Select the optimal parameters. In the process of tree generation, for each node, M features were randomly selected from all feature sets, and then an optimal eigenvalue mtry was selected as the split variable value according to the criterion that the information gain ratio reaches the maximum. A random forest model was established, the trend of ntree and mean square error was observed, and the decision tree corresponding to the minimum mean square error was selected as the best ntree value—that is, the number of regression trees.

Establish a training model. Taking the optimal branch as the random forest input, the node was divided into two branches according to the characteristics, and then the best features were determined from the remaining features. In this way, the branches of the classification tree were constructed recursively to maximize the growth of the regression tree without any clipping, and a decision tree was generated. The process was repeated to establish the random forest training model.

3.7 Evaluation of Variable Importance and Model Fit Prediction

Evaluate variable importance. For each tree in the random forest, the corresponding OOB was used to calculate its OOB data error, which was recorded as errOOB1. Noise interference was randomly added to the characteristic X of all samples of OOB data, and its OOB data error was calculated again and recorded as errOOB2. Thus, the importance of feature X is

$${\text{Importance}} = \sum {\left( {\text{errOOB2}} -{\text{errOOB1}} \right)} /{\text{Ntree}}{.}$$
(4)

Evaluate model fit prediction. The remaining 20% of the data test set was input into the training model. The random forest model was used to predict the test set data and establish the prediction model. The average of the output values of all decision trees were taken as the prediction value of the random forest; the random forest training model fit by the training set and the random forest prediction model predicted by the test set were drawn visually, and the model fit diagram and prediction diagram were thereby obtained. The prediction result of the random forest regression model is

$$f_{r} \left( x \right) = \frac{1}{k}\sum\limits_{i = 1}^{k} {h_{i} } \left( x \right).$$
(5)

In Eq. (5), \(f_{r} \left( x \right)\) represents the predicted value of the random forest regression model, and \(h_{i} \left( x \right)\) represents the predicted value of a single regression tree model.

3.8 Evaluation of Model

Once the model had been established, it had to be evaluated to determine whether it is suitable for disease prediction. In this study, the true-positive rate and false-positive rate of the test set were calculated using R language, and the relative operating characteristic curve and the area under the curve were drawn using the pROC package within R to evaluate the random forest model.

3.9 Statistical Analysis

All statistical analyses were performed using SPSS19.0 (IBM Corporation, Armonk, NY, USA). Data were described as mean ± SD for continuous variables and as number (percentage) for categorical variables. Student’s t test, Mann–Whitney U test, χ2 test and Fisher’s exact test were used to compare baseline characteristics and echocardiographic PFO characteristics between patients with and without CS where appropriate. The interobserver agreements were analyzed using the intraclass correlation coefficient or Cohen’s κ statistics based on data from 15 randomly selected patients recorded by observers A and B. P values of < 0.05 were considered significant.

4 Results

4.1 Patient Characteristics

Of the 151 patients recruited, the mean age was 50.9 ± 13.9 years and 80 (53%) were male. CS was found in 66 (43.7%) patients, hypertension in 90 (59.6%) patients, diabetes in 45 (29.8%) patients, and hyperlipidemia in 58 (38.4%). In addition, 46 (30.4%) patients reported being a current or prior smoker and 47 (31.1%) had a body mass index ≥ 25. The baseline characteristics did not differ significantly between patients with and without CS, as shown in Table 1.

Table 1 Comparison of basic information between CS and non-CS groups

4.2 Echocardiographic Characteristics of PFOs

As shown in Table 1, the size of the PFOs was significantly greater in patients with CS compared to those without CS in systole and diastole [systole 2.0 (1.5, 2.9) mm versus 1.6(1.1, 2.0) mm, p < 0.001; diastole 1.7(1.4, 2.2) mm versus 1.3(1.1, 1.8) mm, p < 0.001). Large PFOs in systole and diastole were more common in patients with CS (systole 51.5% versus 30.6%, p = 0.009; diastole 34.8% versus 18.8%, p < 0.001), while long-tunnel PFO and length of tunnel showed no significant difference between the two groups in either systole or diastole. Patients with CS had a greater maximum mobility [5.9(3.3, 7.6) mm versus 3.2(2.3, 4.6) mm, p < 0.001]. ASA was present in 27.3% of patients with CS, compared with 1.2% of patients without CS (p < 0.001).

4.3 Random Forest Model Results

The test results of the random forest model are summarized in Fig. 6. Among 30 samples, 21 were correctly predicted, with an accuracy rate of 70%. Moreover, the relative operating characteristic curve diverged from the 45° line near the coordinates (0,0) and (1,1) and yielded an area under the curve value of 0.816, which also indicates that the model has acceptable discrimination to diagnose patients with low and high risk.

Fig. 6
figure 6

Classification results visualized by the confusion matrix and relative operating characteristic curve

4.4 Discovery of High-Risk Factors

The order of importance of variables in the random forest model is shown in Fig. 7, in which maximum mobility, large RL shunt during Valsalva maneuver, size of PFO in diastole and systole, and diastolic length of the tunnel are the top five most important variables of the random forest.

Fig. 7
figure 7

The importance of the random forest model

4.5 Reproducibility

Data from 15 randomly selected patients were used to assess interobserver agreement. The interobserver intraclass correlation coefficient between two reviewers for size of PFO in diastole was 0.91 (0.76–0.97), and was 0.85 (0.62–0.95) for maximum mobility. There was 100% agreement between reviewer 1 and reviewer 2 for the presence of ASA and the classification of the severe RL shunt at rest and Valsalva maneuver.

4.6 Prediction Model Accuracy Comparison

Using the same dataset, we choose ANN [24, 25] and SVM [26] to predict the CS based on PFO from TEE Imaging Data. The prediction results are compared with the RF model results, and the root mean square error and goodness of fit are used to measure the prediction effect of the model. The error comparison of the prediction results of different models is shown in Table 2:

Table 2 Error comparison

(1) The RF model has the highest fitting degree of resistance to permeability. The goodness of fit of the RF prediction results in the training set and test set is 0.968 and 0.951, and the coefficient of certainty is closest to 1 compared with the other two models. (2) The prediction error of RF resistance to permeability is the smallest. The RMSEs of the RF prediction results in the training set and test set are 0.036 and 0.095, the error is very close to 0, and lower than the other two prediction models. To sum up, the random forest prediction model has strong adaptability and superiority in A prediction, and can obtain prediction results with high accuracy and reliability [27, 28].

5 Discussion

This study developed a random forest model for high-risk PFO associated with CS, in which 21 variables were included. The model’s predictive ability was found to be acceptable. The random forest approach has been shown to have high efficiency in processing medical data. It has been widely used in the fields of genes, proteins, drugs, diseases, and so on. However, investigations of the morphologic characteristics of PFO by TEE and ischemic lesions based on random forest models have not been conducted to date. The accuracy of the final test set in this study was 70%; in addition, the area under the curve value of the prediction ability of the model was 0.816 with high sensitivity of 73% and high specificity of 65%, indicating that the established random forest model performed well in identifying the risk factors for CS in patients with PFO.

In this study, random forest was implemented to quantify feature importance; it was found that maximum mobility, large RL shunt during Valsalva maneuver, size of PFO in diastole and systole, and diastolic length of the tunnel are closely related to CS. In approximately 40% of patients with ischemic stroke, the origin of cerebral ischemic events remains unknown [29, 30]. Multiple studies have shown that PFO can be implicated in the pathogenesis of CS [31, 32]. Other studies have determined that PFO size is closely related to CS. Nevertheless, some studies have shown that there is no significant association between the anatomy of PFO and paradoxical cerebral embolism [33]. Schuchlenz et al. [34] concluded that PFO size measured at exit location (left atrial side) is an independent risk factor for ischemic events, and that patients with a PFO size > 4 mm have a substantial risk of recurrent strokes. In contrast. Nakayama et al. observed that PFO size measured in the end-systolic frame is not related to CS [10]. However, PFO size changes during the cardiac cycle and differs depending on the location in the tunnel. PFO size in the systole has been found to be greater than that in the diastole at the entrance, mid-, and exit location, and PFO size at the entrance (right atrial side) has been shown to be greater than that of exit (left atrial side). The current study investigated the relationship between PFO size at the exit location (left atrial side) in both systole and diastole and CS; our findings reveal that the sizes of the PFO in the diastole and systole are both related to CS.

PFO is generally considered to be the anatomic means by which paradoxical cerebral embolism develops [35]. The saline contrast TEE test is a widely accepted noninvasive standard for diagnosing PFO, enabling the RL flow to be noted along with semi-quantification of RL shunt size according to the bubble count in the left atrial. In the present study, we observed that patients with CS had a higher frequency of severe RL shunt compared to those without CS (p < 0.001). This finding is unexpected, because the larger RL shunt leads to greater potential for thrombus to pass directly from venous to arterial circulation when the pressure in the right exceeds that in the left cardiac chamber, which increases the likelihood of paradoxical embolic stroke. Nevertheless, the finding is consistent with previous reports.

In the present study, maximum mobility was found to be associated with CS in patients with PFO. In fact, De Castro et al. investigated the morphological and functional characteristics of PFO and their embolic implications in patients with a median follow-up period of 31 months [6]. They found that greater interatrial septum mobility was more common in patients with CS, and RL shunt at rest with a hypermobile interatrial septum seemed to identify PFO patients who were at high risk of paradoxical cerebral embolism recurrence. Such findings have also been supported in other research [36]. It is believed that increased interatrial septum mobility may be an indicator of a larger PFO and is able to strengthen the preferential orientation of blood flow from the inferior vena cava via the PFO into the left atrium, leading to an increase in the potential thrombus passage and the occurrence of paradoxical embolism [37]. Nevertheless, in this study, a moderate correlation was found between the presence of ASA and CS; this differs from prior PFO–ASA studies [37], where ASA accompanied by PFO was demonstrated to be more frequent in patients with CS and recurrent stroke. Indeed, this discrepancy could partly be explained by the different definitions of ASA used in these studies. In the study by [18], an ASA was diagnosed when the atrial septum extended ≥ 11 mm into the right or left atrium or if the sum of the excursion into the left and right atria was ≥ 11 mm. In [38], ASA was defined as a septum primum excursion ≥ 10 mm from the atrial septum into the left or right atrium. However, in our study, we considered septal excursion from the midline into the right or left atrium ≥ 10 mm, or total excursion between the right and left atrium ≥ 15 mm, as diagnostic criteria. Therefore, noninvasive “gold standard” criteria are needed to normalize the identification of atrial septal aneurysm.

Similar to a previous study [11], we found that the diastolic length of the tunnel was highly associated with CS. This indicates that the long tunnel may tend to serve as a conduit for paradoxical emboli, or produce stroke via in situ thrombus formation [39].

6 Conclusions

Large PFO in diastole, the presence of hypermobile interatrial septum, severe right-to-left shunt and Eustachian valve or Chiari’s network were independently associated with CS suggesting that TEE-detected morphologic and functional characteristics of PFO may play important roles in the occurrence of CS.

Our model’s credibility is supported by the fact that the importance of the influencing factors used can be traced, meaning that we can use it to effectively evaluate patients with high-risk PFO in clinical practice. Nevertheless, the study has some limitations. First, there are many variables involved, which is the best number of variables obtained using the CARET package in the language, but these may not be practical in clinical settings. The number of variables may need to be optimized in future studies using other methods. Second, this was a single-center study, which included a small sample, and some variables had to be deleted due to missing values. Future research should increase the sample size and the number of variables to provide robust data. In addition, this study was exploratory, and needs to be verified by samples that include more populations. Finally, the study only considered the presence/absence of cerebral infarction lesion, not the severity of neurological events such as cerebral infarct distribution, location, number, and so on.