Introduction

Diabetes mellitus (DM), specially type II, remains a challenge for public health as the incidence continues to rise1,2, and with it the burden of associated comorbidities as diabetic eye disease3. Among the latter, diabetic macular edema (DME) is the most prevalent sight-threatening (ST) condition although treatable while macular changes are not yet permanent4. Several countries have implemented successful diabetic retinopathy (DR) screening programs, mainly based in fundus retinography (FR), that lowered the overall incidence of blindness5. Nevertheless, recent reports advocated for the potential benefit of adding optical coherence tomography (OCT) to FR, as it may detect indirect early DME signs often missed by FR in a non-negligible proportion,6 improving the cost-effectiveness of screening models7. As a counterpart, the deployment of such program faces new resourcing challenges that could jeopardize the sustainability of any public health system; adding OCT to FR raises the instrumentation costs, and highly skilled specialists are needed to read the OCT cubes to achieve the best results6,8.

To address some of these challenges, artificial intelligence, especially deep convolutional neural networks (CNNs), have been widely tested for the automated classification of macular pathologies in FR and/or OCT scans, as well as to assist the retina specialists to speed-up their decision-making process9. In this sense, several studies published using CNNs for the classification of DME showed specialist-level or even outperformed the results achieved by retina specialists, although their generalizability raises important concerns10,11. In some studies, authors used 2D-OCT scans (B-scans) and that introduced a selection bias as training images must be previously selected from the OCT cube12,13. Moreover, the majority used central fovea B-scans that could potentially miss both non-foveal involving anatomical signs and DMEs10,11,12,13. Some authors relied in FR alone to detect DME, thus based on indirect signs that likely indicate progression which in turn may be easier to detect by the models10,14. Furthermore, in most of them the authors did not state the criteria followed to assess DME10. In consequence, most of the datasets used in these studies may not represent the actual distribution of the disease in the general population, so that may lead to overestimated performances and poor generalizability in turn.

Recurrent neural networks (RNN) are deep learning models suitable to analyze time-series data and could be stacked at the output of several CNNs to predict time-dependent outcomes in images15,16. Hence, such hybrid model (CNN-RNN) applied to OCT B-scans would allow to analyze the OCT cube in a row and sequentially in time.

Thus, the aim of the present study was to build hybrid deep learning models (CNN-RNN) with different settings and trained with an ad hoc ground truth of DME, then to test them for the detection of DME in an unselected dataset from a real-world DR screening program.

Methods

Model development and training with a ground truth of DME

A detailed description of all models settings, training and testing, as well as image processing was shown in Supplementary Material 1. Briefly, we pre-trained a backbone custom CNN using a publicly available dataset comprised of B-scans of DME, drusen, choroidal neovascularization, and normal macula (Supplementary Material 1.1–1.2)12. Then, twenty-two (22) of these pre-trained CNNs were stacked in parallel, initialized with the pre-training weights, and fed with 22 unselected B-scans extracted from the OCT cube. Researchers did not intervene in the selection of images for training or testing, so that avoided image selection bias. To this end, we divided each 128-slice cube into those that mainly comprised the foveal zone and those which mainly comprised the parafoveal zone. The foveal zone was captured between slices 60 to 85 from where we automatically extracted every two (12 B-scans), being the remaining the parafoveal zone from where we automatically extracted every ten (10 B-scans; Supplementary Material 1.3). Every pre-trained CNN outputs an embedding of image features from the flatten layer. Then, all were concatenated into a sequence which was forwarded to the bidirectional recurrent layer. The bidirectional wrapper moves a cell forward and another backwards along the sequence to learn dependencies between time-dependent features. Finally, the output from the RNN layer was fully-connected to a sigmoid layer to predict the probability of DME for the 22-slice OCT cube (Supplementary Fig. 1).

We used a ground truth of DME and normal macula, described elsewhere6 but enriched with additional OCT cubes from a second DR screening program (Hospital Clínic of Barcelona, Spain). Additional samples were graded following the same criteria6. All images were acquired using the Topcon 3D OCT-Maestro 1. The ground truth dataset was split into training, validation, and test, ensuring a similar proportion of DME between folds, and avoiding data leakage by creating splits of unique subjects. The binary cross-entropy was used as the loss function, although accuracy, the area under the receiver operating characteristic (ROC) curve (AUROC), and area under the precision-recall curve (AUPRC) were also computed and compared. From the pool of trained models, we selected those with the best generalizable metrics in the test set.

Study cohort and data collection

A retrospective cohort study nested in a teleophthalmology real-world DR screening program was conducted. From November 2015 to March 2019, we included all diabetic patients (either type), of any gender, and aged ≥ 18 years. We included one eye per subject, which was the affected in case of unilateral DME, or a random sample if both eyes had the same diagnosis (DME or non-DME).

The characteristics of the screening program were described elsewhere6. In short, screening visits were conducted by a technician in an outpatient center and collected health data of interest, measured the best-corrected visual acuity (BCVA), and acquired a 3-field FR (first centered on the macula, second on the disc, and third supero-temporal)17, and a 6 × 6 mm OCT macular cube scan. Patients were re-imaged under pupil dilation in case of low quality of images. Then, the technician forwarded all the abovementioned data, also including the ETDRS average thicknesses, the ETDRS topographic map, and the macular volume, to a retina specialist in the Hospital who acts as the gatekeeper to specialized care. There, the retina specialist makes the initial diagnosis, and decides whether to refer the patient. The retina specialist assessed DME based on the presence of macular thickening ≥ 300 µm with anatomical signs of DME (cysts, microaneurysms, exudates, neurosensory detachment, and hyperreflective dots), without signs of another macular disease in the FR18,19. The severity of DME was also assessed depending on the distance to central fovea as proposed by the International Clinical Diabetic Retinopathy and Diabetic Macular Edema Disease Severity Scales (ICDRSS), into mild, moderate and severe20. Subjects whose OCT and FR were missing or ungradable were excluded from the study cohort.

Model evaluation in the study cohort

All 22-slice OCT cubes from the study cohort were extracted and pre-processed, then fed to the CNN-RNN models to predict the probability of DME. Model predictions were evaluated against the diagnosis made by the retina specialist in the screening program, as described above. ROC curves, AUROC, and the partial AUROC (pAUROC) at a range of false positive rate (FPR) < 0.05 and < 0.1, along with their 95% confidence intervals (95%CI) were computed. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the cumulative incidence of DME were calculated at a range of 1000 prospectively tuned thresholds set to predict the best parameters for screening: (1) Youden index (threshold 1), (2) the highest sensitivity (threshold 2), and (3) the highest specificity (threshold 3). For the latter two, we established a baseline specificity and sensitivity of at least 0.8 or greater. Only the models that yielded the best AUROC with a pAUROC over 0.80 were considered onwards. The selected models were ensembled to build a voting classifier based on the mode of class predictions obtained at each model threshold.

Additionally, the characteristics of false positives and false negatives obtained were reviewed by a third retina specialist.

Finally, as an exploratory analysis, we tested the potential generalizability of the models for classification of referable DR. In this case, we graded DR as mild non-proliferative, moderate non-proliferative, and referable DR, which included severe non-proliferative and proliferative plus moderate and severe DME20.

Statistical analysis

Baseline characteristics of the population were expressed as median and interquartile range (IQR) for quantitative variables, and as frequencies and percentages for qualitative variables. Differences between two medians were tested using the Mann–Whitney’s U, and the test on the equality of proportions was used to compare two qualitative variables. A p-value < 0.05 was set as statistically significant.

ROC curves and AUROC (95%CI) were computed to test the model performance as it is independent of the prevalence of the disease. Partial AUROC (pAUROC; 95%CI) was also computed to obtain more nuanced models21. Youden (J) index was computed to represent the maximum sensitivity and specificity given for a single point on the ROC curve. Diagnostic accuracy for binary outcomes was assessed using the sensitivity, specificity, PPV, and NPV, after probability thresholding. Diagnostic accuracy was also stratified by gender, age, laterality, BCVA, grade of DR, OCT quality of image, and grade of DME. For the latter, due to low numbers, we collapsed moderate and severe. Incidence of DME predicted by the models was calculated as the number of true positives divided by the total number of subjects. Intervals at 95% of confidence (95%CI) were calculated using the standard normal distribution or the binomial distribution for proportions22.

To test the assumption of no missing cases due to lower resolution of images (22-slice cubes) we carried out a sensitivity analysis by severity of DME, as well as by other covariates as grade of DR, BCVA, quality of image, among others.

Models were developed in GPU-enabled Tensorflow v.2.4, and diagnostic accuracy was computed with Scikit-learn v. 1.2.2, for Python. The remaining analyses were run with STATA/MP v.17 (Stata Corp LLC, College Station, TX, USA).

Ethics approval and consent to participate

The study protocol was approved by the Ethics Committee of the University Hospital “Príncipe de Asturias” on March 2, 2020. The need for informed consent was waived due to the retrospective nature of the study . This study complied with the provisions of Spanish and European laws on personal data as well as with the Declaration of Helsinki (Fortaleza 2013).

Results

Baseline and clinical characteristics of the study cohort

The characteristics of the study population were detailed in Table 1. We included 5314 screens from 4408 subjects. Of them, 126 (2.37%) were DMEs, and 5188 (97.6%) had no record of DME throughout the study period. The majority of DMEs were mild (77; 61.1%), while moderate and severe contributed equally (23; 18.3%, each). Compared to subjects without, those with DME had fewer screening visits, were older, predominantly males, with longer duration of DM, and had a greater prevalence of acute myocardial infarction (Table 1). Regarding vision, subjects with DME had lower BCVA, greater prevalence of moderate and severe non-proliferative DR, as well as proliferative DR, and had greater proportion of epiretinal membranes, pupil dilation or previous macular scars. In OCT, subjects with DME showed greater average macular thicknesses and volume across the ETDRS grid (Table 1).

Table 1 Baseline characteristics of the study cohort.

Training of hybrid models with a ground truth dataset of DME

The ground truth dataset comprised 650 OCT cubes from unique subjects, being 433 (66.6%) of them in the training set (127 -29.3%- DMEs, and 306 -70.7%- non-DMEs), 100 (15.4%) in the validation set (20 -20%- DMEs, and 80 -80%- non-DMEs), and 117 (18%) in the test set (40 -34.2%- DMEs, and 77 -65.8%- non-DMEs). Seven (7) different hybrid models yielded the best predictive and generalizable results in the test set (Supplementary material 1.4–1.6).

Model selection and diagnostic accuracy in the study cohort

The abovementioned models were then tested in the study cohort and three showed the best performance. Their architecture characteristics and results were detailed in Supplementary material 1.5–1.6. AUROC (95%CI) was 0.94 (0.92–0.97) for model 1, 0.94 (0.91–0.97) for model 2, and 0.94 (0.91–0.97) for model 3. For the region at FPR < 0.05, pAUROC (95%CI) was 0.83 (0.79–0.87), 0.84 (0.80–0.89), and 0.81 (0.76–0.85) for models 1, 2, and 3, respectively, while at FPR < 0.10, pAUROC (95%CI) was 0.87 (0.83–0.91), 0.88 (0.84–0.92), and 0.85 (0.81–0.89) for models 1, 2, and 3, respectively (Fig. 1).

Figure 1
figure 1

AUROC and pAUROC (95%CI) for the classification of DME, with the best three models. ROC: receiver operating characteristics; AUROC: area under the ROC curve; DME: diabetic macular edema. Vertical dotted and colored lines represented pAUROC at FPR (1-specificity) < 0.05 and FPR < 0.10.

Sensitivity and specificity were calculated at the three thresholds. Model 2 showed the most balanced results at threshold 1: sensitivity (95%CI) was 90.5 (84.0–95.0), and specificity (95%CI) was 90.8 (90.0–91.6). Similar results but with higher specificity was reached by the ensemble model: sensitivity (95%CI) was 88.9 (82.1–93.8), and specificity (95%CI) was 93.3 (92.6–94.0). The ensemble model showed the best results regarding sensitivity and specificity at the remaining thresholds: sensitivity (95%CI) was 92.1 (82.1–93.8), and specificity (95%CI) was 83.1 (82.1–84.1) at threshold 2, while sensitivity (95%CI) was 81.0 (73.0–87.4), and specificity (95%CI) was 97.0 (96.5–97.4) at threshold 3 (Table 2).

Table 2 Diagnostic accuracy of the three best models after threshold tuning.

Negative predictive values (NPV; 95%CI) were high and similar in all scenarios (above 99.5). As expected, positive predictive values (PPV; 95%CI) were low (as it drastically depends on the prevalence of the disease but increases with higher specificity); with the ensemble model at threshold 3, PPV was 39.5 (33.5–45.8). Cumulative incidence of DME for the study period was similar across all thresholds and for all models (Table 2).

Diagnostic accuracy of the ensemble model by different covariates

The stratified analysis showed a similar diagnostic accuracy across all strata, excepting for moderate and severe DME which resulted in higher sensitivity and specificity compared to mild DME and overall; the highest sensitivity (95%CI) was 100 (92.9–100) at threshold 2, and the highest specificity was reached at threshold 3 (specificity; 95%CI = 97.0; 96.5–97.4). In contrast, the specificity observed among subjects with ST-DR (including severe non-proliferative and proliferative) was lower as compared to overall; at threshold 3 it was 62.9 (44.9–78.5), and even lower at thresholds 1 and 2 (Table 3).

Table 3 Sensitivity and specificity of the ensemble model at all thresholds, by different covariates.

Characteristics of false positives and false negatives

The characteristics of false positives and negatives obtained with the ensemble model at threshold 3 were explored. In this configuration, we obtained 24 false negatives and 156 false positives. According to the review made by the third retina specialist, 87.8% of all false negatives predicted by the model were correctly classified indeed. They were mainly characterized by epiretinal membranes, macular thickening without signs of DME, and anatomical signs of DME without macular thickening, and just 4 DMEs (12.1%) were missed by the model. On the other hand, anatomical signs of DME without macular thickening, images with artifacts (eye blinking, cropped images) or skewed, and age-related macular degeneration accounted for 54% of false positives, and 17 (10%) DMEs were misdiagnosed by the retina specialist, thus correctly classified by the model (Fig. 2).

Figure 2
figure 2

Characteristics of false positives and negatives obtained with the ensemble model at threshold 3. ERM: epiretinal membranes; DME: diabetic macular edema; AMD: age-related macular degeneration; PDR: proliferative diabetic retinopathy. *Includes ischemia and thinning, asteroid hyalosis, central serous choroidopathy, lamellar macular hole, vascular tortuosity, and epithelial pigment detachment.

Generalizability of models for classification of referable DR

The three selected models yielded almost chance prediction in the classification of mild non-proliferative DR. Conversely, an increasing trend to a high performance was observed with severity. Overall, AUROC for referable DR was equal or above 0.89 (Fig. 3).

Figure 3
figure 3

ROC and AUROC (95%CI) for different grades of diabetic retinopathy with the best three models. ROC: receiver operating characteristics; AUROC: area under the ROC curve; DR: diabetic retinopathy; DME: diabetic macular edema. *Referable DR included severe non-proliferative DR, proliferative DR, moderate DME, and severe DM.

Discussion

Results from the present study showed that our hybrid (CNN-RNN) deep learning models achieved high diagnostic accuracy in the identification of DME, overall and by severity, in a dataset of OCT cubes generated in a real-world DR screening program as a product of routine clinical care.

Computer vision models based on CNNs have been widely tested for the detection of DR or DME so far23. Nevertheless, very few have been designed for direct clinical application in real-life, with prospectively collected datasets, and well-defined diseases10,13,14,24. CNNs require significant amounts of manually expert-labeled data for training and validation. At present, some publicly available datasets comprised of OCT B-scans or FR have been used in numerous studies, however, using flat images as those implies selective sampling10,12. The screening of DR in FR alone leads to low sensitivity in the diagnosis of DME, as the criteria to assess DME is based on the presence of indirect signs that are indeed markers of progression, and a non-negligible proportion of false positives and negatives (almost 50% of DME without indirect signs are missed)6 Likewise, selective sampling also occurs in B-scans as they must be previously selected or, in other cases, only central fovea B-scans were extracted from the OCT cube10,12. For these reasons, DMEs with clear signs of the disease are overrepresented in existing datasets, and may introduce a significant bias with misleadingly outstanding results10. Otherwise, when transported to real-world datasets, much more realistic results have been reported13,14,24,25,26.

To our knowledge, this is the first study that applied hybrid (CNN-RNN) models to analyze OCT cubes with the aim to detect DMEs of either grade. Hybrid models have been proposed to predict time-dependent outcomes as treatment response in colorectal and lung cancer15,16, although we took advantage of this feature to analyze OCT cubes with just one expert annotation per volume, and also to avoid misalignments when re-assembling the cube if 3D-CNNs were used11. Furthermore, another potential benefit of our approach is the management of time, making our models transferable to predict time-to-event outcomes as treatment response (e.g. vascular endothelial growth factor inhibitors) or risk of progression. At present, other authors have also developed a model for segmentation and identification of several referable retinal diseases using OCT cubes, so that precluded the comparability with our results27.

Regarding the screening of DME, Liu et al.13 carried out a study nested in a real-word DR screening program and reported a sensitivity, specificity, and AUROC of 91.3, 97.5, and 0.944, respectively, using a combination of FR and OCT. Taking the Youden index as reference, our ensemble model yielded a sensitivity (95%CI) and specificity (95CI%) of 88.9 (82.1–93.8) and 93.3 (92.6–94.0), while AUROC for the best three models was 0.94. Our results were slightly lower, but some considerations should be made to adequately compare the results: Liu et al. used B-scans and FR but without assessment by grade of DME13. Of note, using a modest population of 600 diabetics, they reported a PPV of 75.4% strongly influenced by a high prevalence of DME of almost fourfold the observed in our population. In practical terms, since PPV depends on prevalence, with the sensitivity and specificity reported by them we would barely reach a PPV of 50% in our population, and suggesting PPV alone as inadequate metric for screening in imbalanced datasets. Moreover, baseline and clinical characteristics of the study population were not reported, so that prevented an extensive interpretation of results13.

This pioneer study evaluated the ability of the models to detect different grades of DME, and showed no differences in performance to detect mild DMEs, but outperformed the overall results in moderate and severe DMEs, so this trend may support, in part, our concerns about the misleading results observed in studies using selected datasets10,28. It is important to stress that, in our population, the majority of DMEs were mild (61.1%) as a consequence of a well-established screening program where patients are detected earlier6. Nevertheless, is also relevant to detect early DMEs to personalize screening visits, adjust antidiabetic treatments, and manage other cardiovascular risk factors, to improve their visual prognosis29,30.

The design of the best strategy for DME screening in the community must consider multiple dimensions, as well as the impact of potential misdiagnoses, the number of false positives that specialized care could undertake or the cost-effectiveness, among others. Population-based screening of a disease with such low prevalence would benefit most from higher specificity as increases PPV to a greater extent than higher sensitivity31, and would rule out most subjects screened in a first step, reducing drastically the workload of the specialist. Based on our results, that would cost 3–6% (156–311 subjects) false positives, and almost 20% false negatives (24–25 DMEs missed), throughout the study period. Conversely, screening based on sensitivity would result more efficient although highly challenging in turn since advanced stages of DR share indirect markers with DME and both entities often concur. In this scenario, we would miss 7.9% (10) DMEs at the cost of 16.9% (877) false positives, at best. Alternatively, the Youden index would return more balanced metrics overall; 7–9% false positives, and 9–11% false negatives. Of note, after exploring the characteristics of false negatives, only 4 DMEs were incorrectly predicted by the model, so re-calculated sensitivity would rise to 94% with specificity unchanged. On the other hand, the characteristics of false positives revealed that the models may learn anatomical signs of DME that could be shared with ST-DR, explaining the low specificity of the models among these subjects. Regarding the latter, our models showed potential transferability to referable DR, although specific training would be needed specially for mild and moderate stages32.

Our study has strengths and limitations that need to be discussed. We trained our models based on a ground truth of DME constructed by retinal experts with high expertise and low inter-individual variability, and then evaluated the models within a well-defined population extracted from the target population where these models could be further applied. In addition, our models were able to analyze OCT cubes, avoiding selective sampling and allowing to detect extra-foveal DMEs, which in turn were the majority in our population and often missed due to subclinical presentations. Finally, the ensemble model improved the results, but its complexity must be limited to be deployed effectively within a teleophthalmology screening program. By contrast, an external validation of the models, using datasets from other imaging sources and populations would be necessary. Despite we achieved good results indeed, we did not analyze the 128 B-scans from the OCT cube, so further approaches to build more complete samples would be needed. In this line, the combination of image-related data as thicknesses from the ETDRS grid may reduce the number of cases where anatomical signs were present but without reaching 300 µm. Finally, artificial intelligence is usually considered as black box models due to their complexity and lack of interpretability, and that reduces the trust of specialists in their predictions33. For this reason, medical researchers in this field must make efforts towards explainable artificial intelligence that support retina specialists to make meaningful clinical decisions.

Conclusions

Our model was able to detect DME in complete OCT cubes from a real-world screening program, making unnecessary to select training images and avoiding selection bias. This approach showed potential applicability to discard in a first step the majority of non-DMEs otherwise referred to specialized care for re-evaluation, then reducing the workload and needs of retina specialists, then speeding-up the access to the teleophthalmology screening program as a consequence.