Introduction

Microbial keratitis (MK) is one of the most common corneal diseases and a major cause of visual impairment1,2,3. The distribution of MK varies from country to country due to climate, contact lens use, socioeconomic status and accessibility of health services2,4,5. With the prevalence of corneal contact lenses, the incidence of bacterial keratitis (BK) and fungal keratitis (FK) is increasing6.The management of FK and BK is challenging, surgical intervention is usually required at late stage, and poor visual outcomes are usually encountered2,7,8. Hence, early diagnosis is essential to avoid devastating consequences that threaten vision.

However, it is not easy to diagnose FK and BK at an early stage. It has been reported that correctly differentiating between BK and FK is a challenging process even for trained corneal experts and is often misdiagnosed in more than 30% of the cases9. When ophthalmologists are unable to ensure the pathogens of keratitis, they usually use empirical therapy without microbiological results until culture results are available10. The rationale for empirical treatment is based on the assumption that most cases of bacterial keratitis will respond to modern broad-spectrum antibiotics11,12. And some ophthalmologists treat corneal infections empirically with the newer fluoroquinolone antibiotics, even without the procedures of Gram staining and culture13. Yet the failure of treatment may increase the likelihood of advancing corneal infiltration and a poor therapeutic outcome14 and the time lag between empirical treatment and the appearance of results may let the patients to miss the optimal time to initiate appropriate treatment.

In computer-aided diagnosis, deep learning algorithms with artificial intelligence (AI) are now widely used for medical image recognition and making great progress in the field of ophthalmology, such as diabetic retinopathy15, age-related macular degeneration16, glaucoma17, and topography for keratoconus18. Until now, few studies have applied deep learning on infectious keratitis (IK) using slit-lamp microscopic images and there is still a major improvement in terms of classifying BK and FK19,20,21,22. And no reported models have applied multimodal information to improve diagnostic accuracy for keratitis. However, in the real world, there is a lot of disturbance in images and doctors make a judgement based on multidimensional information such as pathological images, medical history and laboratory results. Due to the rapid disease progression of BK and FK, the few visits at beginning basically determined the treatment plan and the patient's prognosis.

Based upon that, we aimed to develop a knowledge-enhanced transform-based multimodal classifier (KTBMC) that employs images in addition to text to improve the prediction and to aid ophthalmologists in diagnosing BK and FK.

Materials and methods

Image datasets

The image dataset for this study included 158,931 clinical digital images taken from 15,687 patients with 89 categories of corneal diseases by slit lamp microscopy during the period of October 2004 to 2020 in the Department of Ophthalmology, Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University. The study was approved by the Ethics Committee of Sir Run Run Shaw Hospital, Zhejiang University School of Medicine (Ethical approval code: 20210318-32) and adhered to the ARVO statement on human subjects and the Declaration of Helsinki. The Ethics Committee of Sir Run Run Shaw Hospital, Zhejiang University School of Medicine waived the need for informed consent for patients in this study based on a retrospective design and the privacy protection via delinking personal identification at image and data analysis.

In the dataset, images taken from patients whose initial treatment was anti-microbial therapy, including BK and FK, were selected for the training or testing set for algorithmic classification into each infectious category. For each patient, only two images, the initial presentation and the first follow-up, were selected for the dataset. All the images from the patients with corneal infections were annotated with a definite clinical diagnosis that was corroborated by at least one piece of the following evidence: ① the progression of the corneal infection was influenced and terminated by diagnostic pertinent single-drug or combined-drug therapy leading to its ultimate curing; ② pathogen identification of the sample from the infection site: either confirmed by sample smear under microscopic examination or organism culture.

Patients were excluded if they had mixed bacterial and fungal infections; corneal perforation; no documented slit-lamp images; poor-quality or fluorescein-staining images; or the presence of other corneal diseases, such as viral keratitis, Acanthamoeba keratitis, marginal keratitis, corneal dystrophy or degeneration, chemical burn, mucous membrane cicatricial pemphigoid, or bullous keratopathy.

The final dataset contained 704 images from 352 patients for this study. The training set consisted of 262 randomly selected images of BK and 296 images of FK from 279 patients. And the training set was randomly divided into a training set and a validation set in the ratio of 4:1. The testing set consisted of 72 randomly selected images of BK and 74 images of FK from 73 patients.

Treatment text datasets

Information on the course of all patients' illnesses and their medication history from their initial visit was collected in paper and electronic medical records. First we stored all patients' medication records for the initial diagnosis as electronic data by hand, then under the clinician's guidance, we excluded medication unrelated to the treatment of infectious keratitis, such as medication for dry eyes or glaucoma. As treatment text is relatively simple and short, it didn't require much preprocessing. We converted all medication names into lowercase proper names and doses were processed according to a uniform prescription format. Then the words segmented by space were directly fed into the pre-trained Bert to extract embeddings. The final top ten word frequency statistics by space division are shown in the Fig. 2.

Common anti-bacterial drugs include Levofloxacin, Ofloxacin, Cefuroxime Sodium and Amikacin. Depending on the dose, it can be used to prevent and control bacterial infection. Common anti-fungal drugs include Itraconazole, Natamycin, Voriconazole and Amphotericin B. No private information was collected or compromised.

Knowledge-enhanced transform-based multimodal classifier

The knowledge-enhanced transform-based multimodal classifier was based on Convolutional Neural Network (CNN) and BERT23,24. The algorithm architecture is illustrated in Fig. 1 (take ResNet50 for example).

Figure 1
figure 1

The whole deep learning framework. (A) The abstract flow chart for developing the deep learning model. (B) The architecture of KTBMC applied ResNet50 to extra image features and used BERT embeddings to contact all features to classify. I_F1—First image feature, I_F2—Second image feature.

It is usually to transfer the final fully connected layer of a pre-trained convolutional neural network, where the output is often the result of pooling over feature maps. Since the transformer can handle an arbitrary number of dense inputs, we try to produce not a single output vector, but N separate image embeddings, unlike in a regular convolutional neural network23. In this case, we used a pre-trained ResNet with average pooling (DenseNet with norm5) over the K × M grids in the image, yielding N = KM output vectors for every image. As we input two images at one time, the features of the two images were extracted separately and the first was input into the embedding layer along with the difference between the first and the second. Before being input into the image encoder, all of the images were resized to a resolution of 256 × 256 × 3. Then they are also randomly cropped to a resolution of 224 × 224 × 3 and each of them was normalized into (0,1), which enabled the model to converge more quickly and steadily.

We used four CNNs (i.e., ResNet50, ResNet15225, DenseNet121 and DenseNet16926) as our model image encoder. We pre-trained these models on a four-categorical classification dataset containing 24,818 images of amoeba keratitis, BK, FK, and herpes simplex keratitis. And we used the pre-trained 3-layer 768 dimensional base-uncased model for BERT, trained on English Wikipedia23.

The architecture takes embeddings as input, where we can put image embeddings as well as text embeddings. Since BERT is an extremely large-scale model and our dataset is too tiny to train it, we just trained the final classification layer and froze the embedding parameter settings. The experiment hyperparameter configuration was showed in the supplementary file (Table S1).

To compare the performance of our models, we applied four CNNs on the same data set with a single image as input.

Performance interpretation and statistics

For visualizing heat maps, the gradient-weighted class activation mapping (GradCAM) technique27, in which the model’s attention scores are computed according to the calculation of the gradients before the embedding layer, was used to plot the heat map of the model. Receiver operating characteristic (ROC) curves were illustrated to discriminate between BK and FK, and AUC was measured. Youden’s index was used by the ROC curve to obtain the sensitivity and specificity. The accuracy of the model was further calculated. Statistical analysis was performed with R (R Core Team, 2022) and figures were produced using the package ggplot2.

Results

Patient distribution and characteristics

A total of 352 patients (216 males and 136 females) with 704 images were included. The average patient age was 53.6 ± 11.5 years. The distribution and characteristics of the patients are shown in Table 1. Because we choose only two images for one patient and input all into the model at the same time, the days between the initial presentation and first follow-up were also a very important variable. And we also plotted a boxplot with jittered points of the variable to visually compare the distribution of the variable before the two disease types and results of the text word frequency statistics (Fig. 2).

Table 1 Distribution of patients in the train and test datasets.
Figure 2
figure 2

(A) Boxplot of interval of days distribution, (B) the final top ten results of the text word frequency statistics.

Performance of backbone

We chose four CNNs with a single image as the input to serve as the benchmark for our experiments, and the results are presented in Table 2. And ResNet50 performed best with an average accuracy of 0.86.

Table 2 Performance of benchmark.

Performance of KTBMC

Owing to the flexibility of transformers, we can change the different inputs to test the performance of our model. Details regarding the accuracy, sensitivity and specificity of all of the models are presented in Table 3. The average accuracy of two images as input ranged from 88 to 91%. When adding extra treatment text, the best average accuracy, and accuracy of BK and FK increased to 97%, 92%, and 95% severally with DenseNet121 and the remaining models all had improved in accuracy. It indicated extra treatment did help with model classification. And our dataset is so small, it is indeed easy to overfit. The loss results for KTBMC with different input and CNN in Supplementary Fig. S1.

Table 3 Performance of different input and CNN.

ResNet152 was the best model that achieved an AUC of the ROC curve of 0.94(95% CI [0.92,0.96]) for both BK and FK. And ResNet152 was also best in a precision-recall curve with an average precision of o.95 (Fig. 3).

Figure 3
figure 3

Receiver operating characteristic curves and Precision/Recall curves of KTBMC for four image encoders (A) ROC without treatment text. (B) ROC with treatment text. (C) PR without treatment text. (D) PR with treatment text.

Instance analysis

We printed all prediction scores after SoftMax and had some discoveries. As we can see from Fig. 4 that BK was harder to classify than FK on all CNNs with a P value (P < 0.05) when only images were used as input. And after adding treatment texts, the prediction scores of BK markedly improved on DensNet121 with a P value (P < 0.001). Correspondingly, the other prediction scores had no difference.

Figure 4
figure 4

Model calibration and Brier Score of KTBMC (A) without treatment texts and (B) with treatment texts. (C) Boxplot of prediction scores of KTBMC output. (D) Heat maps generated by models that were hard to correctly classify. Column (a): original images. Column (b): heat maps generated by KTBMC without treatment texts. Column (c): heat maps generated by KTBMC with treatment texts. (d): heat maps generated by ResNet50. * No-txt: input without treatment texts. With-txt: input with treatment texts.

Model calibration was used to assess whether the model output was representative of the true probabilities. And ResNet152 performed best with the minimum Brier Score of 0.12 (Fig. 4).

We selected some samples that were at the classification boundary, and the heat map generated with Grad-CAM for model visualization is presented in Fig. 4.

Discussion

In this study, we mainly developed a brand-new deep learning model which combined CNN with BERT to improve the accuracy of diagnosis of BK and FK. The model applying slit-lamp images and treatment texts achieved an average accuracy of 97%, and diagnostic accuracies of about 92% and 95% for BK and FK respectively (Table 3), far exceeded the performance of corneal specialty ophthalmologists whose accuracy was up to 76% on FK28 and compared to senior attend ophthalmologists with a maximum accuracy rate of 88%19. And the sensitivity for detecting keratitis was 95% (95% CI [80%,99%] and the specificity was 92% (95% CI [78%,98%]), which demonstrated the broad generalizability of our model.

Additionally, we selected four CNNs as benchmarks to compare KTBMC’s ability and our model far exceeded them (Table 2). And our models were also performed using different CNNs as image encoders. All CNNs had similar performance (Table 3). It was probably due to the powerful model performance of the BERT, so KTBMC did not over rely on CNNs.

To make the output of our model interpretable, heat maps were generated to visualize where the system attended for the final decisions (Fig. 4D). And we also chose ResNet50 which performed best to produce the heat maps for comparison. It was pointed out that CNN with a single image as input would focus on regions outside the cornea, such as the eyelid or conjunctiva if there is no image crop20. From Fig. 4D, ResNet50 did focus on areas outside the lesion and we concluded that our model was able to distinctly focus and learn the features from dominant lesions like the epithelial defect, oedema and deep stromal infiltration. Furthermore, with the treatment information, the regions of cornea lesions became more precise and comprehensive. This interpretable feature of our model can further facilitate its application in the real world, as ophthalmologists can understand how the final output of the model is made.

So far, there were insufficient studies on applying deep learning algorithms for infectious keratitis via using slit-lamp images, let alone combined with treatment text. And because of the similarity of BK and FK, no study has had a satisfactory result in this regard. Xu et al. reported an average accuracy of 79% on IK by using a deep sequential-level learning model with slit-lamp images, while their model performed poorly in identifying BK with an accuracy of only 65%19. Hung et al. applied segmented images to reach an average diagnostic accuracy of 80% to BK and FK. They used U2 Net to crop the image of the cornea because they found inappropriate focusing on the area without clinically relevant features would decrease model performance20. Ghosh et al.22 applied ensemble learning with three pre-trained CNNs (VGG19, ResNet50 and DenseNet121) that trained on the ImageNet data set and got the best average accuracy of 83% between BK and FK. The above researches were just performed on a single slit-lamp images and their model performed barely satisfactorily in identifying BK and FK. And the performance of all models was closely related to the distribution of the data set. All indicated that there are limitations to using only images as input. And in real world applications, there is more information relevant to diseases, such as medical history, laboratory findings and past history. Hence, we applied image and medication information to improve the model's ability to distinguish BK from FK.

Our model could learn from changes in images between initial and subsequent visits as well as medication intervention. When doctors can’t determine the cause, they would apply to empirical therapy which, if inappropriate, can cause the identifying features to be obscured11,29. This in turn would increase patients’ financial burden and may result in a worse prognosis. From Table 1, we concluded that days of the interval were fewer for patients diagnosed with BK than with FK and the difference was meaningful with a P value (< 0.001). It is likely caused by the fact that BK progresses more quickly and that doctors tend to monitor the effects of treatment before culture results are known, whereas FK has a longer drug history before culture results are known or symptoms worsen. Thus, reducing the time lag between patient diagnoses not merely lightens the burden on the patient but also decreases the difficulty of microbial keratitis management. In clinical practice, when doctors are unable to diagnose whether it is BK or FK, our model provides a more accurate reference for them to make a more convincing judgement. Moreover, our model has confirmed the potential of multimodality in keratitis.

However, our model has a few limitations. First, we excluded complicated cases, such as patients with mixed infections and other corneal diseases and that would influence the performance of the model. Second, on account of the difficulties of collecting patient records and cleaning images with only a few workers, the size of our dataset was still too small to develop deeper-level experiments. After validating the feasibility of a small data set, it can be extended to a large one according to the user's needs. Third, as we could not match general statistical characteristics of patients (age, gender, etc.) between the training and test groups, changes in these characteristics may have an impact on the model's performance. Finally, the model’s function lies in assisting in the differentiation of FK from BK, and we did not subclassify the dataset to different pathogens, which may have different clinical characteristics. Viral and amoeba keratitis were not included in this study, either. In clinical practice, cultures remain the gold standard for final species identification.

In conclusion, we developed a new deep learning model that combined CNN with BERT to improve the prediction in differentiating between BK and FK. And we are the first study to focus on the impact of image changes and medication interventions on infectious keratitis. Moreover, the method is scalable and can be applied to any clinical problem where the disease is difficult to distinguish based on images but there is other data available in the clinic than images. We believe that the model’s outstanding performance demonstrates the great potential and inspires others of multimodal information for clinical applications.