Introduction

Fungal keratitis (FK) is a serious ocular infection occurring in the cornea, which has been known as one of the leading causes of visual impairment1,2. It is gaining increasing attention around the world, especially in developing countries due to a higher incidence rate3,4,5,6,7,8. FK often occurs due to corneal trauma and wearing contact lens9,10,11, and can cause serious complications, such as corneal perforation and endophthalmitis. However, their clinical features are not distinctive enough, and as a result fungal keratitis can be misdiagnosed as bacterial or parasitic keratitis1. Therefore, early diagnosis is critical for instituting timely and proper treatment to improve the curative effect and the prognosis of patients, reducing the risk of irreversible vision loss.

The traditional keratitis diagnostic methods include corneal scraping and fungal culture. Corneal scraping brings pain to patients and increases the risk of secondary injury in the cornea. Moreover, fungal culture takes a long time and has a relatively low sensitivity, especially for infections in the deep corneal stroma3,12. Shotgun metagenomics is a new DNA sequencing method to identify complete taxonomical and functional profile of an organism from little volume of the sample. However, it’s sensitivity, reference standards for downstream analysis, costs, turnover time are limitations for routine clinical practice13,14,15. In contrast, the use of in vivo confocal microscopy (IVCM) enables non-invasive and prompt eye examinations16. Ophthalmologists can check for corneal conditions almost at any depth by IVCM and make a diagnosis for fungal keratitis based on observed fungal hyphae in the IVCM images. However, the nerve fibers, vessels, and dendritic cells could confuse ophthalmologists since some filiform texture regions have similar appearance features to fungal hyphae. Ophthalmologists need to gather extensive clinical experience to effectively distinguish fungi from confounding objects. Considering the high prevalence rate of fungal keratitis in many countries, the number of qualified ophthalmologists is not sufficient to provide care for the large population, leading to delayed treatment and management for some patients. That may cause irreversible damage to their cornea and bring a high risk to public health. In this work, we aim to provide an automated FK detection system working on IVCM images to strengthen the capability of ophthalmologists in the diagnosis of fungal keratitis.

As one of the top breakthroughs in recent years, deep neural networks have greatly benefited the field of medical image analysis and have been applied in a variety of medical imaging modalities, including X-ray images, retina images, Magnetic Resonance Imaging (MRI), and Computerized Tomography (CT)17,18,19,20. They have shown dominant performance in automatic disease detection and lesion region segmentation21,22 due to their inherent capability of learning complex features directly from raw image data. In the last decade, convolutional neural networks including ResNet-like frameworks18,23 have shown great power in extracting spatial features of medical images and yielded impressive results. With the advances of attention mechanisms including become another popular deep networks for image analysis tasks, such as classification, segmentation, and object detection24. Prior works on diagnosis of fungal keratitis using IVCM images have employed traditional image recognition methods25 and deep convolutional neural networks26,27 to detect fungi using visual features. However, those methods are often hampered by the lack of large-scale FK IVCM image datasets and the limited capability of their learning models. Although these methods have shown promising performance in their FK diagnosis experiments, their generalizability is yet to be further validated.

Our research is motivated by the real clinical process. The main aim of the IVCM image-based FK diagnosis is to identify fungal hyphae structures and to distinguish them from other structures in the cornea, such as nerve fibers and vessels. We observe that the diagnosis of FK in clinical practice does not only rely on a single IVCM image. During the real clinical diagnosis, experienced ophthalmologists look carefully at a set of IVCM images of the same patient and give the final decision based on the observed spatial structure of hyphae. It is a decision based on the combination of all the visual feature observations of a group of images of the patient. In this work, we propose to explore the relationship among multiple IVCM images of the same patient captured in sequence for automated FK diagnosis. Such images tend to be spatially neighboring and cover related regions, and we develop a new deep architecture based on transformer modules with a higher capability of extracting spatially correlative features.

In this study, we present and validate our deep learning framework for automated fungal keratitis diagnosis, which contains two stages. In stage 1, we train a deep neural network with a single IVCM image as its input, which is able to detect fungal keratitis at the image level. We utilize recent transformer-based modules28 to effectively extract the filiform texture features and identify the images with hyphae structures. In stage 2, we train a multi-instance deep network that takes a set of neighboring IVCM images belonging to the same patient as input and predicts a diagnostic conclusion for the image set. Since datasets used in previous work are either unavailable or too small, we built a new large-scale dataset suitable for our two-stage training. And we also collected images from separate patients for validation and testing, to allow evaluation at image, sequence and patient levels.

Results

Performance of the first stage network

We evaluate the image-level diagnostic performance of the first stage network using the stage 1 test dataset from FK-IMG (Fungal Keratitis Image Dataset, detailed description in Page. 7) that contains 8,568 images, including 3,815 positive images and 5,383 negative images. To find the best backbone for effectively extracting image features, we compared several image classification networks, including ResNet18, ResNet34, PoolFormer, and SwinTransformer. The classification performances of these backbone networks are reported in Table 1, where we compare them using specificity, sensitivity, accuracy, and AUC (Area Under the Curve) scores, based on 95% confidence intervals. SwinTransformer achieves the overall best performance, with the highest sensitivity, accuracy and AUC score, and the second best specificity, just after PoolFormer, so we chose SwinTransformer as our backbone model.

Table 1 Performance of different backbones in diagnosing fungal keratitis at the image level with 95% confidence intervals.

Performance of the second stage network

To evaluate the performance of the second stage network on image sequences, we first compare its performance against a naive method based on the prediction results of single images by the stage 1 network, where a sequence will be labeled as positive if at least one of its images is identified as positive. The stage 2 test dataset for evaluation contains images of 20 positive patients and 17 negative patients from FK-SEQ (Fungal Keratitis Image-Sequence Dataset, detailed description in Page. 7)). We use the aforementioned index-based strategy to select the neighboring images to build the sequence dataset. Here, we use Seq.k to denote the dataset with an image sequence length of k in the following evaluation. We compared performance under different lengths of image sequences, where the Seq.5 test dataset contains 2,411 negative groups and 4,508 positive groups, the Seq.7 test dataset contains 2,330 negative groups and 4,981 positive groups and the Seq.9 test dataset contains 2,257 negative groups and 5,361 positive groups, more test datasets with different sequence lengths are shown in Table 2.

Table 2 Test dataset statistics of different image sequence lengths in the evaluation of the stage 2 network.

Take image sequences of length 7 (Seq.7) as an example. As shown in Table 3, the baseline using SwinTransformer as the first stage backbone achieves overall highest performance, better than baselines with alternative backbones in stage 1 network for the sensitivity of 95.34% (94.72–95.91%), accuracy of 94.42% (93.87–94.93%) and AUC score of 0.9864 (0.9845–0.9883), and the baseline with PoolFormer backbone achieves the highest specificity of 92.45% (91.30–93.35%) among all the baselines. When reporting performance, we show the mean and confidence intervals. Our stage 2 network better utilizes sequence-information through multi-instance learning29, and achieves clearly better performance than baselines: specificity of 96.65% (95.84–97.35%), sensitivity of 97.57% (97.10–97.98%), accuracy of 97.28% (96.88–97.64%), and AUC score of 0.9950 (0.9938–0.9962). More results in different sequence lengths are shown in Table 3.

Table 3 Image sequence-level accuracy comparison between our method and baselines.

Performance of patient level diagnosis

As previously explained, we further extend the prediction from the sequence level to the patient level based on the stage 2 results. And we evaluate the diagnostic performance of our method at the patient level. Since some of the positive patients in FK-SEQ take IVCM images more than once, we group the images by patient and date, as patients’ circumstances may change over time. Therefore, the patient level test dataset contains 36 entities from 20 positive patients and 17 entities from 17 negative patients. Each entity includes the IVCM images taken from a single patient in one examination. The results of patient-level diagnosis are shown in Table 4. We list the patient diagnostic results of our naive solution using the stage 1 network and the stage 2 network. For the stage 2 network, we only label the patients as positive if the number of their predicted positive images is larger than a threshold \(\sigma\). We show the results under different values of \(\sigma\). As \(\sigma\) increases, the specificity increases and the sensitivity decreases slightly. Our system can also list all the suspicious images to ophthalmologists for further examination to avoid missing positive patients as much as possible.

Table 4 Patient level accuracy comparison between stage-1 baselines and our stage-2 network with different settings.

Comparison with human experts

We further conducted an experiment to validate the effectiveness of our method by comparing its diagnostic performance with experienced ophthalmologists. We randomly selected a subset of the Seq.7 test dataset and invited four ophthalmologists with different levels of experience to diagnose FK given the image sequences. For each patient in our Seq.7 test dataset, we randomly selected five image sequences at most and built a subset with 249 image sequences, including 179 positive and 70 negative sequences. The performances of two junior ophthalmologists, two senior ophthalmologists and our deep network are shown in Table 5.

The binary classification task usually takes the probability of 50% as threshold to separate negative cases and positive cases, which tends to achieve a balance for specificity and sensitivity. Under this setting, our network achieves a higher sensitivity and a slightly lower specificity than ophthalmologists. The precision-recall curve in Fig. 1 shows that when we increase the probability threshold until the specificity rising to 100%, the sensitivity of our network remains at 96.65%. The results show that ophthalmologists usually do not diagnose normal or other cornea infections as fungal keratitis, but even the senior ophthalmologists miss some non-typical fungal keratitis cases. Our network achieves a higher sensitivity than human experts, with the ability to bring in higher specificity while preserving sensitivity by tuning a higher threshold, showing great promise in assisting ophthalmologists for FK diagnosis.

Table 5 Image sequence-level accuracy comparison of our method and human experts with different levels of experience.
Figure 1
figure 1

Receiver operating characteristic curve and Precision-Recall curve of our network compared with human experts.

Discussion

The proposed two-stage deep learning framework achieved high sensitivity and specificity in FK diagnosis. Although the first stage network has already shown great performance in identifying FK-related visual features to label single IVCM images, the relatively high false positive rate on single images leads to more misdiagnosis at the patient level. Instead of just formulating the diagnosis process as a single-image-based binary classification task, we employ the stage 2 network to combine the information of a group of images from the same patient for prediction, further improving the sensitivity and specificity. Our experiments show that the proposed deep learning framework generates promising results in assisting ophthalmologists for timely and effective fungal keratitis diagnosis.

All the related prior works only considered the fungal keratitis diagnosis problem in IVCM images as a binary classification task on single images. However, our experiments show that false positive instances predicted by a single image classification network are very common for negative patients without fungal keratitis. It demonstrates that there could still be some filiform textures like nerve fibers or vessels that cannot be distinguished from fungal hyphae. The relatively high false positive rate leads to a low specificity if we directly apply the single image results to the diagnosis for a patient. We have tried several state-of-the-art binary classification network architectures, and empirically no existing deep network architecture appears to be able to solve the low specificity issue. Therefore, considering the real clinical diagnosis process, making decisions based on the relationship among a group of images is needed for better diagnosis performance.

In a clinical diagnosis process, an ophthalmologist usually screens a patient’s cornea by IVCM in different locations and makes the diagnostic conclusion based on their observations of all the IVCM images. An experienced ophthalmologist usually roughly checks the cornea, locates the suspected region, and takes more images to find the lesions caused by fungal keratitis and fungal hyphae. Besides the fungal hyphae features observed in single images, they also take into account spatial information by checking nearby images to better distinguish hyphae from other filiform textures. Once the acquired information is adequate to conclude whether fungi infect this region, the ophthalmologist will move to the next suspected region for further inspection to measure the level of infection for this patient. Therefore, we consider that our stage 2 network based on multi-instance learning can better simulate a real clinical diagnosis process. Our network takes an image sequence of neighboring images as input and explores the relationships between them by an attention mechanism, which can combine the complementary information provided by different images for the same patient when learning how to make the final prediction. To the best of our knowledge, it is the first two-stage deep architecture to use image sequence information in automated fungal keratitis diagnosis. The results have shown that our second stage network increases the specificity and the sensitivity compared to the naive method based on the image-level results. It has shown great potential to assist ophthalmologists in real-world clinical practice.

In our two-stage framework, the second stage network can correct the false positive instances predicted by the first stage network to get higher specificity. We show two examples of false positive images corrected by the second stage network in Fig. 2. Figure 2a shows incorrectly predicted positive images containing filiform textures and messy regions. Figure 2b shows the generated image sequences containing the false positive images and their neighboring images, which are then fed into the second stage network. Although the false positive images may have some suspicious filiform textures, the neighboring images are normal and have no fungal hyphae features. Then the second stage network can collect the information of all the images in the sequence and label the whole sequence as negative, correcting the prediction of the first stage. In Table 3, we show the performance with different image sequence lengths. With an increasing sequence length, the sensitivity of our stage 2 network increases slightly, but the specificity declines. Our experiments show that longer sequences could not provide a significant improvement in diagnostic performance, which is similar to the real clinical process where the ophthalmologist usually takes a few images in one region and then moves the microscopy to another region.

Figure 2
figure 2

Images and generated image sequences corrected by second stage network.

We inspect our prediction results in comparison with human experts at the image sequence level. When using the more balanced threshold (\(P=0.5\)), in all the 249 image sequences, our deep network only misdiagnoses five positive and four negative cases. We note that the five positive cases misdiagnosed by our method are also misdiagnosed by the four ophthalmologists, which may be caused by their confusing visual features that are hard to be distinguished by both human experts and our deep network. Overall, the human experts tend to be more conservative and missed more positive cases, leading to a lower sensitivity than our method. One of the four negative cases incorrectly predicted by our deep network is also misdiagnosed by a junior ophthalmologist. Our precision-recall curve in Fig. 1 shows that the predicted probability of these misdiagnosed negative cases is still lower than that of most positive cases, so that we can get a specificity of 100% with a sensitivity of 96.65% with a high probability threshold (\(P=0.63\)). Notably, our setting of the experiment only provides image sequences to human experts, while in clinical settings, experienced ophthalmologists will gather more information (e.g. corneal images taken by a slit lamp, patients’ symptom and patients’ experiences) to make final diagnosis. Limited information from image sequences may be the reason why human experts got relative lower performance in our experiments, but the results still demonstrate that our network can assist ophthalmologists to avoid missing suspicious positive cases.

During the real clinical process, it is important to ensure that the negative patients are not misdiagnosed, as the anti-fungal medicine is expensive and toxic, which would put extra burden on the patients’ finance and health. Compared with human experts, our network is shown to be able to achieve a higher specificity while maintaining a higher sensitivity when setting a higher probability threshold. Therefore, our network shows great promise in assisting ophthalmologists.

In the diagnostic process of fungal keratitis, an ophthalmologist normally makes an overall decision after inspecting all the captured IVCM images. In the inference phase, our deep learning framework can take all the patient’s IVCM images by separating them into sequences, and provides an overall probability of fungal keratitis, with the most suspected images of the patients listed. Therefore, besides automatically producing a diagnostic decision, our method can also play an assistive role for ophthalmologists by validating the ophthalmologists’ diagnostic conclusion and generating a confidence value for a suspected case. The experiments have shown that the ophthalmologists usually get higher specificity and our network gets higher sensitivity. The ophthalmologists assisted by our network could pay more attention to those listed suspected cases and avoid missing atypical fungal keratitis as much as possible.

There are also several limitations in our work. Firstly, the second stage network takes the predicted positive images to build the image sequence in the inference phase so that the false negative images predicted in the first stage cannot be further addressed in the second stage. Future study needs to focus more on correcting false negative instances from the first stage. Secondly, due to the relatively small patient number in our dataset, it is hard to fully validate the robustness of the deep learning framework in patient level diagnosis. More external clinical data are needed for further study. Thiredly, Our method is only trained and evaluated on the image captured by “HRT III/RCM Heidelberg Engineering, Germany”. The quality and form of images captured by other devices may influence the performance of current methods. Finally, ophthalmologists know the depth of each image when examining the cornea, but that information is lost in our dataset. Since our system is not trained using that prior knowledge, we may have some misdiagnosis cases that could be potentially fixed by incorporating the depth information of IVCM images.

In conclusion, we proposed a deep learning framework for diagnosing fungal keratitis using IVCM images, which not only analyzes the features of single images, but also explores how to effectively combine visual features of a group of images to make better diagnostic decisions. Our method of leveraging a sequence of images for automatic fungal keratitis diagnosis is a more reasonable solution, which is similar to the real clinical process of making diagnostic conclusions for a patient. Our experiments also show a promising potential of our method in assisting ophthalmologists to diagnose fungal keratitis and evaluating the confidence of a diagnostic conclusion.

Methods

In this study, we aim to provide a deep learning framework to conduct fungal keratitis diagnostic tasks like human experts. Therefore, our framework is not only designed for detecting FK infections in a single image, but is also capable of making diagnostic decisions by combining the features of multiple images for a patient.

Datasets preparation

The IVCM image dataset that we used for training and validating our two-stage deep networks was collected from 2013 to 2021, which contains 96,632 IVCM images from 377 patients. Examples of positive and negative IVCM images are shown in Fig. 3. All of the IVCM images in FK-IMG and FK-SEQ datasets were captured by IVCM (HRT III/RCM Heidelberg Engineering, Germany) in Wuhan Aier Hankou Eye Hospital, Beijing Aier Intech Eye Hospital, and Chengdu Aier East Eye Hospital. Images were stored in JPEG or BMP with a resolution of \(384 \times 384\) pixels. The positive patients were diagnosed with fungal keratitis on the basis of their positive corneal scraping microscopy examination results, or positive fungal cultures. The images were each identified and labeled by two experienced ophthalmologists. The two ophthalmologists were asked to review all the images independently. If the diagnosis of the two ophthalmologists was inconsistent, the image was submitted to another experienced ophthalmologist for a final decision. Because our networks in two stages require image data and continuous image sequence data respectively, we separated our collected images to form two different datasets, named FK-IMG and FK-SEQ, to support training and evaluation at both image and sequence levels, and meet the requirements in different stages. FK-IMG is built for stage 1 network, which contains 12,228 images with positive labels from the samples of 163 patients, and each positive image has fungal hyphae that can be seen as the main features and diagnostic criteria of fungal keratitis. As the stage 1 task is performed at the image level, we require individual images to have the correct labels. Since some of the IVCM images of positive patients can still be negative, such images are excluded from the dataset to ensure image-level correctness. FK-IMG also includes 16,417 IVCM images with the negative label from 88 patients with no signs of fungal infection. FK-SEQ contains continuous image sequences taken by IVCM. There are 57,020 original IVCM images from 68 positive patients and 10,967 IVCM images from 58 negative patients. All the original images captured for each patient are included in FK-SEQ without dropping negative images. The images came from diagnosed fungal keratitis patients and were taken during clinical processes on different dates. We group the images by the date they were taken, so that each patient in FK-SEQ dataset may have more than one group of images. In FK-SEQ, there are a great number of negative images from positive patients, as the fungal hyphae usually exist only in some areas of the cornea. All the images were collected from the real clinical diagnostic process.

To properly train and evaluate deep models, we split the IVCM images of FK-IMG and FK-SEQ into training set, validation set and test set at the patient level. We use the FK-IMG dataset for the training and evaluation of stage 1. In stage 1, we randomly selected 151 patients (60%) to build the training set, including 7,946 positive images from 98 patients and 9,573 negative images from 53 negative patients. A set of images from 26 patients (10%) is randomly selected as the validation set, including 2,558 images. Another group of 74 patients (30%) is selected for the evaluation of stage 1, including 8,568 images. In stage 2, we utilize the FK-SEQ dataset for training, validation, and testing. We randomly select 35 negative patients and 41 positive patients from FK-SEQ as our training data in stage 2. For validation, we use seven positive patients and six negative patients. The images of the remaining 20 positive patients and 17 negative patients are used to build the test set. More details of our datasets and the distribution of the positive/negative samples are reported in Table 6.

Table 6 Summary of the IVCM image dataset and data split.
Figure 3
figure 3

Examples of positive (first row) and negative (second row) IVCM images.

Network architecture

Our framework contains two stages, which learn to extract features and predict diagnostic decisions. In the first stage, we train an image-level deep neural network to extract features from a single IVCM image and detect whether fungal keratitis can be observed in that image. The second stage aims to give a comprehensive consideration by combining all the learned features from a set of IVCM images from the same patient. We train a multi-instance network to learn the relationships between IVCM images in this stage, which takes a sequence of neighboring images as input. The patient-level diagnosis pipeline is constructed by aggregating the results from the two-stage networks, which combines the image-sequence level results to obtain the final patient-level diagnostic result. We show the architecture of the 2-stage deep networks and illustrate the diagnostic process at image level, sequence level and patient level in Fig. 4.

Figure 4
figure 4

Our two-stage deep learning framework.

Stage 1: image level diagnosis network

We leverage the recently developed SwinTransformer28 as the backbone of our image-level deep neural network and train it for the binary classification task. We use transfer learning in our stage 1 training, where the pretrained SwinTransformer weights in ImageNet22k30 are transferred to our backbone network as an initialization of the trainable parameters. The training dataset is denoted by \(\{{\mathscr {X}}_i, y_i\}(i \in \{1,2,\ldots ,N\})\), where \({\mathscr {X}}_i \in {\mathbb {R}}^{H \times W}\) represents the grayscale image captured by the confocal microscope and \(y_i \in \{0, 1\}\) represents the annotation indicating whether the i-th image belongs to the positive or negative group of fungal keratitis. The pipeline of our image-level diagnosis network is shown at the top of Fig. 4. The input of the network is the image \({\mathscr {X}}_i\), which is then processed by the pretrained SwinTransformer network to extract the image feature \(v_i\). The extracted feature \(v_i\) is subsequently fed into the linear classifier, which outputs the diagnostic result.

Stage 2: image sequence level diagnosis network

Considering that ophthalmologists often take a few images around the suspicious regions in the cornea during the real examination, the neighboring images captured in a sequence often contain additional fungal hyphae features. For this purpose, we take the images captured at similar times and regions by the ophthalmologists during the cornea examination. When captured images are recorded sequentially, such images can be easily located by taking the nearest images in the captured sequence, e.g. based on image indices. In the training stage, we build such input sequences by taking nearest images for each image of a patient. For negative training samples, the image sequences are all selected from negative patients. For positive samples, the images are all selected from the patients with fungal keratitis and each image sequence has at least one positive image.

As shown at the middle of Fig. 4, the second stage network uses the trained backbone network of stage 1 to extract the features of the IVCM image, followed by a transformer-based network29,31,32 to learn the relationships among the image features. The aggregated sequence feature vector is then processed by a linear classifier predicting the positive/negative labels. The implementation of the stage 2 Transformer-based network, designed to process image sequences, is shown in Fig. 5. We denote the image sequence dataset as \(\{({\mathscr {X}}_i^1, {\mathscr {X}}_i^2, \ldots , {\mathscr {X}}_i^S; y_i)\}\), where the sequence length is S and \(y_i \in \{0, 1\}\) represents the label of the i-th sequence indicating whether the sequence contains fungal hyphae. The feature matrix \({\mathscr {V}}_i = (v_i^1, v_i^2, \ldots , v_i^S)\) extracted by the stage 1 feature backbone, is then processed by the Transformer-based network. We remove the position embedding module of the original transformer architecture in the stage 2 network since we cannot treat the sequence as an ordered set of elements. The relationship features between neighboring images are extracted using the four-layer Transformer block, which is described by the following equations:

$$\begin{aligned} \begin{aligned} \hat{{\mathscr {V}}}_i^{(l+1)} = MSA(LN({\mathscr {V}}_i^{(l)})) + {\mathscr {V}}_i^{(l)} \\ {\mathscr {V}}_i^{(l+1)} = FF(LN(\hat{{\mathscr {V}}}_i^{(l+1)})) + \hat{{\mathscr {V}}}_i^{(l+1)} \end{aligned} \end{aligned}$$
(1)

where \({\mathscr {V}}_i^{(l)}\) represents the output feature matrix of the l-th layer, \(MSA(\cdot )\) represents the multi-head self-attention module, \(FF(\cdot )\) represents the feed-forward module, and \(LN(\cdot )\) represents the layer normalization module. The output feature matrix \({\mathscr {V}}^{out}\) is a sequence of feature vectors with a length of S. In order to obtain the final sequence feature that represents the relationships between neighboring images, we apply a max-pooling layer to aggregate \({\mathscr {V}}^{out}\).

Figure 5
figure 5

Details of our second stage Transformer-based network. MSA refers to the multi-head self-attention module.

The training of the two-stage feature extraction and diagnostic networks is regarded as a binary classification problem, and the networks are optimized using the cross-entropy loss function. Specifically, the loss function is defined as:

$$\begin{aligned} {\mathscr {L}}_{cross\_entropy}(y_i, {\hat{y}}_i) = -y_i \log (\hat{y_i}) - (1-y_i)\log (1-\hat{y_i}) \end{aligned}$$
(2)

where \(y_i\) represents the label of the image or image sequence, and \({\hat{y}}_i\) represents the predicted probability of the network classifying it as fungal-positive.

Patient level diagnosis pipeline

Our networks are trained both at the image level (Stage 1) and image sequence level (Stage 2). In practice, we can further use our model to perform patient level diagnosis. As shown in the bottom of Fig. 4, the images of each patient are first processed by the first stage network to get image-level visual feature identification results. The predicted positive images are then selected with their neighboring images (defined by image indices) to generate a set of image sequences. The stage 2 network processes the image sequences to get sequence-level diagnostic predictions. We set a threshold \(\sigma\) for automatic diagnosis: The patient will be diagnosed as having fungal keratitis if there are at least \(\sigma\) image sequences predicted as positive by the second stage network. Using this scheme, our network can get higher specificity while increasing the threshold or get higher sensitivity while decreasing the threshold.

Preparation for training the networks

The original input IVCM images are grayscale images at a resolution of \(384 \times 384\). We first normalize the input images by mean and standard deviation calculated from the training data. Because the first stage backbone is initialized by a pre-trained SwinTransformer model on ImageNet-22k from pytorch-image-models33, whose inputs are RGB images with a resolution of \(224 \times 224\), we resize the IVCM images to \(224 \times 224\) and average the weights of the first convolutional layer into one input channel. We also use data augmentation by randomly flipping the images and changing the brightness, contrast and saturation.

During the training process of the two stages, our training datasets have imbalance data between two categories. To balance the data in two categories, we resample the images by a predefined weight, which is equal to the reciprocal of the total image number of the corresponding category in the training set. To alleviate the possible overfitting to the training data, we choose the model trained at the epoch that achieves the best performance on the validation set in our training process.

Statistical analysis

The fungal keratitis diagnosis is a binary classification task. Therefore, we evaluate the performance of the proposed deep learning framework by sensitivity, specificity, and AUC score. We calculate the 95% confidence intervals of sensitivity and specificity by Clopper-Pearson intervals34. We calculate the AUC score, the area under the receiver operating characteristic curve, and the 95% confidence intervals of the AUC score by bootstrapping35. The deep learning framework and statistical analysis are built on Python (version 3.6.9). The network architecture, training and test process are built on PyTorch (version 1.9.0), PyTorch-lightning (version 1.5.10) and Jittor36 (version 1.3.4.1). The accuracy, sensitivity, specificity, and AUC score are calculated by sklearn (version 0.24.2) and torchmetrics (0.7.2).

Ethics declarations

This study was conducted in compliance with the Declaration of Helsinki and approved by the ethics committee of Wuhan Aier Hankou Eye Hospital, Beijing Aier Intech Eye Hospital and Chengdu Aier East Eye Hospital. Informed consent was waived by the ethics committee of Wuhan Aier Hankou Eye Hospital, Beijing Aier Intech Eye Hospital and Chengdu Aier East Eye Hospital because of the retrospective nature of the study and anonymized usage of images.