Introduction

In the scientific literature, a substantial amount of chemical data are published annually, including a plethora of information about chemical structure. However, these data are often presented in the form of images, making it inaccessible for direct use. This poses challenges for education, research and development in the field of chemistry. Manually organizing and converting these image data into machine-readable formats is time-consuming and error-prone. To enhance the research efficiency and reduce human effort, optical chemical structure recognition (OCSR) has emerged as a viable solution. This is a task of automatically identifying chemical structure images and converting them into machine-readable formats, such as SMILES [1] and SELFIES [2]. Through OCSR, chemical information in structured data formats will facilitate broader sharing and utilization of chemical information.

Early OCSR methods, such as Kekule [3], OROCS [4], CLiDE [5], OSRA [6], and Imago [7], predominantly employ rule-based approaches. Specifically, they individually detect characters, bonds, rings, and other constituent elements in chemical structure images and process these elements based on a carefully designed set of rules to obtain machine-readable chemical identifiers. However, these methods are cumbersome and dependent on manually crafted rules. Also, if unconsidered rules are present in the image, their performance can be significantly compromised. Research has also shown that slight deformations in the images can substantially impact the performance of these methods [8].

With the rapid advancement of computational resources and the continuous improvement of deep learning methods, deep learning has made significant progress in the field of OCSR. These methods achieve tasks by automatically learning patterns and features from the data. Currently, there are two main categories of deep learning-based OCSR methods: the image-to-sequence approach and the image-to-graph approach.

The image-to-sequence method was first proposed by Staker et al. [9], which utilizes a custom convolutional neural network to extract image features and GridLSTM [10] to generate SMILES sequences. However, this method is trained only on low-resolution images, limiting its applicability to predictions on other inputs. Similar to this approach, Img2Mol [8] implements a custom convolutional neural network to extract image features and implements an RNN to generate SMILES sequences. The key difference is that Img2Mol uses the decoder part of a pre-trained autoencoder to decode the image features into SMILES. Image2SMILES [11] is another image-to-sequence OCSR method that applies ResNet-50 [12] to encode image features and utilizes a Transformer as the decoder to obtain FG-SMILES. Inspired by the Show-and-Tell network [13], DECIMER [14], DECIMER 1.0 [15] and DECIMER-V2 [16] employ encoder-decoder architecture for OCSR tasks. DECIMER takes InceptionNet-V3 [17] as the encoder and GRU [18] as the decoder to generate SMILES. However, these methods cannot achieve comparable results. In DECIMER 1.0, InceptionNet-V3 is replaced with EfficientNet-B3 [19] and uses a Transformer to generate SELFIES. DECIMER-V2 substitutes EfficientNet-V2 [20] for EfficientNet-B3 in the generation of SMILES sequences. All of the above methods involve convolutional neural networks to extract image features. Recognizing that convolutional neural networks primarily learn local features and are less effective at utilizing global information, Xu et al. introduced SwinOCSR [21]. This method employs Swin Transformer [22] to encode image features and uses Transformer as the decoder to generate DeepSMILES [23].

MolScribe [24] was presented as a deep learning method that converts images into graphs, with a model architecture similar to SwinOCSR, which utilizes Swin Transformer to encode images and Transformer for decoding. The key innovation in this approach is that it directly constructs molecular graph structures based on the geometric layout of atoms and bonds in the image.

Currently, there is a growing trend in deep learning-based OCSR methods to utilize extensive computational resources for optimizing performance. However, the performance of deep learning models is often more contingent upon the model architecture than the computational power alone. When trained on large-scale datasets, models with poor design may be at risk of overfitting. Therefore, this paper primarily focuses on exploring improved OCSR model structures in the aspects of image feature extractors and loss functions.

Effective image feature extractors are crucial to obtain accurate representations of chemical structures. Existing deep learning-based OCSR methods predominantly utilize CNNs for extracting image features, but these methods fall short in global feature modeling. Our prior method, SwinOCSR, employs the Swin Transformer to obtain the global representation of images. However, this approach has the potential to overlook some local information within the images. These limitations lead to a decrease in accuracy. To holistically consider both global and local information in the images, this paper adopts multi-path Vision Transformer (MPViT) [25] as the encoder of image features, and proposes a new OCSR method, multi-path optical chemical structure recognition (MPOCSR). It uses multi-scale patch embedding and a multi-path structure to merge local features from convolution and global features from the Transformer, providing multi-scale and diversified image information for subsequent decoding. According to the results of subsequent experiments, when trained on the same dataset of 2 million molecules, MPOCSR achieves a higher accuracy rate than other methods.

Additionally, this paper considers another issue, the class imbalance of elements in chemical structure representation, in the field of OCSR. For instance, elements like C, H, and O appearing more frequently, while Br and Cl appearing less frequently. Considering that cross-entropy loss can be sensitive to imbalanced element frequencies, it can lead to models performing well when predicting common categories but poorly when predicting rare ones. In this paper, class-balanced (CB) loss [26] is introduced to mitigate the impact of element frequency imbalances on the model.

According to Rajan et al. [27], SMILES can yield more accurate results; therefore, in this paper, it is used as the output for the model.

The contributions of this paper are as follows:

  • A novel end-to-end OCSR method based on deep learning is introduced, which utilizes MPViT as the backbone. Compared to SwinOCSR, MPOCSR incorporates a multi-path structure, combining multi-scale global features and local features, resulting in a more comprehensive image feature representation. It achieves an accuracy of 90.95% on the test set, surpassing other existing methods.

  • The issue of imbalanced molecular element frequency is further explored by introducing CB loss as the loss function of the proposed model. Compared with the model trained using Focal loss, the model trained with CB loss achieves better performance.

  • A dataset containing two categories (Markush and non-Markush) is constructed, which comprises a total of 2 million molecules. The test set is obtained using the MaxMin algorithm. Compared with random selection, the test set obtained by the MaxMin algorithm is more representative of the sample space.

The subsequent sections of this paper are organized as follows: Sect. “Materials and methods” provides a detailed introduction to the model architecture and the dataset. Section “Experimental method and results” presents the experiments and results. Section “Discussion” discusses this research with current large-scale models. Section “Conclusion” offers a conclusion of the entire paper.

Materials and methods

Overall architecture of MPOCSR

The overall architecture of MPOCSR is shown in Fig. 1, which consists of backbone, Transformer encoder, and Transformer decoder. The process begins by extracting high-level feature representations from the image using the backbone. Next, the obtained image features, along with positional encodings, are collectively input in the Transformer encoder, resulting in a sequential representation of the image features. Finally, the Transformer decoder utilizes the sequential image representation produced by the Transformer encoder to decode and generate the corresponding SMILES sequence for the given image.

Fig. 1
figure 1

Overall architecture of MPOCSR

Backbone

We use MPViT as the backbone, with a general structure illustrated in Fig. 2. Initially, the image is fed into the stem block to obtain low-level features. The stem block consists of two convolutional layers with a \(3\times 3\) kernel size, a stride of 2, and padding of 1. Next, the image is processed through modules consisting of four-stage structures to obtain feature maps with different scales. Each stage is composed of multi-scale patch embedding and multi-path Transformer block.

Fig. 2
figure 2

MPViT structure

Multi-scale patch embedding provides visual tokens with both coarse-grained and fine-grained information for the subsequent multi-path Transformer block. As shown in Fig. 3, to obtain visual tokens of different granularities, this module applies convolution operations with kernel sizes of \(3\times 3\), \(5\times 5\), and \(7\times 7\) to the input image features. Considering that two consecutive \(3\times 3\) convolution operations have the same receptive field as one \(5\times 5\) convolution operation, and three consecutive \(3\times 3\) convolution operations have the same receptive field as one \(7\times 7\) convolution operation, and they have fewer parameters, this module uses three consecutive\( 3\times 3\) convolution operations to obtain visual tokens of different granularities. Additionally, to further reduce the number of parameters, the model uses \(3\times 3\) depthwise separable convolutions instead of regular \(3\times 3\) convolutions. Subsequently, visual tokens of different granularities are processed through the multi-path Transformer block to obtain visual features of different scales.

Fig. 3
figure 3

Multi-scale patch embedding and multi-path Transformer block

As illustrated in Fig. 3, the multi-path Transformer block is composed of convolutional layer, Transformer encoder, and feature fusion module. The structure of the convolutional layer is depicted in Fig. 3, which consists of \(1\times 1\) convolution and \(3\times 3\) depthwise separable convolution. Three Transformer encoders encode visual tokens of different granularities from the multi-scale patch embedding to obtain visual features of different scales. As shown in Fig. 3, the structure of each Transformer encoder is consistent with the original Transformer encoder. To reduce the computational complexity, this module employs factorized self-attention for attention calculation, as proposed by co-scale conv-attentional image Transformers (CoaT) [28]:

$$\begin{aligned} FactorAttn(Q,K,V) = \frac{Q}{\sqrt{C_{K}}}(softmax(K)^{T}V). \end{aligned}$$
(1)

where Q represents the query matrix, K represents the key matrix, V represents the value matrix. \(C_{K}\) represents the channel dimension of K. T represents the transpose operation. \(softmax(\cdot )\) represents the softmax operation.

Finally, the feature fusion module combines the features from the convolutional layer and those obtained from the three Transformer encoders to generate the output of that stage. As depicted in Fig. 3, this module concatenates the features obtained from the preceding modules along the channel dimension and then utilizes \(1\times 1\) convolution for feature fusion and dimension reduction. This process effectively fuses and enhances the visual features to prepare them for further processing.

Transformer

The feature outputs by the backbone are linearly mapped to the same dimension as the Transformer and then fed into the Transformer encoder, which is composed of 6 identical encoder layers, as illustrated in Fig. 4. Each encoder layer consists of two sub-modules: multi-head self-attention and a feed-forward network. After each sub-module, layer normalization and residual connections are applied. The features are first multiplied by three weight matrices to obtain query matrix Q, key matrix K and value matrix V. Subsequently, based on Formula 2, the calculation results of the self-attention layer are computed, where \(C_{K}\) represents the channel dimension of K, T represents the transpose operation and \(softmax(\cdot )\) represents the softmax operation. Finally, this result is passed through the feed-forward network to obtain the output of that encoder layer. After going through 6 Transformer encoder layers, the features are sent to the Transformer decoder.

$$\begin{aligned} Attn(Q,K,V) = softmax (\frac{QK^{T}}{\sqrt{C_{K}}})V \end{aligned}$$
(2)
Fig. 4
figure 4

Transformer encoder

Similar to the encoder, the decoder is also constructed by stacking multiple identical decoder layers, and the number of layers is 6. As shown in Fig. 5, each decoder layer consists of three sub-modules: masked multi-head self-attention, multi-head cross-attention, and a feed-forward network. The masked multi-head self-attention layer takes the sequence generated at the previous time step as input and utilizes masking to prevent the model from seeing information beyond the current time step. The multi-head cross-attention layer uses the output of the previous sub-module as the query for attention calculation. The features are then passed through the feed-forward network to obtain the output of that decoder layer. After passing through 6 decoder layers, the output is sent to a linear layer and a softmax layer to obtain the output sequence for the current time step. This sequence forms the prediction of the model for the OCSR task.

Fig. 5
figure 5

Transformer decoder

Dataset

Considering that directly extracting chemical structure images from the literature is challenging, and the manual annotation of data is cumbersome and difficult to perform in large quantities with correct annotations, we did not directly extract chemical structure images from the literature; instead, we used CDK [29] to generate the data. The molecular structures were downloaded from PubChem [30], and a dataset containing 2 million images was constructed based on these data. This dataset includes two types of data: one with Markush structures and the other with non-Markush structures.

The specific construction process was as follows: firstly, molecular structures with serial numbers 1–2.5 million were downloaded from PubChem’s FTP site, along with their corresponding SMILES representations. We then used CDK to get canonical SMILES. From the processed data, 1 million molecules were randomly selected and Markush structures for these molecules were generated using Randepict [31]. This formed the first category of the dataset. The remaining molecules were taken to create the second category of the dataset, which included 1 million non-Markush structures. Finally, CDK was used to generate images corresponding to these molecules. In the image generation process, certain parameters of CDK were modified to make the generated images more closely resemble those found in the literature. These modifications may include adjustments to the font, superscript and subscript spacing, subscript size, and other factors. Figure 6 shows examples of both categories of molecules.

Fig. 6
figure 6

Examples of each category

The process of generating Markush structures using Randepict is as follows: begin by reading the input SMILES representation of a molecule and add explicit hydrogen atoms. Then, randomly replace some carbon atoms (C) or hydrogen atoms (H) with the characters R, X, Y, and Z. These replacements can range from 1 to 4 atoms. For these characters, add random numeric indices between 1 and 20, as well as the characters”a,” ”b,” ”c,” ”d,” ”e,”and”f.”After the random replacements, remove explicit hydrogen atoms to obtain the final Markush structure. For example, consider the input SMILES ”CC#CC(=C=C(C)C)C.”Initially, add explicit hydrogen atoms to get ”C([H])([H])([H])

C#CC(=C=C(C([H])([H])[H])C([H])([H])[H])C([H])([H])

[H]”. After random replacements, we get”C([H])([H])([H])

C#CC(=C=[Y18d](C([H])([H])[H])C([H])([H])[H])C([H])

([H])[H]”. Finally, remove explicit hydrogen to get”CC#C

C(=C=[Y18d](C)C)C”.

The training set, validation set, and test set were obtained by partitioning the two categories of datasets as described. For the test set, 100,000 molecules were selected from each of the two categories using the MaxMin algorithm, resulting in a total of 200,000 molecules. The validation set was constructed by randomly selecting 100,000 molecules from the remaining molecules in each of the two categories, yielding 200,000 molecules. The remaining 1.6 million molecules were used as the training set for model training. The number of samples in each category in the dataset is presented in Table 1, and the distribution of SMILES lengths in the dataset is illustrated in Fig. 7.

Table 1 Description of training, validation and test set
Fig. 7
figure 7

Distribution of the lengths of SMILES

The SMILES representations of the dataset consist of a total of 67 different characters, which are considered as tokens. These characters are as follows:

C,1,=,(,2,),O,N,3,S,4,[,R,],H,Y,7,Z,e,@,X,8,6,d,l,b,0,a,9,

B,r,+,F,5,.,-,/,f,#,P,\(\backslash \),c,M,n,I,I,T,o,g,t,s,K,A,V,u,G,h,W,L,

U,m,E,D,y,%,p,k.

Experimental method and results

Evaluation metric

In this paper, the following evaluation metrics are used:

Accuracy: this metric is calculated by determining the proportion of SMILES predictions made by the model that are identical to the true SMILES representations. Accuracy is computed as follows:

$$\begin{aligned} Accuracy = \frac{S}{N} \end{aligned}$$
(3)

where S represents the number of SMILES that are predicted correctly and N represents the total number of SMILES.

Tanimoto similarity: it is a commonly used metric in the chemical domain to assess molecular similarity. This paper uses CDK for calculation.

Validity: The analysis of validity is performed using the SMILES parser of CDK. SMILES representations that can be successfully parsed by the SMILES parser are considered valid, while those that cannot be parsed are classified as invalid. Validity is calculated by determining the proportion of valid SMILES representations among all SMILES in the dataset.

BLEU and ROUGE: these are common evaluation metrics in the field of image description and are used to assess the similarity between the sequences predicted by the model and the true sequences.

Table 2 Performance comparisons of methods in the test set

Training setup

The batch size is set to 128 during training. The number of stacking layers of Transformer encoder and Transformer decoder is 6, the token embedding dimension is 256, and the number of head is 8. We use AdamW [32] optimizer with an initial learning rate of 5e−4 and the cosine learning rate scheduler. The dropout rate is set to 0.1. The operating system used for training is Ubuntu 20.04 and the GPU is A100. The model is trained for a total of 30 epochs.

Results

Method comparison experiment

In the test set, we compared the MPOCSR model proposed in this paper with other deep learning-based models, including Image2SMILES, DECIMER1.0, DECIMER-V2, and SwinOCSR. All methods were trained on the dataset proposed in this paper, and the results are shown in Table 2. Considering that training MolScribe requires a specific dataset with atomic coordinate information, and its image data is dynamically generated during training, we were unable to obtain the specific dataset images required to train MPOCSR; therefore, we did not include a comparison with this method.

MPOCSR achieved an accuracy of 90.59% on the test set. Compared to Image2SMILES, DECIMER1.0, DECIMER-V2, and SwinOCSR, the MPOCSR model outperformed them in accuracy by 4.43%, 2.62%, 1.01%, and 0.19%, respectively. Furthermore, MPOCSR also demonstrated the best performance in the BLEU and ROUGE similarity evaluation metrics. Although MPOCSR scored slightly lower than SwinOCSR in the Tanimoto and Validity metrics, it excelled with a higher accuracy score. The primary focus of our evaluation is accuracy, which demands the predicted SMILES to be identical to the ground truth, as it better reflects the core performance of the model in chemical structure recognition. Considering all the evaluation metrics, MPOCSR was the model that achieved the best performance on the test set.

In terms of model architecture, Image2SMILES, DECIMER1.0 and DECIMER-V2 use convolutional neural networks to extract image features and then exploit Transformers for sequence generation. On the other hand, SwinOCSR utilizes Swin Transformer for image feature extraction and also employed Transformers for decoding. In contrast, MPOCSR employs a multi-path structure, combining features from different scales, and uses Transformers for decoding. The experimental results indicate that MPOCSR, by capturing features at various scales, provides richer visual information for the subsequent decoding process, enabling the model to generate more accurate SMILES sequences.

Loss function comparison experiment

The distribution of tokens in the dataset is shown in Fig. 8. It can be found that tokens present a long-tail distribution. A minority of tokens on the left side, such as”C,” ”=”, ”(”, and”),” constitutes the head of the distribution, while the majority of tokens make up the tail. This distribution within the dataset leads to a tendency of the model to predominantly predict the few head tokens. To address this issue, we employed the CB loss for model training. CB loss is based on weighting the loss according to the effective sample count of each category, resulting in a class-balanced loss function. CB loss is computed as follows:

$$\begin{aligned} CB\hspace{5.0pt}loss = -\frac{1-\beta }{1-\beta ^{n_{y}}}\sum _{i}^{C}(1-p_{i})^{\gamma }\log {p_{i}} \end{aligned}$$
(4)
Table 3 Results on the test set of MPOCSR trained using different loss functions

where C represents the total number of categories, \(\beta \) represents the hyperparameter for balancing categories, \(n_y\) represents the number of ground-truth for each category, and \(\gamma \) represents the hyperparameter of adjusting the weight factor. \(p_i\) denotes the predicted probability of the model for the ith category, which is calculated as follows:

$$\begin{aligned} p_{i} = {\left\{ \begin{array}{ll} \sigma (o_{i}), &{} y_{i}=1 \\ 1-\sigma (o_{i}), &{} otherwise \end{array}\right. } \end{aligned}$$
(5)

where \(y_i\) represents the ground truth label for the i-th category, \(o_i\) represents the model’s output for the i-th category, and \(\sigma (\cdot )\) is the sigmoid function.

Fig. 8
figure 8

Frequency distribution of tokens in the dataset

In order to analyze the impact of different loss functions on model performance, we trained the model using BCE loss, Focal loss and CB loss, followed by validation on the test dataset. As shown in Table 3, when compared to the model trained with BCE loss, models trained with Focal loss and CB loss exhibited improvements in accuracy of 0.10% and 0.36%, respectively. Furthermore, there were enhancements in similarity metrics, such as Tanimoto, BLEU and ROUGE. These results indicate that the long-tail token distribution within the dataset influenced the model performance during training. Focal loss and CB loss were effective in mitigating this impact by assigning different weights to different tokens. Given that the model trained with CB loss performed the best across all metrics, all subsequent experiments in this study employed MPOCSR trained with CB loss.

Table 4 Results of MPOCSR on different categories of data

Influence of molecular category

In order to analyze the performance of MPOCSR on different categories of molecules, we conducted separate evaluations on the two types of data present in the test set. As shown in Table 4, for non-Markush structure molecules, MPOCSR achieved an accuracy of 93.05%, while the accuracy was slightly lower at 88.84% for Markush structure molecules. Markush structures lack standardization, and the functional groups within them are relatively harder to recognize, which contributed to the decreased performance on Markush structure molecules. However, in terms of Tanimoto, BLEU, and ROUGE evaluation metrics, MPOCSR exhibited excellent performance on both types of data. This suggests that while the model may not always predict all the details of Markush structures correctly, it still successfully generates molecules that resemble the original structures very closely. These findings demonstrate that MPOCSR maintains strong performance on both classes of data.

Influence of SMILES length

In order to analyze the performance of MPOCSR on SMILES sequences of different lengths, the test set was divided into segments based on SMILES lengths, specifically in the ranges of 1–25, 26–50, 51–75, and 76–100. Table 5 displays the results of MPOCSR on SMILES sequences of different lengths. The experimental results show that predictions by MPOCSR for sequences with lengths of 1–25 were slightly less accurate, primarily due to the limited representation of such lengths in the dataset, making it challenging for the model to learn their features effectively. However, from the results on sequences of other lengths, it is evident that as the sequence length increases, the accuracy of the model in predictions gradually decreases. This phenomenon is attributed to the increased variability and complexity of longer SMILES sequences. Nevertheless, metrics like Tanimoto, BLEU and ROUGE indicate that, even with longer sequences, the predicted SMILES sequences maintain a high degree of similarity to the true SMILES sequences, suggesting that MPOCSR can successfully capture the key features of chemical structures across different length ranges, resulting in a high degree of structural preservation and similarity.

Table 5 Results of MPOCSR with different SMILES lengths
Fig. 9
figure 9

Results generated by GPT-4

Fig. 10
figure 10

Results generated by LLaMA

Discussion

GPT-4 [33], LLaMA [34], PaLM [35], and similar large-scale models have made significant advancements in various domains, including text generation, machine translation, image captioning, etc. Through an increasing number of tasks, their capability to achieve promising results has been demonstrated by direct testing or fine-tuning of these large models.

However, it is crucial to note that many of these large models often provide limited access through APIs without openly sharing their underlying code and training details. Consequently, users might encounter challenges related to model fidelity and result uncertainty.

For instance, Figs. 9 and 10 display the results of using GPT-4 and LLaMA to generate SMILES corresponding to images. It is evident that the generated results are not consistently aligned with the contents of the images, and this constitutes an issue of alignment between the predictions of the model and the input data.

Given the universality, adaptability and convenience of large models, a promising future avenue involves integrating these models with specialized models that possess domain-specific expertise in the field of OCSR. This collaborative approach aims to combine the strengths of both large models and specialized models to address the challenges in OCSR effectively.

Conclusion

This paper introduces a multi-scale, multi-level feature-based OCSR method, which leverages MPViT as backbone, uses Transformer for predicting SMILES sequences, and employs CB loss to mitigate the impact of class imbalance. Experiments reveal that MPOCSR can effectively capture critical features of chemical structures, maintain good recognition performance for Markush structures, and deliver strong performance in predicting long sequences.

Although our method has achieved promising results, there still exist some limitations for future improvements. Theoretically, the main limitation lies in the“black box”nature of deep learning models, which restricts our comprehensive understanding of how the model learns and processes chemical structures. To enhance interpretability, we plan to explore visualization of attention mechanisms. It can highlight the specific areas on which the model focuses, thereby providing deeper insights into its decision-making process. Practically, the performance of MPOCSR is heavily dependent on the quality and diversity of the training data, and its adaptability to unseen data needs to be further improved. To address this limitation, we tend to incorporate a wider range of chemical structure images and to investigate image augmentation techniques, thereby enhancing the model’s generalization capabilities. Furthermore, in the field of OCSR, model performance evaluation primarily relies on accuracy, with a lack of statistical tests to demonstrate the significance of the results. To establish the significance of our findings, we will consider statistical tests such as the Friedman test. Finally, considering the rapid development of large pre-trained models such as GPT-4, our future research efforts will be directed towards exploring opportunities to integrate these models, offering a more comprehensive platform that includes functions for chemical structure conversion, analysis, among others.