MPOCSR: optical chemical structure recognition based on multi-path Vision Transformer

Lin, Fan; Li, Jianhua

doi:10.1007/s40747-024-01561-6

MPOCSR: optical chemical structure recognition based on multi-path Vision Transformer

Original Article
Open access
Published: 22 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

MPOCSR: optical chemical structure recognition based on multi-path Vision Transformer

Download PDF

258 Accesses
Explore all metrics

Abstract

Optical chemical structure recognition (OCSR) is a fundamental and crucial task in the field of chemistry, which aims at transforming intricate chemical structure images into machine-readable formats. Current deep learning-based OCSR methods typically use image feature extractors to extract visual features and employ encoder-decoder architectures for chemical structure recognition. However, the performance of these methods is limited by their image feature extractors and the class imbalance of elements in chemical structure representation. This paper proposes MPOCSR (multi-path optical chemical structure recognition), which introduces the multi-path Vision Transformer (MPViT) and the class-balanced (CB) loss function to address these two challenges. MPOCSR uses MPViT as an image feature extractor, combining the advantages of convolutional neural networks and Vision Transformers. This strategy enables the provision of richer visual information for subsequent decoding processes. Furthermore, MPOCSR incorporates CB loss function to rebalance the loss weights among different categories. For training and validation of our method, we constructed a dataset that includes both Markush and non-Markush structures. Experimental results show that MPOCSR achieves an accuracy of 90.95% on the test set, surpassing other existing methods.

SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer

Article Open access 01 July 2022

A review of optical chemical structure recognition tools

Article Open access 07 October 2020

RanDepict: Random chemical structure depiction generator

Article Open access 06 June 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In the scientific literature, a substantial amount of chemical data are published annually, including a plethora of information about chemical structure. However, these data are often presented in the form of images, making it inaccessible for direct use. This poses challenges for education, research and development in the field of chemistry. Manually organizing and converting these image data into machine-readable formats is time-consuming and error-prone. To enhance the research efficiency and reduce human effort, optical chemical structure recognition (OCSR) has emerged as a viable solution. This is a task of automatically identifying chemical structure images and converting them into machine-readable formats, such as SMILES [1] and SELFIES [2]. Through OCSR, chemical information in structured data formats will facilitate broader sharing and utilization of chemical information.

Early OCSR methods, such as Kekule [3], OROCS [4], CLiDE [5], OSRA [6], and Imago [7], predominantly employ rule-based approaches. Specifically, they individually detect characters, bonds, rings, and other constituent elements in chemical structure images and process these elements based on a carefully designed set of rules to obtain machine-readable chemical identifiers. However, these methods are cumbersome and dependent on manually crafted rules. Also, if unconsidered rules are present in the image, their performance can be significantly compromised. Research has also shown that slight deformations in the images can substantially impact the performance of these methods [8].

With the rapid advancement of computational resources and the continuous improvement of deep learning methods, deep learning has made significant progress in the field of OCSR. These methods achieve tasks by automatically learning patterns and features from the data. Currently, there are two main categories of deep learning-based OCSR methods: the image-to-sequence approach and the image-to-graph approach.

The image-to-sequence method was first proposed by Staker et al. [9], which utilizes a custom convolutional neural network to extract image features and GridLSTM [10] to generate SMILES sequences. However, this method is trained only on low-resolution images, limiting its applicability to predictions on other inputs. Similar to this approach, Img2Mol [8] implements a custom convolutional neural network to extract image features and implements an RNN to generate SMILES sequences. The key difference is that Img2Mol uses the decoder part of a pre-trained autoencoder to decode the image features into SMILES. Image2SMILES [11] is another image-to-sequence OCSR method that applies ResNet-50 [12] to encode image features and utilizes a Transformer as the decoder to obtain FG-SMILES. Inspired by the Show-and-Tell network [13], DECIMER [14], DECIMER 1.0 [15] and DECIMER-V2 [16] employ encoder-decoder architecture for OCSR tasks. DECIMER takes InceptionNet-V3 [17] as the encoder and GRU [18] as the decoder to generate SMILES. However, these methods cannot achieve comparable results. In DECIMER 1.0, InceptionNet-V3 is replaced with EfficientNet-B3 [19] and uses a Transformer to generate SELFIES. DECIMER-V2 substitutes EfficientNet-V2 [20] for EfficientNet-B3 in the generation of SMILES sequences. All of the above methods involve convolutional neural networks to extract image features. Recognizing that convolutional neural networks primarily learn local features and are less effective at utilizing global information, Xu et al. introduced SwinOCSR [21]. This method employs Swin Transformer [22] to encode image features and uses Transformer as the decoder to generate DeepSMILES [23].

MolScribe [24] was presented as a deep learning method that converts images into graphs, with a model architecture similar to SwinOCSR, which utilizes Swin Transformer to encode images and Transformer for decoding. The key innovation in this approach is that it directly constructs molecular graph structures based on the geometric layout of atoms and bonds in the image.

Currently, there is a growing trend in deep learning-based OCSR methods to utilize extensive computational resources for optimizing performance. However, the performance of deep learning models is often more contingent upon the model architecture than the computational power alone. When trained on large-scale datasets, models with poor design may be at risk of overfitting. Therefore, this paper primarily focuses on exploring improved OCSR model structures in the aspects of image feature extractors and loss functions.

Effective image feature extractors are crucial to obtain accurate representations of chemical structures. Existing deep learning-based OCSR methods predominantly utilize CNNs for extracting image features, but these methods fall short in global feature modeling. Our prior method, SwinOCSR, employs the Swin Transformer to obtain the global representation of images. However, this approach has the potential to overlook some local information within the images. These limitations lead to a decrease in accuracy. To holistically consider both global and local information in the images, this paper adopts multi-path Vision Transformer (MPViT) [25] as the encoder of image features, and proposes a new OCSR method, multi-path optical chemical structure recognition (MPOCSR). It uses multi-scale patch embedding and a multi-path structure to merge local features from convolution and global features from the Transformer, providing multi-scale and diversified image information for subsequent decoding. According to the results of subsequent experiments, when trained on the same dataset of 2 million molecules, MPOCSR achieves a higher accuracy rate than other methods.

Additionally, this paper considers another issue, the class imbalance of elements in chemical structure representation, in the field of OCSR. For instance, elements like C, H, and O appearing more frequently, while Br and Cl appearing less frequently. Considering that cross-entropy loss can be sensitive to imbalanced element frequencies, it can lead to models performing well when predicting common categories but poorly when predicting rare ones. In this paper, class-balanced (CB) loss [26] is introduced to mitigate the impact of element frequency imbalances on the model.

According to Rajan et al. [27], SMILES can yield more accurate results; therefore, in this paper, it is used as the output for the model.

The contributions of this paper are as follows:

A novel end-to-end OCSR method based on deep learning is introduced, which utilizes MPViT as the backbone. Compared to SwinOCSR, MPOCSR incorporates a multi-path structure, combining multi-scale global features and local features, resulting in a more comprehensive image feature representation. It achieves an accuracy of 90.95% on the test set, surpassing other existing methods.
The issue of imbalanced molecular element frequency is further explored by introducing CB loss as the loss function of the proposed model. Compared with the model trained using Focal loss, the model trained with CB loss achieves better performance.
A dataset containing two categories (Markush and non-Markush) is constructed, which comprises a total of 2 million molecules. The test set is obtained using the MaxMin algorithm. Compared with random selection, the test set obtained by the MaxMin algorithm is more representative of the sample space.

The subsequent sections of this paper are organized as follows: Sect. “Materials and methods” provides a detailed introduction to the model architecture and the dataset. Section “Experimental method and results” presents the experiments and results. Section “Discussion” discusses this research with current large-scale models. Section “Conclusion” offers a conclusion of the entire paper.

Materials and methods

Overall architecture of MPOCSR

The overall architecture of MPOCSR is shown in Fig. 1, which consists of backbone, Transformer encoder, and Transformer decoder. The process begins by extracting high-level feature representations from the image using the backbone. Next, the obtained image features, along with positional encodings, are collectively input in the Transformer encoder, resulting in a sequential representation of the image features. Finally, the Transformer decoder utilizes the sequential image representation produced by the Transformer encoder to decode and generate the corresponding SMILES sequence for the given image.

Backbone

We use MPViT as the backbone, with a general structure illustrated in Fig. 2. Initially, the image is fed into the stem block to obtain low-level features. The stem block consists of two convolutional layers with a $3\times 3$ kernel size, a stride of 2, and padding of 1. Next, the image is processed through modules consisting of four-stage structures to obtain feature maps with different scales. Each stage is composed of multi-scale patch embedding and multi-path Transformer block.

Multi-scale patch embedding provides visual tokens with both coarse-grained and fine-grained information for the subsequent multi-path Transformer block. As shown in Fig. 3, to obtain visual tokens of different granularities, this module applies convolution operations with kernel sizes of $3\times 3$, $5\times 5$, and $7\times 7$ to the input image features. Considering that two consecutive $3\times 3$ convolution operations have the same receptive field as one $5\times 5$ convolution operation, and three consecutive $3\times 3$ convolution operations have the same receptive field as one $7\times 7$ convolution operation, and they have fewer parameters, this module uses three consecutive$ 3\times 3$ convolution operations to obtain visual tokens of different granularities. Additionally, to further reduce the number of parameters, the model uses $3\times 3$ depthwise separable convolutions instead of regular $3\times 3$ convolutions. Subsequently, visual tokens of different granularities are processed through the multi-path Transformer block to obtain visual features of different scales.

As illustrated in Fig. 3, the multi-path Transformer block is composed of convolutional layer, Transformer encoder, and feature fusion module. The structure of the convolutional layer is depicted in Fig. 3, which consists of $1\times 1$ convolution and $3\times 3$ depthwise separable convolution. Three Transformer encoders encode visual tokens of different granularities from the multi-scale patch embedding to obtain visual features of different scales. As shown in Fig. 3, the structure of each Transformer encoder is consistent with the original Transformer encoder. To reduce the computational complexity, this module employs factorized self-attention for attention calculation, as proposed by co-scale conv-attentional image Transformers (CoaT) [28]:

$$\begin{aligned} FactorAttn(Q,K,V) = \frac{Q}{\sqrt{C_{K}}}(softmax(K)^{T}V). \end{aligned}$$

(1)

where Q represents the query matrix, K represents the key matrix, V represents the value matrix. $C_{K}$ represents the channel dimension of K. T represents the transpose operation. $softmax(\cdot )$ represents the softmax operation.

Finally, the feature fusion module combines the features from the convolutional layer and those obtained from the three Transformer encoders to generate the output of that stage. As depicted in Fig. 3, this module concatenates the features obtained from the preceding modules along the channel dimension and then utilizes $1\times 1$ convolution for feature fusion and dimension reduction. This process effectively fuses and enhances the visual features to prepare them for further processing.

Transformer

The feature outputs by the backbone are linearly mapped to the same dimension as the Transformer and then fed into the Transformer encoder, which is composed of 6 identical encoder layers, as illustrated in Fig. 4. Each encoder layer consists of two sub-modules: multi-head self-attention and a feed-forward network. After each sub-module, layer normalization and residual connections are applied. The features are first multiplied by three weight matrices to obtain query matrix Q, key matrix K and value matrix V. Subsequently, based on Formula 2, the calculation results of the self-attention layer are computed, where $C_{K}$ represents the channel dimension of K, T represents the transpose operation and $softmax(\cdot )$ represents the softmax operation. Finally, this result is passed through the feed-forward network to obtain the output of that encoder layer. After going through 6 Transformer encoder layers, the features are sent to the Transformer decoder.

$$\begin{aligned} Attn(Q,K,V) = softmax (\frac{QK^{T}}{\sqrt{C_{K}}})V \end{aligned}$$

(2)

Similar to the encoder, the decoder is also constructed by stacking multiple identical decoder layers, and the number of layers is 6. As shown in Fig. 5, each decoder layer consists of three sub-modules: masked multi-head self-attention, multi-head cross-attention, and a feed-forward network. The masked multi-head self-attention layer takes the sequence generated at the previous time step as input and utilizes masking to prevent the model from seeing information beyond the current time step. The multi-head cross-attention layer uses the output of the previous sub-module as the query for attention calculation. The features are then passed through the feed-forward network to obtain the output of that decoder layer. After passing through 6 decoder layers, the output is sent to a linear layer and a softmax layer to obtain the output sequence for the current time step. This sequence forms the prediction of the model for the OCSR task.

Dataset

Considering that directly extracting chemical structure images from the literature is challenging, and the manual annotation of data is cumbersome and difficult to perform in large quantities with correct annotations, we did not directly extract chemical structure images from the literature; instead, we used CDK [29] to generate the data. The molecular structures were downloaded from PubChem [30], and a dataset containing 2 million images was constructed based on these data. This dataset includes two types of data: one with Markush structures and the other with non-Markush structures.

The specific construction process was as follows: firstly, molecular structures with serial numbers 1–2.5 million were downloaded from PubChem’s FTP site, along with their corresponding SMILES representations. We then used CDK to get canonical SMILES. From the processed data, 1 million molecules were randomly selected and Markush structures for these molecules were generated using Randepict [31]. This formed the first category of the dataset. The remaining molecules were taken to create the second category of the dataset, which included 1 million non-Markush structures. Finally, CDK was used to generate images corresponding to these molecules. In the image generation process, certain parameters of CDK were modified to make the generated images more closely resemble those found in the literature. These modifications may include adjustments to the font, superscript and subscript spacing, subscript size, and other factors. Figure 6 shows examples of both categories of molecules.

The process of generating Markush structures using Randepict is as follows: begin by reading the input SMILES representation of a molecule and add explicit hydrogen atoms. Then, randomly replace some carbon atoms (C) or hydrogen atoms (H) with the characters R, X, Y, and Z. These replacements can range from 1 to 4 atoms. For these characters, add random numeric indices between 1 and 20, as well as the characters”a,” ”b,” ”c,” ”d,” ”e,”and”f.”After the random replacements, remove explicit hydrogen atoms to obtain the final Markush structure. For example, consider the input SMILES ”CC#CC(=C=C(C)C)C.”Initially, add explicit hydrogen atoms to get ”C([H])([H])([H])

C#CC(=C=C(C([H])([H])[H])C([H])([H])[H])C([H])([H])

[H]”. After random replacements, we get”C([H])([H])([H])

C#CC(=C=[Y18d](C([H])([H])[H])C([H])([H])[H])C([H])

([H])[H]”. Finally, remove explicit hydrogen to get”CC#C

C(=C=[Y18d](C)C)C”.

The training set, validation set, and test set were obtained by partitioning the two categories of datasets as described. For the test set, 100,000 molecules were selected from each of the two categories using the MaxMin algorithm, resulting in a total of 200,000 molecules. The validation set was constructed by randomly selecting 100,000 molecules from the remaining molecules in each of the two categories, yielding 200,000 molecules. The remaining 1.6 million molecules were used as the training set for model training. The number of samples in each category in the dataset is presented in Table 1, and the distribution of SMILES lengths in the dataset is illustrated in Fig. 7.

Table 1 Description of training, validation and test set

Full size table

The SMILES representations of the dataset consist of a total of 67 different characters, which are considered as tokens. These characters are as follows:

C,1,=,(,2,),O,N,3,S,4,[,R,],H,Y,7,Z,e,@,X,8,6,d,l,b,0,a,9,

B,r,+,F,5,.,-,/,f,#,P,$\backslash $,c,M,n,I,I,T,o,g,t,s,K,A,V,u,G,h,W,L,

U,m,E,D,y,%,p,k.

Experimental method and results

Evaluation metric

In this paper, the following evaluation metrics are used:

Accuracy: this metric is calculated by determining the proportion of SMILES predictions made by the model that are identical to the true SMILES representations. Accuracy is computed as follows:

$$\begin{aligned} Accuracy = \frac{S}{N} \end{aligned}$$

(3)

where S represents the number of SMILES that are predicted correctly and N represents the total number of SMILES.

Tanimoto similarity: it is a commonly used metric in the chemical domain to assess molecular similarity. This paper uses CDK for calculation.

Validity: The analysis of validity is performed using the SMILES parser of CDK. SMILES representations that can be successfully parsed by the SMILES parser are considered valid, while those that cannot be parsed are classified as invalid. Validity is calculated by determining the proportion of valid SMILES representations among all SMILES in the dataset.

BLEU and ROUGE: these are common evaluation metrics in the field of image description and are used to assess the similarity between the sequences predicted by the model and the true sequences.

Table 2 Performance comparisons of methods in the test set

Full size table

Training setup

The batch size is set to 128 during training. The number of stacking layers of Transformer encoder and Transformer decoder is 6, the token embedding dimension is 256, and the number of head is 8. We use AdamW [32] optimizer with an initial learning rate of 5e−4 and the cosine learning rate scheduler. The dropout rate is set to 0.1. The operating system used for training is Ubuntu 20.04 and the GPU is A100. The model is trained for a total of 30 epochs.

Results

Method comparison experiment

In the test set, we compared the MPOCSR model proposed in this paper with other deep learning-based models, including Image2SMILES, DECIMER1.0, DECIMER-V2, and SwinOCSR. All methods were trained on the dataset proposed in this paper, and the results are shown in Table 2. Considering that training MolScribe requires a specific dataset with atomic coordinate information, and its image data is dynamically generated during training, we were unable to obtain the specific dataset images required to train MPOCSR; therefore, we did not include a comparison with this method.

MPOCSR achieved an accuracy of 90.59% on the test set. Compared to Image2SMILES, DECIMER1.0, DECIMER-V2, and SwinOCSR, the MPOCSR model outperformed them in accuracy by 4.43%, 2.62%, 1.01%, and 0.19%, respectively. Furthermore, MPOCSR also demonstrated the best performance in the BLEU and ROUGE similarity evaluation metrics. Although MPOCSR scored slightly lower than SwinOCSR in the Tanimoto and Validity metrics, it excelled with a higher accuracy score. The primary focus of our evaluation is accuracy, which demands the predicted SMILES to be identical to the ground truth, as it better reflects the core performance of the model in chemical structure recognition. Considering all the evaluation metrics, MPOCSR was the model that achieved the best performance on the test set.

In terms of model architecture, Image2SMILES, DECIMER1.0 and DECIMER-V2 use convolutional neural networks to extract image features and then exploit Transformers for sequence generation. On the other hand, SwinOCSR utilizes Swin Transformer for image feature extraction and also employed Transformers for decoding. In contrast, MPOCSR employs a multi-path structure, combining features from different scales, and uses Transformers for decoding. The experimental results indicate that MPOCSR, by capturing features at various scales, provides richer visual information for the subsequent decoding process, enabling the model to generate more accurate SMILES sequences.

Loss function comparison experiment

The distribution of tokens in the dataset is shown in Fig. 8. It can be found that tokens present a long-tail distribution. A minority of tokens on the left side, such as”C,” ”=”, ”(”, and”),” constitutes the head of the distribution, while the majority of tokens make up the tail. This distribution within the dataset leads to a tendency of the model to predominantly predict the few head tokens. To address this issue, we employed the CB loss for model training. CB loss is based on weighting the loss according to the effective sample count of each category, resulting in a class-balanced loss function. CB loss is computed as follows:

$$\begin{aligned} CB\hspace{5.0pt}loss = -\frac{1-\beta }{1-\beta ^{n_{y}}}\sum _{i}^{C}(1-p_{i})^{\gamma }\log {p_{i}} \end{aligned}$$

(4)

Table 3 Results on the test set of MPOCSR trained using different loss functions

Full size table

where C represents the total number of categories, $\beta $ represents the hyperparameter for balancing categories, $n_y$ represents the number of ground-truth for each category, and $\gamma $ represents the hyperparameter of adjusting the weight factor. $p_i$ denotes the predicted probability of the model for the ith category, which is calculated as follows:

$$\begin{aligned} p_{i} = {\left\{ \begin{array}{ll} \sigma (o_{i}), &{} y_{i}=1 \\ 1-\sigma (o_{i}), &{} otherwise \end{array}\right. } \end{aligned}$$

(5)

where $y_i$ represents the ground truth label for the i-th category, $o_i$ represents the model’s output for the i-th category, and $\sigma (\cdot )$ is the sigmoid function.

In order to analyze the impact of different loss functions on model performance, we trained the model using BCE loss, Focal loss and CB loss, followed by validation on the test dataset. As shown in Table 3, when compared to the model trained with BCE loss, models trained with Focal loss and CB loss exhibited improvements in accuracy of 0.10% and 0.36%, respectively. Furthermore, there were enhancements in similarity metrics, such as Tanimoto, BLEU and ROUGE. These results indicate that the long-tail token distribution within the dataset influenced the model performance during training. Focal loss and CB loss were effective in mitigating this impact by assigning different weights to different tokens. Given that the model trained with CB loss performed the best across all metrics, all subsequent experiments in this study employed MPOCSR trained with CB loss.

Table 4 Results of MPOCSR on different categories of data

Full size table

Influence of molecular category

In order to analyze the performance of MPOCSR on different categories of molecules, we conducted separate evaluations on the two types of data present in the test set. As shown in Table 4, for non-Markush structure molecules, MPOCSR achieved an accuracy of 93.05%, while the accuracy was slightly lower at 88.84% for Markush structure molecules. Markush structures lack standardization, and the functional groups within them are relatively harder to recognize, which contributed to the decreased performance on Markush structure molecules. However, in terms of Tanimoto, BLEU, and ROUGE evaluation metrics, MPOCSR exhibited excellent performance on both types of data. This suggests that while the model may not always predict all the details of Markush structures correctly, it still successfully generates molecules that resemble the original structures very closely. These findings demonstrate that MPOCSR maintains strong performance on both classes of data.

Influence of SMILES length

In order to analyze the performance of MPOCSR on SMILES sequences of different lengths, the test set was divided into segments based on SMILES lengths, specifically in the ranges of 1–25, 26–50, 51–75, and 76–100. Table 5 displays the results of MPOCSR on SMILES sequences of different lengths. The experimental results show that predictions by MPOCSR for sequences with lengths of 1–25 were slightly less accurate, primarily due to the limited representation of such lengths in the dataset, making it challenging for the model to learn their features effectively. However, from the results on sequences of other lengths, it is evident that as the sequence length increases, the accuracy of the model in predictions gradually decreases. This phenomenon is attributed to the increased variability and complexity of longer SMILES sequences. Nevertheless, metrics like Tanimoto, BLEU and ROUGE indicate that, even with longer sequences, the predicted SMILES sequences maintain a high degree of similarity to the true SMILES sequences, suggesting that MPOCSR can successfully capture the key features of chemical structures across different length ranges, resulting in a high degree of structural preservation and similarity.

Table 5 Results of MPOCSR with different SMILES lengths

Full size table

Discussion

GPT-4 [33], LLaMA [34], PaLM [35], and similar large-scale models have made significant advancements in various domains, including text generation, machine translation, image captioning, etc. Through an increasing number of tasks, their capability to achieve promising results has been demonstrated by direct testing or fine-tuning of these large models.

However, it is crucial to note that many of these large models often provide limited access through APIs without openly sharing their underlying code and training details. Consequently, users might encounter challenges related to model fidelity and result uncertainty.

For instance, Figs. 9 and 10 display the results of using GPT-4 and LLaMA to generate SMILES corresponding to images. It is evident that the generated results are not consistently aligned with the contents of the images, and this constitutes an issue of alignment between the predictions of the model and the input data.

Given the universality, adaptability and convenience of large models, a promising future avenue involves integrating these models with specialized models that possess domain-specific expertise in the field of OCSR. This collaborative approach aims to combine the strengths of both large models and specialized models to address the challenges in OCSR effectively.

Conclusion

This paper introduces a multi-scale, multi-level feature-based OCSR method, which leverages MPViT as backbone, uses Transformer for predicting SMILES sequences, and employs CB loss to mitigate the impact of class imbalance. Experiments reveal that MPOCSR can effectively capture critical features of chemical structures, maintain good recognition performance for Markush structures, and deliver strong performance in predicting long sequences.

Although our method has achieved promising results, there still exist some limitations for future improvements. Theoretically, the main limitation lies in the“black box”nature of deep learning models, which restricts our comprehensive understanding of how the model learns and processes chemical structures. To enhance interpretability, we plan to explore visualization of attention mechanisms. It can highlight the specific areas on which the model focuses, thereby providing deeper insights into its decision-making process. Practically, the performance of MPOCSR is heavily dependent on the quality and diversity of the training data, and its adaptability to unseen data needs to be further improved. To address this limitation, we tend to incorporate a wider range of chemical structure images and to investigate image augmentation techniques, thereby enhancing the model’s generalization capabilities. Furthermore, in the field of OCSR, model performance evaluation primarily relies on accuracy, with a lack of statistical tests to demonstrate the significance of the results. To establish the significance of our findings, we will consider statistical tests such as the Friedman test. Finally, considering the rapid development of large pre-trained models such as GPT-4, our future research efforts will be directed towards exploring opportunities to integrate these models, offering a more comprehensive platform that includes functions for chemical structure conversion, analysis, among others.

Data availability

The source code supporting the conclusion of this article are available in https://github.com/BuildXZ/MPOCSR.

References

Weininger D (1988) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36. https://doi.org/10.1021/CI00057A005
Article Google Scholar
Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn Sci Technol 1(4):45024. https://doi.org/10.1088/2632-2153/aba947
Article Google Scholar
McDaniel JR, Balmuth JR (1992) Kekule: ocr-optical chemical (structure) recognition. J Chem Inf Comput Sci 32(4):373–378. https://doi.org/10.1021/ci00008a018
Article Google Scholar
Casey RG, Boyer S, Healey P, Miller A, Oudot B, Zilles K (1993) Optical recognition of chemical graphics. In: 2nd International Conference Document Analysis and Recognition, ICDAR ’93, October 20–22, Tsukuba City. IEEE Computer Society, pp 627–631. https://doi.org/10.1109/ICDAR.1993.395658
Ibison P, Jacquot M, Kam F, Neville AG, Simpson RW, Tonnelier CAG, Venczel T, Johnson AP (1993) Chemical literature data extraction: the clide project. J Chem Inf Comput Sci 33(3):338–344. https://doi.org/10.1021/ci00013a010
Article Google Scholar
Filippov IV, Nicklaus MC (2009) Optical structure recognition software to recover chemical information: Osra, an open source solution. J Chem Inf Model 49(3):740–743. https://doi.org/10.1021/ci800067r
Article Google Scholar
Smolov V, Zentsev F, Rybalkin M (2011) Imago: open-source toolkit for 2d chemical structure image recognition. In: Voorhees EM, Buckland LP (eds) Proceedings of The 20th Text REtrieval conference, TREC 2011, Gaithersburg, November 15–18, NIST Special Publication, vol. 500–296. National Institute of Standards and Technology (NIST). http://trec.nist.gov/pubs/trec20/papers/GGA.chemical.pdf
Clevert D-A, Le T, Winter R, Montanari F (2021) Img2mol-accurate smiles recognition from molecular graphical depictions. Chem Sci 12(42):14174–14181. https://doi.org/10.1039/D1SC01839F
Article Google Scholar
Staker J, Marshall K, Abel R, McQuaw CM (2019) Molecular structure extraction from documents using deep learning. J Chem Inf Model 59(3):1017–1029. https://doi.org/10.1021/acs.jcim.8b00669
Article Google Scholar
Kalchbrenner N, Danihelka I, Graves A (2016) Grid long short-term memory. arXiv:1507.01526v3
Khokhlov I, Krasnov L, Fedorov MV, Sosnin S (2022) Image2smiles: transformer-based molecular optical recognition engine. Chem Methods 2(1):e202100069. https://doi.org/10.1002/cmtd.202100069
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, June 27–30. IEEE Computer Society, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, June 7–12. IEEE Computer Society, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Rajan K, Zielesny A, Steinbeck C (2020) Decimer: towards deep learning for chemical image recognition. J Cheminform 12(1):65–73. https://doi.org/10.1186/s13321-020-00469-w
Article Google Scholar
Rajan K, Zielesny A, Steinbeck C (2021) Decimer 1.0: deep learning for chemical image recognition using transformers. J Cheminform 13(1):61–76. https://doi.org/10.1186/s13321-021-00538-8
Article Google Scholar
Rajan K, Brinkhaus HO, Agea MI, Zielesny A, Steinbeck C (2023) Decimer. AI—an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. Nat Commun 14(1):5045–5062. https://doi.org/10.1038/s41467-023-40782-0
Article Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, June 27–30. IEEE Computer Society, pp 2818–2826. https://doi.org/10.1109/CVPR.2016.308
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555v1
Tan M, Le QV (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML , 9–15 June 2019, Long Beach, Proceedings of Machine Learning Research, vol. 97. PMLR, pp 6105–6114. http://proceedings.mlr.press/v97/tan19a.html
Tan M, Le QV (2021) Efficientnetv2: smaller models and faster training. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18–24 July 2021, virtual event, proceedings of machine learning research, vol. 139. PMLR, pp 10096–10106. http://proceedings.mlr.press/v139/tan21a.html
Xu Z, Li J, Yang Z, Li S, Li H (2022) Swinocsr: end-to-end optical chemical structure recognition using a swin transformer. J Cheminform 14(1):1–13. https://doi.org/10.1186/s13321-022-00624-5
Article Google Scholar
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, October 10–17. IEEE, pp 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986
O’Boyle N, Dalke A (2018) Deepsmiles: an adaptation of smiles for use in machine-learning of chemical structures. https://doi.org/10.26434/chemrxiv.7097960.v1
Qian Y, Guo J, Tu Z, Li Z, Coley CW, Barzilay R (2023) Molscribe: robust molecular structure recognition with image-to-graph generation. J Chem Inf Model 63(7):1925–1934. https://doi.org/10.1021/acs.jcim.2c01480
Article Google Scholar
Lee Y, Kim J, Willette J, Hwang SJ (2022) Mpvit: multi-path vision transformer for dense prediction. In: IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, New Orleans, June 18–24. IEEE, pp 7277–7286. https://doi.org/10.1109/CVPR52688.2022.00714
Cui Y, Jia M, Lin T, Song Y, Belongie SJ (2019) Class-balanced loss based on effective number of samples. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, June 16–20. Computer Vision Foundation/IEEE, pp 9268–9277. https://doi.org/10.1109/CVPR.2019.00949. http://openaccess.thecvf.com/content_CVPR_2019/html/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.html
Rajan K, Steinbeck C, Zielesny A (2022) Performance of chemical structure string representations for chemical image recognition using transformers. Digit Discov 1(2):84–90. https://doi.org/10.1039/D1DD00013F
Article Google Scholar
Xu W, Xu Y, Chang TA, Tu Z (2021) Co-scale conv-attentional image transformers. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, October 10–17. IEEE, pp 9961–9970. https://doi.org/10.1109/ICCV48922.2021.00983
Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluskal T, Rojas-Chertó M, Spjuth O (2017) The chemistry development kit (cdk) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform 9:1–19. https://doi.org/10.1186/s13321-017-0220-4
Article Google Scholar
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B (2021) Pubchem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49(D1):D1388–D1395. https://doi.org/10.1093/nar/gkaa971
Article Google Scholar
Brinkhaus HO, Rajan K, Zielesny A, Steinbeck C (2022) Randepict: random chemical structure depiction generator. J Cheminform 14(1):31–37. https://doi.org/10.1186/s13321-022-00609-4
Article Google Scholar
Loshchilov I, Hutter F (2017) Fixing weight decay regularization in adam. arXiv:1711.05101
OpenAI (2023) Gpt-4 technical report. arXiv:2303.08774
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F (2023) Llama: open and efficient foundation language models. arXiv2302.13971
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, Schuh P, Shi K, Tsvyashchenko S, Maynez J, Rao A, Barnes P, Tay Y, Shazeer N, Prabhakaran V, Reif E, Du N, Hutchinson B, Pope R, Bradbury J, Austin J, Isard M, Gur-Ari G, Yin P, Duke T, Levskaya A, Ghemawat S, Dev S, Michalewski H, Garcia X, Misra V, Robinson K, Fedus L, Zhou D, Ippolito D, Luan D, Lim H, Zoph B,Spiridonov A, Sepassi R, Dohan D, Agrawal S, Omernick M, Dai AM, Pillai TS, Pellat M, Lewkowycz A, Moreira E, Child R, Polozov O, Lee K, Zhou Z, Wang X, Saeta B, Diaz M, Firat O, Catasta M, Wei J, Meier-Hellstern K, Eck D, Dean J, Petrov S, Fiedel N (2023) Palm: scaling language modeling with pathways. J Mach Learn Res 24:240:1–240:113. http://jmlr.org/papers/v24/22-1144.html

Download references

Funding

This work was supported by National Key R &D Program of China (under Grant No. 2016YFA0502304) and Important Drug Development Fund, Ministry of Science and Technology of China (2018ZX09735002).

Author information

Authors and Affiliations

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
Fan Lin & Jianhua Li

Authors

Fan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianhua Li.

Ethics declarations

Conflict of interest

The authors have no conflict of interest in the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lin, F., Li, J. MPOCSR: optical chemical structure recognition based on multi-path Vision Transformer. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01561-6

Download citation

Received: 06 December 2023
Accepted: 06 July 2024
Published: 22 July 2024
DOI: https://doi.org/10.1007/s40747-024-01561-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

MPOCSR: optical chemical structure recognition based on multi-path Vision Transformer

Abstract

Similar content being viewed by others

SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer

A review of optical chemical structure recognition tools

RanDepict: Random chemical structure depiction generator

Introduction