1 Introduction

Gastric cancer poses a significant health concern, as evidenced by the annual diagnosis of approximately 990,000 individuals worldwide and the subsequent occurrence of approximately 738,000 fatalities [1]. It ranks as the fifth most prevalent form of cancer and stands as the fourth primary contributor to cancer-related mortality [2]. The etiology of Gastric cancer encompasses a range of risk factors, including both environmental and genetic influences, with Helicobacter pylori infection accounting for approximately 90% of distal Gastric cancer cases [3,4,5]. Additionally, excessive body fat, tobacco use, and the consumption of salted and processed meats have been identified as other notable risk factors [6].

The identification of gastric cancer encompasses various diagnostic methods, including endoscopy, blood tests, and imaging techniques [7]. While endoscopy is widely regarded as the preferred diagnostic tool [8], it is also the most invasive and carries the risk of complications such as bleeding or perforation. Blood tests are employed to detect specific proteins generated by cells within the body, including cancer cells. These proteins, known as tumor markers, are typically utilized in conjunction with other diagnostic approaches. The commonly employed tumor markers for gastric cancer include carcinoembryonic antigen (CEA) and carbohydrate antigen, such as CA19-9 or CA72-41 [9, 10]. Nevertheless, these markers exhibit limitations and a lack of specificity in identifying precancerous lesions associated with gastric cancer, such as atrophic gastritis, intestinal metaplasia, and dysplasia. In addition to tumor markers, imaging modalities such as X-ray imaging, computed tomography (CT) scan, and magnetic resonance imaging (MRI) can also be utilized for the detection of gastric cancer [11, 12]. These technologies have the capability to offer insights into tumor size and location, yet their accuracy falls short compared to endoscopy. Consequently, international research teams are actively seeking novel diagnostic approaches, among which ATR-FTIR spectroscopy emerges as a potentially valuable option due to its high sensitivity, non-destructive nature, and relatively affordable cost [13, 14].

ATR-FTIR spectroscopy has emerged as a promising technique for cancer diagnostic testing, enabling the measurement of the fingerprint spectra of compounds at the molecular level [15]. This technique capitalizes on the significant changes or conformational alterations in the bonding and vibrational patterns of functional groups within biomolecules, such as nucleic acids, proteins, lipids, and carbohydrates, that occur during cancer development. Consequently, ATR-FTIR spectroscopy can effectively differentiate between normal and tumor tissues [16]. Gonul et al. identified significant disparities between tissue samples from cancer patients and those from healthy individuals using ATR-FTIR spectroscopy. They further demonstrated that spectral parameters, including lipid/protein concentration and nucleic acid/protein concentration, could serve as biomarkers for the early diagnosis of cancer [17]. Additionally, Maiti et al. employed ultra-wideband mid-infrared Fourier absorption spectroscopy to monitor the respiration of volunteers, achieving an accuracy exceeding 95% across four spectral ranges in differentiating healthy individuals from prostate cancer patients [18]. However, the samples utilized in these studies exhibit certain limitations. Gastric tissue samples are challenging to procure and pose potential harm to patients. Exhalation samples are intricate to analyze and vulnerable to environmental interferences. In contrast, serum samples offer several advantages: they are straightforward to collect and store, and they exhibit reduced susceptibility to environmental disturbances. Furthermore, serum comprises water, organic compounds, and inorganic salts, making it a reliable medium for reflecting the physiological and pathological states of the human body. Sheng et al. demonstrated that the RNA/DNA ratio in the serum of gastric cancer patients (n = 27) was significantly lower compared to that of healthy individuals (n = 19). Furthermore, the ratio of H₂₉₉₅₉/H₂₉₃₁ was found to be an effective biomarker for distinguishing between the serum of gastric cancer patients and that of healthy controls [19]. Recently, Guo et al. utilized ATR-FTIR spectroscopy to analyze blood samples from 68 gastric cancer patients, 73 colorectal cancer patients, 25 liver cancer patients, and 44 healthy individuals. They employed machine learning classification techniques and, by analyzing the 2-Dimensional Second Derivative Infrared Spectroscopy, found that the Back Propagation algorithm had the highest accuracy, exceeding 95% [20]. These findings indicate that ATR-FTIR spectroscopy holds substantial promise for the diagnosis of gastric cancer. Future research should focus on increasing sample sizes and incorporating a broader range of machine learning methodologies to enhance diagnostic accuracy.

The present study demonstrates the feasibility of utilizing serum-based ATR-FTIR spectroscopy for the detection of gastric cancer. Employing PCA and machine learning algorithms, we effectively analyzed 192 human serum samples to differentiate between gastric cancer patients and healthy controls. This approach offers a viable and non-invasive method for the clinical screening and diagnosis of gastric cancer.

2 Materials and methods

2.1 Ethics statement

This study was conducted under the authorization of the Chongming Hospital Affiliated to Shanghai University of Medicine and Health Sciences (Approval number: CMEC-2024-KT-01) after being reviewed and approved by the Chongming Hospital Affiliated to Shanghai University of Medicine and Health Sciences ethics committee. Each patient received written informed consent from the Declaration of Helsinki.

2.2 Sample and sample preparation

The serum samples from 96 patients diagnosed with gastric cancer and 96 healthy blood donors were provided by the Laboratory Department of Chongming Hospital Affiliated to Shanghai University of Medicine and Health Sciences. 1 ml of fasting peripheral venous blood was collected from both gastric cancer patients and healthy donors (serving as the control group) into centrifuge tubes using non-anticoagulant blood collection tubes. The blood samples were allowed to stand at room temperature for 30 min to facilitate natural coagulation via endogenous coagulation factors. Subsequently, the samples were centrifuged at 3000 rpm for 10 min to separate the serum. The isolated serum was then transferred to cryotubes and stored at – 80  C. The serum samples were thawed immediately prior to analysis. The mean age of the gastric cancer patients was 68.83 years, ranging from 36 to 90 years. Among them, there were 56 males and 40 females. Inclusion criteria were as follows: (1) All patients had received initial treatment; (2) Complete clinical data were available; (3) gastric cancer was confirmed in all patients through surgical intervention or pathological biopsy; (4) Absence of other tumors within the digestive system. Exclusion criteria included: (1) A history of radiotherapy or chemotherapy; (2) Presence of other primary malignant tumors. Additionally, a control group comprising 96 healthy individuals, who underwent physical examinations at our hospital during the same period and exhibited normal values in all blood tests, was selected. The average age of the healthy blood donors was 53.54 years, with the youngest participant being 27 years old and the oldest being 90 years old. Among them, there were 54 males and 42 females. All participants in this study were given comprehensive information and signed informed consent.

2.3 Spectral irradiation analysis

The absorbance mode of the ATR-FTIR spectrometer (Bruker IFS 66v), located at the National Synchrotron Radiation Laboratory in Hefei, China, was utilized to investigate the discernible spectral differences in serum samples collected from individuals suffering from gastric cancer compared to those who were not affected by the disease. A volume of 10 μL of serum samples was evenly spread onto a barium fluoride crystal (the ATR crystal) and subsequently air-dried for a duration of 10 min at room temperature. The spectral recording resolution for each sample was configured to be 4 cm−1, with an imaging spot size of 100 μm × 100 μm. A total of 128 repeated accumulation scans were conducted for spectral acquisition. The OPUS software was employed to process both the spectral measurements and imaging data.

2.4 Spectral preprocessing and data analysis

The OPUS software was employed to conduct fundamental calibration and normalization procedures on all spectra. The rubber band method was utilized for baseline correction, while the vector method was employed for normalization. Data analysis was carried out using RStudio 4.4.0 software. Average spectra were generated, and the standard deviation was computed from the cancer samples and the raw spectra of the control samples. The spectral absorbance value is quantified as the area under the peak curve. This can be achieved by transforming the infrared spectral curve into a discrete set of data points, segmenting these data points according to specified wavenumber ranges, and subsequently calculating the area for each segment. To emphasize the spectral bands, mitigate background noise, and uncover the authentic biochemical attributes of individual serum samples from gastric cancer patients and healthy donors, the second derivative of their respective spectra was computed. The second derivative, a prevalent mathematical operation applied to infrared spectra, enables the identification of broad absorption bands. The second derivative is the derivative of the first derivative. It represents the rate of change of the first derivative and reflects the concavity and convexity of the function graph. The second derivative is a common mathematical operation applied to infrared spectra, which can identify broad absorption bands. Each absorption band is represented as a more pronounced minimum value in the second derivative, which helps to further discover the changes in the spectrum. Consequently, each absorption band within the spectrum is depicted as a distinct and more pronounced minimum value in the second derivative. Following the pre-processing of the spectra and the computation of their second derivatives, the spectral data was subjected to PCA. PCA, an unsupervised learning algorithm, was employed to discern patterns and relationships within the data without direct human intervention, and to determine the directions exhibiting the highest variance [21]. The score plot of the principal components can be utilized to identify patterns or clusters present in the dataset.

2.5 Classification models and ROC analysis

The k-Nearest Neighbors (kNN), Gaussian Process Regression (GPR), and Support Vector Machine (SVM) with a linear kernel were utilized as classification methodologies to distinguish the serum of gastric cancer patients from that of healthy donors. Additionally, Receiver Operating Characteristic (ROC) analysis was conducted to evaluate the diagnostic performance across various probability/diagnostic thresholds. For each classification technique, the initial dataset was randomly partitioned into a training set comprising 70 cases of gastric cancer and 70 healthy donors, and a validation set consisting of 26 cases of gastric cancer and 26 healthy donors. The five-fold cross-validation technique was applied to the training dataset. Subsequently, the optimized model was retrained using the entire training dataset and employed to predict outcomes on the test set, facilitating the extraction of sensitivity, specificity, accuracy, and area under the curve (AUC) metrics.

2.6 Statistical analysis

Data were expressed as mean ± standard deviation. Differences between the values of two groups were assessed using the unpaired, Student’s t-test (two-tailed). P < 0.05 was considered significant.

3 Results

3.1 Mean spectra and band distribution

The application of ATR-FTIR technology facilitated the examination of spectral disparities in serum samples obtained from patients afflicted with gastric cancer and those unaffected by the disease. The functional group vibration modes responsible for the infrared spectral absorption peaks were identified and are presented in Table 1. Figure 1A illustrates the mean spectral profiles of 96 gastric cancer patients and 96 healthy individuals, along with the mean differential spectra between the two cohorts. Figure 1B depicts the average spectra and corresponding standard deviations for each group. The regions of 1700 ~ 1500 cm⁻1 and 3500 ~ 2910 cm⁻1 exhibit the most notable disparities between the gastric cancer group and the healthy group, as illustrated in Fig. 1A. The utilization of the second derivative on the extended spectral region derived from the initial absorption spectra of serum allows for the identification of disparities in the second derivative between the two cohorts (Fig. 2). In contrast to healthy individuals, patients with gastric cancer demonstrate a displacement towards shorter wavelengths in the peak position and a reduction in absorbance within the 3500 ~ 3000 cm−1 range (Tables 2, 3), which corresponds to the vibrations associated with the stretching and flexing of O–H and N–H bonds, predominantly representing proteins or water [22]. Moreover, a discernible displacement towards higher frequencies, known as a blue shift, is detected in the amide I region (1700 ~ 1600 cm−1) among individuals with gastric cancer (Table 2), specifically in the vicinity of 1655 cm−1. This shift predominantly pertains to the protein constituent of proteins and exhibits sensitivity towards alterations in protein secondary structure. These oscillations encompass the stretching vibrations of CO and CN, as well as the bending vibrations of NH [23, 24]. The results of this study demonstrate significant changes in protein structure and relative concentrations in the serum of individuals with gastric cancer. Specifically, a decrease in absorbance is observed in the range of 2990 ~ 2910 cm−1 (Table 3), indicating alterations in the antisymmetric stretching vibrations of CH2 and CH3, which are representative of lipids [25]. This suggests a relative decrease in lipid concentrations in the serum of gastric cancer patients. Furthermore, an increase in absorbance and a blue shift in peak position are observed in the range of 1090 ~ 1070 cm−1 (Tables 2, 3), corresponding to the absorption bands of C = O and P = O, respectively, which are associated with nucleic acids [14, 26]. This finding suggests a notable rise in nucleic acid concentrations and potential modifications in nucleic acid structure within the serum of individuals diagnosed with gastric cancer. In summary, patients with gastric cancer demonstrate significant variations in the concentrations of proteins, lipids, and nucleic acids in their serum, alongside structural modifications in the functional groups of proteins and nucleic acids, when compared to healthy individuals.

Table 1 The bands frequencies assigned for the IR spectra
Fig. 1
figure 1

A. The serum spectra of both the healthy donor group and the gastric cancer patient group were examined. The average spectra of the healthy individual group (red) and the gastric cancer group (blue) were analyzed, as well as the differential spectra between the two groups (black). B. The average spectra and corresponding standard deviations for healthy individual group (red) and the gastric cancer group (blue)

Fig. 2
figure 2

The second derivative of the average serum FTIR spectra of the healthy group (red) and the gastric cancer group (blue) were studied, along with the difference between the average of the two groups (black)

Table 2 Spectral peak position analysis of Gastric cancer
Table 3 Spectral peak absorbance analysis of gastric cancer

3.2 PCA results

To examine the varying effects of proteins, and nucleic acids in differentiating individuals with gastric cancer from healthy individuals, a PCA model was constructed to analyze the infrared spectra of serum within specific wavelength ranges. These ranges, which include 1090 ~ 1070 cm−1, 1700 ~ 1600 cm−1, and 3500 ~ 3000 cm−1, were chosen due to their significant changes in both absorbance and peak position. PCA is a widely employed multivariate statistical technique utilized for reducing the dimensionality of data and examining the underlying structure of a dataset. It aids in identifying the primary sources of variation within the data and extracting the most representative characteristics. The findings revealed that the PCA score plot, when represented in the PC1-PC2-PC3 space, effectively demonstrated the differentiation between the two serum groups. The application of PCA modeling along the PC2 and PC3 axis demonstrated a discernible level of segregation, classifying the samples into two distinct groups: the aggregation of healthy donors aligned with the area characterized by negative PC2 and PC3 values, whereas the aggregation of patients aligned with the area characterized by positive PC2 and PC3 values. However, PC1 did not make a significant contribution to the K-means clustering analysis (Fig. 3). The findings suggest that the primary spectral disparities observed between the gastric cancer group and the healthy group are associated with dissimilarities in protein conformation and nucleic acids within serum samples.

Fig. 3
figure 3

The results of the PCA analysis were focused on the spectral regions of 3500 ~ 3000 cm−1, 1700 ~ 1600 cm−1, and 1090 ~ 1070 cm−1. A scatter plot of the PCA analysis scores was presented in Fig. 3

3.3 Classification models and ROC results

Figure 4 presents the ROC curves for the SVM, GPR, and kNN models. Among these, the kNN model exhibited superior performance, accurately distinguishing 96 cases of gastric cancer from 96 healthy control samples. The balanced sensitivity and specificity for kNN were 89.7% and 87.2%, respectively, with an overall accuracy of 86.5% and an AUC of 0.901, indicating excellent diagnostic capability. In comparison, the SVM model achieved a sensitivity of 84.5%, a specificity of 84.7%, and an accuracy of 84.7%, while the GPR model demonstrated a sensitivity of 82.9%, a specificity of 83.2%, and an accuracy of 83.2% (Table 4). The kNN classification method demonstrated a tendency towards marginally higher sensitivity compared to specificity, whereas SVM and GPR exhibited a slight predominance of specificity over sensitivity.

Fig. 4
figure 4

Receiver operating characteristic (ROC) analysis of SVM, kNN and GPR model (AUC, area under the ROC curve;)

Table 4 ROC analysis of SVM, kNN and GPR model

4 Discussion

Gastric cancer, being a prevalent malignant neoplasm, presents a significant peril to human well-being, resulting in approximately 800,000 fatalities each year [27]. The incidence rate in males surpasses that in females by a factor of two, predominantly manifesting in individuals aged 60 and above [28]. Consequently, enhancing the prompt identification of gastric cancer stands as a pivotal objective within the realm of clinical medicine. In recent times, the utilization of serum for ATR-FTIR spectroscopy analysis has become a novel, rapid, and convenient cancer detection technology, showing great potential [29,30,31,32].

Serum is abundant in lipids, proteins, carbohydrates, and other biomolecules, which can effectively reflect physiological and pathological changes in the human body [33]. ATR-FTIR spectroscopy can characterize all classes of biomolecular species associated with changes in metabolic reaction products and enzyme activities in serum. This study demonstrates that there are significant differences in the biochemical composition of serum between gastric cancer patients and healthy donors. The spectral ranges of 3500 ~ 3000 cm⁻1, 1700 ~ 1600 cm⁻1, and 1090 ~ 1070 cm⁻1 are particularly significant for the diagnosis of gastric cancer. A reduction in absorbance within the 3500–3000 cm⁻1 range suggests alterations associated with the stretching and bending of O–H and N–H bonds, primarily indicating a decrease in the concentration of proteins or carbohydrates. Furthermore, a blue shift observed in the amide I region (1700 ~ 1600 cm⁻1) among gastric cancer patients signifies modifications in the structural configuration of protein functional groups. Conversely, the observed increase in absorbance and the blue shift of the peak position within the 1090 ~ 1070 cm⁻1 range correspond to the absorption bands of C = O and P = O associated with nucleic acids. This suggests a significant elevation in the concentration of nucleic acids in the serum of these patients, as well as potential alterations in the nucleic acid structure. Specifically, within the 1700 ~ 1600 cm⁻1 range, the findings of Guo et al. [20] also demonstrated that this spectral region can effectively differentiate gastric cancer from liver cancer, colorectal cancer, and healthy individual serum.

Furthermore, PCA effectively differentiated serum samples from healthy individuals and pancreatic cancer patients. Additionally, various classification techniques and three distinct machine learning algorithms were employed to discriminate between 100 cancer cases and 100 healthy control samples, yielding promising statistical metrics and underscoring the significant potential of this technology. The kNN model emerged as the most effective, demonstrating a sensitivity of 89.7%, a specificity of 87.2%, and an accuracy of 86.5% (Table 4). Moreover, the ROC analysis yielded an AUC value of 0.901 (Fig. 4), further validating the model's excellent diagnostic capability.

Although this proof-of-concept study demonstrates promising results, several research questions remain to be addressed. Firstly, the sample size in this study is relatively small, necessitating further investigations with larger cohorts to validate our findings. Additionally, future research should aim to explore the impact of spectral changes on the pathological staging of gastric cancer and pathogenic factors, such as Helicobacter pylori. Moreover, efforts should be directed towards developing more accurate and reliable diagnostic models based on ATR-FTIR spectroscopy.

Overall, our study introduces a novel approach for the detection of gastric cancer through the identification of biomarkers in serum samples utilizing ATR-FTIR spectroscopy. Further research is warranted to explore the potential applicability of this method to other types of cancer and to broaden its utility in oncological diagnostics.

5 Conclusion

The primary objective of the clinical fundamental research pertaining to ATR-FTIR spectroscopy is to develop novel and enhanced techniques for diagnosing ailments that are presently challenging or costly to identify using established methodologies. The findings of this study demonstrate that the integration of ATR-FTIR spectroscopy with principal component analysis enables the differentiation of biomarkers (such as proteins, lipids, nucleic acids, etc.) between healthy and gastric cancer samples. The findings indicate that the spectra and spectral parameters of serum samples from individuals with gastric cancer differ significantly from those of healthy individuals. Specifically, the ranges of 3500 ~ 3000 cm−1, 1700 ~ 1600 cm−1, and 1090 ~ 1070 cm−1 were identified as the most pertinent criteria for distinguishing between gastric cancer serum samples and healthy serum samples. These distinctions can be ascribed to the alteration of nucleic acids and substantial alterations in protein conformation within the serum samples during the process of carcinogenesis. The diagnostic method based on infrared spectroscopy obtained by the kNN model yields a sensitivity of 89.7% and a specificity of 87.2%, providing a theoretical basis for subsequent comprehensive research and practical application.