Introduction

Idiopathic pulmonary fibrosis (IPF) is a chronic, progressive interstitial lung disease. The etiology of IPF is unknown, and its high-resolution computed tomography (HRCT) or pathological manifestation is usual interstitial pneumonia (UIP) [1, 2]. In Europe and North America, the incidence of IPF is between 2.8 and 9.3 per 100,000 people, making it a rare disease. The epidemiological data about IPF are scarce in China, but its incidence has significantly increased in recent years [3, 4]. IPF progresses slowly at the early stage, and it will gradually cause diffuse fibrosis of the lungs, eventually leading to respiratory failure and death [5]. IPF has developed into a severe, potentially fatal condition as a result of a lack of early management and comprehensive understanding of the disease’s pathophysiology [6]. Patients with IPF continue to have a dismal prognosis, with a median survival of about three years [7]. It is critical to identify novel targets for the diagnosis and treatment of IPF to enhance the prognosis of affected patients.

IPF is an intricate and multifactorial disease that arises from the interplay between genetic and environmental elements. Genetic factors have been demonstrated to be crucial in the pathogenesis of IPF [8, 9]. An array of characteristic genes that serve as references for the clinical diagnosis of IPF have been linked to its occurrence and progression [10,11,12]. However, these genes remain inadequate for the early detection of IPF. At present, the diagnosis of IPF is still based on whether HRCT or histological manifestation of the lung is UIP, the application of genomics has had some help in the diagnosis of IPF [1, 2]. Thus, further investigation is required to identify novel approaches that can identify feature genes and establish diagnostic models.

As a chronic lung disease, inflammation and fibrosis are involved in the pathogenesis of IPF. It is mainly due to aberrant wound healing response following repetitive epithelial cell injury. Inflammatory cytokines released by immune cells may activate fibroblasts and connective tissue cell proliferation [13]. Immune dysregulation is involved in the occurrence and development of IPF [14]. Research from animal modeling and human research indicates that innate and adaptive immune mechanisms can orchestrate existing fibrotic responses [15].

Artificial intelligence and artificial neural networks (ANNs) have been progressively introduced into the medical field to assist physicians in managing vast volumes of data and implementing precision medicine more easily. ANN is a type of computing mode, which was inspired by the human brain [16]. The learning and trial-and-error methods form the foundation of the ANN algorithm. The prognosis and prediction of tumors were the primary focus of earlier ANN research [17, 18]. Recently, one research constructed an ANN model that demonstrated robust performance across multiple cohorts, but it was not analyzed from the perspective of immune infiltration [19].

Thus, our work aimed to develop an ANN model for IPF using candidate gene weight and compare immune cell types in IPF and control groups. As a first step in this investigation, we gathered IPF microarray datasets from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) between tissues of patients with IPF and tissues of controls were screened to perform enrichment analyses and protein-protein interaction (PPI) network. Afterwards, we identified the important feature genes associated with IPF using random forest (RF) analysis, and then constructed and validated a prediction ANN mode. The prediction power of these crucial feature genes was screened using receiver operating characteristic (ROC) curves. Furthermore, based on the gene expression profiling of microarray datasets, cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT) analysis was used to quantify the proportions of immune cells.

Methods

Data acquisition

The GSE110147, GSE21369, and GSE24206 series of matrix files were acquired from the GEO database of the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/geo/). The Affymetrix Human Gene 1.0 ST Array’s GPL6244 platform serves as the foundation for GSE110147 [20]. The GPL570 platform, which is part of the Affymetrix Human Genome U133 Plus 2.0 Array, was used to create both GSE21369 and GSE24206 [21, 22]. The GSE110147 dataset contained 11 samples of normal lung tissue obtained from tissue flanking lung cancer resections and 22 samples collected from the organs of those with IPF (Supplementary File 1A). Eleven samples from patients who had been diagnosed with IPF and six normal samples serving as controls comprised the GSE21369 dataset (Supplementary File 1B). The GSE24206 dataset comprised six control specimens retrieved from healthy donor lungs and 17 samples from patients with IPF (Supplementary File 1C).

Probe annotation files were utilized to convert probes in each dataset into gene symbols. Gene expression values were calculated using the probe with the highest expression level where multiple probes had the same gene symbol.

For further integration analysis, the matrix files of multiple datasets were merged into a merged dataset cohort due to their shared platform and the importance of incorporating large sample size data from various datasets. The “SVA” package’s combat function was utilized to preprocess and eliminate batch effects after the three datasets were merged into a single dataset cohort (Supplementary File 1D).

Lung tissue samples from 50 healthy controls and 119 patients with IPF were included in the testing cohort. The GSE32537 dataset, which was based on the Affymetrix Human Gene 1.0 ST Array GPL6244 platform, was used for the study (Supplementary File 1E) [23].

Screening DEGs in dataset between IPF and control samples

The “linear models for microarray data (limma)” package was used to standardize presentation data and identify DEGs [24]. The DEG threshold values were established as follows: |log2 fold change (FC)| > 2 between the IPF and control samples, and adjusted (adj) P value < 0.05. The “ggplot2” and “pheatmap” packages in R plotted volcano plots and heatmaps.

Enrichment analyses of DEGs

Using Metascape (http://metascape.org/), we performed various bioinformatics analyses to get more biological insights into the DEGs [25]. The ontology categories DisGeNET, Pattern Gene Database (PaGenBase), and Transcription Regulatory Relationships Unravelled Sentence-based Text mining (TRRUST) all showed gene list enrichments. A discovery platform called DisGeNET (https://www.disgenet.org/) houses one of the most publicly accessible libraries of genes and variations linked to human diseases [26]. A free database called PaGenBase (https://bioinf.xmu.edu.cn/PaGenBase/) contains information on the pattern genes of eleven model organisms that have been discovered using serial gene expression profiles under various physiological conditions [27]. A manually maintained library of transcriptional regulatory networks in humans and mice is called TRRUST (https://www.grnpedia.org/trrust/) [28]. The enrichment background comprised all of the genome’s genes. Terms that met the following criteria were gathered and clustered: membership similarities, P value < 0.01, minimum count of 3, and enrichment factor (the ratio between the observed counts and the counts expected by chance) > 1.5.

The “org.Hs.eg.db” and “clusterProfiler” packages in R were used to perform the gene ontology (GO) functional enrichment analyses and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis for the DEGs [29, 30]. GO functional enrichments comprised molecular function (MF), cellular component (CC), and biological process (BP). Enrichment was statistically significant at a q value < 0.05. The outcomes of these enrichment analyses were visualized using R’s ggplot2 package.

Establishment of a PPI network

To develop a PPI network, the DEGs were incorporated into the STRING database (https://string-db.org/). STRING contains known and projected PPIs. The interactions are a combination of direct and indirect linkages that come from the sharing of knowledge between organisms, computational prediction, and the compilation of interactions from other databases [31]. The PPI network was constructed with “homo sapiens” as the study species and a minimum interaction value of 0.4.

Identification of important feature genes and construction of an ANN model

The “randomForest” package was then utilized to perform an RF analysis with the parameter (number of decision trees) set to 500. We then filtered the DEGs to determine which nodes had the lowest cross-validation errors, which we then selected as the parameter for the final model. Genes with importance scores > 1.0 were considered IPF key feature genes, and a subset of significant genes were found to have importance scores. The “pheatmap” package was utilized to visualize significant feature genes and group the data based on their expression levels.

We scored the DEGs according to their expression concerning the median value to remove batch effects between cohorts. Genes that were upregulated were given a score of 1 if their levels were higher than the median. Otherwise, they received a score of 0. The opposite trend was seen in the score when this gene was down-regulated. Using gene scores, we developed an ANN model to diagnose IPF. Three layers make up the ANN: an output, a hidden, and an input layer. In this stage, the R packages “neuralnet” and “NeuralNetTools” were utilized [32, 33].

Evaluation of the ANN model

The gene cohort was tested and validated using the same methodology, which was also utilized to assess the IPF model’s diagnostic accuracy. Using the “pROC” package, we created ROC curves for each of the two cohorts to assess the effectiveness of the ANN model. The true positive rate, or “Sensitivity,” is represented by the vertical scale in the ROC curve, whereas the horizontal axis represents the false positive rate, or “1-Specificity.” The area under the curve (AUC) showed how accurate the model was.

Discovery of immune cell infiltration characteristics

To quantify the relative proportions of infiltrating immune cells from the gene expression profiles in IPF, a bioinformatics algorithm called CIBERSORT (https://cibersortx.stanford.edu/) was used to calculate immune cell infiltration characteristics. CIBERSORTx is an analytical tool from the Alizadeh Lab and Newman Lab to impute gene expression profiles and provide an estimation of the abundances of member cell types in a mixed cell population, using gene expression data [34, 35]. Based on a reference set of 22 immune cell subtypes (download the LM22 Signature Matrix file from CIBERSORTx), 1,000 permutations were used to calculate immune cell abundance.

Distribution and correlation analyses of 22 different types of invading immune cells were performed using the R “corrplot” package. To illustrate how the immune cell infiltration of the IPF and control samples differed, plots were generated using the R package.

Statistical analysis

We used RGui 4.2.3 for all statistical analyses. DEGs were compared between IPF and control samples using an adj P value < 0.05 and |log2FC| >2. We collected terms having a P value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 from DisGeNET, PaGenBase, and TRUST ontologies. For GO functional enrichment and KEGG pathway enrichment, a q value < 0.05 indicated statistical significance. The last interaction value in the PPI network was set as 0.4. The feature genes’ diagnostic efficacy was assessed using ROC curve analysis and AUC value. In continuous variable group comparisons, the Student’s t-test was used for normally distributed data and the Mann-Whitney U for abnormally distributed variables. P < 0.05 was considered significant for all two-sided statistical analyses.

Results

Identification of DEGs in merged dataset cohort

Following the merge of three datasets (GSE110147, GSE21369, and GSE24206), batch effects were preprocessed and eliminated using the “SVA” package’s combat function to produce a merged dataset cohort. Using the “limma” package, the DEGs of the merged dataset were tested. Using adj P value < 0.05 and |log2FC| > 2.0 thresholds, 47 DEGs were identified, with 11 downregulated and 36 upregulated (Table 1, Supplementary File 2). Figure 1A illustrates the heatmap depicting the expression levels of the eleven downregulated DEGs and thirty-six upregulated DEGs. Additionally, Fig. 1B illustrates the volcano plot of these DEGs.

Table 1 47 DEGs in merged dataset cohort
Fig. 1
figure 1

DEGs in merged dataset. (A) The expression levels of the 11 downregulated DEGs and 36 upregulated DEGs in the merged dataset. Control samples (Con) and IPF samples (IPF) showed varied expression levels. Blue denotes low expression, whereas red denotes high expression. (B) The volcano plot presents 11 downregulated DEGs and 36 upregulated DEGs in the merged dataset. The thresholds were established at |log2FC| > 2.0 and adj P < 0.05; the genes upregulated and downregulated in the IPF samples are shown by the red (Up) and green (Down) dots respectively; genes that do not exhibit a difference in expression between the IPF and normal samples are represented by the black dots (Not)

Prediction of the disease spectrum and function of DEGs

The DisGeNET enrichment analysis summary showed that IPF was linked to lung diseases (interstitial), lung diseases, and connective tissue diseases (Fig. 2A). Summary of enrichment analysis in PaGenBase showed tissues and cells were related to IPF such as lung, bronchial epithelial cells, and trachea (Fig. 2B). The summary of enrichment analysis in TRRUST showed IPF-related transcription factors, including SP1, STAT3, TFAP2A, BRCA1, REAL, NFKB1, and JUN (Fig. 2C).

Fig. 2
figure 2

Enrichment analyses using Metascape. (A) Summary of enrichment analysis in DisGeNET. (B) Summary of enrichment analysis in PaGenBase. (C) Summary of enrichment analysis in TRRUST. Terms that met the following criteria were gathered and clustered: membership similarities, P value < 0.01, minimum count of 3, and enrichment factor > 1.5

GO functional and KEGG pathway enrichment analyses

The GO BP enrichment analysis revealed that the DEGs were remarkably enriched in various biological processes including extracellular matrix (ECM) organization, extracellular structure organization, external encapsulating structure organization, collagen fibril organization, response to nutrient, antimicrobial humoral immune response mediated by antimicrobial peptide, humoral immune response, collagen metabolic process, organ or tissue specific immune response, and blood coagulation. The DEGs were considerably abundant in collagen-containing ECM, endoplasmic reticulum lumen, fibrillar collagen trimer, banded collagen fibril, collagen trimer, and complex of collagen trimers, according to the GO CC enrichment analysis. The results of the GO MF enrichment analysis demonstrated that the DEGs exhibited a significant enrichment in the following functional domains: ECM structural constituent, platelet-derived growth factor binding, integrin binding, heparin binding, calcium-dependent protein binding, metallopeptidase activity, metalloendopeptidase activity, cytokine activity, glycosaminoglycan binding, growth factor binding, and other functions (Supplementary File 3A). The top 10 GO functional enrichments ranked by q value are shown in Fig. 3A.

The analysis of the KEGG pathway enrichment revealed that the DEGs exhibited a high enrichment in advanced glycation end products (AGE)-receptor for AGE (RAGE) signaling pathway in diabetic complications signaling pathway, ECM − receptor interaction, interleukin 17 (IL-17) signaling pathway, viral protein interaction with cytokine and cytokine receptor, pancreatic secretion, amoebiasis, protein digestion and absorption(Supplementary File 3B). The seven KEGG pathway enrichments ranked by q value are shown in Fig. 3B.

Fig. 3
figure 3

GO functional and KEGG pathway enrichment analyses. (A) Top 10 GO functional enrichments ranked by q value. BP: biological process, CC: cellular component, MF: molecular function. (B) Chord plot of GO BP. The top eight GO BP functional enrichments are represented by the GO terms, and the enriched genes are indicated by the gene names with the relationship. (C) The nine KEGG pathway enrichments ranked by q value. (D) Chord plot of KEGG. The top eight KEGG pathway enrichments are shown by the KEGG terms, and the enriched genes are indicated by the gene names with the connection

PPI network construction

Using the STRING database, we built a PPI network to examine the interactions between the 47 DEGs in more detail. The network has 46 nodes for target proteins and 83 edges for protein interactions when the lowest interaction score was 0.40 (Supplementary file 4, Fig. 4).

Fig. 4
figure 4

PPI network

The network’s 46 targets and 83 edges showed target interactions when setting the lowest interaction score to 0.40.The increase in the degree value is directly related to the extent of connections.

Selection of important genes using RF analysis

To identify key feature genes on 47 DEGs, RF analysis was performed. The number of decision trees was determined using cross-validation error. It was determined that the cross-validation error was minimized at 39 decision trees. As the final model parameter, 39 decision trees were subsequently selected (Fig. 5A). Following this, a subset of significant genes was identified and assigned importance scores; the 30 most important genes, arranged in ascending order of importance scores, are displayed in Fig. 5B. Among them, leucine-rich repeat containing 17 (LRRC17), cartilage oligomeric matrix protein (COMP), asporin (ASPN), cartilage acidic protein 1 (CRTAC1), collagen type III alpha 1 chain (COL3A1), periostin (POSTN), phosphatidylethanolamine binding protein 4 (PEBP4), interleukin 13 receptor subunit alpha 2 (IL13RA2), and carbonic anhydrase 4 (CA4) with importance scores > 1.0 were identified as feature genes for subsequent analysis. The heatmap presenting nine important feature genes is visualized in Figure S1.

Fig. 5
figure 5

Identification of candidate important genes by RF analysis. (A) Effect on the error rate of the quantity of decision trees. The number of decision trees (trees) is denoted along the x-axis, whereas the error rate (Error) is represented along the y-axis. The black lines indicate the error values for all samples. (B) The 30 most significant genes as determined using RF analysis. Critical feature genes were identified in compliance with the specifications of the RF algorithm. MeanDecreaseGini represents the mean Gini index decrease value. A larger value indicates the more important of the variable

Construction of an ANN model for IPF

Our score for the nine feature genes was their expression relative to the median. ANN was used to develop a diagnostic prediction model with three layers: input, hidden, and output, using the nine feature gene scores (Supplementary file 5A). To develop the ANN model, a deep machine-learning algorithm was performed using the feature gene weight. ANN model output data showed that the training method was repeated 114 times (the number of iterations), which was automatically selected by the ANN algorithm (Figure S2). The ANN model based on gene scores is constructed as shown in Fig. 6A, where the hidden layer displaying genes relevant to IPF was connected to the input layer containing genes for several groups depending on the scores and weights that were obtained. Five nodes were found to be present in the hidden layer. Based on these five nodes and their respective weights, we obtained the output layer, which was the attribute of the sample.

The accuracy of the ANN model in predicting IPF is detailed in Tables 2 and 3, respectively, for the training and testing sets. Figure 6B shows the predictive model’s AUC was 1.000 [95% confidence interval (CI) 1.000–1.000]. This value signifies that the model demonstrated a remarkable ability to predict IPF. The ANN model was utilized to detect feature genes in the assessment set that were identical to those found in the training set (Supplementary file 5B). The testing set AUC was 0.936 (95% CI 0.894–0.971), showing the ANN model’s reliability and stability (Fig. 6C). The heatmap presenting nine important feature genes in the testing set is visualized in Fig. 7A and the expression of nine important feature genes between IPF tissues and normal control tissuesin the testing set is visualized in Fig. 7B. These results were consistent with those of differential expression analysis in the metadata cohort.

Table 2 IPF prediction accuracy of the ANN model in the training set
Table 3 IPF prediction accuracy of the ANN model in the testing set
Fig. 6
figure 6

The ANN model of the nine important genes for IPF. (A) Gene score-based ANN model generation. Three layers make up the ANN: an output (O1,O2), a hidden (H1-H5), and an input (I1-I9) layer. (B) The predictive model (Train group) AUC was 1.000 (95% CI 1.000–1.000). (C) Testing set (Test group) AUC was 0.936 (95% CI 0.894–0.971)

Fig. 7
figure 7

Validation of the expression of the nine important genes in the GSE32537 dataset. (A) The heatmap presenting nine important feature genes in the testing set. Control samples (Con) and IPF samples (IPF) showed varied expression levels. Blue denotes low expression, whereas red denotes high expression. (B) The expression of nine important feature genes between IPF tissues and normal control tissues in the testing set. Control (Con) and IPF samples (IPF) are represented by blue and yellow colors correspondingly. *** P < 0.05

Immune cell infiltration

The CIBERSORT bioinformatics algorithm was utilized to assess immune cell abundance using the LM22 signature matrix file with 1,000 permutations after downloading it (Supplementary File 6A). The results of CIBERSORT are presented in Supplementary File 6B.

Figure 8A shows the findings of the distribution analysis of 22 immune cell types in the IPF and control groups. Figure S3 shows immune cell correlation. Next, we investigated the immune cells that differed between IPF tissues and normal control tissues. IPF tissues had significantly decreased levels of T cells CD8, monocytes (P = 0.009), natural killer (NK) cells resting (P < 0.001), macrophages M1 (P = 0.010), and neutrophils (P = 0.028) compared to normal tissues. However, IPF tissues had significantly greater proportions of T cells CD4 memory resting (P = 0.020), macrophages M0 (P < 0.001), and mast cells resting (P = 0.028) compared to normal tissues (Fig. 8B).

Fig. 8
figure 8

Distribution and difference of immune cell infiltration. (A) The distribution analysis of 22 immune cell types in IPF samples (IPF) and control samples (Con). (B) The differential immune cells in IPF tissues comparing normal control tissues. Control (Con) and IPF samples (IPF) are represented by blue and red colors, correspondingly

Discussion

IPF is an interstitial disease in which UIP is its primary pathological manifestation. IPF remains incurable and has a dismal prognosis at this time. The precise mechanism by which IPF occurs and progresses remains poorly understood, despite the publication of numerous studies in the field [36]. The onset and progression of IPF may be influenced by epithelial-mesenchymal transition, ECM deposition, and pulmonary remodeling [37,38,39].

Patients frequently miss their best chance for treatment since there are no early diagnostic markers for IPF, which causes the disease to progress more quickly. It is essential to delve into the molecular mechanisms of IPF onset and progression, along with pinpointing the treatment target for the disease. Recent studies suggest that immune cell infiltration may play a major role in the development and progression of IPF and have the ability to eradicate aged alveolar epithelial cells [40, 41].

However, studies into the immune infiltration and abnormally expressed genes that distinguish IPF from normal tissues are limited. Initially, we employed microarray technology to gather three analogous cohorts from the GEO datasets. Subsequently, we conducted a merged dataset cohort comprising 23 control samples and 50 IPF samples. In total, 47 DEGs were found, 11 downregulated and 36 upregulated, which was consistent with the previous differential gene analyses [12]. The enrichment analyses showed that they were linked to IPF-related transcription factors, cells and tissues, and illnesses. The PPI network showed the interaction between these DEGs. The primary GO functional enrichments were associated with ECM, suggesting that these DEGs contribute to the formation of IPF and are intimately related to ECM [36,37,38]. Significant KEGG pathway enrichments were observed in the following domains: IL-17 signaling pathway, AGE-RAGE signaling pathway, ECM-receptor interaction, pancreatic secretion, amoebiasis, viral protein interaction with cytokine and cytokine receptor, and protein digestion and absorption. These major pathways were also related to ECM and immune response, including the most important pathways that are highly relevant and enriched in IPF such as transforming growth factor β (TGF-β), mitogen-activated protein kinase (MAPK), phosphatidylinositol 3 kinase (PI3K)-protein kinase B (Akt), and nuclear factor κB (NF-κB) signaling pathways.

Then, with the rapid development of science and technology, RF analysis and ANN model were used to identify important feature genes and establish a diagnostic model. The CIBERSORT instrument was utilized to investigate the involvement of immune cell infiltration features in IPF.

Using RF analysis, nine important feature genes were identified. Six upregulated genes were LRRC17, COMP, ASPN, POSTN, COL3A1, and IL13RA2, and three downregulated genes were CRTAC1, PEBP4, and CA4. Therefore, the nine genes were constructed and validated as a prediction ANN mode. The results obtained from conducting the ROC and AUC analyses suggested that all nine genes possessed a significant potential in disease diagnosis.

It is anticipated that LRRC17 contributes to the development of bone marrow, negatively regulates osteoclast differentiation, and is active in ECM and extracellular space [42, 43]. COMP encodes a noncollagenous ECM protein [44]. The most intriguing clinical application of COMP is its utilization as a biomarker for IPF. COMP is a large pentameric glycoprotein that interacts with numerous ECM proteins in cartilage and other tissues [45, 46]. ASPN encodes a small leucine-rich proteoglycan cartilage extracellular protein [47]. Tissue regeneration and development are facilitated by a secreted ECM protein encoded by POSTN [48]. ASPN and POSTN may act as hub genes regulating pulmonary fibrosis [49]. ASPN promotes the differentiation of lung myofibroblasts induced by TGF-β by facilitating the recycling of TβRI, which is dependent on Rab11 [50]. Periostin is a useful biomarker for type 2 inflammation and pulmonary fibrosis [51]. In extensible connective tissues, COL3A1 encodes type III collagen pro-alpha1 chains [52]. Dysregulated expression of COL3A1 might impact the development of IPF through modulating IPF-related biological processes and the expression level of COL3A1 is correlated with IPF prognosis [53]. COL3A1 could serve as a biomarker for IPF and non-small cell lung cancer progression [54]. The protein encoded by IL13RA2, which is closely linked to IL13RA1, binds IL13 with high affinity and helps internalize it [55]. The induction of fibrotic markers by IL-13 in vitro is impeded by the overexpression of IL-13Ralpha2, which also prevents bleomycin-induced pulmonary fibrosis [56]. CRTAC1 is responsible for producing a glycosylated ECM protein located in the interterritorial matrix of articular deep zone cartilage [57]. CRTAC1 serves as a biomarker for the health status of alveolar type-2 epithelial cells in lavage fluid and plasma [58]. Protidylethanolamine-binding proteins, which comprise PEBP4, are a family of proteins that have undergone significant evolutionary conservation. These proteins play critical biological roles, including lipid binding and serine protease inhibition [59]. The glycosylphosphatidyl-inositol-anchored membrane isozyme CA4 is encoded by CA4. This isozyme is expressed on the proximal renal tubules and luminal surfaces of pulmonary capillaries [60]. While there are currently no IPF-related genes deserving further inquiry, these genes are linked to the disease and should be thoroughly investigated.

After the nine feature genes were included in the ANN, a diagnostic prediction model was developed, which exhibited outstanding IPF prediction performance. It has the potential to accurately differentiate IPF samples from normal samples, which will be crucial for the IPF diagnosis.

We utilized CIBERSORT to analyze immune cell infiltration in normal and IPF samples. Consequently, it was discovered that certain immune cell subtypes were intimately connected to significant BPs of IPF. It was found that there was an increase in mast cells, macrophages M0, and T cells CD4 memory resting in IPF tissues in comparison to normal tissues, and a decrease in the infiltration of monocytes, neutrophils, NK cells resting, and T cells CD8. These processes may be linked to the onset and progression of IPF. There are similar differences in other chronic lung diseases, our next research is to further analyze feature genes in order to find immune cell gene targets specific to IPF.

Indeed, it has been demonstrated previously that immunological and inflammatory cells are crucial to the development of IPF. A few of the findings line up with earlier research. The pathological result of suboptimal wound healing after a lung injury is IPF. M1 macrophages repair wounds after alveolar epithelial injury, while M2 macrophages resolve lung inflammation [61]. NF-κB exacerbates M1 macrophage polarization by promoting the release of proinflammatory cytokines [62]. According to research, polarized M1 macrophages cultured in a distinct polarizing medium can redifferentiate into a different cell phenotype or revert to M0 macrophages after 12 days in a cytokine-deficient medium [63]. NK cell resting percentage was lower in IPF tissue samples than in controls [64]. The interest in immunological dysregulation in IPF has been rekindled by recent publications emphasizing the prognostic and mechanistic roles of monocytes and monocyte-derived alveolar macrophages [65]. BLT1 mediates bleomycin-induced lung fibrosis independently of neutrophils and CD4 + T Cells [66]. It may be possible to use these differentiated immune cells as targets for immunotherapy in patients with IPF.

A genomic classifier was developed with machine learning and whole transcriptome RNA sequencing using lung tissue obtained by biopsy. It was introduced and validated for lung tissue obtained by transbronchial forceps biopsy. Genetic testing of lung tissue can increase the multidisciplinary discussion of confidence in distinguishing diagnostic IPF from non-IPF. However, because there are few studies on genetic testing of lung tissue biopsy, the sensitivity of genetic testing is low, and it is prone to false negatives, more clinical studies are needed to further evaluate its sensitivity and specificity [1, 2].

Given the above results, we can detect the nine feature genes and increase confidence in IPF early diagnose .The detection of the nine feature genes before and after treatment in patients with a definite diagnosis of IPF to further validate our model. The efficacy after treatment and expression changes in the nine feature genes, combined with immune cell infiltration, provide a basis for further investigation of treatment-related mechanisms.

The study has limitations, despite our best efforts to conduct it properly. These should be noted as well. Even though we merged the three datasets to acquire as many samples as feasible, the metadata cohort requires more samples. Second, the validation cohort sample size must be raised. Ultimately, the roles of immune cell infiltration and nine feature genes in IPF were inferred from bioinformatics analysis. However, additional experimental study is required to validate these findings.

Conclusion

In conclusion, it was determined that key IPF feature genes included LRRC17, COMP, ASPN, CRTAC1, POSTN, COL3A1, PEBP4, IL13RA2, and CA4. The ability to accurately identify between IPF samples and normal samples is made possible by the nine feature genes ANN model’s superiority, and this will be crucial for the diagnosis of IPF. Immune cells that differ between IPF and normal samples may have a role in the onset of the disease and may one day be the focus of immunotherapy for patients with IPF.