Introduction

It is estimated that a total of 451 million people suffered from diabetes by 2017, and the number is speculated to be 693 million by 20451. As one of the most serious microvascular complications, diabetic nephropathy (DN) has been a major cause of end-stage renal disease (ESRD) in many countries. The congregation of advanced glycation end-products, oxidative stress and activation of protein kinase C are the major pathogeneses of DN. A new viewpoint holds that tubular injury plays an important and even initial role2. Current treatment strategies for DN aim at controlling blood glucose and blood pressure levels and inhibiting the RAS system to reduce albuminuria and delay the progression of DN3. However, considering the high incidence of DN-related ESRD, the effect is not entirely satisfactory. Therefore, there is a critical need to identify new therapeutic targets and improve clinical management.

High-throughput sequencing technology offers an effective method to study disease-related genes and provides promising medication goals in many fields4. To date, several studies have screened genes or miRNAs involved in DN5,6,7,8,9. Integrating these data could overcome the heterogeneity of studies and provide more accurate information. This study identified target genes that may improve the understanding of the molecular mechanisms of DN and provide a resource to build new hypotheses for further follow-up studies. We suggest that the complement system may serve as a therapeutic target in DN.

Results

Differential expression analysis of genes in the GSE30529 dataset

Differential expression analysis of genes in the GSE30529 dataset5 was performed to obtain differentially expressed genes (DEGs) that may be involved in DN. First, the GSE30529 dataset was subjected to quality examination to detect batch effects and determine the principal component of the dataset that contributed the most to the variance. The boxplot showed that the overall gene expression levels of the samples in the GSE30529 dataset were approximately the same (Fig. 1a), suggesting that there was no batch effect. In addition, the two main components contributed 25.7% and 25.4% in principal component analysis (PCA) (Fig. 1b), suggesting that there are obviously different components between the DN group and the control group. These different components may be biologically significant DEGs. After the quality inspection, differential expression analysis was performed by the limma package10 to acquire DEGs with the criteria of |log2-fold change (FC)| greater than 1 and adjusted p value less than 0.05. As a result, 386 upregulated DEGs and 71 downregulated DEGs were identified between the DN group and the control group. The volcano map in Fig. 1c displays the general distribution of these genes, and the top 25 DEGs (PART1, IGJ, IGLC1, IGLV1-44, FCER1A, HDAC9, VCAN, TNC, PDLIM1, PXDN, C3, LTF, CXCL6, MMP7, LYZ, MID1, TRIM22, PTPRE, MARCKSL1, QPCT, TNFAIP8, SPARC, NMI, PLK2 and KDELC1) are shown in the hierarchical clustering heatmap in Fig. 1d.

Figure 1
figure 1

Differential expression analysis of GSE30529. (a) Boxplot of GSE30529. (b) PCA of GSE30529. The two main components contributed 25.7% and 25.4%. (c) Volcano map of DEGs. A total of 386 upregulated DEGs and 71 downregulated DEGs were identified between the DN group and the control group with the criteria of |log2 FC| greater than 1 and adjusted P value less than 0.05. (d) Heatmap of the top 25 DEGs.

Weighted gene coexpression network analysis of the GSE30529 dataset

Compared with differential analysis that focuses on the differential expression of genes, the advantage of weighted gene coexpression network analysis (WGCNA) is that it uses expression correlation information between multiple genes to identify genes of interest. Therefore, we applied two analytical methods to screen the target genes. Similarly, we performed sample cluster analysis first to learn sample similarity. The results showed that there were 3 outliers (Fig. 2a); therefore, three samples (GSM757025, GSM757027 and GSM757034) were removed. When performing WGCNA, to construct a scale-free network, the scale-free topological fitting index reaches 0.85 and the mean connectivity reaches 100 by setting the soft threshold power value to 10 (Fig. 2b). Based on the weighted gene coexpression correlation, hierarchical clustering analysis was carried out to obtain different gene modules, which are represented by branches of the clustering tree and different colours. A total of 22 modules were found in the network, with module sizes ranging from 30 to 10,000 and merge cut hights of 0.25 (Fig. 2c). The 22 modules were divided into two clusters in general according to the relationships between the modules (Fig. 2d). In addition, the weighted coexpression correlations of all genes were displayed in a heatmap plot (Fig. 2e). Finally, 3,538 highly related genes were selected in the TOM matrix with a threshold greater than 0.1. The results of the two analyses can be combined to obtain more accurate targets. Therefore, a list of 345 target genes was obtained, and these genes may play a regulatory role in DN (Fig. 3a).

Figure 2
figure 2

WGCNA of GSE30529. (a) Sample clustering of GSE30529. (b) Analysis of soft-thresholding powers to fit the scale-free topology model and the mean connectivity of the soft-thresholding powers; 10 was chosen as the value to construct a scale-free network. (c) Dendrogram of the gene modules. The branches represent different gene modules, and each leaf represents a gene in the cluster dendrogram. (d) Clustering and heatmap of 22 gene modules. (e) Heatmap of the weighted gene coexpression correlations of all genes.

Figure 3
figure 3

Enrichment analysis. (a) Venn diagram of the DEG list and highly related gene list. A total of 345 target genes were obtained. (b, c) GO annotation and KEGG pathway enrichment analysis. GO annotations mainly included neutrophil activation, regulation of immune effector process, positive regulation of cytokine production and neutrophil-mediated immunity. KEGG pathways mostly included phagosome, complement and coagulation cascades, cell adhesion molecules and ECM-receptor interaction and focal adhesion.

Functional enrichment analysis of the target genes

The pathogenesis of diabetic nephropathy is very complex, and understanding the functions of the target genes could guide the direction of new research. Functional enrichment analysis of the target genes was performed with the clusterProfiler package11 to explore the Gene Ontology (GO) annotations and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways in which target genes are involved. The top 12 GO terms were identified and mainly included neutrophil activation, regulation of immune effector process, positive regulation of cytokine production and neutrophil-mediated immunity (Fig. 3b). The KEGG pathways mostly included phagosome, complement and coagulation cascades, cell adhesion molecules (CAMs), ECM-receptor interaction and focal adhesion (Fig. 3c). The AGE-RAGE signalling pathway in diabetic complications was also found. It is interesting that the immune system seems to play an important role.

Potential epigenetic regulatory mechanism

It has now been recognized that the occurrence and development of DN are the result of complex interactions between genetic and environmental factors. Environmental signals could change intracellular pathways through chromatin modifiers and regulate gene expression patterns leading to diabetes and its complications12. After determining the target genes, we studied more datasets to understand the potential mechanisms of the differential expression of the target genes, including the GSE51674 dataset9, which contains miRNA profiles, and the GSE121820 dataset, which contains DNA methylation profiles.

Generally, gene expression could be inhibited by miRNAs via base pairing with mRNA. Differential expression analysis was performed on the miRNA profiles of the GSE51674 dataset9. Similarly, quality examinations of GSE51674 were performed. There were no very heterogeneous samples in the sample cluster dendrogram (Fig. 4a). PCA showed that the two main components contributed 62.71% and 15.68%, respectively (Fig. 4b). Next, 16 downregulated miRNAs and 67 upregulated miRNAs were found with the criteria of |log2 FC| greater than 3 and adjusted p value less than 0.01 (Fig. 4c). The 16 downregulated miRNAs are shown in the hierarchical clustering heatmap in Fig. 4d. To construct a downregulated miRNA-mRNA network, the TargetScan, miRWalk, miRBase and miRTarBase databases13,14,15,16 were used for target gene prediction of the miRNAs. Eighty-eight downregulated miRNA-mRNA pairs were obtained according to the miRNA target webtools (Fig. 5a). Among them, TGFBI, SH2B3 and ZNF652 were upregulated in the GSE30529 dataset (Fig. 5b). Therefore, the miR-1237-3p/SH2B3, miR-1238-5p/ZNF652 and miR-766-3p/TGFBI axes may be involved in diabetic nephropathy. Similar work was carried out on the upregulated miRNAs, but their predicted genes did not overlap with the target genes from GSE30529.

Figure 4
figure 4

Differential expression analysis of GSE51674. (a) Cluster dendrogram of GSE51674. (b) Principal component analysis of GSE30529. The two main components contributed 62.71% and 15.68%. (c) Volcano map of differentially expressed miRNAs. Sixty-seven upregulated miRNAs and 16 downregulated miRNAs were identified between the DN group and the control group with the criteria of |log2 FC| greater than 3 and adjusted p value less than 0.01. (d) Heatmap of the downregulated DEGs.

Figure 5
figure 5

miRNA-mRNA network. (a) Venn plot of four prediction results. (b) miRNA-mRNA network. In this network, TGFBI, SH2B3 and ZNF652 in red were upregulated in GSE30529.

DNA methylation is the main epigenetic form of gene expression regulation. To understand the methylation level changes of the target genes, the GSE121820 dataset was downloaded as a validation dataset. Among 345 target genes, 227 genes had methylation differences between the DN group and the control group (Supplemental Table 1).

Table 1 Small molecular compounds identified by connectivity map.

PPI network and identification of hub genes

First, the list of target genes was exported to the STRING database. By setting the interaction confidence score at the highest level at 0.9, a protein–protein interaction (PPI) network was constructed, which contained 190 nodes and 680 edges (Fig. 6a). Each node represents a protein, and an edge represents an interaction between proteins. The size and gradient colour of the nodes are adjusted by the degree, while the thickness and gradient colour of the edge are adjusted by the interaction score. To search for important nodes in the networks, all nodes were ranked by the 12 topological analysis methods provided by CytoHubba. Each algorithm computed all node scores, and then 1–50 points were assigned based on the rank. According to all points, the top 20 nodes (KNG1, C3, FN1, SYK, HLA-E, EGF, ITGB2, CXCL1, CXCL8, ITGAV, LYN, VWF, RHOA, HLA-DQA1, ITGAM, SERPING1, P2RY13, ANXA1, P2RY14 and FCER1G) were identified (Fig. 6b). Because the products of genes were at the core of the PPI network, these hub genes were considered potential therapeutic targets.

Figure 6
figure 6

PPI network. (a) PPI network of combined genes. There are 190 nodes and 680 edges. The size and gradient colour of nodes are adjusted by degree. The thickness and gradient colour of the edge are adjusted by the interaction score. (b) Heatmap of the CytoHubba analysis score.

Clinical data validation and drug prediction

To verify the potential roles of the hub genes in DN, clinical data including two datasets (Woroniecka and Schmid) from Nephroseq were obtained, and Pearson correlation analysis was performed between the hub genes and clinical data. The gene expression of SYK, CXCL1, LYN, VWF, ANXA1, C3, HLA-E, RHOA and SERPING1 in DN tubule samples was negatively related to GFR, suggesting a pathogenic role of the upregulated genes (Fig. 7a, c, e). Conversely, the gene expression of EGF and KNG1 in DN tubule samples was positively related to GFR, suggesting a protective role of the downregulated genes (Fig. 7b, d, f).

Figure 7
figure 7

Pearson correlation analyses of GFR and target genes. (a) The gene expression of SYK (p = 0.0022, r =  − 0.8437), CXCL1 (p = 0.0016, r =  − 0.8554), LYN (p = 0.0269, r =  − 0.6911), VWF (p = 0.0452, r =  − 0.6423) and ANXA1 (p = 0.0211, r =  − 0.7111) was negatively related to GFR. (b) The gene expression of EGF (p = 0.0027, r = 0.8349) and KNG1 (p = 0.0073, r = 0.7838) was positively correlated with GFR. (c) The gene expression of C3 (p = 0.0459, r =  − 0.6109) and CXCL1 (p = 0.0061, r =  − 0.7645) was negatively correlated with GFR. (d) The gene expression of EGF (p = 0.0037, r = 0.7919) was positively related to GFR. (e) The gene expression of C3 (p = 0.0171, r =  − 0.6970), HLA-E (p = 0.0132, r =  − 0.7161), RHOA (p = 0.0439, r =  − 0.6154) and SERPING1 (p = 0.0091, r =  − 0.7409) was negatively correlated with GFR. (f) EGF (p = 0.0121, r = 0.7221) and KNG1 (p = 0.0153, r = 0.7053) were positively related to GFR.

Given that the effectiveness of existing treatment strategies is not entirely satisfactory, it is necessary to propose new strategies and develop new therapeutic methods. Connectivity Map17 was used to compare the DEG list with the database reference dataset, and a correlation score (− 100 to 100) was obtained. Negative numbers indicate that the DEG list and the reference gene expression spectrum may be opposite; that is, the expression spectrum of drug disturbance is negatively correlated with the expression spectrum of disease disturbance. Twenty-three upregulated DEGs (logFC greater than 2.5) and 13 downregulated DEGs (logFC less than 1.5) were exported to Connectivity Map to search for potential drugs. Small molecule compounds with an average coefficient of less than − 90 were sorted according to the correlation score of the reference gene expression spectrum. As a result, 8 small molecule compounds were identified as potential therapeutic drugs (Table 1).

Discussion

As one of the microvascular complications of diabetes, DN is the main cause of ESRD. Existing treatments are not sufficient to control the development of disease. New treatment strategies are needed. High-throughput omics data have been widely used to study the mechanisms of disease and predict possible therapeutic targets. We performed differential expression analysis and WGCNA of GSE30529 and obtained 345 target genes. GO annotations mainly included neutrophil activation, regulation of immune effector process, positive regulation of cytokine production and neutrophil-mediated immunity. KEGG pathways mostly included phagosome, complement and coagulation cascades, cell adhesion molecules (CAMs), ECM-receptor interaction, focal adhesion and AGE-RAGE signalling pathway in diabetic complications. The results supported that the immune response may be involved in DN. Cytokine release and extracellular matrix deposition may be subsequent events and continue with the development of disease. We also studied additional datasets to understand the potential mechanisms of the differential expression of the target genes. The miRNA-mRNA network suggested that the miR-766-3p/TGFBI, miR-1238-5p/ZNF652 and miR-1237-3p/SH2B3 axes may be involved in diabetic nephropathy and that most target genes have differences in DNA methylation levels between the DN group and the control group. Next, a PPI network was established, and the 20 hub genes were identified. Furthermore, correlation analysis with clinical data demonstrated the disease-promoting effect of SYK, CXCL1, LYN, VWF, ANXA1, C3, HLA-E, RHOA and SERPING1, which were upregulated in DN tubule samples. In contrast, EGF and KNG1, which were downregulated in DN tubule samples, were suggested to have protective effects in DN.

To date, there have been some reports about hub genes and DN. Spleen tyrosine kinase (SYK) was reported to mediate high glucose-induced TGF-β1 and IL-1β secretion18,19. In a diabetic animal model, C-X-C motif chemokine ligand 1 (CXCL1) was found to possibly serve as a proinflammatory mediator20,21. In addition, VWF was reported to be involved in intrarenal thrombosis leading to the deterioration of renal function22. Purvis et al. observed higher circulating plasma levels of ANXA1 in T1D and T2D patients, whereas the exogenous supplementation of ANXA1 improves insulin resistance and prevents the progression of subsequent microvascular complications in mice23,24. Previous studies have demonstrated that statins prevent DN by reducing the activity of Ras homolog family member A (RhoA) protein activation25,26,27,28. Another study reported that the activation of RhoA/ROCK may regulate the NF-κB signalling pathway29. In addition, sinomenine, kaempferol, catalpol and rutin have been shown to have protective effects through the RhoA/ROCK signalling pathway30,31,32,33. EGF was considered a urine biomarker in two studies34,35. Recently, the newest report about cytosine methylation differences in kidney tubule samples supported this viewpoint36. In addition, one large-scale linkage study revealed polymorphisms in kininogen 1 (KNG1) associated with DN in European populations37.

C3 was the gene of interest through differential expression analysis and WGCNA. The KEGG pathways of the target genes also included the complement and coagulation cascade. In addition, the selection of the core genes in the PPI network also indicated that C3 was centrally located. These results may prove that complement C3 serves as a therapeutic target in diabetic nephropathy. The results are consistent with knowledge that the complement system participates in DN. The development of diabetes is intimately linked to low-grade inflammation38. High levels of inflammatory markers such as C-reactive protein and adiponectin proved this viewpoint39,40. Inflammation might promote the occurrence and development of diabetic complications such as DN. However, the underlying mechanisms of the initiation of low-grade inflammation are still poorly understood. Increasing research evidence has proven that the innate immune system is closely involved in diabetes41. Simultaneously, the roles for pattern recognition receptors (PRRs) associated with DN have been discussed42,43. The complement system is not only involved in innate immune defence by PRRs (mannose-binding lectin and ficolin) but also considered an important proinflammatory factor. Several studies have pointed out that the complement system is involved in the pathogenesis of DN and might be a therapeutic target44,45,46. Significant differences in complement system component levels in both plasma and urine were found between DN patients and diabetic patients. In addition, Li et al. highlighted the relatively more important impact of C3a, C5a and sC5b-9 in the development of DN47. Sun et al. demonstrated that more severe kidney damage was associated with the deposition of C1q and C3c in renal histopathology assessment48. Furthermore, a large-scale cohort study substantiated that diabetic patients with high plasma levels of C3 are more prone to kidney damage than the general population49. Another study indicated that the serum levels of C3 may help to differentiate DN patients from diabetic patients without kidney damage50. Blockade of C3a and C5a receptors in a T1DM model indicated a potential protective effect on renal fibrosis by improving endothelial-to-myofibroblast transition through the Wnt/β-catenin signalling pathway51. Similarly, blockade of C3a receptors in rats with T2DM improved renal morphology and function by inhibiting cytokine release and TGFβ/Smad3 signalling52. However, the best approach for targeting the complement system to prevent the development of DN still needs to be explored. Therefore, 8 potential small molecule compounds were identified by the Connectivity Map database in our study.

In summary, our study has important significance in understanding the underlying mechanisms of DN and is helpful for developing new treatment strategies for DN. However, further molecular biological experiments are needed to verify the association between the identified genes and DN.

Materials and methods

Data download

The GSE30529 (expression profiling by array)5 and GSE51674 (non-coding RNA profiling by array)9 datasets were downloaded by the GEOquery package53 in R software version 3.6.2. GSE121820_T2DN-CTL (methylation profiling by genome tiling array, unpublished) was downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). The GSE30529 dataset based on the GPL571 platform includes 10 DN tubule samples and 12 control samples. The GSE51674 dataset based on the GPL10656 platform includes 6 DN tissue samples and 4 control samples. The GSE121820 dataset based on the GPL5082 platform contains 10 T2 DN blood samples and 10 control samples.

Data processing

All differential analyses were performed by the limma package10. Adjusted p values less than 0.05 and |log2-fold change (FC)| greater than 1 were considered statistically significant in the differential analysis of GSE30529. Adjusted p values less than 0.01 and |log2 FC| greater than 3 were considered statistically significant in the differential analysis of GSE51674. In addition, the TargetScan, miRWalk, miRBase and miRTarBase databases13,14,15,16 were used for the target gene prediction of the differentially expressed miRNAs.

Weighted gene coexpression network analysis (WGCNA) allows biologically meaningful module information mining based on pairwise correlations between genes in high-throughput data using the WGCNA package54. The WGCNA workflow consists of gene coexpression network construction, module identification, module relationship analysis and the identification of highly related genes. The gene coexpression network was constructed with the filtering principle that the soft threshold makes the network more consistent with a scale-free topology. The modules were identified with the criterion of module size 30–10,000, merge cut height equal to 0.25 and verbose equal to 3. Highly related genes were obtained with thresholds greater than 0.1 in the topological overlap matrix (TOM).

Functional enrichment analysis and hub gene screening

Gene Ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed with the clusterProfiler package11. The STRING database55 (version 11.0, https://string-db.org/) was used to search for interactions between the candidate proteins based on laboratory data, other databases, text mining and predictive bioinformatics data. Cytoscape software was used to visualize the protein–protein interaction (PPI) network and perform network analysis. CytoHubba, a built-in tool in Cytoscape, uses 12 methods to explore important nodes in biological networks, such as the Degree method (Deg), Maximum Neighborhood Component (MNC), Density of Maximum Neighborhood Component (DMNC), Maximal Clique Centrality (MCC), Closeness, EcCentricity, Radiality, BottleNeck, Stress, Betweenness, Edge Percolated Component (EPC) and ClusteringCofficient56.

Clinical data analysis and drug analysis

The Nephroseq v5 analysis engine (https://v5.nephroseq.org) provides access to gene expression signatures and clinical features. Pearson correlation analysis was performed between genes and GFR5,57. Unpaired Student’s t test was used to compare two groups. P values less than 0.05 were considered statistically significant. Nonsignificant results are not displayed.

Connectivity Map17, an online database that relates disease, genes, and drugs based on similar or opposite gene expression signatures, was used for potential drug prediction.