Background

In recent years, genome-wide association studies (GWAS) have become a common tool to discover the genetic basis of complex diseases and have led to many scientific discoveries [1]. Many single nucleotide polymorphisms (SNPs) have been identified in a variety of diseases. The single marker analysis tests genetic association of individual SNPs and identifies only the most significant SNPs below a stringent significance level, for example, p < 5 × 10−8, which is necessary to control the false positive rate on a genome-wide level. However, the identified SNPs only represent a small fraction of the genetic variants to contribute to complex diseases, due to small individual effect sizes. Markers that are truly but weakly associated with disease often fail to be detected [2].

It is well understood that the stability of biological systems is governed by many biomolecular interactions and multi-gene effects should be taken into consideration while mapping from genotypes to phenotypes. Consider a crucial biological mechanism, where failure of a small portion of the important genes can lead to dys-function of the whole biological mechanism. This is very likely to happen in case of complex disease, such as Crohn's Disease, and therefore multi-locus analysis show increase of power when analyzing such data.

Pathway-based or gene set enrichment analysis has become a potentially powerful approach in search of disease associated genes (for a recent review of pathway analysis, see [3]). One of the most popular methods is GSEA [4]. Using a modified Kolmogorov-Smirnov test, it compares the p value distribution of genes in a pathway with the rest of the genes. GSEA has successfully identified the IL-12/IL-23 pathway that is significantly associated to Crohn's disease [5]. However, there are some disadvantages to the common pathway based analysis. First, most of the studies choose the most significant SNP from each gene as a representative, and therefore systematic but small changes in a gene-set will be missed if individual genes do not have any SNPs with strong marginal association. Second, how robust the methods are with regard to factors such as pathway annotations and pathway size is not clear. Third, many methods treat all genes equally despite that some genes (e.g., housekeeping genes) appear in many pathways [6]. The additional information of genes in overlapping pathways should not be ignored, as shown in a study where weighting genes based on their appearance in the gene sets can improve gene set ranking and boost sensitivity of the analysis [7].

Similar to pathway-based analysis, where biologically relevant connections from public databases are utilised in GWAS, network-based analysis has also become a popular tool for the study of complex disease. In context of molecular interaction networks, it has been found that about one third of known disorders with multiple genes show physical clustering of genes with the same phenotype and these clusters are likely to represent disorder-specific functional modules [8]. A concept of disease modules was emerged as more studies show that proteins that are involved in the same disease show a high propensity to interact with each other [9, 10]. If a few disease components are identified, other disease-related components are likely to be found in their network-based vicinity. For example, various module search methods have been developed in search of disease associated modules [11, 12]. However, a module consists of an arbitrary number of genes, it often requires intensive simulations for multiple-testing corrections.

Network Guilt by Association (GBA) is an approach for identifying disease genes based on the observation that similar phenotypes arise from functionally related genes. Algorithms related to Google's PageRank, such as Iterative Ranking and Gaussian smoothing, are applied in prioritizing candidate disease genes using network information [13]. A typical workflow looks like this: given a query disease, known causal genes of diseases that are phenotypically similar to the query disease are given a prior score in human PPI network, then the prior scores are propagated and smoothed over network such that each protein gets an association score. Genes with high association scores are considered as candidate associations.

In this study, we analyzed a GWAS in a GBA framework. The network is overlaid with GWAS information, that is, each gene is assigned a prior score based on the gene level p value from GWAS, which represents our prior knowledge of its association to the disease. After propagation, the prior score has been smoothed over the whole network and each gene gets a new association score, denoted as the posterior score, with higher posterior score representing stronger evidence for association. If a gene has many neighbours that are associated with the disease, it is very likely that itself is also associated. A gene with high posterior score can be called as candidate genes, even though it has a low prior score and fails to be called in standard GWAS due to stringent p value cut-off, chip coverage or sampling bias.

A recent study applied similar ideas to prioritize candidate disease genes and demonstrated a boost in the power to detect associated genes in GWAS [14]. Using a naive Bayes framework on datasets of Crohn's disease and Type 1 diabetes, the posterior score of each gene was obtained by adding its own log odds from GWAS (as prior) and a soft GBA score from neighbors. The study showed that some genes with high posterior scores are actually validated as true associations in later studies, although they do not have highest prior scores (e.g., lowest p values in GWAS studies) and would possibly be ignored in two-stage studies. However, there are some open questions in this study. First, the posterior score of a gene depends not only on its neighbors association of disease, but also on how many neighbors it has, i.e, network topology. It is appealing to mark a gene with high posterior score as associated, neglecting the fact that a high posterior score is merely due to a high degree (receiving information from more neighbors). Second, there is no statistic control (e.g., false discovery rate) for the findings. If there is no signal in the network, i.e, a completely random prior for all the genes, this method still outputs the genes with highest posterior scores.

Methods

In this section, we first describe the data we used and the network propagation framework, then we build a prediction model based on the GBA genes, and evaluate the performance in cross-validation (CV).

Data Set

Prior information from GWAS. We analyzed the raw anonymous genotype data of the Wellcome Trust Case Control Consortium (WTCCC) study. The original cohort includes 2005 Caucasian UK patients of Crohn's disease and 3004 controls genotyped on the Affymetrix 500K mapping array. The details are described in [15]. Genotypes with posterior probability (or CHIAMO score) lower than 0.98 are considered as missing data. Markers are removed if the percentage of missing data was larger than 5% or if they are not in Hardy-Weinberg equilibrium (p > 0.0001 for control group). We further remove some individuals with missing allele larger than 3% or of non-European ancestry or with duplicated samples, as suggested by [15], and are left with 1748 cases and 2938 controls. Finally we map a SNP to a gene if it was located within the gene or 10kb immediately upstream or downstream.

Interaction network. The PPI network is built based on the STRING database version 9.0 [16]. Only interactions with a score larger than 700 are included, and it results in 229599 interactions involving 15010 proteins. We use proteins and genes interchangeably in the following, because SNPs were first mapped to genes then mapped to corresponding proteins. The GWAS dataset only covers part of the genes in the network, for example, 11363 genes for Crohn's disease. We discard all the vertices that are not covered by GWAS, keeping only edges between covered vertices. Isolated nodes are also removed, in the end, we are left with a large connected network N .

Propagation of evidence

Consider a PPI network as an undirected graph G = (V, E, w), where nodes V are a set of proteins, edges E are links between proteins if interaction exists, w is the weight of an edge. For a node vV, denote its total number of neighbors by degree(v) and its direct neighbors in G by N (v). Let Y:V 0 represent a function of prior evidence, i.e., assign high score to a node v if we a priori (from GWAS) believe that it is associated to the disease. F:V 0 denotes a function of posterior evidence, i.e., F(v) represents the posterior evidence of association after propagating the information of its neighbor nodes. The main three steps of network propagating are, (1) obtaining prior information from GWAS, (2) calculating and normalising posterior scores by network propagation (with choice of tuning parameters), (3) selecting genes with highest posterior score as candidates.

Gene-level prior scores. The p value of a gene is defined as the minimum single marker test p value of its SNPs, as widely used in pathway analysis. The prior score of gene i defined as y i = Φ−1 (1 − p i / 2) and Φ is the Cumulative Distribution Function of normal distribution. Therefore, under the null hypothesis of no association, y i ~ N + 0 , 1 . According to [17], minimum p values performs best in most scenarios, we also tried Fisher's combined probability test, however, it gives lower internal consistency of gene ranks for random subset of the data.

Calculating posterior scores by propagation. The posterior score F is computed as

F ( v ) = α [ u N ( v ) F ( u ) w v , u ] + ( 1 - α ) Y ( v )

where the parameter α∈ (0, 1) weights the relative importance of information received from neighbors, and w v , u = w v , u / d ( v ) × d ( u ) denotes the weight of edges, with d(v) the degree of node v. The above formula can be expressed in linear form F = αW'F + (1 − α)Y, which is equivalent to

F = ( I - α W ) - 1 ( 1 - α ) Y
(1)

It can be proved that W' is similar to a stochastic matrix, which has eigenvalues in [−1, 1] (according to the PerronFrobenius theorem). Since α∈ (0, 1), the eigenvalues of (IαW')−1 exists. Though the linear equation can be solved analytically, it is difficult to compute the inverse of a large matrix with |V| × |V| dimension, and we choose a iterative propagation method to solve the system. At iteration t, we compute

F t = α W F t - 1 + ( 1 - α ) Y
(2)

Tuning parameters. There are two tuning parameters in the model, α and T. T denotes the number of iterations of propagation. We study two extreme scenarios, T = 1 and T = ∞. When T = 1, each node only receives information from its direct neighbors. Posterior score F is calculated from equation (2). When T = ∞, equation (2) reaches equilibrium after many iterations and the information is smoothed over the network, therefore each node also gains information from its indirect neighbors through iterations. F scores will converge as shown in equation (1). In practice, equilibrium is often achieved within 20 iterations.

α, also known as the damping parameter in the literature, denotes how much information a node receives from neighbors. Higher α indicates less weight on its own prior information. Previous applications of similar algorithms of ranking SNPs or genes recommend α∈ [0.5, 0, 95] [13, 18]. We explored α∈ (0.2, 0.4, 0.6, 0.8, 0.9) in the experiments. A good choice of α should give better internal consistency of gene rank by posterior score, which can be measured by Kendall Tau rank correlation [19].

We randomly choose half of the case and control samples and rank genes based on posterior scores. Low consistency is obtained for α ∈ (0.2, 0.4) and it agrees with a previous study that α should be more than 0.5 [18]. Highest consistency is obtained when α∈ (0.8, 0.9), thus we choose the mean α= 0.85 in the main analysis, unless otherwise specified.

Identifying associated genes

As shown in Equation (2), a gene can have a high posterior score under two conditions: (1) it has a high degree in the network and receives more information from its neighbors than the other low degree genes, (2) most of its neighbors are associated with the disease and itself is GBA. Therefore ranking a gene based on posterior score has some issues when the first condition is dominant, we may include too many false positives in the candidate gene set and have a potential power loss for genes with lower degrees in the network.

Here we suggest a framework, that uses permutations to identify GBA genes and eliminates potential false positives. The pseudocode is shown in the following. Input parameters include: significant threshold and number of permutations, K.

example input: Sig. Thresh.=0.01, K=10000;

// GWAS prior

prior scores based on p values of Cochran-Armitage trend test in GWAS data;

propagate prior scores by equation (2);

normalize posterior scores for all genes so that the sum is 1;

record posterior score for gene i as S 0 i ;

// permutation

for k in 1 to K; do

permute case and control labels, calculate prior scores from p values;

propagate prior scores by equation (2);

normalize posterior scores for all genes so that the sum is 1;

record posterior score for gene i as S k i ;

done

// find candidate GBA genes

for gene i in network; do

the p value of the posterior score of gene i is k = 1 K I ( S k i > S 0 i ) K

gene i is candidate association if the p value of its posterior score is smaller than Sig. Thresh.

done

Prediction model and ROC

The GBA candidate genes are important nodes in the network, because many of their neighboring genes are associated with disease. To measure how the GBA genes collectively contribute to the disease, we used them to build prediction models and evaluate the performance in 10-fold cross-validation (CV). The prediction models based on 90% of the cases and controls were tested on the remaining 10% data, and it was repeated 10 times with different 90% and 10% of the cohorts.

A logistic regression model with all the SNPs as covariates is fitted by the R package glmnet [20]. Though the GBA framework only chose the most significant SNP to represent the gene, we used all the SNPs located within 10Kb boundary of the candidate genes. glmnet applies cyclical coordinate descent to solve elastic-net penalized regression models, which are mixtures of two penalties: l1 (the lasso) and l2 (ridge regression), and it generates models with relatively few predictors. To evaluate the performance of the predictive model, we calculated the average Area Under the receiver operating characteristic Curves (AUC) [21] for all 10 trials.

Results and discussion

Problems with association by rank

Network propagation for prioritizing associated genes has been applied in several studies when there is functional similarity between a given gene and the known disease gene. The selected few known disease associated genes give prior information for the network, after propagation, each gene gets a posterior score, which represents its association to the disease.

Implementation of similar ideas in a GWAS showed boosting in identification of disease-associated genes [14]. However, in such an application, genes of high degree often have high posterior scores due to propagation. If we simply take the genes with highest posterior score as candidates, we may end up including too many false positives in the candidate gene list. In Additional file 1: Table S1, where N0 refers to network with the GWAS prior and Nk≠0to networks with randomized prior, one can see that the top ranked genes in N0 are also often ranked top in Nk≠0. Although the detailed implementation of our study and the one by [14] is different, it reveals the necessity of utilizing such network methods in a more cautious way. Moreover, most of the top ranked genes in N0 have high network degree, with average degree of 195.4, while the PPI used in our studies has an average degree of 24.5, it again confirms our concern that genes of high degree tend to be ranked on top in the network.

Candidate genes

Identified candidate genes

The study by [14] listed the top-ranked 150 genes, to make the results comparable, we also made a list of 150 top-ranked genes with T = 1, they are ranked by the p value of their posterior score obtained from permutation. As shown in Additional file 1: Table S2, one can see that many genes with significant p value from GWAS are also called in our study. This is not surprising since the prior scores from GWAS contribute to the posterior scores. Moreover, our method also identified some genes that are not called by standard methods in this data set, but reported in other independent GWAS. The number of genes that are validated by other studies and reported in GWAS catalog [22] is shown in Figure 1. There are 7 genes identified as association candidates in our study, which failed to be called by GWAS p value ranking, they are PTPN22, IRF1, PTGER4, IL12B, IL18R1, FASLG and JAK2. Except IL12B and JAK2, the other 5 genes also failed to be called by [14]. These candidate genes all have a higher rank of posterior scores in the network of GWAS prior, compared to network with random prior, as shown in Additional file 1: Table S2. For example, gene JAK2 (MIM 147796) has a gene level p value of 0.0523, but its rank in the network increases from average of 104 in Nk≠0to 54 in N0. IL18R1 (MIM 604494) has a GWAS p value of 0.003, and its rank of posterior score is 3342 in Nk≠0and 1109 in N0. This gene would be missed by both single marker test as well as ranking genes by posterior score.

Figure 1
figure 1

Number of reported genes. Overlap of validated genes among top 150 genes for each method, all of them map SNPs to genes within 10kb boundary. Validated genes denote genes that are reported as candidate in later GWAS studies. Method GWAS means ranking genes by minimum p value.

There are 4 genes, C13orf31, MST1, RTEL1 and HLA-DQA2, that were listed on top 150 by [14] and failed to be called in our study. The reason we failed to call them is because these genes are not present in the STRING database we used. Therefore a more complete PPI database integrated from diverse sources is needed, as was done by [14].

Based on the number of significant genes that are reported by other independent GWAS or meta-analyses, there seems to be no big advantages of our results compared with the one by [14]. However, these studies are designed to discover different signals in the network. Our method focus more on genes with abundant guilty neighbors, thus it has higher power to discover local signals even with low degree genes but less power to call high degree genes. However, there are also two other factors in the approaches that may contribute to the difference in performance. First, the detailed implementation is different, [14] rank genes by posterior log odds, and choose a best prior log odds such that the number of validated genes is maximized. Second, the reported genes in GWAS catalog might have a bias to genes that have significant p value in certain type of study (eg., certain type of chips). Nevertheless, our study showed potential power of GBA framework to boost GWAS signals.

Genomic prediction model

We further investigated whether the candidate genes collectively contribute to the disease and evaluated the extent to which predictions were driven by these candidate genes. A previous study conducted pathway analysis on the same data set [23] and built a logistic regression model of 277 genes with a variable selection algorithm. The average AUC was 0.6 in 10-fold CV with all the SNPs within 10kb boundary of the selected genes, and it dropped to 0.56 after excluding all SNPs with p value < 5 × 10−7. Using GBA candidate genes, we had higher average AUC in 10-fold CV, the average is 0.705 (T = 1) and 0.730 (T=) in models including all the SNPs that are mapped to candidate genes, and 0.687 (T = 1) and 0.715 (T=) after removing SNPs with p value < 5 × 10−7. The numbers are shown in Table 1, one can see that the increased AUC is not due to the number of genes we used to build the model, on the contrary, we used fewer genes for the prediction model, with 130 (T = 1) and 184 (T=) genes respectively.

Table 1 Prediction models based on candidate genes.

We also built prediction models with the same number of genes that are ranked on top (based on posterior scores), which would be called candidate genes in the study of [14]. As shown in Table 1, there is a AUC drop with models of top ranked genes. The reason might be that the candidate genes in our study are GBA genes and collectively contribute to disease, while top ranked genes can be special in the network topology but with no association to the disease.

Boosting signal from IL12 pathway genes

Many studies of pathway analysis have uncovered significant associations between Crohn's disease and the IL12/23 signaling pathway [5, 24]. 19 of 20 genes in IL12 pathway are included in our network analysis, most of them have a posterior score (rank) increase in N0, shown in Additional file 1: Table S3. As propagation redistributes the information of the network, standard pathway enrichment analysis based on posterior scores might have an advantage over the one based on direct GWAS results.

Conclusions

Combining GWAS data with function databases is very appealing as it provides more explanatory power for the list of candidate genes. While pathway methods have shown success in many applications, they also have limitations. For example, genes involved in multiple pathways might introduce bias in different pathways, different definitions of the same pathway in different knowledge bases can affect performance assessment in terms of power and true positive/negative rate. [3].

The methods of combining GWAS and PPI networks mainly fall into two categories, (1) dense module search algorithms in search of significantly enriched subnetworks [11, 12, 25], (2) propagation algorithms related to Google's PageRank [13, 14] in search of genes that have top ranks in the network. However, while methods in group (1) require intensive randomization of network topology for accessing module significance, and often encounter multiple testing problem in searching of modules of various sizes in high dimension space, methods in group (2) fail to distinguish signals from GWAS and signals from network, and therefore tend to have a high false positive rate, especially in case of biased PPI database.

Our study extends the idea of network propagation with GWAS information, such that information from various resources can be utilized. The performance of this method can be improved in various ways. Integration of diverse data sources, as suggested in [14], will improve the ability to prioritize disease genes. Mapping multiple SNPs to a single gene is the simplest way of obtaining genelevel statistics, yet some collapsing-based and kernelbased methods are worth trying for gene-level statistics [26]. There are also potential extensions of this study. For example, most of the 20 major genes in IL12 pathway, identified as associated in Crohn's Disease, have increased posterior score in the network of GWAS prior, it implies that network propagation method redistributes the information in the network where the true associations get enriched information. Therefore pathway analysis based on posterior scores may have more power than the standard pathway analysis. Many methods that detect interaction and epistasis, such as Support Vector Machine [27] and Logic Regression [28] are not applicable in genome-wide scale due to high dimensions, a reduced search space such as interactions among GBA genes might yield some results.