Reconstructing SNP allele and genotype frequencies from GWAS summary statistics

Yang, Zhiyu; Paschou, Peristera; Drineas, Petros

doi:10.1038/s41598-022-12185-6

Reconstructing SNP allele and genotype frequencies from GWAS summary statistics

Article
Open access
Published: 17 May 2022

Volume 12, article number 8242, (2022)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Reconstructing SNP allele and genotype frequencies from GWAS summary statistics

Download PDF

Zhiyu Yang¹,
Peristera Paschou¹ &
Petros Drineas²

3684 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The emergence of genome-wide association studies (GWAS) has led to the creation of large repositories of human genetic variation, creating enormous opportunities for genetic research and worldwide collaboration. Methods that are based on GWAS summary statistics seek to leverage such records, overcoming barriers that often exist in individual-level data access while also offering significant computational savings. Such summary-statistics-based applications include GWAS meta-analysis, with and without sample overlap, and case-case GWAS. We compare performance of leading methods for summary-statistics-based genomic analysis and also introduce a novel framework that can unify usual summary-statistics-based implementations via the reconstruction of allelic and genotypic frequencies and counts (ReACt). First, we evaluate ASSET, METAL, and ReACt using both synthetic and real data for GWAS meta-analysis (with and without sample overlap) and find that, while all three methods are comparable in terms of power and error control, ReACt and METAL are faster than ASSET by a factor of at least hundred. We then proceed to evaluate performance of ReACt vs an existing method for case-case GWAS and show comparable performance, with ReACt requiring minimal underlying assumptions and being more user-friendly. Finally, ReACt allows us to evaluate, for the first time, an implementation for calculating polygenic risk score (PRS) for groups of cases and controls based on summary statistics. Our work demonstrates the power of GWAS summary-statistics-based methodologies and the proposed novel method provides a unifying framework and allows further extension of possibilities for researchers seeking to understand the genetics of complex disease.

A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies

Article Open access 30 August 2022

Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors

Article Open access 08 December 2021

CollapsABEL: an R library for detecting compound heterozygote alleles in genome-wide association studies

Article Open access 08 April 2016

Introduction

Genome-wide association studies (GWAS) have emerged as a powerful tool, leading to the identification of thousands of common genetic variants that underlie human complex disorders and traits. They also led to the creation of large repositories of human genetic variation creating enormous opportunities for further analysis. However, sharing and transferring of individual-level genotype data is often restricted due to privacy concerns as well as logistical issues. On the other hand, GWAS summary statistics, typically including information such as odds ratio (OR)/effect size (beta), standard error (SE), p-values, and case/control sample sizes for each SNP being analyzed, are often readily accessible¹. The availability of such alternative sources of information has spurred intense interest into the development of methodologies seeking to leverage such records effectively in order to retrieve as much information as possible. Besides overcoming barriers in individual-level data access, summary-statistics-based methods also offer advantages in computational costs, which do not scale as a function of the number of individuals in the study².

Summary statistics methodologies have been developed to allow a wide array of statistical analyses, including effect size distribution estimation^3,4; GWAS meta-analysis and fine mapping^5,6,7,8,9; allele frequency and association statistic imputation^10,11; heritability and genetic correlation estimation^12,13,14,15; case-case GWAS¹⁶; and polygenic prediction^17,18,19. Many of these methods have to incorporate additional information from publicly available sources, such as linkage disequilibrium (LD) statistics from a reference population^10,12,20. Most of the existing methodologies analyzing GWAS summary statistics use the summary statistics (OR, SE, p-value) from the input “as is”, often via relatively complicated estimation and modeling.

The objective of our work is three-fold. First, we seek to unify GWAS summary-statistics-based analyses (meta-analyses and cc-GWAS) under a common framework, as well as determine whether such frameworks can lead to novel analyses using only summary statistics. Second, we seek to compare existing summary-statistics-based analyses to each other and also our newly introduced method. Third, we present a novel approach to compute statistics that can be used to evaluate the performance of polygenic risk scores without accessing individual level genotype data. Our framework leverages a very straightforward observation: summary statistics information can be expressed as a function of case/control allele frequencies for each SNP. This allows us to recover case/control allele frequencies from summary statistics by solving a non-linear system of equations. Additionally, if one assumes that the SNPs satisfy Hardy-Weinberg Equilibrium (HWE) (a common and minimal assumption in all standard GWAS), the allele frequencies can be used to infer genotype counts. These simple observations allow us to use information from case-control GWAS summary statistics to develop a simple, user-friendly alternative to summary-statistics-based methods for fixed effect meta-analysis and cc-GWAS. Furthermore, we are able to compute group-wise polygenic risk score (PRS) from summary statistics of both a base and a target population. We note that even though there have been summary statistics based methods estimating the variance explained by SNPs using results from existing PRS associations^21,22, to the best of our knowledge, no existing method could return reliable estimates of PRS without any access to individual-level data in the validation cohort prior to our work.

Here, we describe the mathematical foundations of our framework and its applications to fixed effect meta-analysis, cc-GWAS, and group-wise PRS estimation. We compare the performance of existing methods as well as our novel method using both simulated and real data. Our methods are implemented in the software package Reconstructing Allelic Count (ReACt).

Results

Mathematical foundations

Our framework is motivated by the fact that the summary test statistics from publicly available GWAS can be expressed as a function of allele counts of the effect and the non-effective allele in cases and controls; as a result, the allele counts can be exactly recovered by solving a system of non-linear equations. Interestingly, this rather straight-forward observation has not received much attention in prior work. Additionally, assuming that SNPs included in GWAS studies are in Hardy-Weinberg Equilibrium (HWE), we can also reconstruct the structure of the genotype vectors for publicly available GWAS studies from just summary statistics. We can leverage this information in multiple applications, including: (i) the computation of the joint effect of a SNP in a meta-analysis involving multiple studies; (ii) to obtain the mean polygenic risk score of cases and controls in a population; and (iii) to investigate the genetic differences between traits using a case-case GWAS. All of these can be done using only summary statistics, which circumvents the hassle of individual level data sharing and, as an added bonus, considerably reduces the necessary computational time. We start by introducing some notation that will be useful in this section. Let a and u represent effective and non-effective allele counts respectively; let superscripts $^{\text {cse}}$ and $^{\text {cnt}}$ represent cases and controls respectively; and let OR, SE, and N be the odds ratio, standard error (of log(OR), as presented in most of the GWAS summary statistics), and sample sizes obtained from the summary statistics. Thus, for SNP i, $u_{i}^{\text {cnt}}$ represents the count of the non-effective allele in controls for SNP i; similarly, $a_{i}^{\text {cse}}$ represents the count of the effective allele in cases for SNP i; $N^{\text {cse}}$ represents the number of cases, etc. We now note that the allelic effect of SNP i in case-control GWAS summary statistics can be expressed as follows:

$$\begin{aligned} OR_{i}= & {} \frac{a_{i}^{\text {cse}} \cdot u_{i}^{\text {cnt}}}{a_{i}^{\text {cnt}} \cdot u_{i}^{\text {cse}}},\\ SE_{i}= & {} \sqrt{\frac{1}{a_{i}^{\text {cse}}} + \frac{1}{u_{i}^{\text {cse}}} + \frac{1}{a_{i}^{\text {cnt}}} + \frac{1}{u_{i}^{\text {cnt}}}} \end{aligned}$$

Additionally, sample sizes can be expressed as:

$$\begin{aligned} 2N^{\text {cse}}= & {} a_{i}^{\text {cse}} + u_{i}^{\text {cse}},\quad \text {and} \\ 2N^{\text {cnt}}= & {} a_{i}^{\text {cnt}} + u_{i}^{\text {cnt}}. \end{aligned}$$

Therefore, solving the system of the above four non-linear equations allows us to recover the allelic counts of SNP i for effective and non-effective alleles in cases and controls, by solving for the four unknowns $a_{i}^{\text {cse}}$, $a_{i}^{\text {cnt}}$, $u_{i}^{\text {cse}}$, and $u_{i}^{\text {cnt}}$. Using these counts, we can trivially obtain allele frequencies in case and control groups and, importantly, by further assuming that the SNPs strictly follow HWE, we can even compute the genotypic counts for each genotype from these frequencies. Note that this approach applies to GWAS reporting OR and SE statistics for each SNP, or perhaps other statistics that can be used to compute OR and SE; it may not be applicable to GWAS reporting other types of summary statistics. Furthermore, these frequencies will be different from those observed from individual level data due to model covariates; the recovered frequencies correspond to the allele counts after corrections have been applied. See Section 4.1 and 5.2 in supplementary text for details.

Fixed effect meta-analysis

Our approach

Armed with allelic and genotypic counts, we can provide a new perspective on fixed-effect GWAS meta-analysis. Instead of the conventional inverse-variance weighted meta-analysis, we can now compute the joint effect of a SNP in a meta-analysis using multiple studies by combining the reconstructed allele and genotype counts from each study and run a complete logistic regression on each SNP. Thus, we can essentially proceed with the analysis in exactly the same way as standard GWAS (see “Fixed-effect meta-analysis” section for details).

As mentioned in “Mathematical foundations” section we can obtain genotypic counts for any SNP over cases and controls from GWAS summary statistics. Then, combining these counts for all available input studies, along with the trait status, we can carry out a logistic regression for this SNP as follows²³:

$$\begin{aligned} {{\mathsf {Pr}}}({\mathbf {y}}_j = 1 | {\mathbf {g}}_j,{\mathbf {s}}_j) = S(\beta _0 + \beta _1 {\mathbf {g}}_j + \beta _2 {\mathbf {s}}_j). \end{aligned}$$

In the above ${\mathbf {y}}_j$ denotes the binary trait for the jth individual, ${\mathbf {g}}_j$ denotes the respective genotype, and $S(\cdot )$ stands for the standard sigmoid function used in logistic regression. Solving for the coefficients $\beta _0$, $\beta _1$, and $\beta _2$ we get the overall SNP effect from the meta-analysis. In order to take into account between-study stratification, we introduce an additional variable ${\mathbf {s}}_j$ as a covariate, using the overall allele frequencies of each study to estimate it (see “Fixed-effect meta-analysis” section for details).

Fixed effect meta-analysis: performance evaluation

First, we tested the performance of two leading methods used for fixed-effect meta-analysis (namely METAL²⁴ and ASSET²⁵) as well as ReACt on synthetic data under various conditions. The simulation was carried out using the Balding-Nichols model²⁶, assuming a minor allele frequency of 0.3. For each setting, we predefined the risk for effective alleles of the causal SNPs by setting $r = 1.15/1.2/1.3$ as well as the level of population stratification between cohorts included in the meta-analysis setting $F_{st} = 0.01/0.05/0.1$. Apart from meta-analyzing mutually exclusive datasets, we also tested the performance of all three methods under different extents of sample overlap between the input studies: When generating input summary statistics, we evaluated scenarios where the input studies shared $N_{\texttt {shr}}$ cases and $N_{\texttt {shr}}$ controls, with the value of $N_{\texttt {shr}}$ set to zero, 100, and 500 (see “Data” section for details). ASSET corrects for known sample overlap through introducing correlations between summary statistics derived from overlapping and input sample sizes^25,27. Since the latest stable release of METAL does not include an implementation for sample overlap correction, we used the GitHub version of METAL from²⁸. ReACt allows the user to provide the overlapping sample sizes as an input parameter (ReACt(Exact) in Figs. 1, 2). Furthermore, same as METAL, it allows the estimation of unknown sample overlap via Z-scores in input GWAS summary statistics from²⁸ (ReACt(Est.) in Figs. 1, 2). We compared power and type I error rates of all three tested approaches.

The performance comparison on the meta-analysis of two studies with even case/control sample sizes is plotted in Figs. 1, 2 and Table S3. Performance on meta-analyzing two studies with uneven sample sizes (Table S2 in supplementary text) as well as meta-analyzing multiple studies (Table S4 in supplementary text) are also tested. Results on synthetic data indicated that all three methods have comparable performance namely

$$\begin{aligned} \left| {{\mathrm {Power}}}_\mathbf{ReACt } - {{\mathrm {Power}}}_\mathbf{ASSET/METAL }\right| \le 0.012, \end{aligned}$$

when there is no sample overlap. In scenarios where there were samples shared across input studies, METAL and ReACt (regardless of whether the exact size of the sample overlap is known or is estimated) always showed higher power compared to ASSET

$$\begin{aligned}&0.014 \le {{\mathrm {Power}}}_\mathbf{ReACt/METAL } - {{\mathrm {Power}}}_\mathbf{ASSET } \le 0.219\\&\left| {{\mathrm {Power}}}_\mathbf{ReACt } - {{\mathrm {Power}}}_\mathbf{METAL }\right| \le 0.005. \end{aligned}$$

The advantage in power for our method and METAL compared to ASSET was more visible under higher $F_{st}$ values and larger sample overlaps. In terms of type I error rates, we observed that all methods showed good control on the error rates, while ASSET tended to produce more conservative results. Similar observations can also be made when we meta-analyzed multiple studies; see Table S4 in supplementary text for details.

Beyond power and type I error, we also analyzed the running time of the different methods (see Table S1 in supplementary text). METAL and ReACt far outperform ASSET in this regard. It should be noted that our C/C++ implementation of our method in the ReACt software package is comparable (in terms of running time) to METAL and much faster than ASSET, despite the fact that it has not been highly optimized for performance.

To demonstrate the scalability of ReACt beyong the Balding-Nichols model, we also looked at the performance of ReACt on phenotypes simulated using the UK biobank genotypes and the gcta tool²⁹ (see “Data” section for details). In our simulation, we set the heritability parameter to 0.4 and the disease prevalence to 0.2. We do note that, theoretically, the performance of ReACt should be invariant to disease heritability or prevalence. For this experiment, all causal SNPs were defined to have effect sizes $OR = 1.2$ and we evaluated the performance of all methods by comparing them with results from GWAS on all samples (see “Evaluation metrics” section for details). See Table 1 for our experimental results. We found all methods having comparable power and type I error. More specifically, the performance of ReACt(Exact) and ASSET showed high similarity and so did the performance of ReACt(Est) and METAL.

Table 1 Performance of fixed-effect meta-analysis on simulated data using the gcta model.

Full size table

We further tested the performance of all three methods on real genotype and phenotype data using the UK biobank dataset³⁰ and analyzing for depressive episode trait. The dataset included a total of 18,368 cases, 312,849 controls, with 640,756 SNPs after quality control (see “Data” section for details). In this experiment, we treated the top 7 SNPs with p-value strictly less than $10^{-6}$ from the overall GWAS as “ground truth” and assessed whether various meta-analysis method could pick up these 7 SNPs. Each experiment was carried out over ten iterations: in each iteration, we split the dataset in two equal sized subsets, generated GWAS summary statistics from each of the subsets, and meta-analyzed the resulting summary statistics. We reported average true positive and false positive SNPs counts captured by each method over the ten iterations. Table 2 reports our findings and we note that, perhaps due to the lack of stratification, the differences in performance were not as visible in experiments using the UK biobank samples compared to the Balding-Nichols simulation. A consistent outcome of both experiments was that ReACt(Exact) showed essentially identical performance with ASSET, whereas ReACt(Est) was more comparable with METAL. This should be expected given that both ReACt(Exact) and ASSET require the size of the sample overlap as input, unlike ReACt(Est) and METAL.

Table 2 Performance of fixed-effect meta-analysis on real genotype data.

Full size table

cc-GWAS

Case-case GWAS (cc-GWAS) based on summary statistics has only very recently been described by Peyrot et al.³¹. No other methods have been proposed so far. ccGWAS can be used to investigate the genetic differences between the patients of two diseases. With some assumptions on SNP effect distributions, Peyrot et al. described the case-case effect as a weighted sum of SNP effects from each input GWAS, where the weights could be derived from SNP-based heritabilities, prevalence, number of independent causal variants for each disease, and their genetic correlation. We observed that the framework of analysis that we introduced above, although only requiring minimal assumptions and nothing else apart from basic information come along with the GWAS summary statistics (SE, OR/Beta and case control sample sizes), could also be used to implement cc-GWAS under the same umbrella. We proceed here to describe this implementation and comparison of the two methods.

Our approach

Similar to our proposed approach for meta-analysis of multiple GWAS datasets using summary statistics, we can also carry out cc-GWAS using regression by simply swapping the labels of the phenotypes. Perhaps the biggest challenge in cc-GWAS is the separation of the differential genetic effects from between-study stratification. To circumvent this issue, we leverage the difference of SNP effects in control groups to estimate the extent of stratification (see “cc-GWAS using summary statistics” section for details). Therefore, with a slight modification of the pipeline for meta-analysis of “Fixed-effect meta-analysis” section, we introduce an alternate approach for cc-GWAS using our framework.

The underlying theory is quite straightforward and allows us to estimate the genetic differences between two traits of interest using their GWAS summary statistics. Using the genotypic counts we can proceed with logistic regression using only the cases from the two studies:

$$\begin{aligned} {{\mathsf {Pr}}}({\mathbf {y}}_j^{{\texttt {cse}}} = 1 | {\mathbf {g}}^{{\texttt {cse}}}_j) = S(\beta _0^{{\texttt {cse}}} + \beta _1^{{\texttt {cse}}} {\mathbf {g}}^{{\texttt {cse}}}_j) \end{aligned}$$

In the above, ${\mathbf {y}}_j^{{\texttt {cse}}}$ is the binary indicator variable denoting which trait case j carries and ${\mathbf {g}}^{{\texttt {cse}}}_j$ is the genotype of this case. We note that in an additive model, the coefficient $\beta _1^{{\texttt {cse}}}$ that is part of the output of this regression is a combination of both genetic effects and stratification:

$$\begin{aligned} \beta _1^{{\texttt {cse}}} = \beta _g + \beta _s, \end{aligned}$$

where $\beta _g$ and $\beta _s$ are the genetic effect and stratification coefficients. We are only interested in the genetic effect $\beta _g$ and therefore we need to remove $\beta _s$. Towards that end, we estimate $\beta _s$ using the control samples from the input studies; see “cc-GWAS using summary statistics” section for details.

CC-GWAS: performance evaluation

We first tested the performance of our methods on synthetic data. Simulated data were again generated under the Balding-Nichols model, with predefined risks for effective allele of the causal SNPs and the extent of the stratification. Inspired by Peyrot et al.¹⁶ we simulated three types of SNPs: (i) trait differential SNPs (ii) null SNPs; and (iii) stress SNPs (see “Data” section for details). We expect our method to pick up type (i) SNPs and leave the other two. Therefore, in our performance evaluation, we report the power for detecting the type (i) SNPs and type I error rates for picking up type (ii) and (iii) SNPs. Moreover, since we also expect the performance of our method, especially in terms of error control, to vary with sample size, the evaluation was done under different sample sizes in each input study (2000 cases and 2000 controls as well as 5000 cases and 5000 controls). Power and type I error rates for each type of SNP from the simulation model under each setting are shown in Table 3. The method’s performance was evaluated for p-values strictly less than $5 \cdot 10^{-5}$. For this threshold, our method showed high power and well-controlled type I errors, especially under for lower values of $F_{st}$. On the other hand, as expected, as stratification increases between two input studies, the power of our method drop and the type I error rates increased for null SNPs. However, as a general trend, we also see a decrease in such error rates when we increase the control sample size. Meanwhile, slightly higher type I error rates for the stress SNPs are observed.

Table 3 Performance of cc-GWAS as implemented in ReACt with different sample sizes.

Full size table

Next, we evaluated the performance of our method on real GWAS summary statistics and compared our method with the recently released method of¹⁶. We analyzed BIP³² and SCZ³³ datasets, for which case-case GWAS with individual level data was available³⁴. We filtered out SNPs that showed untrustworthy estimates of the stratification effect (${{\mathrm {SE}}}_s > 0.05$, see “cc-GWAS using summary statistics” section for details). This reduced our output size from 8,983,436 SNPs being analyzed to 7,110,776 SNPs. Out of those, our analysis revealed a total of 18 genome-wide significant risk loci, including the two regions identified by³⁴, namely regions 1q25.1 and 20q13.12). We compared our statistics for SNPs that were also analyzed in¹⁶ and results for this comparison are shown in Table 4. The two cc-GWAS methods are mostly comparable. By definition, both we and Peyrot et al.¹⁶ only used summary statistics as input, and could not apply the individual level quality control steps of³⁴. As a result, both methods identified additional significant loci showing divergent genetic effects between BD and SCZ compared to³⁴, mainly due to a much larger effective sample size. Results for all genome-wide significant risk loci are shown in Table S6.

Table 4 Comparison of genomic regions showing significant divergent genetic effects between BD and SCZ as detected by ReACt and ccGWAS by Peyrot et al.¹⁶.

Full size table

Group PRS

Our approach

We realized that our new method opens up a new opportunity for summary-statistics-based analysis which was not possible before: even though we still cannot compute individual level PRS without access to raw genotypes, we observe that, under the additive model, the mean and standard deviation of PRS for a population are just functions of SNP allele frequencies in the target group (see “cc-GWAS and group PRS” section for details). Therefore, the novel summary-statistics-based framework for analysis, which returns estimates of allele frequencies for cases and controls using GWAS summary statistics, also allows us to estimate means and standard deviations of PRS for case and control groups using the GWAS summary statistics of the target study. With such information (and a fair assumption of normality in the underlying PRS distribution), we can further run a t-test in order to get a p-value comparing the difference of PRS between cases and controls.

More specifically, in the additive model, the mean and variance of PRS for a population can be expressed as follows:

$$\begin{aligned}&{{\mathrm {mean}}}({\text {PRS}}) = \frac{\sum _{i=1}^M S_i p_{i}}{M},\text { and }\\&{{\mathrm {Var}}}(\text {PRS}) = \frac{\sum _{i=1}^M S_i^2 p_iq_i}{2M^2}. \end{aligned}$$

In the above $S_i$ is the weight of SNP i inferred from the base summary statistics (typically $S_i = \frac{\log (OR_i)}{SE_i}$), M is the total number of SNPs used in the PRS computation, and $p_{i}$ and $q_{i} = 1-p_{i}$ are allele frequencies of the effective allele and the non-effective allele for SNP i. Therefore, we can simply use the allele frequencies of cases and controls that were computed in “Mathematical foundations” section in order to get the mean and variance of PRS in cases and controls. See “cc-GWAS and group PRS” section for details.

Group PRS: performance evaluation

We first tested our methods on synthetic data without any confounding factors (ie., no stratification). After generating GWAS summary statistics for synthetic base and target datasets, we compared the estimated group means and standard deviations using our method (which operates on summary statistics) with the real group means and standard deviations of PRS computed from the individual level genotypes using PRSice2³⁵. The results successfully proved that in this scenario our method is extremely accurate. See Table 5 which shows typical representative results from our experimental evaluations; essentially identical results were observed in all our experiments on synthetic data.

Table 5 Estimated and real group mean and standard deviation of PRS for a synthetic target population.

Full size table

We further tested our method on real GWAS data, using GWAS summary statistics for MDD³⁶ as the base study and assessing its predicting power on 18,368 independent depressive episode cases and 312,849 ancestry-matched controls in UK biobank. We did not choose the latest MDD GWAS to be a base study because the latest one has included samples from UK biobank. To run ReACt, we generated GWAS summary statistics for the target dataset as described. We compared the estimated PRS statistics using our methods with the real PRS statistics computed using PRSice2. The results are shown in Table 6; note that since real GWAS datasets are subject to within study population stratification, we did not expect our method to be as accurate as it was on synthetic data without such stratification. There was, however, very high concordance between the results returned by our methods and ground truth. Finally, we applied our methods on summary statistics of eight psychiatric disorders. We evaluated their pairwise PRS predictive power by estimating t-test p-values. For this experiment, we took into account potential sample overlap between all pairs of base and target studies; see Section 5.3 in supplementary text for details of our sample overlap correction procedure. Results are shown in Table 7 and we observe that, in general, our results coincide with pairwise genetic correlation between disorders as discussed in⁷.

Table 6 Estimated and real group mean and standard deviation of PRS for depressive episode cases and controls in UK biobank population.

Full size table

Table 7 Using our method to perform PRS comparisons across eight neuropsychiatric disorders.

Full size table

Discussion

Extracting as much information as possible from easily accessible GWAS summary statistics can help accelerate research that aims to elucidate the genetic background of complex disease, allowing fast sharing of results and datasets while alleviating privacy concerns. In prior work, GWAS meta-analyses and cc-GWAS were treated as separate tasks with different theoretical foundations. In our work, we compare and evaluate leading methods and present a novel framework that unifies analyses under the same methodological umbrella, while expanding capabilities of summary-statistics-based analysis even further allowing, for the first time, group PRS estimation. Our methods do not affect the differential privacy established by sharing GWAS summary statistics³⁷. Moreover, as our allelic frequency reconstruction framework does not make any assumptions, our approach is unlikely to introduce additional bias into the results. However, just like any other summary-statistics-based method, it is still possible that the performance of ReACt might get affected by preexisting ascertainment bias from the input GWAS.

In terms of GWAS meta-analysis, we found that all three methods we tested are comparable in terms of power and type I error rates. However, both METAL and ReACt greatly outperform ASSET in terms of running time. The reconstruction of the allelic counts for each SNP in ReACt allows us to run a full logistic regression model instead of doing the conventional inverse-variance weighted fixed-effect meta-analysis, under the assumption of HWE. Our results on real GWAS data showed that just the standard HWE filtering threshold of $10^{-6}$ is needed, which is a typical quality control step in any GWAS. Note that this assumption is only used when we calculate genotype frequencies from the allelic frequencies, which is the case only in the fixed effect meta-analysis part of our work. Future work could explore whether we can further relax this threshold, or even remove this assumption. Our approach shows increased power in experiments on synthetic data, especially in cases where there is larger $F_{st}$ difference between the input studies, and provides robust results in real GWAS settings. One of the biggest concerns in GWAS meta-analysis is sample overlap between different studies. ASSET only allows correction for known sample overlap, whereas METAL’s development version is able to correct for unknown sample overlap. Our work here presents, for the first time, a thorough evaluation of correction for known and unknown sample overlap; our sample overlap correction is theoretically founded and more intuitive compared to previous methods²⁷.

We further propose a novel perspective on case-case association studies (cc-GWAS), allowing analysis without the need for complicated assumptions or side information apart from sample sizes. To the best of our knowledge, the only prior work on summary statistics based case-case GWAS is¹⁶. In our work, we achieve this objective in a straightforward manner: we directly compare the reconstructed allele frequencies of each SNP in two groups of cases, without the need to estimate heritabilities or prevalence of disorders as in¹⁶. The fact that case-case GWAS using ReACt analyzes each SNP independently also allows the user to run the analysis even when only a subset of the GWAS results are made available, which is a common scenario in practice and could not be addressed by prior work. Further, we do not need any extra assumptions on the distribution of SNP effects.

ReACt showed good control of type I errors in null SNPs (type II SNPs) given sufficiently large control sample sizes for both input studies. In practice, our experiments demonstrated that we can get accurate results with 2000 controls from each input GWAS, which is a reasonable sample size in modern GWAS. It also shows slightly higher, but under-controlled, type I errors in the stress test SNPs (type III SNPs), which is also observed by the method of¹⁶. As also pointed out by¹⁶, we do not expect the existence of stress SNPs to be particularly common in practice.

A notable difference between our method and the work of¹⁶ is that we do not filter for SNPs showing association due to differential tagging effects. While analyzing such SNPs, our method behaves more like a direct case-case GWAS using individual level data.

Our framework also introduces a novel perspective on case-control PRS. Conventionally, PRS for a target study is only accessible from individual level genotype data. However, we notice that the group means and standard errors of PRS can in fact be estimated using only summary statistics of both the base and target studies. With such statistics available, a t-test can be carried out instead place of logistic regression, which is commonly used for predictability evaluation when the individual level PRS are available. It is worth noting that, for case-control studies, t-tests and logistic regression are testing the same hypothesis: whether scores generated from the SNP effect of a base study can differentiate individuals in the target study, or, equivalently, whether the base study can predict the case/control status of samples in the target study. We applied our method to summary statistics of eight psychiatric disorders from PGC for predicting group PRS and found the results in general concordance with the genetic correlation obtained by the work of Lee et al.⁷.

In our work, PRS evaluations use the p-value based clumping and thresholding (PC+T) approach. However, the methodology underlying the ReACt group PRS can be easily adapted to any other PRS computation model, e.g.,³⁸ (SBLUP³⁹, LDpred⁴⁰ , PRS-CS⁴¹, SBayesR⁴² and other Bayesian based methods⁴³ etc).

It is worth noting that given SNP effect sizes and weights as input, ReACt does not require the base summary statistics to be from a case-control GWAS because, in PRS computations, the base summary statistics provide the predictor weights and we do not need to convert them back into allele frequencies. This fact makes ReACt easily applicable on any of the aforementioned SNP re-weighting schemes. To date, most PRS improvements target the selection and prioritization of SNPs or the adjusting of the weights to build a better prediction model using the base study. Our work contributes from a different perspective: it allows the user to evaluate the performance of models without access to individual level genotype data. Moreover, results from group PRS using our approach can be further connected with²² to quantify the predisposition to a particular disorder that is explained by a certain SNP set. Finally, a notable feature of ReACt is that, theoretically, it can handle known and unknown sample overlap between base and target populations; to the best of our knowledge, this is done for the first time for PRS computations. Sample overlap has long been known as a problem in PRS and our approach provides a good starting point for future work. We do note that, recently, a different method has also been proposed to specifically correct the inflation due to known sample overlap between base and target studies in PRS evaluations with individual level data⁴⁴. An interesting future research direction would be to combine the results of⁴⁴ with summary statistic methods such as ReACt.

Our framework is robust against within-study stratification effects. However, users should keep in mind that general rules of thumb for conventional PRS also apply to our method. For instance, the SNPs used for PRS computations are expected to be independent to a certain extent (clump/prune/LASSO shrink the summary statistics)¹⁹ and the predictive power of output PRS will be subject to the power of the base study²¹ and the p-value threshold chosen by the user.

Our work opens many future research directions. First, the reconstruction scheme that our framework is built upon is based on input summary statistics that are generated using a logistic regression or a $\chi ^2$-test. We have not yet explored how to adapt our framework to operate on summary statistics from other models. Theoretically, all we need is GWAS summary statistics that can be converted into OR and SE for each SNP. There exist summary statistics-based methods transforming GWAS results obtained from linear mixed model association to odds ratio⁴⁵, and it will be interesting to further explore how such methods could interface with our approach. Also, our meta-analysis module only investigated the most straight forward application of ReACt in a fixed-effect model. It would be interesting to explore methodologies that carry out random-effect meta-analyses using the same framework. Another interesting topic for future work would be to incorporate information beyond GWAS summary statistics. For example, one could consider incorporating external information such as LD structure using LD reference maps, or eQTL and SNP to gene annotations. Such information could be used to improve the accuracy of sample overlap estimation and to extend the group-PRS applications. Furthermore, although outside the scope of our analysis here, one could investigate expanding towards methods that perform haplotype (instead of genotype) reconstruction. Overall, our work here highlights the power of summary-statistics-based methodology and opens up additional avenues for research.

Methods

Our framework

Notation

Prior to introducing our methods, we discuss notational conventions. We will reserve the subscript i to denote SNP number: given, say, M SNPs, i will range between one and M. Similarly, we will reserve the subscript $\ell $ to denote the study number: given L studies from which summary statistics will be meta-analyzed, $\ell $ will range between one and L. We assume that all L studies released summary statistics on a common set of M SNPs. For simplicity, we will first describe our methods for the case $L=2$ (i.e., when exactly two studies are jointly meta-analyzed) and we will generalize our approach in “Meta-analyzing multiple datasets” section for $L > 2$.

We will use the three-letter shorthand cse for cases and the three-letter shorthand cnt for controls. We reserve the variable a to represent counts of the effective allele and the variable u to represent counts of the non-effective allele. We also reserve the variable N to represent counts for the number of cases or controls. Given the above conventions, we now present the following table of allele counts (effective and non-effective allele) for SNP i ($i=1\ldots M$) in study $\ell $ ($\ell =1\ldots L$) (Table 8).

Table 8 Table of allele counts for SNP i ($i=1\ldots M$) in the $\ell $ -th GWAS ($\ell =1\ldots L$).

Full size table

Using the above table, we can also compute the frequencies of the effective or non-effective allele in cases and controls. Table 9 summarizes frequency notation for SNP i ($i=1\ldots M$) in study $\ell $ ($\ell =1\ldots L$).

Table 9 Notations and definitions of (effective or non-effective) allele frequencies in cases and controls.

Full size table

Obviously,

$$\begin{aligned} p_{i \ell }^{{\texttt {cse}}}+q_{i \ell }^{{\texttt {cse}}}&=1\\ p_{i \ell }^{{\texttt {cnt}}}+q_{i \ell }^{{\texttt {cnt}}}&=1. \end{aligned}$$

Reconstructing allele counts

Using Table 8, notice that the odds ratio (OR) and its corresponding standard error (SE) for SNP i in study $\ell $ are given by the following formulas:

$$\begin{aligned} OR_{i\ell }&= \frac{a_{i\ell }^{\text {cse}} \cdot u_{i\ell }^{\text {cnt}}}{a_{i\ell }^{\text {cnt}} \cdot u_{i\ell }^{\text {cse}}}, \end{aligned}$$

(1)

$$\begin{aligned} SE_{i \ell }&= \sqrt{\frac{1}{a_{i\ell }^{\text {cse}}} + \frac{1}{u_{i\ell }^{\text {cse}}} + \frac{1}{a_{i\ell }^{\text {cnt}}} + \frac{1}{u_{i\ell }^{\text {cnt}}}}. \end{aligned}$$

(2)

Additionally,

$$\begin{aligned} 2N^{\text {cse}}_{\ell }&= a_{i\ell }^{\text {cse}} + u_{i\ell }^{\text {cse}},\quad \text {and} \end{aligned}$$

(3)

$$\begin{aligned} 2N^{\text {cnt}}_{\ell }&= a_{i\ell }^{\text {cnt}} + u_{i\ell }^{\text {cnt}}. \end{aligned}$$

(4)

By solving the system of non-linear Eqs. (1), (2), (3), and (4), we can recover $a_{i\ell }^{\text {cse}}$, $u_{i\ell }^{\text {cse}}$, $a_{i\ell }^{\text {cnt}}$, and $u_{i\ell }^{\text {cnt}}$ for SNP i in study $\ell $. Notice that $OR_{i\ell }$, $SE_{i\ell }$, $N^{\text {cse}}_{\ell }$, and $N^{\text {cnt}}_{\ell }$ are available from summary statistics. See Appendix 5.2 for details on solving the aforementioned system of non-linear equations.

Reconstructing genotype counts

Given the reconstructed allele counts of “Reconstructing allele counts” section, we can now reconstruct genotype counts for SNP i in the $\ell $-th study. In order to do this, we need to assume that SNP i is in HWE in both case and control groups of study $\ell $. Note that a well-performed GWAS should have SNPs drastically violating HWE filtered out. As demonstrated in our results, SNPs with HWE p-value larger than $10^{-6}$ (a common threshold applied in most GWAS) do not affect the performance of ReACt in practice. More precisely, assume that for SNP i in study $\ell $ we have reconstructed its allele table count (Table 8). Then, by assuming that this SNP is in HWE in study $\ell $, we can compute the number of cases and controls that exhibit a particular genotype. Recall that there are three possible genotypes: $A_1A_1$, $A_1A_2$, and $A_2A_2$. We will represent each genotype by counting the number of copies of the effective allele in each genotype. Thus, $A_1A_1$ will correspond to two, $A_1A_2$ will correspond to one, and $A_2A_2$ will correspond to zero.

Following our notational conventions from “Notation” section, we can now compute the entries in Table 10 of genotype counts for SNP i in study $\ell $.

Table 10 Genotype counts for cases and controls for SNP i in study $\ell $.

Full size table

It is worth noting that

$$\begin{aligned} N_{\ell }^{\text {cse}}&= N_{i\ell }^{\text {cse}}(0) + N_{i\ell }^{\text {cse}}(1) + N_{i\ell }^{\text {cse}}(2), \end{aligned}$$

(5)

$$\begin{aligned} N_{\ell }^{\text {cnt}}&= N_{i\ell }^{\text {cnt}}(0) + N_{i\ell }^{\text {cnt}}(1) + N_{i\ell }^{\text {cnt}}(2). \end{aligned}$$

(6)

Next, we reconstruct the genotype vector for SNP i in study $\ell $ as follows:

$$\begin{aligned} {\mathbf {g}}_{i\ell } = \left[ \begin{array}{cccccc} \underbrace{0\ldots 0}_{N_{i\ell }^{\text {cse}}(0)}&\underbrace{1\ldots 1}_{N_{i\ell }^{\text {cse}}(1)}&\underbrace{2\ldots 2}_{N_{i\ell }^{\text {cse}}(2)}&\underbrace{0\ldots 0}_{N_{i\ell }^{\text {cnt}}(0)}&\underbrace{1\ldots 1}_{N_{i\ell }^{\text {cnt}}(1)}&\underbrace{2\ldots 2}_{N_{i\ell }^{\text {cnt}}(2)} \end{array}\right] . \end{aligned}$$

Using Eqs. (5) and (6), it is easy to conclude that the vector ${\mathbf {g}}_{i\ell }$ has a total of

$$\begin{aligned} N_{\ell }^{\text {cse}}+N_{\ell }^{\text {cnt}} \end{aligned}$$

entries, which is equal to the number of samples (cases plus controls) included in the $\ell $-th study. We can also form the response vector ${\mathbf {y}}_{\ell }$ for the $\ell $-th study, indicating whether a sample is a case (i.e., one) or a control (i.e., zero) as follows:

$$\begin{aligned} {\mathbf {y}}_{\ell } = \left[ \begin{array}{cc} \underbrace{1\ldots 1}_{N_{\ell }^{\text {cse}}}&\underbrace{0\ldots 0}_{N_{\ell }^{\text {cnt}}} \end{array}\right] . \end{aligned}$$

(7)

Note that the vectors ${\mathbf {y}}_{\ell }$ and ${\mathbf {g}}_{i\ell }$ have the same dimensions (same number of entries). It should be clear that the vector ${\mathbf {y}}_{\ell }$ is the same for all SNPs in the $\ell $-th study and hence does not depend on the SNP number i.

We conclude the section by discussing the construction of an indicator vector ${\mathbf {s}}$ that will denote the study from which a particular sample in our meta-analysis originated. For the sake of simplicity, assume that we meta-analyze summary statistics from two studies ($L=2$). Then, following the above discussion, we can construct the genotype vectors ${\mathbf {g}}_{i1}$ and ${\mathbf {g}}_{i2}$ and concatenate them to construct the overall genotype vector for the i-th SNP in both studies:

$$\begin{aligned} {\mathbf {g}}_i = \left[ {\mathbf {g}}_{i1} \ {\mathbf {g}}_{i2}\right] . \end{aligned}$$

Similarly, we can construct the overall response vector ${\mathbf {y}}$ for both studies:

$$\begin{aligned} {\mathbf {y}} = \left[ {\mathbf {y}}_{1} \ {\mathbf {y}}_{2}\right] . \end{aligned}$$

Notice that the vectors ${\mathbf {g}}_i$ and ${\mathbf {y}}$ have the same dimensions (number of entries), equal to the number of samples (cases plus controls) in both studies, i.e., equal to

$$\begin{aligned} N = N_{1}^{\text {cse}}+N_{1}^{\text {cnt}}+N_{2}^{\text {cse}}+N_{2}^{\text {cnt}}. \end{aligned}$$

We can now construct the indicator vector ${\mathbf {s}}$ as follows:

$$\begin{aligned} {\mathbf {s}} = \left[ \begin{array}{cc} \underbrace{0\ldots 0}_{N_{1}^{\text {cse}}+N_{1}^{\text {cnt}}}&\underbrace{1\ldots 1}_{N_{2}^{\text {cse}}+N_{2}^{\text {cnt}}} \end{array}\right] . \end{aligned}$$

Note that a value of zero in ${\mathbf {s}}$ indicates that the corresponding sample belongs to the first study while a value of one in ${\mathbf {s}}$ indicates that the corresponding sample belongs to the second study.

Fixed-effect meta-analysis

Logistic regression

We run logistic regression for each SNP separately; recall that we number SNPs in our meta-analysis from one up to M. For notational convenience and since we run logistic regression in an identical manner for each SNP, without loss of generality we focus on a single SNP. Let the genotype vector for the selected SNP be denoted by ${\mathbf {g}}$; let ${\mathbf {s}}$ be the study indicator vector; and let ${\mathbf {y}}$ be the response vector, as discussed in the previous section. Recall that all three vectors have the same dimensions (same number of entries), equal to N, namely the total number of cases and controls in both studies. Notice that we dropped the subscript i from the vector ${\mathbf {g}}$ for notational convenience, since our discussion in this section will focus on a fixed SNP i, without loss of generality.

Using notation from the previous section, while dropping the subscript i from the genotype vector ${\mathbf {g}}$, allows us to formulate logistic regression as follows:

$$\begin{aligned} {{\mathsf {Pr}}}({\mathbf {y}}_j = 1 | {\mathbf {g}}_j,{\mathbf {s}}_j) = S(\beta _0 + \beta _1 {\mathbf {g}}_j + \beta _2 {\mathbf {s}}_j), \end{aligned}$$

(8)

where $S(x) = (1+e^{-x})^{-1}$ is the sigmoid function; ${\mathbf {y}}_j$ denotes the jth entry of the vector ${\mathbf {y}}$; ${\mathbf {s}}_j$ denotes the jth entry of the vector ${\mathbf {s}}$; and $\beta _0$, $\beta _1$, and $\beta _2$ are the unknown coefficients of the logistic regression formulation. Here $\beta _0$ corresponds to the constant offset, $\beta _1$ corresponds to the genotype, and $\beta _2$ corresponds to the study-of-origin. We also highlight that ${\mathbf {g}}_j$ denotes the jth entry of the vector ${\mathbf {g}}$; recall once again that we dropped the subscript i from the genotype vector in this section. The range for all subscripts j for the above vectors is between one and N.

In order to further describe how logistic regression was implemented in our experiments, it will be convenient to introduce additional notation. Let $\mathbf {\beta }$ be the vector

$$\begin{aligned} \mathbf {\beta }^T&= [\beta _0\ \beta _1\ \beta _2], \end{aligned}$$

and let ${\mathbf {x}}$ be the vector

$$\begin{aligned} {\mathbf {x}}_j^T&= [1\ {\mathbf {g}}_j\ {\mathbf {s}}_j]. \end{aligned}$$

Thus, $\mathbf {\beta }$ is the vector of the (unknown) logistic regression coefficients, while ${\mathbf {x}}_j^T$ for all $j=1\ldots N$ is the vector representing the constant offset, the genotype, and the study origin for the jth sample in our meta-analysis. This allows us to rewrite Eq. (8) as follows:

$$\begin{aligned} {{\mathsf {Pr}}}({\mathbf {y}}_j = 1 | {\mathbf {g}}_j,{\mathbf {s}}_j) = S(\mathbf {\beta }^T \cdot {\mathbf {x}}_j). \end{aligned}$$

We can now compute the negative log-likelihood (NLL) function for $\mathbf {\beta }$ as follows:

$$\begin{aligned} NLL(\mathbf {\beta })&= -\sum _{j=1}^N \log ({{\mathsf {Pr}}}({\mathbf {y}}_j)) = 1 | {\mathbf {x}}_j) \\&= -\sum _{j=1}^{N}{\mathbf {y}}_j \log S(\mathbf {\beta }^T \cdot {\mathbf {x}}_j)+(1-{\mathbf {y}}_j) \log (1-S(\mathbf {\beta }^T \cdot {\mathbf {x}}_j)). \end{aligned}$$

Thus, $\mathbf {\beta }$ can be estimated using the Iterative Re-weighted Least Squares (IRLS) algorithm⁴⁶ as follows:

In the IRLS algorithm, we let ${\mathbf {D}}$ denote the diagonal $N \times N$ matrix whose diagonal entries are $d_1,d_2,\ldots ,d_N$; we let ${\mathbf {X}}$ denote the $N \times 3$ matrix whose rows are the vectors ${\mathbf {x}}_j^T$ for $j=1\ldots N$; and we let ${\mathbf {z}}$ denote the vector whose entries are the $z_j$ for $j=1\ldots N$. Using this notation, the matrix ${\mathbf {H}} = {\mathbf {X}}^T{\mathbf {D}}{\mathbf {X}}$ is the $3 \times 3$ Hessian matrix of this logistic regression problem. The algorithm iterates over $t=0,1,2,\ldots $ and terminates when our convergence criterion, namely the difference $\Vert \mathbf {\beta }^{t+1}-\mathbf {\beta }^t\Vert $ (which is simply the sum of the absolute values of the three entries of the vector $\mathbf {\beta }^{t+1}-\mathbf {\beta }^t$) drops below the threshold $10^{-4}$, which is the same threshold as the one used by PLink⁴⁷ for logistic regression.

Note that a drawback for logistic regression is that it can produce anti-conservative results under imbalance, which in our case, includes unbalanced sample sizes in cases and controls, as well as unbalanced sample sizes among input studies. We apply Firth bias-corrected logistic regression test^48,49 to correct for the estimate under input imbalance (triggered when either the total case/control ratio, or maximum/minimum input sample size ratio is greater or equal to 5 by default). This approach has been reported with stable performance in both balanced and unbalanced studies, as well as with rare SNPs⁵⁰.

We conclude this section by discussing how to compute a p-value for the logistic regression formulation of Eq. (8). First, it is well-known that the standard error for the three coefficients of the logistic regression formulation can be computed by using the inverse of the Hessian matrix ${\mathbf {H}}$. In particular, the standard error for $\mathbf {\beta }_0$ is equal to $SE_0 = \sqrt{(\mathbf {H^{-1}})_{11}}$; the standard error for $\mathbf {\beta }_1$ is equal to $SE_1 = \sqrt{(\mathbf {H^{-1}})_{22}}$; and the standard error for $\mathbf {\beta }_2$ is equal to $SE_2 = \sqrt{(\mathbf {H^{-1}})_{22}}$. As is typical in association studies, we focus on $SE_1$, the standard error for the vector of genotypes, and compute the respective p-value for the SNP-under-study using the Wald test. More specifically, we find the corresponding p-value of a Z-distribution for the parameter $\left| \frac{\beta _1}{SE_1}\right| $.

Correcting for sample overlap (two studies)

Sample overlap between studies can lead to an under-estimation of test statistics variance and results in an inflated test p-value. To prevent this from happening, we will use an “effective sample size” correction as follows. Assume that we are given Table 11, which details the number of overlapping samples between the two studies.

Table 11 Number of overlapping cases and controls between the two studies.

Full size table

Using the counts in Table 11, the number of shared cases between the two studies is equal to:

$$\begin{aligned} N_{{\texttt {shr}}}^{{\texttt {cse}}} = N_{{\texttt {shr}}}^{{\texttt {cse-cse}}} + \frac{N_{{\texttt {shr}}}^{{\texttt {cse-cnt}}}+N_{{\texttt {shr}}}^{{\texttt {cnt-cse}}}}{2}. \end{aligned}$$

(9)

Notice that if the off-diagonal entries in Table 11 are equal to zero then the above number reduces, obviously, to $N_{{\texttt {shr}}}^{{\texttt {cse-cse}}}$. Similarly, we have the number of shared controls equal to:

$$\begin{aligned} N_{{\texttt {shr}}}^{{\texttt {cnt}}} = N_{{\texttt {shr}}}^{{\texttt {cnt-cnt}}} + \frac{N_{{\texttt {shr}}}^{{\texttt {cnt-cse}}}+N_{{\texttt {shr}}}^{{\texttt {cse-cnt}}}}{2}. \end{aligned}$$

(10)

Then, the correction is simply carried out by multiplying the case/control sample size of each input study by a “deflation factor” defined as follows:

$$\begin{aligned} \lambda _\ell ^{{\texttt {cse}}}&= \frac{N_\ell ^{{\texttt {cse}}}}{N_\ell ^{{\texttt {cse}}}+N_{{\texttt {shr}}}^{{\texttt {cse}}}}\\ \lambda _\ell ^{{\texttt {cnt}}}&= \frac{N_\ell ^{{\texttt {cnt}}}}{N_\ell ^{{\texttt {cnt}}}+N_{{\texttt {shr}}}^{{\texttt {cnt}}}}. \end{aligned}$$

We multiply the sample size for cases (respectively, controls) in each study $\ell $ by $\lambda _\ell ^{{\texttt {cse}}}$ (respectively, $\lambda _\ell ^{{\texttt {cnt}}}$) before proceeding with the logistic regression described in “Logistic regression” section. See⁵¹ for a similar correction strategy. We finally note that in practice the exact number of overlapping samples between two studies is usually not known. In this case, we followed the approach proposed in²⁸ to estimate the overlapping sample size.

Meta-analyzing multiple datasets

We now extend our approach to meta-analyze more than two datasets. The main difference with our previously described approach is the handling of the indicator variable for multiple datasets. We can still reconstruct the genotype count for each input study in exactly the same way as in Table 10 as well as the response vector following Eq. (4.1.3). Therefore, when multiple studies are meta-analyzed, ${\mathbf {g}}_i$ and ${\mathbf {y}}$ become

$$\begin{aligned} {\mathbf {g}}_i&= \left[ {\mathbf {g}}_{i1} \ldots {\mathbf {g}}_{iL}\right] ,\\ {\mathbf {y}}&= \left[ {\mathbf {y}}_{1} \ldots {\mathbf {y}}_{L}\right] . \end{aligned}$$

The indicator vector ${\mathbf {s}}$ cannot be binary anymore. Intuitively, one may consider using L binary vectors, each to encode samples from each input study. However, this approach would necessitate up to ${L(L-1)}/{2}$ vectors to encode pairwise sample overlap. This increases the computational complexity by $O(L^2)$. A simpler alternative is to use categorical variable as the source study indicator. Note that in this case, different rankings of the studies can lead to completely different results. A straightforward idea is to encode the studies using their population allele frequencies, which can be computed via Table 8 as follows:

$$\begin{aligned} I_{i\ell } = \frac{a_{i\ell }^{\text {cse}} + a_{i\ell }^{\text {cnt}}}{a_{i\ell }^{\text {cse}}+a_{i\ell }^{\text {cnt}}+u_{i\ell }^{\text {cse}}+u_{i\ell }^{\text {cnt}}} \end{aligned}$$

Note this is encoding also controls for population stratification across multiple sample sources. Then, when analyzing L studies, the indicator vector ${\mathbf {s}}$ becomes:

$$\begin{aligned} {\mathbf {s}} = \left[ \begin{array}{cc} \underbrace{I_1\ldots I_1}_{N_{1}^{\text {cse}}+N_{1}^{\text {cnt}}} \ldots \underbrace{I_L\ldots I_L}_{N_{L}^{\text {cse}}+N_{L}^{\text {cnt}}} \end{array}\right] . \end{aligned}$$

We can now proceed with the logistic regression as in “Logistic regression” section. In order to handle sample overlap across multiple studies, we use the subscript $(\cdot )_{\ell _1 \ell _2}$ to denote properties of shared samples between two studies $\ell _1$ and $\ell _2$. Then, generalizing Eqs. (9) and (10), we get, for each pair of input studies $\ell _1$ and $\ell _2$,

$$\begin{aligned} N_{\ell _1\ell _2}^{{\texttt {cse}}}&= N_{\ell _1\ell _2}^{{\texttt {cse-cse}}} + \frac{N_{\ell _1\ell _2}^{{\texttt {cse-cnt}}}+N_{\ell _1\ell _2}^{{\texttt {cnt-cse}}}}{2},\\ N_{\ell _1\ell _2}^{{\texttt {cnt}}}&= N_{\ell _1\ell _2}^{{\texttt {cnt-cnt}}} + \frac{N_{\ell _1\ell _2}^{{\texttt {cnt-cse}}}+N_{\ell _1\ell _2}^{{\texttt {cse-cnt}}}}{2}. \end{aligned}$$

Finally, for any study $\ell _1 =1 \ldots L$, the sample size correction is

$$\begin{aligned} \lambda _{\ell _1}^{{\texttt {cse}}}&= \frac{N_{\ell _1}^{{\texttt {cse}}}}{N_{\ell _1}^{{\texttt {cse}}}+ \sum _{\ell _2 \ne \ell _1}^L N_{\ell _1\ell _2}^{{\texttt {cse}}}},\\ \lambda _{\ell _1}^{{\texttt {cnt}}}&= \frac{N_{\ell _1}^{{\texttt {cnt}}}}{N_{\ell _1}^{{\texttt {cnt}}}+\sum _{\ell _2 \ne \ell _1}^L N_{\ell _1\ell _2}^{{\texttt {cnt}}}}. \end{aligned}$$

We can now apply $\lambda _{\ell _1}^{{\texttt {cse}}}$ to correct the sample size for cases in study $\ell _1$ and we can apply $\lambda _{\ell _1}^{{\texttt {cnt}}}$ to correct the sample size for controls and proceed with logistic regression.

cc-GWAS and group PRS

cc-GWAS using summary statistics

cc-GWAS is a straight-forward approach to investigate the genetic differences between two traits. However, in practice, it is usually challenging and time consuming, due to restrictions in individual level data sharing. Recently, a method for cc-GWAS that relies only on summary statistics has been proposed in¹⁶. We propose an alternative perspective on summary-statistics-based cc-GWAS framework, using the foundations of “Reconstructing allele counts” section.

One of the biggest challenges of cc-GWAS is the differentiation of the genetic effects from trait-trait difference and population stratification. Assume that for a fixed SNP, we run logistic regression focusing only on the cases of the two studies. Let ${\mathbf {y}}^{{\texttt {cse}}}_j = 1$ denote that sample j is a case from the first study and let ${\mathbf {y}}^{{\texttt {cse}}}_j = 0$ denote that j is a case from the second study. Let ${\mathbf {g}}^{{\texttt {cse}}}_j$ be the genotype of the j-th case. Then,

$$\begin{aligned} {{\mathsf {Pr}}}({\mathbf {y}}_j^{{\texttt {cse}}} = 1 | {\mathbf {g}}^{{\texttt {cse}}}_j) = S(\beta _0^{{\texttt {cse}}} + \beta _1^{{\texttt {cse}}} {\mathbf {g}}^{{\texttt {cse}}}_j). \end{aligned}$$

(11)

The effect size $\beta _1^{{\texttt {cse}}}$ that is the output of logistic regression will include effects from the real genetic differences between trait 1 and trait 2 ($\beta _g$) as well as from population stratification ($\beta _s$). We can assume that these two effects are independent of each other:

$$\begin{aligned} \beta _1^{{\texttt {cse}}} = \beta _g + \beta _s. \end{aligned}$$

Assume that the control samples from studies one and two do not carry the traits of interest. Then, we can estimate the effect of population stratification by running another logistic regression, focusing only on controls from the two studies, as follows:

$$\begin{aligned} {{\mathsf {Pr}}}({\mathbf {y}}_j^{{\texttt {cnt}}} = 1 | {\mathbf {g}}^{{\texttt {cnt}}}_j) = S(\beta _0^{{\texttt {cnt}}} + \beta _s {\mathbf {g}}^{{\texttt {cnt}}}_j). \end{aligned}$$

(12)

In the above, ${\mathbf {y}}^{{\texttt {cnt}}}_j = 1$ denotes that sample j is a control from study one, ${\mathbf {y}}^{{\texttt {cnt}}}_j = 0)$ denotes that j is a control from study two, and ${\mathbf {g}}^{{\texttt {cnt}}}_j$ denotes the the genotype for the j-th control sample. From this logistic regression, we can get an estimate of the stratification effect $\beta _s$. Note that along with $\beta _s$, we will also get a standard error for the estimate of stratification ${{\mathrm {SE}}}_s$, which essentially corresponds to the sample size of controls in the two input studies. If we do not have a good amount of controls, ${{\mathrm {SE}}}_s$ will turn out to be large, indicating that the estimate for stratification effect is not reliable and the results from the cc-GWAS should be interpreted carefully.

If ${{\mathrm {SE}}}_s$ is small enough, then it is reasonable to assume that the estimate of the stratification effect is credible and we can subsequently treat $\beta _s$ as a fixed value. Then, the genetic effect from the trait-trait difference that we are interested in is

$$\begin{aligned} \beta _g = \beta _1^{{\texttt {cse}}} - \beta _s. \end{aligned}$$

(13)

It now follows that the standard error of $\beta _g$ is

$$\begin{aligned} {{\mathrm {Var}}}(\beta _g) = {{\mathrm {Var}}}(\beta _1^{{\texttt {cse}}}) \implies {{\mathrm {SE}}}_{g} = {{\mathrm {SE}}}_1, \end{aligned}$$

(14)

using the derivations of “Reconstructing genotype counts” section. Logistic regressions on cases (Eqs. (11)) and controls (Eq. (12)) can be carried out as discussed in “Logistic regression” section, with minor changes (include only the designated samples; relabel the dependent variable; and remove the indicator variable). By running these two logistic regressions, we can compute $\beta _1^{{\texttt {cse}}}, \beta _s, SE_1^{{\texttt {cse}}}$, and $SE_s$. Then, using Eqs. (13) and (14), we can compute $\beta _g$ and $SE_g$ for each SNP. Similarly, we can also compute the corresponding p-value using a Z-distribution for $\left| \frac{\beta _g}{SE_g}\right| $.

Mean PRS for cases and controls

Recall that the PRS for the t-th individual in the study is computed as:

$$\begin{aligned} \text {PRS}_{t} = \sum _{i=1}^M \frac{S_i \cdot g_{it}}{2M}, \end{aligned}$$

(15)

where $g_{it}$ is the genotype of the i-th SNP for the t-th individual and $S_i$ is the weight for SNP i, which is usually defined as

$$\begin{aligned} S_i = \log (\text {OR}_i^{{\texttt {base}}}), \end{aligned}$$

where $\text {OR}_i^{{\texttt {base}}}$ is the odds ratio of SNP i in the base summary statistics. Recall from “Notation” section that M is the total number of SNPs. Then, in order to compute the average PRS for, say, cases, we simply need to sum up the individual PRS and average over the number of cases. More precisely,

$$\begin{aligned} \text {PRS}^{{\texttt {cse}}} = \frac{1}{2 M N^{{\texttt {cse}}}}\sum _{t \in {\texttt {cse}}} \sum _{i=1}^M S_i \cdot g_{it}. \end{aligned}$$

where $N^{{\texttt {cse}}}$ is the number of cases in the target study. The above equation can be rewritten as

$$\begin{aligned} \text {PRS}^{{\texttt {cse}}} = \frac{1}{2 M N^{{\texttt {cse}}}}\sum _{i=1}^M S_i \sum _{t \in {\texttt {cse}}} g_{it}. \end{aligned}$$

Notice that in an additive model, ${\sum _{t \in {\texttt {cse}}} g_{it}}/{2N^{\texttt {cse}}}$ is the allele frequency of SNP i over all cases in the target study, which can be computed using only the summary statistics as shown in “Reconstructing genotype counts” section and Table 9. Thus, the mean PRS under an additive model for cases and controls can be computed as follows:

$$\begin{aligned} \text {PRS}^{{\texttt {cse}}}&= \frac{\sum _{i=1}^M S_i p_{i}^{{\texttt {cse}}}}{M},\\ \text {PRS}^{{\texttt {cnt}}}&= \frac{\sum _{i=1}^M S_i p_{i}^{{\texttt {cnt}}}}{M}. \end{aligned}$$

All relevant information for this computation can be easily obtained from the summary statistics of the base and/or target study.

Estimating the standard deviation of the PRS for cases and controls

Interestingly, we can also estimate the standard deviation of the PRS for cases and controls, even without individual level genotype information, under mild assumptions. First, from Eq. (15), we compute the variance of an individual’s PRS as follows:

$$\begin{aligned} {{\mathrm {Var}}}(\text {PRS}_{t})&= {{\mathrm {Var}}}(\sum _{i=1}^M \frac{S_i \cdot g_{it}}{2M}) \nonumber \\&= \frac{1}{4M^2}{{\mathrm {Var}}}(\sum _{i=1}^M S_i \cdot g_{it}). \end{aligned}$$

(16)

Recall that as a general step prior to the computation of PRS, it is recommended to prune or clump the SNPs used for the PRS computation. Therefore, our first assumption is that the $g_{it}$’s are pairwise independent. Then, Eq. (16) can be simplified as follows:

$$\begin{aligned} {{\mathrm {Var}}}(\text {PRS}_{t})&= \frac{\sum _{i=1}^M {{\mathrm {Var}}}(S_i \cdot g_{it})}{4M^2} \nonumber \\&= \frac{\sum _{i=1}^M S_i^2 {{\mathrm {Var}}}(g_{it})}{4M^2}. \end{aligned}$$

(17)

Notice that under an additive model, $g_{it}$ is a discrete random variable that only takes the value zero, one, and two. Consider all cases and, as in “Reconstructing genotype counts” section, assume that the SNPs are in HWE. Then, the distribution of $g_{it}$ in the cases is presented in Table 12.

Table 12 The probability distribution of $g_{it}$ for SNP i.

Full size table

We can now compute the variance of $g_{it}$ in cases as follows:

$$\begin{aligned} {{\mathrm {Var}}}(g_{it})&= {{\mathrm {E}}}(g_{it}^2) - ({{\mathrm {E}}}g_{it})^2 \\&= (2p^{{\texttt {cse}}}_iq^{{\texttt {cse}}}_i+4(p^{{\texttt {cse}}}_i)^2) - (2p^{{\texttt {cse}}}_iq^{{\texttt {cse}}}_i+2(p^{{\texttt {cse}}}_i)^2)^2 \\&= (2p^{{\texttt {cse}}}_iq^{{\texttt {cse}}}_i+4(p^{{\texttt {cse}}}_i)^2) - (2p^{{\texttt {cse}}}_i(p^{{\texttt {cse}}}_i+q^{{\texttt {cse}}}_i))^2 \\&= 2p^{{\texttt {cse}}}_iq^{{\texttt {cse}}}_i+4(p^{{\texttt {cse}}}_i)^2 - 4(p^{{\texttt {cse}}}_i)^2 = 2p^{{\texttt {cse}}}_iq^{{\texttt {cse}}}_i. \end{aligned}$$

Substituting into Eq. (17), we get

$$\begin{aligned} {{\mathrm {Var}}}(\text {PRS}^{{\texttt {cse}}})&= \frac{\sum _{i=1}^M S_i^2( 2p^{{\texttt {cse}}}_iq^{{\texttt {cse}}}_i)}{4M^2}. \end{aligned}$$

Similarly, we can compute the estimated variance $\text {PRS}^{{\texttt {cnt}}}$ for controls and $\text {PRS}$ for the overall population of the target study. To summarize, our estimates are

$$\begin{aligned} {{\mathrm {Var}}}(\text {PRS}^{{\texttt {cse}}})&= \frac{\sum _{i=1}^M S_i^2 p_i^{{\texttt {cse}}}q_i^{{\texttt {cse}}}}{2M^2},\\ {{\mathrm {Var}}}(\text {PRS}^{{\texttt {cnt}}})&= \frac{\sum _{i=1}^M S_i^2 p_i^{{\texttt {cnt}}}q_i^{{\texttt {cnt}}}}{2M^2},\\ {{\mathrm {Var}}}(\text {PRS})&= \frac{\sum _{i=1}^M S_i^2 p_iq_i}{2M^2}. \end{aligned}$$

Here $p_i$ is the frequency of allele $A_1$ for SNP i in all samples of the target study, and can be computed as:

$$\begin{aligned} p_i&= \frac{N^{{\texttt {cse}}} p_i^{{\texttt {cse}}} + N^{{\texttt {cnt}}}p_i^{{\texttt {cnt}}}}{N^{{\texttt {cse}}}+N^{{\texttt {cnt}}}}, \\ q_i&= 1-p_i. \end{aligned}$$

We can now apply a t-test in order to obtain a p-value for the difference between the PRS distributions in cases and controls. Given the estimated group means and standard deviations for cases and controls, we can further assume that the individual level PRS follow a normal distribution in each group and use the t-test statistic as follows:

$$\begin{aligned} t = \frac{\text {PRS}^{{\texttt {cse}}}-\text {PRS}^{{\texttt {cnt}}}}{\sqrt{{{\mathrm {Var}}}(\text {PRS})} \cdot \sqrt{\frac{1}{N^{{\texttt {cse}}}} + \frac{1}{N^{{\texttt {cnt}}}}}}. \end{aligned}$$

Finally, the degrees of freedom are given by $df = N^{{\texttt {cse}}} + N^{{\texttt {cnt}}} - 2$.

Experiments

Data

Synthetic data. We used the Balding-Nichols model^26,52 for synthetic genotype generation, assuming a minor allele frequency (MAF) of 0.3 for each SNPs and a relative risk r (r = 1.15/1.2/1.3) for the effective alleles of the causal SNPs in each population. The simulation was carried out under a range of $F_{st}$ values ($F_{st} = 0.01/0.05/0.1$). For the fixed-effect meta-analysis, we simulated 1000 cases and 1000 controls for each input study. A total of 100,000 SNPs were generated, out of which 1000 are causal SNPs with the predefined risk for the effective alleles. Moreover, on top of the independent populations, we also evaluated the performance of ReACt under the presence of sample overlap by introducing a predefined amount of samples shared between each pair of input studies (100 cases, 100 controls overlap; or 500 cases, 500 controls overlap).

To further demonstrate the scalability of ReACt, we evaluated its performance on UK biobank samples with phenotypes simulated using the gcta tool²⁹. The simulation was carried out using quality controlled genotypes (removing SNPs and individuals showing missing rate larger than 0.02 and SNPs strongly violating the Hardy-Weinberg equilibrium with a p-value larger than $10^{-6}$), using a predefined trait heritability equal to 0.4 and prevalence equal to 0.2. We simulated 50,000 cases and 250,000 controls, each genotyped on 634,758 SNPs, out of which 1000 SNPs were randomly selected to be causal with effect size OR equal to 1.2. In each iteration, we split the samples into two equal sized subsets, each with 25,000 cases and 125,000 controls. Similarly to our experiments on the Balding-Nichols model, we tested the performance under various degrees of sample overlap.

For the cc-GWAS, inspired by¹⁶, we used the same simulation model but introduced three types of SNPs for a thorough evaluation of the method’s robustness: (i) SNPs with non-zero effect in only one of the studies and zero effect in the other; (ii) SNPs with zero effect in both input studies; and (iii) SNPs with the same non-zero effect size (predefined r) in both input studies. All of the three types of SNPs would suffer from population stratification at a predefined value of $F_{st}$. In total, 100,000 SNPs were generated, with 1000 (for each input study) from type (i), 49,000 from type (ii), and 49,000 from type (iii). To investigate the effect of study sizes, we evaluated the method performance on input studies with 2000 cases and 2000 controls each, as well as on studies with 5000 cases and 5000 controls each.

Individual level genotype data. We tested the performance of our fixed-effect meta-analysis method and group PRS method on the depressive episode trait in UK biobank dataset³⁰. Only independent European ancestry samples identified through PCA and IBD check are included for the analysis. We applied basic quality control filters on those samples, which were removing SNPs and samples with a missing rate exceeding $2\%$ or violating the Hardy-Weinberg equilibrium ($p_{HWE} < 10^{-6}$). As a result, 640,756 SNPs and 331,217 samples (18,368 cases and 312,849 controls) survived and were used for the experiment. For the evaluation of the fixed-effect meta-analysis method, we ran a standard GWAS with all samples and treated SNPs with $p < 10^{-6}$ from the results as the “true signals” to be captured. For all GWAS on UB biobank samples, we correct for age, gender, sample collection batch and top 10 PCs obtained using software TeraPCA⁵³.

Generating summary statistics. For synthetic data and individual level genotypes, summary statistics were generated using PLink⁴⁷, correcting for the top ten principal components (PCs) in the case of admixed datasets. For real individual level genotype data, we divided the samples randomly into two equal sized subsets and ran a GWAS on each subset separately to obtain summary statistics for each subset. We performed ten such random iterations in our experimental evaluations. For the fixed-effect meta-analysis, on top of two independent subsets, we also introduced 100/500 sample overlap for synthetic data under the Balding-Nichols model; 5000/10,000 sample overlap for synthetic data under the gcta model; and 500/1000 sample overlap for the real GWAS data on depression.

Publicly available summary statistics. As part of the performance evaluation for our group PRS method, we used summary statistics from an MDD GWAS published in 2013³⁶ as the base study. Most recent large-scale GWAS often include UK biobank as part of the samples. We chose to use an earlier GWAS published before the release of UK biobank data in order to minimize sample overlap between the base and target populations as much as possible. The summary statistics contains in total 1,235,109 SNPs on genome build hg18. After liftover⁵⁴ to hg19, 1,234,855 remained for the analysis.

For group PRS and cc-GWAS, we demonstrated the applicability of our methods using publicly available summary statistics. We chose the summary statistics of eight neuropsychiatric disorders made available by the Psychiatric Genomics Consortium (PGC), since the underlying relationships between this set of disorders has been relatively well-studied. Information on the eight summary statistics can be found in Table 13.

Table 13 Information on summary statistics for the eight psychiatric disorders used in the experiments.

Full size table

Evaluation metrics

Fixed-effect meta-analysis. For synthetic experiments using the Balding-Nichols model where all SNPs were simulated independently, results after performing the meta-analysis were compared with the predefined causal variants. Under each experimental condition, we reported the average true positive rate (i.e., the percentage of predefined causal SNPs identified under the designated significant threshold), as well as the false positive rate (type I error, i.e., the percentage of non-causal SNPs falsely identified as causal under the same significance threshold) out of ten independent iterations.

For experiments under the gcta simulator it was unreasonable to report power and type I error rates by comparing with the predefined causal SNPs, since the SNPs were not independent in the input genotypes. Therefore, for this experiment, the performance of ReACt and the other tools was evaluated by comparing results to the outcome of a GWAS on all 50,000 cases and 250,000 controls, where 1,886 SNPs were identified as genome-wide significant (GWAS p-value $< 5\times 10^{-8}$). We considered those 1,886 SNPs as true signals (“causal SNPs”) and reported average power and type I error rates over ten iterations for all methods.

For real genotype data, in each iteration, we meta-analyzed summary statistics of two subsets using the proposed methods and standard approaches and compared results with the GWAS results on the complete dataset. Following the lines of the experiments using the gcta simulation model, we again reported results averaged over ten iterations (random splits) showing, on average, how many times a SNP reported as a “true signal” in the overall GWAS got picked up by each meta-analysis method (true positive) as well as how many extra SNPs each method identified (false positive). The performance on real genotype data was also evaluated under 0/100/500 sample overlap. Sample size for each subset under different conditions was 482 cases, 993 controls with no sample overlap; 532 cases, 1043 controls with 100 cases and 100 controls overlap; and 732 cases, 1243 controls with 500 cases and 500 controls overlap.

We compared the performance of ReACt in terms of accuracy as well as running time with METAL²⁴ and ASSET²⁵, which are both widely used tools for fixed-effect meta-analysis. Note that the latest stable release of METAL does not have the sample overlap correction functionality implemented. Therefore, for performance comparison, we used the development version available on GitHub²⁸.

cc-GWAS. Out of the three types of SNPs generated for the cc-GWAS evaluation (see “Data” section), we expect ReACt to pick up only type (i) SNPs as they have been designed to be the trait differential SNPs. Therefore, we reported the power (i.e., the percentage of type (i) SNPs identified under the significance threshold) of ReACt based on the number of type (i) SNPs that were identified as well as type I error rates (i.e., the percentage of type (ii) or (iii) SNPs falsely picked up under the same significance threshold) for type (ii) SNPs and type (iii) SNPs. Since the randomness introduced by the simulation could lead to false positives that were not due to the method itself, we filtered out type (iii) SNPs showing extreme differences in effect size between studies, by removing type (iii) SNPs with $|OR_{i1} - OR_{i2}| \ge 0.1$ from performance evaluation. Here $OR_{i1}$ corresponds to the odd ratio for the ith SNP in the first study and $OR_{i2}$ corresponds to the odd ratio for the ith SNP in the other study. Since all three types of SNPs suffered from population stratification, we evaluated the performance of ReACt under a challenging scenario. Besides simulation, experiments using summary statistics for schizophrenia (SCZ)³³ and bipolar disorder (BIP)³² were also carried out. These two disorders were chosen due to the existence of case-case association study using the individual level genotypes³⁴. We tested ReACt using the summary statistics and compared the results with the existing case-case association study between SCZ and BIP to see whether it could detect possible genetic differences between the two disorders. Since no individual level quality control could be carried out, we expected our results to correspond to a case-case GWAS including 36,989 cases from SCZ and 20,352 cases from all three sub-types of BIP (type 1, type 2, and schizoaffective bipolar disorder). SNPs on the X-chromosome were excluded from this analysis. Further, to make our protocol comparable to the ones used in³¹, we also removed variants on the MHC region (chr6: 25,000,000–35,000,000BP). From a theoretical perspective, our approach analyzes each SNP independently. Therefore, removing MHC is not mandatory to run ccGWAS using ReACt, unless the study design requires to do so. As a result, a total of 9,018,199 SNPs shared between both summary statistics were used for the analysis. The results were compared in detail with the results reported by the cc-GWAS in¹⁶.

Group PRS. In order to show that our method outputs reliable estimates of the group-wise statistics for PRS without accessing individual level genotypes, we compared the output of our method to the true group mean and standard deviation computed from the individual level PRS on synthetic data, as described in “Data” section. Performance was evaluated under with a fixed 0.05 $F_{st}$ between the base and target studies. For a pair of base and target studies , we estimated the mean PRS for case/control groups as well as their standard deviation using SNPs with p-values strictly less than $5 \cdot 10^{-5}$ in the summary statistics. We also computed the individual level PRS using PRSise2 to obtain the true group mean and standard deviation. Our experiments show that our estimates are numerically close to the real values. Next, we evaluated the performance of ReACt on real GWAS datasets, where the individual level genotype of the target study was available. For this experiment, we used an earlier GWAS summary statistics of MDD³⁶ as the base study (see “Data” section for details) and cases and matching controls of depressive episode trait in UK biobank as the target population³⁰. We clumped the base summary statistics using the European samples from 1000 Genome Project as reference, under parameters –clump-p1 1 –clump-kb 250 –clump-r2 0.1. We tested the method and reported results under a range of p-value thresholds ($0.1, 0.01, 0.001, 10^{-4}$). For each threshold, we used only independent SNPs with a p-value smaller than the respective threshold from the base summary statistics for PRS calculation, using both ReACt and PRSice2 ³⁵. We reported the mean PRS of cases and controls, as well as the resulting p-value from t-test. In the case of PRSice2, we also reported the regression $r^2$ value and p-value for the PRS predictor with and without correcting for covariates (ie., the top five principal components).

Finally we applied ReACt to summary statistics of eight neuropsychiatric disorders (OCD, TS, ED, ADHD, ASD, BIP, SCZ and MDD, see “Data” section for details) and reported the pairwise PRS prediction power in terms of t-test p-values for the difference between case/control group PRS means. Prior to the group PRS computation, each base summary statistics was clumped using PLink⁴⁷ using parameters –clump-p1 1 –clump-kb 250 –clump-r2 0.1, with the European samples from 1000 Genome Project as a reference. All PRS values were estimated using independent SNPs with p-values strictly less than $10^{-5}$ from the base summary statistics.

Data availability

Summary statistics for the eight disorders used in this study can be downloaded from Psychiatric Genomics Consortium (PGC): https://www.med.unc.edu/pgc/download-results/. In-house script used for synthetic data generation can be found from our github page https://github.com/Paschou-Lab/ReAct/tree/main/Simulator. Some data that support the findings of this study are available from the UK biobank but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of UK biobank. This research has been conducted using the UK Biobank Resource under Application Number 61553. An implementation for ReACt can be found on our github page: https://github.com/Paschou-Lab/ReACt.

References

Craig, D. W. et al. Assessing and managing risk when sharing aggregate genetic variant data. Nat. Rev. Genet. 12, 730–736 (2011).
Article CAS PubMed PubMed Central Google Scholar
Pasaniuc, B. & Price, A. L. Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet. 18, 117 (2017).
Article CAS PubMed Google Scholar
Park, J.-H. et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat. Genet. 42, 570–575 (2010).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y., Qi, G., Park, J.-H. & Chatterjee, N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat. Genet. 50, 1318–1326 (2018).
Article CAS PubMed Google Scholar
Yang, Z. et al. Investigating shared genetic basis across tourette syndrome and comorbid neurodevelopmental disorders along the impulsivity-compulsivity spectrum. Biol. Psychiatry 90, 317–327 (2021).
Article CAS PubMed Google Scholar
Tsetsos, F. et al. Meta-analysis of tourette syndrome and attention deficit hyperactivity disorder provides support for a shared genetic basis. Front. Neurosci. 10, 340 (2016).
Article PubMed PubMed Central Google Scholar
Lee, P. H. et al. Genomic relationships, novel loci, and pleiotropic mechanisms across eight psychiatric disorders. Cell 179, 1469–1482 (2019).
Article CAS Google Scholar
Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).
Article CAS PubMed PubMed Central Google Scholar
Benner, C. et al. Finemap: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016).
Article CAS PubMed PubMed Central Google Scholar
Pasaniuc, B. et al. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics 30, 2906–2914 (2014).
Article CAS PubMed PubMed Central Google Scholar
Rüeger, S., McDaid, A. & Kutalik, Z. Evaluation and application of summary statistic imputation to discover new height-associated loci. PLoS Genet. 14, e1007371 (2018).
Article PubMed PubMed Central CAS Google Scholar
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Brown, B. C. et al. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zheng, J. et al. LD hub: A centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017).
Article CAS PubMed Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228 (2015).
Article CAS PubMed PubMed Central Google Scholar
Peyrot, W. J. & Price, A. L. Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat. Genet. 53, 445–554 (2021).
Article CAS PubMed PubMed Central Google Scholar
Power, R. A. et al. Polygenic risk scores for schizophrenia and bipolar disorder predict creativity. Nat. Neurosci. 18, 953–955 (2015).
Article CAS PubMed Google Scholar
Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018).
Article CAS PubMed Google Scholar
Choi, S. W., Mak, T.S.-H. & O’Reilly, P. F. Tutorial: A guide to performing polygenic risk score analyses. Nat. Protocols 15, 2759–2772 (2020).
Article CAS PubMed Google Scholar
Grotzinger, A. D. et al. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat. Hum. Behav. 3, 513–525 (2019).
Article PubMed PubMed Central Google Scholar
Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).
Article CAS PubMed PubMed Central Google Scholar
Palla, L. & Dudbridge, F. A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. Am. J. Hum. Genet. 97, 250–259 (2015).
Article CAS PubMed PubMed Central Google Scholar
LaValley, M. P. Logistic regression. Circulation 117, 2395–2399 (2008).
Article PubMed Google Scholar
Willer, C. J., Li, Y. & Abecasis, G. R. Metal: Fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).
Article CAS PubMed PubMed Central Google Scholar
Bhattacharjee, S. et al. A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. Am. J. Hum. Genet. 90, 821–835 (2012).
Article CAS PubMed PubMed Central Google Scholar
Balding, D. J. & Nichols, R. A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).
Article CAS PubMed Google Scholar
Lin, D.-Y. & Sullivan, P. F. Meta-analysis of genome-wide association studies with overlapping subjects. Am. J. Hum. Genet. 85, 862–872 (2009).
Article CAS PubMed PubMed Central Google Scholar
Sengupta, S. Metal, unpublished material and methods (2017). https://genome.sph.umich.edu/w/images/7/7b/METAL_sample_overlap_method_2017-11-15.pdf.
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
Article CAS PubMed PubMed Central Google Scholar
Sudlow, C. et al. Uk biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Peyrot, W. J. & Price, A. L. Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat. Genet. 53, 445–454 (2021).
Article CAS PubMed PubMed Central Google Scholar
Stahl, E. A. et al. Genome-wide association study identifies 30 loci associated with bipolar disorder. Nat. Genet. 51, 793–803 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ripke, S. et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421 (2014).
Article ADS CAS PubMed Central Google Scholar
Ruderfer, D. M. et al. Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes. Cell 173, 1705–1715 (2018).
Article CAS PubMed Central Google Scholar
Choi, S. W. & O’Reilly, P. F. Prsice-2: Polygenic risk score software for biobank-scale data. Gigascience 8, giz082 (2019).
Article PubMed PubMed Central Google Scholar
Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium et al. A mega-analysis of genome-wide association studies for major depressive disorder. Mol. Psychiatry 18 (2013).
Visscher, P. M. & Hill, W. G. The limits of individual identification from sample allele frequencies: Theory and statistical analysis. PLoS Genet. 5, e1000628 (2009).
Article PubMed PubMed Central CAS Google Scholar
Ni, G. et al. A comparison of ten polygenic score methods for psychiatric disorders applied across multiple cohorts. Biol. Psychiatry 90, 611–620 (2021).
Article PubMed PubMed Central Google Scholar
Maier, R. M. et al. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat. Commun. 9, 1–17 (2018).
Article ADS CAS Google Scholar
Privé, F., Arbel, J. & Vilhjálmsson, B. J. Ldpred2: better, faster, stronger. BioRxiv (2020).
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C.A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1–10 (2019).
Article CAS Google Scholar
Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 1–11 (2019).
Article CAS Google Scholar
So, H.-C. & Sham, P. C. Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Sci. Rep. 7, 1–11 (2017).
Article CAS Google Scholar
Choi, S. W., Mak, T. S. H., Hoggart, C. J. & O’Reilly, P. F. Erasor: Erase sample overlap in polygenic score analyses. bioRxiv (2021).
Lloyd-Jones, L. R., Robinson, M. R., Yang, J. & Visscher, P. M. Transformation of summary statistics from linear mixed model association on all-or-none traits to odds ratio. Genetics 208, 1397–1408 (2018).
Article CAS PubMed PubMed Central Google Scholar
Freda, L. Logistic regression. http://nlp.chonbuk.ac.kr/BML/slides_freda/lec7.pdf (2016). (Accessed 13 Apr 2020).
Chang, C. C. et al. Second-generation plink: Rising to the challenge of larger and richer datasets. Gigascience 4, s13742-015 (2015).
Article CAS Google Scholar
Firth, D. Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993).
Article MathSciNet MATH Google Scholar
Heinze, G. & Schemper, M. A solution to the problem of separation in logistic regression. Stat. Med. 21, 2409–2419 (2002).
Article PubMed Google Scholar
Ma, C., Blackwell, T., Boehnke, M., Scott, L. J. & Investigators, G. Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genet. Epidemiol. 37, 539–550 (2013).
Article PubMed PubMed Central Google Scholar
Bom, P. R. & Rachinger, H. A generalized-weights solution to sample overlap in meta-analysis. Res. Synth. Methods 11, 812–832 (2020).
Article PubMed Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
Bose, A. et al. Terapca: A fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics 35, 3679–3683 (2019).
Article CAS PubMed Google Scholar
Haeussler, M. et al. The UCSC genome browser database: 2019 update. Nucl. Acids Res. 47, D853–D858 (2019).
Article CAS PubMed Google Scholar
Arnold, P. D. et al. Revealing the complex genetic architecture of obsessive-compulsive disorder using meta-analysis. Mol. Psychiatry 23, 1181–1181 (2018).
Article CAS Google Scholar
Yu, D. et al. Interrogating the genetic determinants of Tourette’s syndrome and other tic disorders through genome-wide association studies. Am. J. Psychiatry 176, 217–227 (2019).
Article PubMed PubMed Central Google Scholar
Duncan, L. et al. Significant locus and metabolic genetic correlations revealed in genome-wide association study of anorexia nervosa. Am. J. Psychiatry 174, 850–858 (2017).
Article PubMed PubMed Central Google Scholar
Grove, J. et al. Identification of common genetic risk variants for autism spectrum disorder. Nat. Genet. 51, 431–444 (2019).
Article CAS PubMed PubMed Central Google Scholar
Demontis, D. et al. Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat. Genet. 51, 63–75 (2019).
Article CAS PubMed Google Scholar
Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We would also like to thank our colleague Pritesh Jain for contributing the name ReACt for the tool being developed. This work was supported by NSF IIS 1715202 and NSF IIS 2006929.

Author information

Authors and Affiliations

Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
Zhiyu Yang & Peristera Paschou
Department of Computer Science, Purdue University, West Lafayette, IN, USA
Petros Drineas

Authors

Zhiyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Peristera Paschou
View author publications
You can also search for this author in PubMed Google Scholar
Petros Drineas
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.Y. designed the study and carried out the experiments, drafted and revised the manuscript; P.D. designed the study and revised the manuscript; P.P. designed the study and revised the manuscript.

Corresponding authors

Correspondence to Peristera Paschou or Petros Drineas.

Ethics declarations

Competing interest

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, Z., Paschou, P. & Drineas, P. Reconstructing SNP allele and genotype frequencies from GWAS summary statistics. Sci Rep 12, 8242 (2022). https://doi.org/10.1038/s41598-022-12185-6

Download citation

Received: 28 December 2021
Accepted: 27 April 2022
Published: 17 May 2022
DOI: https://doi.org/10.1038/s41598-022-12185-6
Springer Nature Limited

This article is cited by

The goldmine of GWAS summary statistics: a systematic review of methods and tools
- Panagiota I. Kontou
- Pantelis G. Bagos
BioData Mining (2024)
Genome-wide association meta-analysis identifies five loci associated with postpartum hemorrhage
- David Westergaard
- Valgerdur Steinthorsdottir
- Henriette Svarre Nielsen
Nature Genetics (2024)

Reconstructing SNP allele and genotype frequencies from GWAS summary statistics

Abstract

Similar content being viewed by others

A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies

Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors

CollapsABEL: an R library for detecting compound heterozygote alleles in genome-wide association studies

Introduction

Results

Mathematical foundations

Fixed effect meta-analysis

Our approach

Fixed effect meta-analysis: performance evaluation

cc-GWAS

Our approach

CC-GWAS: performance evaluation

Group PRS

Our approach

Group PRS: performance evaluation

Discussion

Methods

Our framework

Notation

Reconstructing allele counts

Reconstructing genotype counts

Fixed-effect meta-analysis

Logistic regression

Correcting for sample overlap (two studies)

Meta-analyzing multiple datasets

cc-GWAS and group PRS

cc-GWAS using summary statistics

Mean PRS for cases and controls

Estimating the standard deviation of the PRS for cases and controls

Experiments

Data

Evaluation metrics

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interest

Additional information

Publisher's note

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

The goldmine of GWAS summary statistics: a systematic review of methods and tools

Genome-wide association meta-analysis identifies five loci associated with postpartum hemorrhage

Search

Navigation