Background

In human genetics, heterozygote advantage (heterosis) has been detected in studies that focused on specific genes [1, 2], but not in genome-wide association studies (GWAS). For example, heterosis is believed to confer resistance to certain strains of malaria in patients heterozygous for the sickle-cell gene haemoglobin S (HbS). Yet the power of allelic tests can be substantially diminished by heterosis [3]. Since GWAS (and haplotype associations) also utilize allelic tests [4, 5], it is unclear to what degree GWAS could underachieve because heterosis is ignored. GWAS has been conducted by testing many genetic variants to find a statistical association with a disease or a particular trait. Steps for conducting GWAS include data collection for the selected study population, genotyping, data processing, and testing for association [6].

Simulation studies by Omolo and colleagues [3] showed that allelic tests underperform in the presence of heterosis, a condition found in some diseases such as malaria and sickle cell anaemia [1]. It is unclear how the allelic tests conducted at millions of single nucleotide polymorphisms (SNPs) would perform under heterotic conditions.

Existing tests for association studies include Pearson’s chi-square test, the allelic test, the Cochran Armitage trend tests, and the MAX test among other tests. Pearson’s Chi-square test and the Cochran Armitage trend test (CATT) [7, 8] are known for genetic association using case–control samples. The trend tests corresponding to the three commonly used genetic models are the dominant, recessive, and additive [7, 8]. The MAX test was proposed by Loley et al. [9], Gonzalez et  al. [10], Zhang et  al. [11], and Horthon et  al. [12]. The test allowed for the underlying genetic model to be selected as opposed to assuming a particular genetic model [7]. Zhang et  al. [11] developed an algorithm to calculate empirical and asymptotic p-values for the MAX and allelic tests. The algorithm has reduced the computation burden of association tests. Zintzaras et  al. [13] studied the degree of dominance which attempted to include the heterotic situation on a continuous scale. The simulation study showed that the method was promising for model selection. Gonzalez et  al. [10] derived the asymptotic form of the MAX test and estimated its significance level based on the three genetic models. Similar to the tests developed by Zang et al. [11], the test showed reduced computational burden. However, an extension of the heterosis situation would be important for some traits. Horthon et  al. [12] used conditional reference distribution for the MAX test in three dimensions and showed that it is asymptotically normally-distributed with estimated parameters [14]. Similar to Horthon et  al. [12], the main interest lies in genome-wide association testing with heterosis being one of the genetic models. The existing tools for analysis have been extended in GWAS to include the heterotic model. See [15,16,17] for a detailed review of robust tests and their applications to genetic association studies.

In this study, a two-step approach to genetic association testing in malaria studies in a GWAS setting was proposed that may enhance the power of the tests by identifying the underlying genetic model before applying the association tests. Firstly, generalized linear models for the dominant, recessive, additive, and heterotic effects were fitted using case–control genotype data. The model selection was then performed using the MAX test procedure [12]. Here, the distribution of the MAX test was extended to cater to the heterotic effect in four-dimensional test statistics to yield the MAX4 test. The model with the smallest p value was selected for different markers. The p-values were adjusted for multiple comparisons using the Bonferroni method for SNPs with an allelic odds ratio greater than or equal to 1.5. The most significant SNPs were selected based on a threshold of \(5 \times 10^{-8}\). Using the MAX4 and the allelic tests, statistics and p-values were estimated to determine SNPs significance across all the genetic models and perform the selection of the correct model. The estimated p-values of the MAX4 test were obtained using the parametric bootstrap (boot), bivariate normal (bvn), and the asymptotic method (asy) [11]. Genotype datasets were simulated under the Hardy-Weinberg equilibrium (HWE), assuming the multinomial distribution for cases and controls. The MAX4 and the allelic tests were performed on the simulated data sets to achieve model selection and to test for significance. An example dataset with 17 SNPs [11], and malaria genotype data from the Kenyan and Gambian populations with unrelated individuals were used for validation (https://www.ebi.ac.uk/ega/).

Methods

Genetic model

Consider a genetic marker with alleles A and S with genotypes AA, AS, and SS as shown in Table 1. The distribution of the genotypes from alleles A and S is found in Sasieni [8]. Assume A is the allele causing disease, which confers a high risk of malaria disease. The corresponding three genotypes are AA, AS and SS, denoted by \(g_0=SS\), \(g_1=AS\), and \(g_2=AA\). The genotype frequencies \(g_i=P(G_i)\) for \(i=0,1,2\). The allele frequencies assume \(P(A)=p\) and \(P(S)=1-p=q\). HWE is assumed to hold, i.e. \(g_0=q^2\), \(g_1=2pq\), and \(g_2=p^2\). The probability of being diseased given a particular genotype (penetrance), is given by \(f_i=P(case|g_i)\) and the disease prevalence by \(k=P(case)=\sum f_ig_i\), for \(i=0,1,2\). Let the genotype counts of \(g_0\), \(g_1\) and \(g_2\) in r cases and s controls be represented by \((r_0, r_1, r_2)\) and \((s_0,s_1,s_2)\) respectively, with \(n_i=r_i+s_i\) where \(i=0,1,2\) and \(n=r+s\). Consider the penetrance relation among the different modes of inheritance. For the additive model, the penetrance relation is \(f_0<f_1<f_2\), and the number of alleles raises the disease risk. For the dominant model, one A allele in the heterozygous phenotype is sufficient to cause a disease similar to two copies of the A allele, i.e AA genotype. The penetrance relationship is \(f_0 <f_1\simeq f_2\). For the recessive model, the penetrance relationship is \(f_0\simeq f_1< f_2\) and for the overdominant model (positive heterosis), the heterozygous genotype AS has the largest effect on disease risk, i.e \(f_1> f_0,f_2\). Using the penetrance relation, we represent the overdominant situation for the MAX4 test using a score vector (0,1,0). The score vectors for dominant, recessive, and additive models are (0,1,1), (0,0,1), and (0,1,2) respectively. Table 2 shows the count of cases and controls by heterosis (overdominance). Define the genotype relative risk (GRR) as \(f_i/f_0=\lambda _i\). Under different genetic models, a test for the null hypothesis \(H_0:\lambda _i=1\) against the alternative \(H_A:\lambda _i>1\) is performed.

Simulated genotype data

Genotype data sets from a case–control study design were simulated. The frequency of both cases and controls maintained the HWE under multinomial distribution. Data were also simulated to violate the HWE assumption of allele frequencies \(p^2\), 2pq, and \(q^2\) for AA, AS, and SS genotypes, respectively. The allelic and the MAX4 tests were performed on different sample sizes. Using samples with 500 to 5,000 SNPs, genotype datasets were simulated using varying allele frequencies. Multinomial distribution was assumed for the cases and the controls. The initial probability of allele A was set at 0.1 and was used to determine the genotype distributions under the conditions of HWE [9]. A comparison of the allelic and the MAX4 test results was performed on the selected genetic models.

Example dataset

The MAX4 test was applied to an example dataset (Additional file 1: Table S1) containing 17 common SNPs from age-related macular degeneration(AMD) [18], prostate cancer (PC) [19], breast cancer(BC) [20], and hypertension(HP) [21] studies and obtained significant results [11]. The Rassoc [11] package in R was used to generate the statistics and the p-values of the tests. This R package has Monte Carlo and asymptotic algorithms of the MAX3, CATT, allelic, and other commonly used tests in case–control studies. The algorithms calculated the p-values using the parametric bootstrap method, the bivariate normal distribution, and the asymptotic null distribution method. The algorithms were improved to incorporate the heterotic effect using the overdominance-related penetrance function.

Malaria datasets

Malaria datasets with genotype data for cases and controls from two populations obtained from the Gambia and Kenya were used (https://www.ebi.ac.uk/ega/). There were 3340 samples from Kenya and 2780 samples from the Gambia in the datasets. Each sample had 23 chromosomes, including the sex chromosome. There were different frequencies of markers on each chromosome. All cases were diagnosed in a hospital, where blood samples from children diagnosed with severe malaria were collected. The controls were from within the general population and from new births with unrelated individuals. The blood samples were from the same geographic area as the cases. Deoxyribonucleic acid (DNA) was extracted from blood samples and examined at SNP Illumina arrays [22]. To process the data on the arrays, various sets of genomic calls were utilized. SNP allele names (A, C, T, G), identification numbers (ID), chromosomal positions, and SNP names were retrieved from input files. Other variables included sex, ethnicity, and country of origin.

The malaria datasets for the study were under EGA data EGAS00001000807 from Kenya and Gambia; dataset ID EGAD00010000570 (1544 controls and 1711 cases) for the Kenyan population and dataset ID EGAD00010000572 (1533 controls and 1247 cases) for the Gambian population. Different samples were picked from different geographical locations across the two countries to enhance genetic diversity in African countries. The initial study and description of the datasets are available at Band et al. [22]. SNPTEST v2.4.1 software was used to pre-process data to obtain case–control summary statistics on genotype counts, chromosome positions, allele frequency, and odds ratios (https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest_v2.4.1.html). The MAX4 and the allelic tests were performed in the presence of an overdominant model. All statistical analyses were conducted in R studio version 4.2.0 [23].

Cochran-Armitage trend test

The Cochran-Armitage trend test (CATT) and the chi-square have been well-studied for single variants [8]. The CATT is defined as

$$\begin{aligned} CATT=\frac{U}{(Var(U))^{1/2}} \end{aligned}$$
(1)

where

$$\begin{aligned} U=\frac{1}{n}\sum _{i=0}^{2}x_i(rs_i-sr_i) \end{aligned}$$
(2)

and

$$\begin{aligned} Var(U)=\frac{rs}{n}\left( \sum _{i=0}^{2}x_i^2n_i-\sum _{i=0}^{2} (x_in_i)^2\right) , \end{aligned}$$
(3)

where r is the number of cases, s is the number of controls and n is the total number of cases and controls. \(n_i=r_i+s_i\), for \(i=0,1,2\). \((x_0,x_1,x_2)\) represents the genotype score vectors for respective genotype models [24]. Consider the CATT of the form

$$\begin{aligned} Z_x=\frac{n^{0.5}\sum _{i=0}^2 x_i (sr_i-rs_i)}{[\frac{rs}{n^3}[n\sum _{i=0}^2 x_{i}^2 n_i-(\sum _{i=0}^2 x_in_i)^2]]^{0.5}} \end{aligned}$$
(4)

Under the overdominant model, with score vector (0,1,0) equation 4 becomes

$$\begin{aligned} CATT_{HET}= \frac{n^{0.5} x_1 (sr_1-rs_1)}{[\frac{rs}{n^3}[n x_{1}^2 n_1-( x_1n_1)^2]]^{0.5}} \end{aligned}$$
(5)

.

Table 1 Count of cases and controls in the genotype model
Table 2 Count of cases and controls in the heterosis model

The MAX test

The MAX test statistic is defined \(Z_{max}=max(|Z_0|,|Z_{1/2}|,|Z_1|)\) [24]. It considers the three common genetic models. An extension of the test statistic to include an overdominant genetic model with a score vector (0,1,0) was proposed and denoted as the MAX4 test. The MAX4 statistic, \(Z_{max4}\), was the maximum of the absolute CATT over four genetic models and it was defined as

$$\begin{aligned} Z_{max4}=max(|CATT_{ADD}|,|CATT_{DOM}|,|CATT_{REC}|,|CATT_{HET}|), \end{aligned}$$
(6)

where the genetic \(CATT_{DOM}\), \(CATT_{REC}\), \(CATT_{ADD}\), and \(CATT_{HET}\) were the CATTs under dominant, recessive, additive, and heterotic models respectively. The four test statistics asymptotically follow standard normal distribution N(0, 1) and can be expressed as a joint density function \(f(z_1, z_2, z_3, z_4;\Sigma )\) where \(\Sigma\) is the 4 by 4 variance-covariance matrix. Using integrate function in R, one can estimate probability under the curve for a given data hence p-value is obtained as follows

$$\begin{aligned} Pr(|Z_{max4}|<m)=\int _{-m}^m\int _{-m}^m\int _{-m}^m\int _{-m}^m f(z_1,z_2,z_3,z_4;\Sigma )dz \end{aligned}$$
(7)

Consider a case–control situation with proportions \(p_0\), \(p_1\), and \(p_2\) for genotypes \(g_0\), \(g_1\) and \(g_2\), respectively. The asymptotic means and variance for the multivariate normal distributions are used [25]. Therefore, the distribution of \(Z_{MAX4}\) follows a four-variate normal distribution with asymptotic variance \(p_i(1-p_i)\) and covariance \(-p_ip_j\). Under no association, the test statistics have a mean vector of zero. Derivation of the correlation coefficients over three genetic models is discussed in [10, 11]. Parametric bootstrap in m replicates was used to approximate the null distribution of the MAX4. The p-values were estimated from the empirical null distribution of the MAX4 [11].

Results

Simulation study and example datasets

Table 3 The test statistics and the p-values of MAX4 and the allelic test procedures for the 17 SNPs reported in Additional file 2: Table S1 using the three approaches: the parametric bootstrap (boot), the bivariate normal approach (bvn) and the asymptotic approach (asy) for the case of the MAX4 procedure

A simulation study to investigate the significance of the MAX4 test in comparison with the allelic test was performed. A multinomial distribution was assumed for both cases and controls in violation of HWE, with model selection performed to investigate the underlying genetic models. Additional file 6: Table S5 shows a few selected most significant SNPs when the MAX4, using the asymptotic method, and the allelic tests were performed on the genetic models selected, at Bonferroni threshold of \(10^{-5}\) with 5000 SNPs. The model selection predicted 2009 SNPs with the additive model, 2086 SNPs with the dominant model, 522 SNPs with the recessive model, and 383 SNPs with the heterotic model of the 5000 SNPs. There were 570 significant SNPs.

Table 3 shows the results of the MAX4 and allelic tests based on the 17 SNPs datasets (Additional file 2: Table S1). The performed model selection predicted the additive model with the highest proportion at 9 out of 17 SNPs. The proportions of the dominant and recessive models were at 1 and 4 of 17 SNPs, respectively. The heterotic model was selected at SNPs rs17157903, rs7696175, and rs2820037. Many SNPs returned significant results for the dominant, recessive, additive, and heterotic chi-square tests with more significance under the additive model compared with the other genetic models (Additional file 3: Figure S2). The p-values of the MAX4 test were estimated using the asymptotic method and it provided a similar approximation to the results of the parametric bootstrap and bivariate normal procedures as shown in Table 3. The p-values of some SNPs such as rs12505080 and rs7696175 reported a disparity between the MAX4 and the allelic tests.

Table 4 Frequency of heterotic models selected and the SNPs showing discordant results between the MAX4 and allelic test for Kenyan malaria datasets

Real data

In both the Kenyan and Gambian datasets, genome-wide significance is estimated using the conservative Bonferroni method, at an allelic odds ratio greater than or equal to 1.5. Tables 4 and 5 provide a summary of the frequency of heterotic models selected and disparities between the allelic test and the MAX4 test for Gambian and Kenyan populations, respectively. Discordance is when the standard MAX4 test results are not consistent with the allelic test results. For dominant, recessive, and additive models, there were no disparities between the two tests, i.e, both the MAX4 and allelic tests reported similar significant results (Additional file 4: Table S3 and Additional file 5: Table S4). Figure 1 shows heterotic frequencies and disparities between allelic and the MAX4 tests for Kenya and Gambia datasets. At allelic odds ratio greater than 1.5 (\(95\%\) confidence interval), heterotic models reported the highest frequency. Figures 2 and 3 show the frequencies of the four genetic modes of inheritance selected using the MAX4 test procedure for Kenyan and Gambian datasets respectively. Manhattan plots and quantile-quantile (QQ) plots for selected chromosomes of Kenyan datasets are provided in additional information (Additional file 10: Fig. S1, Additional file 11: Fig. S2) and have been generated using the qqman package in R [26].

Table 5 Frequency of heterotic models selected and the SNPs showing discordant results between the MAX4 and allelic genome-wide Gambian malaria dataset
Fig. 1
figure 1

Results of disparity for the allelic and the MAX4 tests for the estimated heterotic models for Kenyan and Gambian malaria datasets

Fig. 2
figure 2

Frequency of different genetic modes of inheritances after performing MAX4 test for the model selection at allelic odds ratio greater than 1.5 for Kenyan malaria datasets

Fig. 3
figure 3

Frequency of different genetic modes of inheritances after performing MAX4 test for the model selection at allelic odds ratio greater than 1.5 for Gambian malaria datasets

Discussion

The study assessed the performance of the MAX4 and the allelic tests in malaria studies. The test, known as the MAX, has been previously used in genetic association testing ( [9, 12, 27]). The test allowed for model selection as well as testing of statistical significance. The MAX4 test was the standard test procedure since deviations from its conclusions were deemed false negative by the allelic test. The test is a robust test procedure that allows for genetic and other covariates in the analysis since it incorporates the generalized linear model and has good power and model selection properties [9].

One of the significant findings from the GWAS analyses was the uneven distribution of the disparities in the association test results between the MAX4 test and the allelic test across the chromosomes (Tables 4,5 and Fig. 1). It turned out that the highest disparities occurred in chromosomes X and Y in the Kenyan dataset. Disparities were also observed in chromosomes 1, 2, 13, and 15 (Kenyan dataset) and chromosome 14 (Gambian dataset). The 17 SNPs dataset in Table 3 also reported disparities for SNPs rs12505080 and rs7696175.

Figures 2 and 3 show the highest frequencies at chromosomes 1 and 6 in both Kenyan and Gambian datasets. The two chromosomes also have the most heterotic pattern of inheritance. Chromosomes 1 and 6 have previously been investigated and proven to be protective against severe malaria [28,29,30].

All SNPs were tested for compliance with the HWE before genetic association testing. It was noted that the prevalence of heterotic associations was higher in the Kenyan dataset than the Gambian dataset, further highlighting the genetic diversity between the two populations from the Eastern and Western regions of Africa, respectively. Recent GWAS have implicated chromosome 6 with the SNPs associated with drug-resistant to severe malaria [31]. The recommendation of further studies to be conducted on the chromosomes above to assess their association with malaria protection is required, given the presence of significant heterotic effects in these chromosomes. These results support the findings of simulation studies by Omolo et al. [3], which found that the allelic tests lose power in the presence of heterosis, resulting in false-negative results.

Existing research in single-SNP and genome-wide studies tend to overlook overdominance and underdominance, even though the circumstances reduce the power of allelic tests [3]. The research findings are consistent with simulation study results, which recommended performing the allelic test with care for single SNPs in the presence of heterosis due to power loss.

Conclusion

Based on simulation studies conducted by Omolo et al. [3], who cautioned against overlooking heterotic conditions when performing allelic tests because it resulted in power loss in the presence of the condition, the findings hold in both single SNP analysis and genome-wide association studies. Statistical methods in previous studies examined popular genetic models but ignored heterosis, even though the power of allelic tests reduced in the presence of heterosis.