Background

A large body of studies have demonstrated that genetic variations have a direct or indirect impact on the development of phenotypic variation [1,2,3,4,5]. Such studies advanced our understanding of the genetic architecture of complex traits. More recently, the integration of large-scale genetic studies with transcriptome data have also identified genetic variants that explain variance in transcript abundance of specific genes (reviewed in [6]). The integration of multiple omics datasets, including genotypes, is an important step toward closing the biological gap that exists between genotypes and phenotypes [7].

Recent publications from the human Genotype-Tissue Expression (GTEx) [8, 9] and the cattle GTEx [10] projects have shed light on the genetic control of gene expression in large mammals. The recent findings indicate that genomic variants have a greater impact on gene expression than previously anticipated [11]. These studies have provided valuable information which will help close the critical gap between genomic variants and phenotypic variation [12, 13], especially those associated with health in humans and livestock.

Given the importance of identifying expression quantitative trait loci (eQTL) [14] to understand cell or tissue biology, several statistical approaches have emerged to allow the coordinated analysis of genomic variants and transcript abundance (reviewed by Nica and Dermitzakis [14]). While the first eQTL studies used microarray data [15], most of the analyses carried out in recent years use RNA-sequencing data. One emerging concern is the normalization of the data across samples. To that end, several methods have been used for data normalization across samples such as the trimmed mean of M-values (TMM) [16], fragments per kilobase per million reads (FPKM) [17], and transcript per million reads (TPM) [18]. These and other methods have been evaluated, and TMM might have an advantage over other methods [19]. Another concern related to eQTL analysis is that RNA-sequencing data do not follow a normal distribution, however, all statistical approaches currently employed assume that the inputted data will follow a normal distribution. Researchers have addressed this by transforming the data using the variance stabilization [20,21,22], log2 transformation [23, 24], or the inverse normal transformation [8, 10, 25, 26].

Because the principle of eQTL analysis is to identify differences in transcript abundance between genotypes [15], we reasoned that the analysis of eQTLs using transcript abundance estimated from RNA-sequencing could be carried out using the same framework used for differential gene expression. A major benefit of using such a framework is that differences in transcript abundance are tested and estimated using a negative binomial model [20, 27, 28], which is suitable for sequence count data [29, 30]. Thus, we hypothesized that biologically meaningful eQTLs would be identified without transforming RNA-sequencing data to fit a normal distribution. Here, our objective was to identify eQTLs in cattle peripheral white blood cells (PWBCs) using RNA-sequencing data and the Bioconductor [31] package “edgeR” [27, 32], which was designed for DGE analysis using the general linear model framework.

Methods

All bioinformatics and analytical procedures are presented in Additional file 1.

Data processing for variant detection, and variant filtering

We analyzed RNA-sequencing data from 42 heifers (Bos taurus, Angus × Simmental) publicly available in the GEO database: GSE103628 [33, 34] and GSE146041 [35]. First, we trimmed sequencing adapters and retained reads with an average quality score equal to or greater than 30 using Trimmomatic (v. 0.39) [36]. Then, we used Hisat2 (v.2.2.0) [37] to align the pair-end short reads to the cattle genome [38, 39] (Bos_taurus, ARS-UCD1.2.99), obtained from the Ensembl database [40]. Next, we used Samtools (v.1.10) [41] to filter reads that did not map, secondary alignments, alignments from reads that failed platform/vendor quality checks, and were PCR or optical duplicates. Duplicates were removed using the function “bammarkduplicates” from biobambam2 (2.0.95) [42]. The function “SplitNCigarReads” from GATK (v.4.2.2.0) [43] was then used to separate sequences with a CIGAR string, which resulted from sequencing exon-exon boundaries. Variants were then called in our data by using the functions “bcftools mpileup” and “bcftools call” from Samtools [41].

We filtered the variants with the function “bcftools view” from Samtools to select sites where 20 or more reads were used to identify a variant. Next, in R software (4.0.3) [44], we retained variant sites that were identified as single nucleotide polymorphisms and retained variants with genotypes called in at least 20 samples (Fig. 1A).

Fig. 1
figure 1

Overview of genotyping and variant discovery using RNA-sequencing data from PWBCs. A Schematics of bioinformatics procedures. B Distribution of allelic frequency of all variants genotyped in at least 26 samples. C Distribution of allelic frequency of all variants genotyped in at least 26 samples followed by filtering to retain 6,207 SNPs. (HW: Hardy–Weinberg; MAF: minimum allele frequency)

Variant annotation

After the list of significant SNP-gene pairs was generated from the eQTL analysis, attributes were read in from the Ensembl genome database. The attribute list was merged with the output from the eQTL analysis as well as the nucleotide genotypic data for all samples. Ensembl Variant Effect Predictor [45] was used to compare our data to the cattle genome (Bos taurus, ARS-UCD1.2) to identify the functional consequences of the SNPs.

Quantification of transcript abundance

For the expression dataset, we obtained the raw read counts from our previous work [35]. First, we eliminated one sample that had less than a million reads mapped to the annotation; second, we calculated counts per million reads (CPM) [27]; third, we retained protein-coding genes that had CPM greater than two in five or more samples. Next, we calculated TPM [46], which was used in all plots with transcript abundance.

eQTL analysis

First, we tested whether the samples presented a genetic stratification using plink [47] to calculate the eigenvectors [48]. Given the sample elimination due to low mapping to the annotation, we carried out an eQTL analysis with 41 samples. To prevent overinflation of effects when working with variants with low allelic frequencies [49] and conduct a robust analysis with enough samples in each group of genotypes, we further retained those single nucleotide polymorphisms that had at least five animals in each of the two homozygotes and heterozygote genotypes, had a minor allelic frequency > 0.15, and followed Hardy–Weinberg equilibrium (false discovery rate = 0.05), which was tested with the R package “HardyWeinberg” [50]. In both approaches described below, eQTLs that overlapped between the ANOVA and additive model are only reported in the ANOVA model.

Approach 1: TMM normalized and normal-transformed RNA-seq data

In line with standard procedures adopted for eQTL analysis [8, 25, 26], we normalized expression abundance for 10,332 genes using the TMM method [16]. First, we used the function “calcNormFactors” from the R package “edgeR” [27, 32] to calculate the normalization factors then we multiplied the normalization factors by the respective library size. Next, we used the function “cpm” with the normalized library size to obtain TMM normalized counts per million. Next, we carried out an inverse normal transformation [8, 25, 26] using the “RankNorm” function from the R package “RNOmni”. Additive and ANOVA analyses were carried out independently for eQTL analysis with the R package “MatrixEQTL” [51] using 6216 SNPs. In both models, we used genotypes as a fixed effect. We inferred a significant eQTL when the nominal P-value was less than 5 × 10–8, which is a threshold commonly applied to genome-wide association studies [52,53,54,55,56], and corresponded to a false discovery rate [57] of 4% and 12% for the ANOVA and additive model, respectively.

Approach 2: using a differential gene expression framework

We analyzed the RNA-sequencing data with a general linear model in “edgeR” and tested for differential gene expression using the quasi-likelihood F-test [58, 59]. We note that the normalization adopted by default in “edgeR” adjusts for library sequencing depth, but we added the TMM normalization factors calculated by the function “calcNormFactors” to the procedure for identification of eQTLs.

As part of our proposed approach, we also eliminated genes that had outlier values of transcript abundance, which reduced the transcriptome data to 4,149 genes. For these analyses, gene expression data were used as the dependent variable. Genotypes and collection sites were included in the model as independent variables (fixed effects). For additive analysis, the genotypes were input as numerical variables. For ANOVA-like analysis, we carried out a two-tier analysis. First, we tested the association between SNP and gene transcript abundance using all three genotypes as a factor variable. Next, we subset SNPs that were significantly associated with gene transcript abundance and pseudo-coded the genotypes to establish two contrasts [60]. The first contrast compared the homozygote genotype from the reference allele versus the heterozygote and the homozygote genotype from the alternate allele (i.e., AA versus AB, BB). The second contrast compared the homozygote genotype from the alternate allele versus the heterozygote and the homozygote genotype from the reference allele (i.e., AA, AB versus BB). We also inferred a significant eQTL when the nominal P-value was less than 5 × 10–8 [52,53,54,55,56].

Visualization of the results

We used the R packages “ggplot2”, “cowplot” [61], or “plotly” [62] for plotting [63] and used Cytoscape [64] to visualize eQTLs in network style.

Analysis of gene ontology enrichment

We tested several lists of genes for the enrichment of gene ontology using the R package “GOseq”[65]. In order to account for multiple hypothesis testing, P-values were adjusted by family wise error rate (FWER) [66]. Results were maintained if they had FWER < 0.05.

Results

Overview of SNP identification

We compiled genotype data at 23,506,613 nucleotide positions. Not surprisingly, 99.6% of the genomic positions were homozygous for the reference allele and 2,167 positions were homozygous for the alternate allele. Our pipeline identified 91,006 nucleotide positions showing polymorphisms in our samples. After testing for the deviation of Hardy–Weinberg equilibrium (Fig. 1B), we retained 6,207 SNPs further analysis (Fig. 1C).

Notably, 96% (n = 5964) of the SNPs have been previously identified and are recorded in the Ensembl variant database [45, 67], which includes the dbSNP ([68] version 150), while 243 SNPs were not identified in Ensembl variant database (Additional file 2). Most of the SNPs are in 3 prime UTRs (n = 1553), and a smaller proportion (n = 483) were annotated as missense variants (Additional file 2). We observed no genetic substructure of the individuals based on the SNPs analyzed here (Additional file 3: Fig. S1).

eQTL analyses

For eQTL analysis, we obtained the matrix with raw counts from a previous study [35] from our group. After filtering for lowly expressed genes, we quantified the transcript abundance for 10,332 protein-coding genes. We then analyzed the transcriptome and the SNP data following the two frameworks.

Approach 1: TMM normalized and normal-transformed RNA-seq data

The inverse normal transformation within a gene and across samples [26] indeed normalized the RNA-sequencing data (Additional file 3: Fig S2). Using the R package “MatrixEQTL” [69], the ANOVA and additive analyses concluded in 4.699 and 2.473 s respectively using one core processor (2.60 GHz).

We identified 35 significant eQTLs (P < 5 × 10–8) following the ANOVA model (Fig. 2). Annotated SNPs mapped to the genes: ASCC1, BOLA-DQB, FAF2, IARS2, MGST2, MRPS9, NECAP2, TRIP11 (Additional file 4). We also identified 39 significant eQTLs (P < 5 × 10–8) following the additive model (Fig. 3). Annotated SNPs mapped to the genes AHNAK, GLB1, TRIP11 (Additional file 5), and most of the SNPs on the gene TRIP11 composed the majority of the eQTLs.

Fig. 2
figure 2

eQTLs identified using ANOVA model on TMM normalized counts per million and normal-transformed RNA-seq data. Y axis for all graphs is TMM normalized transcripts per million

Fig. 3
figure 3

eQTLs identified using additive model on TMM normalized counts per million and normal-transformed RNA-seq data. Y axis for all graphs is TMM normalized transcripts per million

Approach 2: using a differential gene expression framework

Using the R package “edgeR” [27], all tests to determine dominance and additive models were completed in 36 and 9 min respectively using 34 core processors (2.60 GHz). We identified 936 significant eQTLs (P < 5 × 10–8). These eQTLs were formed by 16 SNPs present in the dbSNP and one SNP that is a putatively new variant (Additional file 2) influencing the transcript abundance of 445 genes. The majority (98.6%) of the eQTLs were formed by SNPs on the gene TATA-Box binding protein associated factor 15 (TAF15), followed by 6 eQTLs formed by SNPs on the gene SMG6 nonsense-mediated mRNA decay factor (SMG6). The other annotated genes with SNPs forming significant eQTLs were TRIP11, PI4KA, LMBR1L, and ZNF175. There was no overlap of significant eQTL between both approaches (Additional file 6, Additional file 3: Fig. S3).

It was also possible to separate the eQTLs into dominance or additive allelic interaction. We determined that six of the eQTLs followed the pattern of an additive allelic relationship (Fig. 4A, Additional file 7). Two SNPs (rs41892216 and rs135008768) impacting the expression of the gene sialic acid-binding Ig-like lectin 14 are also present in the region containing the sialic acid-binding Ig-like lectin gene family on chromosome 18. One SNP is a missense mutation (18:57,565,792, Fig. 4B) on the gene SIGLEC5 and the SNP on nucleotide 18:57,498,163 is a variant downstream to SIGLEC6. Two other SNPs were annotated to the genes PI4KA (17:72,208,968, rs133672368), TRIP11 (21:56,676,553, rs479089277) and ZNF175 (18:57,538,713, rs109161398).

Fig. 4
figure 4

Significant eQTLs were identified using the differential gene expression framework. A Network depicting the connectivity between SNPs and the genes whose genotypes are influencing their transcript abundance. B Bar plot of the frequency of genes containing SNPs forming eQTLs. Only SNPs that were annotated to genes with a symbol (within a gene model, or within 1,000 nucleotides on each side) are depicted in this figure

We also identified 930 significant eQTLs following a dominance allelic relationship (Additional file 8). Eight annotated SNPs mapped to the genes (LMBR1L, SMG6, TAF15, and TRIP11). Of notice, four intronic variants on the gene TAF15 (19:14,551,828, 19:14,554,927, 19:14,554,403, and 19:14,553,701, Fig. 5A) were collectively associated with the expression of 427 genes, with some examples depicted in Fig. 5B.

Fig. 5
figure 5

Significant eQTLs were inferred using the differential gene expression framework following the additive relationship between alleles. A Eight eQTLs following the additive model determined by edgeR. Y axis for all graphs is TMM normalized transcripts per million. B Ensembl genome browser indicating the SNP position and examples of raw data used for the SNP’s identification

Given the number of genes expressed in PWBCs that were influenced by SNPs, we asked if there would be an enrichment of gene ontology [70] biological processes among these 427 genes. We observed that by setting a more stringent threshold of significance for the eQTLs (P < 5 × 10–10), we subset 196 genes, which are enriched for two biological processes (FWER < 0.05: regulation of catalytic activity (fold-enrichment: 3.54; genes: APBA3, ARHGDIA ARHGEF1, CAPN1, DENND1C, EEF1D, EIF2B3, RAB3IP, RALGDS, RING1, Additional file 9), and endocytic recycling (fold-enrichment: 7.81, genes: CCDC22, DENND1C, PTPN23, SNX12).

Discussion

The major goal of our work was to identify genes expressed in PWBCs of crossbred beef heifers whose transcript abundance is impacted by genetic variants. We used a gold standard approach presented by the GTEx consortium, but also analyzed the RNA-sequencing data without a transformation to force a Gaussian distribution of the counts. The framework for eQTL analysis presented here is motivated by the following rationale: (i) the vast majority of eQTL analyses carried out currently use RNA-sequencing data; (ii) by the nature of the procedures, RNA-sequencing data is count data, which is not normally distributed [71, 72]; and (iii) in principle, an eQTL analysis is an expansion of a differential gene expression (DGE) analysis, where samples are grouped by their genotypes, which is analogous to groups or treatments typically used in DGE analysis. Compared to the latest GTEx framework, our analysis of RNA-sequencing data from cattle PWBCs using the DGE framework identified more eQTLs under the dominance model and an equivalent number of eQTLs under the additive model of allele interaction when compared to the framework used in the human or farm GTEx consortia.

Our study has a few limitations, but they do not hinder the validity of our findings. First, we identified SNPs using the RNA-sequencing data, thus we are not accounting for genomic variants in promoters or distal cis-regulatory elements. This is likely to have impacted the limited number of cis-eQTLs reported here. Second, our transcriptome data represents a mixture of white cells identified in the blood. The proportion of different cells that compose the mixture of white cells was not accounted in our model. A genetic factor contributing to a potential greater abundance of one specific cell type [73] is thus a confounding factor in our study. However, these two limitations do not directly impact our main take home message that there is no need for researchers to normalize RNA-sequencing data in eQTL studies.

Variant genotyping using RNA-sequencing data

RNA-sequencing data is feasible for the identification of genomic variants in a wide range of organisms, including livestock [74,75,76,77], and multiple pipelines have been developed for variant discovery and genotype calling [74,75,76,77]. Here we opted for a hybrid approach, which utilized the “SplitNCigarReads” function of GATK followed by the functions “mpileup” and “call” from BCFtools . The reason for using BCFtools was that it calls genotypes at every nucleotide position by default so that individuals were genotyped regardless of the homozygote or heterozygote makeup.

Prior research showed that the efficacy of genotype calling using RNA-sequencing data is high [78]. Although we did not assess the specificity of genotype calling with an orthogonal method, we employed a stringent requirement for coverage equal to or greater than 20×, which is higher than the previously suggested 10× [75, 78] for high confidence genotype calling. In addition, 96% of the variants identified in our pipeline are present in the dbSNP ([68] version 150), and the variants have the same allelic composition reported in the dbSNP. Our hybrid pipeline efficiently genotyped individuals at homozygote and heterozygote genomic positions, although further confirmation is required for the variants called in our work that are not reported in the dbSNP.

eQTL analysis using RNA-sequencing with and without forcing the data into a Gaussian distribution

Current statistical approaches employed for eQTL analysis [79] assume that the data is normally distributed, and the transformation of RNA-sequencing data to enforce a normal distribution is employed in nearly all major eQTL studies. Our comparison of the RNA-sequencing data prior to and after transforming the data (Additional file 3: Fig. S1) does confirm that the inverse normal transformation [26] is highly effective in reducing skewness and shrinking the variance to reduce the impact of extreme values in the analysis [72], and thus making the data suitable for statistics tests requiring normally distributed data.

We first analyzed our data following the GTEx framework [8], transforming the data to achieve a normal distribution. Our analysis yielded less significant associations between genotype and gene transcript abundance relative to previously published studies that worked with genes expressed in blood samples [80,81,82,83] and the recent results from the cattle GTEx consortium[10]. This large difference was expected because we only utilized 6207 SNPs in our analysis, which yields less genotypic data as compared to high-throughput genotyping platforms or imputation of SNPs from reference populations. Another difference between our procedure and other reports was the stringent threshold to infer significance (P = 5 × 10–8, −log10(5 × 10–8) = 7.3).

We noted, however, that visual inspection of the data with significant eQTLs identified with the ANOVA model (see examples in Fig. 2C) does not clearly indicate patterns of data distribution that resemble the definition of allelic interaction characterized as complete dominance [84, 85]. The dispersion of the data with significant eQTLs identified with the additive model (see examples in Fig. 3C) does indicate patterns of data distribution that resemble alleles interacting in additive mode [84, 85]. However, the distribution of heterozygotes showed two groups of samples with district profiles.

The graph profiles obtained from significant eQTLs using the GTEx framework prompted us to analyze the data using a DGE framework. To that end, we carried out an analysis using one of the commonly used statistical algorithms coded in the R package “edgeR” [27, 32, 86]. The comparison of our eQTL analysis using “edgeR” showed a striking contrast with the analysis using the GTEx framework and “MatrixEQTL” in many important aspects. First, there was no overlap of significant eQTLs obtained between the two approaches within this study. Here, we point out that identifying which eQTL is true is virtually impossible without further mechanistic experiments that confirm the influence of allelic variants on gene expression [87, 88]. Our findings add to previous observations that the type of statistical analysis carried out is a critical contributor to the lack of replicability observed across eQTL studies [89, 90]. Second, working with specific contrasts, we were able to identify trans eQTLs that more closely resemble complete dominance, which were not identified by the standard framework. Our results are evidence that the number of genes whose expression are under genetic control and follow patterns of complete dominance [91, 92] is probably more common than previously expected [8]. The identification of groups of genes enriched for specific biological processes strongly supports that this genetic control under the dominance model may have a biological role in the function of PWBCs.

We identified two important aspects that show a contrast between the ANOVA framework and the DGE framework we propose here. First, the functions in "MatrixEQTL” require less computational resources and time to conclude the analysis relative to the calculations carried out using the DGE framework in “edgeR”. Our proposed approach is inherently more complex, as we carried out multiple tests to provide robust and valuable information about dominance interaction between alleles. It is also very important to note that our study is not about the tools (“MatrixEQTL” or “edgeR”), because researchers can use other tools for the standard analysis of eQTL such as “FastQTL” [93] or DESeq2 [20] for the DGE framework. Second, the transformation of the data to fit a normal distribution clearly shrinks the variance (Additional file 3, Fig. S4), reducing the differences in transcript abundance among genotypes thus reducing the likelihood of these eQTLs to be inferred as significant. In the end, the most critical choice researchers need to make is between (i) forcing data that is not normally distributed and has many outlier data points [71, 72] into normality or (ii) utilizing a framework that employs a statistical test appropriate for count data.

Conclusions

In summary, different types of data normalization and analytical procedures lead to a variety of combinations that can be used for eQTL analysis using RNA-sequencing. Most of these approaches also transform the data to fit a normal distribution. Our analysis showed that it is possible to carry out eQTL studies using the concepts and analytical framework developed for differential gene expression that does not require data transformation to fit a normal distribution, thus it is likely more suitable for RNA-sequencing. The approach proposed here can uncover genetic control of gene expression that is biologically relevant for the tissue studied that otherwise may not be detected through data transformation and linear models.