Background

The Sorbs living in the Upper Lusatia region of Eastern Saxony are one of the few historic ethnic minorities in Germany. They are of Slavonic origin speaking a west Slavic language (Sorbian), and it is assumed that they have lived in ethnic isolation among the German majority during the past 1100 years [1]. Therefore, this population may be of special interest for genetic studies of complex traits.

The value of isolated populations for the discovery of genetic modifiers of diseases or quantitative traits is discussed controversially [26]. On the one hand, reduced genetic and environmental variability of isolated populations could increase genotypic relative risks [7, 8]. In combination with the generally higher degree of linkage disequilibrium (LD) in isolated populations, this could improve the power of genetic association studies [5, 6, 911]. On the other hand, studies in isolated populations are often limited in size and, therefore, cannot match modern genome-wide association studies and meta-analyses comprising several tens of thousands of individuals.

Nowadays, it is common practice to combine all available genotyped and phenotyped populations in large-scale, whole genome meta-analyses or pooled analyses in order to identify even very small genetic effects as commonly observed for complex traits. Spurious associations caused by the genetic sub-structures of combined populations are the most serious concern of this approach [1215], implying the need for appropriate adjustment strategies [16, 17]. This is especially true if evidence from isolated and outbred populations is combined as this approach necessitates a thorough comparison of populations by population genetic means in order to determine their "degree of isolation" [6]. For this purpose, different methods have been proposed in the literature. For example, length and number of runs of homozygosity (ROHs) are discussed as an appropriate measure of isolation since they measure the degree of parental consanguinity [18]. LD is estimated to be higher in isolated populations because of lower generation numbers resulting in fewer recombination events [5, 6]. Due to the smaller size of the founder population, it can also be expected that there is a lower number of polymorphisms in isolated populations [6, 19, 20]. Other markers of population structure such as F-statistics [21] are related to the measures mentioned above. Furthermore, genetic distances between populations can be determined by principal components analysis (PCA), allowing to quantify how closely populations are related [22]. By this technique genetic information can be mapped to topographic maps [14] allowing the assessment of a new indicator of isolation in the sense that an isolated population could be genetically far away from their geographic location. So far there appears to be no single measure sufficient to characterize the isolation of a population.

Another characteristic feature of isolated populations is the putatively higher degree of cryptic relatedness in randomly drawn samples. This is a serious concern in genetic association analysis and needs to be addressed with appropriate statistical methods [17, 2325]. Relatedness of individuals could also interact with the above mentioned measures of isolation of populations. Thus, when comparing two populations with different degrees of cryptic relatedness, it is not easy to decide whether differences in these measures can be traced back to different degrees of isolation or simply to over-sampling of related subjects.

The degree of isolation of the Sorbs has been studied in the past by the analysis of Y-chromosomal markers [26]. Recently, we compared a subset of about 200 Sorbs with other European isolates using 30,000 SNPs measured by microarrays [1]. In this analysis, the Sorbs expressed only moderate signs of isolation. Here, we analyse a sample of N = 977 Sorbs, which is currently included in several genome-wide association studies e.g. [27, 28], and compare the Sorbs with the German outbred population of the KORA study [29]. Using the KORA study (N = 1644) and a larger sample of Sorbs (N = 977) provides more power than previous studies for comparing population genetic patterns between Sorbs and their neighbours. For this purpose, we assess the above mentioned population genetic characteristics: PCA, number of rare SNPs, F-statistics, ROHs, and LD. All analyses are based on genome-wide SNP array data. We also aim to separate effects of cryptic relatedness from effects of genetic isolation.

Furthermore, we analyse how differences between populations can be translated to differences in power of genetic association studies within these samples. We analyse the influence of genetic effect size, LD structure, heritability, and relatedness on power.

Methods

Study Populations

Sorbs

The Sorbs are of Slavonic origin, and lived in ethnic isolation among the Germanic majority during the past 1100 years [1]. Today, the Sorbian-speaking, Catholic minority comprises 15,000 full-blooded Sorbs resident in about 10 villages in rural Upper Lusatia (Oberlausitz), Eastern Saxony. A convenience sample of this population was collected including unrelated subjects as well as families. Details of the study population can be found elsewhere [28, 30]. Genotyping and metabolic phenotyping of this sample was approved by the ethics committee of the University of Leipzig and is in accordance with the declaration of Helsinki. All subjects gave written informed consent before taking part in the study. A subset of individuals were genotyped with either Affymetrix Human Mapping 500 K Array Set (N = 483) or Affymetrix Genome-Wide Human SNP Array 6.0 (N = 494). Details on genotyping are described in [28]. A total of 977 subjects were available after quality control.

KORA

The study population was recruited from the KORA/MONICA S3 survey, a population-based sample from the general population living in the region of Augsburg, Southern Germany, which was carried out in 1994/95. In a follow-up examination of S3 in 2004/05 (KORA F3), 3006 subjects participated. Recruitment and study procedures of KORA have been described elsewhere [29, 31]. For KORA F3 500 K we selected 1644 subjects of these participants then aged 35 to 79 years. Informed consent has been given, and the study has been approved by the local ethics committee. All KORA participants have a German passport. Genotyping of these individuals was performed with the Affymetrix Gene Chip Human Mapping 500 K Array Set as described in [32].

HapMap

174 CEU (CEPH (Centre d'Etude du Polymorphisme Humain) from Utah) and 88 TSI (Toscans in Italy) samples were taken from a recent HapMap Collection (Public Release 27, NCBI build 36, The International HapMap Project). From the CEU sample, we removed 58 children, five individuals with call rate < 90% and one individual because of cryptic relatedness (NA07045 because of lower call-rate compared to NA12813 [33]). In summary, we analysed 110 CEU and 88 TSI samples.

Data Analysis

Genotype Imputation and Quality Control

Missing genotypes of the KORA and Sorb samples were imputed separately using MACH Imputation Software with standard settings [34].

After Imputation, we checked 471,012 autosomal SNPs in the overlap of the Affymetrix Human Mapping 500 K Array Set and Affymetrix Genome-Wide Human SNP Array 6.0 for quality.

SNPs with a call rate less than 95% in all four study populations combined, prior to imputation, were filtered (34,711 SNPs). Hardy-Weinberg-Equilibrium (HWE) was tested across populations using a stratified test proposed by [35]. 10,712 SNPs with p-values less than 10-6 were eliminated. Finally, 14,508 SNPs showing unexpectedly high differences of allelic frequencies between genotyping platforms in the Sorbs sample were eliminated (p-value < 10-7, see [1] for further details).

Since several SNPs violated more than one of our criteria, we discarded a total of 46,536 SNPs and analysed 424,476 remaining SNPs.

For estimation of ROHs (see below) the number of analysed SNPs is reduced to 306,081 by matching SNPs on Affymetrix chips with available SNPs in the HapMap CEU and TSI samples. Due to the high sensitivity of the PCA (see below) we decided to tighten our quality criteria for this kind of analysis. Only SNPs with a call rate of at least 99% were included for PCA, which reduced the number of SNPs to 199,702.

An overview of the data pre-processing workflow can be found in Additional file 1.

Estimation of Relatedness

Pair-wise relatedness between all individuals of KORA and Sorbs was estimated by the method described in [36]. For first degree relatives one would expect a value of r = 0.5, for second degree relatives a value of r = 0.25, and so on. Two individuals were considered as unrelated if the pair-wise relatedness estimate was not greater than 0.2, which approximately corresponds to the exclusion of first and second degree relatives.

For analyses of dependence of measures of population genetic comparison on relatedness, we define two subsamples used for all subsequent analyses: For the first subsample, the complete Sorbs sample (Sorbs977, N = 977) was matched with a randomly selected subset of N = 977 unrelated KORA subjects born in Germany (KORA977). For the second subsample, a subset of N = 532 unrelated Sorbs (Sorbs532) was matched with a subset of N = 532 KORA subjects (KORA532) randomly selected from KORA977.

Unrelated subjects were selected by an algorithm which implements a step-by-step removal of individuals showing the highest number of relationships to other members of the population until no pair of individuals with relatedness > 0.2 remained.

Principal components analysis

PCA is suitable to map genetic variance to a few dimensions expressing the highest degree of variance [16, 22]. It has been shown recently that the application of this technique to genome-wide genetic data is powerful enough to mirror even small geographic distances in Europe [14, 37].

Since PCA results are biased in case of unequal population sizes [38], it was necessary to analyse subsamples of our populations. We performed PCA of 350 individuals from 7 subsamples of size N = 50, generated from the most unrelated individuals of our four study populations. The subsamples were defined as follows. Three subsamples were created from N = 1336, N = 140, and N = 80 individuals from KORA, who were born in Germany, in the Czech Republic, and in Poland, respectively. Two subsamples were generated from the Sorbs grouped by their degree of Sorbian ancestry. We identified 786 "Full"-Sorbs who stated that all four grandparents are Sorbs and 160 "Half"-Sorbs where at least one grandparent was not Sorbian. Another two subsamples were built from 110 CEU and 88 TSI samples.

PCA was done with iterative removal of outliers (default 5 iterations) and LD correction in consecutive SNPs (involving two previous SNPs as recommended in the manual of the EIGENSOFT package).

Rare SNPs

Isolated populations are supposed to have reduced genetic variability resulting in a higher number of rare SNPs. By definition, a SNP has a minor allelic frequency (MAF) of at least 1%. To account for variance we calculated the exact 95% confidence interval of the MAF and considered a SNP as rare if the interval was below one percent. This is equivalent to less than 11 observed alleles in Sorbs977 or KORA977 and less than five observed alleles in Sorbs532 or KORA532 respectively. The odds to find rare SNPs were compared between KORA and Sorbs using Fisher's exact test.

F-statistics

To characterize the variance of allelic frequencies within and between populations, we calculated F-statistics.

The inbreeding coefficient F IS measures the correlation of alleles within an individual relative to the corresponding population. It is calculated by estimating the deviance of the observed number of heterozygote genotypes from what is expected under HWE. For every SNP, we calculated unbiased estimates as presented in [21], assessed the weighted average and determined the standard error of estimates by jack-knifing over individuals.

Correlation of alleles of individuals in the same population was estimated by the co-ancestry coefficient F ST. Since F ST quantifies the amount of genetic variation between populations, it is used to define genetic distances between populations. We assessed F ST for pairs of populations using a combined estimate of all SNPs [21] and calculated the standard error of estimates again by jack-knifing over individuals.

Runs of homozygosity

Counting ROHs is useful to detect inbreeding [18]. ROHs were determined in all individuals from KORA, Sorbs, CEU, and TSI using the PLINK Package (Version 1.07) with standard settings except for two parameters as noted below. PLINK estimates ROHs by searching for contiguous runs of homozygote genotypes. For this purpose, a window (default length 5000 kb, minimum 50 SNPs) is moved along the genome. To account for possible genotyping errors, at each SNP the homozygosity of the window is assessed allowing one (default) heterozygous genotype and five (default) missing calls. For each SNP the proportion of overlapping homozygous windows is calculated. If this proportion is high enough (default 5%) the SNP is considered to be part of a homozygous segment. Only homozygous segments longer than a given threshold (500 kb, default 1000 kb), consisting of a minimum number of 100 SNPs (default) and comprising a minimum SNP density of one SNP per 50 kb (default) were denoted as ROH. A homozygous segment can be split in two if two SNPs are at least 100 kb apart (default 1000 kb). Details on the algorithm can be found on the PLINK Homepage (see URLs).

Linkage disequilibrium

In the Sorbs and KORA samples, we calculated pair-wise LD for all SNPs on Chromosome 22 (5382 markers) using robust estimators [39]. We used the widely accepted measures r[40] and |D'| [41] to quantify LD. Since both measures depend on allelic frequencies, we also used the newly proposed measure |η1|, which is independent of allelic frequencies. Hence, it is especially useful when comparing populations [42]. The measure η1 is a monotone function of the odds ratio λ[43] ranging between -1 and 1. It is defined as

Its absolute value is the percentage of SNP pairs under the non-informative uniform distribution with less extreme LD than the one observed (see [42] for details). Measures of LD were averaged using bins of 5 kb length as proposed by Olshen et al. [44]. Resulting means were smoothed by a LOWESS estimator [45].

Comparison of power assuming uncorrelated phenotypes

We analysed how the observed differences in LD structure between KORA and Sorbs can be translated into differences in power of genetic association studies. For this purpose, we assumed a linear regression model y = β1s1 + ε1 of a random phenotype y which is influenced by a genotype s1 of a causative SNP, and ε1 is the residual Gaussian error of the model.

The SNP is assumed to explain a pre-specified proportion of the total variance of the phenotype which is denoted as in the following. In consequence, we can assume β1 = 1 without restriction of generality. Within the distance of ± 2 Mb we now analysed the model y = β2s2 + ε2 for a second SNP, which is in maximum LD (measured by r) with the causative SNP. That is, we analysed the best proxy of the causative SNP rather than the causative SNP itself modelling the marker principle of genetic association studies. The estimator is normally distributed and depends on s1, s2, and :

Where n is the number of individuals, s 2i is the genotype of the i-th individual and is the average. The formula is derived in Additional file 2. We calculated the power of the regression analysis, i.e. the probability that the observed p-value is smaller than a given significance level (p-value threshold) when testing against the null hypothesis β2 = 0 using the above formula. This was done for all SNPs on Chromosome 22 in KORA977, KORA532, Sorbs977, and Sorbs532 . Distribution of power was derived using the results of all SNPs of Chromosome 22. Results were compared between the KORA and Sorbs samples of equal size.

Comparison of power assuming correlated phenotypes

In the previous section, we derived formulae for the estimation of power under the assumption of uncorrelated phenotypes. This approach applies for either a negligible relatedness structure of the individuals or a weak correlation of phenotypes of related individuals. Applying a GRAMMAR approach [17], deviations from this situation can be corrected resulting again in the situation considered in the previous section.

However, to our knowledge, it is still not common practice in genome-wide association studies to use this approach to correct for relatedness. Therefore, we aim to study the situation in which the phenotypes are correlated but in which the corresponding individuals were analysed as independent even though they are not.

Following Amin et al. [17], we simulated phenotypes y on the basis of the mixed model y = β1s1 + g + ε1, comprising a fixed effect of genotypes s1, a random effect representing the residual polygenic effects and non-genetic residuals . Here, G represents the pair-wise relatedness matrix. The model results in non-trivial covariance of phenotypes of different individuals. For each SNP we drew 1000 samples from the model and analysed the linear model y = β2s2 + ε2 for a second SNP which is in maximum LD to the first SNP in complete analogy to the procedure developed for uncorrelated phenotypes (see previous section). Different degrees of heritability were simulated, where is the explained variance by genotypes s1 and is the explained variance by polygenetic effects g. Providing values for and results in the variance components and , which follow after some calculations.

Statistical Software and Web-Resources

HapMap data were downloaded from [46]. Estimation of Eigenvectors for comparison of all subsamples was done with the EIGENSOFT package (Version 3.0, [47]). ROHs were determined by the PLINK Package (Version 1.07, [48]) [49].

All other calculations were performed using the Statistical Software package R (Version 2.8.0, [50]) [51].

Results

For population genetic comparison of the Sorbian minority in Germany with the German KORA population, several measures of genetic isolation were applied to genome-wide SNP array data.

Relatedness

We analysed the relatedness of all 476,776 pairs of individuals in the Sorbs and all 1,350,546 pairs in the KORA samples. Results are shown in Figure 1. Frequencies of relationships differ remarkably between the two samples. Emphasized by the different scales of the histograms, it can be clearly recognized that the numbers of first and second degree relationships are higher in the Sorbs compared to KORA. Numbers of pairs with estimates over a given threshold are shown in Table 1 for both populations. We also provide odds-ratios for the encounter of a related pair.

Figure 1
figure 1

Distribution of degrees of relatedness in KORA and Sorbs. Distribution of degrees of relatedness in the KORA and Sorbs samples. For readability, the distribution of the 0.01% highest relatedness estimates of the KORA samples and the highest 0.5% estimates of the Sorbs samples are shown.

Table 1 Distribution of pair-wise relatedness estimates

To achieve samples without pairs of individuals with relatedness-estimates greater than 0.2, it was necessary to exclude 445 Sorbs and 33 KORA individuals, resulting in subsamples of 532 Sorbs and 1,611 KORA individuals.

Principal components analysis

Results of PCA after removal of outliers and LD correction are shown in Figure 2. The figure comprises all 150 individuals from KORA, 97 Sorbs, 49 HapMap CEU and 48 HapMap TSI after outlier removal.

Figure 2
figure 2

Principal components analysis of study populations. First two principal components of individuals from KORA born in Czech Republic (N = 50), Germany (N = 50), Poland (N = 50) and Full-Sorbs (N = 49), Half-Sorbs (N = 48), CEU (CEPH (Centre d'Etude du Polymorphisme Humain) from Utah, N = 49) and TSI (Toscans in Italy, N = 48).

A plot of the genetic variance represented by the first two principal components impressively reflects the geographic origin of these populations. TSI samples are relatively far away from the other clusters giving an orientation of a north to south axis. The KORA population is very close to the CEU HapMap population. In contrast, the Sorbian population clusters significantly eastwardly. There is a clear trend of west to east clustering of KORA individuals born in Germany, KORA individuals born in Poland or Czech Republic, Half-Sorbs, and finally, Full-Sorbs. The Sorbs clusters are nearest to the cluster of KORA individuals born in Poland.

Rare SNPs

When analysing 424,476 quality SNPs in 977 Sorbs (Sorbs977) and the random Sample of 977 individuals from KORA (KORA977), we counted 51,204 rare SNPs in Sorbs977 and 49,721 rare SNPs in KORA977 (p-value 6.7 × 10-7). In the subset of 532 unrelated Sorbs (Sorbs532) and the random sample of 532 unrelated individuals from KORA (KORA532), we counted again more rare SNPs in the Sorbs532 than in KORA532, i.e. 49,257 and 47,913 (p-value 4.7 × 10-6), respectively.

F-Statistics

Estimating F IS in the samples KORA977 and KORA532 resulted in slightly positive values with the smaller value in KORA977. In contrast, in the samples Sorbs977 and Sorbs532, we find slightly negative values with smaller value in the sample Sorbs977.

F ST estimates are somewhat higher between KORA977 and Sorbs977 than between KORA532 and Sorbs532. F ST estimates are higher than corresponding F IS estimates, indicating a clear genetic distance between the two populations. All statistics can be found in Table 2.

Table 2 Inbreeding and co-ancestry coefficients

Runs of Homozygosity

ROHs were determined for the populations KORA, Sorbs977, Sorbs532, CEU, and TSI. Percentages of individuals in these populations containing at least one ROH in a specified length interval were calculated (Figure 3). Compared to the other populations, Sorbs show a higher proportion of individuals with ROHs between 2.5 Mb and 5 Mb.

Figure 3
figure 3

Proportion of individuals with certain ROH length. Proportion of individuals from KORA (N = 1644), Sorbs977, Sorbs532, CEU (CEPH (Centre d'Etude du Polymorphisme Humain) from Utah, N = 110) and TSI (Toscans in Italy, N = 88) with at least one ROH in the given length interval.

In a second step, mean total length of ROHs with a given minimum length was estimated averaged over the individuals of each population (Figure 4). Again, Sorbs differ from the other populations and are characterized by higher mean total length of ROHs. However, the effect is less pronounced if only long ROHs are considered. The mean total length of ROHs is shorter for Sorbs532 than for Sorbs977 but the difference is small.

Figure 4
figure 4

Average total length of ROHs. Average total length of ROHs for KORA (N = 1644), Sorbs977, Sorbs532, CEU (CEPH (Centre d'Etude du Polymorphisme Humain) from Utah, N = 110) and TSI (Toscans in Italy, N = 88) in dependence on minimal length of a single run.

Linkage Disequilibrium

Three measures of LD were calculated for KORA977, KORA532, Sorbs977, and Sorbs532. Results of η1 are shown in Figure 5. Other measures such as r and D' behave similarly (data not shown). LD in the KORA sample is markedly lower at long ranges compared to Sorbs. This result is robust against dropping related individuals in the Sorb sample.

Figure 5
figure 5

LD structure in KORA and Sorbs. LD structure in the KORA977, KORA532, Sorbs977 and Sorbs532 samples. η1 was estimated for all SNP pairs of chromosome 22. Results are averaged over distance using bins of 5 kb length and smoothed by a LOWESS estimator.

As expected for KORA977 and KORA532 a small sample size bias can be observed. In contrast the estimators for Sorbs977 and Sorbs532 are virtually identical.

Comparison of power assuming uncorrelated phenotypes

The power to detect causal SNPs was calculated for KORA977, KORA532, Sorbs977, and Sorbs532. Results for SNP effects with explained variances of 2% or 5% can be found in Figure 6. Since the results are virtually identical for KORA and Sorbs, we present the quartiles of the power distribution in Table 3 for p-value thresholds of 1 × 10-5 and 1 × 10-7.

Figure 6
figure 6

Median power distribution in KORA and Sorbs. Median power to detect SNP effects explaining 2% (left) or 5% (right) of variance, respectively. Power is plotted versus the p-value threshold. The grey lines are virtually covered by the black lines. The dotted line corresponds to p-value thresholds of 1 × 10-5 and 1 × 10-7 respectively.

Table 3 Quartiles of power distribution assuming uncorrelated phenotypes

Comparison of power assuming correlated phenotypes

In Table 4 we present the power estimates assuming a heritability of 100% resulting in the greatest differences compared to Table 3. However, except for Sorbs977 , there are only very small differences between Tables 3 and 4 and even for Sorbs977 the differences appear to be not substantial. For an explained variance of 2%, the power in Sorbs977 increases, but it decreases for an explained variance of 5%. This is due to dependence on the significance threshold. Independent of the explained variance of the SNPs, the power under maximum heritability (100%) is greater than under minimal heritability () for small p-value thresholds. But for large p-value thresholds, the opposite is true (see Additional file 3).

Table 4 Quartiles of power distribution assuming correlated phenotypes

The explanation for this behaviour is the inflation of the variance of the β-estimator caused by high levels of relatedness in the Sorbs977 sample (see Additional file 4).

Results for other degrees of heritability are presented in Additional file 5. As expected, in the case of minimal heritability the results of our simulations under the mixed model and the results obtained with our analytical formula used in the previous section are coincident.

Discussion

The Sorbs, resident in Lusatia, Germany, are an ethnic minority of Slavonic origin. Using genome-wide SNP array techniques, we aimed to compare this putatively isolated population with a German mixed population (KORA study) by various population genetic means. The Sorbs were compared recently with other European populations or isolates on the basis of a limited set of genetic markers and a limited set of unrelated individuals [1, 52]. In the present analysis, we studied the Sorbs from the perspective of ongoing genome-wide association studies. That is, we compared the population with a German mixed population on the basis of complete sets of genotyped individuals, and a large number of genotyped SNPs. We also aimed to separate the effect of isolation from potential effects caused by over-sampling of relatives in the Sorbs. Finally, we studied the implications of observed differences between KORA and Sorbs for the analysis, and especially, the power of genome-wide association studies.

Genotype data from a sample of 977 Sorbs were available from genotyping with 500 k and 1000 k Affymetrix SNP chips. While SNP markers come with certain drawbacks (ascertainment bias, need for careful QC), they have proven useful for detecting subtle population structures.

For comparison with a German mixed population, we used the KORA F3 sample (N = 1644) and corresponding genotypes from 500 k Affymetrix SNP chips. Observed differences between regions of Germany are typically an order of magnitude lower than differences observed between Sorbs and KORA [53]. Publicly available European-American HapMap samples were also included in the analysis.

A major goal of our study was to distinguish effects of genetic isolation from simple over-sampling of families in the Sorbs. Since most of the population genetic measures used to compare populations assume independence of individuals, over-sampling of families in certain samples may introduce a source of bias which is difficult to control. Indeed, we discovered a large number of closely related individuals within the Sorbs sample. Therefore, we repeated all analyses for a sub-group of Sorbs for which all relationships with relatedness estimates greater than 0.2 were removed. This does not completely resolve the problem of increased relatedness within the Sorbs sample but provides a trend for potential biases introduced by over-sampling of families. Indeed, such biases could be detected in our data but it is not substantial at least for the population genetic measures studied.

Since relatedness cannot be completely removed from the samples, a cut-off of 0.2 for the relatedness estimate seems to be feasible to study the effect of relatedness and to keep the sample size at an acceptable level. We also studied a cut-off of 0.1 reducing the sample size to N = 414. Results can be found in Additional file 6. Although tending slightly towards zero, results are essentially the same as those obtained for the cut-off of 0.2.

For some analyses such as determination of rare SNPs and LD it is known that sample size can introduce bias [39, 44, 54]. Therefore, for most comparisons we used randomly drawn subsamples of KORA which are of the same size as the Sorbs samples.

PCA is a proven means to detect even very small genetic differences between populations with high power. For European populations, it was demonstrated that the first two appropriately scaled principal components can map individuals to their geographic origin on the European continent with high precision, when all four grandparents are from the same location [14]. Our PCA results showed clear distances between KORA, Sorbs, and individuals from Tuscany. Using individuals from KORA and Tuscany to roughly orient the PCA graph on a map of Europe, Sorbs are positioned towards the East. KORA individuals are very close to the CEU HapMap population, while the distance to Tuscan/TSI individuals is much larger.

We conclude that the Slavonic origin of the Sorbs is still clearly genetically detectable. The analysis revealed that there is a west to east sequence of the clusters of KORA individuals born in Germany, KORA individuals born in Poland or Czech Republic, Half-Sorbs, and finally, Full-Sorbs. Although birthplace is not a stringent indicator of ethnicity, it is a commonly used surrogate in genetic epidemiologic studies if more detailed information cannot be ascertained. On the other hand, most of the KORA individuals born in Poland or Czech Republic are descendents from German minorities of these countries. Hence, on the basis of our data we cannot conclude that the Sorbs are genetically more distant from Germany than a random sample from Poland or Czech Republic. Half-Sorbs can be assumed to be closer to the German population than Full-Sorbs due to mating with German neighbours. This is clearly reflected by the localization of Half-Sorbs between KORA individuals and Full-Sorbs. There is a trend that the Sorbs are closer to the KORA individuals born in Poland than to the KORA individuals born in Czech Republic which is in agreement with a recently stated hypothesis that the Sorbs are genetically closer to Polish than to Czech [1].

Since it has been suggested that genetic diversity is lower in isolated populations [6], we analysed the number of rare SNPs. Indeed, we found a higher number of rare SNPs in the Sorbs sample compared to the KORA sample. Although significant, the difference is small in size.

The F ST statistics between KORA and Sorbs were an order of magnitude higher than usually observed between different regions of Germany [53]. Thus, variance between KORA and Sorbs is much higher than expected for different regions in Germany. Surprisingly, the F IS statistic was positive for KORA but negative for Sorbs. Such a phenomenon has also been observed for other isolated populations, suggesting that there may be signs of recent isolation breaking in the Sorbs [44]. Another indicator of isolation breaking is the relatively high number of Half-Sorbs (N = 160) in the present sample, i.e. subjects who claim to have less than four Sorbian grandparents. It should be remarked that the F IS statistic is a population based measure rather than an individual based measure of inbreeding studied in [1].

ROH analysis was proposed to detect signs of isolation by estimation of inbreeding [18]. Despite the simplicity of this concept, calculation of ROH depends on many variable parameter settings such as SNP density or allowed numbers of missings or heterozygous markers, which heavily influence the results. Parameter settings are extensively discussed in McQuillan et al [18]. For our analysis, we used the default settings of PLINK except for two parameters: The threshold for homozygous segments was 500 kb (PLINK default is 1000 kb) and the splitting of homozygous segments can occur if two neighbouring SNPs are 100 kb apart (PLINK default is 1000 kb). Hence, we used the same settings as in McQuillan et al. except for the minimum number of contiguous homozygous SNPs constituting a ROH, for which we kept the PLINK default (N = 100). The results of ROH analysis also depend on allelic frequencies of populations and SNP-selections used by different genotyping technologies. Since McQuillan et al. [18] used a different genotyping platform (Illumina Infinium HumanHap300v2), the latter modification was necessary to obtain similar results.

We found that Sorbs have enriched ROHs of intermediate length (between 2.5 Mb and 5 Mb) compared to KORA, CEU, and TSI. This effect is much less pronounced for longer ROHs. Accordingly, the coverage of the genome by ROHs is higher in the Sorbian population. Following the argumentation of McQuillan et al., we conclude that there is a lack of recent parental relatedness in the Sorbs (no differences for long range ROHs) but that there are signs of ancient parental relatedness or the existence of autozygous segments of older pedigree structures (differences for ROHs of intermediate range). The lack of direct parental relatedness is in accordance with our estimates of F IS .

Furthermore, we compared the LD structure of chromosome 22 between the KORA and the Sorbs population. We used the newly proposed LD measure η1 for the comparison of KORA and Sorbs. In contrast to the more popular measures r and D', the measure η1 is independent of allelic frequencies [42]. In our opinion, this property is desirable when comparing LD structure between populations of potentially differing allelic frequencies. However, the results obtained by the three measures are very similar (data not shown).

An expected small upward bias caused by smaller sample size in KORA532 compared to KORA977 could be clearly detected. In contrast, the results for Sorbs977 and Sorbs532 are virtually identical. We conclude that the expected upward bias of the reduced Sorbs532 sample is nullified by the elimination of relationships. This interpretation is supported by the fact that a random sample of N = 532 individuals from Sorbs977 resulted in the same sample size bias as observed for KORA (data not shown). That is, LD is upwardly biased by the relatedness structure in the Sorbs. Nevertheless, even if relationships are eliminated to a reasonable degree (first and second degree relationships), Sorbs show generally higher LD at longer distances than is observed in KORA. It has been already shown in the literature that LD excess at longer ranges is a characteristic of isolated populations [5, 911]. However, the effect is moderate in size which is also in agreement with several other populations considered as isolated [44, 5557].

Since LD structure directly influences the coverage of a SNP technology, and with it, the power of genome-wide association studies, we performed power analyses in the Sorbs and KORA samples. For this purpose, we defined a fixed genetic effect of an arbitrary SNP at chromosome 22. Explained variance was used as a measure of effect in order to adjust for differences in allelic frequencies. For this SNP, we analysed the best proxy SNP available on chromosome 22 in order to mimic a situation in which an unobserved causative variant is detected via a marker in LD. We derived an analytical formula for our model for the case of negligible heritability for which individuals can be considered as independent. This formula also applies to situations where correction for relatedness effects has been performed, for instance with a GRAMMAR approach [17]. Power was calculated for all SNPs on chromosome 22 and the resulting distribution was compared between the Sorbs and KORA samples with and without relatives. No differences regarding power were detected. We conclude that there is no gain in power due to higher LD in the Sorbs.

Since relatedness structure is often neglected in genetic association studies, we also analysed the influence of present relatedness structure on the power of an uncorrected analysis. This analysis is done via simulations of a linear mixed model comprising a fixed effect of a SNP and random polygenetic and non-genetic effects. We showed that the variance of the β-estimator is inflated under relatedness and high heritability. This results in a gain in power for higher p-value thresholds and a loss of power for lower p-value thresholds in the Sorbs977, irrespective of the size of the genetic effect considered. The explanation is that normal distributions with different variances are overlapping.

We conclude that relatedness in the Sorbs977 sample influences the power of uncorrected genetic association studies. Influence of relatedness on power is highest under maximum heritability of the phenotype. However, directions of power differences depend on the size of the genetic effect in combination with the significance threshold chosen.

In our simulations we did not observe a scenario resulting in a clear power benefit in the Sorbs977 sample. However, this does not rule out that there might be a higher power in the Sorbs due to increased effect sizes caused, e.g., by higher environmental homogeneity or lower number of causative variants [7, 8].

Conclusions

We could show that there are signs of genetic isolation within the Sorbs which cannot be explained by over-sampling of relatives. The effects are moderate in size. The Slavonic origin of the Sorbs is still genetically detectable. Although there is higher LD in the Sorbs, the difference to KORA is small. Power analysis showed that a clear advantage of the Sorbs for genome-wide association studies with respect to coverage cannot be expected.

The significant amount of cryptic relatedness in the Sorbs sample results in inflated variances of β-estimators which should be considered in genetic association analyses.