Validation of cross-progeny variance genomic prediction using simulations and experimental data in winter elite bread wheat

Oget-Ebrad, Claire; Heumez, Emmanuel; Duchalais, Laure; Goudemand-Dugué, Ellen; Oury, François-Xavier; Elsen, Jean-Michel; Bouchet, Sophie

doi:10.1007/s00122-024-04718-6

Validation of cross-progeny variance genomic prediction using simulations and experimental data in winter elite bread wheat

Original Article
Open access
Published: 18 September 2024

Volume 137, article number 226, (2024)
Cite this article

Download PDF

You have full access to this open access article

Theoretical and Applied Genetics Aims and scope Submit manuscript

Validation of cross-progeny variance genomic prediction using simulations and experimental data in winter elite bread wheat

Download PDF

221 Accesses
Explore all metrics

Abstract

Key message

From simulations and experimental data, the quality of cross progeny variance genomic predictions may be high, but depends on trait architecture and necessitates sufficient number of progenies.

Abstract

Genomic predictions are used to select genitors and crosses in plant breeding. The usefulness criterion (UC) is a cross-selection criterion that necessitates the estimation of parental mean (PM) and progeny standard deviation (SD). This study evaluates the parameters that affect the predictive ability of UC and its two components using simulations. Predictive ability increased with heritability and progeny size and decreased with QTL number, most notably for SD. Comparing scenarios where marker effects were known or estimated using prediction models, SD was strongly impacted by the quality of marker effect estimates. We proposed a new algebraic formula for SD estimation that takes into account the uncertainty of the estimation of marker effects. It improved predictions when the number of QTL was superior to 300, especially when heritability was low. We also compared estimated and observed UC using experimental data for heading date, plant height, grain protein content and yield. PM and UC estimates were significantly correlated for all traits (PM: 0.38, 0.63, 0.51 and 0.91; UC: 0.45, 0.52, 0.54 and 0.74; for yield, grain protein content, plant height and heading date, respectively), while SD was correlated only for heading date and plant height (0.64 and 0.49, respectively). According to simulations, SD estimations in the field would necessitate large progenies. This pioneering study experimentally validates genomic prediction of UC but the predictive ability depends on trait architecture and precision of marker effect estimates. We advise the breeders to adjust progeny size to realize the SD potential of a cross.

Accuracy of genomic selection for grain yield and agronomic traits in soft red winter wheat

Article Open access 01 November 2019

Genomic assisted selection for enhancing line breeding: merging genomic and phenotypic selection in winter wheat breeding programs with preliminary yield trials

Article Open access 08 November 2016

Genomic selection in a commercial winter wheat population

Article 08 January 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

How humans choose candidates for selection has evolved over time. At first, selection based solely on phenotypic observation of progeny was next assisted by pedigree-based predictions, more particularly in animals (Henderson 1975; Falconer and Mackay 1996). Nowadays, genomic predictions of individual genetic values (Whittaker et al. 2000; Meuwissen et al. 2001) and cross values (Schnell and Utz 1975; Zhong and Jannink 2007; Lehermeier et al. 2017; Danguy des Déserts et al. 2023) are promising and open up new prospects for improving mating design and optimizing long-term genetic gain (Allier et al. 2019a, 2019b).

In plants, crosses are essentially selected to secure a high mean yield performance of the progeny. Breeders seek to identify which crosses will produce the best superior progeny in order to ensure genetic gain in the next selection cycle, and to maintain long-term genetic gain. It is thus important that the selected crosses also generate a high progeny genetic variance. A strategy to find these crosses is to select on the usefulness criterion ($\text{UC}$) (Schnell and Utz 1975; Zhong and Jannink 2007) defined as: $\text{UC}=\text{PM}+i\times \text{SD}$, where $\text{PM}$ is the parental mean, $i$ the selection intensity corresponding to the fraction $q$ of selected progenies, and $\text{SD}$ the progeny standard deviation (square root of progeny variance). PM can be estimated from the mean of additive parental genetic values using their phenotypes or genomic estimated breeding value (GEBV). Several attempts have been made to predict progeny variance using phenotypic distances (Utz et al. 2001) and genetic distances (Bohn et al. 1999; Hung et al. 2012), but without success. According to the recent methodological studies, genomic predictions appear to be a promising tool for optimizing complementarity between parents, as well as short- and long-term genetic gains in breeding schemes.

Many factors can influence genomic prediction accuracy (Liu et al. 2018). This was extensively demonstrated for GEBV (see Elsen 2022 for a review), and to a lesser extent for progeny standard deviation ($\text{SD}$), and for usefulness criterion ($\text{UC}$) which combines PM and SD. The accuracy of these predictions influences the relative efficiency of selection using $\text{UC}$. A first factor influencing genomic prediction accuracy is the genetic architecture of traits, which includes the number of QTL and the distribution of linkage disequilibrium over the genome. Zhong and Jannink (2007) demonstrated algebraically that genetic gain using $\text{UC}$ criteria decreases when the number of QTL increases in plants. In dairy cattle, Santos et al. (2019) demonstrated a higher $\text{UC}$ efficiency relatively to classical selection based on GEBV when the trait is controlled by 200 QTL rather than 20. A second factor influencing genomic prediction ability is the number of progenies. The progeny standard deviation ($\text{SD}$) estimated algebraically, or by simulation, may be correct for an ideal offspring population of infinite size, but in practice the number of progenies is limited. A third factor is the precision of the marker effects estimation which depends (1) on the size of the reference population (the set of individuals both phenotyped and genotyped) and (2) on the statistical method used when estimating these marker effects. The effects of the size and characteristics of the reference population have been extensively studied for the estimation of GEBV (Hayes et al. 2009; Calus 2010; Goddard et al. 2011; Erbe et al. 2012; Lee et al. 2017), showing that the precision increases when the reference population is bigger, more diverse and genetically closer to the set of selection candidate individuals. A huge literature deals with comparisons of statistical prediction methods of marker effects and GEBV precision. The list includes genomic best linear unbiased prediction (GBLUP) (Habier et al. 2007), RR-BLUP, Bayesian methods such as Bayes A, B, C, etc. (Meuwissen et al. 2001; Gianola et al. 2009), partial least square and least absolute shrinkage and selection operator (Usai et al. 2009). If accuracies of the different genomic prediction methods are similar when the trait depends on a large number of genes with small effect (infinitesimal model), Bayesian approaches perform better when the trait is controlled by QTL with very different effects.

The precision of the progeny standard deviation ($\text{SD}$) estimations not only depends on the marker effects estimation method, but also on the way they are processed. An algebraic formula to predict inbred progeny variance derived from a cross between two inbred lines was provided by Lehermeier et al. (2017) based on marker effect estimates and their co-segregation in progeny. In this paper, to this prediction described as the “Variance of posterior mean” (VPM) was compared a “Posterior mean variance” (PMV) estimator which, using an MCMC method, takes into account the error in marker effect estimates and gives higher genetic gain. In the present paper, we use an algebraic interpretation of their PMV and we propose a more advanced estimator which considers the fact that the uncertainty of the estimation of marker effects is modulated by the genomic composition of each parent as well.

Those elements influencing estimation precisions were considered in the simulations we performed to better understand the parameters impacting our predictive ability of $\text{PM}$, $\text{SD}$, and $\text{UC}$.

Several other studies showed that selecting on criteria based on progeny variance such as UC could actually increase the genetic gain in plants and in animals (Tiede et al. 2015; Allier et al. 2019a, 2019b; Bijma et al. 2020). But, to our knowledge, although the ability to predict the value of a cross has only been validated by simulations, just a few publications have validated the ability to predict UC compared to real data (Lian et al. 2015; Neyhart et al. 2019; Osthushenrich et al. 2017), none on wheat, and no publications tested the ability of prediction of SD.

So this study was carried out to (i) estimate the ability of genomic prediction of three cross value components (PM, SD, and UC) in experimental data, (ii) to identify the factors influencing this predictive ability based on simulations, (iii) to test if we could improve SD estimation by taking into account the error of the estimation of marker effects and the genomic composition of each parent. We were able to compare cross value components predictions and experimental observations for 101 crosses with an average of 55 unselected progenies each.

Materials and methods

The objective of the study was to compare genomic predicted values of parental/progeny mean (PM), parental gametic variance/standard deviation (SD) and mean of top progenies (UC: usefulness criterion) to experimental observations of winter bread wheat. We evaluated four traits (yield, grain protein content, plant height, and heading date) for 101 crosses with 55 progenies on average for 3 years.

In order to compare our results to a reference and to better understand the parameters influencing our predictive ability, we conducted simulations based on the same crosses, using the same parental genotypes.

Materials

The training population (TP)

TP composition

The TP was composed of 2,146 INRAE-AO (the French research institute Institut National de la Recherche pour l’Agriculture, l’Alimentation et l’Environnement and its subsidiary company Agri-Obtentions) breeding lines and 650 French registered varieties evaluated by GEVES (Groupe d'Étude et de contrôle des Variétés Et des Semences). Data were collected between 2000 and 2022 in two French breeding networks (North and South). Each year, the lines were phenotyped in up to 11 locations for INRAE-AO and 15 for GEVES in split-plot trials with two repetitions. In each location, four registered varieties were used as controls for several consecutive years. Those controls evolved gradually during 20 years (only one control can be changed on one specific year). In total, up to 169 (42) lines were observed each year for the INRAE-AO (GEVES) dataset. For two consecutive years, up to 76 (40) lines were in common for the INRAE-AO (GEVES) dataset. We report here data for yield, grain protein content, plant height and heading date under high-yield crop management (optimized pesticide, fungicide and nitrogen amount) (Supplementary Tables S1–S9).

TP adjusted phenotypes

Trial experts removed blocks presenting obvious experimental problems and outliers. In each trial, we modeled spatial heterogeneity when field coordinates where available. Rows and columns were modelized as random effects, and field trends were modelized as a smooth surface as implemented in the 2-dimensional P-spline mixed model of the SpATS v1.0.16 R package (Rodríguez-Álvarez et al. 2018) (Supplementary Table S1). Genotype was set as a fixed term to estimate genotypic BLUEs. Note that it has been proven that when there was no spatial variation, this model did not perform worse than not including a spatial effect (Selle et al. 2019). In the other trials, we applied arithmetic means to replicates to get genotypic values. We then calculated best linear unbiased estimator (BLUE) values for each line of the TP using the following model:

$$y_{{ij}} ~ = \mu + g_{i} + e_{j} + \varepsilon _{{ij}}$$

where $y_{{ij}}$ is the adjusted mean for line $i$ in environment (year and location intersection)$j$, $\mu$ the grand mean, ${g}_{i}$ the fixed effect of line $i$, ${e}_{j}$ the random effect of environment $j$, and ${\varepsilon }_{ij}$ the random effect of residual. The model was implemented using the lme4 R package. We considered these BLUE values as phenotypes in further analyses (Supplementary Table S1).

TP genotypes

The 2,146 lines were genotyped using a 35 K SNP (single nucleotide polymorphisms) chip representative of the TaBW280K array (Rimbert et al. 2018; Ben-Sadoun et al. 2020). After quality filtering for minor allele frequency below 0.01 and for missing call rate superior to 0.9, we used 19,065 markers. We imputed missing genotypes with the algorithm implemented in BEAGLE v5.3 (Browning et al. 2018) using the genetic positions previously estimated for a West-European bread wheat population (Danguy des Déserts et al. 2021). The total number of lines with genotypes and phenotypes was 2,146, 2,062, 2,126 and 2,145 for yield, grain protein content, plant height and heading date, respectively (Supplementary Table S1).

TP variance components

First, we estimated repeatability of each trait, using the following model:

$$y_{{ijk}} = \mu + g_{i} + e_{j} + \left( {g \times e} \right)_{{ij}} + \varepsilon _{{ijk}}$$

where $y_{{ijk}}$ is the raw observations for line $i$ in environment (year and location intersection) $j$, repetition $k$, $\mu$ the grand mean, ${g}_{i}$ the effect of line $i$, ${e}_{j}$ the effect of environment $j$, ${\left(g\times e\right)}_{ij}$ the effect of the interaction between line $i$ and environment $j$, and $\varepsilon$ the effect of residual. All effects were considered as random. The model was implemented using the lme4 R package. We then computed repeatability (Piepho and Möhring 2007) at the plot and design levels as:

$$\widehat{{{\text{rep}}}}_{{{\text{plot}}}} = \frac{{\hat{\sigma }_{g}^{2} }}{{\hat{\sigma }_{g}^{2} + \hat{\sigma }_{{\left( {g \times e} \right)}}^{2} + \hat{\sigma }_{\varepsilon }^{2} }}$$

$$\widehat{{{\text{rep}}}}_{{{\text{design}}}} = \frac{{\hat{\sigma }_{g}^{2} }}{{\hat{\sigma }_{g}^{2} + \frac{{\hat{\sigma }_{{\left( {g \times e} \right)}}^{2} }}{{{\text{nb}}\_{\text{env}}}} + \frac{{\hat{\sigma }_{\varepsilon }^{2} }}{{{\text{nb}}\_{\text{rep}}}}}}$$

where ${\widehat{\sigma }}_{g}^{2}$, ${\widehat{\sigma }}_{\left(g\times e\right)}^{2}$ and ${\widehat{\sigma }}_{\varepsilon }^{2}$ are the line, the interaction between line and environment and the residual estimated variances, respectively, and $\text{nb}\_\text{env}$ and $\text{nb}\_\text{rep}$ are the average number of environments and replications per line, respectively.

Next, we estimated genomic heritability of each trait with a RR-BLUP (Ridge Regression Best Linear Unbiased Predictor) approach, using the following model, implemented in the rrBLUP v4.6.1 R package (Endelman 2011):

$$\varvec{y} = \varvec{\mu } + \varvec{X\beta } + \varvec{r}$$

where $\varvec{y}$ is the vector of BLUE values, ${\varvec{\mu}}$ the grand mean, ${\varvec{\beta}}$ a vector of random SNP effects assumed to be normally distributed $N\left(0,{{\varvec{I}}\sigma }_{\beta }^{2}\right)$, with its matrix of incidence ${\varvec{X}}$, and ${\varvec{r}}$ the vector of random residual effects assumed to be normally distributed $N\left(0,{{\varvec{I}}\sigma }_{r}^{2}\right)$. We then computed genomic heritability (Nyquist and Baker 1991; Falconer and Mackay 1996) as:

$$\hat{h}_{{{\text{genomic}}}}^{2} = \frac{{\hat{\sigma }_{g}^{2} }}{{\hat{\sigma }_{g}^{2} + \hat{\sigma }_{r}^{2} }}$$

where $\hat{\sigma }_{g}^{2} = \hat{\sigma }_{\beta }^{2} \times 2\mathop \sum \limits_{{i = 1}}^{n} p_{i} q_{i}$, with ${p}_{i}$ being the allele frequency of SNP$i$,${ q}_{i}=1-{p}_{i}$, and $n$ the total number of SNP.

Predictive ability of genome estimated breeding values (GEBV) in TP

We assessed the predictive ability in the TP using a tenfold cross-validation procedure iterated 10 times. In each iteration, the TP was split into 10 equal sets, each containing 1/10-th of the TP, which was predicted by the remaining 9/10-ths of the TP. The predictive abilities of individual models are expressed as the average between the GEBV and observed phenotypic values in the validation set across all tenfold. The process was repeated 10 times with 10 new independent sets being sampled each replication, in which case the reported predictive abilities are averages across all folds and replications. This cross-validation procedure is implemented in the PopVar v1.3.0 R package (Mohammadi et al. 2015; Neyhart et al. 2019).

The experimental population (EP)

In this study, we generated 101 crosses evaluated in six environments for three years. The objective was to compare cross value components estimated using genomic predictions and the same components observed in the field. Details about the experimental design are provided in Supplementary Table S10.

EP description

Within the FSOV PrediCropt project, a total number of 101 crosses using 106 different parents included in the TP were generated between 2020 and 2022. These crosses were generated by a private company (Florimond-Desprez) (16), and INRAE-AO (85). The number of progenies per cross ranged from 8 to 122 (mean: 54.8 progenies per cross, Supplementary Table S11). The 16 crosses generated by Florimond-Desprez were evaluated in two locations (Cappelle and Houville, France), and the 85 remaining were evaluated in 4 other locations in France (Auzeville: INRAE UE GC, Clermont-Ferrand: INRAE UE PHACC, Lusignan: INRAE UE FERLUS, and Mons: INRAE UE GCIE). Crop management methods corresponded to high-yield objectives (optimized pesticide, fungicide and nitrogen amount). Two registered varieties and the parents of the generated crosses were used as controls and repeated across trials (year x location). At least 10% of the progenies were replicated in each trial. For each partner, all crosses were observed in all locations.

Lines were evaluated for yield, grain protein content, plant height and heading date. For grain protein content, only crosses generated during the two first years (73) were analyzed in this study. We discarded from further analyses plots that were affected by lodging or that were still segregating (presenting different types of plants). We kept all crosses for the study after checking congruency between parents and progeny (by genotyping 2 progenies per cross and applying quick visual observations such as presence/absence of awns, common stems and leaf color). After cleaning raw observations, the total number of phenotyped progenies were 5565, 3522, 5591 and 5590 for yield, grain protein content, plant height and heading date, respectively (Supplementary Table S10).

Adjusted EP phenotypes

For each trial, we adjusted phenotypes for spatial effects using the SpATS R package. We checked for usual Pearson’s correlations between sites and years for each trait (Supplementary Tables S12 and S13).

We then calculated BLUE values for each line using the following model:

$$y_{{ij}} ~ = \mu + g_{i} + e_{j} + \varepsilon _{{ij}}$$

where $y_{{ij}}$ is the adjusted phenotype for line $i$ in environment (year and location intersection)$j$, $\mu$ the grand mean, ${g}_{i}$ the fixed effect of line $i$, ${e}_{j}$ the random effect of environment $j$, and ${\varepsilon }_{ij}$ the random effect of residual. The total number of environments was 11 for grain protein content and 15 for the three other traits. The model was implemented using the lme4 R package. We considered these BLUE values as phenotypes in further analyses. The total number of BLUE values was 5658, 3596, 5684 and 5683 for yield, grain protein content, plant height and heading date, respectively (Supplementary Table S10).

Methods

Cross value components predictive ability

In this study, we considered three components for the value of a cross: the parental mean (PM), the progeny standard deviation (SD) which is the square root of gametic variance and the UC, corresponding in our study to the expected mean of the 7% best progeny of a cross (Schnell and Utz 1975; Zhong and Jannink 2007; Lehermeier et al. 2017).

Models and formulae to predict cross value components

The three cross value components can be predicted by analytic formulae using information obtained from the TP.

Marker effects estimation The estimation of marker effects was performed with the PopVar R package using the following model (Meuwissen et al. 2001):

$$\varvec{y} = \varvec{\mu } + \varvec{X\beta } + \varvec{r}$$

where $\varvec{y}$ is the vector of TP BLUEs described above, ${\varvec{\mu}}$ the grand mean, ${\varvec{\beta}}$ a vector of random SNP effects assumed to be normally distributed $N\left(0,{\boldsymbol{ }{\varvec{I}}\sigma }_{\beta }^{2}\right)$, with its matrix of incidence ${\varvec{X}}$ for SNPs (TP), and ${\varvec{r}}$ the vector of random residual effects assumed to be normally distributed $N\left(0,{ I\sigma }_{r}^{2}\right)$. We tested genomic prediction models using different hypotheses of marker effect variance distribution (Meuwissen et al. 2001; Habier et al. 2007, 2011; VanRaden 2008; Park and Casella 2008): “BRR” (Bayesian ridge regression, Bayesian approach with a Gaussian prior for SNP effects), “BayesA” (Bayesian approach with a scaled-t prior), “BayesB” (Bayesian approach with a two-component mixture prior with a point of mass at zero and a scaled-t slab), “BayesC” (Bayesian approach with a two-component mixture prior with a point of mass at zero and a Gaussian slab), and “BL” (Bayesian Lasso, Bayesian approach with a double-exponential prior). These models are implemented in the R package BGLR v1.0.9 (Pérez and de los Campos 2014) available in PopVar.

PM estimation We predicted PM as the average parental GEBV computed as the matrix product between parental genotypes and estimated marker effects:

$$\hat{\mu }_{{P_{1} \times P_{2} }} = \frac{{\varvec{\hat{\beta }}^{\varvec{'}} \left( {\varvec{X}_{{\varvec{P}_{1} }} + \varvec{X}_{{\varvec{P}_{2} }} } \right)}}{2}$$

where $\mu _{{P_{1} \times P_{2} }}$ is the PM for cross ${P}_{1}\times {P}_{2}$, $\widehat{{\varvec{\beta}}}$ the vector of estimated marker effects, and ${{\varvec{X}}}_{{{\varvec{P}}}_{1}}$ and ${{\varvec{X}}}_{{{\varvec{P}}}_{2}}$ are the vectors of genotypes for parent ${P}_{1}$ and parent ${P}_{2}$, respectively.

Gametic variance estimation The formula takes into account marker effects and phase in parents as well as recombination rates. We used the formula proposed by Lehermeier et al. 2017:

$$\hat{\sigma }_{{P_{1} \times P_{2} }}^{2} = \varvec{\hat{\beta }}^{\varvec{'}} \varvec{V}_{{\varvec{P}_{1} \times \varvec{P}_{2} }} \varvec{\hat{\beta }}$$

where $\hat{\sigma }_{{P_{1} \times P_{2} }}^{2}$ is the gametic variance for cross ${P}_{1}\times {P}_{2}$, and ${{\varvec{V}}}_{{{\varvec{P}}}_{1}\times {{\varvec{P}}}_{2}}$ the genotypic variance–covariance matrix for biparental RIL progeny calculated as follows:

$$\left( {V_{{P_{1} \times P_{2} }} } \right)_{{jl}}^{{{\text{RIL}}\left( k \right)}} = 4D_{{jl}}^{*} \left( {\mathop \sum \limits_{{{\text{gen}} = 1}}^{k} \left( {0.5\left( {1 - 2c_{{jl}} } \right)} \right)^{{{\text{gen}}}} } \right)$$

where ${D}_{jl}^{*}$ is the linkage disequilibrium (LD) between alleles at loci $j$ and $l$ among both parental lines (either 0 if parents carry the same alleles at loci $j$ and/or $l$, or 0.25 if alleles are different at both locus between parents and in coupling phase, that is, one parent carries the two beneficial alleles, while the other carries deleterious alleles, or -0.25 if the alleles are in repulsion phase), $k$ the number of generations, and ${c}_{jl}$ the recombination rate between loci $j$ and $l$. The recombination rates were computed from the West-European bread wheat population genetic map (Danguy des Déserts et al. 2021).

An alternative approach was used to compute the gametic variance. We called Vg1 (standing for variance of gametes) the formulae proposed by Lehermeier et al. 2017 using RR-BLUP model (Meuwissen et al. 2001) to estimate marker effects:

$$\begin{array}{*{20}c} {\left( {\hat{\sigma }_{{P_{1} \times P_{2} }}^{2} } \right)_{{{\text{Vg}}_{1} }} = \varvec{\hat{\beta }}^{\varvec{'}} \varvec{V}_{{\varvec{P}_{1} \times \varvec{P}_{2} }} \varvec{\hat{\beta }}} \\ \end{array}$$

(1)

For the model called Vg2, we added a first term taking into account the error in marker effect estimation given by Lehermeier et al. 2017 (Eq. 10) in their algebraic version of the posterior mean variance (PMV):

$$\left( {\hat{\sigma }_{{P_{1} \times P_{2} }}^{2} } \right)_{{Vg_{2} }} = \hat{\beta }^\prime V_{{P_{1} \times P_{2} }} \hat{\beta } + {\text{trace}}\left\{ {V_{{P_{1} \times P_{2} }} \text{var} \left( {\beta |X,y} \right)} \right\}$$

(2)

with $\left( {\beta |X,y} \right) = \hat{\sigma }_{\beta }^{2} \left( {I - X^{\prime} \left( {XX^{\prime} + I\frac{{\hat{\sigma }_{r}^{2} }}{{\hat{\sigma }_{\beta }^{2} }}} \right)^{{ - 1}} X} \right)$, where ${\widehat{\sigma }}_{\beta }^{2}$ and ${\widehat{\sigma }}_{r}^{2}$ are the markers and residual estimated variances, and ${\varvec{X}}$ is the matrix of TP’s genotypes.

For the model called Vg3, we added a second term that aims to consider the fact that the uncertainty of the estimation of marker effects is modulated by the genomic composition of each parent:

$$\left( {\hat{\sigma }_{{P_{1} \times P_{2} }}^{2} } \right)_{{Vg_{3} }} = \hat{\beta }^{\prime} V_{{P_{1} \times P_{2} }} \hat{\beta } + {\text{trace}}\left\{ {V_{{P_{1} \times P_{2} }} \text{var} \left( {\beta |X,y} \right)} \right\} + 0.25X_{{P_{1} \times P_{2} }}^{\prime } \text{var} \left( {\beta |X,y} \right)X_{{P_{1} \times P_{2} }}$$

(3)

where ${{\varvec{X}}}_{{{\varvec{P}}}_{1}\times {{\varvec{P}}}_{2}}$ is the vector of genotypes for the F1 of cross ${P}_{1}\times {P}_{2}$. Details are provided in Supplementary Information S1.

UC estimation The UC was calculated as follows (Schnell and Utz 1975; Zhong and Jannink 2007):

$$\widehat{{UC}}_{{P_{1} \times P_{2} }} = \hat{\mu }_{{P_{1} \times P_{2} }} + i\hat{\sigma }_{{P_{1} \times P_{2} }}$$

where $i$ is the selection intensity corresponding in our study to a 7% selection rate ($i \sim 1.91$, computed as the inverse Mills ratio), and ${\widehat{\sigma }}_{{P}_{1}\times {P}_{2}}$ the estimated progeny SD for cross ${P}_{1}\times {P}_{2}$.

EP cross value components

We calculated the experimental cross value components for each cross from the distribution of the progeny BLUE values: mean (PM), standard deviation (SD) and mean of the 7% best progeny (observed UC) (or the value of the best progeny when progeny size was inferior to 15).

Cross value components’ predictive ability

We compared the three cross value components (PM, SD and UC) genomic estimates with the corresponding experimental values using a weighted Pearson’s correlation (weights v1.0.4 R package (Pasek and Schwemmle 2021)) to adjust for the number of progenies per cross. These correlations were considered as predictive ability.

Factors influencing predictive ability of the cross value components (simulations)

In order to evaluate the different parameters impacting the predictive ability of the three cross value components (PM, SD and UC) and to have a reference for the values we can get in optimal/suboptimal conditions (perfect/unperfect prediction of marker effects, limited/infinite progeny size), we developed a simulation study.

Simulation design description

The general scheme of the simulation study is provided in Fig. 1. We estimated cross value components using analytic formulae 1 or by simulating progenies 2 for the same 73 crosses used in the experiment.

We ran different scenarios, varying the number of QTL (from 10 to 1,000), heritability (from 0.2 to 0.8) and progeny size (from the size in experiment to 2,000). We simulated 20 traits for each number of QTL. We positioned the QTL randomly along the genetic map (Danguy des Déserts et al. 2021). We assigned them an effect drawn from a normal distribution $N\left(\text{0,1}\right).$ These true marker effects 1.2 were then adjusted to provide a variance of true breeding values (TBV) similar to the GEBV variance observed for yield in the TP.

Analytic formulae 1. To estimate cross value components, we either used the true marker effects 1.2 (TRUE scenario) to mimic a scenario where the error in marker effect prediction is negligible (TP size is infinite and the prediction model perfect) or estimated marker effects 1.5 (ESTIMATED scenario) when the size of the TP is limited. The TBV of the TP were calculated as the cross-product between true marker effects and genotypes 1.3. We simulated phenotypes 1.4 for the TP by adding a noise to their TBV. This noise was normally distributed with a variance corresponding to the residual variance from a specified heritability. For instance, if the TBV variance is 80 and we specify a heritability of 0.7, we would draw the noise for phenotypes computation from $N\left(\text{0,34}\right)$ according to the following equation: ${h}^{2}=\frac{80}{80+34}=0.7$. We used the genotypes (excluding QTL) and simulated phenotypes of the TP to estimate marker effects 1.5 using different genomic prediction models. The pred_true (pred_estimated) cross value components were calculated using true (estimated) marker effects using the formula described in 2.2.1.1. Recombination rates were estimated in a Western Europe diversity panel (Danguy des Déserts et al. 2021).

Simulated progenies 2. We simulated progeny genotypes 2.2 for the 73 experimental crosses 2.1 phenotyped for all four traits, using the genetic map from (Danguy des Déserts et al. 2021) and the MoBPS v1.6.64 R package (Pook et al. 2020). The number of progenies per cross varied between the one observed in the experiment (Supplementary Table S11) and 2,000. We produced 25 times these number of progenies per cross to account for variations due to Mendelian gamete sampling. (The results presented in this study are the average of these 25 simulations.) Progeny TBV 2.3 were calculated as the cross-product between true simulated marker effects and the simulated progeny genotypes. Progeny phenotypes were simulated 2.4 by adding a noise to their TBV. This noise was normally distributed with a variance corresponding to the residual variance observed in the TP (34 if we keep the example of the previous section with a heritability of 0.7).

Finally, we calculated the cross value components for each cross from the distribution of the progeny phenotypes (progeny_phenotypes in Fig. 1): mean, standard deviation and mean of the 7% best progeny for the UC (or the value of the best progeny when progeny size was inferior to 15).

Predictive ability For each scenario, we compared the genomic predictions of the three cross value components obtained with the analytic formulae using true or estimated marker effects, with the ones obtained in simulated progenies using a weighted Pearson’s correlation to adjust for the number of progenies per cross. These correlations were considered as predictive ability.

Parameters influencing the predictive ability

For each scenario, 20 trait architectures were simulated. To compare scenarios, we considered for each cross value component the median values of correlations across the 20 runs.

We explored different parameters that could impact the predictive ability of the three cross value components: genetic architecture (number of QTL and heritability of the trait), number of progenies per cross, genomic prediction models, three different formulae to estimate the gametic variance (Vg1, Vg2, Vg3) and mating design (similar to the experiment or maximizing diversity).

Genetic architecture We tested different numbers of QTL for the simulation of the genetic determinism of agronomic traits: 10-30-300-1,000. The scenarios with 10 or 30 QTL represent polygenic traits (with major genes), whereas 300 and 1,000 QTL scenarios correspond to quantitative traits. Heritability ranged from 0.2 to 0.8 (with a 0.1 step). Note that this heritability is the true heritability. In our simulations, the corresponding genomic heritability estimated using a RR-BLUP approach was lower. For instance, a true heritability of 0.7 corresponded to a genomic heritability of ~ 0.5 (Supplementary Figure S1), congruent with the genomic heritability observed for yield in experimental data.

Progeny size To evaluate if the observed variance in the field using a limited progeny size is reliable, we tested different numbers of progenies per cross, starting with the real numbers observed in our experiment (ranging from 5 to 122 progenies; mean = 47.6 progenies per cross, Supplementary Table S11) to a maximum of 2000 progenies for all crosses (50, 100, 200, 1,000 and 2,000).

Precision of marker effect estimation To evaluate the importance of the quality of marker effects estimation in genomic predictions, we implemented a true and an estimated scenario. For the later, several genomic selection methods were compared: different Bayesian approaches and a RR-BLUP approach.

Precision of SD estimation We implemented two algebraic variants of the PMV approach proposed by Lehermeier et al. 2017 with a RR-BLUP approach described earlier (“Vg1,” “Vg2,” “Vg3”).

Genetic composition of the mating plan The experimental mating plan of the PrediCropt FSOV project was chosen by breeders to maximize genetic gain. To evaluate the bias possibly generated by expert choice (some parents were used in several crosses and some were genetically similar, as described in Supplementary Table S11 and Supplementary Figure S2, respectively), we tested a different set of parents. This new set was chosen to be representative of the total genetic diversity present in the TP and the least possible genetically similar. We chose 146 different parents to generate 73 crosses. To do so, we first extracted the 10 principal components (PC analysis) from the variance-standardized genomic relationship matrix (GRM) of the TP lines, using PLINK v2.00a2LM toolset. We computed Euclidean distances between lines using those 10 PC and performed a hierarchical cluster analysis (146 groups) with the Ward’s minimum variance method (Murtagh and Legendre 2014) (dist and hclust functions, respectively, from the stats v4.1.1 R package (R Core Team 2021)). In each cluster, we picked the most distant line. Finally, we generated 73 crosses by randomly mating this new set of 146 parents 10 times. We considered the median values of correlations across the 10 random mating plans to compare with the crosses from the experiment. The plots of the 4 first PC for the two sets of parents (“PrediCropt” crosses and this new set of “Unrelated” parents) show that “Unrelated” parents cover a larger diversity (Supplementary Figure S2).

Results

Quality of the training population (TP)

We used as TP the historical French registration data (GEVES) and the INRAE-AO breeding program, including 2000–2022 evaluation trials, both Southern and Northern France networks. The average Pearson’s correlations between years in each dataset were 0.69 for yield, 0.78 for grain protein content, 0.87 for plant height and 0.91 for heading date (Supplementary Tables S2 to S9).

The repeatabilities at the plot level were 0.32 for yield, 0.52 for grain protein content, 0.73 for plant height and 0.85 for heading date. Repeatabilities at the design level were very high for all traits (ranging from 0.91 to 0.99). Those values are in congruence with other trials and reflect higher GxE interactions for grain protein content and yield compared to plant height and heading date. Genomic heritabilities were 0.53 for yield, 0.51 for grain protein content, 0.56 for plant height and 0.73 for heading date. If we compare repeatability’s rankings to genomic heritability’s rankings, we notice a high missing heritability for plant height, maybe due to epistasis. Finally, the predictive ability using an additive model ranged from 0.56 for plant height to 0.70 for heading date (Table 1).

Table 1 Variance components, repeatabilities and predictive ability in the TP

Full size table

Simulation study

In our simulation study, we explored different ranges of parameters that could impact the predictive ability of the three cross value components (PM, SD and UC): the number of QTL (10–1000), the heritability (0.2–0.8), the progeny size (experimental progeny size ~ 48–2000), the precision of the estimation of marker effect (true or estimated) and the genetic composition of the mating plan. For all scenarios, the predictive ability was higher for PM and UC compared to SD.

Genetic architecture

The predictive ability of the three cross value components increased with heritability (Fig. 2). For 300 QTL, for instance, predictive abilities increased from 0.64 to 0.93 for PM, from 0.48 to 0.86 for UC and from 0.04 to 0.36 for SD between heritability 0.2 and heritability 0.8.

The predictive ability of the three cross value components decreased when QTL number increased (Fig. 2). While PM predictions were slightly impacted (0.91 to 0.84 between 10 and 1,000 QTL with h2 = 0.6, for instance), SD decreased significantly (from 0.69 to 0.18). UC that is calculated with both PM and SD ($\text{UC}=\text{PM}+i\times \text{SD}$) was still correctly estimated (dropped from 0.82 to 0.77). This suggests that UC is more driven by PM than SD in this elite material.

Progeny size

To evaluate if the observed variance in the field using a limited progeny size (< 100) is reliable, we tested different numbers of progenies per cross (Fig. 3).

The predictive ability of the three cross value components increased when the number of progenies per cross increased from the number of progenies in the experiment (~ 48) to 2,000 (Fig. 3). PM and UC were slightly impacted: The predictive abilities increased from 0.90 to 0.91 for PM and from 0.82 to 0.90 for UC when h2 = 0.7 and QTL number = 300, for instance. The interest to produce large progeny was much higher for SD estimation: Predictive abilities increased from 0.29 to 0.57. We observed a first slope decrease when progeny size is superior to 250 and a second when progeny size is superior to 1000, close to a plateau.

The relative increase in predictive ability for SD was remarkably stronger in scenarios with high number of QTL (Fig. 3 and Supplementary Figure S3), with a plateau when the number of QTL is superior to 300.

Marker effect estimation precision

We compared two scenarios to test the impact of marker effect estimation precision on cross value components prediction. We considered the TRUE scenario as a reference (pred_true in Fig. 1). In this optimal situation, TP size is infinite, and its genetic composition ideal and marker effects are perfectly estimated: We use true simulated marker effects to estimate PM, SD and UC. In the more realistic ESTIMATED scenario (pred_estimated in Fig. 1), marker effects are estimated using a genomic prediction model from our TP that is limited.

In the TRUE scenario (pred_true), the predictive abilities were very high for PM and UC whatever the number of QTL or heritability. But SD predictive ability was high only when the number of QTL was inferior to 300. For instance, the predictive ability was 0.97, 0.89 and 0.81 for PM, UC and SD, respectively, when QTL number was 10 and when the number of progenies was the same as in the experimental data (Fig. 3 and Supplementary Figure S3), and 0.99, 0.95 and 0.95 when the number of progenies was 2,000 (Fig. 4B). We observed a strong decrease in predictive ability for SD when the number of QTL increased (0.37 for QTL number = 1000 versus 0.81 for QTL number = 10 and when progeny size = the number of progenies in experimental data, Fig. 4A). This decrease was much smaller when the number of progenies per cross was 2000 (0.86 for QTL number = 1000 versus 0.95 for QTL number = 10, Fig. 4B). All these results show that if we would perfectly estimate marker effects and progeny size was sufficient, we would be able to estimate SD as well very well, even when the number of QTL is high.

As we observed an important gap of predictive abilities between the TRUE and the ESTIMATED scenarios for SD (Fig. 4), we tested different genomic prediction models with different hypotheses on trait architecture (BayesA, BayesB, BayesC, BL, BRR and RR-BLUP (Vg1)). We also tested 2 different SD estimators deriving from the algebraic formulae of SD: Vg2 = Vg1 + term that takes into account the error of estimation of marker effects, and Vg3 = Vg2 + term that takes into account the variance in parents (Supplementary Information S1) (Fig. 5).

When progeny size was equal to the experimental design (~ 48) and QTL number was inferior or equal to 300, Bayesian models performed better than the RR-BLUP approach, and more particularly the Bayes A and B models. For instance, predictive abilities were 0.76 for both Bayes A and B models versus 0.62 for Vg1 when QTL number = 10 and heritability = 0.7 (Fig. 5A and B). Taking into account the error in marker effect estimates slightly increased the predictive abilities in scenarios with 1,000 QTL and heritability = 0.7 (0.26 for Vg2 and Vg3 compared to 0.23 for Vg1) (Fig. 5B).

When progeny size was 2,000 and QTL number was inferior to 300, Bayes B still gave better predictions of SD (for instance, 0.91 for Bayes B versus 0.73 for Vg1 when QTL number = 10 and heritability = 0.7; Fig. 5C). When number of QTL ≥ 300, Vg3 increased predictive abilities (Fig. 5C), especially when heritability was low (0.25 for Vg1 versus 0.33 for Vg3 when QTL number = 300 and heritability = 0.2; Fig. 5D). Vg3 is equal to Bayes B when heritability = 0.8 and number of QTL = 300 (0.64) (Fig. 5D).

All these results show that the quality of marker effect estimates has a strong impact on the predictive ability of SD. For traits with small number of QTL, Bayesian models (Bayes A or B) should be preferred. When the number of QTL is high, Vg3 should be used, especially if heritability is low. (It is equivalent to Bayes B when heritability is very high.)

Genetic composition of the mating plan

To evaluate the bias possibly generated by the choice of crosses in our experimental design, we compared the results obtained with the experimental crosses called “PrediCropt” with a set of crosses called “Unrelated.” We selected the “Unrelated” parents to be representative of the total genetic diversity present in the TP and the crosses by mating parents with low genetic similarity (see Materials and methods). The 4 first PC for the two sets of parents are provided in Supplementary Figure S2 and show that the “Unrelated” parents are more scattered throughout the TP than “PrediCropt” parents (for instance, no “PrediCropt” parent in the bottom left side in graphic A—PC 1 versus PC 2).

Ten random associations of the “Unrelated” set of parents gave the predictive abilities’ results provided in Supplementary Figure S4 for scenario with 300 QTL and heritability of 0.7. Little differences were observed between the two sets of parents. Indeed, the predictive abilities were 0.90 (“PrediCropt”) versus 0.88 (“Unrelated”) for PM, 0.82 for both sets for UC, and 0.29 (“PrediCropt”) versus 0.24 (“Unrelated”) for SD. According to these results, we considered as negligible the bias generated by the choice of our experimental crosses.

Experimental data application

In our experimental data, we phenotyped yield, plant height and heading date for 101 progenies and grain protein content for 73 progenies (Supplementary Table S11). On average correlations between trials (year x location) were 0.48, 0.43, 0.79 and 0.88 for yield, grain protein content, plant height and heading date, which is congruent with other trials (Supplementary Tables S12 and S13).

We compared experimental cross value components to genomic predictions described above. Note that both are predictors of the true values and we do not know a priori if one is better than the other. We just look here at their correlations (Table 2). Observed and predicted PM was significantly correlated for the 4 phenotypes. Median correlation values across all prediction models were 0.38, 0.63, 0.51 and 0.91, for yield, grain protein content, plant height and heading date, respectively. SD was significantly correlated for plant height and heading date (0.59 and 0.38, respectively) but not for yield and grain protein content (0.01 and 0.13, respectively). UC was, however, significantly correlated for all 4 traits (0.45, 0.52, 0.54 and 0.74 for yield, grain protein content, plant height and heading date, respectively).

Table 2 Correlations between predictions and experimental values of the three cross value components

Full size table

Note that for plant height and heading date, SD was better predicted with non-Gaussian Bayesian approaches (0.63–0.64 and 0.46–0.49, respectively) than with Gaussian approaches (0.53–0.54 and 0.33–0.34, respectively), suggesting that those traits are polygenic with some major QTL (Ellis et al. 2002).

Discussion

The present study aimed at investigating the genomic predictive ability of the cross value called UC (usefulness criterion; $\text{UC}=\text{PM}+i\times \text{SD}$) and its two components, the mean (PM; parental mean) and the variance (SD; standard deviation) of the progeny. We used simulations to test the impact of the precision of marker effect estimation in predictions. We compared two scenarios, one using perfectly estimated marker effects (true marker effects) and one using more realistically estimated marker effects. We tested two new estimators of SD that take into account the error in marker effect estimation. We also tested the impact of trait heritability (varying from 0.2 to 0.8), trait architecture (number of QTL varying between 10 and 1000) and progeny size (varying from the number of progenies per cross observed in the field to 2000). We used those figures as a reference grid for the comparison of predictions with experimental data composed of yield, grain protein content, plant height and heading date evaluated for 101 crosses with 55 progenies on average. Our results shed light on the factors influencing cross value predictive ability and provide insights into the prerequisites for using such methods in precision breeding programs.

A high-quality training population (TP)

In this study, we took advantage of a historical TP of more than 20 years of data collection to predict three cross value components (PM, SD and UC) of 73 or 101 crosses depending on the trait using genomic prediction models. We compared these predictions to experimental observations of progenies. All the results concerning the TP (correlations between years, repeatabilities at plot and design levels) suggest a high-quality dataset. The predictive abilities obtained by cross-validation ranging from 0.56 (plant height) to 0.70 (heading date) were higher than previously obtained with a smaller TP (G. Charmet, personal communication). We checked if we could optimize the training population (data not shown) by an iterative approach that removes the environments (year x location) one by one and that computes predictive ability at each iteration (Heslot et al. 2013). We finally kept all the environments of the TP because we did not detect any environment deteriorating the quality of the prediction. Even trials from the South network were improving predictions of North trials. We also checked (data not shown) the predictive ability within each sub-dataset of the TP (GEVES, INRAE-AO, North and South) and the predictive abilities of the 73 crosses obtained for the three cross-value components using these sub-datasets as TP. We observed better predictive abilities using all datasets combined as TP. All these preliminary results led us to retain the entire dataset as the TP in this study.

Due to the size and complexity of the TP design (not uniform spatial information), for computational reasons, and as done in previous studies (Tiede et al. 2015; Danguy des Déserts et al. 2023), we separated the steps of correction for the fixed effects such as spatial coordinates and environmental effects (location and year) from the estimation of marker effects. It would be interesting to evaluate the impact of this pre-correction on the estimation of marker effects step.

Parameters impacting predictive abilities based on simulations

To evaluate different parameters impacting predictive abilities, we simulated progenies using the same crosses than in our experiment, varying heritability, QTL number, progeny size, prediction models, and genetic composition of the mating plan.

The first comment on this simulation study concerns the difference observed between true heritability and estimated heritability, with a missing heritability ranging from 0.1 to 0.2 in estimated heritabilities, whatever the number of QTL simulated. This could be explained by a lack of genetic information, such as the absence of rare variants or structural variants (Manolio et al. 2009), epistasis and GxE. According to this observation, we focused our results on scenarios where heritability = 0.7, which correspond to an estimated heritability of 0.5, congruent with yield observations.

First, the predictive ability of the three cross value components increased when heritability increased. The impact of QTL number impacted SD predictions: the predictive ability decreased when QTL number increased. These two results were in concordance with the results obtained in previous studies (Wimmer et al. 2013; Tiede et al. 2015; Yao et al. 2018). The first conclusion about this simulation study is that trait architecture is the most important factor impacting predictive abilities and more especially for the prediction of SD.

Second, the predictive ability of the three cross value components increased when progeny size increased, the SD being strongly affected. To our knowledge, no study has yet investigated this parameter. Our results suggest that more than 1000 progenies per cross should be sufficient to maximize SD predictive ability as little differences were observed between the 1000 and 2000 progenies per cross-scenarios.

Third, we observed that if we perfectly knew marker effects (TRUE scenario), we were able to estimate the three cross value components when QTL number is low and progeny size is large almost perfectly. But SD predictions decrease strongly with the number of QTL, especially when marker effects are estimated. This could be due to two factors: The TP is not informative enough compared to our validation population, or/and the prediction model is not optimal. We could not explore the first factor because no other TP was available for this study. However, we investigated the second parameter by testing different prediction models. We observed small differences between prediction models, and mostly in the small progeny size scenarios. This result was in concordance with several previous studies (Yao et al. 2018; Neyhart et al. 2019; Santos et al. 2019). RR-BLUP is known to have a similar or slightly better performance for GEBV prediction than models with differential shrinkage when the number of QTL is large (Daetwyler et al. 2010). We actually observed that Bayesian models, Bayes A and B in particular, outperformed models using Gaussian marker effect variance distribution when the number of QTL was low, in accordance with several previous studies (Meuwissen et al. 2001; Shepherd et al. 2010; Legarra et al. 2011).

In this study, we computed SD using the algebraic formula proposed by Lehermeier et al. 2017 for gametic variance. As the precision of marker effect estimates strongly impacted predictive abilities, we proposed two variants for the computation of SD that account for the error in marker effect estimation. The first variant (Vg2) was the algebraic version of the posterior mean variance (PMV) (Lehermeier et al. 2017). The second variant (Vg3) aimed at considering the fact that the uncertainty of the estimation of marker effects is modulated by the genomic composition of each parent. These two variants were computed only with marker effects estimated by a RR-BLUP approach in our study. They increased SD predictive abilities when progeny size was large and the number of QTL was high, and to a greater extent when heritability was low. This result is in concordance with the 300 QTL simulation scenario described in Lehermeier et al. 2017.

Finally, SD predictive ability was high only in the TRUE scenario, when the number of QTL was 10. We did not take into account the interaction between genotype and environment or epistasis in our prediction models. This could improve predictive ability of traits such as yield, grain protein content and plant height in particular. We could also test if our predictive ability is sufficient to improve our mating design and short/long-term genetic gain in elite and more diverse material.

Validation in experimental data

The predictive ability of cross value components based on experimental progenies’ phenotypes was validated for maize yield, moisture and test weight SD (0.18 for yield, 0.49 for moisture and 0.52 for test weight) (Lian et al. 2015). For flowering time (Osthushenrich et al. 2017), the correlations were around 0.90 for PM and of 0.46–0.65 for SD. For barley (Tiede et al. 2015; Neyhart et al. 2019), the predictive abilities for PM were moderate to high (0.46–0.69), whereas those for SD were lower (0.01–0.48). They were higher for heading date (the most heritable trait) and lower for FHB severity (the least heritable trait).

In our study, observed and predicted PM and UC were significantly correlated for all traits (PM: 0.38, 0.63, 0.51 and 0.91; UC: 0.45, 0.52, 0.54 and 0.74; for yield, grain protein content, plant height and heading date, respectively), while SD was correlated only for heading date and plant height (0.64 and 0.49, respectively). We were not able to validate our ability of prediction of SD for yield because progeny sizes were too small according to simulations. As quoted by Santos et al. (2019), inferences using SD « should be regarded as a bet»: The value of the best individual among 40 may differ from its parental UC calculated with a of 10 or 2%. For grain protein content, plant height and heading date, non-Gaussian Bayesian approaches outperformed Gaussian approaches for SD prediction, indicating that considering more flexible distributions for marker effects can improve SD estimation accuracy.

It would now be interesting to validate the value of UC in terms of genetic gain using different levels of diversity in the starting germplasm (elite breeding program vs pre-breeding program) under a long-term perspective, using different traits of interest. The progeny size is very important to calibrate to assure the realization of the SD potential of crosses. It would also be interesting to test if including interaction effect of markers with environments improves SD predictive ability.

Recommendations for breeding schemes

Our results contribute to practical guidance for precision breeding programs. The study highlights the importance of accurate marker effect estimates, thanks to large and optimized TP datasets, as well as the optimization of prediction models to estimate the value of a cross. As progeny size is determinant to realize the potential of a cross, we can advise the breeders to produce larger populations for crosses with outstanding SD. Optimizing mating design based on UC might be even more interesting with more diverse material where the variance of SD between crosses is larger, in pre-breeding programs, for instance.

Finally, breeders are interested in predicting several traits of agronomic interest simultaneously in progenies, the more challenging case being those with negative genetic correlations, yield and grain protein content, for instance, for bread wheat (Thorwarth et al. 2018). One interesting perspective is now to optimize the predictive ability of this correlation (Neyhart et al. 2019) in order to predict crosses with outstanding progenies for the main trait and satisfying values for the secondary trait.

Data availability

The data that support the findings of this study are openly available in repository FSOV PrediCropt at https://doi.org/10.57745/F6230V, reference number F6230V.

References

Allier A, Lehermeier C, Charcosset A, Moreau L, Teyssèdre S (2019a) Improving short- and long-term genetic gain by accounting for within-family variance in optimal cross-selection. Front Genet 10:1006
Article PubMed PubMed Central Google Scholar
Allier A, Moreau L, Charcosset A, Teyssèdre S, Lehermeier C (2019b) Usefulness criterion and post-selection parental contributions in multi-parental crosses: application to polygenic trait introgression. G3 Genes Genomes Genetics 9(5):1469–1479. https://doi.org/10.1534/g3.119.400129
Article PubMed PubMed Central Google Scholar
Ben-Sadoun S, Rincent R, Auzanneau J, Oury FX, Rolland B, Heumez E, Ravel C, Charmet G, Bouchet S (2020) Economical optimization of a breeding scheme by selective phenotyping of the calibration set in a multi-trait context: application to bread making quality. Theor Appl Genet 133:2197–2212
Article PubMed CAS Google Scholar
Bijma P, Wientjes YCJ, Calus MPL (2020) Breeding top genotypes and accelerating response to recurrent selection by selecting parents with greater gametic variance. Genetics 214:91–107
Article PubMed Google Scholar
Bohn M, Utz HF, Melchinger AE (1999) Genetic similarities among winter wheat cultivars determined on the basis of RFLPs, AFLPs, and SSRs and their use for predicting progeny variance. Crop Sci 39:228–237
Article CAS Google Scholar
Browning BL, Zhou Y, Browning SR (2018) A one-penny imputed genome from next-generation reference panels. Am J Human Genet 103:338–348
Article CAS Google Scholar
Calus MPL (2010) Genomic breeding value prediction: methods and procedures. Animal 4:157–164
Article PubMed CAS Google Scholar
Daetwyler HD, Pong-Wong R, Villanueva B, Woolliams JA (2010) The impact of genetic architecture on genome-wide evaluation methods. Genetics 185:1021–1031
Article PubMed PubMed Central CAS Google Scholar
Danguy des Déserts A, Bouchet S, Sourdille P, Servin B (2021) Evolution of recombination landscapes in diverging populations of bread wheat.Genom Biol Evolut 13. https://doi.org/10.1093/gbe/evab152
Danguy des Déserts A, Durand N, Servin B, Goudemand-Dugué E, Alliot J-M, Ruiz D, Charmet G, Elsen J-M, Bouchet S ( 2023) Comparison of genomic-enabled cross selection criteria for the improvement of inbred line breeding populations. G3 Genes|Genomes|Genetics jkad195
Ellis M, Spielmeyer W, Gale K, Rebetzke G, Richards R (2002) “Perfect” markers for the Rht-B1b and Rht-D1b dwarfing genes in wheat. Theor Appl Genet 105:1038–1042
Article PubMed CAS Google Scholar
Elsen J-M (2022) Genomic prediction of complex traits, principles, overview of factors affecting the reliability of genomic prediction, and algebra of the reliability. In: Ahmadi N, Bartholomé J (eds) Genomic prediction of complex traits: methods and protocols. Springer US, New York, pp 45–76. https://doi.org/10.1007/978-1-0716-2205-6_2
Chapter Google Scholar
Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. https://doi.org/10.3835/plantgenome2011.08.0024
Article Google Scholar
Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME (2012) Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95:4114–4129
Article PubMed CAS Google Scholar
Falconer DS, Mackay TFC (1996) Introduction to quantitative genetics. Prentice Hall, Harlow, England
Google Scholar
Gianola D, de Los Campos G, Hill WG, Manfredi E, Fernando R (2009) Additive genetic variability and the bayesian alphabet. Genetics 183(1):347–363
Article PubMed PubMed Central Google Scholar
Goddard ME, Hayes BJ, Meuwissen THE (2011) Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet 128:409–421
Article PubMed CAS Google Scholar
Habier D, Fernando RL, Dekkers JCM (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177:2389–2397
Article PubMed PubMed Central CAS Google Scholar
Habier D, Fernando RL, Kizilkaya K, Garrick DJ (2011) Extension of the bayesian alphabet for genomic selection. BMC Bioinf 12:186
Article Google Scholar
Hayes BJ, Visscher PM, Goddard ME (2009) Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res (camb) 91:47–60
Article PubMed CAS Google Scholar
Henderson CR (1975) Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423–447
Article PubMed CAS Google Scholar
Heslot N, Jannink J-L, Sorrells ME (2013) Using genomic prediction to characterize environments and optimize prediction accuracy in applied breeding data. Crop Sci 53:921–933
Article Google Scholar
Hung H-Y, Browne C, Guill K, Coles N, Eller M, Garcia A, Lepak N, Melia-Hancock S, Oropeza-Rosas M, Salvo S et al (2012) The relationship between parental genetic or phenotypic divergence and progeny variation in the maize nested association mapping population. Heredity (edinb) 108:490–499
Article PubMed CAS Google Scholar
Lee SH, Clark S, van der Werf JHJ (2017) Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship. PLoS ONE 12:e0189775
Article PubMed PubMed Central Google Scholar
Legarra A, Robert-Granié C, Croiseau P, Guillaume F, Fritz S (2011) Improved Lasso for genomic selection. Genet Res 93:77–87
Article CAS Google Scholar
Lehermeier C, Teyssèdre S, Schön C-C (2017) Genetic gain increases by applying the usefulness criterion with improved variance prediction in selection of crosses. Genetics 207:1651–1661
Article PubMed PubMed Central Google Scholar
Lian L, Jacobson A, Zhong S, Bernardo R (2015) Prediction of genetic variance in biparental maize populations: Genomewide marker effects versus mean genetic variance in prior populations. Crop Sci 55:1181–1188
Article Google Scholar
Liu X, Wang H, Wang H, Guo Z, Xu X, Liu J, Wang S, Li W-X, Zou C, Prasanna BM et al (2018) Factors affecting genomic selection revealed by empirical evidence in maize. Crop J 6:341–352
Article Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
Article PubMed PubMed Central CAS Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Article PubMed PubMed Central CAS Google Scholar
Mohammadi M, Tiede T, Smith KP (2015) PopVar: a genome-wide procedure for predicting genetic variance and correlated response in biparental breeding populations. Crop Sci 55:2068–2077
Article CAS Google Scholar
Murtagh F, Legendre P (2014) Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? J Classif 31:274–295
Article Google Scholar
Neyhart JL, Lorenz AJ, Smith KP (2019) Multi-trait improvement by predicting genetic correlations in breeding crosses. G3 Genes Genom Genet 9(10):3153–3165. https://doi.org/10.1534/g3.119.400406
Article Google Scholar
Nyquist WE, Baker RJ (1991) Estimation of heritability and prediction of selection response in plant populations. Crit Rev Plant Sci. https://doi.org/10.1080/07352689109382313
Article Google Scholar
Osthushenrich T, Frisch M, Herzog E (2017) Genomic selection of crossing partners on basis of the expected mean and variance of their derived lines. PLoS ONE 12:e0188839
Article PubMed PubMed Central Google Scholar
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Article CAS Google Scholar
Pasek J (2021) Schwemmle with some assistance from AT and some code modified from R -core; A contributions by GC and M. weights: Weighting and Weighted Statistics. https://cran.r-project.org/web/packages/weights/index.html
Pérez P, de Los Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198(2):483–495
Article PubMed PubMed Central Google Scholar
Piepho H-P, Möhring J (2007) Computing heritability and selection response from unbalanced plant breeding trials. Genetics 177:1881–1888
Article PubMed PubMed Central Google Scholar
Pook T, Schlather M, Simianer H (2020) MoBPS—modular breeding program simulator. G3 Genes Genom Genet 10(6):1915–1918. https://doi.org/10.1534/g3.120.401193
Article Google Scholar
R Core Team (2021) R: A language and environment for statistical computing. https://www.r-project.org/
Rimbert H, Darrier B, Navarro J, Kitt J, Choulet F, Leveugle M, Duarte J, Rivière N, Eversole K (2018) High throughput SNP discovery and genotyping in hexaploid wheat. PLoS ONE 13(1):e0186329. https://doi.org/10.1371/journal.pone.0186329
Article PubMed PubMed Central CAS Google Scholar
Rodríguez-Álvarez MX, Boer MP, van Eeuwijk FA, Eilers PHC (2018) Correcting for spatial heterogeneity in plant breeding experiments with P-splines. Spat Stat 23:52–71
Article Google Scholar
Santos DJA, Cole JB, Lawlor TJ, VanRaden PM, Tonhati H, Ma L (2019) Variance of gametic diversity and its application in selection programs. J Dairy Sci 102:5279–5294
Article PubMed CAS Google Scholar
Schnell FW, Utz HF (1975) F1-leistung und elternwahl euphyder züchtung von selbstbefruchte. In Bericht über die Arbeitstagung der Vereinigung Österreichischer Pflanzenzüchter, pp. 243–248, Gumpenstein, Austria
Selle ML, Steinsland I, Hickey JM, Gorjanc G (2019) Flexible modelling of spatial variation in agricultural field trials with the R package INLA. Theor Appl Genet 132:3277–3293
Article PubMed PubMed Central Google Scholar
Shepherd RK, Meuwissen TH, Woolliams JA (2010) Genomic selection and complex trait prediction using a fast EM algorithm applied to genome-wide markers. BMC Bioinf 11:529
Article Google Scholar
Thorwarth P, Piepho HP, Zhao Y, Ebmeyer E, Schacht J, Schachschneider R, Kazman E, Reif JC, Würschum T, Longin CFH (2018) Higher grain yield and higher grain protein deviation underline the potential of hybrid wheat for a sustainable agriculture. Plant Breed 137:326–337
Article CAS Google Scholar
Tiede T, Kumar L, Mohammadi M, Smith KP (2015) Predicting genetic variance in bi-parental breeding populations is more accurate when explicitly modeling the segregation of informative genomewide markers. Mol Breed 35:199
Article Google Scholar
Usai MG, Goddard ME, Hayes BJ (2009) LASSO with cross-validation for genomic selection. Genet Res (camb) 91:427–436
Article PubMed CAS Google Scholar
Utz HF, Bohn M, Melchinger AE (2001) Predicting progeny means and variances of winter wheat crosses from phenotypic values of their parents. Crop Sci 41:1470–1478
Article Google Scholar
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423
Article PubMed CAS Google Scholar
Whittaker JC, Thompson R, Denham MC (2000) Marker-assisted selection using ridge regression. Genet Res 75:249–252
Article PubMed CAS Google Scholar
Wimmer V, Lehermeier C, Albrecht T, Auinger H-J, Wang Y, Schön C-C (2013) Genome-wide prediction of traits with different genetic architecture through efficient variable selection. Genetics 195:573–587
Article PubMed PubMed Central Google Scholar
Yao J, Zhao D, Chen X, Zhang Y, Wang J (2018) Use of genomic selection and breeding simulation in cross prediction for improvement of yield and quality in wheat (Triticum aestivum L.). The Crop Journal 6:353–365
Article Google Scholar
Zhong S, Jannink J-L (2007) Using quantitative trait loci results to discriminate among crosses on the basis of their progeny mean and variance. Genetics 177:567–576
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank the Fonds de Soutien à l’Obtention Végétale (FSOV) for financing the FSOV 2020 I—PrediCropt project. The authors acknowledge the partners of the project (Agri-Obtentions and Florimond-Desprez Veuve & Fils) and the INRAE personal responsible for experimental evaluation (Laurent Falchetto, Sandrine Berges -INRAE UE PHACC-, Clement Debiton -INRAE UMR GDEC, CRB group-, Kevin Bargoin -INRAE UMR GDEC, DIGEN group-, Patrice Walczak -INRAE UE Ferlus-, Paul Bataillon -INRAE UE Auzeville-, and Emmanuel Heumez -INRAE UE GCIE-). They also thank Marie-Hélène Bernicot and Solène Barrais for providing the GEVES training population dataset (corresponding to the experimental evaluation data for French national registration) and Justin Blancon for helping in adjusting for spatial heterogeneity in experimental measures.

Funding

This work was supported by the Fonds de Soutien à l’Obtention Végétale (FSOV 2020 I—PrediCropt project). Genotyping was supported by the Breedwheat grant (ANR-10-BTBR-0003) and INRAE IVD program.

Author information

Authors and Affiliations

UMR1095, GDEC, INRAE-Université Clermont-Auvergne, Clermont-Ferrand, France
Claire Oget-Ebrad, François-Xavier Oury & Sophie Bouchet
INRAE-UE Lille, 2 Chaussée Brunehaut, Estrées Mons, BP50136, 80203, Peronne Cedex, France
Emmanuel Heumez
Agri-Obtentions, Ferme de Gauvilliers, 78660, Orsonville, France
Laure Duchalais
Florimond-Desprez Veuve & Fils SAS, Cappelle-en-Pévèle, France
Ellen Goudemand-Dugué
UMR1388, GenPhySE, INRAE-Université de Toulouse, Castanet-Tolosan, France
Jean-Michel Elsen

Authors

Claire Oget-Ebrad
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Heumez
View author publications
You can also search for this author in PubMed Google Scholar
Laure Duchalais
View author publications
You can also search for this author in PubMed Google Scholar
Ellen Goudemand-Dugué
View author publications
You can also search for this author in PubMed Google Scholar
François-Xavier Oury
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Michel Elsen
View author publications
You can also search for this author in PubMed Google Scholar
Sophie Bouchet
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

COE collected and formatted the data, performed the analyses, interpreted the results and wrote the manuscript. EH, LD, EGD and FXO provided crosses’ progenies for the FSOV PrediCropt project. SB conceived and coordinated the FSOV project. JME and SB supervised COE, interpreted the results and helped writing the manuscript. All authors reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Sophie Bouchet.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethical approval

This is an observational study. No ethical approval is required.

Additional information

Communicated by Huihui Li.

Jean-Michel Elsen and Sophie Bouchet have contributed equally to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 2034 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Oget-Ebrad, C., Heumez, E., Duchalais, L. et al. Validation of cross-progeny variance genomic prediction using simulations and experimental data in winter elite bread wheat. Theor Appl Genet 137, 226 (2024). https://doi.org/10.1007/s00122-024-04718-6

Download citation

Received: 26 September 2023
Accepted: 16 August 2024
Published: 18 September 2024
DOI: https://doi.org/10.1007/s00122-024-04718-6

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Validation of cross-progeny variance genomic prediction using simulations and experimental data in winter elite bread wheat

Abstract

Key message

Abstract

Similar content being viewed by others

Accuracy of genomic selection for grain yield and agronomic traits in soft red winter wheat

Genomic assisted selection for enhancing line breeding: merging genomic and phenotypic selection in winter wheat breeding programs with preliminary yield trials

Genomic selection in a commercial winter wheat population

Introduction

Materials and methods

Materials

The training population (TP)

TP composition

TP adjusted phenotypes

TP genotypes

TP variance components

Predictive ability of genome estimated breeding values (GEBV) in TP

The experimental population (EP)

EP description

Adjusted EP phenotypes

Methods

Cross value components predictive ability

Models and formulae to predict cross value components

EP cross value components

Cross value components’ predictive ability

Factors influencing predictive ability of the cross value components (simulations)

Simulation design description

Parameters influencing the predictive ability

Results

Quality of the training population (TP)

Simulation study

Genetic architecture

Progeny size

Marker effect estimation precision

Genetic composition of the mating plan

Experimental data application

Discussion

A high-quality training population (TP)

Parameters impacting predictive abilities based on simulations

Validation in experimental data

Recommendations for breeding schemes

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 2034 KB)

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation