1 Introduction

Mountain areas are highly sensitive to climate change (Hock et al 2019). They respond faster and also more intensely to increasing greenhouse gas concentrations in comparison to the global mean and to other regions (Rangwala and Miller 2012; Pepin et al 2015) due to an elevation-dependency of various physical mechanisms (Palazzi et al 2019; Pepin et al 2022). Existing studies mostly focus on changes in temperature and precipitation patterns with elevation (Kuhn and Olefs 2020; Tudoroiu et al 2016; Pepin et al 2022) and reductions of snow cover (Kotlarski et al 2022; Matiu et al 2021). However, many other processes also depend on elevation, such as atmospheric and soil moisture, winds, biodiversity, aerosol species and concentration, surface albedo, shortwave and longwave radiation components, and surface heat fluxes (Napoli et al 2023). Indeed, climate change has significant impacts on ecosystem health and biodiversity, economy, and society in mountain regions (Adler et al 2022). Additionally, since mountains regulate streamflow through orographic lifting of moist airflows, natural storage in snowpacks, glaciers, and groundwater, they have significant impacts on manifold larger areas downstream (Immerzeel et al 2020; Viviroli et al 2007).

The climate in mountain regions is driven mainly by the interaction between large-scale atmospheric flows and local topography (Sandu et al 2019). At the larger scale, elevation exerts the main influence on climate, followed by exposition and slope. Elevation modulates many meteorological variables such as near-surface temperature (expressed as temperature lapse rates) and orographic precipitation (Daly et al 2008), while exposition and slope affect mostly incoming radiation and wind flows, and hence orographic enhancement of precipitation upstream to mountain ranges and shadowing downstream.

The quantitative assessment of future climate scenarios relies on numerical climate models (Smiatek et al 2016; Coppola et al 2021). In mountain regions, characterized by complex topography, a high spatial resolution is necessary to adequately represent the relevant physical processes (Ban et al 2021; Pieri et al 2015; Berthou et al 2020). Consequently, Regional Climate Models (RCMs) are driven by output from General Circulation Models (GCMs) in order to obtain a spatial resolution suitable for describing the atmospheric dynamics affected by complex terrain. In the European Alps, recent assessments relied on the large EURO-CORDEX (the European branch of the World Climate Research Programme’s Coordinated Regional Climate Downscaling Experiment) ensemble (Kotlarski et al 2022; Jacob et al 2014, 2020), which is available over whole Europe. Other single model runs are also available at higher resolutions (Warscher et al 2019), some of them at convection-permitting scale (Ban et al 2014).

Although RCMs reproduce large-scale climate features well, they can be affected by significant bias at the local scale. For instance, in Europe average temperature biases are generally below 1.5 \(^{\circ }\)C and precipitation biases below 40% for reanalysis driven RCMs in the period 1989–2008 (Kotlarski et al 2014). Vautard et al (2021) evaluated GCM-driven RCMs for the period 1981–2010 over Europe using E-OBS as reference, and additional high-resolution precipitation datasets for subregions, including APGD for the Alps. They found that RCMs in Europe were generally too cold and too wet, model performance depended on the adopted GCM-RCM combination and none was optimal under all aspects. For the Alps, Smiatek et al (2016) performed an evaluation of a subset of the currently available EURO-CORDEX ensemble at 0.11\(^{\circ }\) resolution and found an overall cold bias in the range of \(-0.8\) to \(-1.9\,^{\circ }\)C and a wet bias between 14.8 and 41.6%. Kilometre-scale simulations have been found to generally reduce biases in all variables because the topography is better resolved than in larger scale models (Lucas-Picher et al 2021). Also, enabling convection almost eliminates heavy precipitation biases in summer (Ban et al 2021). Yet, kilometre-scale simulations are only available for short time ranges, usually a maximum of 10 years, and do not cover all emission scenarios that are, for instance, available in EURO-CORDEX. Moreover, besides biases in climatological means, RCMs also show some indication of diverging climate trends with respect to their driving GCMs for the Alpine region (Boé et al 2020; Schumacher et al 2023; Sørland et al 2018; Schwingshackl et al 2019).

Moreover, limitations in the availability and accuracy of observational data hampers bias evaluation. This is often a serious drawback, especially for precipitation, where differences in the observational datasets can be in the range of the bias spread in an RCM ensemble (Prein and Gobiet 2017; Kotlarski et al 2019). In mountainous terrain, an additional challenge is the operation and maintenance of meteorological stations because of harsh environmental conditions and lack of permanent access which hinders the setup of dense observational networks suitable to capture the high spatial variability of mountain climates. Furthermore, precipitation measurements in mountains suffer from high uncertainties because of wind-driven undercatch of precipitation (Førland and Hanssen-Bauer 2000), or distorted observations of solid precipitation in the absence of heated rain gauges.

Biases negatively impact estimates of future changes in heat indices (Iturbide et al 2022), which are usually defined on the basis of temperature thresholds (e.g., 20 \(^{\circ }\)C minimum temperature for tropical nights), and, consequently, biases can significantly alter the estimated numbers. Furthermore, biases also affect impact models used in hydrology and energy simulations. These require unbiased data and additionally often with high spatial and temporal resolutions (Maraun et al 2010). Bias adjustment relies on some form of observational reference, for example, from in-situ stations, spatial analyses or reanalysis. Simple and frequently used methods consist of the application of parametric and non-parametric adjustments to match modeled and observed variables (Gudmundsson et al 2012; Gutiérrez et al 2019).

Previous assessments of biases in RCMs at the European scale focused on spatial patterns of several different indices, with elevation dependence rarely taken into consideration (Vautard et al 2021). Dedicated studies for the European Alps looked explicitly into the elevation dependency, but rarely considered other indices beyond the mean (Kotlarski et al 2022; Gobiet et al 2014; Monteiro and Morin 2023). Furthermore, uncertainty in the observational datasets was seldom considered in the analysis (Herrera et al 2020; Prein and Gobiet 2017; Kotlarski et al 2019).

In this study we address these gaps by analysing biases in the full EURO-CORDEX ensemble of RCMs as available in 2023 over the European Alps, based on a variety of observational reference data products, and focusing on elevation dependence and indices that cover mean and extreme conditions. In addition, we evaluate the impact of different bias adjustment methods and datasets available under CORDEX-Adjust, which is the bias-adjusted EURO-CORDEX ensemble available via the Earth System Grid Federation (ESGF) archive and described in more detail in Sect. 2.3. The specific goals of this study are:

  • assessing observational uncertainty in the European Alps by inter-comparing a range of observational datasets;

  • understanding how climate model biases depend on elevation, season, climate indices, and climate model chain (GCM and RCM);

  • identifying how well RCMs capture temporal and spatial (horizontal and vertical) variability of climatic indices;

  • determining the impact of parametric and non-parametric bias adjustment methods and of different bias adjustment reference datasets (as available in CORDEX-Adjust) on the elevation-dependent bias structure.

The paper is organized as follows. Data and methods are presented in Sect. 2. Results are presented and discussed in Sect. 3: A comparison of different observational datasets for the Alpine region is presented in Sect. 3.1; this is followed by the evaluation of RCM biases (Sect. 3.2), focusing on elevation patterns and different indices; then, the impact of bias adjustment methods and datasets is presented (Sect. 3.3). Finally, conclusions are drawn in Sect. 4. In order to reduce the amount of figures in the main manuscript, preference was given to the summer and winter seasons. Results for spring and fall are available in the supplementary material.

2 Data and methods

In the following, we use the term bias to denote the difference between model and observation, but we acknowledge that the term may be inaccurate, particularly in mountain environments, given that it implicitly assumes that observations are error-free.

2.1 Study area

The study area is the Greater Alpine Region (GAR, Fig. 1) stretching approximately 1200 km from West to East across several countries in central-western Europe. The dominating character of this area is the presence of deeply incised valleys within high mountain chains with the highest peak above 4800 m a.s.l.. The major climatic influences are the west flows of moist air from the Atlantic, the European continental flows with cold air from northern Europe and the continental air masses from the eastern side, and the Mediterranean Sea, usually associated with warmer air advection northward (Auer et al 2007).

Fig. 1
figure 1

Elevation map of the study area. The source is the European Digital Elevation Model (EU-DEM), version 1.1, which has been aggregated to 1 km. The black rectangle is the area for which the regional climate model data was cropped

2.2 Raw model data

In the present study we used the ensemble of GCM-driven EURO-CORDEX RCMs downloaded from ESGF servers in January 2023. Since high resolutions are required for mountain terrain, we used only the EUR-11 simulations, which have a horizontal spacing of 0.11\(^{\circ }\), equivalent to approximately 12 km grid cell width. The main analysis of the temperature and precipitation indices was performed for the 30-year period 1971–2000.

Additionally, we tested the time-dependence of biases by analysing 20-year moving window averages between 1971 and 2008 for precipitation and between 1971 and 2022 for temperature. Note that we did not perform formal tests of stationarity, but rather performed a descriptive analysis of the time-dependence of biases. For this, the historical run of the RCMs, which ends in 2005, was merged with the RCP8.5 (representative concentration pathway) scenario, which starts 2006. Consequently, we used the years 1971–2005 from the historical run and 2006–2022 from RCP8.5. Besides raw biases, we also calculated detrended biases with a linear model applied to the full series by means of ordinary least-squares and afterwards calculating the 20-year moving window averages (see Sect. 2.6).

The list of RCMs used in the present study is shown in Table 1. We used daily data for precipitation (pr), mean temperature (tas), minimum temperature (tasmin), and maximum temperature (tasmax). The ensemble size was 52/52/46/46 GCM-RCM combinations for pr/tas/tasmin/tasmax, respectively.

Table 1 Overview of regional climate models used in this study

2.3 Bias adjusted model data

In addition to the raw simulations, we analysed the impact of bias adjustment as available in CORDEX-Adjust (Dosio 2016; Bartók et al 2019), which includes a subset of the models (Table 2). CORDEX-Adjust is mainly a search category (facet) on the ESGF servers. Since many users require or want bias-adjusted climate model output for impact studies, the CORDEX community decided to provide unified access to such data using CORDEX data standards (see https://is-enes-data.github.io/, last accessed 16 May 2024). As for CORDEX, CORDEX-Adjust was the result of a community effort, supported by various international projects (e.g., CLIPC, IS-ENES2). It builded upon the long-term bias adjustment expertise that various climate modeling centers implemented with reference to multiple bias adjustment routines using different observational datasets as reference (gridded observations or reanalysis).

Within CORDEX-Adjust, the following datasets were used as reference for the bias adjustment:

  • WFDEI (WATCH Forcing Data methodology applied to ERA-Interim) is the post-processed and bias-corrected ERA-Interim dataset produced using the WATCH (Water and Global Change) methodology (see Weedon et al (2014) for the WFDEI and Weedon et al (2011) for the original WATCH). The dataset was produced at a horizontal spacing of 0.44\(^{\circ }\) over the entire globe.

  • E-OBS is a spatial analysis of in-situ observations over the European land domain (Haylock et al 2008) with a horizontal spacing of 0.1\(^{\circ }\), managed and regularly updated by the Royal Netherlands Meteorological Institute (KNMI). In CORDEX-Adjust, E-OBS version 10 was used, but here we used version 26 as reference (see below), available since January 2023. Note that version 10 is not publicly available anymore. Moreover, version updates not only extend into more recent periods, but the full period is re-calculated as the number of included stations has grown over time. Consequently, there can be considerable differences between versions (personal communication, KNMI).

  • MESAN (Landelius et al 2016) is a high-resolution downscaling of the HIgh-Resolution Limited-Area Model (HIRLAM) with the MESoscale ANalysis system (MESAN) with horizontal spacing of 5 km.

As bias adjustment methods, the climate modelling group at METNO employed quantile mapping (QMAP) (Gudmundsson et al 2012), SMHI used distribution-based scaling (DBS) (Yang et al 2010), and LSCE-IPSL applied the cumulative distribution transform (CDFt) (Vrac et al 2016). The variants of CDFt (CDFt, CDFT22, CDFT22s) of CORDEX-Adjust apply the same bias adjustment method, even if they have a different name. They are based on different versions of the code implementing CDFt (Thomas Noël and Harilaos Loukus, personal communication). The influence of three bias adjustment reference datasets (EOBS10, WFDEI, and MESAN) was tested with CDFt, and the influence of three bias adjustment methods (CDFt, DBS45, and QMAP) with MESAN as bias adjustment reference dataset.

Table 2 Subset of regional climate models from Table 1 that are available with bias adjustment from CORDEX-Adjust

2.4 Observational data

In addition to the datasets used in the bias adjustment, we also considered other two gridded observational reference datasets covering the whole study region, namely E-OBS v26.0 and APGD. The former provides daily temperature (min, mean, max) and precipitation at 0.1\(^{\circ }\) spatial resolution obtained by spatial analysis of in-situ observations (Cornes et al 2018), whereas a better spatial analysis of precipitation is provided by the latter, which is characterized by a grid spacing of 5 km and was obtained from a much denser network of meteorological stations than E-OBS (Isotta and Frei 2013). For the observation intercomparison we used precipitation from APGD, E-OBS, MESAN, and WFDEI; for temperature E-OBS, MESAN, and WFDEI. Based on that analysis (see results), we subsequently chose APGD for precipitation and E-OBS for temperature as main reference to evaluate RCM biases. Monteiro and Morin (2023) also used these two as main observational reference. Note that evaluation of bias adjusted RCMs cannot be considered fully independent, since many stations records entered both the evaluation datasets (APGD, E-OBS version 26) and the calibration datasets (E-OBS version 10, WFDEI, MESAN).

To compare all Alpine-wide observational and reanalysis datasets including the ones used in CORDEX-Adjust (E-OBS, APGD, MESAN, WFDEI), we further collected gridded national datasets. These datasets cannot be considered completely independent from the Alpine-wide ones, since many stations are included in all products. However, we assume that these national datasets are less affected by errors, since they are created with the highest number of possibly available observations and some of them also using spatial techniques specifically tailored for mountain terrain. Nonetheless, also these high-resolution national datasets suffer from observational and interpolation uncertainties, with added complexities in mountain terrain due to, for example, temperature inversion and wind-driven precipitation undercatch.

The following national datasets have been used:

  • SPARTACUS (Austria): 1 km scale resolution for daily minima (tmin) and maxima (tmax) of air temperatures and daily precipitation (Hiebl and Frei 2015, 2018)

  • Switzerland: 1 km daily (MeteoSwiss product names: TminD, TabsD, TmaxD, and RhiresD)

  • SLOCLIM (1 km daily tmin, tmax, precipitation) for Slovenia (Škrk et al 2021)

  • HYRAS (5 km daily tmin, tmean, tmax, precipitation) for Germany (Razafimaharo et al 2020)

  • CRESPI (250 m daily tmean and precipitation) for a sub-region in northern Italy (Crespi et al 2021)

  • France: 1 km daily tmin, tmean, tmax (Besson et al 2019), and precipitation (Lassegues 2018; Soubeyroux et al 2019)

We note that other datasets exist, which could have been used instead or in addition. For example, SPAZM (SPAtialisation en Zones de Montagne) for precipitation in France, which combines weather types with interpolation of in-situ stations (Gottardi et al 2012), or a 2 km reanalysis for Italy (Adinolfi et al 2023). However, we adopted the above ones to stay as close as possible to observations, rather than relying on reanalysis.

For the observational intercomparison the reference period was 1989 to 2008, which is the common period of all datasets.

2.5 Aggregation in time and space, climatic indices

All observational data products were remapped to the rotated pole EUR-11 grid from EURO-CORDEX to allow comparisons by grid cell. We used first-order conservative remapping in CDO (climate data operators) (Schulzweida 2022), since the datasets have spatial resolutions varying from 1 to 12 km. Grid cells with partial coverage were only remapped if the target area was at least 95% covered by the source grid. The only dataset with coarser resolution than the RCMs was WFDEI, for which we employed a bilinear interpolation, since the conservative method introduces unrealistic adjustments at cell boundaries to ensure areal mass conservation. Once remapped, the national datasets were spatially united into a single extended dataset, denoted NAT (for NATional datasets) in the remaining of this work.

The elevation analysis was also performed on the same EUR-11 grid, by taking the subset of grid cells that were common to all datasets for each variable. In the evaluation sections this was often approximately APGD, E-OBS, or NAT, which are the datasets with the smallest coverage. In the sections on the impact of bias adjustment, this corresponds approximately to E-OBS (i.e., only land surface). Approximately means slight differences at the study area borders and land-sea boundaries.

We computed annual and seasonal averages, denoted with ’mean’, and percentiles, which are denoted with Xpct, where X is the percentile value between 0 and 100. Specifically, we calculated the 5th, 50th, and 95th percentile for temperature and the 95th percentile for precipitation. Additionally we calculated a subset of the ETCCDI climatic indices (Zhang et al 2011), see Table 3. This selection of indicators includes the ones with the main weather-climate impacts in the natural and socio-economic environment (Crespi et al 2020).

Table 3 Overview of the calculated climatic indices (ETCCDI) at the annual scale

2.6 Statistical analysis

In Sect. 3.2.1, the bi-variate relationship between temperature and precipitation biases is assessed. For this we calculated average winter and summer temperature and precipitation biases by elevation band for each GCM-RCM combination. We then grouped the resulting averages by RCMs, such that each RCM had multiple runs by different GCMs. Note that in each group, depending on RCM, the number of GCMs was different, ranging between three and six. For all these groups (one RCM, multiple GCMs, one elevation band, one season), we calculated linear regression slopes (y: precipitation bias, x: temperature bias) to ease the comparison. However, these are considered as approximate estimates given the low number of GCMs (from three to six) used to compute the group statistics. For REMO2009, which was driven by only one GCM, no regression slope was estimated.

In Sect. 3.2.2, the evolution of biases over time is analysed. For that, we employed a detrending of time series using ordinary least squares. This was done individually for each grid cell.

3 Results and discussion

3.1 Intercomparison of observational datasets

The general patterns of the Alpine climate over 1989–2008 were similarly described in all the studied observational datasets for both precipitation (Figure S1) and temperature (Figure S1). As expected, temperatures decreased with increasing elevation and increasing latitude (Figure S1). Mean winter precipitation was higher North of the main ridge, and precipitation intensity (SDII) was higher south of the main ridge, with precipitation peaks around the Ligurian and Adriatic Sea (Fig. 2).

Fig. 2
figure 2

Seasonal (winter and summer) mean (pr_mean) and extreme (95th percentile; pr_95pct) precipitation and annual precipitation indices (SDII: simple daily intensity index, RR1: number of wet days) from different data sources. NAT is the union of national datasets, for the other abbreviations refer to Sect. 2.4 and for definition of climatic indices to Sect. 2.5. For temperature, see Figure S1. Values are based on climatological averages over 1989–2008

But while the general pattern agreed across datasets, considerable differences were found in local precipitation values. For instance, a common characteristic of all datasets was the increase in precipitation until 1500 m, followed by a decline at higher elevations (Fig. 3a). But when looking more in detail differences emerged: in E-OBS the mean summer precipitation at 1500 m was 4.0 mm/d, close to the value of 4.1 mm/d observed in WFDEI, but different from 4.6 mm/d in APGD, and 4.9 mm/d in MESAN. Differences between datasets were stronger for extremes, as the 95th percentile of summer precipitation at 1500 m varied from 16.4 mm/d (WFDEI) to 21.8 mm/d (APGD). Precipitation intensity (SDII) peaked in all datasets at about 1000 m and was lower above and below. But again in absolute terms, maximum SDII differed considerably among datasets: the maximum elevation-averaged SDII of WFDEI was 8.1 mm/d, and increased to 9.4 mm/d for MESAN, 10.1 mm/d for E-OBS, and 10.9 mm/d for APGD. The elevation-averaged number of wet days (RR1) increased up to 500 m and then remained approximately constant, ranging between 122 and 166 d, depending on the dataset.

Differences between datasets over the 1989–2008 climatology were less pronounced for temperature (Fig. 3c). All datasets showed the expected negative relationship between temperature and elevation. Below 2000 m differences between datasets were negligible and the mean difference of the elevation-averaged temperature climatology among the datasets amounted to 0.7 \(^{\circ }\)C. Above 2000 m, MESAN was colder than E-OBS, and WFDEI was warmer than E-OBS. Unexpectedly, above 2000 m, temperatures of WFDEI did not decrease with elevation.

Assuming the national datasets as reference, APGD was most accurate for precipitation (Fig. 3b) and E-OBS for temperature (Fig. 3d). Differences between APGD and national datasets were mainly unbiased with values between \(-11.4\%\) and 3.5% across all precipitation indices and elevation bands, while the other datasets had differences between \(-38.5\) and 56.8% depending on index, elevation, and dataset (Fig. 3b): E-OBS, MESAN, and WFDEI showed over- and underestimation of winter precipitation depending on elevation with respect to NAT, consistent underestimation of mean summer precipitation, increasing underestimation of extreme summer precipitation with elevation, underestimation of intensity (SDII), and overestimation of wet-day frequency (RR1). For temperature, E-OBS was unbiased below 2000 m across all indices, while at higher elevations it showed a slight warm bias (Fig. 3d). MESAN showed a warm bias in summer at elevations between 1000 and approximately 2500 m, while in winter it exhibited a cold bias above 2000 m. WFDEI showed little bias for average winter temperatures (50pct) below 2000 m, but strong warm biases above 1500–2000 m and mixed biases at lower elevations.

Fig. 3
figure 3

Elevation dependency of precipitation and temperature climatology (1989–2008) across observational datasets. Grey background in each panel indicates the relative distribution of elevation in the analyzed domain, which is different between (a\(+\)c) and (b\(+\)d) because the national datasets (NAT) have only a partial coverage (no scale provided for the grey background, as this is only meant to contextualize data availability by elevation). a Precipitation indices (seasonal mean and 95th percentile, SDII simple daily intensity index, RR1 number of wet days). b Difference in precipitation indices with respect to NAT. c Seasonal percentiles of temperature. d Difference in seasonal temperatures with respect to NAT

This intercomparison showed that the choice of spatial reference dataset is crucial for precipitation, particularly when the area of interest spans a wide range of elevations and a strong orographic diversity. The differences are lower for temperature. This confirms previous findings that uncertainty between precipitation datasets can be of the same order as climate model biases (Prein and Gobiet 2017) and that observational uncertainty is less an issue for temperature (Herrera et al 2020). Concerning the precipitation, the higher station density in APGD makes it superior to E-OBS, when compared to national datasets. The impact of station density on accuracy has been already noted (Isotta et al 2014). In the Alps, the differences in station density between E-OBS and APGD are negligible for Germany and Slovenia, small for Austria, Croatia, France, and Switzerland, but strong for Italy (Bandhauer et al 2022).

A sensitivity analysis performed on different subregions confirms the above findings. All datasets capture the main climatological features regarding north–south and east–west gradients of temperature and precipitation (Figure S2). But again, from a quantitative point of view differences are observed with over- and underestimations of temperature and precipitation indices between datasets, similar to the previous findings for the whole study area. Alternatively, splitting the analysis by national dataset does not contradict the previous findings, except for SLOCLIM (the Slovenian dataset), which did not use a climatology-based interpolation scheme for precipitation compared to the other datasets (Figure S4).

Still results at higher elevations should be taken more cautiously than at lower elevations. The assumption of the national datasets as best estimate is reasonable, but still suffers from inaccuracies, such as undercatch of winter precipitation (Kochendorfer et al 2022). Additionally, the accuracy of each spatial analysis depends on the station density, especially in complex terrain such as the Alps. In our case, the grid spacing is often smaller than the inter-station distance. Additionally, observations at high elevations are scarcer than in lowlands.

Finally, we did not assess the impact of horizontal scales, which goes beyond the scope of this study. The comparison was made at 0.11\(^{\circ }\) and the higher-resolution datasets were upscaled. WFDEI is the only one with a coarser resolution, which has been interpolated to the higher one of 0.11\(^{\circ }\). As such, WFDEI is expected to underestimate localized large values because an areal integral of precipitation will always smooth out local precipitation peaks.

3.2 Temperature and precipitation biases in raw RCMs

Based on results from the previous section, we used APGD as reference for precipitation and E-OBS for temperature for evaluating raw RCMs over the entire Alpine domain. Overall, for the period 1971–2000, RCMs showed a wet (Figs. 4, 5b) and cold (Figures S5 and 5c) bias. The driving GCM is the dominant factor for the large-scale bias, while local and elevational patterns are more influenced by RCMs (Figs. 4 and  S5). In addition, biases showed a strong dependence on elevation (Fig. 5).

Fig. 4
figure 4

Differences in summer mean precipitation for 1971–2000 between a GCM-RCM combination and APGD. Values above 200% are displayed as 200%, too

The elevation dependence of biases in precipitation indices is shown in Fig. 5b. Precipitation intensity on wet days (SDII) was captured well by the models, albeit with increasing uncertainty at higher elevation. The average bias in SDII was \(-4\%\). On the other hand, the wet day frequency (RR1) was consistently overestimated by all models and the bias increased with elevation, from 15% at 100 m to 51% at 3000 m (model-ensemble-means). Summer precipitation means and extremes were also well captured across elevation, again with increasing model uncertainty at higher elevations. Model-ensemble-means of biases were between \(-7\) and 27% for means and between \(-21\) and 14% for extremes (5th and 95th percentiles), respectively. Winter precipitation was consistently higher in RCMs than in APGD, and differences in mean winter precipitation increased with elevation from \(+8\%\) to over \(+100\%\) (model-ensemble-means).

Biases in temperature indices with respect to elevation are shown in Fig. 5c. The cold bias intensified with decreasing temperatures, comparing summer versus winter and across elevation. Negligible biases were found below 1000 m for average summer and winter temperatures: model-ensemble-means of elevation-averaged biases were between 0 and \(-1^{\circ }\)C. However, cold biases intensified with elevation up to \(-3.8\) and \(-5.4\,^{\circ }\)C in summer and winter, respectively. Cold winter extremes (5th percentile) were most prone to biases and reached up to \(-20\,^{\circ }\)C.

As supplemental analysis, precipitation and temperature biases for spring and fall are shown in Figure S6. The patterns with elevation and magnitude of precipitation biases in spring are more similar to winter, and in fall more similar to summer. On the other hand temperature biases in spring and fall reflect more a transition period and lie between winter and summer biases. Finally, temperature indices based on minimum and maximum temperatures are shown in Figure S7. Indices based on maximum temperatures (frost days and summer days) are less biased than those based on minimum temperatures (ice days and tropical nights). Regarding frost days (FD), the model ensemble mean is largely unbiased below 1000 m, but shows an overestimation (cold bias) above 1000 m. On the other hand, biases in ice days (ID) increase consistently with elevation above 500 m.

The above results are consistent with previous assessments of RCMs in Europe and in the European Alps by Smiatek et al (2016), Vautard et al (2021), and Kotlarski et al (2014), which also found a predominant cold and wet bias. In terms of bias magnitude (Tables S1 and  S2), we find biases for domain averages similar to Kotlarski et al (2014), in the range of 1 \(^{\circ }\)C for seasonal temperature and between 0 and 20% for precipitation. Only winter precipitation biases are higher, with a model ensemble average of 35% compared to approximately 20% in Kotlarski et al (2014). Compared to Smiatek et al (2016), we find consistently lower biases in all seasonal temperatures and precipitations, with differences in the order of 0.5 \(^{\circ }\)C and 5–15%, respectively. When interpreting differences between this and previous studies, it should be taken into account that the model ensemble and evaluation reference datasets are different. In particular, Kotlarski et al (2014) evaluated the reanalysis-driven EURO-CORDEX while in the present work we employ the GCM-driven ensemble.

Furthermore, since biases often increased with elevation, domain averaged biases can sometimes be misleading. Compared to spatial averages, local biases are often smaller at low elevations and larger at high elevations (Tables S1 and S2). Finally, wintertime precipitation biases should be treated with caution, since gauge undercatch can be significant: correction factors, such as those used in northern Europe, can be up to 85% (Førland and Hanssen-Bauer 2000).

Fig. 5
figure 5

a Precipitation indices in APGD and GCM-RCMs. b Differences between each GCM-RCM and APGD. c Differences in temperature indices between each GCM-RCM combination and E-OBS. Each line is one model combination. Values refer to the 1971–2000 climatology

3.2.1 Bi-variate dependency

Next, we assessed the relationship between temperature and precipitation biases over 1971–2000 by studying the dependency between seasonal mean precipitation and temperature biases across elevation for all RCMs selected in the present study (Fig. 6). The RCM ensemble exhibited a dependency between temperature and precipitation biases. The relationship was mostly positive in winter and negative in summer. For all RCMs that had multiple runs driven by different GCMs, we evaluated the slope of the linear relationship, expressed in percentage points (pp) precipitation bias per degree Celsius temperature bias. The median slope over all RCMs in winter was 15 pp/\(^{\circ }\)C (IQR, inter-quartile-range, over all models and elevation bins: 7, 22). In summer, the median slope was \(-10 (-17, -4)\) pp/\(^{\circ }\)C. The only exception was WFR381P, for which the relationship in winter was negative.

The strength of the relationship varied by RCM and intensified with elevation in nearly all cases, especially in winter (blue lines in Fig. 6): the median slope increased from 7 (3, 11) pp/\(^{\circ }\)C for the lowest elevation bin (0–500 m) to 21 (14, 28) pp/\(^{\circ }\)C for the highest elevation bin (\(> 2500\) m). The positive relationship found in winter implies that warmer and wetter conditions are associated with each other, but since most RCMs have simultaneously a cold and wet bias, this relationship reduces one bias at the cost of the other (e.g., less cold bias implies more positive precipitation bias). For summer, the opposite is true, since the relationship is negative (warmer associated to drier) and thus cold-wet biases are reduced simultaneously.

Fig. 6
figure 6

Dependency between seasonal mean precipitation and temperature biases across elevation (columns) and season (rows, DJF: winter, JJA: summer) over the period 1971–2000. Colours denote different RCMs, where multiple points imply different GCMs (GCMs are not further distinguished). Lines are linear regression estimates and only intended to guide the visualization of correlation and strength of relationship

3.2.2 Time evolution of biases

Besides the bi-variate dependency, we further assessed the temporal evolution of biases for temperature over 1971–2022 (Fig. 7) and precipitation over 1971–2008 (Figure S8). Biases increased with time for summer temperatures and winter precipitation because of a mismatch in trends between observations and models. In the case of summer temperature, observations showed a stronger increase than RCMs over the past 50 years, which led to a widening discrepancy with time (Fig. 7a, b). This warming discrepancy has been attributed to stationary aerosol forcing in the RCMs (Schumacher et al 2023). After trends were linearly removed for each grid cell, the remaining biases were constant in time (Fig. 7c). For winter precipitation, observations showed a decrease of precipitation, while RCMs showed no change, thus again leading to a widening discrepancy with time. Similarly, after removing the linear trends, also precipitation biases were constant (Figure S8).

The resulting temporal stability of biases after trend removal is an important property of the modified time series, given that most bias adjustment and downscaling algorithms assume stationarity. For a period of 50 years in the Alps, we can confirm this stability for temperature, and for a period of 38 years also for precipitation. However, the temporal stability of biases is obtained only after removing trends from both observations and models. Consequently, trend removal is a crucial step in statistical bias adjustment techniques, since otherwise, temporal variability might be wrongly matched between observations and models (Lange 2019).

This confirmation of bias stationarity is nonetheless only a rough estimation, since the spatial observation datasets are not fully suited for trend analysis due to their changing network over time. Further confirmation could be obtained by using homogenized spatial datasets (Isotta et al 2019; Spinoni et al 2015). A more detailed assessment of bias stationarity is beyond the scope of this study, but worthwhile for future analyses.

Fig. 7
figure 7

a Domain average trends of seasonal temperatures in models and observations over 1971–2022. Lines are linear regression fits (grey: single GCM-RCM, black: model ensemble, red: E-OBS) b Time evolution of 20-year biases. c Time evolution of 20-year biases, but time series were linearly detrended prior to calculating 20-year biases. See Figure S8 for precipitation

3.3 Impact of bias adjustment

This section deals with the impact of different bias adjustment methods and datasets as available from CORDEX-Adjust. In general, bias adjustment reduced biases over 1971–2000 as expected, but in our case the impact of the reference dataset used for bias adjustment (Fig. 9) was stronger than that of the method used (Fig. 8).

The three analyzed bias adjustment methods (QMAP, DBS, and CDFt) produced similar results for both temperature and precipitation indices across all elevations (Fig. 8). DBS and CDFt produced nearly identical results in all cases, while after QMAP some biases remained, especially for seasonal mean and extreme precipitation (indices: pr_mean and pr_95pct) and for warm temperature extremes (tas_95pct) in winter. For precipitation intensity (SDII) and wet-day frequency (RR1) and the other temperature indices (tas_5pct, tas_50pct), QMAP was nearly identical to DBS and CDFt. Note that the remaining bias for QMAP was still much below initial model biases.

The reference dataset used in the bias adjustment had stronger impacts and the differences and uncertainty increased with elevation (Fig. 9). The remaining biases (compared to references APGD and E-OBSv26) diverged above 1500 m for winter precipitation and above 1000 m for temperature indices. This divergence was caused by the differences between observational datasets used in the bias adjusment (see also Sect. 3.1). For precipitation, in particular, the difference between the RCMs bias adjusted with MESAN and APGD is basically identical to the difference between the two observational datasets (i.e., APGD and MESAN, Figure S9), and thus it is not dependent on the particular RCM. We hypothesize the same would apply for models adjusted with E-OBS10 and WFDEI, however, we cannot test it comprehensively, because E-OBS10 is not available anymore, and for WFDEI the bias adjustment procedure involved spatial and temporal moving windows, which cannot be reproduced without knowledge of the exact spatial interpolation schemes and grid specifications.

These results highlight the importance of the reference dataset used for bias adjustment. Inaccuracies in the reference datasets will be inherited by the bias adjustment methods, and in the case of the European Alps, we found large discrepancies between observational datasets (Sect. 3.1) and, consequently, also large differences after bias adjustment (Figs. 8, 9). This inheritance of inaccuracies can in some cases lead to larger biases after applying the bias adjustment procedure, especially if initial model biases were already small (Figures S10 to S13).

The presented analysis did not assess the impact of bias adjustment on future trends. Some methods, like QMAP, have been shown to modify trends (Maraun 2013). However, this is especially true when a simultaneous downscaling is performed by using QMAP. Within the CORDEX-Adjust framework, the spatial scale is the same, and thus QMAP has only been used for bias adjustment and not for downscaling. In this respect, we envision a future study that may assess possible impacts of bias adjustment on future trends.

Fig. 8
figure 8

Impact of different bias adjustment methods on a precipitation indices and b temperature indices over 1971–2000. Each line is a GCM-RCM combination. Grey dashed lines are raw (uncorrected) models, while solid colored lines are bias adjusted using different methods (see legend)

Fig. 9
figure 9

Impact of different bias adjustment datasets on a precipitation indices and b temperature indices over 1971–2000. Each line is a GCM-RCM combination. Grey dashed lines are raw (uncorrected) models, while solid colored lines are bias adjusted using different observational datasets (see legend)

4 Conclusions

Biases in the large GCM-driven EURO-CORDEX ensemble have been assessed for the European Alps as a function of elevation. This evaluation adds novel results to previous studies, which focused on spatial averages. We found that biases of GCM-driven RCMs over 1971-2000 in temperature, seasonal precipitation, and wet-day frequency generally increased with elevation. Biases in annual precipitation intensity were constant across elevation. Consequently, spatial averages of biases in the Alps are overestimated for lower elevation and underestimated for higher elevations (Tables S1 and S2). It remains to be seen, if the same applies also to other mountain regions in Europe and beyond.

Besides the elevation dependency, the temperature biases were also more negative in winter than in summer (Table S1), and spring/fall biases lay in-between winter and summer (Figure S6). Altogether this implies a form of intensity-dependent bias, where negative biases increase both in colder seasons and with elevation. One consequence of this is that climate change signals could benefit from trend modification for intensity-dependent errors by bias adjustment techniques such as quantile mapping, as shown in Gobiet et al (2015). Even so, whether statistical bias adjustment methods should or should not modify trends in climate change signals remains controversial.

Furthermore, we found a dependency between temperature and precipitation biases that differed by season and a temporal dependency of biases, as short-term trends (38 years for precipitation and 50 years for temperature) diverged between observations and models. This stresses the importance of detrending the time series prior to any bias adjustment in order to avoid a mismatch in temporal variability, which is driven by temporal trends.

Impact models often require unbiased data, which leads to applying some form of bias adjustment of climate models. A fundamental choice for bias adjustment is the selection of a suitable reference observational dataset. Results from the observation intercomparison (Sect. 3.1) and CORDEX-Adjust evaluation (Sect. 3.3) highlighted that this is a crucial step and more important than the choice of bias adjustment method.