1 Introduction

Near-term climate predictions are vital to both land (Ely et al. 2013; Palin et al. 2016; Clark et al. 2017; Turco et al. 2017; Bell et al. 2017; Ceglar et al. 2018) and marine sector applications (Stock et al. 2015; Hobday et al. 2016; Mills et al. 2017; Payne et al. 2017; Scott et al. 2021). Traditionally, however, there has been less work on understanding and predicting seasonal variability of shelf sea environments, where shallow bathymetry creates a dynamic physical environment subject to high variability via energetic mixing processes as well as the influence of atmospheric variability (Otto et al. 1990; Huthnance 1991; Becker and Pauly 1996; Dippner 1997; Pätsch et al. 2017). This knowledge gap is especially pertinent given the disproportionately high importance of shelf seas to humans and ecosystems. For example, despite occupying just 7% of global ocean area they support over 90% of the world’s fisheries (Pauly et al. 2002; Kröger et al. 2018).

We identify the European North-West shelf seas (hereafter NWS; area within the 200 m isobath marked in Fig. 1) as a prime region of interest. The operations of many NWS industries are sensitive to variability of physical fields, especially temperatures, and could benefit from accurate forecasts months ahead. Such industries include fisheries, renewable and non-renewable energy, transport and recreation (Payne et al. 2017, 2019). Interannual sea surface temperature (SST) variability can be on the order of degrees Celsius in large areas of the NWS (Fig. 1; Tinker et al. 2018) and seasonal climatologies will therefore not suffice for making predictions of upcoming conditions. Whilst 6-day operational forecasts (Tonani et al. 2019) and uninitialized end-of-century climate projections (Tinker et al. 2016) exist for the physical marine environment of the NWS, no operational NWS seasonal product currently exists. Tinker et al. (2018) and Tinker and Hermanson (2021) began to explore the prospect for NWS seasonal forecasting and suggest there may be potential predictability of temperatures on the NWS in boreal winter (DJF). There is likely also scope for predictability in other seasons which have not yet been explored, for example boreal summer (JJA), albeit limited by internal predictability limits as well as ocean or atmospheric modelling deficiencies.

Fig. 1
figure 1

Climatologies and variability of the European North-West shelf seas. CMEMS-v5 SST climatologies (a, b) and standard deviations (c, d) across the period 1993–2021 in winter (DJF; panel column a, c) and summer (JJA; column bd). Black line marks the 200 m isobath. Mixing fronts are shown by white dashed contours in b) and d), defined as where SST minus near-bottom temperature is equal to 0.5 °C. Key NWS regions are marked by text annotations in a)

The shallow NWS environment has lower thermal inertia than the deep open ocean and is primarily driven by variability in the atmosphere (Becker and Pauly 1996; Dippner 1997; Sharples et al. 2006). For example, from a simple box model, Holt et al. (2010) suggested that surface fluxes could be expected to be twice as important as lateral heat transport for determining the heat content. However, advection is likely to play a more substantial around the shelf break. Here, whilst sub-mesoscale processes may occur, particularly sub-surface, on-shelf transport in the surface layer is primarily influenced by wind-driven Ekman transport (e.g. Huthnance et al. 2009; Graham et al. 2018).

North Atlantic – European domain-scale atmospheric circulation patterns, most notably the North Atlantic Oscillation (NAO), are well-known determinants of seasonal surface climate over Europe (Hurrell 1995; Hurrell and Deser 2009). Capturing changes in the dominant circulation patterns in operational seasonal forecasting systems remains a challenge (e.g. Kim et al. 2012), with the recent exception of winter (DJF) surface NAO (Riddle et al. 2013; Scaife et al. 2014; Palin et al. 2016; Clark et al. 2017; Athanasiadis et al. 2017; Baker et al. 2018; Thornton et al. 2023). The summer NAO is also known to impact European surface climate (Folland et al. 2009; Bladé et al. 2012) but is found to have a much weaker predictable signal in dynamical forecasting systems (Patterson et al. 2022; Dunstone et al. 2023). Additional circulation patterns include the East Atlantic pattern (EA), East Atlantic – Western Russia pattern (EAWR) and Scandinavian pattern (SCAND), which are also known to impact European surface climate (Barnston and Livezey 1987; Bueh and Nakamura 2007; Woollings et al. 2010; Lim 2015; Wang and Tan 2020). Recent studies suggest that some of these lower-order modes may be predictable at certain times of year, for example in summer (Lledó et al. 2020) and specifically EA in late autumn – early winter (Thornton et al. 2023), but the literature has thus far centred predominantly on NAO as the leading predictable mode of variability in winter.

Here, we quantify the skill of a large ensemble global ocean-atmosphere coupled dynamical seasonal forecasting system, namely the UK Met Office Global Seasonal Forecasting System (GloSea), in making predictions of the key end-user relevant NWS variable, sea surface temperature (SST), using retrospective forecasts (hereafter hindcasts) across 1993–2016. We extend existing NWS seasonal forecasting work (Tinker et al. 2018; Tinker and Hermanson 2021) to consider both winter (DJF) and summer (JJA) seasons, when the dynamics of the marine environment are markedly different, and focus on the impact of atmospheric variability and the inherent predictability offered by the dynamics of the shelf sea waters themselves.

In this paper we directly assess the seasonal predictability of NWS SST. To do so, we first characterise the influence of atmospheric circulation patterns on NWS SST (Sect. 3.1), and their predictability (Sect. 3.2), before presenting the current skill levels of GloSea and persistence forecasts of NWS SST (Sect. 3.3) and tests of upper limits to NWS SST skill with the use of idealised atmospheric fields (Sect. 3.4 and 3.5). In Sect. 4, we discuss the implications of our findings and consider the potential for improved predictability on the NWS. We describe our data and methodology in Sect. 2.

2 Data and methodology

2.1 Persistence hindcasts

Persistence analysis is employed here for two purposes. First, it serves as a skill benchmark which the more sophisticated GloSea system should aim to beat. Second, persistence will be a scientific tool for quantifying the inherent memory of waters (thermal inertia) as a potential source of predictability on the NWS.

Here, persistence hindcasts are empirical models built using reanalysis data of the physical marine environment of the NWS. The data is sourced from version 5 of the regional physical reanalysis of the NWS produced by the UK Met Office and distributed by the Copernicus Marine Environment Monitoring Service (hereafter CMEMS-v5; Renshaw et al. 2021). The reanalysis is based on the Forecasting Assimilation Model 7 km Atlantic Margin model (FOAM AMM7) using the NEMO version 3.6 ocean model (Madec and the NEMO team 2016). The model is run for the domain 20°W – 13°E, 40°N – 65°N at ~ 7 km horizontal resolution, across the period 1993 – near-present. The original model simulation has 51 terrain-following vertical levels and data is distributed on 24 vertical geopotential levels. CMEMS-v5 has been extensively validated (Renshaw et al. 2021) and forms a ‘best estimate’ of the observational truth on the shelf. At ~ 7 km horizontal resolution, CMEMS-v5 includes dynamic tides and resolves key shelf sea processes such as seasonal stratification.

The process for building a timeseries of persistence hindcasts at 2–4 months lead time from CMEMS-v5 data per target season is as follows: SST anomaly data are averaged over a 15-day period centred on the first day of the month preceding the target season, per hindcast year. For example, a winter (DJF) predictor field is formulated from the 15-day average centred on the November 1st immediately preceding that winter. The hindcast prediction is then taken as persistence of this fixed field through the winter. The 15-day averaging period is selected to reflect the span of start dates and information available for a typical GloSea hindcast set-up (as discussed in Sect. 2.2). The hindcast data are constrained to winter and summer of 1993–2016 to match the length of the GloSea hindcast period.

2.2 GloSea large ensemble hindcasts

We use operational hindcast data from versions 5 and 6 of the UK Met Office Global Seasonal Forecasting System (GloSea; MacLachlan et al. 2015). Both versions use the HadGEM3 ocean-atmosphere-land-sea-ice coupled climate model. GloSea6 uses HadGEM3-GC3 (Williams et al. 2018) whereas GloSea5 uses HadGEM3-GC2 (Williams et al. 2015) but both versions show similar skill. The model is initialised with observational analyses of the atmosphere, land, ocean and sea ice. The global ocean component of the model (NEMO; Megann et al. 2014) is run on the ORCA025 grid, i.e. a 0.25° horizontal resolution tri-polar grid with 75 vertical levels, 18 (24) of which are in the top 50 m (100 m) of the water column. The model’s atmospheric component is run at N216 resolution (0.8° in latitude and 0.5° in longitude) with 85 vertical levels. Operationally, GloSea produces real-time forecasts as well as hindcasts for the period 1993–2016, which we use here for skill assessment. GloSea employs a lagged ensemble technique; the hindcasts are generated on four calendar dates per month: the 1st, 9th, 17th and 25th and are integrated forward for the next six simulated months. For each start date, seven individual members (differing only by stochastic perturbation of atmospheric initial conditions; i.e. Bowler et al. 2009) are initialised. All initialisations use only conditions which would have been available at the time.

To obtain an ensemble of hindcast simulations at 2–4 months lead time per target season, we combine the 21 members from start dates centred on the 1st of the month preceding the target season. For example, a winter (DJF) hindcast ensemble combines members initialised on October 25th, November 1st, November 9th (3 start dates × 7 members) per hindcast year. Multiple hindcasts have been run and we are able to combine all available data from five hindcasts to form a larger ensemble of 105 members (5 × 21-member ensembles) per year. We take this step to provide a hindcast ensemble for skill assessment which best represents the real-time forecasting set up, which is soon due to increase from a 42- to 100-member ensemble. Using such a large ensemble has the notable advantage of being able to better extract predictable signals in the extra-tropics, which are thought to be underrepresented in models (Eade et al. 2014; Scaife and Smith 2018).

The linear trend across the hindcast period is preserved in all analyses (except when defining atmospheric circulation patterns, see Sect. 2.3) to match real-time forecasting procedures, which are capable of capturing skill originating from long-term trends.

2.3 Atmospheric circulation identification

Atmospheric circulations over the North Atlantic – European domain likely impact NWS SST variability due to the fast response of shallow seas to atmospheric forcing (Becker and Pauly 1996; Dippner 1997; Sharples et al. 2006). Here, we investigate dominant circulation patterns in winter and summer as identified by the three leading modes of atmospheric variability. As the circulations will have different centres of action which change across seasons, a fixed point (or box) difference definition for each mode is not suitable for our investigation. We employ an empirical orthogonal function (EOF; Hannachi et al. 2007; Wilks 2011) analysis of ERA5 (Hersbach et al. 2020) geopotential height anomalies at 500 hPa (hereafter Z500), temporally detrended and weighted by the square root of the cosine of the latitude (North et al. 1982) across the domain 90°W–40°E and 20°–80°N (as in Hall and Hanna 2018), to define the three leading atmospheric circulation patterns per season. ERA5 Z500 fields are projected onto the EOFs to generate observation-based principal component timeseries per circulation pattern. EOFs are calculated using the xeofs Python package (Rieger and Levang 2024) across the period 1979–2021. Z500 EOFs are denoted by EOFZ500 subscript notation.

The loading pattern of each ERA5-derived atmospheric circulation is presented in Fig. 2. In both seasons, EOF1Z500 closely resembles the North Atlantic Oscillation (NAO) and EOF2Z500 resembles the East Atlantic pattern (EA). EOF3Z500 most closely resembles the East Atlantic – Western Russia pattern in the winter (EAWR; though given the domain constraints it may be missing the eastern low pressure node) and the Scandinavian pattern (SCAND) in the summer. GloSea full ensemble Z500 fields from the same domain are projected onto the ERA5-derived loading patterns to get GloSea principal component predictions for the three atmospheric circulation patterns per season and for each individual member.

Fig. 2
figure 2

Leading modes of North Atlantic – European atmospheric variability. The three leading modes of atmospheric variability (EOFZ500 1–3; panel columns) derived from ERA5 Z500 anomalies, across the 1979–2021 reanalysis period, in winter (DJF; row ac) and summer (JJA; df). Percentage of total variance explained by each mode is marked in square brackets. Loading patterns are expressed as linear regressions on the standardised principal components

2.4 Hindcast skill measures

Hindcast skill is assessed by validation against available reanalyses (hereafter observation-based products). We focus predominantly on the Pearson Anomaly Correlation Coefficient skill measure (ACC; Jolliffe and Stephenson 2011), i.e.

$$\:ACC=\:\frac{{\sum\:}_{i=1}^{N}({h}_{i}-\stackrel{-}{h})({v}_{i}-\stackrel{-}{v})}{\sqrt{{\sum\:}_{i=1}^{N}{({h}_{i}-\stackrel{-}{h})}^{2}{({v}_{i}-\stackrel{-}{v})}^{2}}}$$
(1)

where N is the number of years in the hindcast period, \(\:{h}_{i}\) and \(\:{v}_{i}\) are the hindcast and verification fields, respectively, and overbars denote timeseries means. For validation of winter and summer NWS SST skill, \(\:v\) is taken as CMEMS-v5 SST seasonal mean fields. Note that for GloSea validation, NWS SST data on the ORCA025 tri-polar grid are re-gridded to CMEMS-v5 specification by nearest-neighbour interpolation. For atmospheric circulation validation, \(\:v\) is taken as ERA5-derived winter and summer principal components. ACC is a useful measure of the extent to which the simulated phase of variability matches observations, where a score of 1 implies perfect association between simulation and observation-based estimate. Significance (p value) is calculated using a positive one-tailed t test as we are testing for positive correlation only.

We assess the significance of the difference between GloSea and persistence hindcast ACC skill by bootstrapping. That is, we randomly select 105 members (with replacement) per year, calculate the ACC score of the ensemble mean and build a distribution of ACC scores by repeating this process 1000 times. GloSea ensemble mean ACC is deemed significantly different from persistence where persistence lies outside the 97.5th and 2.5th percentiles of the randomly selected GloSea ensembles distribution.

A limitation of the ACC statistic is its insensitivity to errors in magnitude, which can mask poor skill associated with the strength of signals arising from unrealistic variability in the model ensemble. In some cases, particularly in the northern extra-tropics in winter, it is possible for the ACC skill of the signal extracted from the mean of a large ensemble of model simulations to be high but for the ensemble mean variance to be too small (Eade et al. 2014; Scaife et al. 2014; Scaife and Smith 2018; Smith et al. 2020). In such cases, there is a relatively high proportion of noise in the ensemble and therefore unrealistic disagreement between individual members. This suggests that each member cannot truly be interpreted as a plausible alternative realisation of the real world (Scaife and Smith 2018). This behaviour is described as the “signal-to-noise paradox” (Scaife and Smith 2018) as it results in the ensemble mean being better able to predict the real world than its own ensemble members. Here we use a measure of the ratio of predictable components (RPC) to quantify the error in the signal-to-noise ratios in the hindcasts. Using the definition by Scaife and Smith (2018), which is itself an iteration of the definition by Eade et al. (2014): RPC is calculated as the ratio of the predictable component of observed variability (\(\:{PC}_{obs}\)) and the predictable component of model variability (\(\:{PC}_{model}\)). \(\:{PC}_{obs}\) is estimated by ACC (Eq. 1) between model ensemble mean and the observations (\(\:{ACC}_{mo}\)), whereas \(\:{PC}_{model}\) is taken as the average ACC between each model member and the remaining ensemble mean (\(\:{ACC}_{mm}\)). Therefore,

$$\:RPC=\:\frac{{PC}_{obs}}{{PC}_{model}}=\frac{{ACC}_{mo}}{{ACC}_{mm}}$$
(2)

A perfect forecast system, with infinite samples and ensemble members, exhibits an RPC of 1, indicating that the predictable fraction of variance in the observations matches the predictable fraction of the model itself. When the model ensemble mean is more skilful in predicting the observed signal than individual members the RPC is greater than 1. With an RPC greater than 1, the signal-to-noise ratio is erroneously low and the magnitude of variability in the ensemble mean will be suppressed.

2.5 Ensemble sub-selection

In Sect. 3.4, we employ an observation-matching approach to sub-select the full 105-member GloSea ensemble, based on comparisons of the modelled to observation-based state of the three atmospheric circulations investigated (EOFZ500 1–3, as defined in Sect. 2.3). That is, for each year and for each atmospheric mode, the 20 members which have predicted principal components closest (least absolute difference) to ERA5-derived principal components are selected. The NWS SST fields from this 20-member sub-sample are averaged to create a new sub-selected ensemble mean. This technique relies on having observation-based estimates available against which to compare the predicted principal components, and as such it does not have relevance for real-time forecasting application. Instead this procedure represents a set of “atmospheric mode matched” experiments which enable us to ask how GloSea would perform if it could accurately predict the evolution of each atmospheric circulation pattern, to provide an idealised upper limit associated with atmospheric variability forecasts. Of course, the true limit of predictability will be lower than this.

The statistical significance of the sub-selected ensemble mean results is assessed by bootstrapping. That is, 20 members are chosen at random (with replacement) per year, and this process is repeated 1000 times to build a distribution of sub-selected ensemble mean ACC scores. The mode-matched sub-ensemble mean is deemed significantly different from random chance where it falls outside the 97.5th and 2.5th percentiles of the random-selection distribution.

3 Results

3.1 Influence of atmospheric circulations on NWS SST

We first investigate the impact of atmospheric circulations on NWS SST in observation-based and modelled (GloSea) fields. In both observation-based and GloSea versions, EOFZ500 1–3 show strong correlation against NWS SST (Fig. 3). In winter, EOF1Z500 (NAO) is positively correlated with large areas of English Channel and North Sea SST, and the spatial impact patterns match closely between observation-based (Fig. 3a[i]) and GloSea fields (Fig. 3b[i]). Winter EOF2Z500 (EA) is strongly positively correlated with SST around the Southern Bight (see Fig. 1a for NWS region locations) in both observation-based (Fig. 3a[ii]) and GloSea versions (Fig. 3b[ii]). Winter EOF3Z500 (EAWR) demonstrates strong positive correlation in the Celtic Sea, outer shelf and northern North Sea regions, again in both observation-based (Fig. 3a[iii]) and GloSea fields (Fig. 3b[iii]). The broad similarity between observation-based and modelled atmospheric circulation impact on NWS SST in the winter indicates that GloSea tends to correctly simulate the spatial impact of winter atmospheric circulation variability on NWS SST. However, the correlations tend to be weaker in GloSea compared to observation-based fields and there are some cases where GloSea displays the wrong sign of correlation between winter atmospheric circulations and NWS SST. For example, the model- and observation-based correlations are of opposite sign for EOF1Z500 (NAO) in the Irish Sea (Fig. 3a[i] vs. Figure 3b[i]) and EOF2Z500 (EA) in the north and west portions of the NWS (Fig. 3a[ii] vs. Figure 3b[ii]).

Fig. 3
figure 3

Observation-based and modelled influence of atmospheric variability on NWS SST. Panels a[i]a[vi]: correlation (Pearson r) between ERA5-derived EOFZ500 1–3 principal components (panel columns) against CMEMS-v5 SST, in winter (DJF; row i – iii) and summer (JJA; iv – vi), across 1993–2021 (common period between datasets; n = 28). Panels b[i]b[vi]: correlation (Pearson r) between GloSea predicted EOFZ500 1–3 principal components (panel columns) against GloSea predicted SST, in winter (DJF; row i – iii) and summer (JJA; iv – vi). Note, in panels b[i]b[vi] (GloSea) correlations are calculated for all individual ensemble members and hindcast years (n = 2520) to avoid artificially smoothing variability within the ensemble. Black line marks the 200 m isobath

According to summer observation-based estimates, EOF1Z500 (NAO) correlates strongly with SST in the Celtic Sea (Fig. 3a[iv]), but EOF2Z500 (EA) demonstrates little to no substantial on-shelf correlation (Fig. 3a[v]). Summer EOF3Z500 (SCAND) is positively correlated with central and northern North Sea and northern outer shelf region SST (Fig. 3a[vi]). Despite being only third in terms of contribution to total atmospheric variance, EOF3Z500 (SCAND) has a substantial influence on NWS summer SST. The spatial patterns of modelled (GloSea) atmospheric circulation impact on NWS SST are approximately correct in the summer, by comparison to observation-based estimates, but do exhibit some notable inconsistencies (Fig. 3 summer observation-based vs. GloSea panels), suggesting that GloSea may produce errors in the spatial impacts of summer atmospheric circulations on NWS SST in some cases. For example, EOF1Z500 (NAO) is negatively correlated with SST in the English Channel, southern North Sea and Irish Sea regions in observation-based estimates but the correlations are positive in GloSea (Fig. 3a[iv] vs. Figure 3b[iv]).

3.2 Prediction skill of atmospheric circulation patterns

We now consider how skilfully GloSea predicts each atmospheric circulation pattern per season, which will likely impact the performance of GloSea NWS SST given the clear influence of atmospheric variability on the NWS. We do this to begin to attribute the sources of NWS SST prediction skill, and subsequently understand potential limits to predictability. In the winter, only predictions of EOF1Z500 (NAO) are skilful and significant at the 95% confidence level in GloSea (ACC = 0.57; p = 0.002; Fig. 4a). In addition, the signal-to-noise ratio is anomalously low and hence the RPC is above 1 for winter EOF1Z500 (NAO). Winter EOF2Z500 (EA) and EOF3Z500 (EAWR) demonstrate no significant predictability (Fig. 4c and e). In the summer, signals are generally weaker and only EOF2Z500 (EA) is significantly predictable at the 95% confidence level (ACC = 0.36, p = 0.043; Fig. 4d). Predictions of summer EOF3Z500 (SCAND) are significant at 90% confidence (ACC = 0.3, p = 0.078; Fig. 4f), whilst summer EOF1Z500 (NAO) skill is negative and insignificant (ACC = -0.29; p = 0.918; Fig. 4b). Summer EOF2Z500 (EA) significance is degraded when hindcasts are temporally detrended (not shown), suggesting that a portion of the skill is attributable to low-frequency decadal signals or the trend. The opposite is true for summer EOF3Z500 (SCAND); predictions are significant at the 95% confidence level when detrended.

Fig. 4
figure 4

Timeseries of predicted and observation-based atmospheric circulation patterns. GloSea full ensemble mean principal components (red lines) for EOF1Z500 (a, b), EOF2Z500 (c, d) and EOF3Z500 (e, f) in winter (DJF; panel column a, c, e) and summer (JJA; column b, d, f), at 2–4 months lead time across 1993–2016 hindcast period. ERA5 principal components are shown by the black line. All principal component values are normalised by their respective timeseries standard deviations. Background shading represents the density (count) of GloSea individual member principal components per year. ACC scores (with p value in brackets) are marked in the white boxes. Where significant at the 95% confidence level, p values are emboldened and RPC values are also displayed

3.3 GloSea and persistence NWS SST skill

We now assess the skill of GloSea in predicting NWS SST in winter and summer. Both GloSea and persistence demonstrate high skill across much of the NWS in winter (ACC > 0.6; Fig. 5a and b) and show similar spatial patterns, suggesting that persistence contributes to GloSea prediction skill. High persistence in winter is likely attributable to the thermal inertia of the waters, which are generally fully mixed in winter and thus suppress the influence of atmospheric variability on SST. Persistence and GloSea winter skill is low in the English Channel and southern North Sea regions (ACC < 0.4), which are some of the shallowest portions of the NWS. Lower thermal inertia associated with shallow bathymetry is likely to tie SST closer to atmospheric variability in these parts and therefore lead to accelerated decoupling from initial conditions. GloSea shows higher skill than persistence overall during winter (Fig. 5c). The NWS-wide improvements are modest in GloSea (area mean ACC = 0.60 for GloSea vs. ACC = 0.55 for persistence) but spatially coherent and significantly different from persistence across large areas, including the North Sea. The majority of improvement comes in regions which are shown to be impacted by winter EOF1Z500 (NAO), in both observation-based (Fig. 3a[i]) and modelled fields (Fig. 3b[i]), noting that EOF1Z500 (NAO) is skilfully predicted in GloSea (Fig. 4a). That is, the skilful simulation of winter EOF1Z500 (NAO) in GloSea likely leads to improved NWS SST predictions over persistence, which lacks information on the evolution of the atmosphere over the predicted period. SST in the Southern Bight region, where EOF2Z500 (EA) is most influential (Fig. 3a[ii] and 3b[ii]), remains poorly simulated in GloSea, likely because EOF2Z500 (EA) is not skilfully predicted (Fig. 4c). The areas of impact associated with winter EOF3Z500 (EAWR), namely the Celtic Sea, outer shelf and northern North Sea regions (Fig. 3c), show statistically significant but relatively small changes in GloSea as persistence is already high in these areas (Fig. 5b).

Fig. 5
figure 5

Dynamical (GloSea) and persistence NWS SST prediction skill. GloSea full ensemble mean (n = 105) SST ACC (panel column a, d), persistence ACC (column b, e), and GloSea minus persistence (column c, f), for winter (DJF; top panel row) and summer (JJA; bottom row), at 2–4 months lead time across the hindcast period 1993–2016. The p = 0.05 statistical significance level is marked by grey dashed contours, where the side of the contour with higher ACC scores marks p < 0.05. p values are calculated using a one-tailed test. Mean ACC for the NWS domain is marked in the bottom right. Black line marks the 200 m isobath. Stippling marks locations where GloSea is significantly different from persistence, as assessed by bootstrapping method described in Sect. 2.4

Summer NWS SST skill in both GloSea and persistence hindcasts is low in large areas (ACC < 0.3; Fig. 5d and e), particularly in the Celtic Sea and North Sea regions where ACC is almost entirely statistically insignificant in both systems. Moreover, there is no statistically significant difference between GloSea and persistence across most of the NWS. Persistence in the Celtic Sea and North Sea regions is low in summer likely due to seasonal stratification in the warmer summer months (shown by mixing fronts in Fig. 1b and d, and van Leeuwen et al. 2015; Pingree and Griffiths 1978) which results in a thin mixed layer (reduced thermal inertia), thus amplifying the impact of atmospheric variability on SST. Though there is moderate improvement in GloSea skill relative to persistence on-shelf (within the 200 m isobath where this study is focussed) large areas remain statistically insignificant (Fig. 5f). Summer EOF1Z500 (NAO) is shown to impact the Celtic Sea in observation-based (Fig. 3a[iv]) and modelled fields (Fig. 3b[iv]) but is not skilfully predicted in GloSea (Fig. 4b), meaning there is no associated meaningful improvement in SST skill in GloSea (Fig. 5d). Summer EOF3Z500 (SCAND) is shown to be both influential on North Sea SST (Fig. 3) and somewhat skilfully predicted by GloSea (Fig. 4f) yet there is little meaningful improvement in GloSea SST skill in these parts relative to persistence. Potential explanations may include one or a combination of (i) the skill of summer EOF3Z500 (SCAND) being still too low (ACC = 0.3, Fig. 4f) to adequately constrain SST evolution in the particularly sensitive stratifying regions, (ii) the weak and spatially erroneous response of SST to summer EOF3Z500 (SCAND) in GloSea (Fig. 3a[vi] vs. Figure 3b[vi]), potentially due to (iii) deficiencies in the GloSea global ocean component associated with locally erroneous representation of key NWS physics such as seasonal stratification (discussed in greater detail in Sect. 3.5). Summer EOF2Z500 (EA) is skilfully predicted by GloSea (Fig. 4d) but is shown to have weaker impact on on-shelf SST (Fig. 3a[v] and 3b[v]) meaning GloSea NWS SST prediction skill is unaffected.

Exceptions to low summer skill include the English Channel, southern North Sea, Irish Sea regions and around the Fair Isle current (see Fig. 1a for NWS region locations) where, unlike winter for the most part, skill is high in both GloSea and persistence systems (ACC > 0.6; Fig. 5d and e). These regions tend to remain fully mixed throughout the year due to strong tidal mixing (see mixing fronts in Fig. 1b and d, as well as van Leeuwen et al. 2015), which typically favours higher thermal inertia. However, the high summer skill seemingly contradicts the reasoning for low skill in these regions in winter when the waters are also fully mixed but nonetheless shallow and therefore likely to have low thermal inertia.

We propose that the high persistence in waters with low thermal inertia, seen in the shallow English Channel and southern North Sea regions in summer (Fig. 5e), occurs because these regions are only weakly influenced by the dominant summer atmospheric circulations according to observation-based estimates (Fig. 3a summer panels). This state contrasts with the situation in the winter (Fig. 3a winter vs. summer panels). However, this cannot explain the English Channel, southern North Sea and Irish Sea skill in GloSea, which falsely simulates summer EOF1Z500 (NAO) as being impactful on SST in these regions (Fig. 3a[iv] vs. Figure 3b[iv]). The ORCA025 ocean grid used by GloSea is relatively coarse and parameterises tidal mixing (i.e. Simmons et al. 2004) rather than including dynamic tides of varying amplitude. The parameterisation of tides is expected to be an underestimation for the NWS, which can lead to artificial stratification in the English Channel, southern North Sea and Irish Sea regions (Tinker et al. 2018), therefore artificially amplifying air-sea exchange. Intuitively, this would suggest that the true SST persistence in the English Channel, southern North Sea and Irish Sea regions should be degraded in GloSea due to the artificial influence of EOF1Z500 (NAO), yet summer SST skill is maintained in these regions in GloSea (Fig. 5d). We explain this as a cancellation of errors between the GloSea EOF1Z500 (NAO) prediction and SST response. That is, the GloSea summer SST response to EOF1Z500 (NAO) has the incorrect sign in these regions (Fig. 3a[iv] vs. Figure 3b[iv]), in parallel to summer EOF1Z500 (NAO) prediction skill displaying the wrong sign (ACC = -0.29; Fig. 4b).

GloSea summer SST skill in the outer shelf region is high (Fig. 5d) and sees improvements when compared with persistence (Fig. 5f), albeit the difference is not statistically significant. We note that the same region displays lower substantially skill when hindcast data are temporally detrended (not shown). Skill in the outer shelf region may be associated with low-frequency decadal signals in the North Atlantic. The advection of these signals onto the NWS should be captured by the GloSea hindcasts.

3.4 Observation-matched GloSea atmospheric circulation patterns

We have shown that GloSea can be skilful in predicting NWS SST, particularly in winter. However, large amounts of variance remain unexplained at current levels of skill so we now ask what the prediction skill of NWS SST would be with improved forecasts of atmospheric circulation patterns. The “atmospheric mode matching” procedure gives an upper estimate of the skill that could be achieved if we had improved atmospheric forecasts with respect to each individual circulation pattern. To do so, we sub-select members from the full GloSea ensemble which best simulate observation-based estimates of each atmospheric circulation pattern’s principal components per hindcast year (see Sect. 2.5). In both winter and summer, ACC skill for the “atmospheric mode matched” sub-selected ensemble mean for each circulation pattern (blue lines in Fig. 6) is close to 1 and is significantly improved over the GloSea full ensemble mean (red lines). This idealised method achieves near perfect skill for each mode in terms of capturing the amplitude, sign and phase of variability (ACC > 0.9 across all seasons and circulation patterns) and it corrects any anomalous signal-to-noise ratios (RPC ≈ 1). Note, however, we find that there is little overlap between members in sub-ensembles selected for each atmospheric circulation pattern (on average across the hindcast period, 5.62% and 3.33% of members overlap in winter and summer, respectively), indicating that individual members tend not to skilfully predict multiple circulation patterns simultaneously.

Fig. 6
figure 6

GloSea “atmospheric mode matched” tests. Sub-selected ensemble mean principal component predictions (blue lines) with respect to EOF1Z500 (a, b), EOF2Z500 (c, d) and EOF3Z500 (e, f) in winter (DJF; panel column a, c, e) and summer (JJA; column b, d, f), at 2–4 months lead time across 1993–2016 hindcast period. Red line shows the original full ensemble mean (as in Fig. 4). ERA5 principal components are shown by the black scatter markers. All principal component values are normalised by their respective timeseries standard deviations. ACC (with p value in brackets, calculated using a one-tailed test) and RPC scores marked in the white boxes are for GloSea “atmospheric mode matched” sub-selected ensembles

3.5 Sub-selected ensemble GloSea NWS SST skill

Using only the members whose predictions of each atmospheric circulation pattern track observation-based estimates (n = 20), a new sub-selected NWS SST ensemble mean field is generated per mode from the members’ corresponding ocean fields. In the winter and for all three atmospheric circulation patterns, the sub-selected ensemble mean shows improvement in SST skill across large areas of the NWS, with improvements on the order of ACC > 0.25 relative to the GloSea full ensemble mean (Fig. 7). Concerning winter EOF1Z500 (NAO) and EOF2Z500 (EA), when matched to observation-based estimates, the areas which perform worst in the full ensemble mean, namely the English Channel and southern North Sea (Fig. 5a), show the largest improvements (Fig. 7d and e). This is consistent with the influence of EOF1Z500 (NAO) and EOF2Z500 (EA) in these regions in winter (Fig. 3). Regarding winter EOF3Z500 (EAWR), there is widespread improvement in skill across most of the NWS (Fig. 7f). Skill in the small area in the southeast Celtic Sea/west of France is boosted but remains low (Fig. 7c).

Fig. 7
figure 7

GloSea NWS winter SST prediction skill if improved atmospheric predictions were possible. Top panel row: Winter (DJF) only EOF1Z500 (panel column a, d), EOF2Z500 (column b, e) and EOF3Z500 (column c, f) GloSea sub-selected ensemble mean SST ACC (n members = 20), at 2–4 months lead time across the hindcast period 1993–2016. Bottom panel row: sub-selected ensemble mean (i.e. top row) minus full ensemble mean (i.e. Figure 5a). The p = 0.05 statistical significance level is marked by grey dashed contours, where the side of the contour with higher ACC scores marks p < 0.05. p values are calculated using a one-tailed test. Mean ACC for the NWS domain is marked in the bottom right for the top row. Black line marks the 200 m isobath. Stippling shows locations where the sub-selected ensemble mean is significantly different from random member selection, as assessed by bootstrapping method described in Sect. 2.5

In the summer, the sub-selected ensembles matched to observation-based estimates of EOF1Z500 (NAO) show considerable improvement in skill (Fig. 8a and d) in areas of the NWS, namely the Celtic Sea and North Sea, where the GloSea full ensemble mean is both very low in skill (Fig. 5d) and where summer EOF1Z500 (NAO) is shown to be impactful in the summer (Fig. 3). Therefore, summer EOF1Z500 (NAO) forecasts offer the greatest opportunity for improvement of NWS summer SST predictions. However, large areas in the Celtic Sea and parts of the eastern North Sea remain low in skill and statistically insignificant. This is likely due to one or a combination of: poor model representation of NWS summer stratification (i.e. Tinker et al. 2018) in the seasonally stratifying regions of the Celtic Sea and North Sea, deficiencies in simulating the impacts of meteorological events such as storms/rainfall/freshwater inputs on the water column (i.e. Jardine et al. 2023), or errors in exchange between the shelf and open ocean in the GloSea global ocean model component, which is not best suited to the fine-scale processes of the shelf-break (Graham et al. 2018). Also of note is the fact that improved summer EOF1Z500 (NAO) predictions lead to degraded SST performance in regions such as the English Channel (Fig. 8d). This is consistent with the interpretation of results presented in Sect. 3.3 where full ensemble GloSea SST skill is suggested to be high in such regions due to a cancellation of errors between summer EOF1Z500 (NAO) predictions and SST response. That is, improved EOF1Z500 (NAO) predictions will degrade SST predictions in areas where the spatial impacts of atmospheric circulations are erroneous, likely due to deficiencies in the ocean model component such as in the case of artificial stratification in the English Channel. The sub-selected ensembles matched to observation-based estimates of summer EOF3Z500 (SCAND) also show meaningful improvement in skill across parts of the NWS (Fig. 8c and f) but again, ocean model deficiencies may limit improvement in summer. In the case of matching to observation-based estimates of summer EOF2Z500 (EA), there is little improvement and skill is often significantly degraded relative to the full ensemble mean (Fig. 8b and e). This is explained by the weak impact by EOF2Z500 on NWS summer SST (Fig. 3) meaning that the sub-selection procedure serves only to reduce the ensemble size without the benefit of better representing impactful atmospheric circulations.

Fig. 8
figure 8

GloSea NWS summer SST prediction skill if improved atmospheric predictions were possible. Top panel row: Summer (JJA) only EOF1Z500 (panel column a, d), EOF2Z500 (column b, e) and EOF3Z500 (column c, f) GloSea sub-selected ensemble mean SST ACC (n members = 20), at 2–4 months lead time across the hindcast period 1993–2016. Bottom panel row: sub-selected ensemble mean (i.e. top row) minus full ensemble mean (i.e. Figure 5d). The p = 0.05 statistical significance level is marked by grey dashed contours, where the side of the contour with higher ACC scores marks p < 0.05. p values are calculated using a one-tailed test. Mean ACC for the NWS domain is marked in the bottom right for the top row. Black line marks the 200 m isobath. Stippling shows locations where the sub-selected ensemble mean is significantly different from random member selection, as assessed by bootstrapping method described in Sect. 2.5

4 Summary and discussion

We have quantified the current levels of skill in predicting boreal winter and summer European North-West shelf seas sea surface temperature using a dynamical forecasting system and have understood potential sources of skill improvement. Approximate locations where we find high and low current skill, as well as locations where there may be improvements in skill with improved atmospheric forecasts, are summarised in the schematic in Fig. 9. Whilst it is unrealistic to expect GloSea to ever produce near perfect predictions of atmospheric circulation patterns, the “atmospheric mode matched” analysis presented in this study is a useful way of identifying locations where more NWS SST skill may be attainable with some future increases in atmospheric skill, as well what skill remains unexplained in GloSea under improved atmospheric circulation prediction. In this study, we have evaluated NWS SST predictability from the perspective of atmospheric circulation drivers. Alternative approaches could involve methods to explicitly link predictable modes of the variable of interest (i.e. NWS SST) to potential sources of predictability (e.g. DelSole and Chang 2003; Fan et al. 2020; Zhang et al. 2023; Chen et al. 2024).

Fig. 9
figure 9

Current skill and potential skill around the NWS. Schematic to summarise the approximate locations of high skill (yellow), low skill (purple) and potential for skill indicated by “atmospheric mode matched” tests (orange) in GloSea predictions of NWS SST for winter (DJF) and summer (JJA) seasons. Annotations provide more detail on the sources and limits to skill in each category, based on the analysis presented on GloSea and persistence predictability in Sect. 3.3 and the results of “atmospheric mode matched” experiments in Sect. 3.5

GloSea winter NWS SST prediction skill is generally high and beats persistence, particularly in shallow regions where EOF1Z500 (NAO) exerts a strong control. This builds on the well-documented understanding of GloSea winter NAO prediction skill (e.g. Scaife et al. 2014) and echoes the NWS winter predictability results of Tinker and Hermanson (2021). In the summer, GloSea NWS SST skill is generally lower when the stratified surface mixed layer is particularly sensitive to atmospheric variability. For both seasons, prospects for future seasonal forecast skill in dynamical forecasting systems are identified through GloSea “atmospheric mode matched” simulations. Potential increases in NWS SST skill stem primarily from improved EOF1Z500 (NAO) predictions in both winter and summer, in line with well understood surface climate impacts of winter and summer NAO (Hurrell 1995; Folland et al. 2009; Hurrell and Deser 2009; Bladé et al. 2012). Winter NAO has previously been shown to impact SST across large areas of the NWS (Becker and Pauly 1996; Dippner 1997; Tinker et al. 2018) meaning its role in producing skilful simulations of NWS SST is expected. However, we also demonstrate for the first time the key contribution of summer NAO to NWS SST variability and its potential for improving summer SST simulations if predicted well. EOF2Z500 (EA) in winter and EOF3Z500 (SCAND) in summer also contribute to NWS SST predictability. Winter EOF2Z500 (EA) is linked to precipitation over North-West continental Europe and Britain (Casanueva et al. 2014; Hall and Hanna 2018; West et al. 2021), indicating a potential mechanism for influencing the NWS via variability in coastal runoff. With increased freshwater runoff we might expect increased stratification in the otherwise well-mixed Southern Bight area (Simpson et al. 1993; van Leeuwen et al. 2015) with consequences for SST variability. Despite being the lowest order EOF in summer, EOF3Z500 (SCAND) exerts strong control over summer NWS SST, reflective of its broad links to European surface climate (Bueh and Nakamura 2007; Wang and Tan 2020). If predicted well, it offers potential for skill improvement across the NWS in summer.

Even in “atmospheric mode matched” tests, only ~ 50% of the variance in NWS SST on seasonal timescales can be explained. Whilst unpredictable noise associated with atmospheric variability (e.g. short-term weather) contributes an inherent limit to total NWS SST predictability, the impact of fine-scale ocean processes on the NWS remains a potentially significant source of skill and has not been explored in detail in this study. The GloSea ocean model component, which is relatively coarse resolution for this regional application and lacks dynamic tides, may fail to properly resolve NWS processes, such as stratification, shelf-break transport and shelf-edge exchange (Holt et al. 2017). Therefore, alongside continuing development aimed at improving atmospheric variability simulation in dynamical forecast systems, there is significant scope to derive further NWS SST skill through ocean component development with greater attention given to resolving shelf seas processes. Moreover, in certain cases, improvements to the ocean component of GloSea may be required to unlock skill offered by any improvements to atmospheric forecasts. For example, we have shown that errors in the spatial impact of summer EOF1Z500 (NAO) on SST in the English Channel within GloSea, potentially due to artificial stratification, results in degraded SST skill in this region when predictions of summer EOF1Z500 (NAO) are improved. The prospect of dynamically downscaling the ocean component of a GloSea large ensemble is appealing (i.e. Tinker and Hermanson 2021) but would likely be prohibitively expensive. Future work could instead aim to force a simple shelf seas model (e.g. “S2P3”; Halloran et al. 2021) to explore the benefits of improved shelf dynamics at lower expense, or to improve representation of tides in GloSea either by using a shelf-enabled global NEMO ocean component (in development) or by improving their parameterisation (e.g. Tinker et al. 2022). Whilst we have focussed here on seasonal SST predictability, it will be important to consider the predictability of other variables, averaging periods and lead times. For example, predictability of sub-surface temperature fields could be critical for ecosystem applications (e.g. Smyth et al. 2010; Marsh et al. 2015), and exploring prospects for early-warning systems of NWS extreme events, including marine heatwaves (e.g. Berthou et al. 2024), could be important for climate adaptation.

We have demonstrated that considerable NWS SST seasonal forecasting skill already exists through both the persistence of anomalies in the initial conditions and the predictability of atmospheric circulation. Further skill can likely be derived from improvements to both atmospheric and oceanic components of dynamical forecasting systems. However, we encourage the exploitation of the existing skill for managing the demands on the NWS in response to human pressures.