Revisiting the synoptic-scale predictability of severe European winter storms using ECMWF ensemble reforecasts

Abstract. New insights into the synoptic-scale predictability of 25 severe European winter storms of the 1995–2015 period are obtained using the homogeneous ensemble reforecast dataset from the European Centre for Medium-Range Weather Forecasts. The predictability of the storms is assessed with different metrics including (a) the track and intensity to investigate the storms' dynamics and (b) the Storm Severity Index to estimate the impact of the associated wind gusts. The storms are well predicted by the whole ensemble up to 2–4 days ahead. At longer lead times, the number of members predicting the observed storms decreases and the ensemble average is not clearly defined for the track and intensity. The Extreme Forecast Index and Shift of Tails are therefore computed from the deviation of the ensemble from the model climate. Based on these indices, the model has some skill in forecasting the area covered by extreme wind gusts up to 10 days, which indicates a clear potential for early warnings. However, large variability is found between the individual storms. The poor predictability of outliers appears related to their physical characteristics such as explosive intensification or small size. Longer datasets with more cases would be needed to further substantiate these points.


Storm tracking
The 25 selected storms are tracked both in ERA-Interim and in the members of the ensemble reforecast, using the algorithm described by Pinto et al. (2005) and originally developed by Murray and Simmonds (1991). In a first step, maxima are identified in the Laplacian of MSLP interpolated on a polar stereographic grid then minima in MSLP are looked for in their vicinity. The Laplacian of MSLP is closely related to the quasi-geostrophic vorticity; thus the algorithm is similar to tracking maxima in 5 low-level vorticity. In a second step, the mimima of MSLP are connected between subsequent model outputs every 6 h to form tracks, if their displacement velocity remains consistent in time. As the focus is on severe storms here, the obtained tracks are filtered to exclude storms with a weak Laplacian of MSLP or with a duration of less than 24 h. However, the algorithm is applied hemisphere-wide and thus results in a large number of tracks, among which the storms of interest need to be identified.
Identifying the storms in ERA-Interim is straightforward, because the selection of severe storms is based on the same dataset. 10 For each of the 25 storms, the reference time and position of minimum MSLP given by Roberts et al. (2014) are searched for in the tracks obtained from the algorithm. The closest track is unambiguously identified this way and matches the reference track, although differences may arise, particularly at the beginning and end. As shown by Neu et al. (2013), such differences are a common issue when comparing storm tracking algorithms, which usually agree well for the mature phase of deep cyclones but differ during the phases of cyclogenesis and cyclolysis. In particular, the algorithm of Pinto et al. (2005) tends to identify the 15 cyclones earlier than others. Neu et al. (2013) emphasize that there is no best way of tracking storms, because there is no single definition of extratropical cyclones. As the same algorithm is applied here to both ERA-Interim and the reforecasts, potential biases due to the tracking method would likely cancel out.
In the reforecast, identifying the storms is less straightforward even at short lead times and quickly becomes ambiguous, because the tracks diverge from ERA-Interim when the lead time increases. In earlier studies, Froude et al. (2007a, b) applied 20 strict criteria in the location, timing and duration of tracks to identify storms in forecasts. While such criteria may be required for statistical studies, they would reject too many ensemble members for the sample of storms considered here, in particular at long lead time, and thus would bias the results towards "good" members only. Instead, the track closest to ERA-Interim is identified in each ensemble member without arbitrary criteria, based on the great-circle distance averaged over a 24-h period.
Two methods are compared for the definition of the 24-h period. In the first method, the period is defined as the first 24-h 25 overlap between the track in the ensemble member and in ERA-Interim. If the track is not present at the time of initialization, it is further constrained to start in the ensemble member within 48 h of its first occurrence in ERA-Interim. In the second method, the period is simply defined as the day of maximum intensity.
The two methods are illustrated for the 7-day reforecast of the storm that hit the British Isles on 28 October 1996 ("u19961028", Table 1). The storm took its origin in Hurricane Lili, which reached Europe after crossing the North Atlantic and undergoing 30 extratropical transition (Browning et al., 1998). With the first method, the identified tracks start from the same location, because the storm is present in the reforecast at the time of initialization (Figure 2a). They later diverge and only two of them reach Europe, whereas the others remain over the central North Atlantic. With the second method in contrast, the identified tracks all reach Europe, as expected from the identification on the day of maximum intensity ( Figure 2b). However, they start from different regions spreading from the western to the eastern North Atlantic. In particular, no single track takes its origin in Hurricane Lili, i.e. the two methods do not show any common track. Although this case of extratropical transition is unique among the selected storms, it illustrates the difficulty of identifying storms in the reforecast. The most relevant method depends on the aims of the analysis; the first method focusing on the dynamics of the storm and the second one on its impact. Both methods are therefore used here.

Storm Severity Index
While the intensity of a storm is commonly measured with its minimum MSLP, its severity mostly depends on the strength of the wind gusts, which is also controlled by the pressure gradient at the synoptic scale and by additional factors at the mesoscale and turbulent scale. In particular, insured losses have been shown to scale with the third power of the strongest wind gusts.
Following Klawa and Ulbrich (2003) and Leckebusch et al. (2008), a Storm Severity Index (SSI) is therefore defined as if v max > v 98 and SSI = 0 otherwise, with v max the daily maximum wind gust and v 98 its local 98th climatological percentile.
The scaling with v 98 accounts for the local adaptation to wind gusts, whose impact on infrastructure is weaker in exposed areas such as coasts and montains than in the continental flatlands for the same absolute wind speed (Klawa and Ulbrich, 2003).
The climatology of wind gusts is computed separately for ERA-Interim and the reforecast but for the same period of interest 15 mid-October-mid-March 1995/96-2014/15. The resulting values of v 98 are higher in the reforecast, likely due to the higher model resolution. In particular, wind gusts are abnormally high over the topography in the first 6-h output of the reforecast, which suggests a problem with the spin-up of the model. The first 6 h are thus omitted for computing both v max and v 98 . Wind gusts are also subject to caution in ERA-Interim but are still preferred to the wind speed (used by Leckebusch et al., 2008), because they represent maximum values over a certain time period rather than instantaneous values and thus better sample 20 storms with a large displacement velocity.
The daily maximum gusts in ERA-Interim are shown in Figure 3a and the resulting SSI in Figure 3b for storm Lothar on 26 December 1999. The strongest gusts are found over the Bay of Biscay but the highest SSI is found over southern Germany due to the lower values of the local model climatology. The SSI is then averaged over central Europe (defined as 40 • N-60 • N and 10 • W-30 • E; corresponds to map shown in Figure 3) to give a single value for the total severity of the storm, which can then 25 be compared with the reforecast. This method is equivalent to the area SSI defined by Leckebusch et al. (2008). It is preferred to including the SSI along the track of the storm only (event SSI in Leckebusch et al., 2008), because of the ambiguous identification of the tracks in the reforecast. Among the 25 investigated storms, Lothar exhibits the highest averaged SSI in ERA-Interim, followed by Klaus, Martin and Kyrill (Table 1). These four storms are responsible for the four highest insurance losses during the period of interest (Roberts et al., 2014), which suggests that the averaged SSI in ERA-Interim is a relevant 30 measure of the severity of storms. Inaccuracies are still expected and attributed to mesoscale features that are not resoved by ERA-Interim and by non-meteorological factors such as the density of population and the insured capital.

Extreme Forecast Index and Shift of Tails
Forecasting extreme events is a challenge in numerical weather prediction, because predicted extremes tend to underestimate the magnitude of actual events. Lalaurette (2003) therefore introduced the Extreme Forecast Index (EFI), which measures the extremeness of an ensemble forecast as compared to the model climate rather than to the observed climate. The original formulation of the EFI was revised by Zsótér (2006), who included a weighting function to emphasize the tails of the distribution 5 and obtained with F f (p) the proportion of ensemble members lying below the p quantile of the model climate. The EFI quantifies the deviation of an ensemble forecast from its climatological distribution with a unitless number between -1 (all members reach record-breaking low values) and +1 (record-breaking high values). 10 Zsótér (2006) also introduced the Shift of Tails (SOT) as an additional index that focuses even more on the tail of the distribution with Q f (p) and Q c (p) the p quantiles of the ensemble forecast and of the model climate, respectively. The SOT indicates if a fraction of the ensemble members predicts an extreme event, even if the rest of the members do not. Following Zsótér (2006), p 15 is taken as the 90th percentile, i.e. the top two members of the 11-member ensemble reforecast. As in the operational ECMWF configuration, p 0 is taken as the 99th percentile of the model climate, which is smoother than the 100th percentile (maximum) used by Zsótér (2006). A positive value of SOT thus means that at least two members predict an extreme event that belongs to the top percent of the model climate.
Both EFI and SOT are computed here for daily maximum wind gusts. For consistency with the SSI, the model climate is 20 defined from the period mid-October to mid-March 1995/96-2014/15. This contrasts with the operational ECMWF configuration, where the model climate is defined for each forecast within a one-month window centred around the initialization time.
As the focus is on winter storms here, a seasonal model climate is preferred to avoid storms to be considered as more or less extreme depending on when they occur during the season. A longer period is also preferred to improve the representation of the 99th percentile of the model climate, as the length of the operational configuration has been validated for precipitation 25 and temperature but not for wind gusts (Zsoter et al., 2015). Finally, as in the operational configuration, the model climate is computed separately at each lead time to compensate for any drift of the reforecast. Germany. However, it also shows a discrepancy between high EFI or SOT and weaker gusts over other regions. This suggests a potential for warnings but with possible false alarms, as already noted by Lalaurette (2003). The use of EFI and SOT thus requires an appropriate balance between hit rate and false alarm rate (Petroliagis and Pinson, 2014;Boisserie et al., 2016). On average over all storms, the predicted MSLP remains close to ERA-Interim until day 4, but exhibits a clear positive bias, i.e. it underestimates the intensity of storms from day 5 onwards (black curve in Figure 5a). The predicted MSLP also exhibits a large dispersion between the storms, which increases with increasing lead time (symbols in Figure 5a). The most striking outlier is storm Gero (red triangle), which shows the strongest positive biases with more than 60 and 40 hPa on days  (Table 1). This suggests an impact of the storm intensity on its predictability, although no systematic link is found in the sample of storms. For instance, the second and third deepest storms Oratia and Stephen, which also experienced an explosive cyclogenesis, show contrasting positive and negative biases in MSLP depending on the lead time (green triangle and blue circle in Figure 5a). The predicted MSLP of Gero also exhibits a negative bias on day 1, 20 although this may be due to ERA-Interim underestimating the actual intensity due to its coarse horizontal resolution.
Concerning the position, the predicted longitude exhibits a negative bias on average, i.e. the storms are too slow in the reforecast from day 4 onwards (black curve in Figure 5b). A weak positive bias is present in the reforecast of the latitude but it does not appear to be significant (not shown). Similar to the predicted MSLP, the predicted longitude also exhibits a large dispersion between the storms, which increases with increasing lead time (symbols in Figure 5b). Storm Gero is again an outlier 25 with strong negative biases at days 5 and 8 but the strongest biases are shown by ex-Lili at day 7 (blue square) and Dagmar at day 10 (blue cross). These two storms formed remotely from Europe, the former in the tropics (Browning et al., 1998, see also Figure 2) and the latter over the southeastern United States. This suggests a link between the poor predictability of the position and the difficulty at representing convective dynamics, especially during extratropical transition (e.g. Pantillon et al., 2013). However, storm "u19960207" shows a strong negative bias in longitude at day 7 (green square) though it developed 30 over the eastern North Atlantic. This emphasizes that single factors can influence the predictability of specific storms but do not necessarily have a systematic impact. As expected, the spread between the ensemble members increases regularily with the increasing lead time on average, both for the intensity (solid black curve in Figure 5c) and the position (solid black curve in Figure 5d). The spread is consistent with the median absolute error (dashed curve), which suggests that the ensemble reforecast is properly calibrated. However, a large dispersion is again found between the storms and the spread does not match the error for individual storms. The storms with a strong bias mentioned above tend to exhibit a small spread, i.e. their reforecast is overconfident. Inversely, other storms that Although not tested here, this result raises the question of the meaningfulness of the ensemble mean at lead times beyond a few days, when the identification of storms becomes ambiguous. In particular, the number of members still containing the 20 storm decreases when the lead time increases, which biases the ensemble mean. In the extreme case of ex-Lili for instance, all members of the 10-day reforecast valid on the day of maximum intensity have lost track of the storm on the day it reaches Europe, making this metric meaningless.
Using the alternative identification method focusing on the day of maximum intensity ensures that a storm is identified in each member of the ensemble. The predictability can then be measured by the number of members that match the actual

Storm impact
The predictability of the selected storms is further evaluated for the impact of the wind gusts estimated from the SSI. Only the daily, spatially averaged SSI is evaluated here, without considering geographical information on where the storm occurred exactly. The reforecast is therefore evaluated for its ability to predict a severe storm on a specific day but anywhere over central Europe. It is compared to ERA-Interim as a logarithmic difference, because the SSI is highly nonlinear (Equation 1) and spans 5 several orders of magnitude between the least and the most severe storms of the selection (Table 1). Finally, although the SSI is scaled locally with separate model climates between the reforecast and ERA-Interim, the predicted distribution of SSI is overestimated overall. The overestimation is strongest for the low quantiles of the distribution then decreases to a factor of about 2 in the higher quantiles. The predicted SSI is thus divided by a factor of 2 for ease of comparison unless stated otherwise.
On average over all storms, the reforecast is close to ERA-Interim until day 3 but then drops by one order of magnitude and 10 thus strongly underestimates the SSI at longer lead times (solid curve in Figure 7a). This drop is specific to the sample of severe storms and is not due to a systematic drift in the reforecast, which is illustrated by the 99th percentile of predicted SSI remaining almost constant with lead time (dashed curve). In addition, the average spread in SSI between ensemble members increases until day 3 only, before it decreases again when the average SSI drops (not shown). The reforecast is thus underdispersive at longer lead time. As for the MSLP, however, the predicted SSI shows a large dispersion between the storms (symbols). For 15 instance, the deep storms Gero and Oratia are again outliers with strong negative biases at days 5, 8 and 9, respectively, whereas a few other storms even exhibit a positive bias.
These results are confirmed by measuring the number of members that predict at least the SSI of ERA-Interim, which also drop at day 4 ( Figure 7b). Note that this is a rather pessimistic estimation, as the predicted SSI is divided by a factor of 2.
Before the drop at day 4, the number of members is further separated into two groups with either a large majority or a small 20 minority capturing the storms. This suggests that the reforecast systematically over-or underestimates the severity of individual storms. ERA-Interim may also contribute to the cases of overestimation by underestimating the actual SSI due to its limitation at representing the mesoscale structure of some storms. Beyond day 3, the reforecasts show a systematic underestimation of the SSI for almost all storms. However, at least one ensemble member on average still predicts the SSI of the storms until day 7, which suggests a potential for early warning based on individual members. without storms. It is computed with the Brier Score (Brier, 1950), which measures the ability of the reforecast to predict if an event will occur or not.
The Brier Score can be split into reliability, resolution and uncertainty components (Murphy, 1973). The reliability component measures the ability of the forecast to predict the observed frequency of events. A perfect reliability can be achieved with a climatological forecast and is thus not sufficient to be useful. In contrast, the resolution component measures the ability 5 of the forecast to distinguish between events and non-events, which can not be achieved with a climatological forecast. The uncertainty component finally measures the sampling uncertainty inherent to the events. The Brier Score is further compared to a climatological forecast to obtain the Brier Skill Score (BSS), i.e. the actual skill of the reforecast, which is in turn split as into reliability and resolution components B rel and B res (e.g. Jolliffe and Stephenson, 2012). sampling uncertainty. This emphasizes that the dataset is too limited to investigate extreme events, which on average represent 8.2 events per lead time only. As a result, the Brier Skill Score suggests that the reforecast exhibits some skill at predicting 25 extreme events until day 6 but it suffers from the same irregular evolution with lead time.

Area covered by damaging gusts
The potential for early warnings of strong gusts is further investigated with the EFI and SOT, which are both designed for this purpose by highlighting the behaviour of the most extreme ensemble members. As noted by Lalaurette (2003) already, the EFI gives useful warnings of extreme events but also frequent false alarms. Petroliagis and Pinson (2014) therefore suggested the 30 use of an optimal threshold to balance between hits and false alarms, a higher (lower) threshold increasing (decreasing) both the hits and the false alarms. Boisserie et al. (2016) further suggest to maximize the Heidke Skill Score (Heidke, 1926) as a trade-off between hit rate and false alarm rate. Following these authors, an optimal threshold is looked for to predict gusts that Nat. Hazards Earth Syst. Sci. Discuss., doi:10.5194/nhess-2017-122, 2017 Manuscript under review for journal Nat. Hazards Earth Syst. Sci. Discussion started: 31 March 2017 c Author(s) 2017. CC-BY 3.0 License. exceed the local 98th climatological percentile in ERA-Interim. This value is taken for consistency with the SSI. In contrast with the previous studies, however, which focused on specific storms or storm intensities, an optimal threshold is first computed for the whole dataset and only then applied to the selected storms. This ensures that the result is not biased by verifying the forecast with extreme events only.
As shown in Figure 9a, the optimal threshold in EFI decreases with lead time, because both hit rate and false alarm rate 5 decrease with lead time for a given threshold. In contrast, the optimal threshold in SOT is stable until day 6 and decreases at longer lead times only (Figure 9b). This is due to the increase in false alarm rate with lead time for a given threshold in this case, which compensates for the decrease in hit rate (not shown). A constant threshold is thus suitable for the SOT and in the early range only. The dependency of the optimal thresholds on the lead time should else be taken into account for warnings.
The optimal thresholds further show seasonal and regional variability (not shown), which could also be included to improve 10 warnings. For the sake of simplicity, however, they are not considered here.
Although the optimal threshold exhibits a different evolution with lead time between the EFI and the SOT, the corresponding Heidke Skill Score is very similar, with a slightly higher value for the EFI. It decreases regularily with increasing lead time but remains above zero (no skill) until day 10, the longest lead time investigated here. The decrease tightly follows the hit rate, while the false alarm rate slowly increases but remains small due to the rarity of events by definition of the local 98th 15 climatological percentile. Note that the false alarm rate, which is conditioned by the events that are not observed, should not be confused with the false alarm ratio, which is conditioned by the events that are not forecast. These results demonstrate the actual potential of both EFI and SOT for the early warning of strong gusts. If the local 99th climatological percentile is preferred to defined extreme events, as in early studies, the optimal thresholds need to be levelled up and the resulting skill becomes lower but it also remains positive until day 10 (not shown).

Application to the selected storms
The optimal thresholds described above are applied to the EFI and SOT for the selected severe storms in the reforecast. The Heidke Skill Score is again used as a trade-off between hit rate and false alarm rate. It is computed for the prediction of gusts over the central European domain on the day of maximum intensity of each storm. As for the whole dataset, the EFI ( Figure   10a) and the SOT (Figure 10b) exhibit similar Heidke Skill Score on average, which lies around 0.8 during the first two days 25 (high skill) and then decreases with increasing lead time until vanishing at day 10 (no skill). In particular, before day 10, the Heidke Skill Score is higher for the storms (solid curves) than for the whole dataset (dashed curves). It is related to higher hit rates for the storms, which enhance the skill despite higher false alarm rates (not shown). This does not necessarily mean that the reforecast is more skillful at predicting the presence than the absence of storms but rather emphasizes how focusing on observed events can bias the verification.

30
Beyond these average properties, the reforecasts of the storms exhibit contrasting skill from case to case. The dispersion between the storms quickly increases with increasing lead time and the Heidke Skill Score of some storms approaches zero or becomes negative from day 6 onwards (symbols on Figure 10). A poor skill is found in both EFI and SOT for storms Lili at day 7 and Gero at day 8 in association with a low hit rate, as well as for storm Joachim at day 7 in association with a high false alarm rate. This is consistent with the large biases in MSLP and longitude and the large spread in MSLP, respectively, found for these storms. Other storms contrast between poor skill in EFI and good skill in SOT, as Yuma at day 4, which was remarked for its difficult forecast as it occurred (Young and Grahame, 1999), and Xynthia at day 6. The higher skill could be due to the high hit rate of the SOT compared to the EFI, as suggested by Boisserie et al. (2016). However, no difference is found here on average in the whole sample.

5
Storm Yuma has the lowest area of strong gusts of the whole dataset, followed by Lili, Gero and Xynthia (Table 1), which suggest a link between storm size and predictability. However, such a link is not systematic, as shown by storm Xaver, which exhibits almost no skill at day 6 in both EFI and SOT though one of the largest area of the dataset. Finally, storm Xynthia exhibits a surprisingly high skill at day 10 in both EFI and SOT thanks to a high hit rate. This constitutes an outlier compared to all other storms, which show no skill at that lead time. However, none of the ensemble members predicts the actual development 10 of Xynthia over the subtropical North Atlantic (Ludwig et al., 2014). Instead, several members predict a storm forming over the central North Atlantic but reaching the Iberian Peninsula on the same day as Xynthia. Although this successful reforecast could be due to chance rather than to the actual skill of the model, it illustrates how predicting individual storms becomes ambiguous at long range but suggests a potential for predicting an environment favorable to storm development.

15
The synoptic-scale predictability of 25 severe historical winter storms over central Europe is revisited by taking advantage of the ECMWF ensemble retrospective forecast (reforecast), which offers a homogeneous dataset over 20 years with a state-ofthe-art The ensemble average is unbiased until day 3 to predict the position and minimum MSLP of the storms on the day of maximum intensity. At longer lead times, however, it systematically underestimates the speed of motion and the depth of the storms. This bias is accompanied by an increase in ensemble spread by a similar magnitude, which suggests that the ensemble is calibrated, but only a minority of ensemble members still captures the actual storm at lead times beyond 3-5 days. This 30 questions the relevance of using the ensemble average at longer lead times. This differs from a classical situation of averaging the ensemble members to smooth the unresolved scales, as the variables of interests are objects here rather than continuous fields. The ensemble average further underestimates the SSI of the storms at lead times longer than 3 days and the relative error reaches several orders of magnitude. The ensemble spread drops by orders of magnitude, which shows that the SSI of the storms is systematically underestimated. In contrast, there is no general drift in the reforecast, where severe events are present up to 10 days. Similarly, the biases in position and intensity may not be systematic but rather be due to the focus on intense storms that reach Europe. These results suggest that relevant predictions of storm properties are restricted to the first 2-4 days of lead time. This suggestion is supported by the ambiguousness at identifying the storms at longer lead times in the reforecast.

5
A different methodology is therefore required at lead times longer than the 2-4 days horizon. Although they are missed by the ensemble average, the position, intensity and severity of the storms are captured by some members up to one week in advance or even beyond. As suggested by earlier studies, the whole distribution of the ensemble should thus be used by shifting the focus from the average and spread to individual members for the prediction of extreme events. The danger with this approach, however, is to verify the predictions with regard to observed events only, i.e. by concentrating on the hit rate 10 without accounting for the false alarms. The predictability is therefore investigated here in the whole dataset of 20 winter seasons including both stormy and non-stormy days. Tracking is not used, because it can be applied to cyclones only and becomes ambiguous at long lead times. For intense events defined as the top 5% of the SSI, which span on average 7-8 days per winter, the reforecast exhibits a positive Brier Skill Score that regularily decreases until vanishing at day 9. For extreme events defined as the top 1% of the SSI, which approximately correspond to the 25 historical storms, the reforecast appears to 15 exhibit a similar skill but suffers from a large sampling uncertainty at longer lead time. The EFI and SOT indices confirm the skill of the reforecast at predicting the area covered by strong wind gusts until day 10 for storms as for the whole dataset. These results highlight the potential for early warnings of storms but also the difficulty at verifying the forecast of extreme events, even with the extended dataset used here.
While the metrics agree on average, they exhibit a high case-to-case variability. The predictability is particularly low for 20 a few storms involving an explosive cyclogenesis, a tropical origin or a small area. However, no systematic pattern is found among the sample of storms and their predictability partly lacks consistency between lead times. A possible explanation lies in the paucity of data at single lead times and for each storm, as the reforecast is computed twice a week and contains 11 members only. A more frequent initialisation and a larger amount of members may thus prove better ability at identifying systematic links between the dynamics and predictability of storms. The NOAA ensemble reforecast offers a daily initialization (Hamill 25 et al., 2013) but appears not to perform as well as its ECMWF counterpart for predicting wind over central Europe (Dabernig et al., 2015). Furthermore, even in the operational ECMWF ensemble forecast initialised every day and containing 50 members, Pirret et al. (2016) struggled to find a relation between the predictability and the intensity, track or physical processes of storms.
The predictability of the severe storms investigated here may not be linked to common factors but rather be due to characteristics of the individual storms. This suggests a fundamental limitation due to the nature of severe storms, which are extreme 30 events and often do not follow standard patterns. More case studies are thus needed to better understand the predictability of specific storm features at different scales. They should be eased by the new generation of global and regional reanalyses that become available with a high horizontal resolution able to better represent the storms. Alternatively, the focus of the predictability could be shifted from the storms to the large-scale conditions that favour their development (e.g. Pinto et al., 2014), in particular at longer lead times, when the identification of storms is ambiguous among the ensemble members.    for ease of comparison. The symbols represent the storms as given in Table 1 and the solid black curve shows the median of the storms per lead time, while the dashed black curve in (a) further shows the 99th percentile of SSI compared between reforecast and ERA-Interim.  The symbols represent the storms as given in Table 1 and the black curve shows the median of the storms per lead time, while the dashed curves illustrate the whole dataset for reference as in Figure 9.