the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Is considering runs (in)consistency so useless for weather forecasting?
Abstract. This paper addresses the issue of forecasting the weather, using consecutive runs of one given numerical weather prediction (NWP) system. In the literature, considering how forecasts evolve from one run to another has never been proved relevant to predicting the upcoming weather. That is why the usual approach to deal with this consists of blending all together the successive runs, which leads to the well-known "lagged'' ensemble. However, some aspects of this approach are questionable, and if the relationship between changes in forecasts and predictability has so far been considered weak, this does not mean that the door is closed. In this article, we intend to further explore this relationship by focusing on a particular aspect of ensemble prediction systems, the persistence of a given weather scenario over consecutive runs. The idea is that, the more it persists over successive runs, the more it is likely to occur, but its likelihood is not necessarily estimated as it should be by the latest run alone. Using the regional ensemble of Météo-France, AROME-EPS, and forecasting the probability of certain (warning) precipitation amounts being exceeded in 24 hours, it has been found that reliability, an important aspect of probabilistic forecast, is highly sensitive to that persistence. The present study also shows that this dependency can be exploited to improve reliability, for both lagged ensembles and individual runs. From these results, some recommendations for forecasters are made, and the use of new predictors for statistical post-processing, based on consecutive runs, are encouraged. The reason for such sensitivity is also discussed, leading to a new insight on weather forecasting using consecutive ensemble runs.
- Preprint
(1557 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on nhess-2024-208', Anonymous Referee #1, 20 Dec 2024
This paper investigates how forecast performance varies depending on the persistence of a given weather scenario between successive ensemble forecasts. The authors introduce a novel “risk persistence” measure and show that using this information in addition to the latest ensemble probability forecast can significantly improve forecast reliability.
There has been relatively little investigation of the consistency between successive forecasts in the literature, and previous studies have not found significant links between consistency and forecast skill. This paper therefore provides a welcome addition to this literature and presents some intriguing results that form the basis for further research.
The paper is clearly written, and I particularly appreciated the discussion section. Almost all the questions and comments I had while reading the paper were carefully addressed in this discussion section.
I have only minor additional comments, mainly for clarification.
Specific comments
- Section 4.1, L206-207. “at least one member of Z21 is predicting the exceedance”. Does this just apply to results in this section (figs 3-6)? Then in 4.2 (regression) you use all cases including where no members of Z21 predict exceedance of a given threshold?
- L65-66. “scale mismatch”. What is the spatial scale of the ANTILOPE observations (are these gridded?)? How different is this from the AROME-EPS grid scale? NB. the “upscaling” used here will not address any scale mismatch – you are still comparing a model grid box value against an observed (gridded or point) value when you take the maximum over a neighbourhood so any difference due to different scales in model and observations will still apply. But I agree this is a good neighbourhood procedure to address the double penalty issue.
- “overestimation” should be “underestimation”
- Fig 7. It is interesting that the regression with just the raw probabilities significantly degrades the performance. Any idea why that is?
- L340-341. “any non-zero probability is already a strong signal in itself”. Fig 8 (red curve for risk persistence =0) seems to contradict this (also figs 4,5), suggesting need to be cautious if non-zero probability in just latest run?
- L355-365. Very interesting that the resolution is not affected by the regression, and worth further research as you suggest (future work, not for this paper). While a monotonic transformation of probabilities will not affect the ROC, note that the logistic regression can in principle improve the resolution (you have shown in fig 8 that the raw probabilities are affected differently by the different risk persistence values so this is not a simple monotonic transformation)
- L389-396. This is a very interesting aspect of the discussion and definitely worth further investigation. I would expect ensemble size to have a significant impact on the results, but then I would have expected that the regression would have a smaller impact on the lagged ensemble, which did not seem to be the case.
Citation: https://doi.org/10.5194/nhess-2024-208-RC1 -
RC2: 'Comment on nhess-2024-208', Anonymous Referee #2, 21 Mar 2025
In the context of weather forecasting using lagged ensembles of model runs, the article proposes and tests the so-called risk-persistence metric. Risk persistence quantifies how consistently a precipitation scenario is forecasted for day D+1 across successive model runs on day D.
The idea is interesting and potentially useful. However, I have few comments to improve the overall clarity of the work and better validate the proposed metric.1. When identifying the research gap, the authors often refer to the chronological order (or evolution) of runs (e.g., lines 69, 87, 88). However, while novel, the proposed methodology does not really consider the chronological order of runs, but only how persistent lagged forecasts are, irrespective of their order (see eq. 1). This is also admitted in the discussion (lines 316—338). Therefore, I would suggest stating upfront in the introduction that the proposed methodology aims at improving the way we leverage lagged model forecasts but, at this stage, it does not yet address how to explicitly consider any added information from the chronological order of those runs.
2. Fig. 3, 4, and 6: in these three figures, the authors use continuous (solid) lines to indicate what in the caption is referred to as “probability of exceedance predicted by Z21” (in the two cases with risk-persistence=0 and risk-persistence=3). However, from what I understand, what the authors actually show is the frequency of model forecasts exceeding the considered threshold. Using the word “probability” might therefore cause some confusion; for example, it might lead to think that the authors are referring to probabilities calculated using the logistic model given by Eq. 2, although they are actually different concepts, as clearly stated in line 260 (“more skillful forecasts” vs. “raw ensembles”).
3. Fig. 3: in my understanding, another conclusion that can be derived from this plot is that smaller precipitation events tend to be predicted more consistently by all model runs. If this is correct, I would state this in the article too.
4. Fig. 3 and 4: what is the difference between the blue and red solid curves shown in Fig. 3 and those shown in Fig. 4? Perhaps, using more informative axis labels for the y-axis may help improve clarity.
5. Lines 172-190 & Fig. 9 and 10: When studying the effects of station sampling, the authors considered the sensitivity to neighborhood size. I suggest also trying different alternative samples of stations for the same, fixed neighborhood size. To validate the proposed methodology, I believe it is more important to observe that results are consistent across different station network realizations with the same 25-km neighborhood size, because of the considerations outlined at lines 181—184.
6. The proposed risk-persistence metric is validated by comparing forecasted probabilities with the observed frequencies, for fixed precipitation threshold exceedance (e.g., Fig. 5 and 7), as well as the forecasted and observed frequencies for varying thresholds (e.g., Fig 4). To obtain a more intuitive assessment of the importance of considering persistent forecasts in lagged model runs, I suggest also considering an event-by-event analysis, counting the number of times persistent (or brand new) forecasts made on day D are consistent with the amount of precipitation that is actually observed on day D+1 (maybe considering some tolerance), as well as the number of times those forecasts over- or under-estimated the actual precipitation amount.
7. Fig. 4: considering different benchmarks (dashed lines) depending on the value of risk persistence is a bit counterintuitive, since the final goal of the work is to determine whether using the proposed metric improves our forecasting capabilities. Probably a better way (also see my previous comment) would be instead showing how often Z21 with risk_persistence=0 and risk_persistence=3, respectively, correctly forecasts next-day precipitation amounts (or threshold exceedances, to be consistent with the definition of risk persistence given by Eq. 1), as well as how many times those forecasts are incorrect instead, on an event-by-event basis. This way, the benchmark is the same for both scenarios (i.e., with risk_persistence=0 and risk_persistence=3) and forecasters can better understand whether considering the persistence of forecasted threshold exceedances in previous runs can help achieve better performances.Minor comments:
1. Line 172: the double-penalty effect is mentioned but not explained; while a literature reference is provided, I would suggest to also include a brief description in the text.
2. Line 113: “… as already experienced by many forecasters”. Please include one or two references about the observed large variability, both in space and time, of accumulated precipitation, at the end of this sentence.
3. Some parts of the manuscript need some rewording to enhance clarity (e.g., lines 151-156)
4. Fig 3: for the largest RR24 thresholds, confidence intervals for the two curves overlap and cannot be seen.
5. Fig. 7, 9, and 10 are a bit too “crowded”. Can the authors devise better visualization strategies?Citation: https://doi.org/10.5194/nhess-2024-208-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
188 | 32 | 19 | 239 | 21 | 15 |
- HTML: 188
- PDF: 32
- XML: 19
- Total: 239
- BibTeX: 21
- EndNote: 15
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1