the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions
Abstract. In recent years, the integration of physical snowpack models coupled with machine-learning techniques has become more prevalent in public avalanche forecasting. When combined with spatial interpolation methods, these approaches enable fully data- and model-driven predictions of snowpack stability or avalanche danger at any given location. This prompts the question: Are such detailed spatial model predictions sufficiently accurate for use in operational avalanche forecasting? We evaluated the performance of three spatially-interpolated, model-driven forecasts of snowpack stability and avalanche danger by comparing them with human-generated public avalanche forecasts in Switzerland over two seasons as benchmark. Specifically, we compared the predictive performance of model predictions versus human forecasts using observed avalanche events (natural or human-triggered) and non-events. To do so, we calculated event ratios as proxies for the probability of avalanche release due to natural causes or due to human load, given either interpolated model output or the human-generated avalanche forecast. Our findings revealed that the event ratio increased strongly with rising predicted probability of avalanche occurrence, decreasing snowpack stability, or increasing avalanche danger. Notably, model predictions and human forecasts showed similar predictive performance. In summary, our results indicate that the investigated models captured regional patterns of snowpack stability or avalanche danger as effectively as human forecasts, though we did not investigate forecast quality for specific events. We conclude that these model chains are ready for systematic integration in the forecasting process. Further research is needed to explore how this can be effectively achieved and how to communicate model-generated forecasts to forecast users.
- Preprint
(2158 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (extended)
-
RC1: 'Comment on nhess-2024-158', Florian Herla, 18 Oct 2024
reply
## Summary
The manuscript titled "Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions" presents a novel approach for evaluating the performance of avalanche hazard assessment models, a highly relevant topic within the scope of NHESS. By leveraging data sets of natural and human-triggered avalanches as well as GPS tracks of backcountry users, the authors pursue a statistical exercise to address two main question. (1) Can spatially interpolated model predictions of avalanche danger and snowpack stability reflect observed avalanche activity, and (2) How well do the automated predictions perform relative to human-made avalanche bulletins. The authors conclude that the model-predicted probabilities correlated strongly with their proxy variable for the probability of avalanche release, and that the model predictions discriminate between different avalanche hazard situations as well as the human-made bulletins. These are substantial findings that the authors introduce and discuss well in the context of underlying assumptions, existing literature, and the future of avalanche forecasting.I have one main comment that could help make the manuscript even stronger. The comparison between human and model performance in discriminating between different conditions (Fig. 5) is not completely independent. In L254ff the authors explain how the model predictions are tied to the human predictions, and in L381--383 they discuss that this could reduce the estimated model performance. To make the present approach more transferable to other countries and the results more illustrative in general, I would greatly appreciate two numerical experiments that simulate (a) less-quality bulletin data and (b) worse model predictions. In the first step, all data could be held equal except for the reported danger rating, which could be perturbed with a given standard deviation. In the second step, only the model predictions would be perturbed. This experiment would add another figure similar to Fig 5, which tells us (a) whether this approach will always cap model performance at the level of human performance, and (b) how a significantly worse model prediction would line up on this rather abstract visualization. This new figure could help the reader appreciate the strong results even more, and help other warning agencies to assess whether this evaluation strategy is suited for their contexts (e.g., less consistent and accurate danger rating data).
Another comment along similar lines, but outside the scope of this manuscript unless the authors actually investigated the following thoughts already. To not run the risk of capping model performance, one could make the bins entirely independent of the human distribution of danger ratings. The results may be less suited for comparing the model and human predictions, but we may learn better in which interval ranges the probabilities are most capable of discriminating conditions. And lastly, I fully buy into the point made by the authors that there is a limit to the value that comparing model to human data sets has, when it is unclear which data set is closer to reality when they disagree. However, I personally would be more than curious how an actual day-to-day comparison over an entire season in a prominent region looks like in the Swiss data set (for example, similar to Fig 5 in Herla et al, 2024).
The storyline is sound, focused on the research objectives, and communictated well. Congratulations to the authors for this contribution!
## Detailed comments
### Abstract
L14: "We conclude that these model chains are ready for systematic integration in the forecasting process." Consider adding a statement like "in Switzerland" or giving other warning agencies in other snow climates a heads-up that other modeling pipelines might not be on par.### Introduction
L47: Consider using a clearer wording, for example, *...when they independently forecast avalanche danger with a similar skill as expert forecasters*.L53: I find the statement of the first objective, "(1) Is the expected increase in the number of natural avalanches or in locations susceptible to human-triggering of avalanches predicted by spatially interpolated model predictions?", more complicated than it needs to be. Consider rephrasing it, e.g., "Can spatially interpolated models predict the observed increase in the number of natural avalanches or ...".
### Models
L80: Please add that the instability model is suited for dry slab avalanches only.L86: Please add that the natural-avalanche model is suited for dry slab avalanches only.
L102: Downscaling weather model output to point scales is a complex endeavour. Can you please describe the key modifications to the raw NWP output before you refer the reader to Mott et al. (2023)?
### Data
L111: "We analysed ..." (past tense)Paragraph 3.3: In principle, the analysis would be complete with comparisons of natural and human-triggered avalanches to a reference distribution. By including Non-events (approximated by GPS tracks), you offer another perspective on evaluating performance for human triggering, a bonus so to say. Given that readers likely have a strong opinion about using GPS tracks to approximate Non-events (see next comment), I suggest you make this point ("it's a bonus") more clear to the reader. For me, it was helpful to understand that in Figure 2 the box "Events/Human-triggered avalanches" caused two arrows, one to the data set that links human-triggered avalanches to the reference distribution and one to the data set that links human-triggered avalanches to the GPS tracks. This nicely visualizes that you examined human-triggering of avalanches from two complementary perspectives: one more theoretical, and the other purely data-driven, though relying on assumptions that are not easily quantified.
L142: GPS tracks as non-events: This approach assumes that an avalanche would have occurred if a skier loaded the snowpack and it was unstable. That assumption holds more true for surface problems than for persistent problems buried more deeply. In the latter case we know from avalanche accidents that it's not always the first skier who triggers the avalanche, particularly since the characteristics of the slab and depth of the weak layer vary within a slope. Moreover, I assume that the snowpack at popular ski tours or just outside of ski resorts is heavily modified by skier traffic throughout the season. Within the typical skier corridors, weak layers will likely be destroyed and the primary avalanche problems will likely be new snow and wind slab problems. Can you please discuss these thoughts and their potential effect on the results in the Discussion and refer to that discussion from Sect. 3.3? It would nicely add to the paragraph in L384ff.
L149: Consider adding "... due to forecast, encountered avalanche conditions, or previous terrain use" (or similar).
L175: I suggest changing "avoid" to "minimize".
### Methods
L195: "the random subset of grid points used as reference distribution". This is the first location that mentions that the reference distribution is based on a randomly selected subset of grid points. Please add a statement that tells the reader that this concept will be explained in detail below in Sect. 4.3.Footnote 1: I think you can simply omit the footnote, particularly since you cite the same publication at the end of the sentence anyways.
L203: Consider rephrasing the sentence to e.g., "For locations and elevations with dis-continuous or non-existent snow cover, we set Pr = 0."
L207: Thanks for providing the code to this analysis. Could you still summarize the high-level tuning (i.e., the hyperparameter settings) in the text of the manuscript please?
Footnote 2: Please mention the software package used to to implement the regression kriging.
L221: How sensitive are the results to the choice of 2.5% of all grid points, and how did you decide on that number? I assume the analysis is computationally fairly inexpensive. In that case, could you easily re-run the analysis for other values and report on the main differences? Also, I think the elevation filter should ideally be applied before the random sampling.
L234: Consider rephrasing to e.g., "systematic biases exist between the *forecast* and *nowcast* predictions"
L236: "whether the models reflect the expected increase in avalanche occurrence probability with increasing model-predicted probability." I found this sentence somewhat confusing and suggest to either delete 'with increasing...' or to rewrite that last part like 'by predicting higher probabilities themselves'.
L237f: Can you add a brief statement why you chose different bin widths?
L241 & L243: Instead of "for cases when we relied on the reference distribution:" and "when using non-events:", please call it the same way as in Fig. 2 and 4, i.e., 'for natural and human-triggered avalanches' and 'for backcountry data'. The equations tell the reader already when the reference distributions and non-events are used.
L254: "To obtain bins containing an equal number of data points for human forecasts and for model predictions, ...". I am not sure whether this is the correct justification. I do buy into that binning approach in order to compare human and model data, but I assume this is rather necessary because the danger rating reflects a non-linear increase in hazard (e.g., Schweizer et al, 2020; Techel et al 2022), whereas the model predictions reflect non-linear increases of other functional shapes (maybe sigmoidal?), e.g. Figure 8 in Mayer, Techel, et al (2023). In other words, there needs to be a mapping of some sort, which you implement through the binning. Do I understand that correctly?
L257: Please add ", etc." after 17.8%. That is, only if I interpret the statement "For higher sub-levels, we proceeded in the same way" correctly.
L266 & L268: Same comment as for L241 & L243 above.
### Results
Figure 4: "... middle row: events with (d) natural avalanches, (e) human-triggered avalanches and (f) human-triggered avalanches during backcountry touring;". I don't fully understand the difference between the data used for panel (e) and (f). Can you please make that more clear.Sections 5.1: Very explainable and encouraging results! Great to see it all come together after an intense workout of data acrobatics beforehand ;-)
Figures 5, B1--3: I am confused why the human forecast is further stratified into the models. More specifically, why are there three lines in Fig. B1 b, d, f that are colored according to different models? As far as I understand, each panel corresponds to one specific data set, e.g. Fig. B1d contains all natural avalanches and there should be one curve that displays the proportion of issued danger ratings. Please make sure that a correct explanation is in the text and that the reader will find that explanation from the figure captions.
Additional table: In Figures 5 and B1--3, the x-axis allows for translating between the Bin and D_s*. For example, Danger level 3- corresponds to bin 5. Please add a table to the Results section or Appendix, that shows the thresholds for each of the 10 bins and for each of the 3 model types, similarly to L259.
### Discussion
L429: I suggest changing recreationists to backcountry users.Paragraph 6.6: I suggest you re-iterate somewhere in the paragraph (e.g., L461) that the conclusions are valid for danger level, probability of avalanche release, and instability, but not for avalanche problems or other specific characteristics.
Citation: https://doi.org/10.5194/nhess-2024-158-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
123 | 33 | 145 | 301 | 7 | 4 |
- HTML: 123
- PDF: 33
- XML: 145
- Total: 301
- BibTeX: 7
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1