Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions

Techel, Frank; Mayer, Stephanie; Purves, Ross S.; Schmudlach, Günter; Winkler, Kurt

doi:https://doi.org/10.5194/nhess-2024-158

Preprints

https://doi.org/10.5194/nhess-2024-158

Preprints

25 Sep 2024

| 25 Sep 2024

Status: a revised version of this preprint was accepted for the journal NHESS and is expected to appear here in due course.

Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions

Frank Techel, Stephanie Mayer, Ross S. Purves, Günter Schmudlach, and Kurt Winkler

Abstract. In recent years, the integration of physical snowpack models coupled with machine-learning techniques has become more prevalent in public avalanche forecasting. When combined with spatial interpolation methods, these approaches enable fully data- and model-driven predictions of snowpack stability or avalanche danger at any given location. This prompts the question: Are such detailed spatial model predictions sufficiently accurate for use in operational avalanche forecasting? We evaluated the performance of three spatially-interpolated, model-driven forecasts of snowpack stability and avalanche danger by comparing them with human-generated public avalanche forecasts in Switzerland over two seasons as benchmark. Specifically, we compared the predictive performance of model predictions versus human forecasts using observed avalanche events (natural or human-triggered) and non-events. To do so, we calculated event ratios as proxies for the probability of avalanche release due to natural causes or due to human load, given either interpolated model output or the human-generated avalanche forecast. Our findings revealed that the event ratio increased strongly with rising predicted probability of avalanche occurrence, decreasing snowpack stability, or increasing avalanche danger. Notably, model predictions and human forecasts showed similar predictive performance. In summary, our results indicate that the investigated models captured regional patterns of snowpack stability or avalanche danger as effectively as human forecasts, though we did not investigate forecast quality for specific events. We conclude that these model chains are ready for systematic integration in the forecasting process. Further research is needed to explore how this can be effectively achieved and how to communicate model-generated forecasts to forecast users.

Received: 21 Aug 2024 – Discussion started: 25 Sep 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Frank Techel, Stephanie Mayer, Ross S. Purves, Günter Schmudlach, and Kurt Winkler

Status: closed

RC1:
'Comment on nhess-2024-158', Florian Herla, 18 Oct 2024

## Summary

The manuscript titled "Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions" presents a novel approach for evaluating the performance of avalanche hazard assessment models, a highly relevant topic within the scope of NHESS. By leveraging data sets of natural and human-triggered avalanches as well as GPS tracks of backcountry users, the authors pursue a statistical exercise to address two main question. (1) Can spatially interpolated model predictions of avalanche danger and snowpack stability reflect observed avalanche activity, and (2) How well do the automated predictions perform relative to human-made avalanche bulletins. The authors conclude that the model-predicted probabilities correlated strongly with their proxy variable for the probability of avalanche release, and that the model predictions discriminate between different avalanche hazard situations as well as the human-made bulletins. These are substantial findings that the authors introduce and discuss well in the context of underlying assumptions, existing literature, and the future of avalanche forecasting.
I have one main comment that could help make the manuscript even stronger. The comparison between human and model performance in discriminating between different conditions (Fig. 5) is not completely independent. In L254ff the authors explain how the model predictions are tied to the human predictions, and in L381--383 they discuss that this could reduce the estimated model performance. To make the present approach more transferable to other countries and the results more illustrative in general, I would greatly appreciate two numerical experiments that simulate (a) less-quality bulletin data and (b) worse model predictions. In the first step, all data could be held equal except for the reported danger rating, which could be perturbed with a given standard deviation. In the second step, only the model predictions would be perturbed. This experiment would add another figure similar to Fig 5, which tells us (a) whether this approach will always cap model performance at the level of human performance, and (b) how a significantly worse model prediction would line up on this rather abstract visualization. This new figure could help the reader appreciate the strong results even more, and help other warning agencies to assess whether this evaluation strategy is suited for their contexts (e.g., less consistent and accurate danger rating data).
Another comment along similar lines, but outside the scope of this manuscript unless the authors actually investigated the following thoughts already. To not run the risk of capping model performance, one could make the bins entirely independent of the human distribution of danger ratings. The results may be less suited for comparing the model and human predictions, but we may learn better in which interval ranges the probabilities are most capable of discriminating conditions. And lastly, I fully buy into the point made by the authors that there is a limit to the value that comparing model to human data sets has, when it is unclear which data set is closer to reality when they disagree. However, I personally would be more than curious how an actual day-to-day comparison over an entire season in a prominent region looks like in the Swiss data set (for example, similar to Fig 5 in Herla et al, 2024).
The storyline is sound, focused on the research objectives, and communictated well. Congratulations to the authors for this contribution!
## Detailed comments

### Abstract

L14: "We conclude that these model chains are ready for systematic integration in the forecasting process." Consider adding a statement like "in Switzerland" or giving other warning agencies in other snow climates a heads-up that other modeling pipelines might not be on par.
### Introduction

L47: Consider using a clearer wording, for example, *...when they independently forecast avalanche danger with a similar skill as expert forecasters*.
L53: I find the statement of the first objective, "(1) Is the expected increase in the number of natural avalanches or in locations susceptible to human-triggering of avalanches predicted by spatially interpolated model predictions?", more complicated than it needs to be. Consider rephrasing it, e.g., "Can spatially interpolated models predict the observed increase in the number of natural avalanches or ...".
### Models

L80: Please add that the instability model is suited for dry slab avalanches only.
L86: Please add that the natural-avalanche model is suited for dry slab avalanches only.
L102: Downscaling weather model output to point scales is a complex endeavour. Can you please describe the key modifications to the raw NWP output before you refer the reader to Mott et al. (2023)?
### Data

L111: "We analysed ..." (past tense)
Paragraph 3.3: In principle, the analysis would be complete with comparisons of natural and human-triggered avalanches to a reference distribution. By including Non-events (approximated by GPS tracks), you offer another perspective on evaluating performance for human triggering, a bonus so to say. Given that readers likely have a strong opinion about using GPS tracks to approximate Non-events (see next comment), I suggest you make this point ("it's a bonus") more clear to the reader. For me, it was helpful to understand that in Figure 2 the box "Events/Human-triggered avalanches" caused two arrows, one to the data set that links human-triggered avalanches to the reference distribution and one to the data set that links human-triggered avalanches to the GPS tracks. This nicely visualizes that you examined human-triggering of avalanches from two complementary perspectives: one more theoretical, and the other purely data-driven, though relying on assumptions that are not easily quantified.
L142: GPS tracks as non-events: This approach assumes that an avalanche would have occurred if a skier loaded the snowpack and it was unstable. That assumption holds more true for surface problems than for persistent problems buried more deeply. In the latter case we know from avalanche accidents that it's not always the first skier who triggers the avalanche, particularly since the characteristics of the slab and depth of the weak layer vary within a slope. Moreover, I assume that the snowpack at popular ski tours or just outside of ski resorts is heavily modified by skier traffic throughout the season. Within the typical skier corridors, weak layers will likely be destroyed and the primary avalanche problems will likely be new snow and wind slab problems. Can you please discuss these thoughts and their potential effect on the results in the Discussion and refer to that discussion from Sect. 3.3? It would nicely add to the paragraph in L384ff.
L149: Consider adding "... due to forecast, encountered avalanche conditions, or previous terrain use" (or similar).
L175: I suggest changing "avoid" to "minimize".
### Methods

L195: "the random subset of grid points used as reference distribution". This is the first location that mentions that the reference distribution is based on a randomly selected subset of grid points. Please add a statement that tells the reader that this concept will be explained in detail below in Sect. 4.3.
Footnote 1: I think you can simply omit the footnote, particularly since you cite the same publication at the end of the sentence anyways.
L203: Consider rephrasing the sentence to e.g., "For locations and elevations with dis-continuous or non-existent snow cover, we set Pr = 0."
L207: Thanks for providing the code to this analysis. Could you still summarize the high-level tuning (i.e., the hyperparameter settings) in the text of the manuscript please?
Footnote 2: Please mention the software package used to to implement the regression kriging.
L221: How sensitive are the results to the choice of 2.5% of all grid points, and how did you decide on that number? I assume the analysis is computationally fairly inexpensive. In that case, could you easily re-run the analysis for other values and report on the main differences? Also, I think the elevation filter should ideally be applied before the random sampling.
L234: Consider rephrasing to e.g., "systematic biases exist between the *forecast* and *nowcast* predictions"
L236: "whether the models reflect the expected increase in avalanche occurrence probability with increasing model-predicted probability." I found this sentence somewhat confusing and suggest to either delete 'with increasing...' or to rewrite that last part like 'by predicting higher probabilities themselves'.
L237f: Can you add a brief statement why you chose different bin widths?
L241 & L243: Instead of "for cases when we relied on the reference distribution:" and "when using non-events:", please call it the same way as in Fig. 2 and 4, i.e., 'for natural and human-triggered avalanches' and 'for backcountry data'. The equations tell the reader already when the reference distributions and non-events are used.
L254: "To obtain bins containing an equal number of data points for human forecasts and for model predictions, ...". I am not sure whether this is the correct justification. I do buy into that binning approach in order to compare human and model data, but I assume this is rather necessary because the danger rating reflects a non-linear increase in hazard (e.g., Schweizer et al, 2020; Techel et al 2022), whereas the model predictions reflect non-linear increases of other functional shapes (maybe sigmoidal?), e.g. Figure 8 in Mayer, Techel, et al (2023). In other words, there needs to be a mapping of some sort, which you implement through the binning. Do I understand that correctly?
L257: Please add ", etc." after 17.8%. That is, only if I interpret the statement "For higher sub-levels, we proceeded in the same way" correctly.
L266 & L268: Same comment as for L241 & L243 above.
### Results

Figure 4: "... middle row: events with (d) natural avalanches, (e) human-triggered avalanches and (f) human-triggered avalanches during backcountry touring;". I don't fully understand the difference between the data used for panel (e) and (f). Can you please make that more clear.
Sections 5.1: Very explainable and encouraging results! Great to see it all come together after an intense workout of data acrobatics beforehand ;-)
Figures 5, B1--3: I am confused why the human forecast is further stratified into the models. More specifically, why are there three lines in Fig. B1 b, d, f that are colored according to different models? As far as I understand, each panel corresponds to one specific data set, e.g. Fig. B1d contains all natural avalanches and there should be one curve that displays the proportion of issued danger ratings. Please make sure that a correct explanation is in the text and that the reader will find that explanation from the figure captions.
Additional table: In Figures 5 and B1--3, the x-axis allows for translating between the Bin and D_s*. For example, Danger level 3- corresponds to bin 5. Please add a table to the Results section or Appendix, that shows the thresholds for each of the 10 bins and for each of the 3 model types, similarly to L259.
### Discussion

L429: I suggest changing recreationists to backcountry users.
Paragraph 6.6: I suggest you re-iterate somewhere in the paragraph (e.g., L461) that the conclusions are valid for danger level, probability of avalanche release, and instability, but not for avalanche problems or other specific characteristics.

Citation: https://doi.org/10.5194/nhess-2024-158-RC1
- AC1: 'Reply on RC1', Frank Techel, 22 Dec 2024
  
  Dear Dr. Florian Herla,
  we greatly appreciate your time and effort providing us with detailed and constructive feedback and suggestions on our manuscript.
  Please find our response in the attached pdf.
  Kind regards,
  Frank Techel
  
  Citation: https://doi.org/10.5194/nhess-2024-158-AC1
RC2:
'Comment on nhess-2024-158', Christoph Mitterer, 08 Dec 2024
Review on
Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions
by
Frank Techel, Stephanie Mayer, Ross S. Purves, Günter Schmudlach, and Kurt Winkler
Summary
The study explores the effectiveness and consistency of human-generated compared to fully automated, model-driven avalanche danger forecasts by addressing two main questions:
Does the spatial interpolation of model predictions indicate an anticipated rise in natural avalanche occurrences or an increase in areas prone to human-triggered avalanches?

Can fully automated, data- and model-driven avalanche forecasts deliver performance levels comparable to those created by human forecasters?

In order to answer these two questions, the authors compare data sets of natural and human-triggered avalanches and GPS tracks of backcountry activity to human-made and fully automated model-driven forecasts. The human-made forecasts rely on the daily published bulletin data of Switzerland for two consecutive winter seasons (2022/2023 and 2023/2024). The model-based forecasts are explored in different modes (forecast and nowcast) for three different models (danger-level model, instability model, natural-avalanche model).
Using event ratios as proxy, the authors examine the relative accuracies and consistency of human- and machine-made forecasts by comparing the spatial interpolation of the various forecasts to (1) a reference distribution containing events/non-events, and (2) recorded events of natural and human-triggered avalanches and non-events.
The results reveal that human-made and machine-made forecasts show similar relative predictive behaviour,i.e. that increase in all model probabilities are correlated to an increase in avalanche release probability. The authors did not investigate specific or absolute behaviour. This leads the authors to the final conclusion that it is timely to introdcue model-based forecasting into the operational settings of avalanche warning services.
Evaluation
The presented manuscript has a clear story line and applies in large parts transparently and comprehensible a sound set of methods to obtain innovative results in the field of model-based forecasts assessing avalanche conditions in a regional scale. The data set is innovative, methods have been already in place by the authors with other contributions (Degraeuwe et al., 2024; Techel et al., 2022; Winkler et al., 2021). Approach and results are scientifically relevant and represent a major impact on that specific topic for the community.
Most parts of the manuscript are concise, well-structured and nicely written. Some parts though of the Data (3.1 Model predictions) and the Methods Section (4.4 Analysis) were at least for me not easy to follow. Also within the Results Sections there are part – especially the ones pointing to Table 1 that are hard to follow and grasp. In addition, the Discussion Section is very broad and remains too vague in some parts for my taste. I would advise the authors to have the courage to draw more direct conclusions so that the manuscript can have even more impact on the community for their operational approaches and future directions.
Also Figures and Tables and especially their captions will need a bit of re-touch to make them even more clear (see comments directly in the manuscript). I am convinced that this excellent work should be published on NHESS having addressed my general and specific suggestions for improvement.
General comments
My general comments touch two different aspects of the manuscript: one is of a more technical nature, the other concerns the comprehensibility of the manuscript and thus sometimes drifts into questions of taste. In this respect, I am fully aware that parts of my second comment cannot, of course, be decided objectively.
Technical comments:
The reasoning of why the authors adjusted the danger-level model of by rather opting for Pr(D≥3) than D (Lines 110-117) including the Figure A1 is unclear. The authors refer to an “expected danger value” which might be or not connected to (Pérez-Guillén et al., 2024) or to the concept of an expected danger level presented in (Maissen et al., 2024). It remains, however, unclear since no citation is given. Since this altering of the danger-level model might affect central results, the reader needs more details on why the authors decided to do so. In addition to that, I would love to see what the results would look like, if the authors do not introduce this classification criteria, but demonstrate the outcome of predicting the danger level D instead.

My main comment here: The authors analysed nicely now the median relative performance; is now possible to get to the more complex cases, i.e. extremes or misses of humans. The authors did a tremendous work on creating an objective test data set. I highly acknowledge that, so why not use that approach for tackling this further question.

With this study we have learned that the machine relatively seen thinks almost identical as a human – which in turn is only to a certain extent impressive, since at least for the danger-level model analysis are in a somewhat closed circuit: The machine learned very well to pick up the forecasting culture of a well-trained and consistent forecasting team (Switzerland) and now mimics their behaviour well in a relative way. The most exciting question though is: when do they differ? Therefore, I would like to see at least one or two examples where either a specific region for the entire period (e.g. warning region of Davos) or a specific period (e.g. December 2023) for the entire forecasting domain

Question of taste comments:
Please help the reader to make the two very important Sections 1 Model predictions and 4.4 Analysis more comprehensible. Now, sentences are very long and full of different terms referring to different “probabilities”. Maybe you add some graphs to explain the reversed binning approaching a bit better. In addition, the explanation of using Pr(D≥3) needs more text support.

During the entire reading it was not clear to me that were working on two different human-triggered data sets (cf. Fig. 4e,f). Table 1 is hard to understand. E.g. what is referring to ref, what is referring to nEv?

I enjoyed large parts of the Discussion but have the feeling that it is too lengthy and not to the point of results that were shown. Not sure whether Section 6.4 could be shortened and incorporated into the limitation’s section representing the output. I would however, love to see more work on combining Section 6.3., 6.5 by addressing the questions I posed before: When do machine and humans think differently and could this think differently help us in improving the quality of our product.

While reading, the feeling arises here and there that the team of authors is trying, subtly, to polarize (e.g. choice of title). Since they have done a wonderful job either way, they have no need to do so.

Specific comments
See my mark-ups and comments within the attached supplement.
Literature
Degraeuwe, B., Schmudlach, G., Winkler, K., and Köhler, J.: SLABS: An improved probabilistic method to assess the avalanche risk on backcountry ski tours, Cold Reg. Sci. Technol., 221, 104169, https://doi.org/10.1016/j.coldregions.2024.104169, 2024.
Maissen, A., Techel, F., and Volpi, M.: A three-stage model pipeline predicting regional avalanche danger in Switzerland (RAvaFcast v1.0.0): a decision-support tool for operational avalanche forecasting, https://doi.org/10.5194/egusphere-2023-2948, 22 January 2024.
Pérez-Guillén, C., Techel, F., Volpi, M., and Van Herwijnen, A.: Assessing the performance and explainability of an avalanche danger forecast model, https://doi.org/10.5194/egusphere-2024-2374, 6 August 2024.
Techel, F., Mayer, S., Pérez-Guillén, C., Schmudlach, G., and Winkler, K.: On the correlation between a sub-level qualifier refining the danger level with observations and models relating to the contributing factors of avalanche danger, Nat. Hazards Earth Syst. Sci., 22, 1911–1930, https://doi.org/10.5194/nhess-22-1911-2022, 2022.
Winkler, K., Schmudlach, G., Degraeuwe, B., and Techel, F.: On the correlation between the forecast avalanche danger and avalanche risk taken by backcountry skiers in Switzerland, Cold Reg. Sci. Technol., 188, 103299, https://doi.org/10.1016/j.coldregions.2021.103299, 2021.
Citation: https://doi.org/10.5194/nhess-2024-158-RC2
- AC2: 'Reply on RC2', Frank Techel, 22 Dec 2024
  
  Dear Dr Christoph Mitterer,
  we greatly appreciate your time and effort providing us with detailed and constructive feedback and suggestions on our manuscript.
  Please find our response in the attached pdf.
  Kind regards,
  Frank Techel
  
  Citation: https://doi.org/10.5194/nhess-2024-158-AC2

Status: closed

RC1:
'Comment on nhess-2024-158', Florian Herla, 18 Oct 2024

## Summary

The manuscript titled "Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions" presents a novel approach for evaluating the performance of avalanche hazard assessment models, a highly relevant topic within the scope of NHESS. By leveraging data sets of natural and human-triggered avalanches as well as GPS tracks of backcountry users, the authors pursue a statistical exercise to address two main question. (1) Can spatially interpolated model predictions of avalanche danger and snowpack stability reflect observed avalanche activity, and (2) How well do the automated predictions perform relative to human-made avalanche bulletins. The authors conclude that the model-predicted probabilities correlated strongly with their proxy variable for the probability of avalanche release, and that the model predictions discriminate between different avalanche hazard situations as well as the human-made bulletins. These are substantial findings that the authors introduce and discuss well in the context of underlying assumptions, existing literature, and the future of avalanche forecasting.
I have one main comment that could help make the manuscript even stronger. The comparison between human and model performance in discriminating between different conditions (Fig. 5) is not completely independent. In L254ff the authors explain how the model predictions are tied to the human predictions, and in L381--383 they discuss that this could reduce the estimated model performance. To make the present approach more transferable to other countries and the results more illustrative in general, I would greatly appreciate two numerical experiments that simulate (a) less-quality bulletin data and (b) worse model predictions. In the first step, all data could be held equal except for the reported danger rating, which could be perturbed with a given standard deviation. In the second step, only the model predictions would be perturbed. This experiment would add another figure similar to Fig 5, which tells us (a) whether this approach will always cap model performance at the level of human performance, and (b) how a significantly worse model prediction would line up on this rather abstract visualization. This new figure could help the reader appreciate the strong results even more, and help other warning agencies to assess whether this evaluation strategy is suited for their contexts (e.g., less consistent and accurate danger rating data).
Another comment along similar lines, but outside the scope of this manuscript unless the authors actually investigated the following thoughts already. To not run the risk of capping model performance, one could make the bins entirely independent of the human distribution of danger ratings. The results may be less suited for comparing the model and human predictions, but we may learn better in which interval ranges the probabilities are most capable of discriminating conditions. And lastly, I fully buy into the point made by the authors that there is a limit to the value that comparing model to human data sets has, when it is unclear which data set is closer to reality when they disagree. However, I personally would be more than curious how an actual day-to-day comparison over an entire season in a prominent region looks like in the Swiss data set (for example, similar to Fig 5 in Herla et al, 2024).
The storyline is sound, focused on the research objectives, and communictated well. Congratulations to the authors for this contribution!
## Detailed comments

### Abstract

L14: "We conclude that these model chains are ready for systematic integration in the forecasting process." Consider adding a statement like "in Switzerland" or giving other warning agencies in other snow climates a heads-up that other modeling pipelines might not be on par.
### Introduction

L47: Consider using a clearer wording, for example, *...when they independently forecast avalanche danger with a similar skill as expert forecasters*.
L53: I find the statement of the first objective, "(1) Is the expected increase in the number of natural avalanches or in locations susceptible to human-triggering of avalanches predicted by spatially interpolated model predictions?", more complicated than it needs to be. Consider rephrasing it, e.g., "Can spatially interpolated models predict the observed increase in the number of natural avalanches or ...".
### Models

L80: Please add that the instability model is suited for dry slab avalanches only.
L86: Please add that the natural-avalanche model is suited for dry slab avalanches only.
L102: Downscaling weather model output to point scales is a complex endeavour. Can you please describe the key modifications to the raw NWP output before you refer the reader to Mott et al. (2023)?
### Data

L111: "We analysed ..." (past tense)
Paragraph 3.3: In principle, the analysis would be complete with comparisons of natural and human-triggered avalanches to a reference distribution. By including Non-events (approximated by GPS tracks), you offer another perspective on evaluating performance for human triggering, a bonus so to say. Given that readers likely have a strong opinion about using GPS tracks to approximate Non-events (see next comment), I suggest you make this point ("it's a bonus") more clear to the reader. For me, it was helpful to understand that in Figure 2 the box "Events/Human-triggered avalanches" caused two arrows, one to the data set that links human-triggered avalanches to the reference distribution and one to the data set that links human-triggered avalanches to the GPS tracks. This nicely visualizes that you examined human-triggering of avalanches from two complementary perspectives: one more theoretical, and the other purely data-driven, though relying on assumptions that are not easily quantified.
L142: GPS tracks as non-events: This approach assumes that an avalanche would have occurred if a skier loaded the snowpack and it was unstable. That assumption holds more true for surface problems than for persistent problems buried more deeply. In the latter case we know from avalanche accidents that it's not always the first skier who triggers the avalanche, particularly since the characteristics of the slab and depth of the weak layer vary within a slope. Moreover, I assume that the snowpack at popular ski tours or just outside of ski resorts is heavily modified by skier traffic throughout the season. Within the typical skier corridors, weak layers will likely be destroyed and the primary avalanche problems will likely be new snow and wind slab problems. Can you please discuss these thoughts and their potential effect on the results in the Discussion and refer to that discussion from Sect. 3.3? It would nicely add to the paragraph in L384ff.
L149: Consider adding "... due to forecast, encountered avalanche conditions, or previous terrain use" (or similar).
L175: I suggest changing "avoid" to "minimize".
### Methods

L195: "the random subset of grid points used as reference distribution". This is the first location that mentions that the reference distribution is based on a randomly selected subset of grid points. Please add a statement that tells the reader that this concept will be explained in detail below in Sect. 4.3.
Footnote 1: I think you can simply omit the footnote, particularly since you cite the same publication at the end of the sentence anyways.
L203: Consider rephrasing the sentence to e.g., "For locations and elevations with dis-continuous or non-existent snow cover, we set Pr = 0."
L207: Thanks for providing the code to this analysis. Could you still summarize the high-level tuning (i.e., the hyperparameter settings) in the text of the manuscript please?
Footnote 2: Please mention the software package used to to implement the regression kriging.
L221: How sensitive are the results to the choice of 2.5% of all grid points, and how did you decide on that number? I assume the analysis is computationally fairly inexpensive. In that case, could you easily re-run the analysis for other values and report on the main differences? Also, I think the elevation filter should ideally be applied before the random sampling.
L234: Consider rephrasing to e.g., "systematic biases exist between the *forecast* and *nowcast* predictions"
L236: "whether the models reflect the expected increase in avalanche occurrence probability with increasing model-predicted probability." I found this sentence somewhat confusing and suggest to either delete 'with increasing...' or to rewrite that last part like 'by predicting higher probabilities themselves'.
L237f: Can you add a brief statement why you chose different bin widths?
L241 & L243: Instead of "for cases when we relied on the reference distribution:" and "when using non-events:", please call it the same way as in Fig. 2 and 4, i.e., 'for natural and human-triggered avalanches' and 'for backcountry data'. The equations tell the reader already when the reference distributions and non-events are used.
L254: "To obtain bins containing an equal number of data points for human forecasts and for model predictions, ...". I am not sure whether this is the correct justification. I do buy into that binning approach in order to compare human and model data, but I assume this is rather necessary because the danger rating reflects a non-linear increase in hazard (e.g., Schweizer et al, 2020; Techel et al 2022), whereas the model predictions reflect non-linear increases of other functional shapes (maybe sigmoidal?), e.g. Figure 8 in Mayer, Techel, et al (2023). In other words, there needs to be a mapping of some sort, which you implement through the binning. Do I understand that correctly?
L257: Please add ", etc." after 17.8%. That is, only if I interpret the statement "For higher sub-levels, we proceeded in the same way" correctly.
L266 & L268: Same comment as for L241 & L243 above.
### Results

Figure 4: "... middle row: events with (d) natural avalanches, (e) human-triggered avalanches and (f) human-triggered avalanches during backcountry touring;". I don't fully understand the difference between the data used for panel (e) and (f). Can you please make that more clear.
Sections 5.1: Very explainable and encouraging results! Great to see it all come together after an intense workout of data acrobatics beforehand ;-)
Figures 5, B1--3: I am confused why the human forecast is further stratified into the models. More specifically, why are there three lines in Fig. B1 b, d, f that are colored according to different models? As far as I understand, each panel corresponds to one specific data set, e.g. Fig. B1d contains all natural avalanches and there should be one curve that displays the proportion of issued danger ratings. Please make sure that a correct explanation is in the text and that the reader will find that explanation from the figure captions.
Additional table: In Figures 5 and B1--3, the x-axis allows for translating between the Bin and D_s*. For example, Danger level 3- corresponds to bin 5. Please add a table to the Results section or Appendix, that shows the thresholds for each of the 10 bins and for each of the 3 model types, similarly to L259.
### Discussion

L429: I suggest changing recreationists to backcountry users.
Paragraph 6.6: I suggest you re-iterate somewhere in the paragraph (e.g., L461) that the conclusions are valid for danger level, probability of avalanche release, and instability, but not for avalanche problems or other specific characteristics.

Citation: https://doi.org/10.5194/nhess-2024-158-RC1
- AC1: 'Reply on RC1', Frank Techel, 22 Dec 2024
  
  Dear Dr. Florian Herla,
  we greatly appreciate your time and effort providing us with detailed and constructive feedback and suggestions on our manuscript.
  Please find our response in the attached pdf.
  Kind regards,
  Frank Techel
  
  Citation: https://doi.org/10.5194/nhess-2024-158-AC1
RC2:
'Comment on nhess-2024-158', Christoph Mitterer, 08 Dec 2024
Review on
Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions
by
Frank Techel, Stephanie Mayer, Ross S. Purves, Günter Schmudlach, and Kurt Winkler
Summary
The study explores the effectiveness and consistency of human-generated compared to fully automated, model-driven avalanche danger forecasts by addressing two main questions:
Does the spatial interpolation of model predictions indicate an anticipated rise in natural avalanche occurrences or an increase in areas prone to human-triggered avalanches?

Can fully automated, data- and model-driven avalanche forecasts deliver performance levels comparable to those created by human forecasters?

In order to answer these two questions, the authors compare data sets of natural and human-triggered avalanches and GPS tracks of backcountry activity to human-made and fully automated model-driven forecasts. The human-made forecasts rely on the daily published bulletin data of Switzerland for two consecutive winter seasons (2022/2023 and 2023/2024). The model-based forecasts are explored in different modes (forecast and nowcast) for three different models (danger-level model, instability model, natural-avalanche model).
Using event ratios as proxy, the authors examine the relative accuracies and consistency of human- and machine-made forecasts by comparing the spatial interpolation of the various forecasts to (1) a reference distribution containing events/non-events, and (2) recorded events of natural and human-triggered avalanches and non-events.
The results reveal that human-made and machine-made forecasts show similar relative predictive behaviour,i.e. that increase in all model probabilities are correlated to an increase in avalanche release probability. The authors did not investigate specific or absolute behaviour. This leads the authors to the final conclusion that it is timely to introdcue model-based forecasting into the operational settings of avalanche warning services.
Evaluation
The presented manuscript has a clear story line and applies in large parts transparently and comprehensible a sound set of methods to obtain innovative results in the field of model-based forecasts assessing avalanche conditions in a regional scale. The data set is innovative, methods have been already in place by the authors with other contributions (Degraeuwe et al., 2024; Techel et al., 2022; Winkler et al., 2021). Approach and results are scientifically relevant and represent a major impact on that specific topic for the community.
Most parts of the manuscript are concise, well-structured and nicely written. Some parts though of the Data (3.1 Model predictions) and the Methods Section (4.4 Analysis) were at least for me not easy to follow. Also within the Results Sections there are part – especially the ones pointing to Table 1 that are hard to follow and grasp. In addition, the Discussion Section is very broad and remains too vague in some parts for my taste. I would advise the authors to have the courage to draw more direct conclusions so that the manuscript can have even more impact on the community for their operational approaches and future directions.
Also Figures and Tables and especially their captions will need a bit of re-touch to make them even more clear (see comments directly in the manuscript). I am convinced that this excellent work should be published on NHESS having addressed my general and specific suggestions for improvement.
General comments
My general comments touch two different aspects of the manuscript: one is of a more technical nature, the other concerns the comprehensibility of the manuscript and thus sometimes drifts into questions of taste. In this respect, I am fully aware that parts of my second comment cannot, of course, be decided objectively.
Technical comments:
The reasoning of why the authors adjusted the danger-level model of by rather opting for Pr(D≥3) than D (Lines 110-117) including the Figure A1 is unclear. The authors refer to an “expected danger value” which might be or not connected to (Pérez-Guillén et al., 2024) or to the concept of an expected danger level presented in (Maissen et al., 2024). It remains, however, unclear since no citation is given. Since this altering of the danger-level model might affect central results, the reader needs more details on why the authors decided to do so. In addition to that, I would love to see what the results would look like, if the authors do not introduce this classification criteria, but demonstrate the outcome of predicting the danger level D instead.

My main comment here: The authors analysed nicely now the median relative performance; is now possible to get to the more complex cases, i.e. extremes or misses of humans. The authors did a tremendous work on creating an objective test data set. I highly acknowledge that, so why not use that approach for tackling this further question.

With this study we have learned that the machine relatively seen thinks almost identical as a human – which in turn is only to a certain extent impressive, since at least for the danger-level model analysis are in a somewhat closed circuit: The machine learned very well to pick up the forecasting culture of a well-trained and consistent forecasting team (Switzerland) and now mimics their behaviour well in a relative way. The most exciting question though is: when do they differ? Therefore, I would like to see at least one or two examples where either a specific region for the entire period (e.g. warning region of Davos) or a specific period (e.g. December 2023) for the entire forecasting domain

Question of taste comments:
Please help the reader to make the two very important Sections 1 Model predictions and 4.4 Analysis more comprehensible. Now, sentences are very long and full of different terms referring to different “probabilities”. Maybe you add some graphs to explain the reversed binning approaching a bit better. In addition, the explanation of using Pr(D≥3) needs more text support.

During the entire reading it was not clear to me that were working on two different human-triggered data sets (cf. Fig. 4e,f). Table 1 is hard to understand. E.g. what is referring to ref, what is referring to nEv?

I enjoyed large parts of the Discussion but have the feeling that it is too lengthy and not to the point of results that were shown. Not sure whether Section 6.4 could be shortened and incorporated into the limitation’s section representing the output. I would however, love to see more work on combining Section 6.3., 6.5 by addressing the questions I posed before: When do machine and humans think differently and could this think differently help us in improving the quality of our product.

While reading, the feeling arises here and there that the team of authors is trying, subtly, to polarize (e.g. choice of title). Since they have done a wonderful job either way, they have no need to do so.

Specific comments
See my mark-ups and comments within the attached supplement.
Literature
Degraeuwe, B., Schmudlach, G., Winkler, K., and Köhler, J.: SLABS: An improved probabilistic method to assess the avalanche risk on backcountry ski tours, Cold Reg. Sci. Technol., 221, 104169, https://doi.org/10.1016/j.coldregions.2024.104169, 2024.
Maissen, A., Techel, F., and Volpi, M.: A three-stage model pipeline predicting regional avalanche danger in Switzerland (RAvaFcast v1.0.0): a decision-support tool for operational avalanche forecasting, https://doi.org/10.5194/egusphere-2023-2948, 22 January 2024.
Pérez-Guillén, C., Techel, F., Volpi, M., and Van Herwijnen, A.: Assessing the performance and explainability of an avalanche danger forecast model, https://doi.org/10.5194/egusphere-2024-2374, 6 August 2024.
Techel, F., Mayer, S., Pérez-Guillén, C., Schmudlach, G., and Winkler, K.: On the correlation between a sub-level qualifier refining the danger level with observations and models relating to the contributing factors of avalanche danger, Nat. Hazards Earth Syst. Sci., 22, 1911–1930, https://doi.org/10.5194/nhess-22-1911-2022, 2022.
Winkler, K., Schmudlach, G., Degraeuwe, B., and Techel, F.: On the correlation between the forecast avalanche danger and avalanche risk taken by backcountry skiers in Switzerland, Cold Reg. Sci. Technol., 188, 103299, https://doi.org/10.1016/j.coldregions.2021.103299, 2021.
Citation: https://doi.org/10.5194/nhess-2024-158-RC2
- AC2: 'Reply on RC2', Frank Techel, 22 Dec 2024
  
  Dear Dr Christoph Mitterer,
  we greatly appreciate your time and effort providing us with detailed and constructive feedback and suggestions on our manuscript.
  Please find our response in the attached pdf.
  Kind regards,
  Frank Techel
  
  Citation: https://doi.org/10.5194/nhess-2024-158-AC2

Frank Techel, Stephanie Mayer, Ross S. Purves, Günter Schmudlach, and Kurt Winkler

Viewed

Total article views: 755 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
348	133	274	755	35	30

HTML: 348
PDF: 133
XML: 274
Total: 755
BibTeX: 35
EndNote: 30

Views and downloads (calculated since 25 Sep 2024)

Month	HTML	PDF	XML	Total
Sep 2024	42	8	26	76
Oct 2024	66	19	91	176
Nov 2024	24	8	49	81
Dec 2024	45	18	53	116
Jan 2025	27	18	46	91
Feb 2025	19	8	5	32
Mar 2025	19	5	1	25
Apr 2025	34	11	0	45
May 2025	17	9	1	27
Jun 2025	30	8	0	38
Jul 2025	23	17	2	42
Aug 2025	2	4	0	6

Cumulative views and downloads (calculated since 25 Sep 2024)

Month	HTML	PDF	XML	Total
Sep 2024	42	8	26	76
Oct 2024	66	19	91	176
Nov 2024	24	8	49	81
Dec 2024	45	18	53	116
Jan 2025	27	18	46	91
Feb 2025	19	8	5	32
Mar 2025	19	5	1	25
Apr 2025	34	11	0	45
May 2025	17	9	1	27
Jun 2025	30	8	0	38
Jul 2025	23	17	2	42
Aug 2025	2	4	0	6

Viewed (geographical distribution)

Total article views: 712 (including HTML, PDF, and XML) Thereof 712 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 07 Aug 2025

Short summary

We evaluate fully data- and model-driven predictions of avalanche danger in Switzerland and compare them with human-made avalanche forecasts as a benchmark. We show that model predictions perform similarly to human forecasts calling for a systematic integration of forecast chains into the forecasting process.


Total:	0
HTML:	0
PDF:	0
XML:	0

Forecasting avalanche danger: human-made forecasts vs. fully automated model-driven predictions

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.