Evaluating methods for debris-flow prediction based on rainfall in an Alpine catchment

Hirschberg, Jacob; Badoux, Alexandre; McArdell, Brian W.; Leonarduzzi, Elena; Molnar, Peter

doi:https://doi.org/10.5194/nhess-21-2773-2021

Articles | Volume 21, issue 9

https://doi.org/10.5194/nhess-21-2773-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/nhess-21-2773-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 21, issue 9

Research article

|

10 Sep 2021

Research article |

| 10 Sep 2021

Evaluating methods for debris-flow prediction based on rainfall in an Alpine catchment

Jacob Hirschberg, Alexandre Badoux, Brian W. McArdell, Elena Leonarduzzi, and Peter Molnar

Download

Final revised paper (published on 10 Sep 2021)
Preprint (discussion started on 06 May 2021)

Interactive discussion

Status: closed

RC1:
'Comment on nhess-2021-135', Ben Mirus, 11 Jun 2021

This paper presents a very important and thorough investigation of sources and impacts of uncertainty in different methods for determining rainfall intensity-duration (ID) thresholds for debris flow forecasting. It is a particularly useful and interesting contribution that can have immediate implications for researchers interested in developing landslide warning thresholds with the ID approach. The paper compares different optimizations of ID threshold including the true skill statistic, a regression analysis, and a random forest (RF) machine learning model. These methods are rigorous and rely on a solid local-scale dataset from the well-characterized Illgraben catchment in Switzerland, as well as the Swiss national landslide inventory database and a daily rainfall product. Using bootstrap resampling of their dataset the authors calculate robust uncertainty estimates and evaluate the value of information and observational record duration for defining robust ID thresholds. Their RF model does not introduce compelling improvements in threshold performance, but the framework provides useful insights in the value of information for improving landslide forecasting capacity.

Overall, the paper is very well written. There are quite a few minor details that are missing in the abstract, most of which could be gleaned eventually from reading the entire paper, but should be included in the abstract for completeness. However, I could not find information on how storms and ID are defined for the regional datasets, nor did I find a clear explanation that multiple durations for antecedent rainfall conditions were explored in the RF model. Beyond these missing details, some further details on the rationale behind the scenarios selected for different optimization strategies would be helpful. I have outlined a number of specific suggestions below related to these and other minor concerns, so these comments should be addressed prior to final publication in NHESS.

Specific Comments:

I found the title quite critical, when actually the paper is not only about limitations, but rather a comprehensive evaluation of multiple approaches to debris-flow forecasting. Perhaps the title could be revised to more fairly represent the important contributions of this work.

L3. I’m not sure I agree that there are no standardized procedures. I think it’s more reasonable to state that there are multiple competing methods that have not been objectively and thoroughly compared at multiple scales.

L6. Consider stating “record duration” since you are talking about time, not a distance (length).

L12. Regional landslide dataset with local rainfall input or with a regional rainfall database? This is a critical detail that needs to be clarified in the abstract even if it can be determined later in the paper.

L13. If these implications are important, is it important enough to list them in the abstract?

Also, state here whether the RF model was tested for just local or also for regional?

L15. I found this “30-min maximum accumulated rainfall” a bit confusing as it isn’t really standard terminology. Is this the greatest accumulated depth of rainfall observed within a given 30-minute period of a storm? If so, wouldn’t that be basically equivalent to the peak 30-minute rainfall intensity (I-30)?

L17. Increase in predictive performance over which other threshold optimization approach/approaches?

L41. Again, it’s not that there are none, but that a few established procedures are in use and that those approaches have not been compared objectively and thoroughly.

L82. Could also mention that you evaluate these differences for both a local vs. regional landslide inventory.

Figure 1. Legend should explain what the blue shaded channel and also the X marks the Illhorn peak. It wouldn’t hurt to put the elevation of the Illhorn and the force plate or catchment outlet to provide easy reference for the steepness of the basin.

L143-146. Consider briefly explaining the gridded daily rainfall product, including the spatial resolution and how it is collected/calculated, as well as what rainfall value was used for the threshold evaluations (i.e., did you use rainfall values from the nearest grid cell, or some grid-cell averaging, or …?). This is important context for evaluating the ID thresholds at the regional scale vs. local scale.

L173-202. I guess you didn’t explain the regional data here in Table 1. Perhaps that’s not necessary, but you do need to define your MIT for the daily/regional data analysis. How are multi-day storms determined?

L192. Initially, I assumed that this 3-90d antecedent conditions meant the cumulative rainfall total measured between 3 and 90 days prior to the storm event. While there needs to be some explanation of why 3 days was selected as a cutoff (why not 2 or 1 day?), there also needs to be a clearer explanation that multiple potential durations of antecedent rainfall were considered. This only becomes apparent in Figure 7 and the associated analysis of the RF results and variable importance.

L208. Yes, and thus does not consider the rate of false alarms.

L204-216. These paragraphs could benefit from an explanation of the shape parameters in terms of how they influence ID threshold shape/position, and then subsequently the rational for why the two contrasting optimization approaches (LR-TSS vs, TSS-TSS) were selected. It might not be clear to all readers the significance of these choices.

L223. Consider clarifying the “… original (or ) record…”

L248. By “classical” ID thresholds, you mean those optimized with ROC statistics (LR-TSS, TSS-TSS)?

Figure 3. Difficult to see what the minimum number of debris flows are in each month, but it looks like they’re all zero. If so, consider just stating in the figure caption. (b) also, see previous comment about maximum 30-min accumulation. Isn’t this just more or less equivalent to the peak I-30 (i.e. 7.2mm/h)?

L256-257. If seasonal snowmelt is a relevant control on rainfall triggering, then the antecedent precipitation variable ought to somehow account for this, but I suspect it cannot.

L260-261. Again, these non-conforming observations might also be related to the fairly coarse consideration of antecedent rainfall.

L266-268. As a discussion point, it could be interesting to compare this range in parameter variation to the ranges of typical ID thresholds reported in the literature, say for example the difference between values for Caine vs. Guzzetti et al. ID thresholds. I have not done this comparison myself, but it could be worth looking at.

L372. Even though 20% seems low, this is actually pretty good performance overall for an ID threshold relative to others developed worldwide, so that just further highlights the multitude of complex interactions that lead to debris-flow triggering and justify the need to explore more data-rich approaches like the RF you propose.

Citation: https://doi.org/10.5194/nhess-2021-135-RC1
- AC1: 'Reply on RC1', Jacob Hirschberg, 02 Jul 2021
  
  We thank Ben Mirus for providing a constructive review. Please find our replies in the supplement.
  
  Citation: https://doi.org/10.5194/nhess-2021-135-AC1
RC2:
'Comment on nhess-2021-135', Clàudia Abancó, 17 Jun 2021
General Comments:

The main question addressed by this paper is the uncertainty on the rainfall thresholds for debris-flow prediciton. It presents a study at local scale, but also analyses the implications that using a regional landslide dataset would have had on the final Intensity-Duration (ID) threshold. It deals with several aspects which are well known on the literature by causing uncertainty on the definition of thresholds: such as the statistical techniques used, the size of the data set or the variables included in the ID threshold. The topic is interesting and relevant since, as pointed by the authors, no standarized procedures exist yet to define threhsolds and many uncertainties still remain on the data and techniques to be used.

Although the uncertainty in rainfall thresholds definition for landslides and debris flows is a common research topic, this paper clearly shows originality. It deals with some classic topics such as the effects of the database length or the method used to estimate the threshold, but both the methods and the database used are of good quality and original. I find specially interesting the multivariate approach including further rainfall properties (more than the standard ones) and the seasonal proxies.

The paper is very well written, clear and easy to read.

The conclusions are consistent with the evidence and arguments presented. They address the main questions proposed.

The Figures are in general clear, and helpful to follow the paper. I have added some comments in the specific comments for one particular figure, which I find quite difficult to follow.

As a summary, I really enjoyed reading the paper and I think the authors did a very nice work. However, it needs some revisions before publishing it in NHESS.

Specific comments:

Title: It may be a bit misleading. It does not deal with the limitations of the thresholds but more with the uncertainties on their definition? I would suggest reconsidering it...

L65: I would suggest adding a few references of studies using different MIT, as it is said they range from 10 min to 6 h but no references are given (although they appear later, in 3.3., but I would add them here too)

L82: Although the two methods that are going to be compared are mentioned in the abstract, I would list them here too

L80-85: I miss here stating as an objective (maybe as a secondary one) the analysis of the performance with local vs. Regional dataset, which is stated in the abstract.

L89-98: Cite Figure 1 in this paragraph

L106: Rain gaugeS in plural? If there’s more than one, why in the Figure only 1 is shown? Why only data from 1 is used?

L111-113: If the geophones and depth sensors have been removed, which sensors (less maintenance) have been installed?

L115: By including the citation of Badoux in line 113 I think you could delete it in Line 115, it is clear for the Reader that details can be found there

L122: Any reference where snowmelt has been observed (even if not as sole trigger)?

Figure 1: Why only force plate is indicated? I would suggest adding the other sensors (the new ones replacing the geophones)?

L135: 5 mm? This sounds like a very low number...

L136: Does the local rain gauge not have a thermometer? Also, I would suggest moving Temperature, lighning strikes and other parameters to another paragraph, as I understand these are all variables for the Machine learning, but not actually for the main ID thresholds comparison? I was a bit confused reading about rainfall and changing to temperature abruptly, as it’s the first time you mention the temperature variable

L160: Reference of TSS?

L169: Include Area Under Curve as clarification of what AUC stands for

L173: I understand that you used the same criteria for both trigg and non trigg rainfalls?

L175: Delete ? before Deganutti

L180 and Figure 2: I think this is more results than methods?

Figure 2 (b): Sensitivity? This word may be confused by SE? Could it be called “Analysis of ID-threshold parameters with changing MIT”?

L184: I would not say that β stabilizes, but reduced the increasing tren dat MIT 3h... (in Fig 2b)

L197: Actually, if it was snowing the data from the rain gauge would not be valid, right? As it is not heated... Have you considered this? If so, maybe you could mention here.

L200: Very interesting selection of parameters!

L208: This last sentence of the paragraph (“Lately, confusion matrix...”) is actually a bit confusing to me. You have not used frequentist method, right? You used linear squares and LR&TSS and TSS&TSS methods if I have understood right. Therefore, the sentence is confusing as it seems that you have calculated the confusion matrix and ROC for the frequentist method...

L220: This is also confusing. A record of length 5 years includes 5 annual samples that can include repetition of the same year? Why is the procedure repeated 100 time for each record length? Please clarify

L280: I think it would be good to see the total rainfall amounts at some point in a Figure, as it is stated here that long duration need more rainfall (logic, but still nice to see)

L297: Higher antecedent rainfall amount may lead to higher degree of pore saturation along the entire channel bed, but also, in some cases the antecedent rainfall would mostly contribute to the generation of lateral flow and increase of water table (e.g.: M.N. Papa, V. Medina, F. Ciervo, A. Bateman 2013, Derivation of critical rainfall thresholds for shallow landslides as a tool for debris flow early warning Systems). This could also correlate with the fact that magnitudes are bigger, but I would say that the correlation between the antecedent rainfall and the magnitude it is a tricky point and needs careful evaluation...

L305: TSS&TSS thresholds are lower for short durations (<4.5 h) and higher for long durations- after this I would add (Figure 5e)

L311: However, the biases decrease to _30% already after 6 years or _25 triggering events- For both? Or only for β? I can’t see it that clearly in alpha?

L335: Also, the source of rainfall data is different, right? If I am not wrong the work of Leonarduzzi et al. it was not based only in rain gauge data. Therefore, apart from climàtic, topographic and lihologic uncertainties it may be also from the type of rainfall data?

Figure 7:

The grey dots are very difficult to see, specially over blue, red and green bars. Change colour of bars or make dots bigger

I find this figure particularly dense and a bit difficult to follow. Some ideas on how it could be made a bit easier to read:

I understand that RF_ID+1 is based in one sigle predictor (the one with best performance). Why not indicating which one instead of leaving the reader to interpret?

Same with RF_ID+var and 4 predictors

Maybe then it would not be necessary to include all the single predictors in the same figure. Either include them in a separate figure or as supplementary material?

If you think it is relevant to keep the same format, I would suggest indicating the selected predictors for each RF model in some way...
Citation: https://doi.org/10.5194/nhess-2021-135-RC2
- AC2: 'Reply on RC2', Jacob Hirschberg, 02 Jul 2021
  
  We thank Clàudia Abancó for providing a constructive review. Please find our replies in the supplement.
  
  Citation: https://doi.org/10.5194/nhess-2021-135-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to minor revisions (review by editor) (09 Jul 2021) by Thomas Glade

AR by Jacob Hirschberg on behalf of the Authors (19 Jul 2021) Author's response Author's tracked changes Manuscript

ED: Publish as is (16 Aug 2021) by Thomas Glade

AR by Jacob Hirschberg on behalf of the Authors (17 Aug 2021) Manuscript

Short summary

Debris-flow prediction is often based on rainfall thresholds, but uncertainty assessments are rare. We established rainfall thresholds using two approaches and find that 25 debris flows are needed for uncertainties to converge in an Alpine basin and that the suitable method differs for regional compared to local thresholds. Finally, we demonstrate the potential of a statistical learning algorithm to improve threshold performance. These findings are helpful for early warning system development.