Limitations of rainfall thresholds for debris-flow prediction in an Alpine catchment

The prediction of debris flows is relevant because this type of natural hazard can pose a threat to humans and infrastructure. Debris-flow (and landslide) early warning systems often rely on rainfall intensity-duration (ID) thresholds. Unfortunately, no standardized procedures exist for the determination of such ID thresholds, and a validation and uncertainty assessment is often missing in their formulation. As a consequence, updating, interpreting, generalizing and comparing rainfall thresholds is challenging. Using a 17-year record of rainfall and 67 debris flows in a Swiss Alpine catchment (Illgraben), 5 we determined ID thresholds and associated uncertainties as a function of record length. Furthermore, we compared two methods for rainfall definition which consider both triggering and non-triggering events, based on linear regression and/or True Skill Statistic maximization. The main difference between these approaches and the well-known frequentist method is that non-triggering rainfall events were also considered here for obtaining ID-threshold parameters. Depending on the method applied, the ID-threshold parameters and their uncertainties differed significantly. We found that 25 debris flows are sufficient to 10 constrain uncertainties in ID-threshold parameters to±30% for our study site. We further demonstrated the change in predictive performance of the two methods if a regional landslide data set was used instead of a local one, with important implications for ID-threshold determination. Furthermore, we tested if the ID-threshold performance can be increased by considering other rainfall properties (e.g. antecedent rainfall, maximum intensity) in a multivariate statistical learning algorithm based on decision trees (random forest). The highest predictive power was reached when the 30-min maximum accumulated rainfall was added 15 to the ID variables, while no improvement was achieved by considering antecedent rainfall for debris-flow predictions in Illgraben. Although the increase in predictive performance with the random forest model was small, such a framework could be valuable for future studies if more predictors are available from measured or modelled data. 1 https://doi.org/10.5194/nhess-2021-135 Preprint. Discussion started: 6 May 2021 c © Author(s) 2021. CC BY 4.0 License.

1 Introduction 20 Debris flows are a common geomorphic process and hazardous phenomenon in mountain regions. They move rapidly as a surging flow of saturated debris. In contrast to other mass movements, such as shallow landslides, debris flows follow an established flow path, along which they often entrain substantial amounts of sediment and water stored in the channel (Hungr et al., 2014;de Haas et al., 2020). In Switzerland, where this study is conducted, landslides and debris flows caused 74 fatal accidents between 1946 and 2015 (Badoux et al., 2016). Globally, debris flows cause about 165 fatalities per year on 25 average, with most of them occurring in mountainous regions of developing countries (Dowling and Santi, 2014). Furthermore, debris flows have the potential to damage property, infrastructure, managed forests and agricultural land (Hilker et al., 2009). Therefore, the development of early warning systems (EWS) for debris flows and other rapid gravitational mass movements, involving novel measurement techniques and models, is a priority in many countries (Stähli et al., 2015). EWS often rely on rainfall thresholds (Guzzetti et al., 2020). The most common rainfall thresholds are drawn in the rainfall duration (D) and mean 30 rainfall intensity (I, or cumulative rainfall) space, taking the form I = αD −β , which is a linear curve in logarithmic space (Caine, 1980).
In alpine settings, debris flows mostly develop following a shallow hillslope landslide caused by increased pore water pressure (e.g. Iverson, 1997). Another cause can be runoff, where sediment deposits in the channel are mobilized as a mass movement (Takahashi, 1978(Takahashi, , 1981 or sediments are progressively bulked up (Fryxwell and Horberg, 1943;Johnson and Rodine, 35 1984; Tognacca, 1999;Gregoretti, 2000). Physically-based models considering these mechanisms can be used to infer ID thresholds leading to debris-flow initiation (e.g. Berti and Simoni, 2005;Berti et al., 2020;Tang et al., 2019). However, because such models require a great deal of input data of high quality, empirically determined ID thresholds are still more common and are determined at the local, regional or global scale (e.g. Caine, 1980;Guzzetti et al., 2007;Coe et al., 2008;Badoux et al., 2009;Staley et al., 2013;Abancó et al., 2016;Bel et al., 2017). 40 The problem with the use of ID thresholds is that there is no standardized procedure for their determination, and they are rarely validated. Consequently, generalizing and updating ID thresholds is challenging, as is comparing between them to hypothesize about the possible site-related differences in geomorphology, lithology, terrain and soil properties that lead to a different response to rainfall forcing (Segoni et al., 2018). The sensitivities of ID thresholds have to be better understood and studied for such comparisons to be meaningful. Major uncertainties arise from various issues related to the quality of the 45 rainfall record used. ID thresholds often rely on rainfall data from rain gauges, which can be located on the valley floor or in a neighbouring valley rather than in the immediate vicinity of the debris-flow initiation area. Studies in the Italian Alps have shown that in orographically complex areas, especially for short convective rainstorms, precipitation intensities can decay significantly (30-60%) within short distances (5-10 km) from the centre of the rainfall cell (Marra et al., 2016). This may lead to underestimations of α by up to 70%, and is one reason for the high false alarm rate of ID thresholds (Nikolopoulos et al.,50 2014). Another factor causing ID-threshold underestimation is the coarse temporal resolution of the rainfall data. Landslide and debris-flow data sets going far back in time, or relying on satellite-based estimates of rainfall, can be strongly affected by a coarse temporal resolution. Especially if events are triggered by short-duration storms (minutes to hours), the event mean rainfall intensity is considerably underestimated when daily rainfall records are used. In a study with synthetic data, Marra (2018) showed that using daily data results in significant threshold underestimation. Gariano et al. (2020) confirmed this effect 55 for a real case study in Italy. However, the accuracy of ID thresholds based on rainfall data with a sub-daily resolution can also be limited, for example if the exact timing of the debris flows or landslides is unknown (Leonarduzzi and Molnar, 2020). This is often the case, unless the area is closely monitored or damage was caused and immediately recognized. Therefore, in studies where the landslide timing is only imprecisely known and sub-daily rainfall data are used, the entire rainfall event or the rainfall until the highest intensity was reached is considered to be the triggering rainfall. This uncertainty in event timing 60 has been shown to lead to inflated triggering rainfall amounts and subsequently to overestimated ID thresholds (Staley et al., 2013;Leonarduzzi and Molnar, 2020;Bel et al., 2017). Additional uncertainties stem from the discretization of the rainfall time series into rainfall events. Rainfall events are usually separated by a minimum inter-event time (MIT), a period that is intended to mark separate, independent rainfall periods. For studies in debris-flow torrents, the MIT can range from 10 min to 6 h and is often chosen subjectively without a sensitivity analysis. Bel et al. (2017) combined different MIT values with uncertainty in the 65 timing of debris-flow detection in a French torrent. The obtained uncertainty bounds in ID thresholds encompassed almost all ID thresholds previously published from other torrents. Despite their importance, uncertainties in ID curves are seldom used in prediction.
The inaccuracies in rainfall data used to establish ID thresholds are one source of uncertainty leading to the high false alarm rate. ID thresholds are also criticized for ignoring other information contained in the rainfall time series, such as peak intensities 70 and antecedent rainfall. For debris flows, peak intensities at high temporal resolution (≤10 min) have been shown to have an especially high predictive power (e.g. Abancó et al., 2016;Bel et al., 2017). Multivariate statistical methods, such as logistic regression, have been tested and applied to improve the prediction of post-wildfire debris flows in the US (Cannon et al., 2010;Staley et al., 2017), and for a French Alpine torrent (Bel et al., 2017). More advanced machine learning techniques are also becoming an attractive tool in the geosciences as the availability of both measured and modelled data increases, and the careful 75 investigation of all possible physical interactions between the variables exceeds our capacities (Reichstein et al., 2019). For post-wildfire debris-flow prediction, machine learning algorithms have in fact been shown to outperform logistic regression models and ID thresholds in predictions (Kern et al., 2017;Nikolopoulos et al., 2018).
Here, we address two research questions. First, what is the uncertainty associated with estimation methods and with debrisflow record length in ID-threshold parameters? By resampling the Illgraben debris-flow record using different time windows, 80 we estimate the confidence bounds of the ID-threshold parameters. Furthermore, we compare the uncertainties of two methods that have been used recently for determining ID-threshold parameters (e.g. Leonarduzzi et al., 2017;Leonarduzzi and Molnar, 2020;Nikolopoulos et al., 2018). Second, how do traditional ID-threshold-curve methods compare with machine learning algorithms? We extend the analysis of debris-flow prediction with additional rainfall event attributes in a random forest algorithm (Breiman, 2001), test the predictive skill, and discuss the pros and cons of the multivariate approach for local debris-flow de-85 tection based on different rainfall event properties (e.g. peak rainfall intensity, number of lightning events) and other seasonal proxies, for example those related to sediment recharge.
Illgraben is a north-facing catchment located in the Rhône valley in the Swiss canton of Valais. It consists of two subcatchments: the eastern Illbach (4.15 km 2 ) is hydrologically and geomorphologically disconnected, while the western Illgraben (4.83 km 2 ) 90 produces on average ∼5 debris flows a year. The Illgraben sub-catchment has a maximum elevation of 2645 m a.s.l. at the summit of the Illhorn mountain. In this region, the main Rhône-Simplon fault line changes its orientation, resulting in numerous smaller faults in highly fractured bedrock and affecting the Illgraben catchment (McArdell and Sartori, 2021). The main sediment source area is a highly active hillslope underlain by quartzite bedrock, ranging from 1250 to 2370 m a.s.l. and with slopes of up to 80°, where frequent landslides occur and deposit sediments in the trunk channel (Bennett et al., 2012;Berger 95 et al., 2011). The main debris-flow channel starts just below this hillslope and is 5.2 km long. The first half is characterized by a mean slope of 16% until the fan apex at 886 m a.s.l. . The second half is flatter (10%) and confined by check dams, before it joins the Rhône river at 605 m a.s.l.
In 1961 a large rock avalanche on the northern slope provided abundant sediment and increased the debris-flow frequency in the following years (Hürlimann et al., 2003). However, since then this part of the catchment has produced sediment at much 100 lower rates than the main source area (Schlunegger et al., 2009). The rock avalanche prompted the construction of 30 concrete check dams to stabilize the channel, with the most upstream one being 48 m tall (Lichtenhahn, 1971). This upstream check dam was effective in stabilizing the toe of the rock avalanche deposit and reducing the number of debris-flow events in subsequent years (Hürlimann et al., 2003).
Since 2000 the Swiss Federal Research Institute WSL has been operating an observation network in the Illgraben catchment, 105 including rain gauges (added in 2001), geophones, depth sensors and a force plate (Rickenmann et al., 2001;Hürlimann et al., 2003;McArdell, 2016). In this study, we use the rain gauge data and a debris-flow inventory including events up to the year 2017 (McArdell and Hirschberg, 2020). Illgraben has an alarm system that operates independently from the debris-flow observation station. It serves the villages of Susten and the Pletschen-subdivision on the eastern side of the fan and protects the people visiting the Pfynwald nature park on the western side. Furthermore, hiking trails and sports fields 110 close to the riparian zone make the fan a vulnerable area. Geophones mounted on check dams detect debris flows and activate alert lights and acoustic signals downstream in the riparian zone. After an initial phase of testing and optimization by WSL, the station was operated for about 10 years before the geophones and depth sensors described in Badoux et al. (2009) were replaced with sensors requiring less maintenance. The alarm system is now operated by the local municipality in cooperation with an engineering company. For a detailed description of the original alarm system, the reader is referred to Badoux et al. 115 (2009).
The mean annual precipitation at mean catchment elevation (1600 m a.s.l.) computed for the time period 1981-2010 was 900 mm y −1 (HAD, 2015) and the mean annual temperature for the same period was 5.9°C (Hirschberg et al., 2021). Debris flows generally occur from May to October. Although climate change scenarios project longer debris-flow seasons in the future (Hirschberg et al., 2021), and recently debris flows have also been recorded in April (2020) and December (2018), most debris 120 flows occur in response to convective rainstorms between June and August. Snowmelt likely plays a role in spring but has never were interpolated using local lapse rates to account for the difference in elevation of about 200 m, as described in Hirschberg et al. (2021). The 10 min total of recorded lightning strikes at a distance of 3-30 km was also derived from the Montana station and used as a secondary variable for the convective character of storms (Gaál et al., 2014) in the machine learning algorithm.

140
The local predictive power of debris flows in Illgraben was also compared with a regional prediction of slope failure using a regional data set on shallow landslides in Switzerland including associated rainfall events (Leonarduzzi et al., 2017). It is based on a gridded daily rainfall product (RhiresD) and the Swiss flood and landslide damage database (WSL). This regional data set consists of 2137 landslides which occurred in Switzerland between 1972 and 2018 and for which damage was reported. Only the data from 2001 to 2017 was used, to be consistent with the local data set.
True skill statistic (Peirce skill score or Hannsen and Kujpers discriminant: The benefit of using the specificity over the false positive rate (F P R = F P/(F P + T N )) is that in a perfect model TSS, sensitivity and specificity all equal 1. As noted by others (e.g. Leonarduzzi et al., 2017;Postance et al., 2018;Mirus et al., 160 2018), optimizing TS leads to more conservative (higher) thresholds, while optimizing TSS yields more balanced rainfall thresholds. The choice of score used in practice is therefore the user's decision. In this study, TSS was optimized to calibrate thresholds and to compare classifiers, mainly because it is less sensitive to data sets with unbalanced class prevalence. In particular, TSS was used in the following analyses: 1. In the determination of thresholds for single predictors (section 3.3) 165 2. In the determination of the ID-threshold parameters (section 3.4) 3. In the determination of probability thresholds from the random forest classifier (section 3.5) 4. In the comparison of these predictive models Another metric for predictive model comparison is the area under the ROC curve (AUC). To estimate the AUC, sensitivity and 1-specificity are calculated and plotted for all possible threshold values. AUC equals 1 if there is a threshold that can perfectly 170 separate triggering and non-triggering events. A model with an AUC of 0.5 has no predictive power.

Rainfall event definition and other properties
The precipitation time series was discretized into rainfall events, which were separated by a minimum inter-event time (MIT).
Rainfall events were considered independent if no rainfall was recorded during this time. MIT is often chosen subjectively and varies between 10 min at Chalk Cliffs (USA) to 6 h at Moscardo (Italy) for local debris-flow analyses (? Deganutti et al., 2000).

175
In Switzerland, a MIT of 2-3 h has been shown to be appropriate for the separation of thunderstorms (Gaál et al., 2014) and storms initiating bedload transport in an Alpine watershed (Badoux et al., 2012). However, as the subjectivity in the rainfall event definition complicates the comparison of rainfall thresholds, we followed the suggestion of Bel et al. (2017) to choose MIT as the duration where the number of rainfall events stabilizes, which indicates its independence from MIT duration. In this process, the sensitivity of rainfall threshold performance scores to the choice of MIT was also evaluated (Fig. 2a). The 180 measures SE, SP and TSS from fitting ID curves were found to decrease with increasing MIT as the number of events drops.
However, in absolute terms, the number of false alarms is higher at a short MIT because of the large total number of rainfall events, and this also reflects in the TS statistic. At MIT = 3 h, the number of rainfall events stabilizes. This is seen in the stable TS, meaning that there are no additional false alarms. The shape parameter of the ID threshold (β) also stabilizes at this MIT  (Fig. 2b). The scale parameter (α) increases because the rainfall events become longer with increasing MIT and therefore the 185 rainfall amounts increase. Consequently, MIT = 3 h was used throughout this study and it was confirmed that in the Alps an MIT of 2-3 h is an appropriate time period for separating rainfall into independent events for various applications, as found in independent studies by Badoux et al. (2012), Gaál et al. (2014) and Bel et al. (2017). Nevertheless, a suitable MIT should be objectively tested for each study site if possible.
Once MIT was defined, other rainfall-event properties were extracted in addition to the duration and mean intensity (Table 1, 190 single predictors). Maximum cumulative rainfall was computed for accumulation periods of 10 min to 2 h. Antecedent rainfall was defined as cumulative rainfall from 3 to 90 days prior to the rainfall event. These event properties can be computed from  Furthermore, event properties related to air temperature were added. The daily mean, minimum and maximum temperature 195 was computed for the day of each rainfall event. If an event spanned several days, the day the event started was considered.
In the case of low temperatures, it could be snowing in the higher parts of the catchment, reducing the amount of liquid water contributing to subsequent runoff. As an indicator of convective rainfall, the daily temperature span and the number of lightning strikes were added. To account for seasonality effects, the day-of-year and the month of each event were included. As proxies for sediment availability, the time elapsed since the last debris flow was added and the number of freezing days of the current 200 hydrological year (November-October) was computed from the temperature time series. The aim of the latter was to account for sediment recharge related to frost-weathering processes (Hirschberg et al., 2021).

Rainfall ID thresholds
The best way to determine the scale (α) and shape (β) parameters of rainfall ID curves and their uncertainties is an ongoing discussion. Brunetti et al. (2010) presented a statistical (frequentist) approach involving estimating β with a linear regression 205 (in logarithmic space) fitted to all triggering rainfall events and decreasing α by an amount which is equal to the distance of the median residual to a chosen lower percentile. While this method is objective and, when applied as an EWS, makes it possible to control the hit rate, it neglects the information from the non-triggering rainfall events. Lately, confusion matrix and ROC methods (see section 3.2) have been used as objective measures to determine threshold parameters and compare the predictive performance of different models. For the ID thresholds computed here, sensitivity (Eq. 1) and specificity (Eq. 2) were calculated. The threshold performance was then evaluated in terms of TSS (Eq. 3). Two approaches were applied to optimize the ID-threshold parameters. In the first approach, as in the frequentist method, the shape parameter β is determined in the log-log space with a linear leastsquares approximation of the debris-flow triggering ID pairs. In a next step, the scale parameter α is tuned to maximize TSS. This method is called LR&TSS hereafter. In the second approach, the scale parameter α and the shape parameter β are 215 simultaneously tuned to maximize TSS. This approach is hereafter referred to as TSS&TSS (Table 1, models).
To test the sensitivity of these two methods to the record length, resampled (bootstrapped) time series of rainfall and debrisflow events from 1 to 30 years were produced. Thus, only entire years were resampled, to avoid breaking up any natural intra-annual patterns. One sampled year consisted of all debris-flow triggering and non-triggering rainfall events. For example, for a record length of 5 years, 5 annual samples were drawn with replacement from the 17-year observation period. This 220 means that a specific year could occur multiple times in one 5-year sample. This procedure was repeated 100 times for each record length. Finally, the bias in ID-threshold parameters was estimated for each sample. The bias was defined as the relative deviation of estimates of α and β from the corresponding reference values, i.e. the ones calculated from the original record using LR&TSS and TSS&TSS.

Random forests for debris-flow prediction 225
Much of the information contained in the rainfall time series, such as antecedent rainfall and peak intensity, is lost when discretized into events and characterized only by mean rainfall intensity and duration. As an alternative, random forests (RF, Breiman, 2001) were used to include more rainfall event properties (Table 1) for the classification of rainfall events into debrisflow triggering and non-triggering. Random forests are based on a statistical learning algorithm that uses multiple decision trees. Each of these trees is trained with a subset of the predictor variables in the training data set. This procedure (also called 230 bagging) is fundamental to the algorithm because it decreases the correlation among the trees and makes random forests suitable for capturing complex interactions and structures in the data. For detailed information, the reader is referred to Breiman (2001) and Hastie et al. (2009). The Scikit-learn module in Python was used to develop a random forest classifier (Pedregosa et al., 2011).
For the prediction of debris flows, logistic regression models have been used extensively in regional post-fire debris-flow 235 studies to account for rainfall threshold variability due to spatial differences in slope and burned area (e.g. Cannon et al., 2010;Staley et al., 2017). Moreover, Bel et al. (2017) showed, for a French debris-flow torrent, that when ID thresholds were used in conjunction with a logistic regression model including variables for peak rainfall intensity, antecedent rainfall conditions and the number of days since winter, the number of false alarms could be reduced. RF was used instead in the present study, because of its ability to consider multiple predictor variables with non-linear relationships and correlating predictors. To our Here, four RF models -with the number of predictor variables ranging from 2 to 24 -were tested, with the first being the equivalent of the traditional ID threshold (RF_ID, Table 1). The model output included the probability of being debris-flow 245 triggering for every rainfall event. The probability threshold for classification had to be tuned because the threshold-optimizing TSS is likely not 50% but rather somewhat smaller (Nikolopoulos et al., 2018). The predictive performance was then compared with the classical ID threshold, and with all single predictors (Table 1).  In the study period from 2001 to 2017, 21 debris flows were triggered in spring and early summer (May and June, with the local influence of snowmelt), 38 in summer (July and August) and 8 in autumn (September and October) (Fig. 3). The monthly inter-annual variability was especially high in July, when between 0 and 6 debris flows occurred. This was also the month with the most extreme 30-min rainfall due to convective storms. The debris-flow activity dropped considerably in autumn, when the monthly rainfall also reduced by about 50%. Such seasonality is typical for Alpine debris-flow torrents (e.g. Schneuwly-

255
Bollschweiler and Stoffel, 2012; Bel et al., 2017). In spring, snowmelt generates additional runoff and saturates the debris, which may lower the rainfall threshold for debris-flow initiation. This likely played a role in the events which were triggered at low rainfall amounts (30-min duration) before mid-June. There were also some lower-intensity events in July and August which still triggered debris flows. However, at this time of the year, inaccurate rainfall measurements and high spatial variability during convective storms are a more likely explanation. The most intensive 30-min rainfall event that did not trigger a debris 260 flow was in October, possibly indicating sediment supply-limited conditions.
Debris-flow triggering threshold curves in Illgraben showed the typical negative power-law relationship between mean intensity and duration (Fig. 4). Debris flows occurred mostly during high rainfall intensities. However, triggering and non-triggering events could not be separated perfectly. There were a few outliers at very short (<1 h) and very long rainfall durations (>16 h), which were triggered by comparably little rainfall. ID thresholds were computed with two methods (TSS&TSS and LR&TSS) 265 for the entire data set and for individual seasons (Fig. 4). The scale parameter α had values between 2.6 and 7.3, with lower values for TSS&TSS (2.6-5.4) and higher values for LR&TSS (5.2-7.3). The shape parameter β was consistently smaller for TSS&TSS (0.26-0.93) than for LR&TSS (0.52-0.94) and varied considerably between the seasons. Only in spring, the thresholds were practically identical (Fig. 4b). For the entire data set, this resulted in the median TSS&TSS threshold being lower for short durations (≤4.5 h) and higher for long durations. The LR&TSS threshold was very similar to a curve defined earlier for 270 Illgraben, with α = 5.4 and β = 0.79 (McArdell and Badoux, 2007), although there α was set to detect all triggering events. If the same procedure had been used for the data set used here, the ID threshold would also be lower.
The seasonal ID thresholds differed, but direct comparison is difficult because of differences in the number of events ( Fig.   4b-c). In autumn, it is clearer than in other seasons that only high-intensity rainfall events triggered debris flows. This may be due to (a) rainfall measurements being more representative and accurate for the initiation area than in summer when more 275 convective events take place; (b) sediment availability being exhausted at the end of the debris-flow season (Berger et al., 2011;Bennett et al., 2014); or (c) grain size coarsening throughout the wet season, increasing the hydraulic conductivity in the channel bed and therefore also the rainfall threshold that must be exceeded to generate runoff (Domènech et al., 2019). As a consequence, the false alarm rate was lower in autumn than in other seasons.
For longer durations, larger rainfall amounts are required for debris-flow triggering, and this reflects the balance of infil-280 tration, storage and drainage of water. However, for short and long rainfall durations ID pairs fail to plausibly describe the hydrological processes leading to landslide initiation (Bogaard and Greco, 2018). However, here it was not clear if this was also the reason for the outliers triggered by lower mean rainfall intensities. There were two debris flows in summer at rainfall durations of 10 and 30 min which were triggered by significantly lower mean rainfall intensities than the other debris-flow events associated with these durations. One possible reason is that rain gauges, although close to the initiation area (∼1 km), 285 are prone to not capturing peak intensities, especially of convective storms, even at short distances (Nikolopoulos et al., 2014;Marra et al., 2016). These two events were also characterized by high antecedent rainfall, however, which could have lowered the triggering threshold. In spring, three events were triggered at low mean rainfall intensities (<1 mm h −1 ) and after more than 16 h. It could be that the MIT parameter does not separate rainfall events accurately in these cases, that the rainfall threshold was lower due to snowmelt, that there were errors in the rainfall data, or simply that these debris flows were triggered by other 290 mechanisms than rainfall excess, such as the breaching of a small landslide dam.
Debris flows occur at a wide range of 14-day antecedent rainfall conditions, with many events occurring with very low values for this variable (Fig. 4a). Antecedent rainfall does not appear to be a significant precondition for debris-flow triggering in the Illgraben catchment. This has also been observed at other alpine locations (e.g. Abancó et al., 2016). Here, the debrisflow magnitudes were not affected by the intensity of the triggering rainfall, as observed in other studies (Hirschberg et al.,295 2019; Pastorello et al., 2018). However, the magnitudes were affected by the amount of antecedent rainfall. Higher antecedent rainfall amounts lead to a higher degree of pore saturation along the entire channel bed. Sediment entrainment then experiences a positive feedback from increased pore water pressure as the debris-flow surge passes by, increasing the debris-flow volume (Iverson et al., 2011;McCoy et al., 2012;Hirschberg et al., 2019).  to optimize TSS (LR&TSS) (Fig. 5b-d). TSS&TSS thresholds are lower for short durations (<4.5 h) and higher for long durations. TSS&TSS parameter estimates are overestimated by >100% even after 30 years of observations and the medians do 305 not converge (Fig. 5b,c). The fluctuations in the median suggest that there are 2-3 parameter sets with TSS values which are all close to the optimum. Therefore, there are no unique ID-threshold parameters for Illgraben when estimated with TSS&TSS.

Sensitivity of ID thresholds to record length and identification method
The uncertainty range is not located around zero but biased towards positive values because the reference values (see Fig.   4a, solid threshold line) are at the lower end of possible solutions. The medians and the uncertainty bounds from parameters estimated with LR&TSS still converge after the reference record length of 17 years, with biases of ±20% for both α and β (Fig.   310 5b,c). However, the biases decrease to ±30% already after 6 years or ∼25 triggering events. Furthermore, the TSS score, which Important advantages of using TSS for ID-threshold parameter estimation are that information from both triggering and non-triggering events is considered with equal weight and that the measure is prevalence independent. However, although the narrow uncertainty range for long record lengths (Fig. 5d) additionally indicates the robustness of the TSS score, it does not imply robustness in the parameter estimates that are based on TSS (Fig. 5b,c). In our case, with 67 debris flows, the IDthreshold parameters computed with TSS&TSS seem to be highly sensitive to a few triggering events, which may be outliers 320 but exist in any data set and are difficult to single out with certainty. Consequently, for local ID thresholds we advise against simultaneously optimizing α and β against the TSS score (TSS&TSS). Local data sets are often comparatively small (<100 triggering events), and therefore this method can be sensitive to outliers.
Conducting the same sensitivity analysis on the regional data set, characterized by many more triggering events (∼800), which are mostly shallow landslides in this case, low prevalence (0.05%) and different triggering locations, showed the opposite 325 result than for the local data set. With the regional data set, TSS values can be slightly enhanced when using the TSS&TSS instead of the LR&TSS method (Fig. 6d), as in the local data set. In contrast to the local data set, this enhancement in TSS is not accompanied by larger uncertainties in ID-threshold parameters (Fig. 6b,c). LR&TSS thresholds are practically flat (Fig.   6e) and parameter ranges are still converging after the reference record length of 17 years. Using LR&TSS even makes the duration redundant because β estimated with LR&TSS converges to 0. Note that the large range in β bias for LR&TSS (Fig. 330 6c) is also because β is close to 0, and absolute values are small (see section 3.4). TSS&TSS threshold parameters converge after ∼8 years, corresponding to ∼400 landslides. This is more events than reported for data sets in Italy, where ∼200 was enough both on the regional and on the national scale (Peruccacci et al., 2012(Peruccacci et al., , 2017).
The regional data set is inherently subject to much larger uncertainties, which can be disregarded in the local data set. These uncertainties are mainly related to climatic, topographic and lithologic differences among the landslide locations. These differ-335 ences may also lead to the slope of the ID curve, when fitted with linear regression, losing the typical power-law relationship of extreme rainfall. Furthermore, rainfall uncertainties are higher because regional analyses rely on interpolated precipitation (Frei and Schär, 1998). TSS&TSS instead profits more from the information in the non-triggering rainfall, by setting a threshold high enough to be above the many non-triggering rainfall events with low intensity and short duration, and steep enough to detect triggering events as a response to long-lasting rainfall (Fig. 6e).

340
The main differences between the methods used here and the well-known frequentist method (Brunetti et al., 2010) are that non-triggering rainfall events are considered either in the determination of the scale parameter α (LR&TSS) or in both α and the shape parameter β (TSS&TSS). Parameters estimated by LR&TSS for the local data set have lower uncertainties than when estimated by TSS&TSS. For the regional data set the TSS&TSS method yields both better predictions and lower parameter uncertainty. Hence, this comparison of a local and a regional data set suggests that the range of climatic, topographic, lithologic or land-use differences within a data set should be considered when deciding which method to apply for ID-curve estimation. (reference) refer to the values obtained when the full regional data set is used.

Predictive power of uni-and multivariate models
We compared debris-flow triggering proxies related to maximum precipitation intensity, antecedent rainfall and temperature, among others (Table 1, lower part). Each proxy was evaluated as a single predictor in terms of TSS. These were also compared with five multivariate models: the LR&TSS-ID threshold and four RF classifiers with different single predictors as input (Table   350 1, upper part). The TSS&TSS-ID thresholds were excluded here due to the large uncertainties in estimated parameters. The models were validated with five-fold cross-validation (CV).
We found that the RF classifier's TSS values (0.77-0.81) are only slightly higher than the classical ID threshold's TSS (0.76) (Fig. 7a). This improvement is due to a generally higher sensitivity score (Fig. 7b). However, as seen in the distribution of the  Table 1. The bars refer to models using the full data set for training. The grey dots represent results from the five-fold cross-validation.
points from the CV (small grey points in Fig. 7), RF can reduce uncertainties. The best improvement in TSS is seen when only 355 one additional predictor (max 30-min rainfall) is added to the intensity and duration of rainfall events. If all available data are used by an RF classifier (RF_all), the performance even decreases compared with RF_ID+1. This is likely due to overfitting and can happen when many of the input variables have poor predictive power. In this case, the RF classifier is fitted to the noise from the poor predictors while the predictive information from other variables gets lost. Hence, CV avoids the overfitting of a model set-up, and further improvements may be achieved by testing different combinations of variables.

360
This analysis demonstrated that the best single predictor in terms of TSS is the 30-min max rainfall (TSS=0.76), and its specificity is also the highest overall (0.92) (Fig. 7c). In general, the predictors relating to max rainfall event intensities of different time scales have relevant predictive power, while predictors related to antecedent rainfall do not. Of all debris flows, 96% can be identified with a threshold (5 mm) for total event rainfall (Fig. 7b). Furthermore, 30-min maximum rainfall and mean rainfall intensity on their own perform similarly to the ID threshold. These rainfall properties are simple to calculate with 365 a moving window from any rainfall time series, even without adding uncertainty by rainfall event separation. Surprisingly, lightning strikes also have some predictive power, even though they were recorded within an area which is almost 600 times Although ID thresholds are widely used in applications, a common problem is the number of false alarms they cause. For Illgraben, although 92% of the debris flows are detected with the ID threshold estimated with LR&TSS, only 20% of the rainfall events exceeding the threshold are expected to produce a debris flow. To increase this accuracy mainly based on different rainfall properties, we systematically analyzed the predictive power of single predictors and multivariate models based on the random forest algorithm. Although the RF classifier only marginally improved the TSS score, the potential of overcoming some of 375 the well-known limitations of ID thresholds is evident. For example, rainfall properties could be combined with measured, remotely sensed or modelled variables, such as discharge or soil moisture products (Wicki et al., 2020). Recently, random forests have been used to detect mass movements from seismic signals (Wenner et al., 2021;Chmiel et al., 2021) and could be coupled with rainfall measurements or forecasts to potentially increase the accuracy. It would also be interesting to study the spatio-temporal rainfall structure from radar-based rainfall estimates and their influence on debris-flow triggering (Marra 380 et al., 2016). Of course, random forests are only one alternative to ID thresholds, and there are other algorithms to be tested (see Kern et al., 2017, for a review on post-wildfire debris flows). A drawback of such empirical thresholds is that long-term observations are required to establish them. Where such data are available, additional predictors can easily be implemented in a RF classifier, as presented here, and tested regarding their predictive power.
In this study we used a 17-year record of precipitation and debris-flow timing and magnitude to complete a systematic analysis of rainfall conditions leading to debris-flow triggering in a Swiss catchment, Illgraben. Based on 67 debris-flow triggering and 1657 non-triggering rainfall events (prevalence of 3.8%) we defined a rainfall intensity-duration threshold I = 5.2 * D −0.70 with the most suitable fitting method applied in this work. Given the high debris-flow frequency in Illgraben, it can be considered as a lower threshold for rainfall-induced debris flows in the Swiss Rhône valley.

390
Debris-flow activity is greatest in summer, coinciding with peaks in total monthly rainfall accumulations and peak 30-min rainfall. Although we find differences in seasonal ID thresholds, they are partly based on the occurrence of only a few triggering events (e.g. in autumn). It remains challenging to determine if the reason for this seasonality is seasonal differences in rainfall or in sediment availability or processes such as snowmelt or grain size coarsening.
Our systematic analysis of the uncertainties associated with ID thresholds shows that, for a catchment with rainfall-induced 395 debris flows, 25 debris-flow observations are sufficient to constrain the ID-threshold parameters α and β with uncertainties of ≤30%. However, our findings demonstrate that this uncertainty strongly depends on the data set and the method used to determine ID-threshold parameters. When comparing the Illgraben (local) data set with a Swiss landslide (regional) data set, more triggering events (400) were needed for threshold parameters to converge in the regional data set, due to the higher spatial variability in the data set. More importantly, the best method to minimize uncertainties changed from LR&TSS for the local to 400 TSS&TSS for the regional data set. This underlines the need for standardized methodologies for rainfall threshold identification and validation, and proper reporting of the methods used (see Segoni et al., 2018).
Finally, we aimed to lower the false alarm rate often associated with ID thresholds. Using a random forest model including the predictors rainfall event duration, mean rainfall intensity and the 30-min maximum rainfall amount increased the TSS (true skill statistic) by 0.04 (i.e. ∼3 more hits or ∼70 fewer false alarms). Adding more input variables to the random forest 405 model, including antecedent rainfall, did not improve the performance. Although the expectation of significantly decreasing the false alarms was not fulfilled, we present a flexible framework where additional input variables can easily be tested. The aim of future work should be to include variables such as modelled or measured soil moisture or information on spatiotemporal rainfall structure from radar-based rainfall estimates. Machine learning algorithms can be helpful for maximizing information exploitation from available data and for increasing the accuracy of (debris-flow) early warning systems, and we 410 have highlighted this potential.