Prediction of natural dry-snow avalanche activity using physics-based snowpack simulations

. [.. 1 ] Predicting the timing and size of natural snow avalanches is crucial for local and regional decision-makers, but remains one of the major challenges in avalanche forecasting. So far, forecasts are generally made by human experts, interpreting a variety of data, and drawing on their knowledge and experience. Using avalanche data from the Swiss Alps and one-dimensional physics-based snowpack simulations for virtual slopes , we developed a model predicting the probability of dry-snow avalanches occurring in the [.. 2 ] region surrounding automated weather stations based on the output of a recently 5 developed instability model. This new avalanche day predictor was compared to benchmark models related to the amount of new snow. Evaluation on an independent data set demonstrated the importance of snow stratigraphy for natural avalanche release, as the avalanche day predictor outperformed the benchmark model based on the three-day sum of new snow (F1 scores: 0.71 and 0.65, respectively). The averaged predictions of both models resulted in the best performance (F1 score: 0.75). In a second step, we derived functions describing the probability for certain avalanche size classes. Using the 24-hour new snow 10 height as proxy of avalanche failure depth yielded the best estimator of typical (median) observed avalanche size, while the depth of the deepest weak layer, detected using the instability model, provided the better indicator regarding the largest observed avalanche size. Validation of the avalanche size estimator on an independent data set of avalanche observations conﬁrmed these ﬁndings. Furthermore, comparing the predictions of the avalanche day predictors and avalanche size estimators with a 21-year data set of re-analysed regional avalanche danger levels showed increasing probabilities for natural avalanches and increasing 15 avalanche size with increasing danger level. We conclude that these models may be valuable tools to support forecasting the occurrence of natural dry-snow avalanches.


Introduction
Forecasting natural snow avalanches is highly relevant in areas where avalanches may threaten people or infrastructure.Erroneous forecasts may cause costs as [.. 3 ]missed alarms may result in damage to people or infrastructure, and [.. 4 ]as false alarms may lead to economic loss due to unnecessary closures or evacuations.Therefore, accurately predicting the occurrence 1 removed: Accurately predicting the location, 2 removed: vicinity of 3 removed: (1) 4 removed: (2) of natural avalanches [.. 5 ]is crucial, though still a major challenge in avalanche forecasting.Currently, forecasts are made by human experts, drawing on their knowledge and experience.To forecast natural dry-snow avalanches, the (expected) amount of new snow is one of the main parameters.Accumulated sums of precipitation were found to be among the most important explanatory variables in several studies relating observed avalanche activity to meteorological drivers and observed snowpack parameters [.. 6 ](e.g.Ancey et al., 2004;Kronholm et al., 2006;Hendrikx et al., 2014).However, new snow depth alone is not sufficient for forecasting, but other contributing factors, in particular the presence of potential weak layers in the snowpack, have to be taken into account [.. 7 ](e.g.Stoffel et al., 1998;Schirmer et al., 2009;Schweizer et al., 2009).
While physical snowpack models, such as CROCUS (Brun et al., 1989(Brun et al., , 1992;;Vionnet et al., 2012) or SNOWPACK [.. 8   ] (Lehning et al., 1999;Bartelt and Lehning, 2002;Lehning et al., 2002a, b), are commonly used to model new snow amounts for operational avalanche forecasting, they have so far only rarely been used to assess snowpack instability based on simulated snow stratigraphy in an operational context (Morin et al., 2020).Some recent studies included information on simulated snow stratigraphy as explanatory variables to predict natural avalanche activity with statistical or machine learning models (Viallon-Galinier et al., 2023;Reuter et al., 2022).Viallon-Galinier et al. (2023) found a Random Forest (RF) classifier that included mechanically-based stability indices to outperform a classifier that only relied on meteorological and bulk snow parameters simulated with CROCUS.However, the precision of the improved classifier was low (3.4%), which was attributed to the scarcity of avalanche events and the potential misclassification of non-avalanche days in the observations.The uncertainty inherent in avalanche observation data generally poses a major challenge when developing avalanche prediction models.Errors in visual observations arise from the difficulty of retrospectively determining the exact date of an avalanche release and from missed avalanche events due to limited visibility during periods of heavy snowfall, when the probability of natural avalanche events is particularly high.Avalanche activity data recorded by [.. 9 ]detection systems (e.g.Heck et al., 2019;Mayer et al., 2020) is a promising alternative, but commonly covers only very limited areas (a few km 2 ), much smaller than typical forecasting regions (order of 100 km 2 ).Moreover, due to the relatively new technologies of [.. 10 ]automated avalanche detection, avalanche catalogues only cover a few winter seasons (van Herwijnen et al., 2016).For instance, Reuter et al. (2022) trained and tested a model using automatically detected avalanches using only 31 non-avalanche days and 15 avalanche days.
An alternative approach to develop snow instability models is to use a target variable based on surrogate data that implicitly contain information on avalanche activity, e.g.avalanche danger levels or stability test results from field observations.According to the definitions of the European avalanche danger levels (EAWS, 2023), natural avalanches are expected at level 4 (high) and 5 (very high), but unlikely at the two lowest levels (1 (low), 2 (moderate)).In addition, avalanche size increases with increasing danger level (e.g.EAWS, 2022;Schweizer et al., 2020a;Techel et al., 2020).Pérez-Guillén et al. (2022) recently developed a RF classifier that uses meteorological parameters and snow-cover properties simulated with SNOWPACK 5 removed: in space and time 6 removed: (e.g.Ancey et al., 2004;Kronholm et al., 2006; ?) 7 removed: (e.g.Stoffel et al., 1998;?;Schweizer et al., 2009) 8 removed: (?? Lehning et al., 2002a; ?) 9 removed: automatic detection systems (e.g.al., 2020) 10 removed: automatic to predict danger levels.Another recent RF classifier was trained on stability tests related to human-triggered avalanches [.. 11   ] (Mayer et al., 2022).This model, which we refer to as the instability model in the following, assesses the probability that a simulated SNOWPACK profile is potentially unstable considering human triggering.As the instability model was trained using stability tests related to human-triggered avalanches, its applicability to predict natural avalanches is not self-evident.However, its input features describing the potential weak layer (e.g. grain size) and the overlying slab (e.g. the ratio of the mean slab density and the mean slab grain size) are important variables not only with respect to human triggering but also regarding natural release.Comparing the classification of SNOWPACK profiles simulated using measurements from more than 100 automated weather stations (AWS) in Switzerland with a large number of avalanche forecasts, showed plausible results: the instability model yielded low probabilities of instability [.. 12 ]at the lower danger levels (i.e. level 1 (low) or 2 (moderate)) or in aspects and at elevations not indicated as critical in the forecast; whereas high probabilities were predicted for the upper danger levels (i.e. level 3 (considerable) or 4 (high)) (Techel et al., 2022).The instability model was tested in an operational setting by the national avalanche warning service in Switzerland during the 2021/2022 winter season, with promising results.
The objective of this study is to investigate whether the instability model developed by [.. 13 ]Mayer et al. ( 2022) applied to one-dimensional SNOWPACK simulations can be used to predict natural dry-snow avalanches at the regional scale.More specifically, we aim to derive a transformation of the current model output (probability of instability) to an index describing the probability of observing natural dry-snow avalanches in the surrounding of an AWS.For this purpose, we use avalanche observations recorded for avalanche forecasting in Switzerland during three winter seasons and SNOWPACK simulations from automated weather stations located at the elevations of potential avalanche starting zones.To reduce the uncertainty associated with visual avalanche observations, we apply a filter using observations from the wider surroundings.Furthermore, as a secondary objective, we explore whether we can estimate avalanche size based on one-dimensional SNOWPACK simulations.
The avalanche day predictor and the avalanche size estimator are both validated using 21 years of re-analysed regional danger level data and an independent data set of avalanche observations (5 years) from the region of Davos in the eastern Swiss Alps.
With these validation data, we also demonstrate the usefulness of predictions based on the instability model compared to the use of simple indicators of snow instability as the amount of new snow during the previous 24 or 72 hours.

Model
Validation data

Avalanche size estimator
Probability of natural avalanche occurrence in surrounding of AWS P(AvD) of avalanche observations (data set AV3; 2.1.3),as well as a data set of quality-checked regional avalanche danger levels (DL; Sect. 2.3).
2. To develop the avalanche day predictor, and test the avalanche size estimator, we used avalanche observations collected for the purpose of avalanche forecasting in Switzerland.During the winter season, generally from early December until late April, about 80 observers report avalanches in their region on a daily basis.These observations are highly relevant for the day-today verification of the avalanche forecast, particularly at the higher danger levels.Reported avalanche properties include the approximate location and the date of the avalanche release, the elevation and the slope aspect of the release area, the release type (i.e.natural or human-triggered), whether it was a dry-or a wet-snow avalanche (SLF, 2020), and a size estimate according 90 to the European avalanche size classification ranging from 1 (small) to 5 (extremely large) (EAWS, 2021, see Table 1).In many cases, the release date and time but also other parameters are estimated, as the actual avalanche release was not observed and The data set described in Sect.2.1.1only rarely contained an estimate of avalanche failure depth, which is equal to slab thickness.To derive a relationship between the failure depth of avalanches and avalanche size, we therefore extracted all drysnow avalanches that contained an estimation of avalanche size and the (mean estimated) failure depth from the operational database.Between November 1992 and June 2022 (30 years), this resulted in 5912 dry-snow avalanches.
2.  et al., 2021).These data were used in several studies [.. 20 ](e.g.Schweizer et al., 2020a;Mayer et al., 2022), and are publicly available (Schweizer et al., 2020b).From an updated version of this data set, we extracted all natural dry-snow avalanches that released in the 5 winters 2014/2015 to 2018/[.. 21 ]2019, which resulted in 1995 avalanches.In addition to the simulations on flat terrain, forced with measured snow depths, simulations were also performed for four 'virtual' slope orientations (N, E, S, W) with a slope angle of 38°, including snow redistribution from windward to leeward 120 slopes as described in Lehning et al. (2000) and Lehning and Fierz (2008).Model output was available for up to 124 AWSs.
We used SNOWPACK simulations for the four 'virtual' slope orientations from the 21 winters 2001/2002 until 2021/2022.
To assess snow instability from simulated snow stratigraphy, we applied the instability model to the simulated snow profile at 12:00 LT on the day of interest, as described in Techel et al. (2022).The instability model requires six input features describing the simulated snow layer of interest and the overlying slab.The output probability P unstable that a snow layer is unstable is 125 determined by the fraction of trees in the ensemble of 400 classification trees that classify the layer as potentially unstable.
Applying the instability model to every snow layer of a given snow profile allows computing the following properties (see also [.. 29 ]Fig.3): 29 removed: Figure -Critical weak layer properties: The critical weak layer relevant for natural avalanche release is defined as the layer with the highest probability of instability, i.e. the layer where P unstable = max(P unstable ).In case of ties, we selected the layer deepest in the snowpack.For each snow profile, we then extracted the following three layer properties: max(P unstable ), which we refer to as P crit , the depth z below the snow surface in cm (z crit ), and the grain type (gt crit ).We grouped grain types into three classes considering the primary grain type: (i) persistent grain types (pg), including depth The rate of snowfall and the amount of new snow are known to be important indicators of natural dry-snow avalanche activity, also called direct[.. 33 ]-action avalanches (e.g.Conway and Wilbour, 1999), but also for the potential size of avalanches (e.g.Schweizer et al., 2009).Therefore, we also calculated: -Height of new snow in 24 hours ([.. 34 ]HN1d).
- Lastly, we also extracted the minimum of the natural stability index sn38, which is implemented in SNOWPACK.Sn38 describes for each snow layer the ratio of the shear strength of the layer to the shear stress exerted by the overlying slab (Jamieson and Johnston, 1998;Lehning et al., 2004).

Re-analysed regional avalanche danger level (data set DL)
To validate model predictions, we used a data set of re-analysed regional danger levels.This data set is a subset of the forecast regional avalanche danger levels published by the national avalanche warning service in Switzerland.The data set only contains cases for which the forecast danger level was either validated or corrected (about 5% of the cases) after considering multiple pieces of evidence, as described by Pérez-Guillén et al. ( 2022).An updated version of this data set is publicly available [.. 47 ] (Techel, 2023).The data set consists of 36'582 re-analysed regional danger levels for specified warning regions, the smallest spatial units used in the Swiss avalanche forecast, for the forecast seasons from winter 2001/2002 to 2021/2022.In addition, the critical aspects and elevation range where the danger level applies, and the validity date of the forecast are indicated.The danger level is assigned according to the five-level European Avalanche Danger Scale (EAWS, 2022).The frequency of the danger levels in this data set is: 1 (low) 35%, 2 (moderate) 29%, 3 (considerable) 29%, 4 (high) 7%, 5 (very high) 0.3%.In this re-analysed subset, the proportions for 4 (high) and 5 (very high) are slightly larger than in the original forecasts.

Methods
In a first step, we developed an avalanche day predictor (Sect.3.1) addressing the question: for a given P crit -value, what is the probability of natural avalanches occurring in a specific aspect and elevation band in the surroundings of an AWS.We compared this approach with benchmark models based on conventional indicators related to the amount of new snow.Second, we built an avalanche size estimator (Sect.3.2) with the objective to extract information on the expected typical or largest avalanche size based on (simulated) weak layer depth or the height of new snow.
42 removed: Of these parameters, 43 removed: two parameters describing the 44 removed: new snow were independent of aspect, while all other parameters were extracted from the simulated profile, and thus vary between the four virtual slope aspects.For 45 removed: due to radiation varies 46 removed: In addition, the operational SNOWPACKversion includes snow redistribution as described in Lehning et al. (2000) and Lehning and Fierz (2008) 47 removed: (?) 3.1 Avalanche day predictor

Definition of avalanche days and non-avalanche days
To discriminate days with natural dry-snow avalanche activity (avalanche days, AvDs) from days without any avalanche activity (non-avalanche days, nAvDs), we relied on data set AV1 (Sect.2.1.1).The two main challenges in using these data relate to reliably labeling days with no avalanches and the correct estimation of the release date.For instance, even in areas that are regularly observed, the absence of reported avalanches may be due to poor visibility (i.e.continuous snowfall) rather than a true absence of recent avalanches, making it challenging to accurately determine situations without natural avalanches.Moreover, the accuracy of the release date depends on observation frequency in an area, on visibility conditions and the overall observation quality.To enhance the reliability of the avalanche day labels, we therefore applied the approach developed by [.. 48 ]Hendrick et al. ( 2023) to extract AvDs and nAvDs from the avalanche observations, with a specific focus on dry-snow avalanches.
We define the aspect-specific avalanche day index (Y ) in the [.. 49 ]surroundings of an automated weather station [.. 50 ]and within an elevation band ±250 m of the station elevation for four slope orientations (aspect [.. 51 ]=: N, E, S, W) as: = 0 for all elevations and aspects with NaN not a number.The avalanche activity index AAI refers to the weighted sum of the reported avalanches within the respective elevation band, aspect and area (Schweizer et al., 2003).The size-dependent weights w are defined as in Table 1.The gap check requirement is AAI(5000 km 2 ) > AAI(1000 km 2 ) & AAI(1000 km 2 ) > AAI(250 km 2 ), ensuring avalanche activity increases for larger areas and is not only local.As described in Hendrick et al. (2023), considering observation areas of different size allows to cross-check the absence or occurrence of avalanches.
This definition separates days with widespread avalanche activity (AvD; Y = 1) of a certain magnitude from days with absolutely no avalanches (nAvD; Y = 0), and excludes days with either only local avalanche activity (close to the station) or widespread activity but without any avalanches in the vicinity of the station.Regarding model development, it should be noted that we thus trained and tested our model using rather extreme cases, which are, however, comparably reliable in terms of the quality of the label.
[.. 52 ]By applying Eq. ( 1) to the training data set AV1[.. 53 ], we obtained aspect-specific time series containing AvDs and nAvDs for three winter seasons for each station.As it was still difficult to distinguish between cases classified as

Model development and evaluation
To develop the avalanche day model, we tested a set of predictor variables including [.. 57 ]HN1d, HN3d, z pp , P crit and sn38 in two different modelling approaches, namely a threshold-based binary classification model and continuous regression functions describing the probability for an AvD.
In a first step, we investigated the performance of each predictor variable in discriminating between AvDs and nAvDs from the training data set (i.e.data set AV1 and the corresponding SNOWPACK simulations) using a simple threshold-based binary classification model.To find the best threshold thr for each classification model, we optimized the F1 score, defined as the harmonic mean of the precision, also termed positive predictive value (PPV), and the true-positive rate (TPR; see Table 2 for the definitions of these performance measures).This approach favors a balanced trade-off between the TPR, which is the probability of detecting an AvD, and the PPV, the rate of correct positive predictions.
To examine the robustness of the threshold values and the resulting classification performance, we split the [.. 58 ]training data into several subsets, each of which was tested with the complementary data not used for deriving thr.We split by: 54 removed: which in the following we will refer to as the training data set, 55 removed: .The 56 removed: and the largest 57 removed: hn1d, hn3d, zpp and 58 removed: data AV1 -Hydrological year: We split the data by hydrological year, each with its own pattern of snowpack evolution and avalanche hazard characteristics.
-Grain type characteristics of critical weak layer: We distinguished between layers composed of persistent grain types and precipitation particles.There were only a few AvD cases for other grain types, therefore we did not train on this subset.
-Region: The AvDs are not equally distributed over the Swiss Alps (see Fig. 2).Ten of the 11 stations with the most AvDs are all located in the eastern Swiss Alps, in an area we refer to as Davos-Zuoz.This region is characterized by an inner-alpine climate.To ascertain that the threshold was independent of this spatial bias in the data, we compared a subset Davos-Zuoz (black dots in [.. 59 ]Fig.2) to elsewhere.
In a second step, we derived avalanche day predictors P (AvD) describing the probability for an AvD as continuous functions of a single input feature, i.e.
To estimate the relationship between the binary avalanche index data Y ∈ {0, 1} and the predictor variables we applied regression analysis with four-parameter sigmoidal (S-shaped) functions (see Table A1 in the appendix).The functions were fit on the complete training data set using non-linear least squares with parameter constraints to ensure that modeled probabilities did not exceed 1.For each input feature, we defined the best fitting function by minimizing the Brier score (BS; Wilks, 2011, p. 331), which is the mean squared prediction error: where N is the number of the prediction-observation samples denoted with index i, p i the predicted probability -here f (x i ), and Y i the observed outcome (1 for AvD and 0 for nAvD).A perfect model would thus result in a Brier score equal to zero.
As the data set AV1 contained about 10 times more nAvDs than AvDs, a model with a strong tendency to predict nAvDs

Avalanche size estimator
To estimate avalanche size for a given failure layer depth, we used data set AV2 (Sect.2.1.2) to relate avalanche size to observed failure depth (z obs ) using logistic regression functions of the form where P (S ≥ s) is the probability that avalanches greater or equal than size s (s ∈ [2, 3, 4, 5]; Table 1) were observed given the observed failure depth (z obs ).
The P (S ≥ s)-functions were derived using observed data only, as SNOWPACK simulations were not available for the locations of the avalanche release areas.were then compared to the observed median and maximum avalanche sizes on the respective AvD using the Brier score (eq. 3) with the probabilities p i = P (S ≥ s)(z i ), and the observed outcome Y i equal to 1 if an avalanche of size ≥ s was observed and 0 otherwise.[.. 67 ]To evaluate how well the avalanche size estimator captures rare events, we also calculated the Brier score BS + on the subset of positive observed outcomes[.. 68 ], i.e. the data points that had an observed avalanche of size ≥ s.

Validation and application
To evaluate the performance of the avalanche day predictors (P (AvD)) and the avalanche size estimators (P (S ≥ s)), we used two independent data sets: Weissfluhjoch (2536 m a.s.l.), and if at least one natural dry-snow avalanche was observed within each of the two 2. We compared the re-analysed forecast regional avalanche danger levels (DL, Sect.2.3) to values of P (AvD) [.. 75 ]and P (S ≥ s) [.. 76 ]computed for the stations and virtual slopes that matched the elevation and the critical aspects of the respective danger level data point.As for the other analyses, we used the snowpack simulations at 12:00 LT on the day of interest.For the winter seasons 2019-2020 to 2021-2022, we removed all data points used to develop the P (AvD)-model, and which had a simulated snow depth < 30 cm.

Avalanche days vs. non-avalanche days
Avalanche days were generally associated with new snow [.. 77 ](HN1d = 25 cm, [.. 78 ]HN3d = 59 cm, p < 0.001, row = all in Tab. 3).In contrast, nAvDs were typically characterized by no new snow ([.. 79 ]HN1d = 0, HN3d = 0, median values, p < 0.001).Consequently, the median [.. 80 ]thickness of the layers including precipitation particles varied in a similar way (AvD : z pp = 73 cm, nAvD : z pp = 0 cm).The simulated critical weak layer was at a median depth of 75 cm on AvDs and 22 cm on nAvDs.The simulated critical weak layer had a significantly higher probability of instability on AvDs compared to nAvDs (P crit = 0.92 vs. P crit = 0.33, respectively, p ≤ 0.001) and it was more often composed of persistent grain types (77% vs. 47% of cases, respectively, p ≤ 0.001).As indicated in the median depth of the most critical weak layer was 44 cm on AvDs in 2020 and 91 cm in 2021, while on nAvDs the values were 4 cm and 65 cm, respectively.Similarly, on AvDs, the depth of the weak layer was 88 cm when the critical weak layer consisted of persistent grains (pg) compared to 47 cm for precipitation particles (pp).
At least one potentially unstable layer was detected in 84% of the AvDs, and in only 2% of the nAvDs.Moreover, in 7% of the profiles, there was at least one other potentially unstable layer below the critical weak layer.These cases were rare on nAvDs (1% of the profiles), but quite frequent on AvDs (66%).The median difference in the depth between the critical and the deepest potentially unstable layer (z deep − z crit ) was 14 cm (IQR: 4-44 cm).On AvDs, these layers were 15 cm deeper (IQR: 5-49 cm) compared to only 4 cm (IQR: 2-26 cm) on nAvDs.If such a deeper weak layer existed, it primarily consisted of persistent grains (90%).

Predicting avalanche days and non-avalanche days
All explored variables ([.. 81 ]HN1d, HN3d, z pp and P crit ) showed highly significant differences between avalanche days and non-avalanche days as demonstrated in the previous section.In the following, we will first explore their potential for a binary classification of AvDs and nAvDs, and then derive continuous functions describing the probability for an AvD.
The optimal thresholds (thr) to distinguish between nAvD and AvD for the seven subsets varied when cross-validating the model.For instance, threshold values ranged from 9 to 17 cm for the 24-hour amount of new snow [.. 82 ]HN1d (median 12 cm) or from 22 to 47 cm for [.. 83 ]z pp (median 32 cm) (Appendix, Table A2).Applying these thresholds to the test sets, i.e.
the data not used for training, showed that all four variables performed similarly well in correctly predicting nAvDs (TNR ∈ [0.96, 1]; Fig. 4).In contrast, larger variations were observed in the true positive rate TPR, that is the proportion of correctly predicted AvDs.TPR was highest for [.. 84 ]HN3d (TPR = 0.81) and P crit (TPR = 0.79).The precision, i.e. the proportion of predicted AvDs that also were observed as AvDs, was highest for the two new snow parameters ([.. 85 ]PPV(HN1d) = 0.84, PPV(HN3d) = 0.83).However, these two parameters also showed a greater variation in PPV between subsets compared to P crit , which had a more consistent performance though a slightly lower PPV of 0.80.Overall, in terms of a balanced performance maximizing the F1 score, both [.. 86 ]HN3d and P crit had similar values (median F1 score of 0.80 in crossvalidation).All approaches by far outperformed the natural stability index sn38 (median cross-validated F1 score of 0.24).Due to the limited discriminatory power of sn38, this variable was not considered further in the subsequent development of continuous models.
Analyzing differences between the subsets in more detail also provided interesting insights.For instance, the optimal balanced [.. 87 ]z pp -threshold to differentiate AvD from nAvD was 40 cm when the critical weak layer consisted of precipitation particles (pp) compared to 22 cm for persistent grains pg; it was 47 cm in the region elsewhere and 22 cm in the inner-alpine 81 removed: hn1d, hn3d 82 removed: hn1d (median 13 83 removed: zpp 84 removed: hn3d 85 removed: PPV(hn1d) = 0.84, PPV(hn3d) = 0.83 86 removed: hn3d 87 removed: zpp region of Davos-Zuoz, where persistent weak layers are more frequently observed (e.g.Schweizer et al., 2021, see also Table 3).Similar results were also obtained for the two new snow variables, thus confirming what is known from a process-based point of view: when persistent weak layers are present, less new snow is needed to trigger natural avalanches (Stoffel et al., 1998;Schweizer et al., 2009).
The values of P (AvD) predicted with these two variables correlated strongly (Pearson correlation coefficient r = 0.82).The thresholds where the functions reached P (AvD) = 0.5 (Table A3) were slightly higher compared to the thresholds of the binary classification models described above (Table A2).The F1 scores resulting from these thresholds deviated from the optimal F1 scores obtained with the simple classifiers by less than 1%.Therefore, we only evaluated the performance of the continuous avalanche day predictor functions in the validation (Sect.4.4).
Finally, we explored the performance when averaging the P (AvD)-predictions based on [.. 92 ]HN3d and P crit .Taking the mean of both models resulted in slightly better performance compared to the best performing approach [.. 93 ]P (AvD)(HN 3d): the Brier Score BS decreased from 0.021 to 0.019, while the Brier Score [.. 94 ]on the subset of AvDs, BS + , decreased from 0.156 to 0.144.Translating the mean probability into a binary classification resulted in a TPR of 0.81, a TNR of 0.99, and a high PPV of 0.95.Thus, the combined model detected more than 80% of the avalanche days correctly, and had the overall highest F1 score of 0.87.for the seven subsets numbered in Table 3.The whiskers mark the respective minimum and maximum values, the larger circles display the median values of these performance measures on the seven subsets.In addition, for each parameter, the median threshold, TNR, and F1 score are indicated in the legend.

Estimating avalanche size
In data set AV3, containing 5912 observed avalanches (Sect.2.1.2),the recorded failure depth z obs correlated with avalanche size (r s = 0.45, p < 0.001; Fig. 6a).The median failure depth increased from 30 cm for size 1 avalanches to 100 cm for size 5 avalanches.While there is considerable overlap, the distributions of z obs were significantly different between pairs of consecutive avalanche size classes (Wilcoxon rank-sum test p < 0.001).Based on this data set, we derived logistic functions P (S ≥ s)(z obs ) to estimate avalanche size from z obs ([.. 99 ]Fig.6b; the respective coefficients are provided in the Appendix in Table A4).
Comparing P (S ≥ 3) and P (S ≥ 4) with the observations from the data set AV1 on AvDs with at least two recorded avalanches, we obtained the lowest Brier score BS if the median avalanche size was estimated with [.. 102 ]HN1d as a proxy for the failure depth.For the largest recorded avalanche, on the other hand, [.. 103 ]z deep was the best predictor (Table 4).
Considering only [.. 104 ]the subsets of data points where the avalanche size of interest was indeed observed (BS + in Table 99 removed: Figure 102 removed: hn1d 103 removed: z deep 104 removed: subsets when

Predicting natural avalanche activity in the region of Davos
While the predictive power of the continuous models P (AvD)(P crit ) and [.. 120 ]P (AvD)(HN3d) was similar when applied to the training data set AV1 (see Table A3), there were substantial differences in the performance of these models on the validation data set AV3 of observed avalanches from the region of Davos (Sect.2.1.3),as seen in Table 5 and Fig. 7.For both models, the predicted AvD-probability was low for nAvDs (median values 0.03 and 0.01, respectively), yet for AvDs, the P (AvD)- predicted (TPR = 0.55) and 77% of the 193 predicted AvDs corresponded to actual observed AvDs (PPV = 0.77, Table 5).
The P (AvD)(P crit )-model, on the other hand, had a higher probability of detecting AvDs (TPR = 0.90), while the proportion of predicted AvDs that matched an observed AvD was lower (PPV = 0.59).In terms of F1 score, the P (AvD)(P crit )-model    . 166]z deep , the most frequently predicted avalanche size was size 2 (proportions 0.45-0.48)at 1 (low) and 2 (moderate), and size 3 (proportions 0.43-0.48)at the three highest danger levels.While the proportions of size 3 were approximately similar at the three highest danger levels, the combined proportions of size 4 and size 5 avalanches increased considerably with increasing danger level from 0.13 at 3 (considerable) to 0.41 at 5 (very high) ([.. 167 ]Fig.9b).
Considering z deep regardless whether the respective layer was classified as potentially unstable or not, resulted in the following median values for 1 (low) to 5 (very high): 31 cm, 31 cm, 53 cm, 90 cm, and 154 cm, and, hence, in approximately similar size distributions as when considering only z deep for layers classified as potentially unstable.
Lastly, we explore predictions expected to describe the frequency distribution of snowpack stability.Applying the instability model and the avalanche day predictor to spatially distributed snowpack simulations may yield frequency distributions of snowpack stability with respect to human triggering and natural release, respectively.Spatially distributed simulations of snow stratigraphy can be obtained either with high-resolution output of numerical weather prediction models (e.g.Vionnet et al., 2012;Bellaire and Jamieson, 2013b;Horton et al., 2015) or precipitation input scaled according to terrain properties (e.g.Reuter et al., 2016;Richter et al., 2021).While the demonstration of such an approach was out of scope for this study, we here compare the frequency of locations indicating natural avalanche activity based on the mean of P (AvD)(P crit ) and P (AvD)(HN3d) using posterior knowledge to aggregate AWSs from regions with the same danger level to estimate a frequency distribution of snowpack stability by the proportion of AWSs indicating natural avalanche occurrence (P (AvD)(combi) ≥ 0.5).Similarly, we calculated the mean approximated failure depth z deep per day and region with the same danger level.Results suggest that avalanche probability and z deep , the estimator best correlating with the largest avalanche size, increased non-linearly with danger level.As can already be expected from Figures 8b and c, the largest spread in conditions can be noted at danger level 3 (considerable), where the frequency of locations for which natural avalanches are predicted spanned almost the entire range of possible values (see shape of orange density contours in Fig. 10).In contrast, at 2 (moderate), frequency values were either low (median P (AvD)(combi) = 0.05) or z deep was comparably small (median z deep = 30 cm), while at 4 (high), both the frequency of locations with natural avalanches predicted and z deep were comparably high (median P (AvD)(combi) = 0.88, z deep = 75 cm).

Discussion
We developed an avalanche day predictor P (AvD)(P crit ) describing the probability for natural dry-snow avalanches in the surrounding of an AWS for a given slope aspect based on simulated snow stratigraphy.We compared the performance of this index with benchmark models relying on the amount of new snow.The combination of P (AvD)(P crit ) with a model 163 removed: , i. e. P deep >= 0.77, the median depth of the weak layer, z deep , led to similar median values of z deep ranging from 29 164 removed: at 1 (low) to 157 165 removed: at 5 (very high) (Figure 9e). 166removed: z deep 167 removed: Figure In the following, we will discuss the performance and limitations of the avalanche day predictors (P (AvD)) (5.2) and avalanche size estimators (P (S ≥ s)).

Data reliability
To develop the avalanche day predictor, we created a robust binary target variable (AvD versus nAvD) imposing restrictions on the target variable included rather extreme cases of widespread activity versus no activity at all, which should be taken into account in model interpretation.Due to the necessary adaptation of the AvD/nAvD definition in the Davos validation data set, described in Section 3.3, the reliability of the avalanche day labels is lower as the definition is relaxed.This also shows in the ratio AvD/nAvD, which ranged from 6 to 15% in the training data set, while for the Davos validation set it was 27%.In conclusion, the adapted definition leads to a higher proportion of AvDs, which seems responsible for the lower performance.
As the exact timing of avalanche release was not included in the data sets of observed avalanches, the explanatory variables were extracted from the snowpack and instability model simulations at fixed time steps (12:00 LT).This introduced uncertainty in the explanatory variables of both the training and validation data sets.With avalanche data sets from remote detection systems, providing the exact release time, this uncertainty would be removed.However, so far such data sets only cover short time periods and are very local in scope [.. 171 ](e.g van Herwijnen et al., 2016;Heck et al., 2019;Mayer et al., 2020;Reuter et al., 2022), whereas the training data set used in this study included avalanches from the entire Swiss Alps observed over 3 winter seasons.

Predicting avalanche days
In a first step, we analyzed the predictive power of the explanatory variables to distinguish between AvDs and nAvDs using different subsets of the training data set (AV1).An optimized threshold-based classification resulted in a reasonably high performance (cross-validated F1 score: 0.80) of P crit , and clearly outperformed the conventional natural stability index sn38 (cross-validated F1 score: 0.24).[.. 172 ]While sn38 seems well-suited to model natural avalanche activity from a physical point of view, the parametrization of this simple criterion within SNOWPACK has some weaknesses: Indeed the shear stress can be simply calculated from the load and thus only inherits the errors from estimating precipitation mass based on measured snow depths, yet shear strength is a rather complex microstructural parameter.The current SNOWPACK parametrization of shear strength is based on density and grain type (Jamieson and Johnston, 2001), which may not be sufficient to capture the influence of microstructure as also pointed out by Richter et al. (2020).In particular, the evolution of the SNOWPACK shear strength over time only depends on density if grain type does not change.The poor performance of sn38 is in line with other studies [.. 173 ] (Jamieson et al., 2007;Reuter et al., 2022).For instance, Jamieson et al. (2007) analyzed sn38 based on field measurements and concluded that critical values of stability indices are less useful than their trends, a result confirmed by Reuter et al. (2022) who showed that using time derivatives of sn38 has a higher predictive power.In contrast, the [.. 174 ]3d-sum of new snow (HN3d), recognized as an important indicator of avalanche activity in past studies (Ancey et al., 2004;Schweizer et al., 2009) the complete snowfall event, in contrast to [.. 176 ]HN3d.Potentially, using the mass of recent slab layers, which is more directly related to the load on the weak layer, may lead to better results than the depth of these layers.
Evaluating continuous one-dimensional sigmoidal P (AvD)-functions for the four considered input variables ([.. 177 ]HN1d, HN3d, z pp , P crit ) on the training data (AV1) resulted in negligible differences in F1 scores (≤ 1%) compared to the F1optimized threshold-based classification.The best performance in terms of F1 and Brier scores was obtained by taking the average probability from P (AvD)(P crit ) and [.. 178 ]P (AvD)(HN3d), which was also confirmed by the validation on the independent [.. 179 ]data set from the region of Davos (data set AV3, F1 = 0.75).On this data set (AV3), the performance of the [.. 180 ]P (AvD)(HN3d)-model in terms of predicting AvDs was rather low (TPR = 0.55).A possible explanation is the more frequent formation of persistent weak layers in the region of Davos due to its relatively dry, inner-alpine snow climate, compared to the mean snow climate in the Swiss Alps.If weak layers are present within or at the snow surface, avalanches can release with smaller amounts of new snow (e.g.Schweizer et al., 2009;Schneebeli et al., 1998), which was also illustrated by the differences in optimal thresholds for the subsets from the training data (Sect.4.1).The combination of [.. 181   ]P (AvD)(HN3d) with P (AvD)(P crit ) presents an alternative to using snow-climate-specific thresholds, as the P crit -variable captures the presence of weak layers.
Most of the recently developed snow instability models [.. 182 ]( Viallon-Galinier et al., 2023;Pérez-Guillén et al., 2022;Hendrick et al., 2023;Sielenou et al., 2021) are based on statistical methods which account for non-linear, complex relationships between target and explanatory variables.Here, we chose a rather simple approach based on one-dimensional sigmoidal functions which cannot account for interactions between explanatory variables, but allow for a simple interpretation of model output.Nevertheless, it should be noted that P crit itself is based on the output of a random forest model, which renders the interpretation of P (AvD)(P crit ) with respect to the original input parameters of the instability model difficult.For a discussion of the influence of these input parameters on the direct output of the instability model, the layer-specific probability of instability, P unstable , see [.. 183 ] Mayer et al. (2022).
For a model to be considered useful, it has to provide more information than can be obtained from basic prior information (Honts and Schweinle, 2009), for instance, when simply assuming the base rate of avalanche days as the constant probability for an avalanche day.Thus, the potential benefits of a threshold-based classification can also be explored using the concept of information gain (Honts and Schweinle, 2009).Applied to our context, information gain is defined as the difference between the base rate probability of avalanche days and the posterior probability (or the positive predictive value PPV or precision; Honts and Schweinle, 2009).As shown in Table 5 for the avalanche observations in the region of Davos (data set AV3), particularly 176 removed: hn3d 177 removed: hn1d, hn3d, zpp 178 removed: P (AvD)(hn3d) 179 removed: dataset 180 removed: P (AvD)(hn3d) 181 removed: P (AvD)(hn3d) 182 removed: (Viallon-Galinier et al., 2023;Pérez-Guillén et al., 2022;?;Sielenou et al., 2021) 183 removed: ?
the training data set of observed avalanches from all over Switzerland (AV1).This suggests that for the occurrence of natural dry-snow avalanches, snow stratigraphy seems to be of secondary importance compared to the amount of new snow.However, model evaluation on an independent data set from the region of Davos (AV3) (Sect.4.2) and the analysis of specific subsets of the training data showed that accounting for snow stratigraphy is important when prominent persistent weak layers are present in the snowpack as less new snow is required to cause a decrease in stability.In the classification of avalanche days from the region of Davos, P (AvD)(P crit ) outperformed [.. 208 ]P (AvD)(HN3d) (F1 = 0.71 and 0.64, respectively), and the averaged predictions of both models yielded the overall best performance (F1 = 0.75).The performance of this combined model should be evaluated on further independent data sets to investigate its applicability to snow climates that were not represented by the data used in this study.
We also explored whether indicators of avalanche size can be obtained from one-dimensional SNOWPACK simulations.
Our avalanche size estimator, developed using observations of avalanche size and failure depth, produced the best results in predicting the largest avalanche size when the depth of the deepest simulated weak layer ([.. 209 ]z deep ) was used as a proxy for failure depth.This demonstrates that including information on snow stratigraphy is critical for estimating avalanche size, compared to relying exclusively on parameters based on the amount of new snow.
[.. 210 ]And lastly, as part of the model validation, we showed that model predictions (avalanche day and size) were related with the danger levels.The results were in line with current definitions of the avalanche danger levels and with previous data-driven studies, highlighting the models' potential to support decision-making in regional avalanche forecasting.
The models developed in this study allow for the estimation of two determinants of regional avalanche danger, snow instability and avalanche size.Applied to one-dimensional snowpack simulations driven with data from AWSs or numerical weather prediction models, these models can thus provide valuable support in operational avalanche forecasting.

Figure 1 .
Figure 1.Several data sets were used to develop and validate the functions describing the probability of natural avalanche occurrence and avalanche size.The data are described in the sections indicated.

Figure 2 .
Figure 2. Map of Switzerland showing the location of the automated weather stations (dots).The coloring indicates the number of avalanche days (AvD) per station[.. 22 ], summed up over all aspects and over the three winter seasons 2019/2020 -2021/2022.Stations in the Davos-Zuoz area, which had N ≥ 13 AvDs per station were combined in a subset «Davos-Zuoz» (marked with white circles), all other stations as «elsewhere».For illustration purposes, the major rivers and lakes are shown in blue, and the elevation in grey.(Digital elevation model -source: Federal Office of Topography swisstopo)

Figure 3 .
Figure 3. Example of a simulated snow profile showing the hand hardness profile, the grain type of the simulated layers (coloring of the layers), and the probability of instability Punstable (black line).The critical weak layer is defined as the layer where Punstable is maximal.Hand hardness (F -fist, 4F -four fingers, 1F -one finger, P -pencil, and K -knife) and grain type (PP -precipitation particles, DF -decomposing and fragmented precipitation particles, RG -rounded grains, FC -faceted crystals, DH -depth hoar, SH -surface hoar, MF -melt forms) were coded after [.. 27 ]Fierz et al. (2009).The dashed vertical line displays the threshold of Punstable = 0.77 discriminating between stable and potentially unstable layers derived by [.. 28 ]Mayer et al. (2022).The depth of precipitation particles and the deepest weak layer, i.e. the deepest layer exceeding the instability threshold, are indicated.
were solely due to missing observations rather than actual nAvD, we only retained the winter seasons with at least one AvD for a given station-aspect combination.In the end, [.. 54 ]the data set contained about ten times more nAvDs (N = 8511) than AvDs (N = 872).Overall, AvDs had a median of two avalanches (interquartile range IQR: 2-7) in the aspect and elevation of the snowpack simulation within an area of 250 km 2 surrounding the station.Two or more avalanches were recorded on 559 of the 872 AvDs.The median AAI on AvD was 1 (IQR 0.3-3.6)[.. 55 ], the typical avalanche (median avalanche size) was of size 2 (IQR 2-3)[.. 56 ], and the typical largest avalanche (median) was of size 3 (IQR 2-3).
could result in low Brier scores simply because the evaluation on the minority subset of AvDs has lower weight compared to the subset of nAvDs.To indicate how well the minority class of [.. 62 ]AvDs was captured by the model, we therefore additionally calculated the Brier score on the subset of AvDs only (BS + ).

[
.. 72 ]surrounding regions (1000 km 2 and 5000 km 2 ) [.. 73 ]regardless of aspect and elevation.The definition of an AvD was thus slightly adapted compared to eq. (1) due to the lack of consistent information on aspect and elevation of the observed avalanches within the two larger surrounding regions.Thus, AvD labels in the validation data set are somewhat less reliable compared to the original definition.The resulting data set consisted of 273 avalanche days and 984 non-avalanche days during the five winter seasons [.. 74 ]2014/2015 until 2018/2019.For each of these 1257 days, we calculated aspect-specific values of P (AvD) and P (S ≥ s) using SNOWPACK virtual slope simulations driven with quality-checked data from the AWS Weissfluhjoch (see Sect. 2.2).With the adapted definition of AvD (no consideration of aspect for the two surrounding areas not covered by the avalanche observations in the region of Davos), we obtain some more AvDs than in the other data sets.This follows from the fact that an east-facing avalanche in the region of Davos will count towards an AvD if there are other avalanches in the surrounding areas with unknown aspect.

Figure 4 .
Fig. 2), and as function of the grain type of the critical weak layer (pg = persistent grain types, pp = precipitation particles).The seven numbered subsets were used for cross-validation.Median values are shown for [.. a ]HN1d, [.. b ]HN3d, Pcrit, zcrit and [.. c ]zpp, and the proportion of critical weak layers consisting of persistent

Figure 5 .
Figure 5. Probability that a day is an avalanche day (P (AvD)) as a function of (a) [.. 95 ]Pcrit and (b) [.. 96 ]HN3d for the data subsets shown in Table 3.The subsets are binned with bin-size being [.. 97 ]0.1 in (a) and [.. 98 ]10 cm in (b).The best-fitting function describing all data is shown in black.

Figure 7 .
Figure 7.Estimated probabilities P (AvD) for avalanche days (AvD) and non-avalanche days (nAvD) based on the data set of observed avalanches from the region of Davos using the models based on a) Pcrit, b) [.. 128 ]HN3d and c) the averaged predictions P (AvD)(combi) of the models based on Pcrit and [.. 129 ]HN3d.

Figure 10 .
Figure 10.Proportion of predictions with P (AvD)(combi) ≥ 0.5 and mean depth of deepest weak layer z deep per day and danger level.Shown are the respective median values (labels).Contour lines indicate the two-dimensional density distributions for 2 (moderate), 3(considerable) and 4 (high).The respective outermost contour line represents the (0, 0.1] percentile interval, and the innermost contour line the (0.9, 1.0] interval.Not shown are density estimates for 1 (low), for a few cases the proportion was ≥ 0.01 existed, and for 5 (very high), as only the four data points shown existed.

Table 1 .
Avalanche size classification (according to SLF, 2020; EAWS, 2021)and the corresponding weight, w, used [.. 17 ]to calculate the starting zone of an avalanche is generally not possible.Other avalanche characteristics like the type of avalanche (i.e.slab or loose snow avalanche), the length and width, or the failure depth are also reported sometimes.
For this study, we only considered natural dry-snow avalanches that were recorded between 1 December and 30 April in the three winter seasons [.. 15 ]2019/2020, 2020/2021, and 2021/2022 in the Swiss Alps.In total, 12'940 avalanches were reported[.. 16].Even though the operational avalanche database also contains avalanche observations prior to 2019, the recording standards were different and did not allow us to unambiguously identify natural dry-snow avalanches.2.1.2SwissAlps, observed avalanches (data setAV2, 1992AV2,  /1993AV2,   -2021AV2,  /2022, 30 years)   , 30 years) Mayer et al. (2022)critical weak layer, we searched for potential weak layers deeper in the snowpack.We selected the deepest weak layer as the deepest layer fulfilling P unstable ≥ 0.77 -the best-splitting threshold suggested by [.. 31 ]Mayer et al. (2022)to distinguish between stable and potentially unstable layers.If no such layer existed, the deepest weak layer was the critical weak layer.For each profile, we then extracted the probability of instability for the deepest weak layer P deep , the depth below the snow surface ([.. 32 ]z deep ), and the grain type (gt deep ).
.. 42 ]Conventionally, the [.. 43 ]height of[.. 44]new snow is measured in the flat field.Consistent with this definition, the new snow amounts provided by SNOWPACK are therefore for the flat field as well, regardless of whether it is a simulation in the flat or on a virtual slope.However, we also considered the thickness of precipitation particle layers z pp , which should capture the amount of recently-fallen snow including snow transport by wind, since we used the SNOWPACK version including snow redistribution by wind.All other parameter depend on aspect since they were selected from the virtual slope simulations where, for instance, energy input [.. 45 ]and snow accumulation vary dependent on aspect.[.. 46 ] [

Table 2 .
Confusion matrix defining the possible combinations of observed and predicted labels (upper part) and definition of resulting performance measures true-positive rate (TPR), positive predictive value (PPV), true-negative rate (TNR) and F1 score (F1; lower part).
To analyze the performance of the size indicators combined with the depth parameters extracted from SNOWPACK, we estimated probabilities for different avalanche sizes on the AvDs from the training data set ([.. 63 ]Sect.3.1.1)using the simulated depth parameters z ([.. 64 ]HN1d, HN3d, z pp , z crit , [.. 65 ]z deep , described in [.. 66 ]Sect. 2.2) as proxies for the potential failure depth.The resulting estimated probabilities for different avalanche sizes (s ∈ [3, 4]) Table 3, these values varied between subsets.For instance,

Table 5 .
Performance statistics of different avalanche day predictors P (AvD) on the independent validation data set (AV3) with observed avalanches from the region of Davos including 273 AvDs and 984 nAvDs.

Table 6 .
Brier scores BS for predicting the median or the largest avalanche size for all avalanche days (AvD) with ≥ 2 avalanches (N = 185) from the validation data set AV3 of the region of Davos using the predictor variables [.. 134 ]HN1d and [..Brier scores shown in Table6are in line with the performance for data set AV1 (Table4): The lowest Brier score BS for the estimation of median avalanche size was obtained when using [.. 131 ]HN1d, while for the largest observed avalanche [.. 132 ]z deep was again the better predictor.Considering only events when median avalanches sizes greater or equal than size 3 were observed (N= 33), again using [.. 133 ]z deep resulted in the lower error rate BS + .Pcrit, (b) the probability of an AvD provided by the avalanche day predictor P (AvD)(Pcrit), and (c) the probability of an AvD based on [.. 145 ]P (AvD)(HN3d), the benchmark model.Model predictions were computed for the stations and virtual slopes that matched the elevation and the critical aspects of the respective danger level data point.The dashed horizontal line represents the best-splitting threshold to distinguish between (a) stable and potentially unstable profiles [.. 146 ](0.77;Mayer et al., 2022), and (b, c)between AvDs and nAvDs.The respective proportions above and below this threshold are indicated for each danger level.
135]z deep as input for the P (S ≥ 3)function.BS + evaluates only a subset of the data when the condition is fulfilled (median / maximum S ≥ 3: N = 33 / N = 126).The best-performing approach is highlighted bold.