On the stability interpretation of Extended Column Test results

Snow instability tests provide valuable information regarding the stability of the snowpack. Test results are key data used to prepare public avalanche forecasts. However, to include them into the operational procedures, a quantitative interpretation scheme is needed. Whereas the interpretation of the Rutschblock test is well established, a similar detailed classification for the Extended Column Test (ECT) is lacking. Therefore, we develop a 4-class stability interpretation scheme. Exploring a large data set of 1719 ECTs observed at 1226 sites, often performed together with a Rutschblock (RB) in the same 5 snow pit, and corresponding slope stability information, we revisit the existing stability interpretations, explore the potential of a more detailed classification, and specifically consider the interpretation of cases when two ECTs were performed in the same snow pit. Our findings confirm previous research, namely that the crack propagation propensity is the most relevant result and that the loading step required to initiate a crack is of secondary importance for stability assessment. The comparison with the RB showed that the ECT classifies slope stability less reliably than the RB. In some situations, performing a second ECT 10 may be helpful, when the first test did neither indicate rather unstable nor stable conditions. Finally, the data clearly show that false-unstable predictions of stability tests outnumber the correct-unstable predictions in an environment where overall unstable locations are rare. Copyright statement. TEXT


Stability classification of ECT and RB
To facilitate the distinction between the result of an instability test and the stability of a slope, we refer to test stability using four classes 1 to 4, with class 1 being the lowest stability (poor or less) and class 4 the highest stability (good or better). In contrast, for slope stability, we use the terms unstable and stable. We chose four classes as a similar number of classes has been used for RB stability interpretation, as outlined below. Rutschblock test: We classified the RB in four classes (classes 1 to 4; Fig. 2). We followed largely the RB stability classification by Techel and Pielmeier (2014), who used a simplified version of the classification used operationally by the Swiss avalanche warning service (Schweizer and Wiesinger, 2001;Schweizer, 2007). Schweizer (2007) defined five stability classes 95 for the RB, based on the score and the release type in combination with snowpack structure, while Techel and Pielmeier (2014) relied exclusively on RB score and release type. In contrast to both these approaches, we combined the two highest classes (good or very good) to one class (class 4).
Shallow weak layers (≤ 15 cm) are rarely associated with skier-triggered avalanches (Schweizer and Lütschg, 2001;van Herwijnen and Jamieson, 2007), which is, for instance, reflected in the threshold sum approach (Schweizer and Jamieson, 2007), 100 a method to detect structural weaknesses in the snowpack. Schweizer and Jamieson (2007) reported the critical range for weak layers particularly susceptible to human triggering as 18-94 cm below the snow surface. Minimal depth criteria were also taken into account by Winkler and Schweizer (2009) in their comparison of different instability tests or by Techel and Pielmeier (2014), when classifying snow profiles according to snowpack structure. We addressed this, by assigning the next higher stability class if the weak layer was between 6 and 10 cm below the surface, and class 4 if the failure layer was less than 5 cm 105 below the snow surface. If there were several failure planes in the same test, we searched for the ECT and RB failure plane with the lowest stability class.

Slope stability classification
We classified stability tests according to observations relating to snow instability in similar slopes as the test on the day of observation, such as recent avalanche activity or signs of instability (whumpfs or shooting cracks). This information was man-110 ually extracted from the text accompanying a snow profile and/or stability test. This text contains -among other information - Figure 1. ECT and RB according to observational guidelines. At the back, the block of snow is isolated by either cutting with a cord or a snow saw. The lightblue area indicates the approximate area, where the skis or the shovel blade is placed. This area corresponds to the area loaded for the ECT, while the main load under the skis is exerted over a length of about 1 m (Schweizer and Camponovo, 2001). Loading is from above (arrows). details regarding recent avalanche activity or signs of instability.
A slope was considered unstable if any signs of instability or recent avalanche activity -natural or skier-triggered avalanches from the day of observation or the previous day -were noted on the slope where the test was carried out or on neighbouring slopes Birkeland, 2006, 2009;Moner et al., 2008;Winkler and Schweizer, 2009;Techel et al., 2016). 115 We considered a slope only as stable, if it was clearly stated that on the day of observation none of the before-mentioned signs were observed in the surroundings. In most cases, surroundings relates to observations made in the terrain covered or observed during a day of back-country touring (estimated to be approximately 10 to 25 km 2 , Meister, 1995;Jamieson et al., 2008).
In the following, we denote slope stability simply as stable or unstable, although this strict binary classification is not entirely correct. For instance, many tests were performed on slopes that were actually rated as unstable, though did not fail.

120
If it was not clearly indicated, when and where signs of instabilities or fresh avalanches were observed, or if this information was lacking entirely, these data had not been included in our dataset.

Forecast avalanche danger level
For each day and location of the snow instability test, we extracted the forecast avalanche danger level related to dry-snow 125 conditions from the public bulletin issued at 17.00 CET, and valid for the following 24 hours.

Criteria to define ECT stability classes
We consider the following criteria as relevant when testing existing or defining new ECT stability classes: -(i) Stability classes should be distinctly different from each other. The criteria we rely on is the proportion of unsta-130 ble slopes. Therefore, a higher stability class should have a significantly lower proportion of unstable slopes than the neighboring lower stability class.
-(ii) The lowest and highest stability classes should be defined such that the rate of correctly detecting unstable and stable conditions is high, respectively; hence, the rate of false-stable and false-unstable predictions should be low, respectively.
Stability classes in-between these two classes may represent intermediate conditions, or lean towards more frequently 135 unstable and stable conditions, permitting a higher false-stable and false-unstable rate than the rates of the two extreme stability classes.
-(iii) The extreme classes should occur as often as possible, as the test should discriminate well between stable and unstable conditions in most cases.
To define classes based on crack propagation propensity and crack initiation (number of taps), we proceeded as follows: 140 1. We calculated the mean proportion of unstable slopes for moving windows of 3, 5 and 7 consecutive number of taps, for ECTP and ECTN separately. ECTX was included in ECTN, treating ECTX as ECTN31.
2. We obtained thresholds for class intervals by applying unsupervised kmeans-clustering (R-function kmeans with settings max.iter = 100, nstart = 100; R Core Team (2017); Hastie et al. (2009)) on the proportion of unstable slopes of the three running means (step 1). The number of clusters k tested were 3, 4 and 5.

145
3. We repeated clustering 100 times using 90% of the data, which were randomly selected without replacement. For each of these repetitions, the cluster boundaries were noted. Based on the 100 repetitions, we report the respective most frequently observed k-1 boundaries, together with the second most frequent boundary.
4. To verify whether the classes found by the clustering algorithm were distinctly different (criteria i), we compared the proportion of unstable slopes between clusters using a two-proportions z-test (prop.test, R Core Team (2017)). We 5. For clusters not leading to a significant reduction in the proportion of unstable slopes, we tested a range of thresholds (± 155 3 taps within the threshold indicated by the clustering algorithm) to find a threshold maximizing the difference between  (Schweizer and Jamieson, 2010). We refer to these two outcomes as low or high stability.

165
There are two different contexts a test's adequacy is looked at: the first explores whether the foundations of a test are satisfactory (i), the second its practical usefulness (ii) (Trevethan, 2017): (i) Most often the performance of a snow stability test is assessed from the perspective of the reference group (Schweizer and Jamieson, 2010), i.e. what proportion of unstable slopes are detected by the stability test. The two relevant measures addressing this context are the sensitivity and specificity, which are considered as the benchmark for the performance:

170
-The sensitivity of a test is the probability of correctly identifying an unstable slope from the slopes that are known to be unstable. Considering a frequency table (Tab. 2) the sensitivity, or probability of detection (POD), is calculated as (Trevethan, 2017): The specificity of a test is the probability of correctly identifying a stable slope from the slopes that are known to be stable. It is also referred to as probability of non-detection (PON).
Ideally, both sensitivity and specificity are high, which means that most unstable and most stable slopes are detected. However, 180 missing unstable situations can have more severe consequences and therefore it is assumed that first of all the sensitivity should be high. Nonetheless, a comparably low specificity will decrease a test's credibility. Sensitivity and specificity are generally considered to be insensitive to the distribution of reference standard -in our case the respective proportions of unstable and stable slopes. However, this is only true when the distribution of the reference classes is approximately balanced and misclassifications in the estimated reference classes are rare (Brenner and Gefeller, 1997).

185
(ii) The second context focuses on the ability of a test to correctly indicate slope stability, i.e. if the test result indicates low stability, how often is the slope in fact unstable. This aspect has only rarely been explored for snow instability tests (e.g by Ebert (2018) from a Bayesian viewpoint), and is generally assessed using two metrics: low stability class).
is the statistic we refer to most in this manuscript, generally termed the proportion of unstable slopes.
-The negative predictive value (NPV) is the proportion of stable slopes, given that a test result indicates stability (a high However, to demonstrate the effect of a varying base rate, we highlight differences in PPV and NPV by considering the proportion of unstable slopes stratified by the forecast danger level for 1-Low to 3-Considerable. Finally, a test result should provide interpretable evidence in favour of instability or stability. To address this point, we use the likelihood ratio as a measure of the strength of evidence for one hypothesis or the other. According to Brenner and Gefeller 205 (1997), and applied to our study, the positive likelihood ratio LR+ is the ratio of the probability of a positive test (low stability) in an unstable slope to the probability of a positive test in a stable slope: The likelihood ratio is the factor that describes the shift from the prior probabilities to the posterior probabilities, and is therefore 210 an indicator of the strength of evidence the observed data have (Blume, 2002).

Base rate of unstable and stable slopes
As outlined before, the proportion of unstable slopes varied within our data set: We noted a bias towards more frequently observing two ECTs when the slope stability was considered as unstable (30%), compared to single ECT with only 15% of the tests observed in unstable slopes (Table 1). To balance out this mismatch when comparing two ECT results to single ECT or 215 RB (20% unstable), we created equivalent data sets for single ECT and RB containing the same proportion of tests collected on unstable and stable slopes as the data set of two ECT. For this, we randomly sampled an appropriate number of single The base rate with 30% tests on unstable and 70% on stable slopes was used throughout this manuscript, except in Sect. 4.5, where we evaluate the effect of different base rates.

Selecting ECT from snow pits with two ECT
For snow pits with two adjacent ECTs, we randomly selected one ECT, when exploring single ECT data or the relationship between the number of taps and slope stability (Sect. 4.2). As before, this procedure was repeated 100 times. The respective 225 statistics, generally the mean proportion of unstable slopes, was calculated based on the 100 repetitions.

Comparing existing stability classifications
We first consider the results for a single ECT.
The original stability classification ECT orig led to significantly different proportions of unstable slopes for the two stability Although ECT w09 -class 1 had a larger proportion unstable slopes than ECT orig -class 1, the difference was not significant (p > 0.05).
Considering the results obtained from two adjacent ECTs resulting in the same stability class 1, between 0.52 (ECT orig ) and 235 0.61 (ECT w09 ) of the slopes were unstable. Although the proportion of unstable slopes was higher by 0.05 to 0.08 than for a single ECT, this difference was not significant (p > 0.05). When both ECT indicated the highest stability class, the proportion of unstable slopes was 0.15, not significantly different than for a single ECT resulting in this stability class (0.18, p > 0.05).
When one test resulted in the lowest and the other in the intermediate ECT w09 -class, 0.25 of the slopes were unstable. While this was clearly less than when both resulted in ECT w09 -class 1 (p < 0.05), it was not significantly different than two ECT with 240 9 https://doi.org/10.5194/nhess-2020-50 Preprint. Discussion started: 31 March 2020 c Author(s) 2020. CC BY 4.0 License.
ECT w09 -class 4 (0.15, p > 0.05) Regardless whether a single ECT or two ECTs were considered, the ECT w09 -classification had a 0.06-0.09 larger proportion of unstable slopes for stability class 1 than the ECT orig -classification. For stability class 4 there was no difference, as the definition for this class was identical.
The sensitivity was higher for ECT orig (0.64) than for ECT w09 (class 1: 0.57, Fig. 4a and b). However, this comes at the cost of 245 a high false alarm rate (1-specificity) for ECT orig (0.31), considerably higher than for ECT w09 (0.21).
The optimal balance between achieving a high sensitivity and a low false alarm rate was found to be at ECTP≤21 (R-library pROC (Robin et al., 2011)), exactly the threshold suggested by Winkler and Schweizer (2009).

Clustering ECT results by accounting for failure initiation and crack propagation
So far, we explored existing classifications. Now, we focus on the respective lowest number of taps stratified by propagating 250 (ECTP) and non-propagating (ECTN) results. If in the same test for different weak layers ECTN and ECTP were observed, only ECTP with the lowest number of taps was considered.
As can be seen in Fig. 3b, the proportion of unstable slopes was higher for ECTP compared to ECTN, regardless of the number of taps and in line with the original stability classification ECT orig . However, a notable drop in the proportion of unstable slopes between about 10 and 25 taps is obvious (ECTP, from about 0.6 to almost 0.25).

255
Clustering the ECT results shown in Figure 3b with the number of clusters k set to 3, 4 and 5, and repeating the clustering  two most frequent cluster thresholds obtained for k = 4. The frequency, the respective cluster threshold was selected in the 100 repetitions, is shown in brackets: Setting k to 3 resulted in clusters being divided at ECTP≤14 and at ECTP≤21, k = 5 resulted in cluster thresholds ECTP≤9, ECTP≤14, ECTP≤20 and ECTN≤10. The second most frequent threshold was almost always within ±1 tap of those indicated before.

265
To maximize the difference in the proportion of unstable slopes between classes (Fig. 3c) (about half the base rate).
In the following, we apply these thresholds in combination with the depth of the failure plane.

Stability classification for single ECT
The new classification with four stability classes (ECT new ) showed continually and significantly decreasing proportions of 280 unstable slopes with increasing stability class (0.57, 0.39, 0.25, 0.16 for classes 1 to 4, respectively, p ≤ 0.01, Fig. 3c).
The lowest ECT new -class had a larger proportion unstable slopes (0.57) than the lowest classes for ECT w09 (0.53) or ECT orig (0.47), though this was only significant compared to ECT orig (p ≤ 0.05). In contrast, only marginal differences were noted when comparing stability classes 4 (ECT new 0.16, ECT orig 0.18).
Considering class 1 as an indicator of instability, the sensitivity was 0.44 with ECT new (0.58 when considering classes 1 and 2 285 together, Fig. 4c).

Stability classification for two adjacent ECTs
70% of the time two ECTs indicated the same ECT new class, in 19% they differed by one class and in 11% by two (or more) classes.
Two ECTs resulting in the same ECT new class resulted in pronounced differences in the proportion of unstable slopes for classes 290 1 to 4 (0.61, 0.48, 0.20 and 0.13, respectively; Fig. 3c).
Randomly picking one of the two ECTs as the first ECT yielded the proportion unstable slopes as shown in Table 3. Additionally considering the outcome of a second ECT could increase or decrease the proportion unstable slopes for some combinations.
For instance, if a first ECT resulted in either ECT new class 1 or 4, the second test would often indicate a similar result: class ≤ 2 in 85% of the cases, when the first ECT was class 1, and class ≥ 3 in 93% of the cases, when the first ECT was class 4. How-295 ever, if the first ECT would either be ECT new class 2 or 3, a large range of proportion unstable slopes could result depending Table 3. Proportion unstable slopes when randomly selecting one of two ECTs as the first test (ECTnew(1 st )) (prop unstable 1 st ) and the number of cases (N) , and the respective proportion unstable slopes 2 nd following the outcome of the second ECT (ECTnew(2 nd )).

Comparison to Rutschblock test results
The proportion unstable slopes decreased significantly with each increase in RB stability class (0.76, 0.53, 0.25 and 0.11 for 300 classes 1 to 4, respectively; p < 0.01; Fig. 3c). If a binary classification were desired, classes 1 and 2 would be considered as indicators of instability, classes 3 and 4 as relating to stable conditions. Employing this threshold, the sensitivity was 0.54 and the specificity 0.87 (Fig. 4d). Considering RB class 3, also termed «fair» stability (Schweizer, 2007), as an indicator of stability is, however, not truly supported by the data. This class has a proportion unstable slopes of 0.25, only marginally lower than the base rate.

305
Comparing RB with the ECT showed that the proportion of unstable slopes for RB stability class 1 was significantly higher (p < 0.01) and for class 4 by about 0.05 lower (p > 0.05) than for any of the ECT classifications (Fig. 3a, c). This indicates that the RB stability classes at either end of the scale captured slope stability better than the ECT results, regardless which of the ECT classification was applied, and whether a second test was performed. Fig. 3a and c also highlight that RB class 2 and ECT class 1 (ECT w09 , ECT new ) had similar proportions of unstable slopes. ECT new stability class 2 had a lower proportion of 310 unstable slopes than RB class 2 (p < 0.05), but a higher proportion than RB class 3 (p < 0.05). The proportions of unstable slopes for the two highest ECT new classes were not significantly different than for the two highest RB classes (p > 0.05).
The false alarm rate of the RB (classes 1 and 2) was lower than for any of the ECT classifications (Fig. 4). However, in our data set a comparably large proportion of RB tests (0.34) indicated stability class 3 in slopes rated as unstable. This ratio is higher than for single ECT new class 3. However, the frequency that stability class 4 (false stable) was observed in unstable slopes was 315 lower than for ECT new class 4 (0.13 vs. 0.23, respectively).
The ECT new stability class correlated significantly with the RB stability class (Spearman rank-order correlation ρ = 0.43, p < 0.001), a correlation which was stronger for ECT pairs resulting twice in the same ECT stability class (ρ = 0.64, p < 0.001).

4.5
The predictive value of stability tests 320 Now, we explore the predictive value of a stability test result as a function of the base rate, the proportion of unstable slopes.
Considering single ECT new class 1 and RB class 1 showed that PPV was always higher than the base rate (Fig. 5), indicating that the stability test predicted a higher probability for the slope to be unstable than just assuming the base rate. This shift was 325 more pronounced for the Rutschblock than for the ECT, particularly at 1-Low and 2-Moderate. While PPV for stability class 1 (single or two ECT) remained low at 1-Low and 2-Moderate (PPV ≤ 0.3, Tab. 4), indicating that it was still more likely that the slope was stable rather than unstable, the likelihood ratio indicated weak evidence in favor of instability (Tab. 4). At 4-High, the number of tests performed was very low (N = 16), therefore results are indicative at best.
Figure 5 also shows the shift in PPV, when considering ECT new or RB stability class 4 (high stability). In these slopes, PPV 330 was lower than the base rate, indicating that the probability the specific slope tested to be unstable was less than the base rate.
However, the resulting posterior probability was still higher compared to the base rate of the neighboring next lower danger level.
Analysing the entire data set together, regardless of the forecast danger level, the proportion unstable slopes was 0.21, and thus somewhat between the values for 2-Moderate and 3-Considerable. Again, the informative value of the test can be noted 335 (Fig. 5). However, ignoring the specific base rate related to a certain danger level, leads -for instance -to an underestimation of the likelihood that the slope is unstable at 3-Considerable (RB or ECT new class 1), or an overestimation for the presence of instability at 1-Low (RB or ECT new class 4).
As shown in Figures 3c, the two extreme RB stability classes correlated better with slope stability than the respective two extreme ECT new classes. This is also reflected in Fig. 5 by the stronger shift from base rate to PPV, but can also be noted when 340 calculating LR+ using a binary classification (LR+ for RB classes ≤ 2 (25, 4.2, 3 for 1-Low, 2-Moderate, 3-Considerable) compared to single ECT new classes ≤ 2 (5.2, 2.6, 2.9 for 1-Low, 2-Moderate, 3-Considerable)).

Performance of ECT classifications
We compared ECT results, applying existing and testing a new classification with concurrent slope stability information.

345
Quite clearly, whether a crack propagates across the entire column or not, is the key discriminator between unstable and stable slopes (Fig. 3b). This is in line with previous studies (e.g. Simenhois and Birkeland, 2006;Moner et al., 2008 highest (blue, labels below base rate line) stability classes. The arrows indicate the shift from the prior probability -the base rate at a given danger level -to the positive predictive value (posterior probability) for the specific slope tested being unstable.  Winkler and Schweizer, 2009;Techel et al., 2016) and with our current understanding of avalanche formation (Schweizer et al., 2008b). Moreover, our results confirm the proposition by Winkler and Schweizer (2009) that the number of taps provides additional information allowing a better distinction between results related to stable and unstable conditions. The 350 optimal threshold to achieve a balanced performance, i.e. high sensitivity as well as high specificity, was found to be between ECTP20 and ECTP22, depending on the method (kmeans-clustering, pROC-cutoff point). This finding agrees well with the threshold proposed by Winkler and Schweizer (2009) who suggested ECTP21. Using the binary classification, as originally proposed by Simenhois and Birkeland (2009), increased the sensitivity but led to a rather high false alarm rate. Moving away from a binary classification increased PPV and NPV for the lowest and highest stability classes, respectively, but came at the 355 cost (or benefit) of introducing intermediate stability classes.
Only in some situations did pairs of ECTs performed in the same snow pit show an improved correlation with slope stability: when two tests were either ECT new stability class 1 or 2, or when either both tests were class 4, or one class 3 and one class 4.

Comparing ECT and Rutschblock
To our knowledge, and based on the review by Schweizer and Jamieson (2010), there have only been three previous studies 360 which compared ECT and RB in the same data set. Moner et al. (2008), in the Spanish Pyrenees, relying on a comparably small data set of 63 RB (base rate 0.44) and 47 single ECT (base rate 0.38) observed a higher unweighted average accuracy for the ECT (0.93) than the RB (0.88). In contrast, Winkler and Schweizer (2009, N = 146, base rate 0.25) presented very similar values for RB (0.84) and the ECT (0.81).
However, Winkler and Schweizer (2009) partially relied on a slope stability classification which is based strongly on the compare them to our results: (1) «A localised diagnostic test will be more informative the higher the general avalanche warning.» (Ebert, 2018, p. 4). With general «avalanche warning» , Ebert (2018) refers to the forecast danger level as a proxy to estimate the base rate. As shown in Fig. 5, PPV increased for both ECT and RB with increasing base rate / danger level, supporting this statement. From a more theoretical perspective, it can be shown that PPV can be derived from Bayes Theorem (e.g. Blume, 2002;Ebert, 2018), 385 therefore linking both approaches.
(2) «. . . Do not 'blame' the stability tests for false positive results: they are to be expected when the avalanche danger is low. In fact, their existence is a consequence of the basic fact that low-probability events are difficult to detect reliably» (Ebert, 2018, p. 4). Fig. 5 supports this statement: at 1-Low and 2-Moderate an ECT indicating instability was much more often observed on a stable slope rather than an unstable one. Only once the base rate was sufficiently high, in our case at 3-Considerable, tests 390 indicating instability were observed more often on unstable rather than stable slopes.
(3) «In avalanche decision-making, there is no certainty, all we can do is to apply tests to reduce the risk of a bad outcome, yet there will always be a residual risk» (Ebert, 2018, p. 5). The likelihood ratio was greater than 1 for tests indicating instability, regardless whether we considered an ECT or a RB result and regardless of the danger level, and less than 1 for tests indicating stability. This is statistical evidence for a higher probability that a slope is unstable compared to the base rate. From a Bayesian 395 perspective, we would say that a positive test (a low stability class) always increases our belief that the slope is unstable, and vice versa when a test is negative (a high stability class).
In summary, and regardless of the strength of evidence, instability tests are useful despite the uncertainty which remains.

Sources of error and uncertainties
Beside potential misclassifications in slope stability, which we address more specifically in the following section (Sect. 5.5), 400 Schweizer and Jamieson (2010) pointed out two other sources of error. The first of these is linked to the test method, which are relatively crude methods and where, for instance, the loading may vary depending on the observer. The second error source is linked to the spatial variability of the snowpack. The constellation of slab and underlying weak layer varies in the terrain and may consequently have an impact on the test result. Furthermore, this data set did not permit to check whether the failure plane of avalanches or whumpfs was linked to the failure plane observed in test results. Such information about the «critical weak 405 layer» was, for instance, incorporated by Simenhois and Birkeland (2009) and Birkeland and Chabot (2006) in their analyses.
However, from a stability perspective, considering the actual test result is the more relevant information.

The influence of the reference class definitions and the base rate
So far we have explored ECT and RB assuming that there are no misclassifications of slope stability. However, as the true slope stability is often not known (particularly in stable cases), errors in slope stability classification will occur. Such errors, 410 however, may potentially influence all the statistics derived to describe the performance of tests (Brenner and Gefeller, 1997).
For instance, if there are at least some slopes misclassified, classification performance will drop. However, in such cases, POD and PON will additionally be influenced by the true (though unknown) base rate (Brenner and Gefeller, 1997).
In previous studies exploring ECT (Moner et al., 2008;Simenhois and Birkeland, 2009;Winkler and Schweizer, 2009), slope stability classifications were generally well described and the base rate for the applied slope stability classification given.

415
However, slope stability classification approaches differed somewhat. For instance, a stability criterion used by Moner et al. (2008) was the occurrence of an avalanche on the test slope, while Simenhois and Birkeland (2009) additionally considered explosives-testing of the slope as relevant information. Winkler and Schweizer (2009), on the other hand, additionally considered the manual profile classification used operationally in the Swiss avalanche warning service (Schweizer and Wiesinger, 2001;Schweizer, 2007) and considered a sufficient criterion for instability, when profiles were rated as «very poor» or «poor».

420
As this classification relies rather strongly on the RB result, the RB would be favored in such an analysis (Winkler and Schweizer, 2009).
We have no knowledge about the uncertainty linked to our classification. However, we can demonstrate the impact of variations in the definition of the reference class on summary statistics like POD and PON, and using different data subsets for analysis: Let us assume we are not interested in comparing ECT and RB, but want to explore only the performance of a binary

425
ECT classification with ECTP22 as the threshold between two classes. We will, however, use the RB together with the criteria introduced in Section 2.3 to define slope stability: -Without using the RB as an additional criteria, POD and PON for the ECT was 0.58 and 0.77, respectively (Fig. 4c).
-If only slopes are considered unstable, when the RB stability class was ≤ 2, and those as stable with RB stability class ≥ 3, the resulting POD is 0.70 and PON is 0.84. The base rate in this data set is 0.14 and N = 591.

430
-Being even more restrictive, and considering only slopes unstable, when the RB stability class was 1, and those as stable with RB stability class 4, the resulting POD is 0.75 and PON is 0.89. The base rate in this data set is 0.14 and N = 294.
Of course, one could also be interested in exploring the performance of the RB, and define slope stability by using ECT results as additional criteria to those in Section 2.3. Without relying on ECT results, POD and PON for the RB were 0.54 and 0.87, respectively (Fig. 4d). Considering ECT new stability class ≤ 2 as unstable, else as stable, POD and PON would increase to The combination of various error sources (Sect. 5.4), together with varying definitions of slope stability and differences in the base rate make it almost impossible to directly compare results obtained in different studies. Therefore, performance values presented in this study, but also in other studies regarding snow instability tests, must always be seen in light of the specific 440 data set used and allow primarily a comparison within the study.

Conclusions
We explored a large data set of concurrent RB and ECT, and related these to slope stability information. Our findings confirmed the well-known fact that crack propagation propensity, as observed with the ECT, is a key indicator relating to snow instability.
In addition, the number of taps required to initiate a crack also provides information concerning snow instability. Combining 445 crack propagation propensity and the number of taps required to initiate a failure allows refining the original binary classification. We propose an ECT stability interpretation with four distinctly different stability classes. Furthermore, for an ECT result being in one of the two intermediate classes, a second ECT performed in the same snow pit may be the decisive factor towards either the highest or lowest stability class that are best related with rather stable or unstable conditions, respectively. In our data set, the proportion of unstable slopes was higher and lower in the lowest and highest stability class for the RB than for the 450 ECT. Hence, the RB correlated better with slope stability than the ECT.
We discussed further that changing the definition of the reference standard, the slope stability classification, has a large impact on summary statistics like POD or PON. This hinders comparison between studies, as differences in study designs, data selection and classification must be considered.
And finally, we investigated the predictive value of stability test results using a data-driven perspective. We conclude by 455 rephrasing Blume (2002): When a stability test indicates instability, this is always statistical evidence for instability, as this will increase the likelihood for instability compared to the base rate. However, in case of a low base rate, false unstable predictions are likely.
Author contributions. FT designed the study, extracted and analyzed the data, and wrote the manuscript. MW extracted and classified a large part of the text from the snow profiles. KW, JS and AvH provided in-depth feedback on study design, interpretation of the results and 460 manuscript.
Competing interests. No competing interests.