Snow instability tests provide valuable information regarding the stability of the snowpack. Test results are key data used to prepare public avalanche forecasts. However, to include them into operational procedures, a quantitative interpretation scheme is needed. Whereas the interpretation of the rutschblock test (RB) is well established, a similar detailed classification for the extended column test (ECT) is lacking. Therefore, we develop a four-class stability interpretation scheme. Exploring a large data set of 1719 ECTs observed at 1226 sites, often performed together with a RB in the same snow pit, and corresponding slope stability information, we revisit the existing stability interpretations and suggest a more detailed classification. In addition, we consider the interpretation of cases when two ECTs were performed in the same snow pit. Our findings confirm previous research, namely that the crack propagation propensity is the most relevant ECT result and that the loading step required to initiate a crack is of secondary importance for stability assessment. The comparison with the RB showed that the ECT classifies slope stability less reliably than the RB. In some situations, performing a second ECT may be helpful when the first test did not indicate rather unstable or stable conditions. Finally, the data clearly show that false-unstable predictions of stability tests outnumber the correct-unstable predictions in an environment where overall unstable locations are rare.

Gathering information about current snow instability is crucial when evaluating the avalanche situation.
However, direct evidence of instability – as recent avalanches, shooting cracks or whumpf sounds – is often lacking.
When such clear indications of instability are absent, snow instability tests are widely used to obtain information on the stability of the snowpack.
Such tests provide information on failure initiation and subsequent crack propagation – essential components for slab avalanche release

Two commonly used tests to assess snow instability are the rutschblock test

As the properties of the slab as well as the weak layer may vary on a slope

Both ECT and RB provide information relating to slab avalanche release.
While the rutschblock test provides reliable results, the ECT is quicker to perform in the field, which probably explains why it has quickly become the most widely used instability test in North America

Data were collected in 13 winters from 2006–2007 to 2018–2019 in the Swiss Alps. We explored a data set of stability test results in combination with information on slope stability and avalanche hazard.

At 1226 sites, where slope stability information was available, 1719 ECT were performed (Table

Data overview with the number (

At sites where ECT and RB were realized in the same snow pit, one or two ECTs were generally performed directly downslope from the RB (e.g. as described in detail in

Test procedure followed observational guidelines

ECT and RB according to observational guidelines. At the back, the block of snow is isolated by cutting with either a cord or a snow saw. The light-blue area indicates the approximate area, where the skis or the shovel blade is placed. This area corresponds to the area loaded for the ECT, while the main load under the skis is exerted over a length of about 1 m

To facilitate the distinction between the result of an instability test and the stability of a slope, we refer to test stability using four classes, 1 to 4, with class 1 being the lowest stability (poor or less) and class 4 the highest stability (good or better).
In contrast, for slope stability, we use the terms

The stability classification originally introduced by

The classification suggested by

ECTP

ECTP

ECTN or ECTX – high stability (class 4).

We classified the RB into four classes (classes 1 to 4; Fig.

Classification of RB into four stability classes.

Shallow weak layers (

We classified stability tests according to observations relating to snow instability in slopes similar to the test on the day of observation, such as recent avalanche activity or signs of instability (whumpfs or shooting cracks). This information was manually extracted from the text accompanying a snow profile and/or stability test. This text contains – among other information – details regarding recent avalanche activity or signs of instability.

A slope was called unstable if any signs of instability or recent avalanche activity – natural or skier-triggered avalanches from the day of observation or the previous day – were noted on the slope where the test was carried out or on neighbouring slopes

We only called a slope stable if it was clearly stated that on the day of observation none of the before-mentioned signs were observed in the surroundings.
In most cases, “surroundings” relates to observations made in the terrain covered or observed during a day of back-country touring

In the following, we denote slope stability simply as stable or unstable, although this strict binary classification is not adequate. For instance, many tests were performed on slopes that were actually rated as unstable but did not fail. In other words, unstable has to be understood as a slope where the triggering probability is relatively high compared to stable where it is low.

If it was not clearly indicated when and where signs of instabilities or fresh avalanches were observed, or if this information was lacking entirely, these data were not included in our data set.

For each day and location of the snow instability test, we extracted the forecast avalanche danger level related to dry-snow conditions from the public bulletin issued at 17:00 CET and valid for the following 24 h.

We consider the following criteria as relevant when testing existing or defining new ECT stability classes:

Stability classes should be distinctly different from each other. The criteria we rely on is the proportion of unstable slopes. Therefore, a higher stability class should have a significantly lower proportion of unstable slopes than the neighbouring lower stability class.

The lowest and highest stability classes should be defined such that the rate of correctly detecting unstable and stable conditions is high, respectively; hence, the rate of false-stable and false-unstable predictions should be low, respectively. Stability classes between these two classes may represent intermediate conditions or lean towards more frequently unstable and stable conditions, permitting a higher false-stable and false-unstable rate than the rates of the two extreme stability classes.

The extreme classes should occur as often as possible, as the test should discriminate well between stable and unstable conditions in most cases.

We calculated the mean proportion of unstable slopes for moving windows of three, five and seven consecutive number of taps for ECTP and ECTN separately. ECTX was included in ECTN, treating ECTX as ECTN31.

We obtained thresholds for class intervals by applying unsupervised

We repeated clustering 100 times using 90 % of the data, which were randomly selected without replacement. For each of these repetitions, the cluster boundaries were noted. Based on the 100 repetitions, we report the respective most frequently observed

To verify whether the classes found by the clustering algorithm were distinctly different (criterion i), we compared the proportion of unstable slopes between clusters using a two-proportion

In almost all cases, we used a one-sided test with the null hypothesis

For clusters not leading to a significant reduction in the proportion of unstable slopes, we tested a range of thresholds (

When the predictive power or predictive validity of a test is assessed, it is compared to a reference standard, here the slope stability classified as either unstable or stable.
The usefulness of instability test results is generally assessed by considering only two categories related to unstable and stable conditions

There are two different contexts in which a test's adequacy is looked at.
The first (a) explores whether the foundations of a test are satisfactory and the second (b) explores whether the test is useful

Most often the performance of a snow stability test is assessed from the perspective of the reference group

The sensitivity of a test is the probability of correctly identifying an unstable slope from the slopes that are known to be unstable.
Considering a frequency table (Table

The specificity of a test is the probability of correctly identifying a stable slope from the slopes that are known to be stable. It is also referred to as the probability of non-detection (PON).

The second context focuses on the ability of a test to correctly indicate slope stability; i.e. if the test result indicates low stability, how often is the slope in fact unstable?
This aspect has only rarely been explored for snow instability tests (e.g by

The positive predictive value (PPV) is the proportion of unstable slopes, given that a test result indicates instability (a low-stability class).

The negative predictive value (NPV) is the proportion of stable slopes, given that a test result indicates stability (a high-stability class).

To demonstrate the effect variations in the frequency of unstable and stable slopes have on predictive values like PPV or

A

As outlined before, the proportion of unstable slopes varied within our data set: we noted a bias towards more frequently observing two ECTs when slope stability was considered unstable (30 %).
For a single ECT, only 15 % of the tests were observed in unstable slopes (Table

The base rate proportion with 30 % tests on unstable and 70 % on stable slopes was used throughout this paper, except in Sect.

For snow pits with two adjacent ECTs, we randomly selected one ECT when exploring single ECT data or the relationship between the number of taps and slope stability. As before, this procedure was repeated 100 times. The respective statistic, generally the mean proportion of unstable slopes, was calculated based on the 100 repetitions.

We first consider the results for a single ECT.
The original stability classification ECT

Proportion of unstable slopes (

Considering the results obtained from two adjacent ECTs resulting in the same stability class 1, between 0.54 (ECT

Regardless of whether a single ECT or two ECTs were considered, the ECT

The sensitivity was higher for ECT

Distribution of stability classes by slope stability for the different stability test and classification approaches

The optimal balance between achieving a high sensitivity and a low false-alarm rate was found to be at ECTP

So far, we explored existing classifications. Now, we focus on the respective lowest number of taps stratified by propagating (ECTP) and non-propagating (ECTN) results. If in the same test for different weak layers ECTN and ECTP were observed, only ECTP with the lowest number of taps was considered.

As can be seen in Fig.

Clustering the ECT results shown in Fig.

ECTP

ECTP

ECTN

ECTP

ECTP

ECTP

ECTN

The ECT

For 70 % of the time two ECTs indicated the same ECT

Randomly picking one of the two ECTs as the first ECT yielded the proportion of unstable slopes as shown in Table

Proportion of unstable slopes when randomly selecting one of two ECTs as the first test (ECT

The proportion of unstable slopes decreased significantly with each increase in RB stability class (0.76, 0.53, 0.25 and 0.11 for classes 1 to 4, respectively;

Comparing RB with the ECT showed that the proportion of unstable slopes for RB stability class 1 was significantly higher (

The false-alarm rate of the RB (classes 1 and 2) was lower than for any of the ECT classifications (Fig.

The ECT

Now, we explore the predictive value of a stability test result as a function of the base rate proportion of unstable slopes.
In our data set the base rate proportion of unstable slopes increased strongly, and in a non-linear way, with forecast danger level:
for the 1108 snow pits with at least one ECT it was 0.02 for level 1 (low), 0.1 for level 2 (moderate), and 0.38 for level 3 (considerable) (Table

Proportion of unstable slopes for ECT

Considering a single ECT

Proportion of unstable slopes (position of labels, RB – rutschblock test, ECT – single ECT

Figure

Analysing the entire data set together, regardless of the forecast danger level, the proportion of unstable slopes was 0.21 and thus somewhat between the values for level 2 (moderate) and level 3 (considerable).
Again, the informative value of the test can be noted (Fig.

At level 1 (low), observations of RB stability class 1 were much less common (3 %, or 2 out of 78 tests, Table

As shown in Fig.

We compared ECT results with concurrent slope stability information, applying existing classifications and testing a new one.

Quite clearly, whether a crack propagates across the entire column or not is the key discriminator between unstable and stable slopes (Fig.

Only in some situations did pairs of ECTs performed in the same snow pit show an improved correlation with slope stability: when two tests were either ECT

To our knowledge, and based on the review by

In that respect, this study presents the first comparison incorporating a comparably large number of ECTs and RBs conducted in the same snow pit, where slope stability was defined independently of test results.
Seen from the perspective of the proportion of unstable slopes, the lowest and highest RB classes correlated better with slope stability than the respective ECT classes.
Incorporating the sensitivity, the proportion of unstable slopes detected by a test, a mixed picture showed:
that the single ECT and RB (classes 1 and 2) detected a comparable proportion of unstable slopes (0.56 vs. 0.53, respectively, Fig.

We recall the three lessons drawn by

“A localised diagnostic test will be more informative the higher the general avalanche warning”

“Do not `blame' the stability tests for false positive results: they are to be expected when the avalanche danger is low. In fact, their existence is a consequence of the basic fact that low-probability events are difficult to detect reliably”

“In avalanche decision-making, there is no certainty, all we can do is to apply tests to reduce the risk of a bad outcome, yet there will always be a residual risk”

Besides potential misclassifications in slope stability, which we address more specifically in the following section (Sect.

So far we have explored ECT and RB assuming that there are no misclassifications of slope stability.
However, as the true slope stability is often not known (particularly in stable cases), errors in slope stability classification will occur.
Such errors, however, may potentially influence all the statistics derived to describe the performance of tests

In previous studies exploring ECT

We have no knowledge about the uncertainty linked to our classification.
However, we can demonstrate the impact of variations in the definition of the reference class on summary statistics like POD and PON, as well as using different data subsets for analysis:
let us assume we are not interested in comparing ECT and RB but want to explore only the performance of a binary ECT classification with ECTP22 as the threshold between two classes.
We will, however, use the RB together with the criteria introduced in Sect.

Without using the RB as an additional criterion, POD and PON for the ECT was 0.56 and 0.79, respectively (Fig.

If slopes were only considered to be unstable when the RB stability class was

Being even more restrictive, and considering only slopes to be unstable when the RB stability class was 1, and those with RB stability class 4 considered to be stable, the resulting POD was 0.74 and PON was 0.91. The base rate in this data set was 0.2 and

The combination of various error sources (Sect.

For the purposes of this paper, we introduced class numbers to assign a clear order to the classes rather than assign class labels. However, the introduction of class labels rather than class numbers may ease the communication of results.

Proposed class labels for

We believe suitable terms should follow the established labelling for snow stability, which includes the main classes: poor, fair and good

poor – ECTP

poor to fair – ECTP

fair – ECTP

good – ECTN

We explored a large data set of concurrent RB and ECT and related these to slope stability information. Our findings confirmed the well-known fact that crack propagation propensity, as observed with the ECT, is a key indicator relating to snow instability. The number of taps required to initiate a crack provides additional information concerning snow instability. Combining crack propagation propensity and the number of taps required to initiate a failure allows refining the original binary stability classification. Based on these findings, we propose an ECT stability interpretation with four distinctly different stability classes. This classification increased the agreement between slope stability and test result for the lowest (poor) and highest (good) stability classes compared to previous classification approaches. However, in our data set, the proportion of unstable slopes was higher and lower in the lowest and highest stability class, respectively, for the RB than for the ECT, regardless of whether one or two tests were performed. Hence, the RB correlated better with slope stability than the ECT. Performing a second ECT in the same snow pit increased the classification accuracy of the ECT only slightly. A second ECT performed in the same snow pit may be decisive for the highest or lowest classes that are best related with rather stable or unstable conditions, respectively, only when an ECT result was in one of the two intermediate classes.

We discussed further that changing the definition of the reference standard, the slope stability classification, has a large impact on summary statistics like POD or PON. This hinders comparison between studies, as differences in study designs, data selection and classification must be considered.

Finally, we investigated the predictive value of stability test results using a data-driven perspective.
We conclude by rephrasing

The data are available at

The supplement related to this article is available online at:

FT designed the study, extracted and analysed the data, and wrote the manuscript. MW extracted and classified a large part of the text from the snow profiles. KW, JS and AvH provided in-depth feedback on study design, interpretation of the results and the manuscript.

The authors declare that they have no conflict of interest.

We greatly appreciate the helpful feedback provided by the two referees Bret Shandro and Markus Landro, as well as the questions raised by Eric Knoff and Philip Ebert, which all helped to improve this paper.

This paper was edited by Thom Bogaard and reviewed by Markus Landrø and Bret Shandro.