These authors contributed equally to this work.

There are distinctive methodological and conceptual challenges in rare and severe event (RSE) forecast verification, that is, in the assessment of the

In this paper, we draw on insights from the rich history of tornado forecast verification to locate important theoretical debates that arise within the context of binary rare and severe event (RSE) forecast verification. Since the inception of this discipline, many different measures have been used to assess the quality of an RSE forecast. These measures disagree in their respective evaluations of a given sequence of forecasts; moreover, there is no consensus about which one is the best or the most relevant measure for RSE forecast verification. The diversity of existing measures not only creates uncertainty when performing RSE forecast verification but, worse, can lead to the adoption of qualitatively inferior forecasts with major practical consequences.

This article offers a comprehensive and critical overview of the different measures used to assess the quality of an RSE forecast and argues that there really is only one skill score adequate for binary RSE forecast verification. Using these insights, we then show how our theoretical investigation has important consequences for practice, such as in the case of nearest-neighbour avalanche forecasting, in the assessment of more localized slope stability tests and in other forms of avalanche management.

We proceed as follows: first, we show that RSE forecasting faces the so-called

The discipline of

Finley's consolidated tornado predictions March–May 1884, after

From the totals in the bottom row, we find that the base rate – the climatological probability, as it is sometimes called – of tornado occurrence is a little under 2 %, i.e. 51 observations of tornadoes against 2752 observations of non-tornadoes, well below the 5 % base rate used to classify

Standard characterization of a

The questions we focus on are (1) what this type of accuracy tells us about forecast performance or forecasting skill in the case of rare event forecasting and (2) how best to measure and compare different RSE forecast performances.

A feature of tornadoes and also of snow avalanches is that they are rare events, that is

Indeed, as Finley makes more incorrect predictions of tornadoes (72) than correct ones (28) (

First, focusing on accuracy in rare event forecasting often rewards skill-less performances and incentivizes “no-occurrence” predictions. Second, where the prediction of severe events is concerned, such an incentive is hugely troubling as a failure to predict occurrence is usually far more serious than an unfulfilled prediction of occurrence. As Allan Murphy observes,

Since it is widely perceived that type 2 errors [failures to predict occurrences,

But if not by accuracy, how should we assess the quality of a set of RSE forecasts? Immediately after the publication of Finley's article, a number of US government employees rose to the challenge. In the next section, we outline the skill measures they introduced, and in doing so we motivate three adequacy constraints that skill measures in RSE forecasting ought to meet.

It is also the case that

Given these considerations, we can now substantiate our earlier claim that Finley exhibited genuine skill, in contrast to the ignoramus, in issuing his predictions. While Finley's 28 correct out of 100 predictions of occurrence made may not seem impressive, his score is a fraction over 15 times more than he could have expected to get right by chance, given Table

That was Gilbert's first contribution. Although the next step we take is not exactly Gilbert's, the idea behind it is his second lasting contribution. Our forecaster makes

We can rewrite the Heidke skill score as

Now, if we take it that the best a forecaster can do is have all their predictions, of both occurrence and non-occurrence, fulfilled then, following

Best possible score relative to Table

Focusing on the notion of a

We can think of the Peirce skill score as being this ratio:

To summarize, one of the earliest responses to the challenge to identify the skill involved in RSE forecasting was to highlight the need to take into account, in some way, the possibility of getting predictions right “by chance” and thus present the skill exhibited in a sequence of forecasts as relativized to what a “random prognosticator” could get right by chance. As we have just seen, this can be done in different ways which motivate different measures of skill. At this point, we do not have much to say on whether Gilbert's and Brier and Allen's reading or our rewrite is preferable, i.e. whether the Heidke or Peirce skill score is preferable. However, we can note that this first adequacy constraint rules out simple scores such as accuracy (proportion correct) as capturing anything worth calling skill in forecasting.

We should note that some measures in the literature, in particular those that are functions of

Satisfaction, or not, of the criterion that

Prof. C. S. Peirce (in

Farquhar allows that “either of these differences [i.e. the Peirce skill score and the Clayton skill score] may be taken alone, with perfect propriety”. By multiplying the Peirce skill score and the Clayton skill score, one is multiplying a measure that tests occurrences for successful prediction by a measure that tests predictions for fulfilment. The resulting quantity is neither of these things – but that, in itself, does not formally prevent it being, as Doolittle took it to be, a measure of the skill exhibited in prediction. Why, then, should one not multiply them, or put differently what is

The answer, we suggest, lies in a notion known to philosophers as

Let us consider a man going round a town with a shopping list in his hand. Now it is clear that the relation of this list to the things he actually buys is one and the same whether his wife gave him the list or it is his own list; and that there is a different relation where a list is made by a detective following him about. If he made the list itself, it was an expression of intention; if his wife gave it him, it has the role of an order. What then is the identical relation to what happens, in the order and the intention, which is not shared by the record? It is precisely this: if the list and the things that the man actually buys do not agree, and if this and this alone constitutes a mistake, then the mistake is not in the list but in the man's performance (if his wife were to say “Look, it says butter and you have bought margarine”, he would hardly reply “What a mistake! we must put that right” and alter the word on the list to “margarine”); whereas if the detective's record and what the man actually buys do not agree, then the mistake is in the record.

Prof. C. S. Peirce (in

In considering improvements on the forecasting performance recorded in Table

Let us go back to this form for a skill score:

Clayton's conception of the best possible performance, presented
in Table

Omnipotent forecaster's score relative to Table

What of the Heidke skill score? How does it fare with respect to direction of fit? What conception of best performance does it employ? In its denominator, the Heidke score takes the best possible performance to be one in which all

Highest number of correct predictions relative to Table

Using the marginal totals in Table

The RIOC measure has the following feature: when there are successes in predicting occurrences and non-occurrence, i.e.

That is one problem with the RIOC measure. The other is this. Like Anscombe's detective, the scientific forecaster's aim is to match their predictions to what actually happens. This is why we keep the column totals fixed when considering the best possible performance. Why on earth should we also keep the row totals and the numbers of predictions of occurrence and non-occurrence fixed? There is, we submit, no good reason to do so. The Heidke skill score embodies no coherent conception of best possible performance. The RIOC of Loeber et al. does at least embody a coherent notion of best possible performance, but it is a needlessly hamstrung one, restricting the range of possible performances to those that make the same number of predictions of occurrence and of non-occurrence as the actual performance. On the one hand, this makes a best possible performance too easy to achieve and, on the other, sets our sights so low as to only compare a forecaster with others who make the same

Finally, for completeness, let us consider the measure we started out with, proportion correct,

The notion of direction of fit has much wider application than just forecasting: it applies in any setting in which we can see “the world” or a “gold standard test” or, more prosaically, some aspect of the set-up in question as setting the standard against which a “performance” is judged. Diagnostic testing is one obvious case – and there we have the Peirce skill score but under the name of the Youden index (Youden's

So, in summary, we can say that

We think there is a third feature of skill that is specific to severe event forecasting that a skill measure ought to take into account. Broadly speaking, it consists in being sensitive, in the right kind of way, to one's own fallibility. While the omniscient forecaster need not worry about mistakes, actual forecasters need to be aware of the different kinds of consequences of an imperfect forecast. To motivate our third constraint, consider the two forecasts in Table

Example of two forecasts (A, B) that agree on the correct predictions and the total number of false predictions but differ in the

While forecasts A and B issue the same total number of forecasts and both score an excellent 98.8 % accuracy, they disagree on the kinds of errors they make. Forecast A makes fewer Type II errors (1) than Type I errors (5), while in forecast B this error distribution is reversed. Is there a reason to think that one forecast is

Given the context of our discussion, i.e. rare and severe event forecasting, we believe there is. We saw Allan Murphy saying that “it is widely perceived that type 2 errors [erroneous predictions of non-occurrence] are more serious than type 1 errors [unfulfilled predictions of occurrence]”. A skilful RSE forecaster should take this observation into account and consider, as it were, the

Importantly, the Peirce skill score does just that. We can re-write it as

Now, when

Formally, the Heidke skill score treats Type I and Type II errors equally in that interchanging

In fact

Table

Summary comparison of skill measures in relation to the three adequacy constraints for RSE forecasting.

Finally, should we consider these three constraints as jointly sufficient? Of course, further debate may generate other adequacy constraints on skill measures, and we are open to such a development at this stage of the discussion. However, we take ourselves to have shown that there really is only one skill measure in the forecasting literature that meets the three constraints, and so there is only one genuine

In this section, we will show how our theoretical discussion concerning skill measures has consequences for the practice of avalanche forecast verification. We focus on the use of the nearest-neighbour (NN) method of avalanche forecasting as discussed in

The basic assumption of the NN forecasting approach is that similar conditions, such as the snowpack, temperature, weather, etc., will likely lead to similar outcomes, and so historical data – weighted by relevance and ordered by similarity – are used to inform forecasting. More specifically, NN forecasting is a non-parametric pattern classification technique where data are arranged in a multi-dimensional space and a distance measure (usually the Brier score) is used to identify the most similar neighbours. NN forecasting can be used for categorical or probabilistic forecasts. In the case of the former, which is relevant to our current discussion, a decision boundary

The study of

Dependence of accuracy and skill measures on the choice of decision boundary (number of positive neighbours of the forecast day).

Let us start our discussion by noting two immediate consequences of NN avalanche forecasting. Remember that a positive prediction is issued when the number of positive neighbours is equal to or greater than

the number of positive predictions (

the number of

Our earlier discussion about the disadvantages of using accuracy as a measure is nicely borne out in the study of

Accuracy or proportion correct,

Given (ii), we know that

In short, as

Moreover, and as to be anticipated given our discussion in Sect.

To be clear, these considerations do not imply that there is

The Peirce skill score is also known as the Kuipers skill score, KSS, the name used by

Let us investigate a little further the behaviour of KSS. As said,

We previously noted our reservations concerning the Heidke score; it is, however, an often used skill score in forecast verification (compare

In both graphs,

The amount of overforecasting associated with forecasts of some RSEs is quite substantial, and efforts to reduce this overforecasting – as well as attempts to prescribe an appropriate or acceptable amount of overforecasting – have received considerable attention.

Let us first look at its behaviour. In the case of the Scottish dataset, SR more or less plateaus from

So in both datasets SR might seem initially quite low. But as we know, forecasting rare events is difficult, and we should not be too surprised that the success rate of predicting rare events is less than 50 %. In fact, given that rare event forecasting involves, by definition, low base rates of occurrence, and given our limited abilities in forecasting natural disasters such as avalanches, we should expect a low success rate

So, then what are the main lessons from this practical interlude? Simply put, having the appropriate skill measure really does matter and has consequences for high-stakes practical decisions. Forecasters have to make an informed choice in the context of NN forecasting about which decision boundary to adopt. That choice has to be informed by an assessment of which decision boundary issues the

In this last section, we discuss a conceptual challenge for the viability of RSE forecasting (for a general overview of the other conceptual, physical and human challenges in avalanche forecasting specifically, see

So almost infinitesimal is the area covered by a line of tornado in comparison with the area of the state in which it occurs, that even could the Indications Officer say with absolute certainty that a tornado would occur in any particular state or even county, it is believed that the harm done by such a prediction would eventually be greater than that which results from the tornado itself.

The other issue is the “almost infinitesimal” track of a tornado compared to the area for which warning of a tornado is given. A broadly similar issue faces regional avalanche forecasting: currently such forecasts are given for a wide region of at least 100 km

This type of trade-off applies equally to probabilistic and binary categorical forecasts. One consequence of the scope challenge, alluded to in the above quote, is that once the region is sufficiently large, forecasters may rightly be highly confident that one such event will occur. This means that on a large-scale level, we are not – technically speaking – dealing with rare event forecasts anymore, while on a more local level, the risk of such an event is still very low.

Now,

The probability of an avalanche on a single slope of 0.01 could be considered likely, while the probability of an avalanche across an entire region of 0.1 could be considered unlikely. This dichotomy, combined with a lack of valid data and the impracticality of calculating probabilities during real-time operations, is the main reasons forecasters do not usually work with probabilities, but instead rely on inference and judgement to estimate likelihood. Numeric probabilities can be assigned when the spatial and temporal scales are fixed and the data are available, but given the time constraints and variable scales of avalanche forecasting, probability values are not commonly used.

Given this development, verification of such avalanche forecasts has become more challenging. What makes it even more difficult is that each individual danger level involves varied and complex descriptors that are commonly used to communicate and interpret the danger levels. For example, the danger level

Triggering is

There are numerous reasons why we think our discussion is still important with potentially significant practical implications. First, while regional forecasts are usually multi-categorical, there are many avalanche forecasting services that, in effect, have to provide localized binary RSE forecasts. Consider, for example, avalanche forecasting to protect large-scale infrastructure such as the Trans Canada Highway along Rogers Pass where more than 130 avalanche paths threaten a 40 km stretch of highway. Ultimately, a binary decision has to be made on whether to open or to close the pass, and a wrong decision has a huge economic impact in the case of both Type I and Type II errors; in the case of Type II errors there is in addition potential loss of life. Similarly so on a smaller scale, while regional multi-categorical forecasts usually inform and influence local decision-making, ultimately operational decisions in ski resorts or other ski operations are binary decisions – whether to open or to close a slope – that are structurally similar to binary RSE forecasts. These kinds of binary forecasting decisions will benefit from using forecast verification methods that adopt the right skill measure.

Second, our discussion is relevant to the assessment of different localized slope-specific stability tests widely used by professional forecasters, mountain guides, operational avalanche risk managers and recreational skiers, mountaineers and snowmobilers. A recent large-scale study by

Lastly, there are, as we noted above, numerous research projects to design manageable forecast verification procedures for multi-categorical regional forecast. Assuming that the methodological and conceptual challenges we raised earlier can be overcome, we still require the right kind of measures to assess the goodness of multi-categorical forecasts. The Heidke, Peirce and other measures we discussed can be adapted for these kinds of forecasts. Moreover, given that the danger rating of high and very high are rarely used, and involve high stakes with often major economic consequences, our discussion may once again help to inform future discussions about how best to verify regional multi-categorical forecast. However, an in-depth discussion of multi-categorical skill measures for regional avalanche forecasts has to wait for another occasion as it will crucially depend on the details of the verification procedure.

In his classic article “What is a good forecast?”

We model the actual presences and absences (e.g. occurrence and non-occurrence of avalanches) as constituting the sequence of outcomes produced by

The first step in putting the “random” into random prognostications is the following.
We assume the actual forecasting performance to be produced by

The second step in putting the “random” into random prognostications is the following.
The probability of successful prediction of presence on the

Let

In similar fashion, we obtain

Keeping the marginal totals

An increase in

A decrease in

With an additional

With an additional

For

Let

Let us consider next the score after a reduction of

By hypothesis, we have that

With a reduction of

With a reduction of

In the notation introduced above, the score for a reduction of

It is clear that these results reverse when the forecasted events are common, i.e. when

We assume that

All scores are to be understood relative to Table

Skill scores for binary categorical forecasting.

No data sets were used in this article.

PM and PE formulated research goals and aims. PE and PM wrote, reviewed and edited the manuscript. PE acquired financial support.

The contact author has declared that neither they nor their co-author has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Philip Ebert's research was supported by the Arts and Humanities Research Council AH/T002638/1 “Varieties of Risk”. We are grateful to the International Glaciological Society and Joachim Heierli for the permission to reuse Fig.

This research has been supported by the Arts and Humanities Research Council (grant no. AH/T002638/1) and the University of Stirling (APC fund).

This paper was edited by Pascal Haegeli and reviewed by R. S. Purves and Krister Kristensen.