Methodological and conceptual challenges in rare and severe event forecast-verification

There are distinctive methodological and conceptual challenges in rare and severe event (RSE) forecast-verification, that is, in the assessment of the quality of forecasts involving natural hazards such as avalanches or tornadoes. While some of these challenges have been discussed since the inception of the discipline in the 1880s, there is no consensus about how to assess RSE forecasts. This article offers a comprehensive and critical overview of the many different measures used to capture the quality of an RSE forecast and argues that there is only one proper skill score for RSE forecast-verification. We do so 5 by first focusing on the relationship between accuracy and skill and show why skill is more important than accuracy in the case of RSE forecast-verification. Subsequently, we motivate three adequacy constraints for a proper measure of skill in RSE forecasting. We argue that the Peirce Skill Score is the only score that meets all three adequacy constraints. We then show how our theoretical investigation has important practical implications for avalanche forecasting by discussing a recent study in avalanche forecast-verification using the nearest neighbour method. Lastly, we raise what we call the “scope challenge” 10 that affects all forms of RSE forecasting and highlight how and why the proper skill measure is important not only for local binary RSE forecasts but also for the assessment of different diagnostic tests widely used in avalanche risk management and related operations. Finally, our discussion is also of relevance to the thriving research project of designing methods to assess the quality of regional multi-categorical avalanche forecasts.


15
In this paper, we draw on insights from the rich history of tornado forecast-verification to locate important theoretical debates that arise within the context of binary rare and severe event (RSE) forecast-verification. Since the inception of this discipline many different measures have been used to assess the quality of an RSE forecast. However, not only do these measures disagree in their respective evaluation of a given sequence of forecasts, there is also no consensus about which one is the best or the most relevant measure for RSE forecast-verification in particular. The diversity of existing measures not only creates 20 uncertainty when performing RSE forecast-verification but, worse, can lead to the adoption of qualitatively inferior forecasts with major practical consequences.
This article offers a comprehensive and critical overview of the different measures used to assess the quality of an RSE forecast and argues that there really is only one proper skill score for binary RSE forecast-verification. Using these insights, we then show how our theoretical investigation has important consequences for practice, such as in the case of nearest neighbour 25 avalanche forecasting, in the assessment of more localised slope stability tests, and other forms of avalanche management.
We proceed as follows: first, we show that RSE forecasting faces, in contrast to other forms of forecasting, the so-called accuracy paradox which, although only recently so-named, was pointed out at least as far back as 1884. In the next section, we present this 'paradox', explain why it is specific to RSE forecasting, and argue that its basic lesson-to clearly separate merely successful forecasts from genuinely skillful forecasts-raises the challenge of identifying adequacy constraints on a 30 proper skill measure.
In the third section, we motivate three adequacy constraints for a proper measure of skill in rare and severe event forecasting and assess a variety of widely used skill measures in forecast-verification in relation to these three constraints. Ultimately, we argue that the Peirce Skill Score is the only score that meets all three adequacy constraints and it should thus be considered the skill measure for rare and severe event forecasting (with an important proviso).

35
To highlight the practical implications of our theoretical investigation, we discuss, in the fourth section, a recent study in nearest neighbour avalanche forecast-verification and explain how our theoretical discussion has important practical consequences in choosing the best avalanche forecast model.
In the final section, we highlight a wider conceptual challenge for binary rare and severe forecast-verification by considering what we call the "scope-problem". We apply this problem to the special case of avalanche forecasting and conclude by 40 highlighting how our results are of relevance to different aspects of avalanche operations and management. From the totals in the bottom row, we find that the base rate of tornado occurrence is just under 2%, i.e. 51 observations of tornadoes and 2752 observations of non-tornadoes, and thus well below the 5% base rate used to classify rare and severe event 50 forecasting (Murphy, 1991, p. 303). Further, combining the figures of the top left and bottom right entries we obtain the total number of verified predictions (of both occurrence and of non-occurrence) in the three-month period. Out of a total of 2,803 predictions, 2,708 were correct, which is an impressive success rate of 96.61%. This figure goes by many names: among the more common, it is known as the percentage-correct (when multiplied by 100), the proportion-correct, the hit rate, or simply accuracy. In a table laid out as in Table 2 Gilbert (1884, p. 167). Table 2. Standard characterisation of a 2x2 contingency table.
questions we will focus on are (1) what does this type of accuracy tell us about forecast performance or forecasting skill in the specific case of rare event forecasting and (2) how best to measure and compare different RSE event forecast performances.

The accuracy paradox: accuracy vs skill
A feature of tornadoes and also, as we will see later, of avalanches, is that they are rare events, that is a + c b+ d. As a result, 60 one will do well, i.e. one will exhibit high accuracy or attain a high proportion correct, if one simply predicts 'No tornado' or 'No avalanche' all the time. This trivial (but often overlooked) observation is nowadays blessed with the name the accuracy paradox (e.g., Bruckhaus, 2007;Thomas and Balakrishnan, 2008;Fernandes, 2010;Valverde, 2014;Akosa, 2017). The issue was neatly summed up in a letter to the editor in the 15 th August, 1884 issue of Science, a correspondent named only as 'G.' writing, "An ignoramus in tornado studies can predict no tornadoes for a whole season, and obtain an average of fully 65 ninety-five per cent" (G, 1884).
Indeed, since Finley makes more incorrect predictions of tornadoes (72) than correct ones (28), i.e., b > a in Table 2, it was quickly pointed out that he would have done better by his own lights if he had uniformly predicted 'No tornado' (Gilbert, 1884)-he would then have caught up with the skill-less ignoramus whose accuracy, all else equal, would have been an even more impressive 98.2%, i.e. b + d a + b + c + d in Table 2. Where the prediction of rare events is concerned, what this suggests is 70 that accuracy or the proportion-correct measure is not an appropriate measure of the skill involved. After all, as we will see below, Finley was far from lacking in skill in the prediction of tornadoes. Now, while we will argue for this in more detail in the next subsections, two concerns counting against the proportion-correct measure can be noted here already.
First, focusing on accuracy in rare event forecasting often rewards skill-less performances and incentivizes "no-occurrence" predictions. Second, where the prediction of severe events is concerned such an incentive is hugely troubling, since a failure to 75 predict occurrence is usually far more serious than an unfulfilled prediction of occurrence. As Allan Murphy observes, Since it is widely perceived that type 2 errors [failures to predict occurrences, c in Table 2] are more serious than type 1 errors [unfulfilled predictions of occurrence, b in Table 2], forecasts of RSEs generally are characterised by overforecasting. That is, over a set of forecasting occasions, more RSEs are usually forecast to occur than are subsequently observed to occur [i.e., in terms of Table 2, a + b > a + c]. (Murphy, 1991, pp. 303-4) 80 As a result, we believe that the proportion-correct measure is doubly unsuitable when it comes to assessing the skill involved in rare and severe event forecasting.
However, if not by accuracy, how then should we assess the quality of a rare and severe event forecast? Immediately after the publication of Finley's article, a number of U.S. government employees rose to the challenge and introduced different so-called "skill"-measures. Of most interest here are G. K. Gilbert of the U.S. Geological Survey and C. S. Peirce of the U.S. Coastal 85 and Geodetic Survey, the latter better remembered nowadays, at least amongst philosophers, for his contributions to logic and the school of thought called pragmatism. In the following section, we trace the history of some of these skill measures, and in doing so we motivate three adequacy constraints that have to be met for a measure to be considered a proper skill measure in RSE forecasting.
3 What is skill? Three adequacy constraints on skill measures for RSE forecasting 90 3.1 First adequacy constraint: Better than chance Gilbert (1884) responded immediately to Finley's article and in doing so made two lasting contributions to forecast-verification.
His thought was straightforward. Anybody making a sequence of forecasts, whether skilled or unskilled, is likely to get some right by chance. How many? In Table 2, there are a + c occurrences of tornadoes in the sequence of a + b + c + d forecasting occasions. The forecaster makes a + b forecasts of occurrence. If these a + b forecasts were made "randomly", we should 95 expect a fraction a + c a + b + c + d of them to be correct. So the number, a r , of predictions of occurrence that we might expect the skill-less forecaster to get right by luck or chance is a + c a + b + c + d × (a + b), i.e., the number in proportion to the base rate.
Likewise, in parallel fashion, we work out the number, d r , of predictions of non-occurrence we might expect the skill-less forecaster to get right by chance, the number, b r , of predictions of occurrence we might expect the skill-less forecaster to get wrong by chance, and the number, c r , of predictions of non-occurrence we might expect the skill-less forecaster to get wrong 100 by chance keeping fixed the marginal totals a + b, c + d, a + c and b + d. We find: a − a r is then the number of successful predictions of occurrence that we credit to the forecaster's skill, d − d r the number 105 of successful predictions of non-occurrence. As Gilbert noted, The forecaster does better than chance if a > a r , equivalently, if d > d r , i.e., if ad > bc. (If one finds the reasoning in arriving at a r , b r , c r and d r intuitively appealing but not sufficiently rigorous, see Appendix A.) It is also the case that What do b r − b and c r − c represent? When the forecaster does better than chance, they are the improvements over chance, thus decreases, in, respectively, the making of Type I and the making of Type II errors.
Given these considerations, we can now substantiate our earlier claim that Finley exhibited genuine skill, in contrast to the ignoramus, in in issuing his predictions. While Finley's 28 correct out of 100 predictions of occurrence made may not seem 115 impressive, his score is a fraction over fifteen times more than he could have expected to get right by chance, by "random prognostication" as Gilbert called it, given Table 1's numbers. That was Gilbert's first contribution. Although the next step we take is not exactly Gilbert's, the idea behind it is his second lasting contribution. Our forecaster makes a + b + c + d predictions. How many do we credit to her skill? Gilbert's suggestion is (a − a r ) + b + c + (d − d r ), a suggestion in effect taken up by Glenn Brier and R. A. Allen (1951) when they give this general 120 form for a skill score: actual score − score attainable by chance total number of f orecasts − score attainable by chance .
Here, in both numerator and denominator, the score attainable by chance is a r + d r . So, instead of accuracy's successes predictions as a measure that doesn't take into account skill, we instead take successes owed to skill predictions credited to skill , i .e., .

125
This is This is a skill score that, in contrast to accuracy, meets our first adequacy constraint as it controls for chance and aims at genuinely skillful predictions. In the forecasting literature, this measure is known as the special case of the Heidke Skill Score (Heidke, 1926)  We can rewrite the Heidke Skill Score as: The score is, then, the proportional improvement (decrease) over chance in the making of errors (of both Type I and Type II). Now, if we take it that the best a forecaster can do is have all her predictions, both of occurrence and non-occurrence, fulfilled 135 then, following Woodcock (1976), we can present the Heidke Skill Score in an interestingly different way: actual score − score attainable by chance best possible score − score attainable by chance .
Here, as said, we equate the "best possible score" with correctly predicting all occurrences and non-occurrences-a + c correct predictions of occurrence, b+d correct predictions of non-occurrence (Table 3). Substituting into the above equation, we obtain Observed Table 3. Best possible score relative to Table 2's data on occurrence. the Heidke Skill Score, for the actual score, the actual number of successful predictions is a + d, the number attributable to 140 chance is a r + d r , and the best possible score is a + b + c + d.
Focusing on the notion of a best possible score, though, gives us a different way to think about skill. On the model of what we did above, someone randomly making a + c predictions of occurrence could expect to get (a + c)(a + c) a + b + c + d of them right by chance and, likewise, someone randomly making b + d forecasts of non-occurrence could expect to get right by chance. So in the case of perfect prediction, in which there are no Type I or Type II errors, the number of successes we 145 credit to the forecaster's skill is Now, putting a different reading on our rewriting of Brier and Allen's conception of a skill score, actual score − score attainable by chance (relative to actual perf ormance) best possible score − score attainable by chance (relative to best possible perf ormance) , we get which is known as the Peirce Skill Score (Peirce, 1884), the Kuipers Skill Score (KSS, (Hanssen and Kuipers, 1965)) and the True Skill Statistic (TSS, (Flueck, 1987)). Note that Peirce's own way of arriving at the Peirce Skill Score is somewhat different to our presentation and examined in detail in (Milne, submitted).
We can also think of the Peirce Skill Score as being this ratio: 155 successes due to skill in actual perf ormance successes due to skill in perf ect perf ormance , and we will discuss this way of thinking of the Peirce Skill Score further in what follows.
To summarise, one of the earliest responses to the challenge to identify the skill involved in RSE forecasting was to highlight the need to take into account-in some way or another-the possibility of getting predictions right "by chance" and thus present the skill exhibited in a sequence of forecasts as relativized to what a "chancy" forecaster would predict. As we have just seen, 160 this can be done in different ways which motivate different measures of skill. At this stage, we don't have much to say on whether Gilbert's and Brier and Allen's reading or our rewrite is preferable, i.e. whether the Heidke or Peirce Skill Score is preferable. However, we can note that this first requirement rules out simple scores such as accuracy (proportion-correct) as capturing anything worth calling skill in forecasting.

165
M. H. Doolittle (1885a, b) introduced a measure of "that part of the success in prediction which is due to skill and not to chance" that is the product of two measures now each better known in the forecasting literature than Doolittle's own, the Peirce Skill Score, which we have just introduced, expressed in terms of Table 2 as a a + c − b b + d , and the Clayton Skill Score (Clayton, 1927(Clayton, , 1934(Clayton, , 1941, expressed in the same terms as 175 by a method which refers principally to the proportion of occurrences predicted, and attaches very little importance to the proportion of predictions fulfilled. (Doolittle, 1885b, p. 328, with a change of notation) Farquhar allows that 'either of these differences [i.e., the Peirce Skill Score and the Clayton Skill Score] may be taken alone, with perfect propriety.' By multiplying the Peirce Skill Score and the Clayton Skill Score, one is multiplying a measure that tests occurrences for successful prediction by a measure that tests predictions for fulfilment. The resulting quantity is neither 180 of these things-but that, in itself, does not formally prevent it being, as Doolittle took it to be, a measure of the skill exhibited in prediction. Why, then, should one not multiply them, or put differently: what is wrong with Doolittle's measure?
The answer, we suggest, lies in a notion that philosophers are familiar with in a very different setting but whose first appearances are very much to the point here-direction of fit. The idea, but not the term, is usually credited to Elizabeth Anscombe who introduced it thus:

185
Let us consider a man going round a town with a shopping list in his hand. Now it is clear that the relation of this list to the things he actually buys is one and the same whether his wife gave him the list or it is his own list; and that there is a different relation where a list is made by a detective following him about. If he made the list itself, it was an expression of intention; if his wife gave it him, it has the role of an order. What then is the identical relation to what happens, in the order and the intention, which is not shared by the record? It is precisely this: if the list 190 and the things that the man actually buys do not agree, and if this and this alone constitutes a mistake, then the mistake is not in the list but in the man's performance (if his wife were to say: "Look, it says butter and you have bought margarine", he would hardly reply: "What a mistake! we must put that right" and alter the word on the list to "margarine"); whereas if the detective's record and what the man actually buys do not agree, then the mistake is in the record. (Anscombe, 1963, §32) 195 As Anscombe's observation regarding butter and margarine makes clear, the ideal performance for the husband is to have the contents of his shopping basket match his shopping list; the ideal performance for the detective is for his list to match the contents of the shopping basket. The difference lies in whether list or basket sets the standard against which the other is evaluated-this is the difference in direction of fit. Put crudely, then, Peirce has the sequence of weather events set the standard and evaluates sequences of predictions against that standard; Clayton has the sequence of actual predictions set the standard Thus when measuring occurrences for successful prediction, the aim is to match predictions to the world, something which an omniscient being succeeds in doing; in measuring predictions for fulfilment, the ideal is to have the world match the predictions made, something which an omnipotent being can arrange to be the case.
In considering improvements on the forecasting performance recorded in Table 2, what are kept fixed are the numbers of actual occurrences and non-occurrences, the marginal totals a+c and b+d, not the numbers of actual predictions of occurrence a higher skill score if more tornadoes had occurred,' even though it may well be true. Thus, as Doolittle has, despite himself, made clear for us, forecasters are like Anscombe's detective and not like the husband with the shopping list. Forecasters try to fit their predictions to the world, not the world to their predictions.
Let's go back to this form for a skill score: 220 actual score − score attainable by chance (relative to actual perf ormance) best possible score − score attainable by chance (relative to best possible perf ormance) .
Peirce's conception of the best possible performance, presented above in Table 3, keeps the marginal totals for actual observed occurrences and non-occurrences from Table 2, a + c and b + d, respectively. The actual numbers of occurrence and nonoccurrence provide the standard against which performances are measured; so-constrained, the best possible performance is that of the as-it-were omniscient being who correctly predicts all occurrences and all non-occurrences (Table 3). 225 Clayton's conception of the best possible performance, presented below in Table 4, keeps the marginal totals for actual predictions of occurrence and predictions of non-occurrence from Table 2, a + b and c + d, respectively. The actual numbers of predictions of occurrence and predictions of non-occurrence provide the standard against which performances are measured; so-constrained, the best possible performance is that of the omnipotent being who fashions occurrences and non-occurrences to fit her predictions (Table 4). This, as we have argued, embodies the wrong direction of fit. And so, returning to our original 230 question, it should now be clear what is wrong with Doolittle's measure: it incorporates Clayton's measure which has the wrong direction of fit for a measure of skill in prediction.
Observed Table 4. Omnipotent forecaster's score relative to Table 2's data on prediction.
What of the Heidke Skill Score? How does it fare with respect to direction of fit? What conception of best performance does it employ? In its denominator, the Heidke score takes the best possible performance to be one in which all a + b + c + d predictions are correct but corrects that number for chance using Table 2's marginal totals for both predictions and occurrences.

235
This is, quite simply, incoherent-unless, fortuitously, we are in the special case when the numbers of Type I and Type II errors are equal. Keeping Table 2's marginal totals, the highest attainable number of correct predictions is a+d+2×min{b, c} (Table   5).
Using the marginal totals in Table 5, which are, by design, those of Table 2, to correct a + d + 2 × min{b, c} for chance, we obtain this skill score: 240 ad − bc (a + min{b, c})(min{b, c} + d) . Table 5. Highest number of correct predictions relative to Table 2's marginal totals It has been used to assess forecasting performances not in tornado forecasting nor in avalanche forecasting but in assessing predictions of juvenile delinquency and the like in criminology where it is known as RIOC, Relative Improvement Over Chance (Loeber and Dishion, 1983;Loeber and Stouthamer-Loeber, 1986;Farrington, 1987;Farrington and Loeber, 1989;Copas and Loeber, 1990). 245 Now, this measure has the following feature: when there are successes in predicting occurrences and non-occurrence, i.e., a > 0 and d > 0, it awards a maximum score of 1 to any forecasting performance in which there are either no Type I errors (b = 0) or no Type II errors (c = 0) or both. This is a feature it shares with Stephenson (2000)'s Odds Ratio Skill Score (ORSS) (for which see Appendix D). In agreement with Woodcock (1976), we hold that a maximal score should be attained when, and only when, b and c are both zero.

250
That's one problem with the RIOC measure. The other is this. Like Anscombe's detective, the scientific forecaster's aim is to match her predictions to what actually happens. That is why we keep the column totals fixed when considering the best possible performance. Why on earth should we also keep the row totals, the numbers of predictions of occurrence and non-occurrence fixed? -There is, we submit, no good reason to do so. The Heidke Skill Score embodies no coherent conception of best possible performance. Loeber et al.'s RIOC does at least embody a coherent notion of best possible performance but it is a 255 needlessly hamstrung one, restricting the range of possible performances to those that make the same number of predictions of occurrence and of non-occurrence as the actual performance. On the one hand, this makes a "best possible performance" too easy to achieve and, on the other, sets our sights so low as to only compare a forecaster with others who make the same number of forecasts of occurrence and of non-occurrence-but forecasting is a scientific activity, not a handicap sport.
Finally, for completeness, let's briefly consider the measure we started out with, proportion correct, it fare with respect to the second adequacy constraint? While it may be true to say that it doesn't evaluate a performance in relation to the wrong direction of fit, this is the case only because the measure doesn't properly engage with the issue of fit.
Here, the evaluation is in relation to a + b + c + d and so the performance is not evaluated in relation to any relevant proportion (neither of occurrences nor of predictions). So, in summary, we can say that while the Peirce score evaluates performances in relation to the correct proportions (occurrences, i.e., features of the world), the Clayton score evaluates it in relation to 265 the wrong proportions (predictions fulfilled), the Heidke score-badly-and RIOC-properly-in relation to both proportions (occurrences and predictions), the accuracy score doesn't evaluate the performance in relation to either of these proportions, and just as the latter three scores, it fails to meet the second adequacy constraint.

Third adequacy constraint: Weighting errors
We think there is a third feature of skill that is specific to rare and extreme forecasting that a proper skill measure has to account 270 for. Broadly speaking, it consists in being sensitive, in the right kind of way, to one's own fallibility. While the omniscient forecaster need not worry about mistakes, actual forecasters need to be aware of the different kinds of consequences of an imperfect forecast. To motivate our third constraint, consider the two forecasts in Table 6. While forecasts A and B issue the Table 6. Example of two forecasts (A, B) that agree on the correct predictions and the total number of false predictions, but differ in the kinds of false predictions (Type I vs Type II). same total number of forecasts and both score an excellent 98.8% on a proportion correct measure, they disagree on the kinds of errors they make. Forecast A makes fewer Type II errors (1) than Type I errors (5), while in forecast B this error distribution 275 is reversed. However, is there a reason to think that one forecast is more skillful than the other?
Given the context of our discussion, i.e. rare and severe event forecasting, we believe there is. We saw Allan Murphy saying that "it is widely perceived that type 2 errors [erroneous predictions of non-occurrence] are more serious than type 1 errors [unfulfilled predictions of occurrence]". A skillful forecaster of rare and severe events should take this observation into account and consider, as it were, the effects of their mistakes. As a result a skill measure should incorporate-in a principled way-the 280 different effects of Type I and Type II errors and judge forecast A as more skillful than forecast B, at least when the forecast is evaluated in the context of rare and severe event forecasting. Importantly, the Peirce Skill Score does just that. We can re-write it as and read it as making a deduction from 1, the score for a perfect omniscient performance, for each Type II and each Type I 285 error, respectively. Now, when we are concerned with rare events, i.e., when a + c b + d, the "deduction per unit" is greater for Type II errors than for Type I errors. As a result, it is built into the Peirce Skill Score, in a principled way, that Type II errors count for more than Type I errors when we are dealing with rare events. This is borne out in the Peirce Skill Score for our two forecasts above: forecast A receives a score of .823 while forecast B receives a score of .498. Note that this feature of the Peirce score would turn into a liability if we were to consider very common but nevertheless severe events.

290
Now, when d is large, as it often is in the case of rare events forecasts, it is likely to be the case that a + b c + d. When this is the case the Clayton Skill Score, which we may write as turns the good behaviour of the Peirce Skill Score on its head, giving a greater "deduction per unit" for Type I errors than for Type II errors. According to the Clayton Skill Score, we should regard forecast B (.823) as more skillful than forecast A 295 (.498). So, not only does the Clayton Skill Score fail to meet the direction of fit requirement, it also fails-in quite a spectacular way-our third requirement of weighting errors. Disregarding, if one can, its failure with respect to direction of fit, the Clayton Skill Score might be an appropriate score for common and severe events. In this case c + d a + b and the above reasoning is turned the right way up.
Formally, the Heidke Skill Score treats Type I and Type II errors equally in that interchanging b and c, i.e., Type I and Type 300 II errors, in Nonetheless, we can say in favour of the Heidke score that it provides the right incentive: in an application in which b+d > a+c, as it is in the case of rare event forecasting, an increase in Type II errors would lower the actually attained score by more than the same number of Type I errors and a decrease in Type II errors would increase the actually attained score by more than the 310 same number fewer Type I errors (see Appendix B for the formal details).
To summarise, we argued that a proper skill score for RSE forecasting should penalize Type II errors more than Type I errors. Amongst measures in the forecasting literature, only the Peirce Skill Score can truly capture this aspect of skill in RSE forecasting. The Heidke score fails to weigh errors differentially in a static comparison between two forecasts as in our example above. As we observed, however, both Forecaster A and Forecaster B would be incentivized to reduce Type II errors 315 in preference to Type I errors, which is good news for the Heidke score. The Clayton "Skill" Score proved to not merely fail to meet the requirement but actually to turn it on its head, ascribing more skill to Forecast B over Forecast A-clearly an undesirable result! By way of summary, consider Table 7 which collates our main claims made so far and consider the status of each adequacy 320 constraint. The first constraint was motivated by the early insight by Gilbert Table 7. Summary comparison of skill measures in relation to the three adequacy constraints for rare and severe event forecasting. a strong case that the skill involved in a sequence of predictions can only be captured by a measure which takes account of chance. It thus identifies skill as that aspect that renders a forecasts better than a random one. While this requirement applies to any form of forecasting including rare and severe events it renders more simplistic measures such as the proportion correct one as inappropriate.

325
The second constraint focuses on the direction of fit and requires of a skill measure that it measure the correct aspect of a skillful forecasting performance, i.e. it has to focus on the occurrences of successful predictions and not of successful fulfillments. Given its generality, it is also a requirement that applies to all forms of forecasting including rare and severe event forecasting. Interestingly, some of the most widely used skill measures do not meet this constraint.
Lastly, our third constraint is of a different kind. It is directly motivated by the specific challenge of rare and severe event 330 forecasting which, we argued, requires to weigh differently the different types of errors. Overforecasting rare and severe events is to be expected and should be penalised less when it comes to assessing the skill of an RSE forecaster, than underforecasting (all else equal). While the Peirce measure is the only one that directly meets the constraint, we are open to the idea of accounting for this adequacy constraint by introducing additional weights on the different errors. So, e.g. one may be able to use the Heidke Score, and add appropriate weights on b and/or c to reflect the seriousness in these errors so to get the right result in a static 335 comparison in our toy example of Figure 6. We leave it to the proponent of the Heidke Score (or any other score) to develop these details further.
Finally, should we consider these constraints as jointly sufficient? Of course, further debate may generate other constraints on a proper skill measure, and we are open to such a development at this stage of the discussion. However, we take ourselves to have shown that there really is only one skill measure that meets the three constraints and so there is only one true candidate 340 for a measure of skill in rare and severe event forecasting.

Application: the relevance of skill scores in avalanche forecast-verification
In this section, we will show how our theoretical discussion about proper skill measures has consequences for the practice of avalanche forecast-verification. We focus on the use of the "nearest neighbour" (NN) method of avalanche forecasting as discussed in Heierli et al. (2004). The idea of NN forecasting for avalanches dates back to the 1980's (Buser, 1983;Buser et.al., 1987;Buser, 1989) and has been widely used for avalanche forecasting in Canada, Switzerland, Scotland, India, and the US (e.g. Brabec and Meister, 2001;Gassner et.al., 2001;Gassner and Brabec, 2002;Purves et.al., 2003;Heierli et al., 2004;Roeger et.al, 2004;Singh and Ganju, 2004;Singh et.al., 2005;Purves and Heierli, 2006;Singh et.al., 2015). In order to evaluate the quality of this forecasting techniques, forecast-verification is an indispensable tool. However, there is currently no consensus in the literature about which measure to use in the verification process for NN forecasts. Most studies simply present 350 a list of different measures without providing principled reasons as to which measure is the most relevant one (an exception is Singh et.al. (2015) who opt for the Heidke score). This section offers a discussion as to how the many different measures should be used and ranked in their relevance for avalanche forecast-verification in the context of NN forecasting. It's worth noting, however, that broadly similar considerations will be applicable to the verification used in other avalanche forecasting techniques, or indeed to other kinds of binary RSE forecasts and their verification.

355
The basic assumption of the NN forecasting approach is that similar initial conditions with respect to external conditions, such as the snow-pack, temperature, weather, etc., will likely lead to similar outcomes and so historical data-weighted by relevance and ordered by similarity-is used to inform forecasting. More specifically, NN forecasting is a non-parametric pattern classification technique where data is arranged in a multi-dimensional space and a distance measure (usually the Brier score) is used to identify the most similar neighbours. NN forecasting can be used for categorical or probabilistic forecasts. In 360 the case of the former, which is relevant to our current discussion, a decision boundary k is set and an avalanche is forecast, i.e. a positive prediction is issued, when the number of positive neighbours (i.e. nearest neighbours on which an avalanche was recorded) is greater than or equal to that decision boundary k.

Heierli et al.'s study on avalanche forecast-verification uses two data sets, one focused on Switzerland and the other on
Scotland. Figure 1 summarises their results and shows how changes in the decision boundary k affect a variety of measures, 365 such as accuracy and other skill measures. In what follows, we will investigate their finding through a more "methodological" lens. Using an actual study will help us explain differences in behaviour of the skill measures given variations in the decision boundary, and highlight how our discussion has practical consequences. One core issue for NN forecasting is which decisionboundary k should be chosen, i.e. for which k do we get the "best" forecast. Naturally, this choice should depend, crucially, on how we assess the goodness of the different forecasts given variations in k. Our proposal is that the choice of k should be 370 settled by establishing which value of k issues in the most skillful forecast.
Let's start our discussion by noting two immediate consequences of NN avalanche forecasting. Remember that a positive prediction is issued when the number of positive neighbours is equal or greater than k. From this follows that: (i) the number of positive predictions (a + b) is greater the lower k.
(ii) the number of correct predictions made, a, is greater, the lower k.

375
Given that a + c and b + d are fixed, and given (i) and (ii), we can also note that a a + c  (2004)'s graphs, we see that it increases as k increases in both datasets but also tends to level off: at 90% and over for k ≥ 4 in the Swiss dataset, at about 80% for k ≥ 5 in the Scottish dataset.
Given (ii), we know that a decreases as k increases, so this improvement in accuracy is entirely due to an increase in the . But that is accompanied by an increase in Type II errors, mistaken negative predictions, i.e avalanches that were not predicted.
In short, as k increases more Type II errors are committed than Type I errors. However this "trading off" of errors is, as 390 we discussed in section 3.3, a seriously bad trade in the context of RSE forecasting. Now, maybe to some extent the absolute numbers should matter here, but generally in the context of RSE forecasting, we do want to minimise Type II errors and have Type II errors weigh more than Type I errors. As we showed earlier, the accuracy measure fails to do that.
Moreover, and as to be anticipated given our discussion in section 2.2, if really all we want to achieve is to improve accuracy then we have also to consider the "ignoramus in avalanche studies" who uniformly makes negative predictions, i.e., uniformly 395 forecasts non-occurrence. They have an accuracy score of b + d a + b + c + d . This is exceeded by the accuracy score of the skilled employer of the nearest-neighbour method only when a > b, i.e., just when the success rate (SR) a a + b > 0.5. But as we can see, in the Swiss dataset SR never gets above 0.3 and in the Scottish dataset it rises to about 0.5 and more or less plateaus.
Hence, if all that mattered was accuracy-Heierli et al.'s hit rate-then the lessons from this study for forecasting in Switzerland is to set the decision-boundary k to ∞, making it impossible to issue any positive predictions and in doing so increase accuracy.

400
Hence accuracy really isn't a good measure to assess a professional avalanche forecaster's performance. We hope they agree not merely due to concerns about job security.
To be clear, these considerations do not imply that there's no role for accuracy. Accuracy is not an end in itself, that much we take as established. Nevertheless, we think accuracy may well play a secondary role in "forecast-choice": if two sets of predictions are graded equally with respect to genuine skill, we should prefer or rate more highly the one which has the greater 405 accuracy. After all, it is making a greater proportion of correct predictions. So a view we are inclined to adopt is one where all things considered, accuracy can be a tie-breaker between sets of predictions that exhibit the same degree of skill according to the Peirce skill measure. Technically, our view amounts to a lexicographic all-things-considered ordering for forecast-verification: first rank by skill using the Peirce score, next rank performances that match in skill by accuracy. Let's next have a look at the behaviour of our favourite skill score. Let's investigate a little further the behaviour of KSS. As said, a + c and b + d are fixed, hence the base rate BR is fixed.
As k increases, a and b both decrease (or, strictly speaking, at least fail to increase but in practice decrease). Obviously, as a decreases, a a + c decreases; but as b decreases, is sometimes called the false alarm rate and sometimes the probability of false detection, i.e. P F D. Now, why does KSS so dramatically decrease? The answer should be clear given 420 our discussion of how Type I and II errors are weighted: as k increases, a and b both decrease and c and d both increase. Given that a + c and b + d are fixed the number of Type II errors increases when k increases. As discussed in section 3.3, the KSS score penalises Type II errors more heavily than Type I errors when a + c b + d. Hence a decrease in the latter is unable to outweigh the increase in the former. In addition, given that the KSS measure penalises Type II errors more heavily the rarer the to-be-forecasted event, the lower base rate in the Swiss data set-7% compared to 20% in the Scottish data set-explains 425 the more dramatic fall in the KSS value in the Swiss data set compared to the Scottish one.

The Heidke Skill Score and NN forecasting
We previously noted our reservations about the Heidke Score; it is, however, an often used skill score in forecast-verification (compare Singh et.al. (2015) who uses it in their evaluation of nearest neighbor models for operational avalanche forecasts in India). Interestingly, the Heidke score arrives at a different choice of k for the two data sets, yet the behaviour of the Heidke Skill Score, HSS (Heidke, 1926;Doolittle, 1888), is broadly similar to that of KSS in that it initially rises and then falls off. For the Swiss data set, HSS provides the highest skill rating for a decision boundary k = 3 and for the Scottish data k = 5. So, it really does matter which skill measure we choose when making NN forecast evaluations with important practical consequences. Why do we get such different assessments of 435 the forecast performances?
In both graphs, KSS > HSS for low values of k but not for larger values of k. This is intriguing. When the forecasting performance is better than chance, i.e., when ad > bc in Table 2, and occurrence of the positive event is rarer than its non- there is "over-forecasting" which, as noted earlier, is penalised less heavily in the case of KSS than "under-forecasting". Now, we quoted Murphy earlier noting that given the seriousness of Type II errors overforecasting is, as it were, a general feature of RSE forecasting. However, Murphy goes on to say, The amount of overforecasting associated with forecasts of some RSEs is quite substantial, and efforts to reduce 445 this overforecasting-as well as attempts to prescribe an appropriate or acceptable amount of overforecastinghave received considerable attention. (Murphy, 1991, p. 304) Now, how "bad" too much overforecasting is and when it is too much is a separate issue that depends on the kind of event that is to be forecast and may also depend on the behavioural effects overforecasting has on individual decision and the public's trust in forecasting agencies. But this much is clear: we have to acknowledge that KSS encourages more overforecasting when 450 compared to HSS. Naturally, this phenomenon just is the other side of the coin to penalising Type II errors more heavily, which we argued previously is a feature and not a defect of KSS. This is also something we identify in the graphs: with larger values of k HSS starts to exceed KSS, as Type II errors begin to exceed Type I errors.

The (ir)relevance of the Success Rate for NN forecasting
Heierli et al. also provide what they call the success rate, SR, a a + b in Table 2, which is also known as the positive predictive 455 value. What, however, is its relevance for RSE forecasting and should it have any influence on our choice of k?
Let's first look at its behaviour. In the case of the Scottish dataset, SR more or less plateaus from k = 5 onwards. As a, hence the P OD, is decreasing, b must be decreasing too and "in step". In the Swiss dataset, something else is going on. After k = 4, SR falls dramatically, indicating that while the number of verified positive predictions drop, the number of mistaken positive predictions does not drop in step. Moreover, in neither dataset does SR tend to 1 as k increases, meaning that a 460 sizeable proportion of positive predictions are mistaken even when a comparatively high decision boundary is employed. In the Scottish case, SR plateaus at 0.5, meaning that while the number of positive predictions decreases as k increases, the proportion of such predictions that are mistaken falls to 50% and stays there. In the Swiss case, after improving up to k = 5, the SR drops dramatically, meaning that while the number of positive predictions has decreased between k = 5 and k = 6, the proportion of predictions that are mistaken has increased. Notice too that in the Swiss case, the SR never gets above 0.3, so a 465 full 70% of positive predictions are mistaken, no matter the value of k-at least two out of three predictions of avalanches are mistaken.
So in both data sets SR might seem initially quite low. But as we know, forecasting rare events is difficult, and we should not be too surprised that the success rate of predicting rare events is less than 50%. In fact, given that rare event forecasting involves, by definition, low base rates of occurrence, and given our limited abilities in forecasting natural disasters such as 470 avalanches, we should expect a low success rate (see also Ebert (2019);Techel et.al. (2020)). But there are stronger reasons not to consider SR when assessing the "goodness" of an RSE forecast. SR fails all three adequacy constraints: it does not correct for chance, it has the wrong direction of fit since it is a ratio with denominator a + b, and it in effect only takes into account Type I errors. Given this comprehensive failure to meet our criteria of adequacy, we think that, in contrast to accuracy, SR is not even a suitable candidate to break a tie between two equally skillful forecasts.

475
So, then what are the main lessons from this practical interlude? Simply put: having the appropriate skill measure really does matter and has consequences for high-stakes practical decisions. Forecasters have to make an informed choice in the context of NN forecasting about which decision boundary to adopt. That choice has to be informed by an assessment of which decision boundary issues in the best forecast. Our discussion highlighted that the best forecast cannot simply be the most accurate one, rather it has to be the most skillful one. The Peirce skill measure (KSS) is, as we argued earlier, the only commonly used 480 measure that captures the skill involved in rare and severe event forecasting. Finally, if different k's are scored equally on the Peirce score, then we think that accuracy considerations should be used to break the tie: amongst the most skillful we may well use the most accurate forecast.
5 Conceptual challenges for RSE forecasting: the scope problem.
In this last section, we discuss a conceptual challenge for the viability of RSE forecasting (for a general overview of the other 485 conceptual, physical, and human challenges in avalanche forecasting specifically, see (McClung, 2002, a)). Once again, we can draw on insights from the early pioneers of RSE forecast-verification to guide our discussion. In his annual report for 1887, the Chief Signal Officer, Brigadier General Adolphus Greely, noted a practical difficulty facing the forecasting of tornadoes; more specifically: So almost infinitesimal is the area covered by a line of tornado in comparison with the area of the state in which 490 it occurs, that even could the Indications Officer say with absolute certainty that a tornado would occur in any particular state or even county, it is believed that the harm done by such a prediction would eventually be greater than that which results from the tornado itself. (Greely, 1887, pp. 21-2) Now, there are two issues to be distinguished. First, there is the behavioural issue of how the public reacts to forecasts of tornadoes or other rare and severe events. In particular, there is a potential for overreaction which, in turn, led for many years in the United States to the word 'tornado' not being used when issuing forecasts (cf. Abbe, 1899;Bradford, 1999)! This policy option, to decide not to forecast rare events, is quite radical and no longer reflects current practice.
The other issue is the "almost infinitesimal" track of a tornado compared to the area for which warning of a tornado is given.
A broadly similar issue faces avalanche forecasting: currently such forecasts are given for a wide region of at least 100km 2 , yet avalanches usually occur on fairly localised slopes of which there are many in each region. And, while avalanches are 500 different to many other natural disasters in that they are usually triggered by humans (Schweizer and Lütschg (2001) suggest that roughly 9 out of 10 avalanche fatalities involve a human trigger), RSE forecasts quite generally face what we call the scope challenge: The greater the area covered by the binary RSE forecast the less informative it is. Conversely, the smaller the size of the forecast region, the rarer the associated event and the more over-forecasting we can expect.
This type of trade-off applies equally to probabilistic and binary categorical forecasts. One consequence of the scope chal-505 lenge, alluded to in the above quote, is that once the region is sufficiently large, forecasters may rightly be highly confident that one such event will occur. This means that on a large-scale level, we are not-technically speaking-dealing with rare event forecasts anymore, while on a more local level, the risk of such an event is still very low. Now, in a recent discussion, Statham et.al. (2018) in effect appeal to a version of the scope problem-with an added twist of how to interpret verbal probabilities given variations in scope-as one reason why probabilistic (or indeed binary) forecasts 510 are rarely used in avalanche forecasting. They write: The probability of an avalanche on a single slope of 0.01 could be considered likely, while the probability of an avalanche across an entire region of 0.1 could be considered unlikely. This dichotomy, combined with a lack of valid data and the impracticality of calculating probabilities during real-time operations, is the main reasons forecasters do not usually work with probabilities, but instead rely on inference and judgment to estimate likelihood.

515
Numeric probabilities can be assigned when the spatial and temporal scales are fixed and the data are available, but given the time constraints and variable scales of avalanche forecasting, probability values are not commonly used. (Statham et.al., 2018, p. 682) It might well be these kinds of problems that led in 1993 to the introduction of the European Avalanche Danger scale which involves a multi-categorical five point danger rating: low, moderate, considerable, high, very high. The danger scale itself is a 520 function of snow-pack stability, its spatial distribution, and potential avalanche size and it applies to a region of at least 100 km 2 . The danger scale, at least on the face of it, focuses more on the conditions (snow pack and spatial variation) that render avalanches more or less likely than on issuing specific probabilistic forecasts or predicting actual occurrences.
Given this development, verification of avalanche forecast has become more challenging. What makes it even more difficult is that each individual danger level involves varied and complex descriptors that are commonly used to communicate and 525 interpret the danger levels. For example, the danger level high is defined as: Triggering is likely, even from low additional loads [i.e. a single skier, in contrast to high additional load, i.e. group of skiers], on many steep slopes. In some cases, numerous large and often very large natural avalanches can be expected. (EAWS, 2018) with nested modal claims [given a low load trigger, it's likely there will be an avalanche on many slopes]. And finally, it involves a hedged expectation statement of natural avalanches (i.e. those that are not human triggered) and their predicted size-in some cases, numerous large or very large natural avalanches can be expected. Noteworthy here is that while the forecasts are intended for large forecast areas only, the actual descriptors aim to make the regional rating relevant to local decisions. The side effect of making regional forecasts more locally relevant is that it makes verifying them a hugely complex, if not impossible, task.

535
Naturally, the verification of avalanche forecasts using the five point danger scale is an important and thriving research field and numerous inventive ways to verify multi-categorical avalanche forecasts have since been proposed (Föhn and Schweizer, 1995;Cagnati et.al., 1998;McClung, 2000;Schweizer et.al., 2003;Jamieson et al., 2008;Sharp , 2014;Techel and Schweizer, 2017;Techel et.al., 2018;Statham et.al., 2018;Schweizer et.al., 2020;Techel et.al., 2020;Techel, 2020). Here, we have to leave a more detailed discussion of which measure to use for multi-categorical forecasts for another occasion. Nonetheless, the 540 now widespread use of multi-categorical forecasts may instead raise the question whether, and if so how, our the assessment of the proper skill scores for binary RSE forecast-verification is of more than just historical interest.
There are numerous reasons why we think our discussion is still important with potentially significant practical implications. Column Test and the so-called Rutschblock Test-and assessed their accuracy and success rate. Our discussion suggests that when assessing the "goodness" of what are in effect local diagnostic stability tests, or indeed when assessing the performance of individuals who use such tests, we should treat them as binary rare and severe event forecasts. Using the correct skill score will be crucial to settle which type of stability test is the better test from a forecasting perspective.
Lastly, there are, as we noted above, numerous research projects to design manageable forecast-verification procedures 560 for multi-categorical regional forecast. Assuming that the methodological and conceptual challenges we raised earlier can be overcome, we still require the right kind of measures to assess the "goodness" of multi-categorical forecasts. The Heidke, Peirce, and the other measures we discussed can be adapted for these kinds of forecasts. Moreover, given that the danger rating of high and very high are rarely used, and involve high stakes with often major economic consequences, our discussion in-depth discussion of multi-categorical skill measures for regional avalanche forecasts has to wait for another occasion as it will crucially depend on the details of the verification procedure.

Conclusions
In his classic 1993 article "What is a good forecast?" Murphy distinguished three types of goodness in relation to weather forecasts generally; all three apply to evaluations of RSE forecasts.

570
Type 1 goodness: consistency a good fit between the forecast and the forecasters best judgement given their evidence.
Type 2 goodness: quality a good fit between forecast and the matching observations.
Type 3 goodness: value the relative benefits for end-user's decision-making.
Our discussion has focused exclusively on what Murphy labelled the issue of quality and how to identify a good fit between binary forecasts and observations, though the quality of a forecast has-obviously-knock-on effects on the value of a forecast 575 (Murphy, 1993, p. 289). Historically, a number of different measures have been used to assess the quality-the goodness of fit-of individual RSE forecasts and to justify comparative judgements about different RSE forecasts (such as in the case of NN-forecasting), however, there has not been any consensus about which measure is the most relevant in the context of binary RSE forecasts. In this article, we motivated three adequacy constraints that any measure has to meet to properly be used in an assessment of the quality of a binary RSE forecast. We offered a comprehensive survey of the most widely used measures 580 and argued that there is really only one skill measure that meets all three constraints. Our main conclusion is that goodness (i.e. quality) of a binary RSE forecast should be assessed using the Peirce skill measure, possibly augmented with consideration of accuracy. Moreover, we argued that the same considerations apply to the assessment of slope specific stability tests and other forecasting tools used in avalanche management. Finally, our discussion raises important theoretical questions for the thriving research project of verifying regional multi-categorical avalanche forecasts that we plan to tackle in future work.

585
Appendix A: Numbers of predictions correct and incorrect "by chance" We model the actual presences and absences (e.g., occurrence and non-occurrence of avalanches) as constituting the sequence of outcomes produced by n = a + b + c + d independent, identically-distributed random variables X 1 , X 2 . . . , X n ; each X i takes two possible values, 1 (= presence) and 0 (= absence); each random variable takes value 1 with (unknown) probability p.
The probability of producing the actual sequence of a + c presences and b + d absences is The value of p which maximises this is a + c a + b + c + d . We take this as the probability of presence on any forecasting occasion.
Call itp.p is the maximum likelihood estimate of the (unknown) probability of presence.
Putting the "random" into random prognostication, Step 1 We assume the actual forecasting performance to be produced by n = a + b + c + d independent, identically-distributed random variables Y 1 , Y 2 . . . , Y n ; each Y i takes two possible values, 1 595 (= prediction of presence) and 0 (= prediction of absence); each random variable takes value 1 with (unknown) probability q.
The probability of the actual sequence of a + b predictions of presence and c + d predictions of absence is q, the maximum likelihood estimate of the unknown value q, is a + b a + b + c + d . We take this as the probability of prediction of presence on any forecasting occasion.

600
Putting the "random" into random prognostication, Step 2 The probability of successful prediction of presence on the i th trial is prob(X i = 1 and Y i = 1). We suppose that X i and Y j are independent, 1 ≤ i, j ≤ a + b + c + d. In particular, then, the probability of successful prediction of presence on the i th trial ispq.
Let n = a + b + c + d. Let Z i = 1 if X i = 1 and Y i = 1, Z i = 0 otherwise. The expected value of Z 1 + Z 2 + . . . + Z n is This is npq, i.e., (a + c)(a + b) This is a r , the "number" of successful predictions of presence we attribute to chance. (We put 'number' in scare quotes because (a + c)(a + b) a + b + c + d need not take a whole number value. It's a familiar fact that an expected value need not be a realisable value: the expected value of the number of spots showing on the uppermost face when a fair die is rolled is 3.5 but no (undamaged) 610 face has three and a half spots on it.) In similar fashion, we obtain b r , c r and d r .

Appendix B: The effect of increases and decreases in errors on the Heidke Skill Score
Keeping the marginal totals a + c and b + d fixed, let us consider the score with an additional k Type I errors and again with an additional k Type II errors, 0 < k ≤ min{a, d} (Table B1). We have, by hypothesis, that 0 < a + c < b + d.

615
With an additional k Type I errors, the Doolittle-Heidke Skill Score is: .

630
Let us consider next the score after a reduction of k Type I errors and after a reduction k Type II errors, 0 < k ≤ min{b, c} (Table B2).  By hypothesis, we have that 0 < a + c < b + d.
With a reduction of k Type I errors, the Doolittle-Heidke Skill Score is: .
With a reduction of k Type II errors, the Doolittle-Heidke Skill Score is: For k in the range 0 to min{b, c}, the denominators are positive.
It's clear that these results reverse when the forecasted events are common, i.e., when a + c > b + d. Peirce Skill Scores. Its square is the measure proposed by Doolittle that attracted Farquhar's censure as discussed in section 3.2. See also Wilks (2011) for a general overview of skill scores for binary forecast verification. Note that we disagree with some aspects of his assessment.