Invited perspectives: safeguarding the usability and  credibility of flood hazard and risk assessments

Merz, Bruno; Blöschl, Günter; Jüpner, Robert; Kreibich, Heidi; Schröter, Kai; Vorogushyn, Sergiy

doi:https://doi.org/10.5194/nhess-24-4015-2024

Articles | Volume 24, issue 11

https://doi.org/10.5194/nhess-24-4015-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/nhess-24-4015-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 24, issue 11

Invited perspectives

| Highlight paper

|

26 Nov 2024

Invited perspectives | Highlight paper |

| 26 Nov 2024

Invited perspectives: safeguarding the usability and credibility of flood hazard and risk assessments

Bruno Merz, Günter Blöschl, Robert Jüpner, Heidi Kreibich, Kai Schröter, and Sergiy Vorogushyn

Download

Final revised paper (published on 26 Nov 2024)
Preprint (discussion started on 26 Mar 2024)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-856', Thorsten Wagener, 24 May 2024

Merz and colleagues provide an interesting, well supported and relevant discussion of the issue of validation in the context of flood hazard/risk assessments. My comments below hopefully help to clarify some points and to push their discussion just a little bit further. My points are not listed in order of importance.
[1] One issue that could be better explained is the meaning of key terminology used. The authors state (line 149…): “Firstly, validation can establish legitimacy but not truth.” The relevance of this statement is difficult to understand unless you tell the audience that the actual meaning of the term “validation” is. The main argument of Oreskes (which the authors cite) is that validation translates into something like “establishing the truth”. It would be helpful to include this here. The same terminology clarity holds for the term verification that the authors also use.
[2] A wider point. I find this discussion of “true” in the context of models really unhelpful. A model cannot be true (in my opinion). A model is by definition a simplification of reality. At least for environmental systems, I do not see how there could be a single way to simplify the system reach a “true” model. This is why (if I remember correctly) Oreskes and colleagues suggest using the term evaluation instead of validation. Why did the authors not include this element in their discussion, but chose to assume that validation as a term is what we should continue to use?
[3] The authors state (Line 152…): “A model that does not reproduce observed data indicates a flaw in the modelling, but the reverse is not true” Well, no model perfectly reproduces observations (at least not in our field). So, we are generally talking about how well or how poorly as model reproduces observations.
[4] The authors state (Line 154): “Finally, validation is a matter of degree, a value judgement within a particular decision-making context. Validation therefore constitutes a subjective process.” Yes, I agree that validation is a question of degree. So, validation must contain subjective choices (of thresholds), but does that make the process subjective?
[5] In the context of what the authors present, isn’t the key question how we decide on appropriate thresholds for the matter of degree of validation? The current discussion does not say much about how we find and agree on these thresholds.
[6] One issue is the full acceptance or the complete rejection of models for a specific purpose in the context of model validation in most (all?) studies. Isn’t this black and white view a key problem? How do you consider imperfect suitability of models? In current studies that include some type of validation, there is generally no consideration of the degree in which a model failed to reproduce the observations into future predictions. How would we solve this problem?
The authors state that (Line 352) “It is therefore important to state the range for which the model is credible.”. However, I do not think this is solving the issue. For once, it still implies that the model is correct if used in the right range, which I think is a very strong assumption.
[7] A key problem for impact of risk models – as far as I know – is the lack of data on flood (or other perils) impact data (e.g. doi.org/10.5194/gmd-14-351-2021). One reason why CAT modelling in practice is so concentrated with a few firms (which own such data). How can we overcome this problem? How can we “validate” without data?
[8] I like the inclusion of Sensitivity Analysis as strategy in Table 1 and in the wider discussion in the paper. Though I do think that its value is wider than discussed here. Wagener et al. (2022) discuss at least four questions that this approach can address in the context of model evaluation (used to avoid the term validation in line with the ideas of Oreskes et al.): (1) Do modeled dominant process controls match our system perception? (2) Is my model's sensitivity to changing forcing as expected? (3) Do modeled decision levers show adequate influence? (4) Can we attribute uncertainty sources throughout the projection horizon?
[9] Another interesting reference for the authors might be the study by Eker et al. (2018) who reviewed validation practices. They found, among other things, a total dominance of validation strategies using fit to historical observations (even in the context of climate change studies).
[10] Some of the points discussed here are also part of what others have called uncertainty auditing (doi.org/10.5194/hess-27-2523-2023) or sensitivity auditing (doi.org/10.1016/j.futures.2022.103041). These ideas might be interesting to the authors.

Thorsten Wagener

References
Eker, S., Rovenskaya, E., Obersteiner, M., & Langan, S. (2018). Practice and perspectives in the validation of resource management models. Nature Communications, 9, 5359.
Wagener, T., Reinecke, R., & Pianosi, F. (2022). On the evaluation of climate change impact models. WIREs Climate Change. https://doi.org/10.1002/wcc.772

Citation: https://doi.org/10.5194/egusphere-2024-856-RC1
- AC1: 'Reply on RC1', Bruno Merz, 12 Aug 2024
  
  Thanks a lot, Thorsten Wagener, for your thoughtful comments. We respond to your comments as follows:
  [1] Our understanding of validation is explained in lines 53 – 71 (“… evaluating its [the model’s] ability to achieve its intended purpose. In essence, the evaluation of an FHRA model's validity is determined by its fitness for its intended purpose, reframing the criteria for assessment from correspondence to reality to alignment with decision-making needs …” and “… Our focus on a fit-for-purpose approach follows earlier arguments … the process of structured reasoning about the level of confidence needed to support a particular decision and the credibility of the assessment of risk in that context …”. However, we agree that a more explicit definition of the term validation is helpful for the reader, and we will re-write these 2 paragraphs accordingly. For a more explicit definition, we will follow Eker et al. (2018) that you propose in your comment [9].
  Actually, we don’t use the term verification in our paper, but verifiability which is explained in Table 2 and Figure 1.
  [2] On the question whether we should use the term validation: Yes, Oreskes et al. argue that the terms validate and verify are problematic: “ … both verify and validate are affirmative terms: They encourage the modeler to claim a positive result . And in many cases, a positive result is presupposed. For example, the first step of validation has been defined by one group of scientists as developing "a strategy for demonstrating [regulatory] compliance". Such affirmative language is a roadblock to further scrutiny. A neutral language is needed for the evaluation of model performance …” While Oreskes et al. note this problem, they don’t really suggest to substitute validation by evaluation. Evaluation is rather a procedure. Oreskes et al. (1994) mention that the term validation implies legitimacy, e.g., a contract that has not been nullified is a valid one. They mention, however, that this term is often misused by using it in a sense of verification, i.e., consistency with observations, and in a sense that a valid model is a realistic representation of physical reality, i.e., “truth”. Our definition of validation as being fit for purpose resonates with the notation by Oreskes et al. in a sense that we agree on a “contract” for model use for a specific decision-making purpose, given certain model properties and quality ensuring procedures, and this makes it legitimate or valid. In addition, the term validation is extremely widespread. The majority of papers discussing the simulation of environmental systems use validation, including those papers that agree that “do not see how there could be a single way to simplify the system reach a “true” model”. Thus, to better connect to the existing literature and terms used, we prefer to use validation instead of the evaluation. However, we will add a disclaimer that validation has this affirmative touch.
  On the discussion of “true”: We have checked how we use “true” or “truth” in the manuscript. There are only 3 instances where we use them:
  “ … Firstly, validation can establish legitimacy but not truth. Truth is unattainable because …” (Line 150) and “… The ideal model-building process utilizes an initial model to make testable predictions, then takes measurements to test and improve it (Ewing et al., 1999). This predictive validation approach appeals because the modeler is unaware of the truth at the time of the model experiment and is therefore not subject to hindsight bias…” (Line 190).
  We will keep the first 2 instances, as we basically say what you mean (“A model cannot be true”). Concerning the third instance, we will substitute “truth” by “measurements”.
  [3] We will delete this sentence in the revision as we don’t need it to make our point, which is that validation can establish legitimacy but not truth.
  [4] We agree that there is a difference between subjective choices and a subjective process and will reformulate as follows: “Validation therefore includes subjective choices”.
  [5] It is indeed an important question how we decide on appropriate thresholds. Here, a basic problem is that the specific thresholds and the ways how to decide on them certainly vary between different contexts. We believe that our framework can help to decide, in a certain context, whether the specific model is valid enough. We follow Howard (2007) who discussed the related problem of what a good decision is. According to Howard, a decision should not be strictly judged by its outcome: A good decision does not always lead to a good outcome and a bad decision does not always lead to a bad outcome. Instead, a good decision is governed by the process that one uses to arrive at a course of action. Howard then defined 6 (decision quality) elements and argued that good decisions are those in which all of these elements are strong. Similarly, we think that our framework can support the discussion about the degree of validation of a model. We think that a good validation is governed by the process that one uses to evaluate a model. Applying our framework (Table 2), we argue that a specific model is validated when all 7 criteria are fulfilled to an extent that is appropriate in the specific context. Of course, the problem still remains what “appropriate in the specific context” means. However, given the large range of contexts where FHRAs are performed, we think that it is not possible to discuss the many ways that parties involved in a FHRA might decide whether a model is valid enough.
  In the revised version, we will add a discussion on your important question along the lines outlined above.
  [6] Yes, at some point in the process of validating a FHRA, there is typically a decision that the specific model is valid enough (despite its imperfect suitability). And such decisions are required, because a flood protection measure needs to be designed (or any other real-world decision needs to be taken) based on a concrete model output (which can be a single number, a probability distribution, a range of what-if scenarios, etc.). In such a situation, one could consider the degree to which model is valid (its validity) in the decision context. Often, this is already done. For instance, reliability engineering (e.g. Tung, 2011) considers (aleatory and epistemic) uncertainty in the design of structures. In a situation where the available model is less able to reproduce the observations, one can consider this lower validity in a wider probability distribution of the load (external forces or demands) on the system or the resistance (strength, capacity, or supply) of the system. This, in turn, will lead to higher design values due to our high uncertainty represented in the specific model. In a situation where one has several, alternative models, each associated with a measure of how valid they are (e.g. by quantifying their agreement with observations), one can weigh these models to obtain the concrete model output required for a specific decision. We agree with Thorsten Wagener that the full acceptance of models is a key problem and we will extend our manuscript by a discussion on that problem, including some reflections on how one could incorporate the degree of validity in decision-making contexts.
  We think that stating this range (by specifying the range of return periods, failure mechanisms etc., for which data exist, should be specified, as should those cases for which observations are unavailable) is extremely helpful. However, we will clarify in the revised version that we don’t assume that the model is correct even when it is valid enough to be used for decisions in a specific context.
  [7] We have mentioned the challenge of the lack of data at several locations in the original manuscript, most prominently in Lines 92 – 100: “… One fundamental problem is that flood risk, i.e. the probability distribution of damage, is not directly or fully observable. Extreme events that lead to damage are rare, and the relevant events may even be unrepeatable, such as the failure of a dam (Hall and Anderson, 2002). The rarity of extreme events results in a situation characterized by both limited data availability and increased data uncertainty. This uncertainty relates to data against which the flood model can be compared. For instance, streamflow gauges often fail during large floods, and losses are not systematically documented and reported losses are highly uncertain. In addition, input data is often insufficient for developing a viable flood model. For example, levee failures depend on highly heterogeneous soil properties, and levee-internal characteristics are typically unknown. Thus Molinari et al. (2019) conclude that “a paucity of observational data is the main constraint to model validation, so that reliability of flood risk models can hardly be assessed…”. Our framework is our answer to this challenge. The framework goes beyond the current, most prominent view on model validation (which is strongly focussed on data validation, i.e. comparing simulation against observation) by adding validation elements and criteria (see Figure 1) that can be applied without observations.
  In the revised version, we will discuss more clearly how our framework addresses the question of validating without data.
  [8] Thanks a lot for this wider perspective on sensitivity analysis which we will consider in the revised version.
  [9] and [10] Thanks a lot for these references, which is highly relevant and will be included in our revision.
  References:
  Howard RA (2007) The foundations of decision analysis revisited. In: Edwards W, Miles RF, Von Winterfeldt D (eds) Advances in decision analysis: from foundations to applications. Cambridge University Press, Boston, pp 32–56
  Tung, Y.-K. (2011). Uncertainty and reliability analysis in water resources engineering. Journal of Contemporary Water Research and Education, 103(1), 4.
  
  Citation: https://doi.org/10.5194/egusphere-2024-856-AC1
RC2:
'Comment on egusphere-2024-856', Anonymous Referee #2, 01 Jul 2024

There is almost no consideration given here to the consequences side of risk. Without that, there is a real danger that you are validating hazard rather than risk. You quote Bates (2023) but none of the pre existing research on which that paper is based is credited, fundamentally about comparisons between insurance claims and risk as a modelled. That research gets closest to the proper validation of risk (as opposed to hazard) than much other less comprehensive investigations.
Also, insufficient attention is given to the biases involved in flood risk assessment being undertaken by those who benefit from large risk numbers. Many of those developing risk models are engineers intimately concerned with projects to construct flood risk reduction measures, hence the tendency for exaggeration by the models in comparison with real world data on actual flood impacts. Indeed, it seems that a review of most models show widespread exaggerations over anticipated consequences.
The part of the paper on Emergency Management is not very convincing. It is very brief and the claims for its advancement over current practises cannot easily be verified. So what are the improvements, and how do they come about? Indeed, the final paragraph of that section rather implies that modelling extreme events in real time is not likely to be credible. That raises the question about the credibility of models generally, rather than simply a discussion of the way that the results can be validated. It also suggests this is rather a bad example.

Citation: https://doi.org/10.5194/egusphere-2024-856-RC2
- AC2: 'Reply on RC2', Bruno Merz, 12 Aug 2024
  
  Thanks for these helpful comments. We respond to these comments as follows:
  [1] Our paper is not a review on the validation of flood hazard and risk assessments, but a perspective paper which attempts to offer a broader view on validation. To this end, we have tried to cite examples that cover the entire range of flood hazard and risk assessments. However, as validating risk is more difficult than validating hazard, we agree that putting more emphasis on examples that deal with validating the modelling of flood consequences will make the paper more useful. In the revision, we will thus add more examples on validating consequences / risk.
  [2] The point that engineers tend to exaggerate flood risk in order to benefit from large risk numbers is highly interesting. However, we could not find evidence for the statement of the reviewer “… Indeed, it seems that a review of most models show widespread exaggerations over anticipated consequences …”. There are very few papers that point in this direction. (One example is ‘Penning-Rowsell, E. C. (2021). Comparing the scale of modelled and recorded current flood risk: Results from England. Jour. Flood Risk Manag’. It compares the modelled numbers of flood risk for England with loss figures quantified in terms of insurance claims data and finds that modelled results are between 2.06 and over 9.0 times the comparable flood losses measured in terms of the compensation paid.)
  While we agree that the danger of exaggeration exists, ethical standards of the engineering profession and the public and political scrutiny of large infrastructure projects serve as checks against such behaviour. In many regions/countries, there is a strong emphasis on demonstrating the effectiveness and cost-benefit ratio of flood protection measures, which suggests a focus on justifying any proposed measures rather than exaggerating risks or promoting unnecessary construction. However, we will extend the manuscript and add a short discussion on this point that the reviewer raised.
  [3] We don’t understand the statement: “… Indeed, the final paragraph of that section rather implies that modelling extreme events in real time is not likely to be credible…”. The final paragraph is “… The use case of emergency flood management exemplified in Table 3 reflects the situation that flood models for the management of extraordinary situations cannot rely on typical elements of validation, such as comparing model simulations with observed data. Only rarely does observed data about inundation, defence failures, and impacts exist for a particular region. This is precisely why it is all the more important to safeguard the usability and credibility of the models applied…” Why does this paragraph imply that modelling in real time is not credible? We think that models can be useful and credible even when we don’t have (many) observations.
  We have chosen the example of emergency management deliberately because this is an area without much data to validate models. We think that our framework is helpful exactly in such situations, when there is little data and when the typical approach of validation (compare against observations) is not possible. However, we will extend this example by going more into depth when presenting the validation elements.
  
  Citation: https://doi.org/10.5194/egusphere-2024-856-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to minor revisions (review by editor) (16 Aug 2024) by Gregor C. Leckebusch

AR by Bruno Merz on behalf of the Authors (26 Aug 2024) Author's response Author's tracked changes Manuscript

ED: Publish as is (05 Oct 2024) by Gregor C. Leckebusch

AR by Bruno Merz on behalf of the Authors (11 Oct 2024)

Executive editor

The manuscript addresses an important topic of the useful and correct use of flood hazard and risk assessment approaches. It could serve as a benchmark overview of best practice and thus be of enormous value for the flood hazard community and any related application out of this. The manuscript addresses the important role of stakeholder participation, objectivity, and verifiability when assessing flood hazard and risk.