Flood loss modelling is a crucial part of risk assessments. However, it is subject to large uncertainty that is often neglected. Most models available in the literature are deterministic, providing only single point estimates of flood loss, and large disparities tend to exist among them. Adopting any one such model in a risk assessment context is likely to lead to inaccurate loss estimates and sub-optimal decision-making. In this paper, we propose the use of multi-model ensembles to address these issues. This approach, which has been applied successfully in other scientific fields, is based on the combination of different model outputs with the aim of improving the skill and usefulness of predictions. We first propose a model rating framework to support ensemble construction, based on a probability tree of model properties, which establishes relative degrees of belief between candidate models. Using 20 flood loss models in two test cases, we then construct numerous multi-model ensembles, based both on the rating framework and on a stochastic method, differing in terms of participating members, ensemble size and model weights. We evaluate the performance of ensemble means, as well as their probabilistic skill and reliability. Our results demonstrate that well-designed multi-model ensembles represent a pragmatic approach to consistently obtain more accurate flood loss estimates and reliable probability distributions of model uncertainty.

Effective management of flood risk requires comprehensive risk assessment studies that consider not only the hazard component, but also the impacts that the phenomena may have on the built environment, economy and society (Messner and Meyer, 2006). This integrated approach has gained importance over recent decades, and with it so has the scientific attention given to flood vulnerability models describing the relationships between flood intensity metrics and damage to physical assets, also known as flood loss models. A large number of models have become available in the scientific literature. However, despite progress in this field, many challenges persist in their development, and flood loss models tend to be quite heterogeneous. This often results in practical difficulties when they are to be applied in risk assessment studies (Gerl et al., 2016; Jongman et al., 2012), as described below.

Flood damage mechanisms are complex, being dependent on different properties of flood events, such as water depth, flow velocity and flood duration, as well as on the physical characteristics of the exposed assets (Kelman and Spence, 2004). Precautionary and socio-economic factors can also influence their degree of vulnerability (Thieken et al., 2005). Building accurate and reliable flood loss models that account for all these factors is a challenging task. Model development is hampered by limited knowledge about damage-influencing factors, as well as limited data availability (Merz et al., 2010). It is therefore unsurprising that traditional flood loss models tend to be rather simple, often using water depth as the only explanatory variable to describe damage and loss to coarsely defined groups of assets (Green et al., 2011; Smith, 1994). However, the limited predictive ability and high degree of uncertainty associated with such models has been acknowledged (Krzysztofowicz and Davis, 1983; Merz et al., 2004), and more complex models that consider additional explanatory variables have been developed (Dottori et al., 2016; Elmer et al., 2010; Merz et al., 2013). Regardless, uncertainty in flood loss modelling is to some extent inevitable (Schröter et al., 2014).

Furthermore, flood loss models are usually developed for specific regions, ranging from country to catchment or municipality level, with smaller scales making up the majority of models (Gerl et al., 2016). Lack of available flood loss models in many regions often leads to the transfer of models in space, resulting in their application to contexts with different built environments and/or socio-economic settings than originally intended. However, this is generally done with insufficient justification, and flood loss models have been shown to offer lower predictive ability under such circumstances (Cammerer et al., 2013; Jongman et al., 2012; Schröter et al., 2014).

In addition, flood loss models are most often constructed for specific flood types (e.g. fluvial flood, flash flood, coastal flood) and will usually be poorly suited to estimate loss due to flood events with other dominant damaging processes (Kreibich and Dimitrova, 2010; Kreibich and Thieken, 2008). Models also vary in the way loss is expressed, which can be either in monetary terms or as a fraction of the value of the element at risk (Messner et al., 2007). These are referred to respectively as absolute and relative flood loss models, the latter being better suited than the former for application across different study cases (Krzysztofowicz and Davis, 1983). Further differences may exist in terms of other model attributes.

Due to this large heterogeneity, it is difficult to identify flood loss models that, given their attributes, are potentially the most appropriate for application in specific risk assessment studies. Ideally, for any given application setting, a perfectly suited model (e.g. similar type of asset, no spatial transfer required, validated with local evidence) would be available and unambiguously identifiable, but unfortunately this is far from the case. The lack of an established procedure to select suitable flood loss models from the many available in the literature means that model selection is often done rather arbitrarily (Scorzini and Frank, 2015), which can negatively impact the quality of flood loss estimations and lead to suboptimal investment decisions based on model outcomes (Wagenaar et al., 2016).

A critical issue in flood loss modelling is uncertainty (Merz et al., 2004), which is usually high and can significantly contribute to overall uncertainty in flood risk analyses (de Moel and Aerts, 2011). Model uncertainty is mainly related with parameter representation, whereby fewer parameters than those theoretically needed to describe physical damage processes are used, and with insufficient data and/or knowledge about damage processes (Wagenaar et al., 2016). Quantifying uncertainty is imperative, as this information is required to make informed decisions in the context of flood risk management (Downton et al., 2005; Peterman and Anderson, 1999; USACE, 1992). However, the vast majority of flood loss models currently available in the literature are deterministic (Gerl et al., 2016), providing single point estimates of loss. Such estimates are unable to meet the decision needs of different stakeholders, who may have differing risk attitudes or cost-benefit ratios for risk mitigation measures (Merz and Thieken, 2009). Moreover, the uncertain nature of flood loss estimations means that the performance of any given deterministic model that appears appropriate for a certain application can be limited, as large disparities may exist even among seemingly comparable models (Jongman et al., 2012; Merz and Thieken, 2009). This makes flood risk estimates highly sensitive to loss model selection (Apel et al., 2009; Wagenaar et al., 2016). It is thus clear that adopting a single deterministic model for the estimation of flood losses is not recommended, as the information it provides is insufficient for optimal decision-making, and the results will potentially, and very likely, be inaccurate. Even though research on flood loss modelling has recently started to move into the probabilistic domain (Custer and Nishijima, 2015; Dottori et al., 2016; Kreibich et al., 2017; Schröter et al., 2014; Vogel et al., 2012), probabilistic models are still scarce.

Multi-model ensembles have been successfully applied in scientific fields such as hydrology or weather forecasting to tackle similar issues to those discussed above. Ensemble means have been shown to almost always outperform individual models (Georgakakos et al., 2004; Gleckler et al., 2008; Reichler and Kim, 2008), and the combination of the output of different models can be a pragmatic approach to estimate model uncertainty (Palmer et al., 2004; Weigel et al., 2008). However, in the context of vulnerability modelling, the concept of combining multiple models is relatively new. Rossetto et al. (2014) and Spillatura et al. (2014) have proposed the use of mean model estimates as part of their studies on respectively fragility and vulnerability curves for seismic risk assessment, but model performance is not evaluated and uncertainty quantification is not discussed. The potential use of multi-model ensembles in flood vulnerability assessment has not been addressed before.

This study therefore aims to answer the following research questions:

Can multi-model ensembles be used to improve the accuracy of flood loss estimations?

Are multi-model ensembles able to represent model uncertainty and provide reliable probabilistic estimates of flood loss?

How should such ensembles be constructed?

We first propose a framework to rate flood loss models according to their
potential skill and suitability as participating members in such ensembles.
We then construct various multi-model ensembles, based both on the rating
framework and on a state of simulated non-informativeness, differing in terms
of participating members, ensemble size, and weighting criteria, and evaluate
their performance. Twenty flood loss models
available in the literature are adopted, and losses are modelled for
residential buildings in two application cases, corresponding to flood events
that took place in Germany in 2002 and in Italy in 2010. Based on the
results, which are shown and discussed in Sect.

The flood loss model catalogue developed by Gerl et al. (2016) was used as
the basis for model selection in this study. We first identified all
deterministic models describing loss to residential buildings, and then
excluded models based on following criteria:

the documentation is insufficient for model implementation;

the model uses explanatory variables that are not available in most practical applications;

the model has a functional form that is considered inappropriate (e.g. too simplistic or discretized);

the model is based on the same dataset as another model deemed more appropriate for the application settings (this is to ensure model independence and avoid potential biases in the resulting ensembles).

Models included in this study, including some of their properties.

Based on this procedure, 20 deterministic flood loss models for
residential buildings were adopted. The catalogue developed by Gerl
et al. (2016) provides information on the properties of each model, which is
necessary to assess model suitability according to the framework proposed in
Sect.

Input variables for the Mulde and Caldogno application cases.

Each model is implemented to compute flood losses for the two application cases described in Sect.

The predictive performance of single loss models and ensemble means is
evaluated in terms of accuracy and systematic bias, using respectively the
root mean squared error (RMSE) and the mean bias error (MBE). These are given
by

The probabilistic skill of ensembles is evaluated using the continuous ranked
probability score (CRPS), which is defined as the integrated squared
difference between the cumulative distributions of predictions and
observations (Weigel, 2012). We adopt the expression for the CRPS derived by
Hersbach (2000), which is described as follows. Consider a set of

The CRPS can be interpreted as an error measure, with lower values corresponding to higher probabilistic skill.

To assess ensemble reliability (i.e. whether ensemble predictions and
observations are statistically indistinguishable), the rank histogram is
adopted, which is constructed as follows. Consider an

2002 flood along the Mulde River, in Germany. The figure shows the municipalities considered in the case study (grey), the estimated flood extension and water depths (blue), and the location of the residential grid cells (orange).

Floods are a recurring natural hazard in the Mulde catchment
(7400

Results of individual model applications in the Mulde case: root mean square error (RMSE) and mean bias error (MBE), sorted by RMSE.

The data used for this application case are listed in Table 2, and the
results of individual model applications in terms of error statistics are
shown in Table 3. The flood extension and water depths were estimated through
hydro-numeric simulations (Apel et al., 2009) and hydraulic transformation
(Grabbert, 2006). Return periods of flood peak discharges were derived from
annual maximum series of mean daily discharges by Elmer et al. (2010). For
the estimation of contamination indicators, inundation durations, flow
velocity indicators and precautionary measures indicators, computer aided
telephone interviews with affected households have been used (Thieken et al.,
2005). The average floor areas of residential buildings and average building
values are based on official statistical data about total living area for
different types of residential buildings per district, and standard
construction costs per square metre gross floor area (Kleist et al., 2006).
Asset values with a spatial resolution corresponding to the inundation map
(i.e. 10

2010 Bacchiglione river flood in Caldogno, Italy. The figure shows the estimated flood extension and water depths (blue), and the location of the residential buildings considered in the study (orange).

From 31 October to 2 November 2010, the Veneto Region was affected by
persistent rain, particularly in the pre-alpine and foothill areas, with
accumulated rainfall exceeding 500

Results of individual model applications in the Caldogno case: root mean square error (RMSE) and mean bias error (MBE), sorted by RMSE.

The data used for this application case are listed in Table 2, and the results of individual model applications in terms of error statistics are shown in Table 4. The inundation characteristics were estimated using a coupled 1-D/2-D model of the study area between the municipalities of Caldogno and Vicenza, and validated using data from sources such as aerial surveys and interviews with the local population. Building areas were derived from the cadastral map issued by the Veneto region. Building properties (i.e. building type, structural type, quality, number of floors, and year of construction) were assessed through direct surveys to each damaged building. Building values were estimated based on data from the Chamber of Commerce of Vicenza. Losses to residential buildings were provided by the municipality of Caldogno and amount to a total of EUR 7.55 million. These correspond to actual restoration costs that were collected and verified within the scope of the loss compensation process by the state. Further details can be found in Scorzini and Frank (2015).

Ensembles are finite sets of deterministic realisations of a random variable, whereby the prediction given by each ensemble member is assumed to represent an independent sample from an underlying true probability distribution (Hamill and Colucci, 1997). Ensembles can be used to account for various sources of uncertainty in physical processes, namely initial conditions, parameter, and model uncertainty. The latter can be achieved by combining the output of different models to create a so-called multi-model ensemble (Weigel, 2012). In this section, we investigate how best to translate this concept to the field of flood loss modelling, and to which extent multi-model ensembles can improve the skill and usefulness of flood loss estimations.

The first challenge in constructing a multi-model ensemble to estimate flood loss for a certain future application is identifying models that are better suited to be participating members. One of the requirements for the construction of successful multi-model ensembles is that participating models are skilful; if a model is consistently worse than the others in terms of prediction quality, it should not be included (Hagedorn et al., 2005). Unfortunately, testing the level of skill of a model in predicting loss, for a certain type of asset and application setting, is often not possible. Such exercise would involve applying each candidate model to estimate loss for a past flood event with similar characteristics, and quantifying its performance based on past loss observations for the same assets. However, data required to perform such assessments are usually not available, as scarcity of data is still a major problem in the field of flood risk (Merz et al., 2010). Moreover, exposure and vulnerability tend to change over time, which is likely to affect loss estimates (Tanoue et al., 2016). Another issue of a more practical nature is that collecting, implementing and comparing flood loss models is laborious and time consuming. Because of the economic constraints that inevitably exist in any practical application, most users will likely have limited time to invest in that task. This becomes more problematic as the already large number of models available in the literature continues to increase.

A more practicable approach is to evaluate the suitability and potential
performance of each model in estimating loss, for a given application
setting, based on its properties. This is advantageous, as it does not
require that each model be tested explicitly, and can instead be achieved by
making use the information contained in a model metadata catalogue such as
the ones developed by Gerl et al. (2016) or Pregnolato et al. (2015).
However, models differ at various levels, and a model that is potentially
superior regarding some of its properties may be inferior in terms of others
(see Sect.

a probability tree of model properties is set up through expert elicitation. A set of

once the probability tree is set up, it can be used to assign scores to and rank flood loss models. Because the tree covers the entire space of possible categories within each property, all flood loss models will necessarily have a set of properties that matches one of the tree paths. Any model can thus be assigned a score that is equal to the probability of its respective path. When assigned to a certain number of models rather than to all the possible combinations of model properties, such scores no longer have a specific probabilistic meaning, nor are they intended to. Instead, the scores of different candidate models in a pool can be used to establish a relative degree of belief among them. This effectively provides users with information on their potential performance, in relation to the other models in the pool, through a structured and simple to use procedure.

We apply this framework to the models and test cases presented in
Sect.

Proposed set of properties (probability tree nodes) that are considered relevant to assess the performance of flood loss models for buildings, and respective categories and subjective probabilities.

Model scores for the Mulde application case.

Model scores for the Caldogno application case.

Root mean square error (RMSE) and mean bias error (MBE) of the means of ensembles of increasing size, with models included sequentially from highest to lowest score, starting with the highest ranked single model. Blue crosses and orange circles refer to ensembles weighted equally and differently, respectively.

While model properties are expected to be informative for performance, they
are not presumed to explain it fully. However, if model properties do have usefulness in assessing the
performance of models in relation to one other, some degree of correlation
between model scores and different performance metrics should exist. We
evaluate this using the Spearman's rank correlation
coefficient

The objective of the analyses presented in this section is twofold: to assess to which extent ensemble-means are able to improve skill in the estimation of flood losses, and to investigate how such ensembles should be constructed. Regarding the latter, two questions require particular attention: firstly, which and how many models to include as participating members, and secondly, how to weight those members. Both the ensemble size and the model weighting scheme are likely to have an effect on skill.

In this exercise, the models and application cases described in
Sect.

Losses given by ensemble means are estimated using two approaches: firstly,
by assigning equal weights to all models, and secondly, by weighting them
differently. Concerning this point, we now present some considerations. In
the construction of an equal-weighted multi-model ensemble, the underlying
hypothesis is that each model is independent and equally skilful, whereas
this condition is most often not satisfied. For this reason, adopting
different weights may increase the quality of multi-model predictions.
However, finding optimal weights is not straightforward, and previous studies
show that weighting models differently may result in different outcomes
ranging from slight increases to degradation in performance (Doblas-Reyes
et al., 2005; Hagedorn et al., 2005; Knutti et al., 2010). Here, we aim to
assess how weights affect ensemble-mean performances in estimating flood
loss, again by reproducing a practical situation where the skill of models in
a certain future application is not known. Therefore, assigned weights
instead reflect the user's confidence in each model (Marzocchi et al., 2015).
Because the framework proposed in Sect.

Ensemble-mean performances are calculated in terms of RMSE and MBE, which are
shown in Fig. 4 for the ensembles of different sizes – starting with
a single model, the highest ranked for each case – and using the two
weighting schemes described above. A number of observations can be made from
this figure. Firstly, multi-model ensembles of any size, built by adding
models with the highest degrees of belief first, considerably outperform the
highest ranked single model in terms of both RMSE and MBE. This is observed
for both application cases, the only exception being the MBE of some
ensembles in the Caldogno case. Secondly, the performances obtained using the
two different weighting approaches is mixed; while in some cases there is
improvement by weighting ensemble members differently, in others the opposite
is observed. The weighting approach generally does not have a significant
impact on error metrics, especially when compared to the model selection.
Thirdly, in both cases, the largest improvements in ensemble-mean
performances are obtained after the first few highest ranked models are
added. In relative terms, the impact of including additional models after
that is lower. For example, in the Mulde and Caldogno case studies, the best
performances are obtained with ensembles using respectively the
highest-scoring four and six models. From a practical point of view, this is
a particularly interesting finding because, as mentioned previously, it may
not be feasible to implement a large number of models, and users may
therefore be interested in parsimonious ensembles with the least number of
models that lead to high predictive skill. However, in terms of probabilistic
estimates of loss, smaller ensembles are less useful, which also needs to be
taken into account when deciding on which ensemble size to use, as further
discussed in Sect.

Note that from here on, the equal-weighted expert-based multi-model ensembles shown in Fig. 4 will be used as a basis for other analyses and further discussion, and for the sake of brevity will be referred to as EEM-ensembles.

RMSE and MBE of the EEM-ensemble means, represented by blue crosses, and single model predictions, by red plus signs.

Some of the above observations draw comparisons between multi-model ensembles and individual models, for which the highest ranked single model is used as reference. Even though that model may not necessarily correspond to the highest performing model (which it does not in either of the application cases used here; see Tables 3, 4 and 5, 6), in a practical application case, users have no way of knowing which model is the “best”. The above results very clearly demonstrate that in such situation, using a multi-model ensemble is preferable. However, it is also insightful to assess how the constructed multi-model ensembles perform in relation to the other single models. Therefore, in Fig. 5, the error metrics of the predictions given by EEM-ensemble-means and single models are presented, showing that the former consistently outperform the latter. Note that ensembles are not expected to outperform every single model in every possible situation, and it is possible that in some application cases, certain models have such high accuracy that combining them with other models results in lower performances. The problem is that it is usually not possible to identify such models beforehand. For example, in the Mulde case, the Luino model slightly outperforms the constructed ensembles in terms of RMSE. This model consists in a simple stage-damage function that refers to a single building type, and was derived from data relative to a flood in Italy. Therefore, it is not expectable that it would consistently perform as well if applied to other analogous case studies. Overall, better performances should be obtained by using multi-model ensembles (Hagedorn et al., 2005).

The framework proposed in Sect.

RMSE and MBE of 20 000 multi-model ensemble means, generated by simulating a state of non-informativeness, whereby each participating member is assigned a random weight. Blue crosses and red plus signs refer respectively to the EEM-ensemble means and the single model predictions.

Scatter plots of the RMSE and MBE that result from the above procedure are
presented in Fig. 6 for both case studies. The same error metrics regarding
the EEM-ensembles and the single models are also included. The plots show
that a wide range of possible outcomes in terms of RMSE and MBE exist when
random weights are assigned to models within the framework of a state of
non-informativeness. While the lower bounds of the resulting convex hull are
defined by the error metrics of the lowest-performing models, the upper
bounds (i.e. highest performances) are given not by any single model, but
instead by multi-model ensembles, as expected. In this regard, it is clear
that the model rating framework based on expert judgement and subjective
probabilities proposed in Sect.

The plots also show that it is possible to create certain ensembles that lead
to better skill in relation to the ones developed based on expert judgement.
However, the potential relative degree of improvement is very low in both
test cases, more markedly so in the Caldogno case, which reinforces the idea
that the approach proposed in Sect.

In Sect.

It is first necessary to make clear what the probabilistic meaning of a multi-model ensemble is. Multi-model ensembles do not directly provide probability distributions of a certain variable; instead, ensemble predictions are a priori only finite sets of deterministic realisations of that variable. The question then arises how a probability distribution can be obtained from such ensembles. The simplest approach is to adopt a frequentist interpretation of the ensembles, whereby the probability of a certain event to happen is estimated by the fraction of ensemble members predicting it. However, such approach can only produce reasonable probabilistic estimates if many ensemble members are available. Better probabilistic estimates may in principle be obtained by dressing the ensemble members with kernel functions or by fitting a suitable parametric distribution to them, provided that this is done in an appropriate manner (Weigel, 2012).

Continuous ranked probability score (CRPS) of the EEM-ensembles for the Caldogno application case.

Regardless of the method that is used to obtain probabilistic estimates from multi-model ensembles, it is first important to evaluate the “raw” ensembles, with minimum interference from the ensemble interpretation model that is used. This can be achieved using the continuous ranked probability score (CRPS) (Bröcker, 2012; Hersbach, 2000). We calculate the CRPS for the EEM-ensembles, and present the results in Fig. 7. This is done for the Caldogno case study, as the low number of data points in the Mulde case (19) are insufficient for such analysis.

Rank histogram relative to the 20-model ensemble for the Caldogno application case.

The probabilistic skill of the ensembles is observed to have an increasing
trend (i.e. decreasing CRPS) with the number of participating members. This
is to some extent expected, as ensemble size is known to have an effect on
probabilistic skill scores, which is explained by the fact that probabilistic
estimates derived from ensembles become more unreliable as the size of the
ensemble gets smaller (Weigel, 2012). This highlights the need of using
a considerable number of models when the objective is to obtain reliable
(i.e. statistically consistent) probabilistic estimates of flood loss.
Another requirement to achieve this is that the ensemble itself is reliable,
in the sense that ensemble members and observations are sampled from the same
underlying probability distributions or, in other words, that they are
statistically indistinguishable from each other (Leutbecher and Palmer,
2008). Even an ensemble of infinite size is unable to yield reliable
probabilistic estimates if its members are not reliable (e.g. if they are
heavily biased). For illustration, we assess reliability considering an
ensemble comprising all 20 models implemented in this study using the
rank histogram, which is shown in Fig. 8. As expected, the ensemble is not
perfectly reliable; however, the counts do tend to oscillate around

Probabilistic estimates of total loss, relative to model
uncertainty, for the Caldogno application case, based on 10 000 realisations
of loss to each building.

Finally, we illustrate the simplest approach to obtain a probabilistic
distribution of flood losses using a multi-model ensemble. For each building,
a value of loss is randomly generated using the reverse transform sampling
method, whereby a number

Statistical post-processing techniques may be used to improve the reliability of probabilistic predictions. This is common practice in the field of numerical weather prediction, for example. However, in that case, relatively long time series of past observational data for a certain variable (e.g. temperature) at a certain location are usually available, and such data continue to be collected, which allows the predictive system to be calibrated and the forecasts verified. This is in contrast with the case of flood loss estimations, where loss models necessarily need to be transferred due to the rarity of the events and the difficulty in obtaining data. In the particular case of probabilistic loss estimates based on ensembles, it is therefore necessary to investigate how best to improve their reliability for future applications by considering data from previous flood events often occurring in different contexts. In addition, as mentioned previously, the reliability of probabilistic estimates may also be improved by using a more sophisticated ensemble interpretation method (i.e. kernel dressing or parametric distribution fitting). However, the most appropriate approach to do this in the case of flood loss modelling also needs to be investigated. These topics are beyond the scope of this article.

Flood loss modelling is associated with considerable uncertainty that is often neglected. In fact, most currently available flood loss models are deterministic, providing only single point estimates of loss. Users interested in performing a risk assessment will typically select one such model from the large number available in the literature, based on their perception of which one is the most suitable for the application case at hand. However, this is generally done rather arbitrarily. Moreover, the uncertain nature of flood loss estimations means that the performance of any single deterministic model may vary considerably from case to case, as large disparities in model outcomes exist even among apparently comparable models. This approach is therefore flawed at two main levels: first, flood risk estimates are highly sensitive to the selection of the flood loss model, and second, deterministic estimates of loss do not lead to optimal decision-making. In this study, we have proposed a novel approach to tackle these issues and advance the state of the art of flood loss modelling, based on the application of the concept of multi-model ensembles. This technique, which is widely used in fields such as weather forecasting, consists in combining the outcomes of different models in order to improve prediction skill and sample model uncertainty.

In order to support ensemble construction, we have first proposed a framework to assess the suitability of flood loss models to specific application cases, based on some of their main properties, through expert knowledge. This approach is advantageous as it does not require that all candidate models are implemented beforehand, which is often not achievable in practice. Based on such framework, we have proposed a scoring scheme for flood loss models for residential buildings, and applied it to the 20 models and two applications cases used in this study. The obtained model scores show significant strong negative rank correlation with error metrics, suggesting that the proposed approach is useful, and that expert judgement is informative for model performance and selection.

The constructed ensembles have been shown to considerably outperform the highest ranked single models in the estimation of flood losses. This demonstrates that in a practical application, where model performances cannot be tested beforehand, using multi-model ensembles will result in more skilful loss estimates. Ensemble-means were also tested against all single models, consistently showing higher accuracy. Equal-weighted ensembles generally displayed performances comparable to the score-weighted ones. The largest improvements in ensemble-mean performances were observed after the first few highest ranked models were added to the ensembles, which is a useful finding for practical applications, where it is not always feasible to implement a large number of models. We have also simulated a state of non-informativeness and randomly generated a large set of multi-model ensembles, representative of all possible ensembles that can be constructed using the 20 flood loss models adopted in this study. The ensembles based on expert-based scoring approach were among the most skilful, highlighting its value in the construction of multi-model ensembles. Results also suggest that model selection is more important than weighting. Further insight may be gained by testing the approach in other application cases and using a different set of flood loss models.

Larger ensembles showed higher probabilistic skill than smaller ones, which results from the increased intrinsic unreliability of ensembles as the number of participating members decreases. Therefore, if on the one hand only a limited number of models is necessary to obtain accurate mean estimates of loss, on the other hand, additional effort in model implementation is recommended when the objective is to derive a probabilistic distribution of loss that captures model uncertainty. For the Caldogno case study, we have illustrated how such a distribution can be constructed, adopting a simple equal-weighted ensemble comprising all 20 models. The results demonstrate that the use of multi-model ensembles represents a simple and pragmatic way of obtaining reliable flood loss distributions, which are more useful for decision-making than single point estimates of loss. Reliability may be further improved by calibrating the ensembles and/or adopting more sophisticated ensemble interpretation models, which warrants further research.

The observed and modelled losses for both case studies are available in the Supplement.

The authors declare that they have no conflict of interest.

This research was partly supported by the European Union's Horizon 2020
research and innovation programme, through the IMPREX project (grant
agreement no. 641811) and the H2020 Insurance project (grant agreement
no. 730459). Further support has been received from Guy Carpenter and
Company Ltd. (