Flood damage assessment is usually done with damage curves only dependent on the water depth. Several recent studies have shown that supervised learning techniques applied to a multi-variable data set can produce significantly better flood damage estimates. However, creating and applying a multi-variable flood damage model requires an extensive data set, which is rarely available, and this is currently holding back the widespread application of these techniques. In this paper we enrich a data set of residential building and contents damage from the Meuse flood of 1993 in the Netherlands, to make it suitable for multi-variable flood damage assessment. Results from 2-D flood simulations are used to add information on flow velocity, flood duration and the return period to the data set, and cadastre data are used to add information on building characteristics. Next, several statistical approaches are used to create multi-variable flood damage models, including regression trees, bagging regression trees, random forest, and a Bayesian network. Validation on data points from a test set shows that the enriched data set in combination with the supervised learning techniques delivers a 20 % reduction in the mean absolute error, compared to a simple model only based on the water depth, despite several limitations of the enriched data set. We find that with our data set, the tree-based methods perform better than the Bayesian network.

Decision making in flood risk management is increasingly based on studies that quantify the flood risk rather than only the flood hazard. Flood damage estimation is therefore becoming increasingly important (Merz et al., 2010). Flood risk assessment supports policy makers in deciding which flood risk management measures are most efficient in reducing flood risks and how much investment is cost efficient. With the European Union Floods Directive (EC, 2007) now fully in place, national flood risk assessments are being developed with the final aim to support flood risk management plans. In the Netherlands, such flood damage assessment has been used to derive the optimal protection standard for flood protection (Kind, 2013; van der Most, 2014), using the current Dutch standard method for damage modelling (Kok et al., 2005). Also, for insurance applications, more precise estimates of flood damage are required.

Flood risk assessments require flood damage models. These models typically predict the damage as fraction of the potential damage, based on the water depth, and average building repair and replacement costs for different types of buildings (Messner et al., 2007; Jonkman et al., 2008). Similar approaches are also applied to other natural hazards, for example for landslides (Papathoma-Köhle et al., 2015), and the software package HAZUS can be used for floods, earthquakes and hurricanes (Scawthorn et al., 2006). Alternative approaches to calculate flood risk also exist, such as vulnerability indicators (Papathoma-Köhle, 2016).

Simple flood damage models often do not perform well, as shown by their validation (e.g. Jongman et al., 2012). This is because water depth alone cannot explain the full complexity of the flood damaging processes and several studies have only found low correlation coefficients (typically below 0.5) between the water depth and the flood damage (e.g. Merz et al., 2013; Pistrika and Jonkman, 2009). Furthermore, often no local data are available on flood damage and therefore a relationship between the water depth and damage either needs to be estimated or transferred from other areas (Wagenaar et al., 2016). This can cause errors, as simple models hold many implicit assumptions that may not be valid for the situation the model is transferred to. For instance, Elmer et al. (2010) showed that an event with a low flood probability could not use the same damage function as a flood event with a high probability. These implicit assumptions cause large unexplained differences between flood damage functions (Wagenaar et al., 2016; Gerl et al., 2016). However, transferability can be improved when a model describes more variations of the damaging process, and when more variables are included in the damage models (e.g. flood probability is explicitly part of the model). Similar problems are also present in the modelling of other natural hazards. For example, Fuchs et al. (2007) found that building materials are very important for debris flow damage modelling and that models can therefore not always be transferred in space and time.

Current approaches suffer from two main limitations: first, they rely on limited information and usually only take into account water depth as a predictor, and use a deterministic relation between water depth and some fraction of average maximum damages; second, they are deterministic in nature, while it has been shown that uncertainties in this approach are large, but generally not quantified (e.g. in the Dutch standard method; Egorova et al., 2008). Some of the multi-variable methods are able to provide probability distributions, rather than deterministic estimates of damage.

Recently, multi-variable flood damage models have been created with a German data set based on telephone interviews. Thieken et al. (2005) found that apart from the water depth also the contamination of the flood water and precautionary measures were important to estimate the flood damage. In Thieken et al. (2008) these extra variables were included in a simple multi-variable flood damage model as a surcharge. Using information from this same database, Merz et al. (2013) used regression and bagging trees and Vogel et al. (2014) used Bayesian networks to predict the flood damage. Spekkers et al. (2014) applied regression trees to estimate pluvial flood damage. Van Oostegem et al. (2015) applied the Tobit estimation technique to a multi-dimensional data set in Belgium to estimate pluvial flood damages. These multi-variable flood damage models have been shown to perform better than simple flood damage models by Schröter et al. (2014) (up to 25 % reduction in mean absolute error, MAE), both tested on their own data set and on data sets from other floods (Schröter et al., 2014). Also, some multi-variable approaches (Bayesian networks, bagging trees and random forests) generate probability distributions of estimated damage, and thus provide information on uncertainties of the estimates. Therefore, multi-variable flood damage models look like a promising approach to improve flood damage modelling.

The application of multi-variable flood damage models for flood risk management studies is still difficult because of the large data requirements. Running a multi-variable flood damage model for a new area requires for every object several variables on the flood hazard and building characteristics that are not yet typically collected. Creating new multi-variable flood damage models is currently rarely done because they also require records of flood damage at building level.

More commonly available (although still rare) are simple data sets that hold records with the flood damage that occurred for each building with sometimes a few other variables (such as location or water depth). Such data sets may have been created for compensation purposes or to build simple flood damage models but may miss most of the desired variables. An example of such a data set is the flood damage data set collected after the Meuse flood of 1993 in the Netherlands which is used here. Previously this data set has been described in Wind et al. (1999) and in more detail in WL Delft (1994). In this paper we explore the use of supervised learning techniques to build flood damage models based on a data set that is very different from the data sets used in previous studies (i.e. the German data set applied by Merz et al., 2013, and Schröter et al., 2014). The data set in this paper was collected by insurance experts directly after the flood for compensation purposes and covers all affected buildings. This is different from the German data set, which was collected a year after the flood for research purposes based on a sample of the affected buildings. The data are also different in that in the original study only a few variables were collected, in contrast to the German data set, where all variables (except return period) were based on telephone interview answers. In this study several methods are applied to enrich the Meuse 1993 flood damage data set with extra flood hazard and building characteristic variables. We will answer the question of whether this enriched data set from a different source than previous studies is also suitable to build a multi-variable flood damage model. The expectation is that a multi-variable model performs better than a model based on a single variable (water depth) and that even data with limited quality will improve the results.

Two-dimensional hydraulic simulations of the 1993 flood on the Meuse are used to enrich the data set with additional flood characteristics. Cadastre data are used to enrich the Meuse data set with extra building characteristics. Four different supervised learning techniques are then applied to this enriched data set: a regression tree, bagging regression trees, random forest and a Bayesian network. A part of the data set will be held back and will only be used for validation. This validation is then used to determine whether the enriched data set combined with supervised learning techniques performs better than a traditional damage function based on the original data set of water depths. In this paper we will focus on predicting absolute flood damages rather than relative flood damages. This is because the exact building values are not available.

The data set available for this research is based on the Meuse flood of
22 December 1993 in the province of Limburg in the Netherlands (WL Delft,
1994). Although no dike breaches occurred in this event, several towns and
urban areas located close to the river were affected. The flood caused a
total of 254 million guilder (price level 1993) in direct damages, which is
approximately EUR 180 million today (price level 2016). The flood inundated
180 km

Damage information was collected in the context of a compensation arrangement for flood damage by the national government. All data were collected by sending damage experts from insurance companies to the affected buildings, several weeks after the flood event had occurred. Directly after the damage data were collected in 1994, the data were shared with WL Delft (now Deltares) to create a flood damage model. WL Delft received 5780 records for damage to residential buildings. The damage to privately owned residential buildings was collected by an organization called “Stichting Watersnood 1993”. The damage to companies and the structure of rental residential buildings was collected by another organization called “Stichting Watersnood Bedrijven 1993”. Therefore, in this set-up of damage data collection, the building structure information of rental residential buildings was collected by “Stichting Watersnood Bedrijven”, the organization that collected data on company damages. This is different from the organization that collected the rest of the residential damage information. The data on structural damage to rental residential buildings was only shared with WL Delft (1994) in a partial aggregated form. WL Delft (1994) presumably distributed this partially aggregated rental residential building damage data over the individual rental residential buildings. The exact method for this was, however, not reported and the original data set is no longer available. Therefore, we had to work with a data set which includes unknown manual actions. The structure damage data are therefore of inconsistent quality; the content damage, however, has no such problems. Furthermore, it is believed that the percentage of rental residential buildings in the affected area of Limburg is relatively low, limiting the impact of this data problem.

Another issue with the data set is that for privacy reasons the exact locations of the buildings were not shared with WL Delft. Only the six-digit postal code was available for this study, which makes it difficult to enrich the data set, as between 1 and 20 buildings share the same six-digit postal codes in the data set.

Scatter plot showing the relation between water depth and damage in the original data set.

In the original data set the water depth (relative to the ground floor level)
was estimated by the experts that surveyed the damage. The quality of the
water depth estimate is questioned by WL Delft (1994; report 9) because it was not the main aim of the survey and the
experts visited several weeks after the water had receded. A plot of the
water depth (see Fig. 1) and the damage does not show an obvious relation. The
correlation between the water depth and the damage is weak (Pearson
correlation coefficient

The final data set also contains information on the number of inhabitants per building, whether the house has a basement and whether the house was attached to other houses. However, these data are not described in any of the available reports so the collection methods are not known, but the recorded values are clear enough to incorporate in this study. Two more variables are also included in the WL Delft data set and also not described in any available report. These are emergency actions and ownership of the house. The meaning of the values found in the data set for these variables is, however, not sufficiently clear, and could unfortunately not be taken into account in this study.

To improve the data set, additional information is required on both the flood hazard and exposure variables. The results of a 2-D flood simulation and cadastre data were used to upgrade the data set, in terms of hazard and exposure information, respectively. Because no observational data are available on flood characteristics other than the water depth, a simulation of the flood event was performed. In the 2-D flood simulation tool WAQUA (Rijkswaterstaat, 2013), a verified model of the state of the Meuse during the 1993 flood was available (Becker, 2012) and this was applied in this study to get extra variables. Using this model, a new simulation was run using a discharge boundary condition at Eijsden and a water level boundary condition at Keizersveer for the period 1 November 1993 to 31 January 1994. This simulation was used to create a maximum water depth map, a flood duration map, a flood return period and a flow velocity map at a spatial resolution varying between 10 and 40 m.

The maximum water depth and flow velocity are standard outputs of WAQUA. Flood duration is, however, not a standard output and is more difficult to obtain from a 2-D flood simulation because the drainage also needs to be included in the schematization (Wagenaar, 2012). During the 1993 Meuse flood, most drainage occurred because of the natural slope in terrain and therefore the 2-D flood simulation implicitly includes most of the drainage because the discretized bed level is included. The flood duration can then be calculated by analysing the time-varying maps of the water depth and calculating for every cell the time between the moment a cell is inundated and the moment the cell is dry again. However, some cells in the digital elevation map in WAQUA are surrounded by cells that have a higher elevation. These cells do not drain in the 2-D flood simulation and are still inundated at the end of the simulation. For these cells the flood duration has been calculated based on the change in water depth. If the water depth in a cell stays the same in the simulation for 24 subsequent hours the cell is considered dry at the moment this stable water depth is first reached.

Simulations were also run with the same Meuse 1993 schematization for design
discharges with 1, 10, 50, 100, 250 and 1250 return periods. These discharges
are based on HR2006 (Diermanse, 2004) and have discharges of respectively
1300, 2260, 2869, 3109, 3431 and 4000 m

These maps (water depth, flow velocity, flood duration and return periods) were linked to the original damage records using cadastre data. The data of the cadastre have exact building locations, postal codes, living area within the residential buildings, the building footprint area and the construction year. The building year was used to filter the data to find the building stock of 1993. Then, based on the building locations, the 2-D flood simulation results were linked to the cadastre data.

Description of the variables in the flood damage data set for the Meuse flood of 1993.

This combination of cadastre data and 2-D flood simulation data is then used to make the link with the original flood damage records. First per postal code a list is made of the damage records in the postal code area and ranked based on the water depth in the original damage records. Then another list is made of the objects per postal code according to the cadastre and also ranked based on the simulated water depth. The cadastre objects combined with the 2-D flood simulation data are then linked per postal code based on the water depth rank. This results in a join between the original damage records, cadastre data and 2-D flood simulation results. Table 1 gives an overview of the available records in this combined data set.

The method of joining cadastre objects with damage records within a postal code area based on water depth rank is error prone. The modelled water depth is on average 30 cm larger than the recorded water depth. This is possibly because of the difference in reference level of both data sources, as the recorded water depth is relative to the floor level and the modelled water depth is relative to the digital elevation map. Not all houses have the same floor elevation and both the recorded and the modelled water depth are uncertain because of recording and model imprecisions. It is therefore likely that some damage records have been linked to the wrong object. However, errors will likely be limited because the join on postal codes is accurate. Object and flood variables are generally similar for buildings within the same postal code area (e.g. houses within a street are typically similar to each other), so these errors are expected not to significantly disturb the general trends in the data. The errors are therefore considered acceptable given that the purpose of the data set is only to build a flood damage model. If significant errors are present this would result in a reduced performance of the supervised learning algorithms on the test set. A relatively simple alternative to this water depth rank method is also applied. In this alternative, the average value at all building locations in the postal code area was assigned to each of the objects in the postal code.

Several supervised learning techniques have been applied to the enriched data set to build multi-variable flood damage models. The different supervised learning techniques all have different ways to generalize the training data in such a way as to give useful predictions of the total damage.

These multi-variable flood damage models are compared to two different reference models to assess the value of the enriched data set and to assess the value of multi-variable flood damage models in general. In what follows, the different supervised learning algorithms applied are described in further detail.

The first reference model only uses the square root of the water depth (see Eq. 1) to predict the flood damage. This model represents the damage functions commonly applied today in flood risk management studies because many damage functions have approximately the shape of a root function (e.g. Scawthorn et al., 2006; Thieken et al., 2008; Penning-Rowsell et al., 2005; Sluijs et al., 2000). Merz et al. (2012) applied the same method to get a reference damage function. The purpose of this reference model is to see the benefits of using more data.

The root function (1) is fitted to the data set in such a way that the
coefficients

The second reference model uses multi-variable linear regression to fit a linear model to the data. This model represents simpler/traditional techniques to make a multi-variable model from data. The purpose of this reference model is to see the benefits of potentially better techniques to build multi-variable models from data. Multi-variable linear regression is for example used in Islam (1997) to make multi-variable flood damage models. Linear regression is used without transformations of the input variables because there is no clear indication in the data that there are non-linear relationships (for example see Fig. 1).

To ensure that the model captures general trends and does not fit too strongly to the observed data (overfitting) the LASSO technique is used. This technique determines the coefficients in such a way that a penalty is applied for increasing the coefficients and using the variables more. LASSO yields sparse models, so some coefficients will become zero, which means they are not useful for the prediction. Therefore, the LASSO technique is useful for variable selection. To make this work correctly the data are normalized before training the model.

The multi-variable linear regression was carried out with the Scikit learn library in Python (Pedregosa et al., 2011). LASSO requires an alpha parameter to be set which determines the height of the penalty applied. Several alpha values were tried (0, 0.5, 1 and 10). The model is very insensitive to the alpha value (all formulations perform about equally well); an alpha value of zero performed best on all indicators. Therefore, it was not optimized further and the alpha is set to zero. When alpha is zero the method is equal to the ordinary least-squares method and no overfitting prevention is in place and LASSO is not necessary. This shows that overfitting is not an issue for relatively simple techniques such as linear regression with this data set and number of variables.

Decision trees are a way to represent complex relationships between data and classes in a tree structure. A decision tree can be seen as a series of binary questions (nodes) leading to an answer in the form of a class (leaf). A question can be related to any variable at any value (e.g. is the water depth smaller than 0.5 m).

A regression tree is similar to decision trees but instead of classes it results in real numbers. In theory, regression trees can be very large and have a separate leaf for each unique value in the data set. However, it is more common to combine several similar unique values inside the same leaf and represent it with a summary statistic number (mean). In such a case the regression tree is an approximation of the relationship.

Regression tree learning algorithms can create optimal regression trees based
on a data set. In this paper the data set consists of 4398 flood damage records
(incomplete records are discarded) with 11 variables for each damage record
(see Table 1). The regression tree algorithm aims to split the data set into
subsets in such a way that the mean squared error (MSE) of the predicted
total damage for all observations is minimized compared to the observed data.
It does this by calculating the reduction for all candidate splitting
variables according to their value and then picking the combination that
maximizes the MSE reduction (

Another method to avoid overfitting and generally improve the accuracy of decision/regression trees is bootstrap aggregating, also called bagging. The idea behind the method is to resample the data set multiple times and to build a new regression tree for each resampled data set. This results in an ensemble of regression trees. The resulting flood damage is then the average of the ensemble of regression trees. Resampling is done by building several data sets by randomly picking records from the original data set (each record is allowed to be used multiple times in the same data set). Every resampled data set therefore randomly leaves out a fraction of the observations and puts more weight on other observations because they are picked multiple times. Bagging regression trees also lead to probabilistic outcomes because the ensemble of trees can be seen as a probability distribution of the outcome.

A random forest is a more advanced variation of bagging regression trees. Apart from building multiple trees with resampled data sets it also randomly excludes a subset of variables at each decision split. This will result in an ensemble of regression trees each based on a different set of damage records and each leaving out a different number of variables at each decision split. For this paper the default settings of Scikit learn are applied. In our case this means that eight variables are left out at each decision split.

A Bayesian network is a type of probabilistic graphical model that represents a set of random variables and their conditional dependences in a directed acyclic graph (DAG) structure. Each variable in the network may be observed or represented as a prior probability distribution and dependences between variables are represented with edges representing joint probability distributions. The edges in a Bayesian network are directed, which means there is a direction in which the influence of one variable flows to the other. From this network, inference can be used in order to utilize knowledge of one variable to make predictions about other variables.

Bayesian networks and probabilistic graphical models in general are used in many different fields, such as bioinformatics (e.g. Mourad et al., 2011), image processing (e.g. Sudderth and Freeman, 2008) and speech recognition (e.g. Bilmes, 2002). Recently, they have also been applied to flood damage modelling (Vogel et al., 2014; Schröter et al., 2014; Van Verseveld, 2014). Schröter et al. (2014) found that their performance is often better than that of the different types of tree methods. Furthermore, a Bayesian network can give its result as a probability distribution and does not require information about each variable in order to make predictions. If fewer variables are available, the Bayesian network handles this by adjusting the probability distribution of the outcome. This makes it ideal for transfer of models to other locations where fewer data are available than for the location where the model was originally based. Furthermore, it returns (for each object) probability distributions rather than deterministic values, which is valuable for assessing uncertainties within the damage model estimates.

A Bayesian network can be discrete, continuous or a combination. In this
paper fully discrete Bayesian networks are used, in which all variables are
discretized into bins. Given a network, the probability of a particular set of
discrete variable values can be calculated with the following formula:

A data-driven Bayesian network can derive all its CPTs from the data and even derive its graph structure from the data. For this paper, two Bayesian networks were constructed: a data-driven Bayesian network with both the graph structure and the CPTs derived from the data set and an expert network in which the graph structure was estimated in an expert session but the CPTs were derived from the data set. All calculations were done with a Python library called libpgm (Cabot, 2012). This library follows the methodology described in Koller and Friedman (2009).

The CPTs are learned with maximum likelihood estimation. This method estimates the (joint) probability distributions based on the number of observations. The discretization assumptions have an impact on the maximum likelihood estimation. If the variables are discretized into a large number of bins more possible combinations of states are possible. These combinations of states grow exponentially with the number of bins of the parent variables. A too-fine discretization therefore quickly leads to more possible states than available data points. This results in a poor performance of the maximum likelihood estimation. Koller and Friedman (2009) call this one of the key limiting factors in learning Bayesian networks from data. A too-coarse discretization, on the other hand, is also not desirable because it limits the precision of the Bayesian network. For this study a balance was found by trying several discretization resolutions until the best result was found based on the MAE criterion.

Discretization was achieved by splitting the data into bins with an equal number of data points in each bin. This works better than making equal-sized bins because of the large extremes in, especially, the damage data. Equal-sized bins would either increase the number of bins, which is detrimental to the maximum likelihood estimation (having bins that contain no observations), or the bins would be so large that a majority of the data points would end up in the same bin, which would limit the Bayesian network performance. The number of bins per variable was chosen based on the performance of a test set on the MAE criterion. This was done by varying the discretization of the most important variables until the smallest error was found. For the Bayesian network with the data-driven structure the number of bins chosen was slightly larger because the network is less complex than the expert network.

The performance of the Bayesian network on the testing data can be sensitive for discretization. There are two possible alternatives for the discretization method applied in this paper: an optimization algorithm could be applied to determine the optimal discretization, or a continuous Bayesian network could be used (Friedman and Goldszmidt, 1996). Apart from solving the discretization problem the advantage of a continuous Bayesian network is that it would probably perform better in predicting extreme values but a disadvantage is that the Bayesian network is restricted to specific families of parametric probability distributions (Friedman and Goldszmidt, 1996). An optimization algorithm for the discretization can minimize the error produced by the discretizing but does not solve the fundamental problem of having too few data points.

The data-driven structure is also learned with the libpgm Python library. This library is using a constraint-based approach for structure learning, as is described in Koller and Friedman (2009). In a constraint-based approach the structure is learned by calculating dependences and conditional dependences between the variables. When two variables are dependent regardless of what they are conditioned by, an edge (connection) is formed. The algorithm follows this procedure to create the entire network. The result is shown in Fig. 4a.

Correlation coefficients between the different variables. See Table 1 for a description of the abbreviations.

Bayesian network learned from data

As an alternative to the data-driven structure a structure was also made in an expert meeting involving several Deltares flood damage/Bayesian network experts (see acknowledgements). In the expert meeting the network was constructed based on a combination of expert judgement/logic and with the knowledge of Fig. 3 in this paper. The experts focused mainly on edges that they thought were relevant for estimating the flood damage. The result is shown in Fig. 4b.

The relationship between the total, structural and contents damage is known
and not probabilistic: total damage

The advantage of an expert-based network is that experts focus on the
connections that matter most rather than on all possible connections.
Furthermore, experts can include connections that are not found in this
data set but are expected to exist in theory or in an independent test set.
The advantage of a learned network is that new and previously unknown
relationships between variables can be discovered. It was anticipated that the
Bayesian networks in this paper would not be very sensitive to overfitting
during the CPT learning. Koller and Friedman (2009) only mentions overfitting in the
maximum likelihood estimation of Bayesian networks in relation to
discretization that is too fine and offers no techniques to counter
overfitting in the maximum likelihood estimation. This expectation that
overfitting is not an issue was tested by testing the Bayesian network on its
own training data. If overfitting were an issue the model would do much
better in predicting its own data than in predicting new data. This is not
the case (for the expert model) – the MAE is even slightly worse when
calculated on its own data (0.622), the correlation coefficient and

In order to investigate the value of more data it is interesting to study the contribution of the different variables to the prediction accuracy. This can be done with bagging trees and the random forest methods. This importance can be calculated as the (normalized) total reduction of the mean square error brought by the different variables as achieved during the training of the models. This can be used to compare the relative importance of the variables between each other. This feature importance can be calculated for all the regression trees in the ensemble and a general importance is computed by the Scikit learn library by taking the average of the feature importances in the tree. This was applied in this study for the bagging trees. The variable importance has been separated for predicting the importance of the total damage, structural damage and the content damage. For the calculation of the variable importance the data set is used in which the average per postal code is used for the new variables. The water depth rank is not used because it could transfer some of the importance of the original water depth value to the new variables.

Another way to study variable importance is with the LASSO technique in multi-variable linear regression. LASSO can drop unimportant variable coefficients to zero. If a variable is dropped to zero it means that the variable is less important.

The different models were tested on a test set that was not used for training
the models. Four indicators are used to rate the performance of the models:
MAE, MBE, the Pearson correlation
coefficient and the coefficient of determination (

Results of different models for four indicators: MAE, MBE,

The best-performing model based on the MAE indicator with different number of variables.

Variable importance: the contribution of different variables in reducing the error in the bagging regression trees (the chart follows the order of the legend).

Table 2 shows that given that the models can use all data, random forest and bagging regression trees perform best and equally well. These two methods reduce the MAE by 12 % compared to a reference model using the same data (multi-variable linear regression). Bagging regression trees and random forest perform significantly better than normal regression trees, as was also noted by Merz et al. (2013) for flood damage in Germany. Random forest and bagging regression trees also outperform the Bayesian networks. The normal regression tree also works better than the Bayesian networks. This contradicts the findings of Schröter et al. (2014), that in most cases Bayesian networks outperformed the regression trees. Schröter et al. (2014), however, had a very different data set from the one applied in this study.

Many explanations are possible for the relatively poor performance of the Bayesian networks. The discretization of the data is a possible problem. Some trends could be too subtle to be captured by the rough discretization, but not enough data points are available for a more precise discretization. Perhaps there still is some space for improving the discretization, for example by applying an optimization algorithm to pick bin definitions in such a way that the available information is applied optimally (Vogel et al., 2012, applied such an algorithm). Another possible reason is that Bayesian networks might be more sensitive to low-quality data in combination with a small data set. Some of the CPTs applied in the Bayesian networks here are large, and conditional probabilities are based on a relatively small number of observations. Some wrong observations may then have a relatively large impact on the damage prediction.

In the data-driven network the variable of interest (total damage) in our test is only influenced by the water depth. This is because the water depth relative to the ground floor is known while the content damage is not known, so the known water depth blocks all the influence of other variables and the unknown content damage has no influence because it is unknown (it is a target variable). The data-driven Bayesian network is therefore in our test in practice only dependent on the water depth. Hence, the structure learning decides to ignore the other variables when the water depth relative to the ground floor is available. This is probably because the data-driven structure algorithm finds all variables equally important and therefore draws only the most important edges (connections) regarding the total damage. Other methods (e.g. as described by Riggelsen, 2008) for structure learning might be able to give better results.

The multi-variable linear regression reference model does a good job on the MBE but is clearly weaker on the other performance indicators, which shows that for predicting aggregate damage for, for instance, policy studies, the more complex methods are less beneficial. This is different in cases where individual building damages are important, for instance for insurance rating purposes. The reference root function has a very large bias compared to the other models. This is probably because the shape of the root function is inappropriate for this flood event.

The models were trained with different numbers of variables to see whether the additional data are valuable. As expected, the best-performing model with a high number of variables always performs significantly better than the best-performing model with fewer variables (see Table 3). More data therefore seem to add potential value to the damage prediction despite the possible quality issues in the additional data. The MAE of the best-performing model with only the water depth (regression tree) can be reduced by a further 14 % by the best model using all data (random forest). The MAE of the root function fitted to the data (representing common practice) can be reduced by about 20 % using the random forest with all data.

The method to join the extra data with the original data based on water depth rank is not effective. Just taking the average value per postal code appears to work better. The water depth rank probably sometimes assigns extreme variable values to the wrong objects which disturbs some correlations in the data.

The total importance of variables that were added in this study is about 30 % (Fig. 5), which means that 30 % of the error reduction during the training of bagging tree model originates from variables that were added to the data set. The added variables therefore clearly help to improve the prediction accuracy. This assessment was done without the water depth ranking join because this could assign some of the importance of the original water depth to the modelled water depth. The original water depth is by far the most important variable. Construction year is an important variable for the structure damage but not for content damage. This is as expected. Household size is quite important for the structural damage but insignificant for the content damage. This is less obvious, but it could be that large families live on average in larger houses but do not have much more valuable contents on the ground floor. Return period is an important variable for both the structure and the content damage. This was also expected because the population in areas that flood more frequently are expected to have more flood experience, thus resulting in better preparedness and therefore less damage. This effect is visible in the data, with return period having an importance of about 10 %.

For the best-fitting multi-variable linear regression model (LASSO
alpha

Box plots of the Meuse flood of 1993 per water depth class. The box shows the 25–75 % interval and the lines show the 5–95 % interval. The line in the middle of the box shows the median value. The labels on top of the plots show the number of observations per water depth class.

Additional data improve flood damage modelling relative to a test set, even if these data come from a collection of different sources and are of limited quality (error prone). The supervised learning algorithm is also important. Given the same data there are large differences between the algorithms. Random forests and bagging regression trees perform significantly better than normal regression trees and multi-variable linear regression. The Bayesian networks perform poorly compared to any of the tree-based methods.

Our current approach does not show that the additional variables are beneficial for the Bayesian networks. However, because the tree methods can benefit from the additional data it is likely that in some cases Bayesian networks could also. The poor performance of the Bayesian networks contradicts earlier studies (Schröter et al., 2014) and could be due to the discretization method, quality of the expert network, network learning algorithm or problems with data quantity or quality.

The test set that was applied in this paper for the validation of the model was randomly selected from the data and consistently applied among all models. The accuracy of the indicators for model performance could perhaps be further improved through some form of cross-validation. Also, the tweaking of different models could become more accurate if cross-validation was used instead of validation on a single test set only. For example, the optimization of the stop criteria for tree-based models and the alpha value in the LASSO method for the multi-variable linear regression could be improved in that way. Expectations are that this would cause minor improvements in results but that it would not influence the conclusions of this paper.

This paper did not address another benefit of Bayesian networks, random forest and bagging trees, which is the incorporation of uncertainty. Bayesian networks do this explicitly in the method and for bagging trees or random forest each tree can be seen as a possible damage estimate, and together the trees represent a probability distribution.

The methods applied in this paper provide an uncertainty estimate for a single object. For policy decision making it is often useful to aggregate these uncertainty estimates into a total uncertainty for the entire flood event. This can be done with the assumption that all objects are perfectly correlated with each other (one tree will apply to the entire event but what tree is uncertain), or with the assumption that all objects are independent of each other (each object will have a different tree but what tree is uncertain). Both assumptions are, however, not completely correct (Wagenaar et al., 2016). The Bayesian network framework might offer a middle way to model this correctly. If each object has a copy of the original Bayesian network, and these Bayesian networks are linked together based on the location of the objects, it can be explicitly taken into account that nearby objects are more likely to have similar damages. This could be an argument in support of preferring Bayesian networks over tree-based methods in the future.

The data set applied in this paper had many limitations. The most important limitation is that the exact locations of the objects are unknown. Because of this, it was difficult to link building and flood characteristics to damage records. An attempt to do this by using water depth rank performed worse than just using the average variable values per postal code. Despite this limitation, the added data still produced significantly better damage estimates. Another problem with the data set is the unknown manual adjustment to an unknown share of data (rental residential buildings) for the structural damage records. These actions may have introduced a relationship between structural damage and some of the originally recorded variables that was not there in reality. This could in theory cause a slight overestimation in the prediction performance of the models on the test set. This effect on the results is, however, expected to be small because most of the prediction improvements came from adding variables that were not available for doing the manual actions in 1994.

This study applied absolute damages rather than relative damages. This requires the supervised learning algorithms to implicitly also predict information about the values at risk besides the vulnerability. The algorithms can do this with variables such as living area, footprint area, building year and household size. This seems less error prone and better than estimating such values at risk with general rules of thumb based on assumptions about construction costs and content value. Such assumptions could cause extra errors, and therefore in this study absolute damages were used.

This paper trained flood damage models on just a single flood event. Ideally training data should consist of multiple events so that the spectrum of possible damages which the model is trained upon is larger. This would be important especially for the transfer to other areas. Models that are trained on a single event could overfit on this event and this problem would not show up if the model were tested with data from that same event (even if these specific data were not used for training the model). A good example of this appears in the good performance of the regression tree based on only the water depth versus the fitted root function based on only the water depth. The root shape of a damage function which many expert models use (see Sect. 2.2.1) and which makes physical sense performs much worse than a more flexible model that can adjust to other relationships between damage and water depth. This is explained by Fig. 6, which shows a downward-sloping damage function after 90 cm of water depth, a shape very different from damage functions normally found in the literature. The root function model therefore starts producing large errors after 90 cm, while the regression tree can capture this trend well. This downward slope makes no sense physically but could be explained by other variables such as return period. Return period could be a proxy for flood experience and better preparation because houses that experienced large flood depths in 1993 are probably on lower ground and also experience floods in general more often. This relationship is probably not true for other types of events, for example large flood depths due to dike breaches. Thus, in that sense, the regression tree is overfitting on this single flood event.

Supervised learning can help to create and improve flood damage models. They have many theoretical advantages over deterministic damage functions based on only water depth. The application of supervised learning in flood damage modelling remains challenging in practice because of limited data availability. In this paper we utilized different data sources compared to previous studies to acquire these data and showed that also on this data set the methods are beneficial, especially the tree-based methods. Future work may merge available data sets from different events and from different countries in order to develop a model that can be applied using several hazard variables, and which also works in circumstances outside areas for which flood damage data are available.

The dataset “Observed flood damages from the 1993 Meuse flood in the Netherlands with added flood and building characteristics” is given in the supplement below.

The authors declare that they have no conflict of interest.

This article is part of the special issue “Damage of natural hazards: assessment and mitigation”. It is a result of the EGU General Assembly 2016, Vienna, Austria, 17–22 April 2016.

We thank our colleagues Kathryn Roscoe for advice on the Bayesian networks
and our colleagues Karin de Bruijn and Marcel van der Doef for their input
in constructing the expert Bayesian network. We would also like to thank the
editor (Thomas Thaler) and the reviewers (one anonymous reviewer and Sven Fuchs) – their detailed constructive comments and suggestions helped to
substantially improve this paper. This research has received funding from
the European Union's Horizon 2020 research and innovation programme under
grant agreement number 641811 (Improving predictions and management of
hydrological extremes – IMPREX); see also