Quantification of Continuous Flood Hazard using Random Forrest Classification and Flood Insurance Claims at Large Spatial Scales: A Pilot Study in Southeast Texas

Pre-disaster planning and mitigation necessitates detailed spatial information about flood hazards and their associated risks. In the U.S., the FEMA Special Flood Hazard Area (SFHA) provides important information about areas subject to flooding during the 1% riverine or coastal event. The binary nature of flood hazard maps obscures the distribution of property risk inside of the SFHA and the residual risk outside of the SFHA, which can undermine mitigation efforts. Machine-learning techniques 5 provide an alternative approach to estimating flood hazards across large spatial scales at low computational expense. This study presents a pilot study for the Texas Gulf Coast Region using Random Forest Classification to predict flood probability across a 30,523 km2 area. Using a record of National Flood Insurance Program (NFIP) claims dating back to 1976 and highresolution geospatial data, we generate a continuous flood hazard map for twelve USGS HUC-8 watersheds. Results indicate that the Random Forest model predicts flooding with a high sensitivity (AUC 0.895), especially compared to the existing 10 FEMA regulatory floodplain. Our model identifies 649,000 structures with at least a 1% annual chance of flooding, roughly three times more than are currently identified by FEMA as flood prone.


Introduction
In the US, pluvial and fluvial flood events are some of the most damaging environmental hazards, averaging $3.7 billion annually, totaling over $1.5 trillion in total losses since 1980 (for Environmental Information , NCEI). This trend represents 15 an increase of about 15% in flood losses per year since 2002, despite large scale efforts to mitigate losses over the same period (Kousky and Shabman, 2017). To offset the rising costs associated with extreme flood events, pre-disaster planning and mitigation necessitates detailed spatial information about flood hazards and their associated risks.
In the US, the Federal Emergency Management Agency (FEMA) Special Flood Hazard Area (SFHA) -the area of inundation associated with a 1% annual exceedance probability -provides a basis for community and household planning and mitigation 20 decisions (Blessing et al., 2017). These maps, intended to be used to set flood insurance rates, have become the de facto indicator of flood risk nationwide and are the primary reference point when making a vast array of decisions related to flood (USD) with an additional 107millionto480 million (USD) per year to maintain them (ASFPM, 2020).
Machine learning (ML) methods provide a potential alternative to estimate flood hazards based on historical records of 40 flood loss, especially in resource-limited areas. One such ML algorithm that has shown to be particularly effective within flood hazard mapping are random forests which have recently been used to create a spatially complete floodplain map of the conterminous United States (Woznicki et al., 2019) and to predict flood insurance claims at US Census tract and parcel levels (Knighton et al., 2020). Although initial work has shown random forests improve the prediction of flood hazard, there have been no studies that have used historic records of structural flood damage to estimate a probabilistic floodplain, and many of 45 the previous studies have used sampling procedures that can undermine model reliability. Also, there has been little to no effort to compare random forest predictions against existing regulatory floodplains.
We address these gaps by introducing a novel method to map flood hazards continuously across large spatial scales using a random forest classification procedure trained on 40 years of historic flood damage records from the National Flood Insurance Program (NFIP). Using the NFIP data and high-resolution geospatial data (e.g., topographic, land use/land cover, soils data), 50 we generate flood hazard maps for a large coastal area in Texas Gulf Coast Region. We then compare our modeled outputs against the FEMA floodplains using multiple metrics at regional and community scales. The following sections provide further background on machine learning algorithms and their application to hazards research (Section 2), describe the methods and data used for our analysis (Section 3) and present the model results (Section 4). This is followed by a discussion (Section 5) and conclusions (Section 6).
Data-driven models -those that use statistical or machine learning algorithms for empirical estimations -are prevalent in water resources research (Solomatine and Ostfeld, 2008) and are rapidly gaining popularity for prediction and estimation of flooding in the hydrological sciences (Mosavi et al., 2018). For instance, models predict stream discharge rates (Albers and Déry, 2015), estimate insured flood damage for either the household (Wagenaar et al., 2017) or aggregated (Brody et al., 2009). Data driven 60 approaches have also identified flood probability during a flood event (Mobley et al., 2019)(i.e. flood hazard). When data in an area is sparse these models can help better describe the system (Rahmati and Pourghasemi, 2017). Often however, these approaches require large datasets for an accurate representation (Solomatine and Ostfeld, 2008) . In answer to the data problem many data driven models rely on non-traditional data sources. For example, using video frames flood waters can be identified for a given location (Moy de Vitry et al., 2019).

65
Flood hazard models use dichotomous variables and driver layers to predict the likelihood of flooding across the landscape.
This dichotomous variable can come from a variety of datasets, such as high water marks or flood losses (Knighton et al., 2020). Depending on data availability, models have predicted flood hazard across time (Darabi et al., 2019) or for specific events (Lee et al. 2017). Producing one of two outputs a probability of flooding (Darabi et al., 2019) , or refined into classes of susceptibility (Darabi et al., 2019;Dodangeh et al., 2020). Both of which give more information than the current SFHA 70 dichotomy, in or out of the flood zone. Through this process, researchers are able to map large geographic areas (Hosseini et al., 2019) with potential to scale further given sufficient data availability. This method has shown to quickly (Mobley et al., 2019) and accurately estimate flooding (Bui et al., 2019) .
The amount of data available and how its structured affect the techniques available for predicting flood hazard. Flood hazard models require point locations where flooding has occurred. If non-flooded locations are unavailable pseudo-absences can be 75 randomly generated for modelling (Barbet-Massin et al., 2012). Common algorithms used for predicting flood hazard models include neural networks (Janizadeh et al., 2019a;Bui et al., 2019), Support Vector Machines (SVM) (Tehrany et al., 2019a), and Decision Trees, such as Random Forests (Woznicki et al., 2019;Muñoz et al., 2018). While other algorithms have been used for predicting flood hazard (Bui et al., 2019), these three algorithms are often used in machine learning due to their maturity in research and their generalizability. SVMs are ideal in areas with small sample sizes, but computation times are quadratic 80 as sample sizes increase (Li et al., 2009). The computational complexity of the model removes the scalability of the model, a primary benefit over physical models. Neural networks are often cited as highly accurate (Mosavi et al., 2018), but often come with a reproducibility problem (Hutson, 2018) .
Finally, computational requirements for decision trees are lower than the other two algorithms. The Random Forest algorithm comes from the decision tree family of models. The Random Forest model is highly generalizable within the drivers' 85 parameters. Decision trees are non-parametric and use logic to branch at different values within the independent variables to best fit the classification (Quinlan 1986). By creating numerous trees and democratizing the decision, ensemble classifiers reduce overfitting of the final model while maintaining accuracy (Breiman 2001). Each tree is given a random subsample of the independent variables to predict the dependent variable. These ensemble classifiers are computationally efficient and main-tain a high degree of sensitivity (Belgiu and Drȃgut, 2016). Random forests are capable of identifying interactions between independent variables and the dependent variable regardless of the effect size (Upstill-Goddard et al., 2013).

Study Area
We use the USGS Watershed Boundary Dataset to delineate a region encompassing thirteen Hydrologic Unit Code (HUC) 8-digit watersheds that drain to Galveston Bay and the intercoastal waterway, as well as the Lower Trinity and Brazos Rivers 95 and the San Bernard River ( Figure 1). The resulting study area encompasses 28,000 km 2 and includes two economic population centers: the Houston-Woodlands-Sugar Land Area, also known as Greater Houston, home to 4.8 million people and the Beaumont-Port Arthur-Orange Area, also known as the Golden Triangle, home to 0.4 million people (Bureau 2019). Including rural areas, the total population of the study region is estimated to be around 8 million, accounting for around 25% of the population of Texas (Bureau, 2019). Frequency estimates of tropical cyclone landfall range from once every nine years in the eastern part of the study region to once every nineteen years in the southwest part of the region (NHC 2015). Historical storm surge ranges as high as 6 m along the coast. Notable surge events include the Galveston Hurricanes (1900, 1915), Hurricane Carla (1961, and Hurricane Ike (2008). Several previous studies have demonstrated that flood hazards are not well-represented by the FEMA SFHA (Highfield 110 et al., 2013;Brody et al., 2013;Blessing et al., 2017), suggesting a large proportion of the population along the Texas coast are not only vulnerable to flooding, but the decision makers lack the information to properly account for it.

NFIP Claims
Between 1976 and 2017, there were 184,826 NFIP claims, each of which was geocoded to the parcel centroid. The NFIP 115 flood losses dataset provides the location, total payout, and structural characteristics. Claims provide a reliable indication of the presence of flooding but fail to identify locations that have not flooded making the dataset presence only. Random Forest algorithms require a binary dependent variable identifying presence and absence locations. A pseudo-random sample of background values can be used as a proxy for locations where flooding is absent (Barbet-Massin et al., 2012). Therefore, the background sample locations were based on a random selection of structures. The Microsoft structures footprints dataset was 120 used due to the open source availability and high accuracy (https://github.com/Microsoft/USBuildingFootprints). The study matched structures and claims using a one-to-one sample by watershed and year. This one-to-one matching reduces potential bias from an unbalanced dataset (Chen et al., 2004). The final dataset removed any sample where an independent variable had null values, but the final ratio remained close to 50% of the claims with a sample size of 367,480.

125
To parameterize flood hazard, several contextual variables were considered which represent the potential predictors of flooding across the study area (see Table 1). These variables can be divided into two main categories: (1) topographic: elevation and distance features which drive watershed response; and (2) hydrologic: overland and soil characteristics which govern infiltration and runoff. The variables were collected at different scales based on data availability. All variables were resampled to a 10meter raster and snapped to the Height Above Nearest Drainage (HAND) dataset. Below, we outline the variables, the reasoning 130 behind their inclusion, and previous data-driven flood hazard studies that used them.

Hydrologic
Hydrologic variables explain how stormwater moves across the landscape and, therefore, can help differentiate between low and high flood potential areas. The amount of stormwater that a given area can receive is a function of its flow accumulation potential, which is primarily mediated by three factors: the soil's ability to absorb water (i.e. saturated hydraulic conductivity), 135 roughness (i.e. Manning's n), and imperviousness. Flow accumulation measures the total upstream area that flows into every raster cell based on a flow direction network as determined by the NED (Jenson and Domingue, 1988). Areas with high flow accumulation are more susceptible to receiving larger amounts of stormwater during a given rainfall event. Soil infiltration influences the speed and amount at which stormwater can be absorbed into the ground. When stormwater cannot move into the ground easily, it may result in additional runoff, particularly in urbanized and downstream areas. Two measures of soil 140 infiltration include lithology and saturated hydraulic conductivity (Ksat), both of which have been shown to be strong predictors of flood susceptibility (Brody et al., 2015;Janizadeh et al., 2019b;Hosseini et al., 2019;Mobley et al., 2019). For this study, Ksat values were assigned to soil classes obtained from the Natural Resources Conservation Service's (NRCS) Soil Service Geographic Database (SSURGO) using the values presented in Rawls et al. (1983), and then averaged across the upstream contributing area for each cell.

145
Imperviousness is another strong indicator of an area's ability to infiltrate water. Previous studies have found that increasing impervious surfaces as a result of urbanization reduces infiltration and causes increased surface runoff and larger peak discharges making it an important aspect in determining flood frequency and severity (Anderson, 1970;Hall, 1984;Arnold Jr and Gibbons, 1996;White and Greer, 2006). Imperviousness has been previously shown to be highly important in the Houston region where urban sprawl has greatly increased imperviousness in the region and contributed to higher volumes of overland 150 runoff (Brody et al., 2015;Gori et al., 2019;Sebastian et al., 2019). For this study, percent impervious was measured using the percent developed impervious surface raster from the National Land Cover Database (NLCD) for the years: 2001, 2006, 2011, and 2016. Values range from 0-100% and represent the proportion of urban impervious surface within each 30-m cell. Because Roughness influences the speed at which stormwater can move across the landscape as well as the magnitude of peak flows in channels (Acrement and Schneider, 1984). Previous engineering studies have corroborated the relationship between roughness and overland water flow using Manning's roughness coefficient (Anderson et al., 2006;Thomas and Nisbet, 2007). This coefficient, called Manning's n, has become a critical input in many hydrological models and has also been shown be a good predictor of event-based flood susceptibility (Mobley et al., 2019). For this study, roughness values were assigned to 160 each NLCD land cover class using the values suggested by Kalyanapu et al. (2010), and, like Ksat, was averaged across the contributing upstream area for each raster cell for the years: 2001, 2004, 2006, 2008, 2011, 2013, and 2016.
Elevation and slope were calculated using the National Elevation Dataset (NED), which was provided as a seamless raster product via the LANDFIRE website at a 30-m resolution LandFire (2010).
Three continuous proximity rasters were used in this study: distance to stream, distance to coast, and height above nearest drainage (HAND). Proximity to streams and the coastline have been shown to be significant indicators of flood damage Brody whereas proximity to coasts has been less common (Mobley et al., 2019). Both distance to stream and coast were calculated based on the National Hydrography Dataset (NHD) stream and coastline features.

175
HAND is calculated by defining the height of a location above the nearest stream to which the drainage from that land surface flows (Garousi-Nejad et al., 2019). Areas with high measures of HAND are more buffered from flooding because it requires increasingly more stormwater of short durations to create the peak flows that would reach those locations. This measure has been used to calculate flood depths, the probability of insured losses from floods (Rodda, 2005), soil water potential (Nobre et al., 2011), groundwater potential (Rahmati and Pourghasemi, 2017), and flood potential (Nobre et al., 2011). HAND was 180 downloaded from the University of Texas' National Flood Interoperability Experiment (NFIE) continental flood inundation mapping system (Liu et al., 2016) at a 10-m resolution.
Topographic wetness index (TWI) (Beven and Kirkby, 1979) is a popular measure of the spatial distribution of wetness conditions and is frequently used to identify wetlands. TWI is used to quantify the effects of topography on hydrologic processes and is highly correlated with ground water depth, and soil moisture (Sörensen et al., 2006). This measure has been found to be 185 an influential and, in some cases, a significant predictor for estimating flood hazard (Lee et al., 2017b;Tehrany et al., 2019a; Where A is the contributing area (or flow accumulation) and S 0 is the average slope over the contributing area. High values of TWI are associated with areas that are concave, low gradient areas where water often accumulates and pools making them 190 more vulnerable to flooding. Random Forest Model Random Forest models categorize samples based on the highest predicted probability for each class. We developed a random forest algorithm at the parcel level using the variables in  This year-by-year assessment helps to identify the limitations of this flood hazard approach. All random forest computations 200 were performed with the sci-kit learn package (Pedregosa et al., 2011) in Python version 3.8.
Creating a properly calibrated Random Forest requires tuning two parameters to minimize error, and variable selection to improve generalizability. The two parameters tuned were the number of trees and the maximum tree depth. The number of trees controls how large a forest is used for the predictive model. Increasing the number of trees reduces error rates and increases the attributes used in the decision (Liaw et al., 2002), but comes at the costs of increasing computation time. Tree depth controls 205 the maximum number of decisions that can be made, but too large of a tree will increase the chance of the model overfitting the data and reduce generalizability. The model used 200 trees, and a maximum tree depth of 90, after optimizing error rates using the k-folds analysis. Variable reduction reduces the complexity of the model and decreases the likelihood of the model overfitting, while speeding up the final training and raster predictions. An out-of-bag error score (OOB), isolates a subset of the training dataset which is used to measure error rates of each variable (Breiman, 1996) and generates feature importance.

210
Initially, all variables were added to the model, those variables with the lowest contribution to feature importance were removed from the final model. A figure representing feature importance can be found in Appendix A.
A series of metrics were used to identify if the model is properly calibrated. Average accuracy and sensitivity are measured for each iteration of the k-fold analysis. Accuracy measures the percent of correctly identified flooded and non-flooded samples.
Sensitivity estimates the probability that flooded sample will be predicted with a higher likelihood of flooding than a non-215 flooded sample (Metz, 1978)  fall along a diagonal on the plot. Random forest outputs can be either a classification or represented as a probability. In this study, the final output was based on probability of flooding.
As an additional validity check we compared the RF prediction against the FEMA SFHAs by examining the amount of area and structures exposed to specific annual exceedance probabilities. The RF model predicts the probability that a location floods over the 42 years of the study. To convert the RF probabilities to annual exceedance probabilities we used the following 225 equation.
This procedure allows for a more direct comparison between the FEMA SFHA and flood hazard. The calibration plot (Figure 3) suggests that the model slightly underpredicts flooded points at lower probabilities, and 240 slightly overpredicts flooded points at higher probabilities. The underprediction is explained by the histogram below the plot.
Most non-flooded structures have a probability below 30%, while the largest proportion of flood loss points occur above 90%.

Comparison with the FEMA SFHA
Flood hazard probabilities were converted to annual exceedance (Figure 4) which allowed us to compare the amount of area and structures exposed to specific annual exceedance probabilities with the FEMA SFHAs. Note as a reminder, the 100-year

Discussion
The results illustrate that flood hazards can be accurately estimated using a machine learning algorithm. The model is compu-260 tationally efficient, scalable, and can be used to predict flood hazards over relatively large regions. Overall, the model creates an accurate representation of flood hazard for the study area and demonstrates strong discriminatory power. When compared against similar studies using machine learning approaches to predict flood hazards (Dodangeh et al., 2020; 2019a), our model demonstrates lower sensitivity. The differences are likely attributed to the high topographic relief of these regions where fluvial flood hazard predominate (Tehrany et al., 2019b), likely contributing to the predictive capacity of the 265 models. In contrast, Southeast Texas is characterized by little topographic relief where flooding may emanate from pluvial, fluvial, and marine sources, making flood prediction more complex. In fact, when comparing model sensitivities across the study region, we find that the model performance increases in more steeply sloped inland areas than flat coastal areas.
Another aspect impacting model performance was sample selection. A systematic approach that identifies areas that did not flood (Darabi et al., 2019) can be important to increasing model performance (Barbet-Massin et al., 2012). While the model is 270 based on a comprehensive record of observed flood claims in the study area from 1976 to 2017, there is no comparable record for structures that have not flooded. One option would be to randomly sample areas that have no claims, however this would not control for bias in the absence data and come at the expense of model performance (Wisz and Guisan, 2009). To overcome this potential bias, we generated pseudo-absences by randomly selecting a sample of non-flooded structures by watershed and year to minimize this selection bias (Phillips et al., 2009). Based on the calibration plot (shown in Figure 3) this is an appropriate 275 assumption.
A novel outcome of this analysis are statistically generated flood hazard maps that can be compared to FEMA's regulatory floodplains. That is, we used the predicted model likelihoods to generate 1% and 0.2% annual exceedance probability thresholds, or, equivalently, the 100-year and 500-year flood hazard areas. The statistically generated flood hazard areas differed from the regulatory floodplains in that they are: 1) nearly 3 times as large and 2) captured areas that are hydrologically disconnected 280 from streams and waterbodies. The findings suggest that this approach can better capture small scale variability in flood hazard by implicitly detecting underlying drivers that manifest themselves through subtle changes in historic damage patterns and trends. This corroborates previous research findings that the current 100-year floodplain underestimates and fails to accurately represent flood risk particularly in urban areas (Blessing et al., 2017;Galloway et al., 2018;Highfield et al., 2013).
It should be noted that the interpretation of the predicted random forest flood hazard areas differs from existing regulatory 285 floodplains, in that they are detecting the return period for structurally-damaging inundation. The differences between the statistically generated flood hazard areas and regulatory floodplains is likely a result of multiple advantages of a data-driven approach in identifying the conditions in the underlying drivers (Elith et al., 2011). In other words, a data-driven model to better capture the reality of flood hazard by using actual historic impacts and simultaneously identifying small-scale variations in flood exposure. By using historic losses, the random forest model accounts for extreme events that have occurred, but 290 projecting future hazard is currently limited. The model does not incorporate precipitation patterns, this was by design as we assumed that with the relatively small-scale, precipitation would be homogenous around the region and add little insight. With precipitation added to the model, there is still a potential for error by underestimating the probability of future extreme events.

Conclusions
In this paper we demonstrate the efficacy of a random forest statistical model in spatially identify flood hazards in Southeast

295
Texas, encompassing the Houston metropolitan area. In comparison with FEMA SFHA, we show that a statistically-based flood hazard model can 1) better capture the reality of flood hazard by using actual historic impacts, 2) better capture small scale variability in flood hazard by implicitly detecting underlying drivers that manifest themselves through subtle changes in historic damage patterns and trends, 3) avoids the uncertainty associated with estimating rainfall return periods, stormwater infrastructure characteristics, and flood depths, 4) easily include alternative drivers of flood hazard such as HAND and TWI, 300 and 5) be quickly updated using recent insurance claim payouts.
Data availability. All independent drivers for the flood hazard model can be found at the at the Dataverse repository (https://dataverse.tdl.org/dataverse/M3FR).The Flood hazard output can be found at the following url: https://doi.org/10.18738/T8/FVJFSW (Mobley, 2020). Flood loss data can not be publicly shared due to privacy concerns. The sources of python libraries used are as follows: Scikit-learn library (Pedregosa et al., 2011); RasterIO (Gillies et al., 2013-); NumPy (van der Walt et al., Competing interests. The authors declare that they have no conflict of interest.