A global open-source database of flood-protection levees on river deltas (openDELvE)

. Flood-protection levees have been built along rivers and coastlines globally. Current datasets, however, are generally confined to territorial boundaries (national datasets) and are not always easily accessible, posing limitations for 10 hydrologic models and assessments of flood hazard. Here we present our work to develop a single, open -source global river de lta l e v ee data e nvironment (openDELvE) which aims to bridge a data deficiency by collecting and standardising global flood-protection levee data for river deltas. In openDELvE we have aggregated data from national databases as well as data stored in reports, maps, and satellite imagery. The database identifies the river delta land areas that the levees have been designed to protect, and where additional data is available, we record the extent and design specifications of the levees 15 themselves (e.g., levee height, crest width, construction material) in a harmonised format. openDELvE currently contains 5,089 km of levees on deltas, and 44,733.505 km 2 of leveed area in 1,601 polygons. For the 152 deltas included in openDELvE, on average 19% of their habitable land area is confined by verifiable flood-protection levees. Globally, we estimate that between 5% and 54% of all delta land is confined by flood-protection levees. The data is aligned to the recent standards of Findability, Accessibility, Interoperability and Reuse of scientific data (FAIR) and is open-source. openDELvE 20 is

For models on larger scales, levees are too small to be included directly and are sometimes presented as a sub-grid feature or through a flood-attenuation proxy (Sampson et al., 2015). In both cases, poor data on levee existence and levee properties have made it such that their presence is often disregarded in global flood modelling (Trigg et al., 2016) and global delta modelling (Nienhuis et al., 2020). The lack of levee data (which change and control water and sediment discharge) results in suboptimal modelling scenarios, such as the WRI AQUEDUCT flood-risk tool 95 (https://www.wri.org/applicaitons/aqueduct/floods/) which provides exceptional global-level data but does not include levees and results in abstract scenarios for heavily leveed areas such as the Netherlands.
As an alternative to global levee data, FLOPROS (Scussolini et al., 2016) presents a global dataset on existing and policylevel flood protection standards. FLOPROS provides uniform, global coverage, however individual feature level data is 100 omitted. Other approaches exist that use (semi-)automated algorithms to locate and specify levees from LIDAR data (e.g. Steinfeld et al., 2013;Wing et al., 2019) but these are generally focussed on specific problem definitions and lack global applicability. A global levee database can help inform those algorithms and provide validation and calibration data.

Objective
The objective of openDElvE is to provide an attestable source of delta levee protection delta, for both primary use in flood 105 and hazard modelling, as well as secondary community use through increased data availability by publishing the data on a public website (http://www.opendelve.eu) following standard data types, and a user-led amendment reporting function.

Overview
openDELvE is a collection of existing data on levees and protection features on deltas. We have collected data from vector, 110 raster, and documentary sources. This results in two geospatial layers -one for leveed areas, and one for leveed lines -and a supporting index dataset, linked to the respective delta by a unique identifier and cross-mapped to the river delta dataset of Edmonds et al., 2020. Our methods allow for replicable tracing, processing, assimilation, and display of the data. By storing individual level references and data quality, we aim to provide data that is open and transparent. Our work is underpinned by the principles of FAIR science to support reuse by producing data that is Findable, Accessible, Interoperable, and Reusable 115 (Wilkinson et al., 2016). openDELvE development followed these steps: data definition (Sect. 2.2), data collection (2.3), data processing (Sect. 2.4), data attribution (Sect. 2.5), data management (Sect. 2.6), and data assurance (Sect. 2.7). https://doi.org/10.5194/nhess-2021-291 Preprint. Discussion started: 8 November 2021 c Author(s) 2021. CC BY 4.0 License.

Data definition
We followed our definition of levees from Sect. 1.1. Levees exist along coasts and rivers globally, but the scope of openDELvE is limited to river deltas (Sect. 2.4.1). We made use of a database of deltaic locations and deltaic area extent by 120 Caldwell et al. (2019) and Edmonds et al. (2020). We further limited ourselves to only storing information on defences that are permanent features, and not temporary/reactive measures. Temporary measures, such as sandbags and hoardings deployed for flash flooding or imminent but irregular flood issues are not temporally constant, and so are usually not mapped, nor were considered for inclusion in this database. 125 openDELvE is designed to represent levees as geospatially explicit vector data: lines and polygons. For source data that exists in reports on maps and technical drawings levee presence is often reduced to a raster map element, and so needed to be sufficiently georeferenced and assessed for quality. However, this is still a valid data source and is included in our process.
We consider the age, source document, and data quality as we recognise that data may be reworked and requoted a number of times in its lifespan. 130 openDELvE consists of three data elements: an index table and two vector layers (Table 1), each with a set of standardised attributes (Table 2).  Levee data in openDELvE include a data quality class and a direct link to the source dataset. We devised the data quality 140 criteria included in Table 3:

A (Excellent) Vector data
First-order data source (i.e., scientific papers, governmental geospatial data, original publication) Irrecoverable issues with data quality Could not confirm existence of data from other sources using satellite imagery with resolution ≤25m Temporary or reactive measures only (ex: sandbags)

Data collection 150
We conducted extensive literature searches using a variety of web searching platforms (i.e., Clarivate Web of Science, Google Search, Google Scholar, OCLC WorldCat) as well as data aggregation platforms (e.g., re3data.org, DataCite, data.gov.uk, data.gov, data.gov.au). Data was collected in a search process that is documented as a log with diary-style entries in the Delta Index table (see Table 1) and recorded at a delta level. Sources for each individual levee are stored at the feature level. This allowed us to record rationale and decision-making process so that both viewers and onward developers of 155 the dataset are aware of the steps taken and explanations for decisions taken in data hand.
With an international scope, searching often required country or location-specific terms (e.g., 'tanggul' meaning levee or embankment in Indonesian) to aid data discovery, and these were regionally supplemented along with a vocabulary of common delta and levee terms when using academic paper and internet indexing services. 160 Funding reports from the World Bank projects on flood defence activities has also contributed to the database. Financing documents often contain maps and so we include data from the World Bank where it was discovered in our searches, released publicly, had been reviewed, and contained levee feature level data.

165
When it was not possible to find data in areas where levees were expected, the place was identified by name using the address search (gazetteer) function in ArcGIS and then basic internet searching was performed to find reports of floods or sea level rise related damage. Finally, we made use of the world satellite imagery layer within ArcGIS to review areas where levee source data was inaccessible, and assess by visual means whether it was likely levees were present. We verified areas that we believe may be uninhabited areas using this imagery and classified them accordingly, where satellite imagery 170 confirmed no visible levees, the delta was set to No Result. If levees were visible but we could not verify them with alternative data sources, we set the delta to 'Pending' where external enquiries were taking place and the relevant note was entered in the Journal. We identify deltas as 'Not Processed' if we have yet to manually review available sources, and no national vector dataset was discoverable for processing via our automated tool.

175
Many deltas in the delta dataset may be small and uninhabited , have inaccessible data, or have data that we were unable to convert into a format that we could add to the database. We collectively group these deltas as having "No result" in terms of data collection. Note that this does not always mean there is no data. For example, data from the Database nazionale della AgriNature in TErra (DANTE, formerly known as: ItaliaN LEvee Database [INLED]) (Barbetta et al., 2015) was not suitable for processing because it only contains a levee start and end point coordinate. We classify these 180 deltas under "No result" because it requires access to a detailed regional-level watercourses database and high-resolution DEM so that an interpretational algorithm could be trained to infer the levee course.
Where available, we include levee attributes (e.g., design storm, wall height, levee material, Table 2). This can inform modelling and therefore work as a stand-alone spatial tool for investigating river delta dynamics. Additionally, the data 185 layers can be used for verification of deductive models for the detection of levees by other means, including LIDAR and remotely sensed data as well as corroborating other data sources, such as OpenStreetMap. As we intend for the database to be globally comparable, we set up a cross matching list (Supp. attributes of the levee lines layer were consistent between sources and languages. This was then used for both manual and automated input so that different units of measure, classifications of levee and construction type, and key engineering data 190 were harmonious.

Vector data processing
Where data was sourced in vector format, we defined a data processing algorithm in the ArcGIS® Model Builder (Supp. Fig.   S1) to clip the imported data to the extent of river deltas from Edmonds et al. (2020) with a 100 km 'buffer zone'. This 195 buffer zone is included to maximize OpenDELvE data usability but it does not affect reported statistics on delta coverage.
All reported data statistics in this paper are for levees strictly within delta boundaries (Fig. 1). The buffer zone is included to allow extended use of the dataset for upstream fluvial and sediment transport modelling and additionally, should dataset of Edmonds et al. (2020) be updated, reduces the likelihood that levees are missed from the layer.

200
The ArcGIS® Model Builder automated import process created is distributed with the dataset so that data can be repeatedly processed and added to the database both now and in the future. We supplemented this by the creation of conversion tables Table S2) so that levee attributes, where available, are comparable at a global scale.

Non-vector data processing
We performed georeferencing of map/documentary data where the location was visible using a contemporary map and the 205 map could be referenced in less than 5 reference points. This ensured that we were not extensively distorting the source map and therefore it was possible for us to trace in the features as accurately as possible. Where no georeferencing within 5 reference points was possible, or where the map had too few defining features to be georeferenced at all (e.g. map created with too few topographical features, substantial engineered or geological change resulted in difference between map and modern day situation) then the appropriate data quality class was assigned, and where the map was impossible to suitably 210 georeference, the data source was set aside and documented in the log. Furthermore, where aerial photography was analysed, we defined a set protocol for the inference of leveed area (Supp. Fig. S3).
Data in the "Levee Lines" layer is currently limited to vector levee data sources and does not exist for raster data sources.
Ongoing work includes manual review and development of (semi)automated processing steps to retrieve levee lines from 215 raster sources.

Data attribution
Every task performed in the journal is recorded for audit purposes, and each entry into the layers is attributed to the data source, including a full literature ref, the source URL, and a DOI (where available). This ensures that we can display this data interactively and that the original source remains permanently available. We also included any digital identifiers from 220 vector datasets so that the individual feature can be tracked and mapped over subsequent data revisions.
We timestamp each entry into the delta index and additionally flag deltas that need manual review in the future. This has no effect on data quality, however it ensures that there is a robust process in the future to signal amendments needed or entries where it is apparent that there are undocumented or inaccessible data sources available. This not only supports local 225 maintenance, but also prevents repetition of previous search activities. https://doi.org/10.5194/nhess-2021-291 Preprint. Discussion started: 8 November 2021 c Author(s) 2021. CC BY 4.0 License.

Data management
The resulting data layers for levee area and levee line feature were created in ArcGIS Pro and hosted on an ArcGIS Online data hub (http://www.opendelve.eu). Additionally, we maintain ongoing research data exports in the DataverseNL environment as the database develops. Data is stored in three defined entities as per Table 1, and we store each layer within 230 their own container in the public ArcGIS Online® environment. These layers are then publicly published to be used as part of the ArcGIS Online Directory and through modern GIS clients via Web Feature Service (WFS).
The openDELvE platform facilitates an interactive and community driven maintenance of the dataset through an amendment form and additional messages in all metadata files. The project remains actively maintained by the authors at Utrecht 235 University, and by assigning permanent identifiers (DOIs) to the research dataset, as well as developing the website alongside, there project remains actively maintained.

Data assurance
Before releasing the dataset, we performed several checks on the data and metadata (Table 4). We then generated metadata compliant with the EU INSPIRE geospatial metadata standard (European Parliament, 2007) using the built in ArcGIS® Pro wizard for each data element (Table 2), and for the dataset in its entirety. This included interactive help-text for the model builder GUI. We self-validated the Metadata files using the metadata wizard in the ArcGIS® Pro system. 245

Type Criteria
Duplicate Check There are no duplicate delta polygon IDs (PolygonID) in the index Orphan Check All linked delta polygon IDs matched a delta polygon in the dataset There were no unsuccessful joins between the data layers Null Check Where there was no match to a delta polygon, this returned -1 Where it was not (yet) possible to match the polygon to a delta, this returned null

Visual Check
Visually verify data appears as should be reasonable to expect (i.e., within 100 km of delta polygon border, within proximity of water feature, of a shape that is coincident to fluvial

How representative is openDELvE?
As summarised in Table 5, we found that 19% of the geomorphic delta area (which can include the shallow marine portions of the delta front) processed in openDELvE is protected by a levee. This should be considered a rough estimate. For deltas 295 covered by nationally maintained databases (e.g. Mississippi, Rhine-Meuse) the data quality is good. There is rich metadata and there is little chance of false negatives (no levee in openDELvE but levees present nevertheless). Data quality and coverage in other deltas (e.g. Ganges-Brahmaputra, Mekong) is poorer, and this appears to be linked to the lack of a https://doi.org/10.5194/nhess-2021-291 Preprint. Discussion started: 8 November 2021 c Author(s) 2021. CC BY 4.0 License.