A Multi-strategy-mode-waterlogging-prediction Framework for Urban Flood Depth

. Flood is one of the most disruptive natural disasters, causing substantial loss of life and property damage. Coastal 10 cities in Asia face floods almost every year due to monsoons influences. Early notification of flooding events enables governments to implement focused preventive actions. Specifically, short-term forecasts can buy time for evacuation and emergency rescue, giving flood victims with timely relief. This paper proposes a novel multi-strategy-mode-waterlogging-prediction (MSMWP) framework for forecasting waterlogging depth based on time series prediction and a machine learning regression method. The framework integrates the historical rainfall and waterlogging depth to predict the near future 15 waterlogging in time under future meteorological circumstances. An expanded rainfall model is proposed to consider the positive correlation of future rainfall on waterlogging. By selecting a suitable prediction strategy, adjusting the optimal model parameters, and then comparing the different algorithms, the optimal configuration of prediction is selected. In the actual value testing, the selected model has high computational efficiency, and the accuracy of predicting the waterlogging depth after 30 minutes can reach 86.1%, which is superior to many data-driven prediction models for waterlogging depth. The framework is 20 useful for accurately predicting the depth of a target point promptly. The prompt dissemination of early warning information is crucial to preventing casualties and property damage.


Introduction
With the development of globalization, extreme weather and climate events occur frequently and cause a series of disasters.
According to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC), global extreme weather and climate events have increased and intensified over the past 50 years, and such events will occur more frequently in the future (Jefferson, 2015).According to the Global Risk Report 2021 released by the World Economic Forum (WEF), extreme weather and climate events in 2017-2020 ranked first for four consecutive years in terms of the probability of occurrence of the top 10 global risks.Flood disaster is one of the most destructive natural disasters, usually caused by extreme weather and climate events, mainly including river basin flood, mountain flood, storm surge, urban waterlogging, and other disaster types.knowledge and expertise (Kim et al., 2015).Although relatively fine simulation results can be obtained, comprehensive and large-scale calculations are commonly required.Therefore, it is difficult to be applied to large-scale urban flood risk identification.

Statistical methods
Flood-vulnerable locations have been identified using statistical techniques and sophisticated algorithms.A statistical model based on static Bayesian networks was used to detect floods (Hong et al., 2016).Method used Bayesian parameter estimation was proposed in 2014.It estimated the Topographic Wetness Index (TWI) threshold based on the inundation curve calculated by the spatial window and to identify locations susceptible to waterlogging (Jalayer et al., 2014).Taking into account nine types of factors, it could predict flood through weighted superposition processing by GIS (Mukherjee and Singh, 2019).The statistical methods are simple, practical and easy to operate, mainly used for risk assessment and sensitivity analysis.However, it relies on historical statistics and lacks prediction process, so it often make semi-quantitative prediction.

Data-driven models
Data-driven models can numerically predict the flood solely based on historical data without requiring knowledge about the underlying physical processes.They are used to induce regularities and patterns, providing easier implementation with low computation cost, as well as fast compared to physical models (Faizollahzadeh Ardabili et al., 2018).Machine learning methods have contributed greatly to the development of prediction systems over the past two decades, providing better performance and cost-effective solutions.Characteristics of the methods need to be clarified with respect to the type and amount of available training data, and the type of prediction task, e.g., water level and streamflow.Jia (2022) classified urban catchment areas to realize the waterlogging risk prediction based on unmanned aerial vehicle images and machine learning algorithms.Puttinaovarat (2020) proposed a novel flood forecasting system based on fusing meteorological, hydrological, geospatial, and crowdsource big data in an adaptive machine learning framework.EPSs were demonstrated to have the capability for improving model accuracy in flood modeling (Zhang et al., 2018).The accuracy of prediction is improved through DWT, which decomposes the original data into bands, leading to an improvement of flood prediction lead times.
Neural networks have been widely used for flood prediction.Kim (2016) developed an ANN forecast model for hourly lead time consisting of meteorological and hydrodynamic parameters of three typhoons.Danso-Amoako (2012) provided a rapid system for predicting floods with an ANN, an R 2 value of 0.70 for the ANN model proved that the tool was suitable for predicting flood variables with a high generalization ability.Kourgialas (2015) created a modeling system for the prediction of extreme flow based on ANNs 3 h, 12 h, and 19 h ahead of the flood, which was more effective than conventional hydrological models.In hourly forecasting, NARX worked better in short-term lead-time prediction compared to BPNN.The NARX network produced an average R 2 value of 0.7, showing that it is effective in urban flood prediction (Chang et al., 2014).Some studies defined waterlogging prediction as a classification problem.By defining waterlogging prediction as a binary classification problem, Ke (2020) divided disasters record into flood and non-flood events and used 14 models for comparison .Some studies used regression to predict the change of waterlogging water level.Wu (2020) constructed a regression model with deep learning algorithm, named Gradient Boosting Decision Tree to predict the depth of urban flooded areas (GBDT).
Combined the GBDT model with hydrological variables, they learned the relationship between each condition factor and the occurrence of waterlogging through training, then predicted the range and depth of waterlogging.

Hybrid machine learning methods
Most of the research used a single algorithm or model to predict and worked in different datasets to test the generalization ability of the model.To improve the quality of prediction, an ever-increasing trend in building hybrid machine learning methods had been developed.Hybrid machine learning methods were numerous such as FFRM-ANN (Hsu et al., 2010), ANNhydrodynamic model, SVM-FR (Tehrany et al., 2015) WNN-BB, RNN-SVR (Hong, 2008).The application of machine learning methods to predict waterlogging disasters also had many shortcomings.If the data was scarce or does not cover varieties of tasks, their ability to learn decreased.The second aspect was the performance of each machine learning algorithm, which might vary across different types of tasks.For example, some algorithms might perform well for short-term predictions, but not for long-term predictions.These characteristics of the algorithms needed to be clarified with respect to the available training data and the type of prediction task.

MSMWP Framework
Accumulated rainfall is one of the most direct factors affecting the formation of waterlogging.Through data correlation analysis, we conclude that there is a certain functional relationship between rainfall and waterlogging depth, which is related to soil permeability, impervious area, air humidity, and drainage system capacity in this area.To improve the accuracy of waterlogging depth prediction, this paper proposes a prediction framework (As shown in Figure 1) for urban waterlogging depth called MSMWP (Multi-strategy-mode waterlogging prediction) based on a variety of machine learning strategies, modes, and different algorithms for time series data.In this framework, the process of waterlogging prediction is shown as follows.

Step1. Data preprocessing
Statistical analysis, box-plot tests and correlation analysis were used to deal with missing values and outliers.Redundant data were eliminated according to the configuration conditions of the model, and an interpolation method was selected to impute the missing data after unifying the data sampling rate.Data processing goes through five steps: (1) Correlation analysis.
Calculate the correlation of various meteorological data (rainfall, wind speed, temperature, minimum pressure etc.

Step2. Training mode setting
In this paper, the accumulated rainfall data set () and the historical waterlogging depth data set () are used to predict the waterlogging depth in the future.By adjusting the data combination method Ф, a new data set  can be constructed by Eq. (1): For each input data   , it is the vector combining   (r∈R) and   (d∈D) in the set sharding mode; vector loop through all the training data and combine it into the input data set , (Eq. 2 to 9) which is a high-dimensional matrix.
Where  is the label of the model, which stands for the output of the regressor, and it is a vector with the same length as .
There are five training modes under the MSMWP Framework.
(1) Only multi-R input (R) Through the analysis of data correlation, the maximum correlation coefficient between rainfall and waterlogging was 0.61.It can be concluded that there is an obvious positive correlation between rainfall and waterlogging depth, which proves that it is feasible to use accumulated rainfall to predict waterlogging.
(2) Multi-R and single-D input (mR&D) In reality, waterlogging often occurs after raining for a period of time.Therefore, waterlogging has a certain delay characteristic compared with rainfall.The fluctuation of waterlogging is a continuous physical process affected by multiple factors, so the waterlogging depth at the next moment is often the most closely related to the previous one.In multi-R and single-D mode, only one historical waterlogging data is selected as input.
(3) Single-R and multi-D input (R&mD) This situation corresponds to multi-R and single-D, in which both rainfall and waterlogging data are taken into consideration as input, but the proportion of rainfall input is reduced, while the proportion of waterlogging depth input is increased.
(4) Multi-R and multi-D input (mR&mD) This mode also covers more rainfall and waterlogging depth information, because it can better extract the characteristics of time series, balance the weight of the two data sets coupling, and better conform to the law of time change of rainfallwaterlogging.
(5) Expanded-multi-R and multi-D (E-mR&mD) This paper proposes a new training mode for waterlogging prediction.The prediction value is not only related to the past rainfall and the rainfall at the current time point, but the subsequent change of rainfall will largely affect waterlogging depth.
This mode makes up for the lack of future rainfall information in mode (4) and can better reflect the dominant role of accumulated rainfall.In real applications, real-time rainfall forecast data will be added.Due to the lack of rainfall forecast data for this area, sliding rainfall data is used as an approximation.There are two main reasons for this.Firstly, the expanded part (15-30 minutes) only accounts for 12.5%-25% of the sliding rainfall (2 hours or longer) and has little effect on the whole.
Second, rainfall forecasts, especially short-term forecasts of heavy rains, are now more than 90 percent accurate.It is important to note that the article does not consider only the multi-D input, because this mode building of input matrix  contains only waterlogging depth information changes over time and does not consider the size of the accumulated rainfall.In this mode, with the extension of prediction time, the prediction ability of the model decreases rapidly, so it is not suitable for long-term warning.In the latter part of this paper, the results of this mode are discussed.

Step3. Machine learning regressor setup
Prediction of future data based on historical data is here defined as a regression problem.We adopt the sliding window to slice the time series data in cycles.Traversal is performed in order of data index to preserve the characteristics of continuous changes in the time dimension of the data.In this paper, eight types of regression algorithms are selected, which can simultaneously perform one-dimensional and multidimensional regression output.They are Linear Regression (LR), Tree Regression (TR), Random Forest Regression (RFR), KNN Regression (KNN), Ridge Regression (RR), Kernel-Ridge Regression (KRR), Lasso Regression (LaR) and Elastic Net (ETN).The above eight methods are frequently employed in the field of time series prediction.As a simple regression method, linear regression has good applicability although it is sensitive to outliers.On the foundation of general linear regression, the objective function of ridge regression adds the L2 regularization, which provides the best fitting error and makes the parameters as simple as possible, giving the model excellent generalizability.Yu (2007) realized hydrological time series prediction using ridge regression algorithm based on feature space.Additionally, the Kernel Ridge regression approach was effectively used to the prediction of monthly mean precipitation (Ali et al., 2020).Shen (2021) took human action prediction by EEG signals as an example to study multivariate time series prediction based on elastic net and high order fuzzy cognitive map.Normalization of Lasso regression is achieved by applying L1 regularization to the loss function.Wang (2017) used Lasso regression to accurately predict stock market fluctuations.A tree regression model was created to analyze and predict time series of air pollution since tree regression can describe complex nonlinear data (Gocheva-Ilieva et al., 2019).Wu (2017) used random forest regression algorithm to analyze the time series of weekly influenza-like incidence and made good findings.Martinez (2017) proposed a time series prediction method using KNN regression algorithm.

Step4. Evaluation of model performance
In this paper, the evaluation is mainly divided into two stages: test stage and prediction verification stage.Among them, the indicators in the test stage mainly include the following three categories: R 2 score, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).In the verification stage of actual value, a time series of a specific length is taken to carry out the evaluation from two parts.Firstly, in order to test the model's ability to predict the variation trend of waterlogging depth, time series covering water rising, platform and falling are intercepted.Secondly, by comparing the predicted value with the actual value, the Absolute Percent Error (APE) is used to calculate the model's ability of correct prediction, namely Accuracy (ACC).
However, it is worth noting that the APE cannot be completely evaluated the model, so Absolute Error (AE) is needed to supplement the evaluation.Because when the waterlogging depth is low, a large APE may correspond to a small AE.

Step5. Prediction strategy setting
Different mode settings can greatly affect the training results of the model, and the training strategies are also crucial.This paper compares three training strategies, which are Recursive, Single-Output Coupling and Multi-Output.The optimal prediction strategy is selected by comparing their performance on the test set of waterlogging prediction and the actual value test.
can be divided into two parts, the historical data   and the prediction data   .The time interval of each data set should be uniform.Even if the sensor sampling rate is different, it should be processed uniformly.We define this time interval as  (Ben Taieb et al., 2012).In order to predict s time steps after the current time , the total time of prediction is defined as , where  =  × , according to the number of time steps covered by the time span of the desired output variable.Therefore, the set of desired output variables can be expressed as   ,   ∈   .In a certain span of time , between   and   , we use a common notation  * in Eq. ( 11) to denote the functional dependency.
= ( ℎ ) +   (10) where  ℎ is the inputs of each set;   stands for outputs, the functional relationship between them can be expressed as ;  stands for modeling error, disturbances or noise.
i. Recursive strategy (Rec) The intuitive forecasting strategy is the Recursive (also called Iterated or Multi-Stage) strategy.The result of the prediction of the first step is embedded into the final element of the input vector for the subsequent prediction, and the result of the prediction of the second step is obtained.In Eq. ( 12), when s=1,   ,  −1 , ⋯ ,  − can be used to predict  ̂+1 .When s=2, the previous predicted value  ̂+1 is used to replace the first element of the input vector, and the following elements are replaced in turn.
The original last element  − is removed from the input vector.The model is iterated recursively, and the mixed vector of historical data and forecast data is used as the model input.The algorithm for this method can be expressed as Algorithm 1.
The prediction vector  ̂ [] in the first step can be obtained based on historical data.Then the final element of the input and the new  ̂ is obtained through the input model.And so on, move the value of  forward one bit and add the new predicted value. () | () ,  () ,  () }, comprehensive error evaluation.

ii. Single-Out Coupling strategy (SOC)
The Single Output Coupling strategy is similar to the Direct strategy proposed by Hamzacebi (2009).Different machine learning models have been used to implement the Direct strategy for multi-step ahead forecasting tasks, for instance neural 240 networks (Khashei andBijari, 2010), nearest neighbors (Sorjamaa et al., 2007) and decision trees (Guimarães Santos and Silva, 2014).It consists of forecasting each horizon independently from the others.The biggest difference from Recursive strategy is that Single-Out Coupling does not use any approximated values to compute the forecasts, so having no accumulation of errors.In this strategy, error of the previous prediction results will not have a great influence on the later prediction results.
Each  is supported by a corresponding model and trained with its own independent data (Eq.13).When s =1, it is the same 245 as one-step prediction.When s>1, the model makes prediction across the time interval of s steps.Finally, the results of Single-Out  +1 ,  +2 , ⋯ ,  +−1 ,  + coupling by Eq. ( 14) into a new forecast time series[ +1 ,  +2 , ⋯ ,  +−1 ,  + ].

iii. Multi-Output strategy (MO)
The two previous strategies (Recursive and Single-Out Coupling) may be considered as Single-Output strategies, which neglects the existence of stochastic dependencies between future values and consequently affects the forecast accuracy.Multi-Output strategy requires the design of Multiple-Response modeling techniques.The output is no more a scalar quantity but a vector of length .Using only one model, output a time series of s time intervals (Eq.15).Compared with Single-Output Coupling strategy, this strategy has simple operation and fast calculation.The disadvantage is that some regression algorithms such as Bayesian regression, GBRT and Adaboosting, do not support multidimensional output directly.
Where   is vector-valued function.  is a noise vector.

Step6. Actual value testing
In order to prevent over-fitting phenomenon or insufficient prediction, it is necessary to test the performance of the framework in actual waterlogging data sets after completing training and testing.N groups of continuous time series are selected for actual value test and the application results in actual data will be discussed.

Research area and objectives
As an important city in south China and a representative of China's special economic zones, Shenzhen is one of the core cities of the Guangdong-Hong Kong-Macao Greater Bay Area.Over the past 40 years, Shenzhen's GDP has grown rapidly from 270 million yuan in 1980 to 2,767.024 billion yuan in 2021.
In the process of rapid urban development, Shenzhen is also facing many challenges from natural disasters and accidents, which often bring serious threats to urban public safety and security.Shenzhen, located in the southeast coast of China, has a subtropical monsoon climate.Influenced by the Pacific monsoon current, it is greatly influenced and receives sufficient rainfall all year round.The rainfall is unusually concentrated from June to September every year, and heavy rains or extremely heavy rains occur frequently.In particular, the frontal rainfall in Shenzhen area is subject to topography, which often forms local sudden rainstorms with short duration.According to the statistics of rainfall data in Shenzhen from 1960 to 2012, there were 8.8 times of heavy rainfall with daily rainfall of more than 50mm, 75.2% of which were heavy rain, 21.4% of which were torrential rain, and 3.4% of which were extremely torrential rain (Chen et al.,2020).In May 2014, Pingshan District, Shenzhen, was hit by a sudden rainstorm, with 261mm of rainfall in three hours, causing 150 houses to be flooded and 2,600 people affected.The rainstorm event on April 11, 2019 saw the heaviest rainfall in April since Shenzhen began meteorological records, and it led to 11 deaths (Liu et al., 2020).
This paper focuses on the areas vulnerable to waterlogging in Shenzhen.A data-driven prediction model of urban rainstorm waterlogging depth is established, which can realize the advance perception and accurate prediction of water level change of 280 a waterlogging point.Observation points of meteorological data and waterlogging depth data cover all 10 districts of Shenzhen, and their spatial distribution is shown in the Figure 2. Tyson polygon algorithm is used to divide regions according to the geographical location of meteorological stations, and the rainfall coverage of each station is obtained.The polygon surface of each region indicates that meteorological data of this station is used in this region (Figure 3) (Men et al., 2020).Through this classification form, it is determined that 170 observation stations of waterlogging depth have unique corresponding meteorological input, which unified the model input and output relationship in spatial dimension.
Due to the fact that waterlogging sensors are configured in batches, the total operating time and data storage capacity of each sensor varies.Among 170 waterlogging sensors, we selected the sensor_123 with a long operation time and a large amount of data as the research object.Through data analysis and testing the consistency of meteorological data, it has been determined that precipitation is the most influential element on waterlogging depth.Rainfall data include sliding rainfall with different window lengths, R10M R30M, R01H R02H, R03H, R06H, R12H, R24H and R72H for 10 minutes, 30 minutes, 1 hour, 2 hours, 3 hours, 6 hours, 12 hours, 24 hours and 72 hours of sliding rainfall values.D means the waterlogging depth.Through the data correlation analysis between sliding rainfall and waterlogging depth data of each station (Figure 4), it is concluded that R02H has the largest correlation degree of 0.61 with waterlogging depth.In the Figure 4, the darker the color, the lower the correlation, and the lighter the color, the higher the correlation.Due to the special working mechanism of the waterlogging depth sensor, the data sampling rate is not uniform.In the period when there is no water accumulation (the waterlogging depth is 0), it is collected at irregular intervals of several hours or even several days.In the period when the water level changes dramatically, the sensor can collect data a per minute at the fastest, which brings some difficulties to our research.Considering that the interval of rainfall data is 5 minutes, in order to balance the model accuracy and training efficiency, the data of waterlogging depth is resampled first, which is consistent with the rainfall data on the time scale.Then data interpolation is performed on the newly added blank interval of resampling, which does not destroy the original characteristic attributes of the data.Since the fluctuation process of waterlogging is a smooth and continuous process, the waterlogging-depth curve is smooth.Five commonly used interpolation methods, Cubic, Quadratic, Linear, Zero and Nearest, are used and compared in this paper.The optimal interpolation method is determined by comparing the MAPE of the interpolation data   and the actual data   and observing the fitting of the interpolation curve and the actual one.Finally, the Linear interpolation method is applied.(Cubic and Quadratic may have negative values, while Zero and Nearest have obvious ladder characteristics, which are not consistent with the continuous characteristics of the waterlogging depth).After analysis, we obtain a total of 527 non-zero interpolation data, accounting for 0.46% of the total data set (143,424)  After determining the location correlation between rainfall stations and waterlogging observation stations and unifying the time scale, the two data were integrated into one data.From Figure 5, it can be seen that in most of the time interval, when the rainfall is 0, the frequency of waterlogging accumulation data is less.This is because there is no rainfall in most time interval, which is consistent with the reality.Despite the fact that the proportion of impervious water surface in urban construction areas increases annually the frequency of waterlogging accumulation caused by surface runoff has decreased due to the continuous improvement of drainage system construction and the application of sponge city engineering.However, in the event of strong typhoon or heavy rain, the drainage volume still cannot meet the needs of urban drainage.This would overload the drainage system and allow large amounts of urban surface runoff to accumulate in low-lying areas.In this study, considering the factors of surface infiltration, vegetation leaf canopy interception effect and evaporation, surface runoff cannot be formed under extremely low rainfall and non-rainfall, so the waterlogging depth is always 0. In order to avoid the interference of such factors on model training, the minimum rainfall threshold is set here as 5 millimeters.By searching the entire data set, lock the start and end time stamps of each rainfall event interval, named R_STA and R_END, Rainfall duration is denoted by R_DUR.In the entire data set, R_STA and R_END are paired to represent the rainfall start and end index.A total of 251 rainfall event time series (982.34 hours in total, average rainfall duration is 3.91 hours) were obtained by screening 143,424 data points from this site, and a new dataset Rain_Set with 12,309 data points (Figure 6) was constructed.This method eliminates the interference of a large number of sunny weather inputs to the model, which is proved to improve the efficiency and accuracy of the model calculation.

Step2: Training mode setting
Based on MSMWP framework, after data preprocessing, different training modes are constructed by changing ф to adjust    combination mode.Each rainfall time series is processed by cyclically cutting in the form of sliding windows, we can get a matrix  as Eq. ( 16).In the five modes, input vectors of different dimensions can be constructed by adjusting the values of m and n, and multiple models would be trained.The goal of the model is to accurately predict the waterlogging depth at a certain time.Assume that the current time is , and the predicted target value of the waterlogging depth in the future is  +1 ,  +2 ,  +3 , ⋯ ,  +−1 ,  + .When m=6 and n=1, the model selects rainfall in 30 minutes and waterlogging in 5 minutes before time  as input for training.When m=12 and n=3, the model selects the rainfall in 1h and waterlogging in 15minutes before time t as input for training.Under different combination conditions of m and n, the five combination modes are selected as shown in Table 1  :  = [ [ 5, 18, 41, 63, ⋯ ,97, 99,130,147] [18, 41, 63, 74 ⋯ 99,130,147,151] [41,63,74,83 ⋯ 130,147,151,158] Since rainfall is the fundamental factor affecting waterlogging, the dimension of rainfall in the latter two modes is basically higher than that of waterlogging.In expanded multi-R and multi-D, the rainfall input is split with t as the dividing line to emphasize the influence of the subsequent continuous rainfall input on the model.For example, m=12(9:3) means that rainfall 45 minutes before t and 15 minutes after t are selected as input.
The strategy in step 5 influences the label selection of the model.The Recursive strategy requires direct prediction of  +1 and then recursion with label  + .

Step4: Evaluation of model performance
The data set constructed was divided into training set and test set at a ratio of 70% and 30%.Different modes, strategies and regression algorithms were applied for training, and evaluated by RMSE, MAE and R 2 score.The curve fitting of the predicted data on the test set was compared with the actual data, and the test results of each configuration were analyzed and sorted. (18)

Step5: Prediction strategy setting
The Rec policy is set to replace only the last value of the waterlogging vector at a time.Single-Out Coupling strategy outputs 5 minutes, 10 minutes until the moment ( × 5) minutes of waterlogging, so the label is  +1 ,  +2 ,  +3 , ⋯ ,  + .Multi-Output strategy outputs the waterlogging value vector of minutes at ( 5   × 5 ) at the same time, so its label is

Result and discussion
This section presents and discusses the testing results of the different mode and forecasting strategies.For each mode, we report the results obtained in the eight different regression methods.Based on the results of actual value verification, different prediction strategies are discussed.

Testing result
The time series intercepted with rainfall events were integrated into a new data set, with a total of 12,039 data points.The first 70% constituted a training set with 8,428 points of data, and the last 30% constituted a testing set with 3,611 points of data.In the testing set, eight regression methods were used to test 5 modes, and the model structure inside each mode was changed by adjusting the input parameters m and n.The testing results of different modes are shown in Table A1 to Table A4 in Appendix   and Table 2.The numbers in brackets represent the ranking of evaluation indicators (MAE, RMSE, R 2 score) among different modes of the same algorithm.Bold font indicates the top results of the same algorithm with different modes.The underlined number indicates that the indicator is in the top 50% of the best results (bold) between different algorithms.Taking mode (3)-KRR as an example, in KRR optimal indicator, RMSE is 0.0051, ranking the second, MAE is 0.0013, ranking the third, R 2 score is 0.9779, ranking the third, so the underlined indicator of KRR is 3.When m changes from 6 to 24, the R 2 score of RR changes from 0.1209 to 0.3954, which is 3.27 times of the initial value.In this mode, the larger m is, the more information the model learns and the better the testing performance is.By comparing the predicted value with the actual value, the results of the eight kinds of regression methods all have large noise when the actual waterlogging depth is 0.Even though KRR and LR can achieve good trend prediction at the peak, there are still large noise 405 fluctuations in most of the time (Figure 7).This phenomenon may be caused by the lack of historical waterlogging time series input, so the noise suppression is not good.In mode (2), the prediction performance of LR and TR becomes better as m increases, which is the same as that of mode (1).However, RFR, RR, KRR are not sensitive to parameter changes.
In mode (3), the larger the parameter n is, the better the model may not be.For example, the optimal results of RFR, RR and KRR are obtained when n=6, which the R 2 score of the three exceeds 0.977, indicating that the early information of waterlogging depth is not helpful to the prediction of waterlogging depth in the future, and may cause some interference.For LR, with the same number of parameters, the result of mode (3) is better than that of mode (2).The main reason is that mode (3) extracted more historical waterlogging depth information, which changes by a continuous process in short-term prediction.
However, this does not mean that the model has the best performance, because it contains insufficient rainfall information and may not perform well in the practical application of prediction.
Mode (4) coupled multiple rainfall and waterlogging inputs.Overall, the results of mode (4) are better than that of mode (2) and mode (3) with one-dimensional input (m=1 or n=1).In this mode, the TR and RFR methods achieved the best testing results.TR achieved 0.9778 R 2 score and 0.0050 RMSE (m=24, n=3).The R 2 score of RFR reached 0.9803 and RMSE reached 0.0049 (m=6, n=3).Mode (5) expanded the rainfall input and considered the influence of future rainfall.When parameters are adjusted to m=6(3:3) and n=6, LR evaluation indicator (MAE=0, RMSE=0 and R 2 score=1) is abnormal.This problem also exists in RFR, TR, etc.
The main reason is that the prediction label has been included in the input time series, so the result of m=6(3:3), n=6 in mode (5) should be removed in the discussion.Based on the performance of mode (5) on the test set, the performance of mode (5) will get better after the predicted time is expanded.The reason is that mode (5) expands the rainfall input and considers the influence of future rainfall.Especially, the rainfall model with short duration and high intensity is more suitable for this mode.
As shown in Figure 8, (a) only 30 minutes of waterlogging data were used for prediction, and (b) 30 minutes of waterlogging 430 with 90 minutes of expanded rainfall were used as input.With the increase of prediction time, the difference of prediction performance between them increases gradually.In the prediction of 40minutes waterlogging, R 2 score of (b: 0.6841) is 2.4 times that of (a: 0.2847) when TR method is applied.Using the RFR method, R 2 score and MAE of (b: R 2 score=0.7428,MAE=0.0044) also significantly exceeds (a: R 2 score=0.6488,MAE=0.0053).Especially for the prediction of medium-high value, in the case of high value (0.30m) and medium value (0.13m), the prediction results of (a) are 0.82m and 0.73m, and the 435 large error leads to poor accuracy in the prediction of medium-large scale waterlogging.From the perspective of comparison of regression methods, the performance of LR, TR, RFR, RR and KRR are relatively better, which is reflected in the strong generalization ability of the model (Figure 9).KRR, as Ridge Regression with kernel function added, is more suitable for high-dimensional data.In this study, it shows a slightly stronger regression performance than RR.It can be seen from the comparison (Figure 9(a)(b)) that LR and KRR have strong prediction ability for high values, but poor noise suppression for low values and 0 values, and the model fluctuates constantly around the x axis.The prediction of RFR for the highest value is insufficient, but the prediction performance for other high values is better.Its noise control for low values is better.TR has the best noise control effect for 0 value, but the curve is not smooth or ladder shape at high value.
Of course, this is related to the principle of the algorithm.When applying RFR, selecting the parameter n_estimators which is equal to 100 can solve the problem of TR (Figure 9(c)).LR, RFR, KRR, TR show strong fitting ability in the training set (TR has the MAE:0.0000RMSE:0.0000score:1.0000)(Figure 10) in, KNN and ETN show relatively poor fitting ability.KNN, LaR and ETN have weak ability of fitting and generalization, which are not suitable for regression prediction of such data (Figure 11).KNN methods have negative R 2 score in mode (1), mode (2), mode (4) and mode (5).For most data sets with eigenvalues of 0, the prediction performance is often poor, which can explain why the results of KNN method are the worst (Figure 11(a)).LaR and ETN are mode-insensitive and have the same results in mode (1) and mode (2).However, within each mode, as m and n change, the results will be different.The poor results of the LaR method (Figure 11(c)) may be that the method is suitable for multi-variable model, and the variables are selected by adjusting the λ value to change the compression variable coefficient, but there are fewer variables in this study.Similarly, considering that ETN (Figure 11(b)) works well when many features are interconnected, while this model has a small number of features, the model with similar basic principles to Lasso Regression also has poor performance.
To sum up, mode (5) performs better than any other modes, indicating that the short-term prediction of waterlogging considering the change of future rainfall trend is more realistic.LR seems to have achieved good prediction results in all the five modes.However, the factor that cannot be ignored is that the original waterlogging depth data is sparse and uneven, which must be resampling interpolation processing.It is necessary to go through actual value test to judge whether LR method is really applicable to prediction.In order to highlight the prediction performance of each strategy, we set s=6 for 30-minute-prediction.That is a long forecast for real-time water levels.In practical application, 15 minutes to 20 minutes prediction may be the common prediction, and 20 minutes prediction has fully met the requirement of releasing early warning information in advance and dispatching the nearest traffic and fire personnel to the scene for disposal (Georgiadou et al., 2010).We will make evaluation from multiple indicators such as Absolute Error (AE), Absolute Percent Error (APE) and Time Cost.Mode (5) with m=18 (12:6) and n=6 was selected as the basic parameters of the model to evaluate the model performance under each strategy.Figure A1 in appendix represents the isometric segmentation of the verification data set with different strategies.The first 30 minutes of the time series are taken to show the curves between the actual values in each group and the predicted values for each method.KNN, LaR and ETN, perform poorly in the application of prediction and have high values of AE and APE.Therefore, we will not discuss these three poor methods in the analysis of results.Absolute Error of the predicted value and actual value of each strategy was also discussed (Figure A2 in appendix).We set a tolerance of 0.02m to exclude reasonable error.APE reflects the accuracy of the model, but at low values it does not independently reflect the performance of the model because even small errors (< 0.01m) can cause APE to rise (Figure A3 in appendix).Therefore, the mean value of AE and APE (MAE and MAPE) was used to evaluate the model accuracy, and the variance of AE and APE (V-AE and V-APE) was used to evaluate the robustness.Since  In conclusion, LR, TR, RFR, RR and KRR are superior to other methods, which is consistent with the results on the testing set.Models using Rec or MO strategies have better performance and robustness, with average accuracy more than 85% for predicting waterlogging depth in the next 30 minutes.For short-term prediction, such as 15 minutes prediction, the accuracy can reach 93%, and the robustness of the model will be further improved.As can be seen from Figure A1 to Figure A3 in appendix, when s = 3, the prediction curves of RFR, LR, KRR and other methods basically match actual value, and AE and APE of each group are almost within tolerance.

Conclusions
The prediction and early warning of urban rainstorm and waterlogging disaster has always been a key problem.It is challenging and meaningful to predict the rapid water level rise caused by short-term heavy rainfall in advance.
Waterlogging caused by rainstorms usually accumulates in low-lying areas of the city, such as poorly drained blocks and roads, underpass tunnels, bridge culverts, municipal plumbing manholes, and underground shopping malls or parking lots.Accurate prediction of waterlogging is essential for emergency decision-making and disaster response.Government emergency departments can timely issue warning information to the public, notify traffic management departments to rush to the scene to block the relevant roads, culverts, tunnels, etc. Effective prediction and monitoring will help minimize casualties and property losses.
A Multi-strategy-mode-waterlogging-prediction framework for waterlogging is proposed, which contains how to preprocess raw data, select training modes for different machine learning algorithms.In this framework, different prediction strategies were discussed and used to predict multiple dimensions of waterlogging.Results show that the mode of expanded-multi-R and multi-D performs better than any other modes; five regression algorithms are more suitable for waterlogging prediction.
Recursive and Multi-Output strategies have a better performance and robustness, but MO prediction strategy has not only higher performance but also more efficient.We note that Recursive strategy is poor in the research of Ben (2012).This is mainly because of the periodic characteristics of the data.In this study, physical characteristics of waterlogging determine that water level change is generally a process of monotonically increasing or decreasing, so Rec can also have a good performance on the prediction of non-periodic data with obvious trends.
In this paper, we were concerned only with the lead time for an identified site.In the future, increasing the number of sensors can improve the geographic information of waterlogging point location, including more DEM, slope, positive and negative terrain, infiltration rate and other information.This kind of model can be extended to the spatial dimension for prediction.
Through grid analysis, all position points in the study area can be traversed and waterlogging risk map can be drawn.

Figure 1 :
Figure 1: The Multi-strategy-mode waterlogging prediction framework (MSMWP Framework).Due to the advantages of a black-box model in data-driven methods, machine learning methods can summarize these factors into an overall mechanism.Making full use of the characteristics of accumulated rainfall data will help improve the accuracy of waterlogging prediction.
) and waterlogging depth.(2) Data screening.Use domain knowledge to set thresholds to identify abnormal data.(3) Resampling.Different sensors have different working mechanisms so the sampling time interval is different.The sampling interval is unified through Resampling function in Python to prepare for model training.(4) Data interpolation.Data interpolation is used to complete the data after resampling to make the time series continuous and fit the real situation.(5) Sliding window segmentation and data integration.According to the mode structure requirements, the time series are sliding segmented and input model by data integration.

Figure 2 :
Figure 2: Waterlogging sensors and weather stations in Shenzhen.

Figure 3 :
Figure 3: Rainfall region division obtained by Tyson polygon algorithm in Shenzhen.4.2 Step1: Data preprocessing There are two main sources of the data used in the study.The first is the meteorological observation data of Shenzhen provided by the Shenzhen Meteorological Bureau.The data covers the time ranging from March 8th, 2019, to August 17th, 2020, including the meteorological observation data of rainfall, wind speed, visibility, temperature and humidity at 242 stations in the city.The second is the waterlogging depth sensor data of 170 observation stations in the city provided by Shenzhen Municipal Water Bureau, with an accuracy of 1 cm.Changes in waterlogging depth affect parameters such as refractive index and pressure.The sensor senses the changes and converts physical signals into electrical signals, which are transmitted to the database through optical fibers.The longest time range is from January 1st, 2019, to July 18th, 2020.

Figure 4 :
Figure 4: Data correlation analysis of rainfall data at weather station_ G3795 and waterlogging depth data at sensor_123, the maximum correlation coefficient with D is 0.61 (D and R02H) and the minimum is 0.26 (D and R06M).
. Interpolation data are mainly concentrated at the beginning and end of the water, and the values are generally low.This part of data preprocessing unified the model input and output relationship in the time dimension, and the time range of the final rainfall and waterlogging depth data was unified from 00:00:00 on March 8th, 2019, to 23:55:00 on July 18th, 2020, with an interval of 5 minutes.

Figure 7 :
Figure 7: Testing result using LR and KRR method when m=18.

Figure 8 :
Figure 8: Testing result of mode (5) and only-D prediction using TR and RFR (when both n=6).

Figure 10 :
Figure 10: Fitting performance of different regression methods on training set when m=18, n=6.

Figure 11 :
Figure 11: Performance of KNN, ETN and LaR method on testing set when m=18, n=6.(a) for KNN, (b) for ETN, (c) for Las.4.7.2Step6: Actual value verification 475 Actual value verification takes a subset of the testing set, so the first 85% of the full data set is selected as the training set, which can increase the number of training samples and improve the training ability of the model.The time series from 2020/5/26 13:00 to 2020/5/26 17:30 was selected, lasting 4.5h (Figure 12), covering the complete process of waterlogging

Figure A1 :
Figure A1: Predicted value and actual value of different strategies.(The x-axis represents predicted steps within each group) can be represented by [ +1  +2 ⋯  +−1  + ] and vector   can be represented by [  +1  +2 ⋯  +−1  + ].The sharding mode is realized by adjusting the sliding window size  and .The combined input vector   can be represented as [ +1  +2 ⋯  +−1  +  +1  +2 ⋯  +−1  + ].Through the continuous iteration of  the sliding window can ̂= { ̂ }, the prediction of the output of the selected series. 1 Using training data, different models and regressors were selected for model training.2 for  ∈ (1, ) do, test model for N times, select the best output.

Table 4 : Model performance using SOC strategy. 240
.39s to 83.28s.It should be noted that although KRR algorithm has good model performance and robustness, its time cost is too high, reaching 241.28s under Rec strategy, which is difficult to meet the requirements of update calculation within 5 minutes.LR, TR and RR can all be updated within three seconds due to their simple structures.RFR has a high time cost because it has to traverse all trees.However, it can meet the requirement of update time (only 65.13s under MO strategy).