Online Urban Waterlogging Monitoring Based on Recurrent Neural Network for Classification of Microblogging Text

. With the global climate change and rapid urbanization, urban flood disaster spreads and becomes increasingly serious in China. The urban rainstorm and waterlogging have become an urgent challenge that needs to be real-time monitored and further predicted for the improvement of urbanization construction. In this paper, we trained a recurrent neural network (RNN) model to classify microblogging posts related to urban waterlogging, and establish an online monitoring system of urban waterlogging caused by flood disaster. We manually curated more than 4,400 waterlogging posts 10 to train the RNN model so that it can precisely identify waterlogging-related posts of Sina Weibo to timely find out urban waterlogging. The RNN model has been thoroughly evaluated, and our experimental results showed that it achieved higher accuracy than traditional machine learning methods, such as SVM and GBDT. Furthermore, we build a nationwide map of urban waterlogging based on recent two-year microblogging data.


Introduction 15
Due to climate change and rapid urbanization, global flood disasters have become increasingly frequent and serious, leading to traffic jams, environmental pollution, residents travel and health risks, etc. (Wheater et al., 2009;Kuklicke et al., 2016;Sofia et al., 2017). Therefore, it is crucial to address the problem of early warning and monitoring of flood disaster for the sake of life and property safety. However, it is difficult to predict nature disasters and emergencies, such as earthquake and flood, as we usually have no enough data to train an effective prediction model. 20 Existing methods for flood disaster early warning generally use meteorological and hydrological data as the basis of construction of forecasting models. Researchers build hydrological models that take into account various factors, and simulate the occurrence, progression and consequence of flood disaster (Tawatchai et al., 2005;Yu et al., 2015;Anselmo et al., 1996;Lima et al., 2015). Subsequently, multilevel thresholds corresponding to different warning levels are determined based on the simulation result of the hydrological models. Also, some scholars have developed methods for early warning of 25 flood disaster. For example, Xiao et al. developed a flood forecasting and early warning method based on similarity theory and hydrological model to extend the lead time and achieve dynamic rolling forecasting (Xiao et al., 2019). Kang et al. proposed a flood warning method based on dynamic critical precipitation (Kang et al., 2019). In order to meet the actual need of flood early warning, Liang et al. constructed a new "grade-reliability" comprehensive evaluation of accuracy of https://doi.org/10.5194/nhess-2020-335 Preprint. Discussion started: 16 November 2020 c Author(s) 2020. CC BY 4.0 License. flood early warning. The method combines the grade prediction accuracy evaluation criteria and uncertainly analysis method 30 to evaluate the reliability of the predicted results . Also, Liang discussed how to determine the flood early warning and forecasting time by use of rising rate analysis on the basis of the flood rising rate changing over time in historical data .
The rapid development of mobile internet and smartphone has boosted various social media, such as Weibo and Twitter.
In fact, Weibo has become very popular in Chinese people, and accordingly it becomes an important information source of 35 flood and nature disaster (Robinson et al., 2014). Some researchers explored social media to extract information about disasters. For example, de Bruijin et al. proposed a database for detecting floods in real-time on a global scale using Twitter.
This database was developed using 88 million tweets, from which they derived over 10,000 flood events (de Bruijin et al., 2019). Cheng et al. used Sina-Weibo data to reveal the public sentiments to nature disasters on social media in the context of East Asian culture. The Pearson correlation between information dissemination and precipitation was analyzed, and 40 important accounts and their information in social networks were determined through visual analysis (Cheng et al., 2019). Barker et al. developed a prototype of national-scale Twitter data mining pipeline for improved stakeholder situational awareness during flooding events across Great Britain, by retrieving relevant social geodata, grounded in environment data sources (flood warnings and river levels) (Barker et al., 2019). Wang et al. analyzed the subject words and user sentiments of the earthquake events based on the week-long discussion on Sina Weibo after the earthquake in Japan (Wang et al., 2012). 45 Zhang et al. used the Shanghai Bund Stampede incident as an example, according to the response time, response speed, microblog contents and microblog interaction of government microblog after the emergency, to analyze and evaluate the information release and response ability of government in emergency. Some concrete ways and suggestions for the government to make information release more effective were put forward (Zhang et al., 2015).
Text classification is a hot topic in the field of Nature Language Processing (NLP) (Hu et al., 2018;Kim, 2014), and 50 has been widely used to identify important information of interest from social media. In this paper, we employed text classification of Weibo posts to identify urban waterlogging caused by flood. By manually collecting more than 4,400 waterlogging-related Weibo posts from 2017 to 2018, we built a gold-standard dataset to train a text classification model.
We tested three popular models, including recurrent neural network (RNN), support vector machine (SVM) and gradient boosting decision tree (GBDT), and found that RNN achieved best performance, evaluated on an independent test set that 55 contains 400 Weibo posts (positive and negative) of 2019. Furthermore, we built a nationwide map of urban waterlogging based on recent two-year microblogging data, and a monitoring system based on WeChat applet that will alert the user via voice alarm when he/she approaches a waterlogging point.
As far as our knowledge, this is the first manually validated and the largest dataset of Weibo posts related to urban flood deposits, which is ready for building text classification model to identify Weibo posts that truly reports waterlogging 60 events. Also, we are the first to build a nationwide map with more than 6,000 waterlogging points, which covers most cities in China. Furthermore, the RNN model trained on our dataset can precisely identify flood deposits via online Weibo https://doi.org/10.5194/nhess-2020-335 Preprint. Discussion started: 16 November 2020 c Author(s) 2020. CC BY 4.0 License. classification, and our monitoring system based on WeChat applet would effectively benefit users to reduce the risk and loss caused by flood.

Materials and methods 65
The overall framework of our method is shown in Figure 1

Data sources
The Weibo posts were obtained using Sina Weibo API. To exclude completely irrelevant posts, we downloaded only the Weibo posts including keywords "淹" or "积水" (drowning or waterlogging in Chinese). All microblogging text were saved in a comma-separated values (CSV) file to form data table, including the issuer, microblogging text, post time and location where the issuer was. 95 For the sake of geographical localization of urban waterlogging points from the content of Weibo posts, we also downloaded the catalog of nationwide communities of 307 cities in China, including community name, geographical location, floor area ratio, greening rate and other information, from a famous housing website anjuke.com in China.

Data cleaning
Weibo posts have a large number of repetitions, due to its commenting function with the forward of original text and image 100 content. A hot post may have been commented for many times, but the main body of these Weibo posts is same. Another case is that "@ (retweet)" function also leads to many duplicates. We removed duplicates by using a string match pattern that compared a number of leading characters of two posts. For example, if the first 15 characters of two posts were same, they were considered as duplicates and kept only one.
On the other hand, a large part of Weibo posts were actually irrelevant to flood and waterlogging. For example, many 105 posts contained the keywords of Chinese drowning or waterlogging mentioned above, but discussed about some diseases such as "hydrocephalus", "knee dropsy" and so on. Such Weibo posts had nothing to do with our task. We manually checked all the Weibo posts and removed the posts irrelevant to flood waterlogging. We finally got the Weibo posts that were closely related to urban flood deposits.

Localization of flood deposits 110
First, the Weibo posts that contained words like "certain community" and "highland" were excluded, as the location of the flood deposits mentioned in these posts were difficult to be determined.
Next, we extracted the terms about communities, roads and orientation from the posts. Subsequently, we used the catalog of communities of 307 cities in China to match these terms so that we could determine the geographic location of the flood deposits reported via these Weibo posts. In the practice of localization, we found that there were quite many overlaps 115 among the names of communities, because some communities have the same name but actually locate in different cities. If we considered only matching the name of communities, this would lead to wrong localization of flood deposits, especially without cities or provinces mentioned in the posts. So, we manually checked the communities with duplicate name in different cities to ensure the flood deposits were correctly located.
Finally, we noticed that multiple Weibo posts were actually located to the same flood deposits. For such posts, we kept 120 only one if these posts were posted within one day, and kept all if they were post in different days as the precipitations led to waterlogging are different. In total, we got 4,451 Weibo posts that were successfully located to urban flood deposits. Some examples are shown in Table 1.

Selection of negative samples 125
The Weibo posts that contain the specific location of the flood deposits were taken as positive samples and labeled by 1.
Negative samples were selected from the Weibo posts that contain the keywords drowning or waterlogging in Chinese, however irrelevant to flood and waterlogging. For example, the posts that actually discussed diseases, such as hydrocephalus, keen dropsy and so on, were regarded as negative samples. During the selection of negative samples, the positive samples were excluded. Finally, we built a training dataset that includes 4,451 positive samples (labeled as 1) and 246,341 negative 130 samples (labeled as 0).

Extract feature vectors and construct training set
To train a text classifier, it is necessary to transform a Weibo post, which typically is strings of words, into a feature vector suitable for classification tasks. The first step of preprocessing Weibo posts included Chinese word segmentation and removal of stop words. Thereafter, we constructed a bag of words by extracting unique words in the training set, and then 135 built feature representations of each Weibo post based on word vector space and word2vector model.

Data preprocessing
Due to the Sina Weibo posts are written in Chinese, the text does not have a natural separator between words. Therefore, it is necessary to perform Chinese word segmentation on Weibo posts, which is actually a basic process in Chinese natural language processing. We used the Jieba tool to segment words of Weibo posts. 140 There were many common auxiliary words, prepositions, and so on in Chinese, such as "的(of)" and "在(in)", which should be got rid of with the help of dictionary of stop words. Also, many words were useless to our task but appeared in the Weibo posts, such as "视频 (video)", "微博 (Weibo)", etc. These words were also added to the stop words dictionary so as https://doi.org/10.5194/nhess-2020-335 Preprint. Discussion started: 16 November 2020 c Author(s) 2020. CC BY 4.0 License.

Feature representation 150
The purpose of feature representation is to encode a numeric vector that represents the content of a Weibo post. We considered two most popular methods, TF-IDF and word2vector. TF-IDF (Term Frequency-Inverse Document Frequency) is the most popular term-weighting scheme for information retrieval and data mining. The word2vector models the context of words and the semantic relationship between words and context, and maps words to a low-dimensional real number space to generate the corresponding word vectors (Wang et al., 2018). This paper used both TF-IDF and word2vector methods for 155 feature representations of Weibo posts. TF-IDF was used to build feature vectors prepared for input into SVM and GBDT classifiers, while word2vector was used for RNN, SVM, and GBDT classifiers.
TF-IDF scheme is a statistical method employed to evaluate the importance of a word in a document. TF is the word frequency, formally written as tf (t,d), which means the frequency of the term t appears in the document d, and reflects the correlation between t and d. IDF is the inverse document frequency, formally written as idf(t), which represents the 160 quantification of the term distributions in a collection of documents. The commonly used calculation method is log ( + 0.01), in which N represents the total number of documents, and represents the number of documents in which term t appears (Xiong et al., 2008). TF-IDF represents the importance of relevant terms in the document space. The calculation method as follows Eq. (1) (Salton et al., 1998): where ( , ) represents the weight of term t in document d.
The larger the TF-IDF of a term, the higher the importance in the document. By calculating the TF-IDF value of each word in a Weibo post, we can construct a real-value vector representation of this post. This paper used the TfidfVectorizer in scikit-learn Python package to calculate the TF-IDF value of each word of Weibo posts. According to the TF-IDF value of words, the first 5,000 words were selected to construct the dictionary and subsequently each post was converted into a 5,000-170 dimensional vector.
Word2vec exploits the idea of deep neural network to simplify the processing of text content into vector representation in K-dimensional space, and the similarity in vector space can be used to represent the semantic association between words https://doi.org/10.5194/nhess-2020-335 Preprint. Discussion started: 16 November 2020 c Author(s) 2020. CC BY 4.0 License. (Mikolov et al., 2013). Word2vec takes as its input a large corpus of text to produces a vector space, and assigns each unique word a distributed representation in the space. 175 We used the Gensim library (Gesim, 2019), which takes as input the urban waterlogging-related Weibo posts, to train a 100-dimensional word2vector model with the skip-gram to obtain the vector representation of each word. The structure of the skip-gram model is shown in Figure 2. Its underlying rationale is that given a certain word to predict the context. The specific calculation method is Eq. (2):

Undersampling
Our whole dataset contains 246,341 negative samples and 4,451 positive samples. As the number of negative samples intensively surpasses that of positive samples in the training dataset, the imbalance often leads to ill-structured decision 200 boundary of classification that is overwhelmed by the majority class and ignores the minority class (Chawla et al., 2004). For example, in a case of an imbalance level of 99, a classifier that minimizes the error rate would decide to classify all examples into the majority class to achieve a low error rate of 1%, but in fact all minority examples are misclassified. Therefore, the imbalance problem must be handled carefully to build a robust classifier in a problem with a large degree of imbalance (Liu et al., 2008). same number of negative samples as positive samples, and then combined them to create a balanced dataset to train the classification model. The undersampling process was repeated for enough times to ensure that every sample will be seen by 210 the classifier.

Training of classifiers
We tested three popular models, including recurrent neural network (RNN), support vector machine (SVM) and gradient boosting decision tree (GBDT), using both TF-IDF features and distributed representation derived from word2vector algorithm. 215

Recurrent neural network (RNN)
The most commonly used in natural language processing is the recurrent neural network. Recurrent neural network (RNN) is used to process sequential data, which takes sequence data as input, and performs recursion in the evolution direction of sequence, and all nodes (recurrent units) are connected by chain. The recurrent neural network and its unfolding structure are shown in Figure 3 (LeCun et al., 2015). The RNN introduces a directional loop to pass down the parameters of 220 the hidden layers state −1 at the previous moment and calculate the hidden state at the next moment, so as to achieve the persistence of information and solve the problem of association between the input information before and after (Liu et al., 2018).  In this way, the current hidden layer calculation results and the current input are related to the previous hidden layer 235 calculation results, and the purpose of the memory function is achieved.
Unfortunately, it is difficult for the RNN model to learn long-distance correlation information in a sequence, which will affect the classification effect. Therefore, this paper adopted the improved RNN model by the LSTM. Long short-term memory network (LSTM) (Hochreiter et al., 1997) is a special RNN that can learn long-distance dependent information. The key design of LSTM is the state of cells that throughout the entire network. The "gate" structure is used to remove or add 240 information to the cell state, thereby updating the hidden state of each layer. In this paper, we used the improved RNN model by the LSTM to replace each hidden layer with a cell that has memory function. The LSTM has great advantages in processing time series and language text sequences (Niu et al., 2018). The network structure of LSTM is shown in Figure 4. can effectively overcome the vanishing gradient problem . Intuitively, the forget gate controls the amount of which each unit of the memory cell is erased, the input gate controls how much each unit is updated, and the output gate controls the exposure of the internal memory state (Liu et al., 2016 where (·) denotes the sigmoid activation function, ⨂ denotes element-wise multiplication, * and * are the weight matrix in the network, * is the bias term, , , and are the values of the input gate, output gate and forget gate at time t respectively .
During the experiments, the input layer of the RNN model imported the word vector representation matrix of the words in the sentence. For example, there are n words in a microblog post, and the dimension of the word vectors is k, the size of 275 the input matrix is n*k. Then the "gate" structure of the LSTM model removes or adds information to the cell state of the network to update the hidden state of each layer. The hidden state is input to the softmax layer, and next to output the final classification results, thereby realizing text classification.

Support vector machine
The support vector machine (SVM) classifier is widely used to solve two-category problems. The SVM model is defined as a 280 linear classifier with the largest interval in the features space. Its basic idea is to find the classification hyperplane that can divide the training data set correctly and has the largest geometric interval (Tan et al., 2008). The SVM model is suitable for classification in high-dimensional space and has a good performance on small-size data sets.

Gradient boosting decision tree
Gradient boosting decision tree (GBDT) is a boosting algorithm proposed by Friedman (Friedman, 2000) in 2001. GBDT is 285 an iterative decision tree algorithm, which is composed of multiple decision trees, and the answers of all trees are added up to make the final decision. The model established each time is in the gradient descent direction of the previous established model loss function, so that it performs better than traditional boosting algorithms, which reweight the correct and wrong samples after each round of training (Feng et al., 2017).

Confusion matrix
The confusion matrix, also known as the possibility table or error matrix, is a specific matrix used to present the visual effect map of the classifier performance. The rows represent the predicted values, whereas the columns represent the actual values.
The categories used in analysis are false positive, true positive, false negative, and true negative. The structure of the confusion matrix is shown in Table 2.

Hyperparameters optimization
When building a deep learning model, the selection of hyperparameters is very important. This paper used a grid search to find the optimal hyperparameters of the RNN. The grid search is an optimization strategy by specifying parameter values to 305 exhaustive enumerations to select optimal parameters. The hyperparameters to be selected include optimizer, batch size, and keep probability and so on. To reduce the computational overhead, we chose to fix other parameters, and changed the parameter value for experimentation.

315
Optimizer is used to update weights when training a deep neural network. We tested several adaptive Optimizers, including SGD, Adagrad, Adadelta and Adam. As can be seen from Figure 5a, the Adam optimizer achieve the highest 320 accuracy by 77%. Therefore, we select Adam as the optimizer in the subsequent performance evaluation experiments.
Batch size represents the amount of samples captured in one training round, and its value affects the training speed and model optimization. As shown in Figure 5b, the accuracy of the RNN model reduces but still remains above 0.76 with the value of batch size ranging from 16 to 128. When the batch_size is 16, the accuracy reaches the highest 0.7857. Therefore, the size of the batch size is set 16. 325 We adjusted the size of keep probability to prevent overfitting and improve the generalization ability of the model. As can be seen from Figure 5c that when keep probability is equal to 0.5, the RNN model reaches the highest accuracy 0.7854. Therefore, the size of keep probability is set to 0.5.
In addition, the learning rate and epoch are tuned to 0.001 and 5, respectively. The selection of important parameters are shown in Table 3. 330

Verification of flood deposits based on AutoNavi waterlogging map
In cooperation with the Public Meteorological Service Center of China Meteorological Administration, AutoNavi Map, a famous online map service provider in China, has released a nationwide map of urban waterlogging based on AutoNavi inherent road and traffic data, historical flood deposits reported by traffic polices. AutoNavi map App visualizes the urban 335 flood deposits that can be retrieved by users.
We collected the flood deposits in AutoNavi waterlogging map, including the degree of floods, the latitude and longitude of each flood deposit. Next, we selected the flood deposits in the Nanjing city in AutoNavi waterlogging map to check how many the flood deposits identified from Weibo posts overlapped.

Performance measures
We adopted a variety of measures to evaluate the performance of our proposed method, including accuracy (ACC), precision (P), recall (R), and F1-measure (F1). The receiver operating characteristic (ROC) curves and Area Under Curve (AUC) were also used as criteria for performance evaluation. The accuracy (ACC) is defined as the ratio between the correctly classified samples to the total number of samples as in Eq. (5). The precision (P) represents the proportion of positive samples that are 345 correctly classified to the total number of positive samples as in Eq. (6). The recall (R) is expressed as the ratio of the correctly classified negative samples to the total number of negative samples as in Eq. (7). F1-measure (F1), also refers to as F1-score, represents the harmonic mean of precision and recall as in Eq. (8) (Tharwat, 2018). Higher the value of F1, better https://doi.org/10.5194/nhess-2020-335 Preprint. Discussion started: 16 November 2020 c Author(s) 2020. CC BY 4.0 License. the performance of the method. The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false 350 positive rate (FPR) at various threshold settings (Receiver, 2020). The AUC metric calculates the area under the ROC curve (Tharwat, 2018). The higher the AUC value, the better the performance (Yu et al., 2014). 355

Performance evaluation on validation set
We tested two types of feature representations, TF-IDF and word2vec, on three classifiers, including SVM, GBDT, and RNN. Note that the RNN model can not be applied to TF-IDF feature representation as its input requires sequences. 200 360 samples were randomly selected from the positive and negative samples sets respectively as the validation set (400 samples in total), and the remaining samples were used as the training set. The performance measures were computed based on the predicted result of validation set by learned models. The process was repeated for 50 times, and the averages were reported as the final performance measures. As shown in Table 4, the RNN classifier based on word2vec feature representation achieved the best performance. The accuracy, recall, and F1 value of this model reach above 96%, and the AUC value also achieves 0.99, indicating that the RNN is an effective model for online classifying waterlogging-related posts. According to the experimental results, it is found that the RNN model based on word2vec features performs generally better than traditional classifiers, such as 370 GBDT+TF-IDF (Acc=0.86), GBDT+word2vec (Acc=0.81), SVM+TF-IDF (ACC=0.88) and SVM+word2vec (Acc=0.87).

Performance evaluation on Independent set
To further verify the effectiveness of the RNN model, we built an independent test set to evaluate the model. The independent test set contains 400 Weibo posts (200 positive and negative, respectively) of 2019. These posts were downloaded using the keywords "淹" or "积水" (drowning or waterlogging in Chinese) and then manually checked. Each 375 trained model was performed on the certain independent test set and the performance measures were computed, as shown in Table 5 and Figure 6. The RNN model based on word2vec feature significantly outperforms all other models and achieves highest accuracy, recall, and F1 values. It also has achieved the largest AUC value 0.95. Furthermore, it can also be seen that the word2vec feature representation effectively improves the performance of text classification models on the independent test data set. This shows that the word vectors generated by word2vec model are more informative than traditional TF-IDF in expressing the features of microblogging posts related to urban waterlogging.
For both GBDT and SVM, the models trained on word2vec features got better performance on the independent test dataset, compared to those trained on TF-IDF. In fact, we find similar results in Table 4, RNN+word2vec obtained the best 385 performance, followed by SVM+word2vec.
https://doi.org/10.5194/nhess-2020-335 Preprint. Discussion started: 16 November 2020 c Author(s) 2020. CC BY 4.0 License.  which only 29 posts were really related to these flood deposits. Taking all these posts as input, the trained RNN model achieved 0.836 accuracy. This experiment confirmed that the RNN model proposed in this paper can applicable for online monitoring of urban waterlogging. Especially, it is worth noting that AutoNavi waterlogging map is no longer updated.
There is a pressing demand to propose new monitoring system for urban waterlogging. 400

Nationwide map of urban waterlogging
We built a nationwide waterlogging map with more than 6,000 flood deposits based on microblogging data from 2017 to 2018, as shown in Figure 8. The map was generated by ArcGIS 10.6 software. The small black dots represent the flood deposits (as shown in Figure 8a), and orange dots represent the number of the flood deposits within the province (as shown in Figure 8b). For example, the more flood deposits in the province, the larger the orange points. On the other hand, we 405 notice an overall correlation between the economic development and the number of urban flood deposits. We used the GDP of each province as its background color in the map, i.e., the higher the GDP, the darker the color. It can be seen that the provinces located in eastern and central regions are more developed than northeast and western regions in China, as shown in Figure 8b. Accordingly, we found that the number of flood deposits in the eastern and central provinces is larger than northeast and western provinces. This phenomenon may be caused by the rapid urbanization and ground hardening so that 410 the water permeability is greatly reduced. The potential economic loss in developed area is larger, and the real-time monitoring system for urban waterlogging disasters is more important in developed areas.

Monitoring system based on WeChat applet
To facilitate the usage of the nationwide map of urban waterlogging, we developed an applet based on WeChat, a popular mobile social app in China. The applet visualizes all urban waterloggings in the map of China, as shown in Figure 9. Click on a waterlogging icon, a popup dialog will display the detail information of this urban waterlogging, such as the location description, longitude and latitude. Especially, the applet runs a daemon monitoring process that computes the distance of 420 current position to waterlogging points nearby. If the distance to the nearest waterlogging point is less than a predefined threshold, the applet will trigger voice alarm, such as "warning, warning, waterlogging ahead". This function would greatly benefit the car drivers in the case of rainstorm weather.

Discussion
In this paper, we trained text classification models to identify microblogging text related to urban waterlogging caused by flood disasters. By manually collecting more than 4,400 waterlogging-related Weibo posts from 2017 to 2018, we built a 430 gold-standard dataset to evaluate three text classification models, including RNN, SVM and GBDT. Our empirical experimental results showed that RNN model achieves higher accuracy than other two classifiers on an independent test set.
Also, we found that the feature representation extracted by word2vec could improve performance compared to traditional TF-IDF feature. Furthermore, we built a nationwide map of urban waterlogging based on recent two-year microblogging data, and a monitoring system based on WeChat applet that will alert the user via voice alarm when he/she approaches a 435 waterlogging point.
The limitation of our study lies in that the number of positive samples in the data set is relatively small. In future study, we set about to extend the scale of data set to build predictive model with better performance. Meanwhile, we will consider other deep learning models, such as convolutional neural networks (CNN), to further integrate remote sensing images and social media to improve the prediction of urban flood deposits and develop more powerful monitoring system of urban 440 waterlogging.