Comparison of machine learning techniques for reservoir outflow forecasting

. Reservoirs play a key role in many human societies due to their capability to manage water resources. In addition to their role in water supply and hydropower production, their ability to retain water and control the flow makes them a valuable asset for flood mitigation. This is a key function since extreme events have increased in the last decades as a result of climate change, and therefore, the application of mechanisms capable of mitigating flood damage will be key in the coming decades. Having a good estimation of the outflow of a reservoir can be an advantage for water management or early warning systems. 15 When historical data are available, data-driven models have been proven a useful tool for different hydrological applications. In this sense, this study analyses the efficiency of different machine learning techniques to predict reservoir outflow, namely multivariate linear regression (MLR) and three artificial neural networks: multilayer perceptron (MLP), nonlinear autoregressive exogenous (NARX) and long short-term memory (LSTM). These techniques were applied to forecast the outflow of eight water reservoirs of different characteristics located in the Miño River (northwest of Spain). In general, the 20 results obtained showed that the proposed models provided a good estimation of the outflow of the reservoirs, improving the results obtained with classical approaches such as to consider reservoir outflow equal to that of the previous day. Among the different machine learning techniques analyzed, the NARX approach was the option that provided the best estimations on average.

17,000 km 2 (Confederación Hidrográfica del Miño-Sil, 2016) and constitutes an important region of hydroelectric generation. 95 The Miño-Sil river system is one of the most important in the Iberian Peninsula and the one with the highest runoff-to-surface ratio. It is characterized by a pluvial regime with the maximum water flow in the winter season and a minimum in summer (Fernández-Nóvoa et al., 2017), presenting average annual precipitation of 1,184 mm (Confederación Hidrográfica del Miño-Sil, 2016). Eight reservoirs were selected from the Miño-Sil river system with capacities ranging from 10 to 655 hm 3 (see Table 1 for a summary of their main characteristics). 100 A total of 19 years of data in a daily scale were provided, after request, by the Minho-Sil River Basin Authority (Confederación Hidrográfica del Miño-Sil, https://www.chminosil.es) for the reservoirs under study. The period analyzed spans from October 1, 2000, to September 30, 2019. The time series data include the percentage of filled volume, the inflow and the outflow of the reservoir. 105

Machine learning models
The available dataset was divided into three different subsets, using roughly the first 70% of the data range for the training subset (from October 1, 2000, to September 30, 2013, the following 15% for the validation subset (from October 1, 2013, to September 30, 2016 and the remaining 15% for the test subset (from October 1, 2016, to September 30, 2019. This criterion has been chosen for better interpretation and comparison of the output series produced by the models. Table 2 shows the 110 statistics of the subsets for each reservoir and variable. Once the training phase is completed, the unbiased model performance is tested against the test subset. This approach can help to identify overfitting problems, where a model offers good performance with the dataset used during the training phase but is not able to generalize with new input data. For all the models, the inflow, outflow and volume percent values are used as input data to predict the outflow of the next day.

Multivariate linear regression 115
The first ML technique based on multivariate linear regression was chosen to test complex neural network-based techniques against more conventional techniques. MLR can perform better than ANN in certain applications where the sample data available is small (Markham and Rakes, 1998), in this sense, this model can help to assess if the dataset available is big enough to use ANN based models. A relation was established between the outflow and inflow measured at day d, respect to the outflow of the next day d+1, which corresponds to the day under prediction. In addition, this adjustment was carried out not only for 120 each dam but also for different filling levels, in order to take also into account this variable, which plays a key role in the dam capacity to retain water. In this case, percentage sections of 10% filling of the dam were considered, which means an adjustment when the occupied volume of the dam is less than 10%, another when the occupied volume is between 10-20%, and so on.
where ̂+ 1 is the predicted outflow for day d+1, being and the measured outflow and inflow for day d. The three coefficients (c0, c1 and c2) were obtained from the linear fitting depending on the measured filling level of the corresponding 130 dam.
The equation and procedure described above were applied to each dam, using in all cases the first 70 % of the data to obtain the adjustments, and the last 30 % to test their efficiency. Although a validation phase is not considered in this methodology, the same dataset partitioning as in the neural network models has been used to facilitate the comparison. The first 70 % of the 135 data (training subset) will be used to develop the model, and the last 30 % to test (validation and test subsets) their efficiency. Therefore, both the validation and test subsets are both test subsets in this methodology.

MLP Model
The second type of model developed in this research was ANNs, which are a type of computational approach that can be encompassed within machine learning. These types of models have been inspired in the biological human brain (Farizawani 140 et al., 2020). An ANN model, like a biological neural network, is formed by several simple processing units joined to each other using weighted connections (Taghi Sattari et al., 2012). The simple processing unit is called node or neuron. Artificial neural networks present different advantages over traditional approaches to model data (Livingstone et al., 1997). Perhaps, and according to Livingstone et al. (1997), the most outstanding advantage is that this type of model is capable to fit complex nonlinear models. However, according to the same authors, this type of approach also has an important disadvantage, which 145 is that neural networks can suffer from overfitting and overtraining but can be solved by taking a good architecture selection and using training/control groups to see the evolution of the model (Livingstone et al., 1997).
The first type of ANN model developed in this research is an MLP-ANN (multi-layer perceptron, artificial neural network), that is, a feed-forward neural network with a back propagation algorithm. In this type of ANNs, the information moves only 150 in forward direction, that is, from the input neurons (in the input layer), crossing the hidden neurons (in the hidden layer) to the output neurons (RapidMiner Inc., 2022) (see Figure 2). The back propagation algorithm is used to fit the model. This kind of supervised algorithm compares the predicted values with the real values to calculate the prediction error, then this error is fed back through the network to adjust each connection weight and reduce the prediction error in the next cycle (RapidMiner Inc., 2022), that is, this ANNs learn by adjusting the connection weights. This process continues to run until the error goes down or reaches a satisfactory level, or until a previously established number of cycles has been reached (Taghi Sattari et al., 2012).
This type of artificial neural network is widely used in different predictions related to the study of water movement and dam or reservoirs management. In this sense, an MLP model, with back propagation learning algorithm, has been used to model 160 the daily inflow into the Eleviyan reservoir (Iran) (Taghi Sattari et al., 2012). This model has been compared with a time lag As previously said, the first ANN models developed were feed-forward neural networks with a back propagation algorithm. 180 In this kind of ANNs, the information passes through different layers. In the input layer, the information is received from the database, and it is sent to the hidden layer where the information is treated. Finally, this new information is sent to the output layer where a result is generated. The number of neurons in each layer is determined by the nature of the problem. In this sense, the number of neurons in the input layer is fixated by the number of variables (inflow, outflow and volume (%) at day d) that will be used to try to predict the desired variable (outflow for day d+1). In the output layer, there will be as many 185 neurons as variables to be predicted (in this case, one). Finally, in this research, only one hidden layer was used, and the number of neurons was determined by the trial-and-error method (being studied between one and seven). The number of cycles was studied between 1 and 131,072 in 17 steps with a logarithmic or lineal scale, and the decay parameter was used to decrease the https://doi.org/10.5194/nhess-2022-171 Preprint. Discussion started: 17 June 2022 c Author(s) 2022. CC BY 4.0 License.
learning rate during the learning process (true or false). The best MLP model developed (lineal or logarithmic scale) was selected based on the lowest RMSE value for the validation subset. 190 The different MLP models were implemented in a server (AMD Ryzen 7 1800X, Eight-Core Processor 3.60 GHz with 16GB of RAM) located at the Department of Physical Chemistry of the University of Vigo, Campus of Ourense. The operative system used was Windows 10 Pro 20H2 with 64-bit. The MLP models were developed using RapidMiner Studio 9.8.001 software. 195

NARX model
The last two ANN techniques analysed fall under the umbrella of the so-called recurrent neural networks (RNN). This kind of ANN is especially suitable to forecast time series. Nonlinear autoregressive with exogenous inputs (NARX) neural networks are a type of RNN designed for tasks with long-term dependencies on the input data. They can converge and generalize faster than other ANNs (Lin et al., 1996). NARX can use previous input and output data including a feedback delay for both input 200 and output. There are two typical NARX model architectures: parallel (P) and series-parallel (SP) (Xie et al., 2009) see Figure   3. In the first one, the output of the model is fed back into the neural network, whereas in the SP architecture the real output value is used during the training phase. This second approach has proven to be more stable and robust (Amirkhani et al., 2022;Narendra and Parthasarathy, 1990). NARX models were used in multiple hydrological applications, a NARX model was used Where ̂+ 1 is the predicted outflow for day d+1, is the measured outflow for the day d, are the exogenous input variables for day d and f is a nonlinear function that is approximated by a MLP. The parameters nx and ny refer to input and output delays. In this work, the values of nx and ny were obtained using cross-correlation functions and the value for both parameters was set equal to 5 days. The value of the number of hidden neurons was set equal to 8 by the trial-and-error method. The activations functions for the hidden layer of the neural network and the output layer are tan-sigmoid and linear, 220 respectively. The Levenberg-Marquardt algorithm was defined to train the model using the mean square error (MSE) as the loss function. The NARX models were developed using MATLAB software (MathWorks Inc., 2022).

LSTM model
The Long Short-Term Memory (LSTM) first proposed by Hochreiter and Schmidhuber (1997), employs the so-called LSTM cell that it is a type of RNN memory cell that stores a short-term state hd and a long-term state cd. It is capable of identifying 225 meaningful input data and storing it in a long-term state, keeping this data as long as necessary and using it when needed. Due to this fact, this approach is very suitable to capture long-term patterns present in time series. As shown in Figure 4, for each time-step, as the cd-1 state enters the cell, some data is dropped in the forget gate, to add some extra data coming from the input gate, resulting in cd. At the same time, a tanh function is applied to cd and passes through the output gate to produce hd that is equivalent to ̂+ 1 . The input data xd and the short-term state hd-1 are fully connected to four layers. The main one, which 230 outputs gd, analyses the input and the previous short-term state as in a regular RNN but only the most important parts are stored into the long-term state (Géron, 2019):

235
where tanh is the activation function, Wg is a weight matrix and bg is a bias matrix corresponding to the main layer. The remaining three layers are the gate controllers that use a sigmoidal activation function. Their outputs range from 0 to 1 and, since these outputs are used in an element-wise product, they have the ability to open or close the gate. The forget gate that outputs fd controls which part of the long-term state will be erased: where is the sigmoid activation function. The input gate, that outputs id, controls which parts of the main layer will be added to the long-term state: The last gate is the output gate, that outputs and controls which parts of the long-term state should be included in this time step ℎ and ̂+ 1 : Therefore, the cell output and the new short-and long-term states are defined as follows: 255 The software used for the implementation of the LSTM models was TensorFlow (TensorFlow Developers, 2022). After an 270 exploration of different hyper-parameters, the number of hidden layers was set to one as no significant benefit was observed when using a higher number. The input width window chosen was 10 days. The input data was scaled using a standard scaler.

Metrics
In order to compare the different models and measure their accuracy several statistical metrics, widely used in hydrological applications, were employed. More precisely, the Pearson's coefficient of correlation (r), the ratio of root mean square error and the standard deviation of the observed values (RSR), the Nash-Sutcliffe Efficiency coefficient (NSE) (Nash and Sutcliffe, 1970) and the percent bias (PBIAS) that are defined by the following equations: 285 where Q for is the forecasted value, Q obs is the observed value and N is the total number of samples. Following the criterion of Moriasi et al. (2007), the statistics for model evaluation can be divided into three categories. First, the standard regression statistics that measure the linear relationship between the predictions made by a model and the observed data, in this category we considered the Pearson's coefficient of correlation (r), which ranges between -1 and 1, with values close to 1 being 295 considered as a high degree of positive linear relationship. The second category is the dimensionless statistics, for this case we have chosen the NSE, which ranges from -∞ and 1.0, it is a normalized statistic that computes the relative magnitude of the residual variance with respect to the variance of the observed data, with 1.0 being the optimal value. The last category is error index statistics that quantify the deviation of the predicted values compared with the observed values in the data units used. In this last category, two statistics were chosen; on the one hand, the RSR is the ratio of the RMSE (root mean square error) to 300 the standard deviation of the observed data, with 0 being the optimal value. The other error index statistic used is the PBIAS, which calculates the average tendency of the predicted values to underestimate (positive PBIAS) or overestimate (negative PBIAS) the observed series.

Results and discussion
Statistical parameters used to evaluate the performance of the models to predict dam outflow are shown in Table 3. MLP 305 models with linear scale are showed in this table due to present slightly better average adjustments in the validation phase than the models with logarithmic scale. Results corroborate that the proposed models offer a "very good" performance for all the the maximum level of good functioning defined, and therefore it can be concluded that all the proposed models are able to provide an accurate prediction of dam outflow attending to known parameters.
The first approximation to the prediction of dam outflow was made by the simple method of considering as the prediction outflow the same outflow measured in the previous day. This is considered as the baseline model, on which the rest of the 315 models will be compared. Figure 5 shows the reservoirs averaged metrics for each model and subset. As we can see in RSR, NSE and r, all the models improve the accuracy of the baseline model in every dataset. The ML model's performance shows no evidence of overfitting problems, where a trained model learns very specific features of the training dataset and fails to generalize using new datasets. The MLR approach was able to outperform the baseline model on the whole dataset but lags behind the ANN based models. All the ANN models showed similar accuracy based on RSR, NSE and r metrics and provide 320 a good generalization across the test subset. The LSTM models were slightly better than MLP whilst NARX has shown the best performance.
As can be observed in PBIAS (Figure 5), the baseline does not offer any significant bias since it is the same data series as the one observed but with a delay. All the ML models have a tendency to overestimate the series, especially in the test subset. The 325 LSTM models have the lowest tendency to overestimate and the MLR models have the highest. This fact can be very significant depending on the application of the model where a more conservative estimation towards the worst-case would be preferable.
In order to analyse the differences in the performance on the different datasets and reservoirs of each model, Figure 6 shows the NSE values for each case. Taking into account that the NSE metric is very popular among hydrology studies, and no 330 significant differences were found with RSR and r, only the NSE is shown for sake of clarity. The MLR models provide a good generalization in Castrelo, Velle, Santo Estevo, San Martinho and Frieira reservoirs where the performance for the test dataset is very similar or even better than on the training dataset, these are low-capacity reservoirs (excepting the Santo Estevo reservoir) where the regulation capacity is also lower. This tendency is also present in the ANN models, where in the lower capacity reservoirs (Castrelo, Velle, San Martinho and Frieira) show a better generalization ability, whilst in the higher capacity 335 reservoirs (Belesar, Barcena, Santo Estevo and Peares) the performance of the models in the test subset is lower than on train and validation subsets. Looking more closely at the data in Table 3 a tendency is detected in reservoirs with a higher capacity to have worse statistics than those of lower capacity. This evidences that higher capacity reservoirs have a greater ability to regulate the flow of the river according to the desired interest. In similar events, higher capacity reservoirs have more The NSE values for the test dataset for each reservoir and model are shown in Figure 7, aiming to spot differences in the models performance depending on the reservoir. Belesar, Castrelo and Santo Estevo reservoirs show the highest advantage for the ML models. On the opposite, in the Barcena reservoir, only NARX and LSTM were able to improve the baseline approach. 345 The MLR models were able to outperform the baseline except in the Barcena and San Martinho cases, also in the Castrelo case improved the MLP model. The MLR was never the best option but usually provides results close to ANN models. The MLP models performed always better than the baseline model except in the Barcena case, they provide similar results to the other ANN models and very close to LSTM. The NARX models were able to improve the baseline model, they offered the best results in all the reservoirs except the Barcena reservoir, standing as the best performer. The LSTM models also were able to 350 consistently outperform the baseline model, being the best model in the Barcena reservoir and the second best model in the rest of the reservoirs, except in Santo Estevo where it was outperformed by the MLP and NARX models. From these observations, it can be concluded that a per-reservoir analysis is advisable when developing a data-driven model, since none of the methodologies proposed can be chosen as the best for all the cases.

355
To illustrate the behaviour of the developed models, the Belesar reservoir was chosen for two main reasons: it has the higher capacity from the analysed reservoirs, so it has a higher regulation capacity, and it doesn't have any other reservoir upstream that condition its behaviour. A comparison of the predicted and observed flow time series for the test period in the Belesar reservoir is shown in Figure 8. It can be seen that all the methodologies show similar performance, but some differences can be highlighted. Both MLR and MLP models have underestimated the main peak of the series whilst NARX and LSTM models 360 have overestimated it. This fact should be considered when designing systems like flood EWS where the worst-case estimation should be accounted for safety reasons. On the opposite, in the lower flows on the dry season, the NARX model was the best performer providing accurate predictions, however, the LSTM model had some difficulties at very low flow rates, especially in the summers of 2017 and 2019. This makes the LSTM model less suitable for water management systems where the accuracy on the dry season is essential for a better exploitation of the water resources. 365

Conclusions
This research paper presents an assessment of different ML techniques applied to reservoir outflow one day ahead prediction using the previous reservoir volume percent, inflow, and outflow data. For this purpose, different models were developed and applied to several reservoirs in the Miño-Sil catchment. The analysis of the obtained results revealed that the proposed ML techniques obtained accurate predictions. The ML models provide significant improvements over the baseline model showing 370 a good generalization without significant signs of overfitting. On average, the MLR models were able to consistently improve