Improving computational efficiency of GLUE method for hydrological model uncertainty and parameter estimation using CPU-GPU hybrid high performance computer cluster

: The Generalized Likelihood Uncertainty Estimation (GLUE) method has 2 been thrived for decades, huge number of applications in the field of hydrological 3 model have proved its effectiveness in uncertainty and parameter estimation. However, 4 for many years, the poor computational efficiency of GLUE hampers its further 5 applications. A feasible way to solve this problem is the integration of modern 6 CPU-GPU hybrid high performance computer cluster technology to accelerate the 7 traditional GLUE method. In this study, we developed a CPU-GPU hybrid computer 8 cluster-based highly parallel large-scale GLUE method to improve its computational 9 efficiency. The Intel Xeon multi-core CPU and NVIDIA Tesla many-core GPU were 10 adopted in this study. The source code was developed by using the MPICH2, C++ 11 with OpenMP 2.0, and CUDA 6.5. The parallel GLUE method was tested by a 12 widely-used hydrological model (the Xinanjiang model) to conduct performance and 13 scalability investigation. Comparison results indicated that the parallel GLUE method 14 outperformed the traditional serial method and have good application prospect on 15 super computer clusters such as the ORNL Summit and Sierra of the TOP500 super 16 computers around the world. 17 20 21 22 23 24 25 26 27 28 29 in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a 25 simple and flexible interface for developing parallel applications on platforms from 26 the desktop to the supercomputer. In this research, we adopted the OpenMP in Visual 27 C++2010 implementation to develop parallel codes. The OpenMP C and C++ 28 application program interface lets us write applications that effectively use multiple 29 processors. Visual C++2010 supports the OpenMP 2.0 standard which includes 30


Introduction
). Therefore, uncertainty and parameter estimation 13 have become a necessary procedure in the application of hydrological models (Li, 14 2005). 15 Until recently, hydrologists have carried on comprehensive studies on uncertainty 16 analysis and parameter estimation on hydrological model and have obtained 17 significant achievements. The Generalized Likelihood Uncertainty Estimation, GLUE, 18 has become the most widely applied model uncertainty analysis and parameter 19 estimation method. The GLUE method was proposed by Beven in 1992 (Beven and 20 Binley, 1992) for studying the equifinality phenomenon of hydrological model, and 21 then it has been widely applied in hydrology studies. Li (Li and Liang, 2006;Shu et 22 al., 2008) used three different typical watersheds as examples and adopted the GLUE 23 method to study the parameter uncertainty issue of the Xinanjiang model, which 24 assessed and conformed the applicability of the GLUE method. Huang (Huang and 25 Xie, 2007) applied the GLUE method in the Xingfeng watershed of the Dong River 26 and studied the TOPMODEL uncertainty in hydrological forecasting. Muleta (Muleta 27 M and Nicklow, 2005) studied the discharge and sediment simulation by using the 28 SWAT model in the Creek catchment in Illinois state of the U.S.A. and considered that 29 the sediment simulation results have higher uncertainty than the discharge simulation.  1 assessment and parameter estimation in headwaters of the Hei River. 2 When utilizing the GLUE method, many hydrologists improved the algorithm for 3 the purpose of studying features of the watersheds or satisfying various requirements. 4 Blasone (Blasone and Vrugt, 2008) replaced the Monte Carlo random sampling of the 5 GLUE by the SCEM-UA sampling. They selected the feasible parameter sets 6 according to the coverage of the estimated confidential intervals and therefore 7 significantly improved the reliability of the confidential intervals. Wei (Wei and 8 Xiong, 2008) applied the above-mentioned method in upper stream of the Han River 9 and obtained better estimation results. Liu (Liu, 2008. Liu et al., 2009) improved the 10 likelihood criterion of the GLUE method and added multiple criteria likelihood 11 functions based on China's Specification for Hydrological Forecasting. The multiple 12 criteria likelihood functions promoted the uncertainty analysis capability and accuracy 13 of the GLUE method. Lin and Chen (Lin et al., 2009) described the model parameter 14 relativity by using Copula function and proposed Copula-Glue based parameter 15 uncertainty estimation method for hydrological models. They also compared the 16 simulation result with the original GLUE. Deng (Deng et al., 2008) generated the 17 synthetic time series data to eliminate the influences of the data and model structure 18 and analyzed the effects of likelihood function of the pollutant attenuation model. 19 They found that the likelihood function has great impact on the analysis results. 20 21 The computational cost of the GLUE method is expensive. It is due to the 22 extensive feasible parameter searching space of the Monte Carlo experiments 23 combined with the huge number of hydrological model objective function evaluations. 24 The traditional GLUE method consumes too much time which prevent the researchers 25 and engineers from applying it to highly complex real-world applications. This is a 26 huge impedance to the development of GLUE-based uncertainty analysis and 27 parameter estimation methods. 28 Binley (Binley and Beven, 1991) attempt to assess the uncertainty associated with 29 predictions of distributed models using a distributed rainfall-runoff model. However, realizations even though they adopted a relatively simple distributed model, the 5 Institute of Hydrology Distributed Model version 4 (IHDM4). However, at that time， 6 performing this level of computation required significant code enhancement in order 7 to exploit a 80 node transputer parallel computer (Beven and Binley, 2014). 8 Later on, Beven (Beven and Binley, 1992) noticed that the GLUE procedure is 9 computationally intensive and improved the computational efficiency by 10 implementing it on a local parallel processing computer. The transputer utilized by 11 Binley and Beven was a 1980s parallel computer which was designed by David May 12 at the University of Bristol and produced by Inmos, with chips designed to support 13 pipes to other processors. The first floating point transputer, the T800 was produced in 14 1987. It was used by Binley as TRAM daughter boards for PCs and programmed in a 15 language called Occam. Later on, the simulations run on a parallel computer of 16 Lancaster University Meiko Computing Surface, which contains some 80 transputers, 17 each of which provides about 1 Mflops average performance. The GLUE procedure is 18 adapted to a distributed memory parallel processor, particularly where each realization 19 may be loaded onto a single processor. In general, 500 realizations of the IHDM were 20 run for each storm using a 50 transputer array. Each set of runs for each storm took 21 between 30 and 60 hours of computing time on the 50 processors. In that era, these 22 pioneer studies were trying to employ hardware and software that was new to 23 hydrological sciences and relevant disciplines. 24 With the development of computational performance, constraints of computer on 25 the application of GLUE have been relaxed. However, it remains an issue, either 26 because of a model that is particularly slow to run so that it is still not possible to 27 sample sufficient realizations or because of high number of parameter dimensions 28 (Beven and Binley, 2014). The largest number of runs used in a GLUE application 29 that we know of are the two billion runs application (Iorgulescu, 2005 approach. This was for a model written by just a few lines of code but including 17 2 parameters for calibration. Two billion runs are still a small sample compared with a 3 discrete sampling strategy with ten values for each parameter. They were constrained 4 to 500 realizations for each (relatively short) event and that was only possible in a 5 reasonable time because they utilized an 80 node transputer system. More recently, 6 GLUE calculations have been speeded up for certain models using parallel graphics  15 with the state-of-the-art of modern high-performance computing technology. The 16 acceleration of GLUE by using new generation CPU-GPU hybrid high performance 17 computer cluster is of great significance and is in great need. 18 19 With the development of modern microelectronic technology, multi-core and 20 many-core hybrid heterogeneous parallel computing platform has become the upstart 21 in recent high-performance computing field, owe to its stupendous floating-point 22 compute capability compared with traditional and old generation computers. Until 23 recently, several heterogeneous super computers, such as Summit and Sierra, have 24 shown excellent performance on the TOP500 test. CPU-GPU heterogeneous platform 25 successes owe to its better cost performance and energy consumption ratio. The 26 modern CPU-GPU hybrid computing platform has become the best choice for 27 researchers and engineers who need high-performance computing (Fang et al., 2016). 28 On the other hand, the GPU has been widely equipped in modern PCs, and the 29 CPU-GPU hybrid heterogeneous computer systems is easily available for scientific  8 The utilization of computational power of hybrid CPU-GPU system is 9 codification on heterogeneous systems. Nevertheless, heterogeneous parallel software 10 development faces great challenges such as heterogeneous data communication, 11 programming and optimization based on GPU architecture, joint compilation of 12 multiple compilers, etc. Owing to the Intel, NVIDIA, and other hardware and 13 software producers' hard works, many powerful and easy-to-use compilers and 14 libraries have emerged which include GCC, ICC, PGI, VC, NVCC, MPI, OpenMP, 15 and OpenCL, etc. With these useful tools, source code development for heterogeneous 16 parallel computing is easier to accomplish than before. 17 1.4 Problems needed to be resolved and content of this paper 18 With the arrival of the big data era, hydrological model uncertainty and parameter 19 estimation require unprecedented large amount of computing horsepower. This 20 research focused on the GLUE method and the newly emerged modern CPU-GPU 21 hybrid high performance computer cluster acceleration technology. In order to further 22 improve the computational efficiency of the GLUE-based Xinanjiang hydrological 23 model uncertainty and parameter estimation, the CPU-GPU hybrid computer 24 cluster-based parallel GLUE-Xinanjiang uncertainty and parameter estimation method 25 was proposed. The parallel method was implemented on CPU cluster and GPU cluster, 26 respectively. We utilized totally 5 CPUs and 5 GPUs to achieve good acceleration 27 results. At last, the scalability issue was also investigated to prove the satisfactory 28 robustness and portability of the parallel GLUE method. The requirements of the traditional GLUE procedure for model uncertainty and 4 parameter estimation are listed below: 5 1) A formal definition of a likelihood measure or set of likelihood measures is 6 required. For hydrological model applications, the Nash-Sutcliffe coefficient of 7 efficiency (NSCE) is usually adopted as the likelihood measure or objective function. 8 It can be calculated as follows: 10 where sim,i q denotes simulated discharge at time step i; obs,i q denotes observed 11 discharge at time step i; obs q denotes mean of the observed discharge values; n 12 denotes the number of discharge data. 13 2) An appropriate definition of the initial range or distribution of parameter values 14 is necessary for a particular model structure. Generally speaking, the ranges of 15 parameters are predefined according to the parameter physical meanings of the 16 specific hydrological model and the uniform distribution is adopted in most cases 17 since the distribution of parameters are usually unknown. 18 3) Sampling the parameter sets in the feasible space can be achieved by utilizing 19 the Monte Carlo approach, and their likelihood values should be evaluated with the 20 objective function after obtaining the simulation results of the hydrological model. 21 4) Uncertainty band or optimality of different parameter sets can be evaluated 22 based on their likelihood value.   In this study, we carried on the uncertainty and parameter estimation of the 6 Xinanjiang daily model by using the GLUE method. The Xinanjiang daily model 7 focuses on the water balancing and the hydrograph simulation. Therefore, the 8 objective function (OBJ) adopted herein is calculated as follows:  14 where sim,i q denotes simulated discharge at time step i; obs,i q denotes observed discharge at time step i; obs q denotes mean of the observed discharges; n denotes the 1 number of discharge data. 2 The parameters and state variables of the Xinanjiang model require two 3 additional constraints to ensure the correctness of the model physical meaning. The 4 constraints are applied on parameters CG, CI, and CS (CG must be larger than CI and 5 CI must be larger than CS) and soil moisture W (W must be non-negative). In order to 6 consider the first constraint in the procedure of the GLUE method, before calculating 7 the OBJ, we test the CG, CI, and CS values to verify whether the constraint 8 relationship is satisfied. If it is satisfied, we continue the calculation of the OBJ; 9 otherwise, we set the OBJ equal to a penalty term which is computed by

Proceedings of the modern parallel computing technology
where lambda is a penalty coefficient which was set to 1000 in this research. If the 12 first constraint is satisfied, then we can run the Xinanjiang daily model by using the 13 hydro-meteorological data to generate the simulated discharge time series. After the 14 model simulation is finished, a "flag" will be returned to indicate whether the 15 simulation is success. If the simulation is early stopped and returned a "flag" 16 indicating that the state variable W has negative values, we set the OBJ equal to 17 another penalty term expressed as 19 where WMUB denotes the upper boundary of parameter WM; WM denotes the WM 20 value generated by the Monte Carlo sampling of the GLUE method. This penalty term 21 forces the algorithm to search towards larger WM to avoid the negative W values. 22 If the above mentioned two constraints are both satisfied, we calculate the OBJ 23 according to 24 1 The parallelization of the GLUE method involves two steps which include the 2 parallelization of the Monte-Carlo sampling and the optimal parameter set reduction. 3 We implemented the parallel GLUE method on multi-core CPU computer cluster and 4 many-core GPU computer cluster, respectively. The detailed description of the 5 implementation can be found in the following paragraphs. 6

Parallel uncertainty and parameter estimation -CPU computer cluster
The parallel GLUE method was implemented on a multi-core CPU computer 9 cluster which contains 4 HP Z-series workstations host totally 5 Intel Xeon 10 E5-2630v3 multi-core CPUs. The flow chart of the CPU computer cluster 11 implementation of the parallel GLUE method is demonstrated in figure 2. 12 The parallel GLUE method starts from the initial settings on the master node. The 13 hydro-meteorological data is loaded from CSV (comma-separated values) files which 14 include daily rainfall, runoff discharge, evapotranspiration, and catchment 15 geographical information. Besides, the program set the total sample number of the 16 Monte Carlo (NS), likelihood threshold (TH), model parameter boundaries, and KG 17 plus KI constraint. After initial data loading and settings, the algorithm queries the 18 number of slave nodes (NN) and number of CPU cores in each slave node by using 19 MPI and OpenMP, respectively. The work load quantity assigned to each slave node is 20 calculated as follows: 22 where NSSi denotes number of samples generated in slave node i; NCi denotes number 23 of CPU cores in slave node i; NT denotes total number of CPU cores; i = 1, 2, …, NN. started to find the optimal (largest) parameter set of each slave node. At last, send the 4 feasible parameter sets and optimal parameter set to the master node by using 5 MPI_Send API. 6 During the Monte Carlo sampling, likelihood calculations, and parallel reduction 7 of all the slave nodes, the master node waits for all these computations' completion. 8 Once all the above computations are completed, the master node receives feasible and 9 best parameter sets of each slave node by using MPI_Recv API. At last, the master 10 node generates the uncertainty plots for the feasible parameter sets and chooses the 11 parameter set with the largest likelihood function value as the optimal parameter set 12 and finishes the execution by using MPI_Finalize API. 13 14 implementation 15 The parallel GLUE method was implemented on a many-core GPU computer 16 cluster which is constructed by 4 HP Z-series workstations host totally 5 NVIDIA 17 Tesla K40c many-core GPUs. The flow chart of the GPU computer cluster 18 implementation of the parallel GLUE method is demonstrated in figure 3. 19 The parallel GLUE method starts from the initial settings on the master node. The 20 hydro-meteorological data is loaded from CSV (comma-separated values) files which 21 include daily rainfall, runoff discharge, evapotranspiration, and catchment 22 geographical information of the study catchment. The algorithm also sets the total 23 sample number of Monte Carlo (NS), likelihood threshold (TH), model parameter 24 boundaries, and KG plus KI constraint. After initial data loading and settings, the 25 algorithm begins the MPI execution and queries the number of slave nodes (NN), 26 number of GPUs in each slave node, and number of GPU cores in each slave node by 27 using MPI and CUDA, respectively. The work load quantity assigned to each slave 28 node is calculated as follows: calculations of likelihood, the CUDA parallel reduction was started to find the feasible 12 parameter sets and the best (largest) parameter set of each slave node. At last, the 13 uncertainty plot of the feasible parameter sets and the best parameter set was send to 14 the master node by using MPI_Send API. 15 During the Monte Carlo sampling, likelihood calculations, and parallel reduction 16 of all the slave nodes, the master node waits for all these computations' completion. 17 Once all the above computations are completed, the master node receives the feasible 18 parameter sets and the best parameter set of each slave node by using MPI_Recv API. 19 At last, the master node plots the uncertainty plot and chooses the parameter set with 20 the largest likelihood function value as the best parameter set and finishes the 21 execution by using MPI_Finalize API. 22 23 The hardware utilized in this research is a HP Z-series workstation computer 24 cluster constituted by one HP Z820 and three HP Z840 workstations. The screen and 25 four workstations are demonstrated in figure 4. The USB KVM switch which is used 26 for the controlling and switching of four workstations by one set of screen, keyboard, 27 and mouse is shown in figure 5. This computer cluster has 5 Intel Xeon E5-2630v3 28 multi-core CPUs and 5 NVIDIA Tesla K40c many-core GPUs. The Intel Xeon E5-2630v3 CPU is a high-end server level OEM/tray 5 microprocessor. It's a Haswell-EP architecture CPU with 0.022nm manufacturing 6 process. It has 8 CPU cores and supports hyper-threading technology with up to 16 7 parallel threads. The base frequency of the CPU core is 2.4GHz and the turbo 8 frequency is 3.2GHz. The level 1 cache size is 8x32KB 8-way set associative 9 instruction and data caches. The level 2 cache size is 8x256KB 8-way set associative 10 caches. The level 3 cache size is 20MB shared cache. It supports many new features 11 such as MMX instructions, SSE/streaming SIMD extensions, AVX/advanced vector 12 extensions, and TBT2.0/turbo boost technology 2.0, etc. The V core is 0.65V-1.3V. 13 The maximum operating temperature is 72.1℃. The minimum power dissipation is 32 14 Watt for C1E state and 12 Watt for C6 state. 15 The Tesla K40c is a high-end professional graphics card by NVIDIA, launched in 16 October 2013. Built on the 28nm process, and based on the GK110B graphics 17 processor, the card supports DirectX 12.0. The GK110B graphics processor is a large GDDR5 memory on the card, which are connected using a 384-bit memory interface. 3 The GPU is operating at a frequency of 745MHz, memory is running at 1502MHz. 4 Being a dual-slot card, the NVIDIA Tesla K40c draws power from 1x6-pin+1x8-pin 5 power connectors, with power draw rated at 245W maximum. Tesla K40c is 6 connected to the rest of the system using a PCIe 3.0x16 interface. The card measures 7 267mm in length, and features a dual-slot cooling solution. Its price at launch was 8 7699 US Dollars. 9 The software is developed based on the Microsoft Windows 7 64-bit operating 10 system in this research. The software ecosystem applied in this study is constituted by 11 the MPICH2, Microsoft VC++2010 with OpenMP, and NVIDIA CUDA 6.5. 12 MPICH is a high-performance and widely portable implementation of the optimizing the performance of applications. We'll also find programming guides, user 10 manuals, API reference, and other documentation to help us get started quickly 11 accelerating our application with GPUs. 12 13 The study area of this research is the Ba River basin. It originates from the north 14 slope of the Qinling, China. The full length of the Ba River is 92.6km. The elevation 15 difference from the headwater to the outlet of the river is 1142m. The total slope is 16 12.8%. The area of the research basin is 2577km 2 . The Ba River basin is an 17 asymmetric watershed. The left bank tributaries are sparse and long, while the right 18 bank tributaries are condensed and short. The Ba River is a mountainous river. The  parallel GLUE performs similar to the serial GLUE. Its total execution time also 10 increases approximately according to the multiplier of 10. However, the GPU cluster 11 parallel GLUE do not perform as the above two versions. When the number of 12 samples increases from 10000 to 100000, the total execution time increases more than 13 10 times (from 1.4s to 2.3s). When the number of samples increases from 100000 to 14 1000000, the total execution time increases less than 10 times (from 2.3s to 13.4s), 15 which indicates that with higher computational burden (larger number of samples), 16 the GPU cluster parallel version's computational efficiency becomes better than the cluster parallel implementation performs much better than the CPU cluster parallel 20 version, especially for large scale hydrological model parameter estimation tasks. 21 The above paragraph assesses the speedup ratio between parallel GLUE 22 implementations and serial GLUE implementation. This section compares the 23 performances of the GPU clusters parallel GLUE and CPU cluster parallel GLUE implementations. We can see in figure 8 (B) that with the increase of the number of 1 samples, the GPU cluster version runs gradually faster than the CPU cluster version. 2 For relatively small scale problems such as number of samples equals to 10000, the 3 speedup ratio is less than 1x (0.57x), which means the GPU implementation is slower 4 than the CPU implementation. Nevertheless, when faces with the large scale 5 parameter estimation problems such as the number of samples equals to 100000 or 6 1000000, the GPU cluster parallel implementation run 3.78x or 6.36x faster than the 7 CPU cluster parallel implementation. These facts indicate that the GPU cluster 8 parallel GLUE method is more suitable to large scale model parameter estimation 9 problems and runs much faster than the CPU cluster parallel GLUE method for these 10 kinds of tasks. 11 3.3 Scalability analysis 12 13 In this section, the scalability of the parallel GLUE methods is analyzed based on 14 the total execution time. Here we focused on the scalability analysis, therefore, we 15 fixed the number of samples to 1000000. The GPU cluster turns off the GPU boost 16 and ECC. We varied the number of CPUs or GPUs and tested the total execution time 17 to compare the performances of parallel GLUE methods. The total execution time of 18 CPU cluster parallel GLUE and GPU cluster parallel GLUE is demonstrated in figure   19 9. It can be inferred from the figure that the total execution time of the parallel GLUE 20 methods decreases when applying more CPUs or GPUs to accelerate the computation. 21 With more CPUs or GPUs, the computational efficiency of the parallel GLUE 22 methods improves significantly. These testing results indicate that the scalability of 23 the parallel GLUE method is good and the parallel method can be applied to highly 24 parallel and heavy computational burden tasks.

Analysis based on speedup ratio of CPU cluster GLUE and GPU cluster GLUE
4 v.s. serial GLUE 5 In this section, the scalability of the parallel GLUE methods is analyzed based on 6 the speedup ratio between parallel GLUE methods and serial GLUE method. Here we 1 Fig. 10. Speedup ratio of CPU cluster parallel GLUE and GPU cluster parallel GLUE 2 v.s. serial GLUE. 3 4 In this section, the scalability of the parallel GLUE methods is analyzed based on 5 the speedup ratio between parallel GLUE methods. Here we focused on the scalability 6 analysis by setting the number of samples as 1000000. The GPU cluster turns off the 7 GPU boost and ECC. We varied the number of CPUs or GPUs and tested the speedup 1 Fig. 11. Speedup ratio of GPU cluster parallel GLUE v.s. CPU cluster parallel GLUE.

3
In this research, we developed an improved version of GLUE method to 4 accelerate its computational efficiency. The multi-core CPU and many-core GPU 5 hybrid high performance computing hardware system and MPI, OpenMP, and 6 CUDA-based hybrid software development ecosystems were developed. We applied 7 the improved parallel and traditional serial GLUE method to the parameter estimation 8 of a famous hydrological Xinanjiang model real world application. The following 9 conclusions can be drawn here. 10 (1) The parallel GLUE-Xinanjiang methods run much faster than their serial 11 implementation counterpart and achieved much smaller computational time and much 12 higher speedup ratios. The computational efficiency comparison results indicated the 13 satisfying performance of the parallel methods. Furthermore, the GPU cluster parallel 14 GLUE method achieved the highest speedup ratio up to 251.37x which significantly 15 improves the computational efficiency and makes the highly intensive computation 16 Monte Carlo based GLUE method possible and meaningful to be applied in real word 17 and real time applications. 18 (2) The parallel GLUE methods have good and satisfactory scalability which 19 allows scientists and engineers to apply more CPU or GPU devices to further improve 20 the computational efficiency and makes them capable to tack the large-scale super computers, the CPU-GPU hybrid high performance computer cluster-based 6 parallel GLUE method can achieve even better performance and real-world 7 application results in the near future. 8 Conflict of interest 9 The authors declare that the research was conducted in the absence of any 10 commercial or financial relationships that could be construed as a potential conflict of 11 interest. (2018) A simple topography-driven and calibration-free runoff generation model. 28 Hydrology and Earth System Sciences 23 (2)