Evaluation of two hydrometeorological ensemble strategies for ﬂash-ﬂood forecasting over a catchment of the eastern Pyrenees

. This study aims at evaluating the performances of ﬂash-ﬂood forecasts issued from deterministic and ensemble meteorological prognostic systems. The hydrometeorological modeling chain includes the Weather Research and Forecasting Model (WRF) forcing the rainfall-runoff model MARINE dedicated to ﬂash ﬂoods. Two distinct ensemble prediction systems accounting for (i) perturbed initial and lateral boundary conditions of the meteorological state and (ii) mesoscale model physical parameterizations have been implemented on the Agly catchment of the eastern Pyrenees with three subcatchments exhibiting different rainfall regimes.Different evaluations of the performance of the hydrometeorological strategies have been performed: (i) veriﬁcation of short-range ensemble prediction systems and corresponding streamﬂow forecasts, for a better understanding of how forecasts behave; (ii) usual measures derived from a contingency table approach, to test an alert threshold exceedance; and (iii) overall evaluation of the hydrometeorological chain using the continuous rank probability score, for a general quantiﬁcation of the ensemble performances. Results show that the overall discharge forecast is improved by both ensemble strategies with respect to the deterministic forecast. Threshold exceedance detections for ﬂood warning also beneﬁt from large hydrometeorological ensemble spread. There are no substantial differences between both ensemble strategies on these test cases in terms of both the issuance of ﬂood warnings and the overall performances, sug-gesting that both sources of external-scale uncertainty are important to take into account.

Abstract. This study aims at evaluating the performances of flash-flood forecasts issued from deterministic and ensemble meteorological prognostic systems. The hydrometeorological modeling chain includes the Weather Research and Forecasting Model (WRF) forcing the rainfall-runoff model MARINE dedicated to flash floods. Two distinct ensemble prediction systems accounting for (i) perturbed initial and lateral boundary conditions of the meteorological state and (ii) mesoscale model physical parameterizations have been implemented on the Agly catchment of the eastern Pyrenees with three subcatchments exhibiting different rainfall regimes.
Different evaluations of the performance of the hydrometeorological strategies have been performed: (i) verification of short-range ensemble prediction systems and corresponding streamflow forecasts, for a better understanding of how forecasts behave; (ii) usual measures derived from a contingency table approach, to test an alert threshold exceedance; and (iii) overall evaluation of the hydrometeorological chain using the continuous rank probability score, for a general quantification of the ensemble performances.
Results show that the overall discharge forecast is improved by both ensemble strategies with respect to the deterministic forecast. Threshold exceedance detections for flood warning also benefit from large hydrometeorological ensemble spread. There are no substantial differences between both ensemble strategies on these test cases in terms of both the issuance of flood warnings and the overall performances, suggesting that both sources of external-scale uncertainty are important to take into account.

Introduction
Flash floods are among the most devastating natural hazards worldwide, producing important human and socioeconomic losses. The western Mediterranean region is annually affected by several extreme precipitation events which lead to flash flooding. During the extended warm season, the early intrusion of upper-level cold air masses and the relatively high sea surface temperature boost the convective available potential energy of the low-level Mediterranean warm and moist air. This natural hazard results from the persistence of deep moist convection and intense precipitation over specific hydrographic catchments during several hours. As many western Mediterranean small-to-medium-sized river basins are highly urbanized, steep and close to the coastline, their hydrological responses are inherently short. Large, rapid and unexpected flows exacerbate flood damage. The development and evaluation of the state-of-the-art hydrometeorological forecasting tools is a major issue in the Hydrological cycle in the Mediterranean experiment (HyMeX; Drobinski et al., 2014). This program aims at addressing the following science questions, amongst others. How can we improve heavy rainfall process knowledge and prediction? How can we improve hydrological prediction?
Hydrometeorological forecasting tools can contribute to a better understanding and forecasting of flash floods so as to implement more reliable forecasting and warning systems over the western Mediterranean. Short-range quantitative precipitation forecasts (QPFs) by high-resolution numerical weather prediction (NWP) models are an effective Published by Copernicus Publications on behalf of the European Geosciences Union.
tool to further extend flood forecasting lead times beyond the basin response times. NWP models capture the initiation and evolution of small-scale and convectively driven precipitations, with similar spatial and temporal scales to the flashflood-prone catchments (Leoncini et al., 2013;Fiori et al., 2014;Ravazzani et al., 2016;Amengual et al., 2017). Although QPFs can be directly used to force one-way hydrological models, the hydrometeorological forecasts are impacted by different types of uncertainties. Uncertainties are inherent to each of the hydrometeorological chain components: model parameterization and structure, limitations of measuring devices providing observation data, and initial and lateral boundary conditions (Zappa et al., 2010).
External-scale inaccuracies in the hydrological models emerge from two distinct sources when forecasting deep moist convection and heavy rainfall with NWP models. First, errors arise from the complexity and nonlinearity of the physical parameterizations. Second, uncertainties emerge when representing the exact initial atmospheric state and boundary forcing across the scales where convection develops. But reliable spatial and temporal QPF distributions are necessary to render skillful quantitative discharge forecasts when coping with floods over small-and medium-sized basins. Otherwise, the issuance of precise and dependable early flood warnings is inhibited (Le Lay and Saulnier, 2007;Bartholmes et al., 2009;Cloke et al., 2013).
To alleviate the impact of these external-scale uncertainties, short-range ensemble prediction systems (SREPSs) are used to build hydrological ensemble prediction systems (HEPSs). SREPSs aim at sampling the set of plausible outcomes and at accounting for the most relevant uncertainties in the atmospheric forecasting process so as to increase. Uncertainties in the initial and boundary fields can be encompassed by conveniently perturbing initial and lateral boundary conditions (IC/LBCs, Grimit and Mass, 2007;Hsiao et al., 2013). Uncertainties in model parameterizations are dealt with by populating the ensemble with multiple combinations of equally skillful physical schemes (Stensrud et al., 2000;Jankov et al., 2005;Amengual et al., 2008Amengual et al., , 2017Tapiador et al., 2012). The inclusion of these uncertainties aims at improving the skill and spread of the HEPSs by introducing independent information of all the plausible atmospheric states and processes. Therefore, SREPSs are increasingly used in hydrologic prediction (Cloke and Pappenberger, 2009;Verkade et al., 2013Verkade et al., , 2017Siddique and Mejia, 2017;Benninga et al., 2017;Bellier et al., 2017Bellier et al., , 2018Edouard et al., 2018;Jain et al., 2018). Several studies have stated that probabilistic forecasts could improve decisionmaking if appropriately handled (e.g., Krzysztofowicz, 2001;Todini, 2004;Ramos et al., 2013;Antonetti et al, 2019). As stated by Zappa et al. (2011), each member of a meteorological ensemble can be fed into a hydrological model to generate a hydrological forecast.
However, the most appropriate methods for generating HEPSs and the quantification of their added value are still under assessment (Cloke and Pappenberger, 2009;Cloke et al., 2013). Further efforts devoted to examine the predictive skill of both ensemble strategies and how the externalscale uncertainties spread into the HEPSs become paramount for the optimal design of hydrometeorological operational chains over the flood-prone western Mediterranean area. The objective of the present work is to evaluate the predictive skill of two distinct HEPS generation strategies -accounting for perturbed IC/LBCs (PILB) and mixed physics (MPS) -for three flash-flood episodes over the Agly basin (Fig. 1). This catchment of the eastern Pyrenees has been selected as an experimental area as several subcatchments exhibit different rainfall regimes. Given the small size of the subcatchments (from 150 to 300 km 2 ), the localization of the precipitation patterns is crucial , and it is a challenge to implement QPFs for such small subcatchments. QPFs are generated by using the Weather Research and Forecasting Model (WRF; Skamarock et al., 2008). Next, 48 h WRF forecasts are propagated down through the MARINE hydrological model (Roux et al., 2011) to investigate the quantitative discharge forecasts in the timing and magnitude of these flash floods. The resulting HEPSs are examined using different criteria to illustrate the potential benefits of the probabilistic hydrometeorological forecast chains. The rest of the paper is structured as follows: Sect. 2 presents a short overview of the flash floods, the study area and the observational networks; Sects. 3 and 4 provide an insight into the hydrological and atmospheric models and the strategies for ensemble generation; and Sect. 5 presents the results. The last section summarizes main conclusions and provides further remarks.
2 Data and case studies

Overview of the Agly catchment
This study focuses on a catchment in the north side of the eastern Pyrenees, the Agly, as a test site for implementing the HEPS strategies. The Agly is a coastal river in the north side of the eastern Pyrenees (Fig. 1). It originates from an elevation of approximately 700 m and drains the Pyrenees foothills. It flows into the Mediterranean Sea at Le Barcarès and has a length of around 80 km. A dam dedicated to flood and water management controls approximately 400 km 2 of the catchment (Agence de l'eau Rhône Méditerranée & Corse, 2012). It is located just downstream of the confluence between the Agly and one of its main righthand tributaries, the Désix river, draining an area of around 150 km 2 (Fig. 1). The main left-hand tributary, the Verdouble river, drains an area of 300 km 2 located in a midmountain region, culminating between 400 and 500 m of altitude ( Fig. 1). Granite and gneiss cover about 300 km 2 of the mountainous part of the Agly catchment, promoting runoff already facilitated by the steep slopes. North of the catchment, the Corbières massif is dominated by limestones form- ing karstic networks. According to hydrogeological studies of the area, there are only losses in the Agly and Verdouble catchments due to the karstic system. These losses contribute to the streamflow of two resurgences draining the Corbières massif but located outside of the Agly catchment (Font Estramar and Font Dame resurgences) (Salvayre, 1989). The average loss rates are estimated between 0.3 and 1.5 m 3 s −1 for the Agly, depending on the river discharge, and between 0.7 and 2 m 3 s −1 on the Verdouble (Ladouche and Dörfliger, 2004). These are only average estimates based on observed discharges and assumptions about the functioning of the karst system, but they can be considered small enough not to be explicitly represented in flash-flood simulations. A total of 80 % of the catchment is covered by natural vegetation -forest (45 %), shrubby vegetation (17 %), maquis and scrubland (16 %) -while 18 % is used for agriculture, mainly vineyards.
The Agly catchment is subject to different climate regimes in connection with the distances from the sea and the mountainous reliefs: temperate oceanic in the north-west valley, mountain in the south-west part, and Mediterranean downstream. The rainfall regime varies from east to west with increasing annual cumulated precipitations: the mean annual cumulated precipitations  range from 600 mm at Torreilles (east, Fig. 1) up to 1174 mm at Saint-Louis-et-Parahou (west, Fig. 1) (DIREN Languedoc-Roussillon/SIEE-GINGER, 2008). Generally, the rainfall regime is highly variable, with very intense precipitation events in fall, winter, and spring and with very dry summers.

Available data
The precipitation measurements available on the Agly catchment come from two different observational networks.
-PLU: the operational hourly rain-gauge network for flood-monitoring purposes and data provided by the re-gional flood forecasting service, the Service de Prévision des Crues Méditerranée Ouest (SPCMO); -JP1: 1 km 2 quantitative hourly precipitation estimates ANTILOPE J +1 (ANalyse par spaTIaLisation hOraire des PrEcipitations) that come from a merging of radar data and rain-gauge measurements (Laurantin, 2008;Champeaux et al., 2009).
The hydrometric data were derived from the network of operational measurements at variable time steps (Hy-droFrance databank, http://www.hydro.eaufrance.fr/, last access: 20 November 2019). The stream gauges are located in five upstream stations not influenced by the dam (Table 1 and Fig. 1). Table 2 summarizes the main hydrological features of the five stations. This study will focus on three recent events beginning on 4 March 2013, 16 November 2013 and 28 November 2014, being highly variable (Table 3), with rainfall lasting respectively 3 d for the spring event and 4 d for the two fall events. The selected events have been labeled with the start date and the duration as follows: 20130304_3d, 20131116_4d and 20141128_4d. All the floods feature moderate specific peak discharges for flash floods, highlighting the high infiltration rates. The runoff coefficient is always higher for the eastern part (station no. 5, Table 3) than for the western part. The runoff coefficient is even higher than 1 for 20130304_3d at station no. 5. There is no definitive explanation for that, but several possibilities can be considered: (i) the very high soil moisture at the beginning of the event (65 %, Table 3), which can contribute to the runoff at the outlet via subsurface flows; (ii) an amount of snowmelt as there was a snowfall episode at the very end of February 2013 over the eastern Pyrenees and Corbières, with snow above 700 to 800 m; (iii) the uncertainties in the discharge and precipitation measurements; (iv) a possible supply from the karstic system (Fig. 1); however, this possibility is pretty unlikely as hydrological studies conclude to only losses in the Verdouble catchments due to the karstic system (Ladouche and Dörfliger, 2004). One event occurred in spring with an averagely moist soil (20130304_3d, Table 3), while the other two occurred in autumn with dry soils after the summery drought. The autumn episodes exhibit very different intensities: the specific peak discharges range from 0.3 to 0.6 m 3 s −1 km −2 for 20131116_4d and from 1 to 2 m 3 s −1 km −2 for 20141128_4d. Concerning the means of the maximum rainfall intensity over the catchment, they range from 8 to 14 mm h −1 according to PLU and from 9 to 11 mm h −1 according to JP1 for 20131116_4d as well as from 19 to 30 mm h −1 according to PLU and from 15 to 25 mm h −1 according to JP1 for 20141128_4d (Table 3). 20141128_4d is therefore much more intense than 20131116_4d according to both observed forcings even if JP1 forcing presents lower intensities. 20130304_3d is in between both episodes, with specific peak discharges ranging from 0.6 to 1.5 m 3 s −1 km −2 but lower rainfall intensities ranging from 7 to 11 mm h −1 according to PLU and from 6 to 11 mm h −1 according to JP1. These episodes are representative of the different seasonal rainfall regimes that lead to floods over the Agly. In spring, floods mainly originate from stratiform-type rainfall with moderate but persistent precipitation rates that can result in substantial accumulations. In autumn, floods are most likely driven by convective-type precipitations of shorter duration but high intensity. Figure 2 shows the spatial repartition of the cumulative rainfall for the three events for both forcings. The rain-gauge data have been interpolated using the Thiessen polygon methods (Thiessen, 1911). Variability in rainfall clearly emerges especially between the eastern, western and mountainous part.

Rainfall-runoff model
The MARINE model is a distributed mechanistic hydrological model specially developed for flash-flood simulations. It models the main physical processes in flash flooding: infiltration, overland flow, and lateral flows in soil and channel routing. Conversely, it does not incorporate low-rate flow processes such as evapotranspiration or base flow.
MARINE is structured into three main modules that are run for each catchment grid cell (Fig. 3). The first module allows the separation of surface runoff and infiltration using the Green-Ampt model (Green and Ampt, 1911). The second module represents subsurface downhill flow, based on the generalized Darcy law used in the TOPMODEL hydrological model (Beven and Kirkby, 1979). Lastly, the third module represents overland and channel flows. Rainfall excess is transferred to the catchment outlet using the Saint-Venant equations simplified with kinematic wave assumptions (Fread, 1992). The model distinguishes grid cells with a drainage network, where channel flow is calculated on a triangular channel section (Maubourguet et al., 2007); and from grid cells on hillslopes, where overland flow is calculated for the entire surface area of the cell. For more details about the MARINE model, the readers can refer to Roux et al. (2011), Garambois et al. (2015b) and Douinot et al. (2018).
The MARINE model works with distributed input data such as (i) a digital elevation model (DEM) of the catchment to shape the flow pathway and distinguish hillslope cells from drainage network cells, according to a drained area threshold; (ii) soil survey data to initialize the hydraulic and storage properties of the soil, which are used as parameters in the infiltration and lateral flow models; and (iii) vegetation and land-use data to configure the surface roughness parameters used in the overland flow model. As the MARINE model is event-based, it must be initialized to take into account the previous moisture state of the catchment. This is done by using the spatial daily root-zone saturation state, i.e., the ratio of the soil water content to the soil storage capacity at a spatial resolution of 8 km × 8 km, output from Météo France's SIM operational chain (Habets et al., 2008). The initial soil water content for MARINE is therefore directly obtained by multiplying the saturation state by the soil storage capacity of each cell.

Calibration/validation on the Agly catchment
MARINE requires parameter calibration so as to accurately reproduce hydrological behaviors. Based on previous sen- sitivity analyses by Garambois et al. (2013), five parameters are calibrated: soil depth C Z ; the transmissivity used in lateral subsurface flow modeling C T ; hydraulic conductivity at saturation C K ; and friction coefficients for low-and high-water channels, n L and n H , respectively. C T , C K and C Z are the multiplier coefficients for spatialized, saturated hydraulic conductivities and soil depths. Note that n L and n H are kept invariant throughout the drainage network. The spatial resolution of the MARINE model on all the Agly subcatchments is of 500 m. The calibration of the Agly catchment at the Saint-Paul-de-Fenouillet station (no. 2, Table 1 and Fig. 1) was performed by Garambois et al. (2015a) according to their proposed methodology. The events used for this calibration are older than those considered in the present study (20020411,20031204,20040221,20051115,20101010,20110315; see Garambois et al., 2015a). The cost function L NP is designed to evaluate the performance of the model (Roux et al., 2011;Garambois et al., 2015a): where Q s p and Q o p are respectively the simulated and observed peak runoff, T s p and T o p are the simulated and observed time to peak, and T c is the time of concentration of the catchment. L N denotes the efficiency coefficient (Nash and Sutcliffe, 1970): Table 3. Main features of the selected flash-flood events. Observed forcing PLU: network of 19 rain gauges; observed forcing JP1: 1 km 2 quantitative precipitation estimates; cumulated P (mm): mean ± standard deviation [max] of accumulated precipitation on the catchment during the whole event; max I (mm h −1 ): mean of the maximal rainfall intensity over the catchment; Q o p (m 3 s −1 ): peak discharge for the event; Q o p s (m 3 s −1 km −2 ): ratio of the peak discharge for the event to the drainage area of the subcatchment; T o p (dd hh:mm): date of the peak discharge; C r (-): observed runoff coefficient -ratio of the amount of runoff through the outlet to the amount of rainfall on the catchment; H ini (%): mean ± standard deviation initial soil moisture according to Safran-Isba-Modcou (SIM) daily root-zone humidity output (Habets et al., 2008).
where n is the number of observation data, and Q s and Q o are the simulated and the observed runoff. The estimated times of concentration of each subcatchment are given in Table 1, using the Bransby Williams formula (Pilgrim and Cordery, 1992): where T c (min) is the time of concentration, L (km) is the total length of the channel, A (km 2 ) is the drainage basin area and S (m m −1 ) is the average slope. Here, the formula for time of concentration is only used to normalize the peak time delay in the third term of Eq.
(1) with a characteristic time of the catchment, so the most important point is to always use the same procedure to make this term dimensionless. Note that the range of values for both L NP and L N spans from −∞ to 1, with 1 being the perfect score. Table 4 lists the L N and L NP efficiencies for the validation cases: the three studied events with different forcings and two older flash-flood events with available data, only used for the validation process of the hydrological model but not further studied. Table 4 and Fig. 4 show the following.
-Only one event (20130304_3d with PLU forcing) is well simulated at the five gauging stations.
-Only one event (20130304_3d with both PLU and JP1 forcings) is well simulated at mountainous station no. 1.
-All the other events are correctly simulated only for a part of the catchment: either the eastern part near the Mediterranean Sea (stations no. 3, no. 4 and no. 5), the south-west mountainous part (station no. 1), or the north-west continental part (station no. 2). This result does not seem to be directly linked with the rain-gauged distribution because, first of all, the rain-gauge network is quite dense in this catchment and rather well distributed: with 19 rain gauges for an area of around 1000 km 2 , the rain-gauge density is about 1 for 50 km 2 , whereas the rain-gauge density for the full network over mainland France is of 1 for 120 km 2 (Mounier et al., ). In addition, it is not always for the same part of the catchment that the model has the best performance: it depends on the event. Therefore, the same distribution of rain gauges sometimes leads to a correct simulation in terms of L NP cost function (Eq. 1) for a given even, while it leads to an unsatisfactory simulation for another event.
As expected, the different parts of the catchment exhibit various behaviors which are difficult to correctly simulate with a single calibration by just using observations at sta-tion no. 2. On the one hand, events with relatively moderate peak discharge are usually not correctly simulated by MA-RINE regardless of the observed forcing, as is the case of the 20090411_PLU and 20131116_4d events. Indeed, several authors have pointed out that specific peak discharges larger than 0.5 m 3 s −1 km −2 are one of the relevant criteria to define a flash flood Gaume et al., 2009). The 20090411_PLU and 20131116_4d events exhibit smaller peak discharges (Table 3), except for the 20131116_4d episode at station no. 2, where the results are correct for the PLU forcing (Fig. 4). When the simulated hydrographs are suitable for the eastern Agly, the discharge is overestimated over the western part (e.g., 20141128_4d; Fig. 4). Conversely, when the simulated hydrographs are correct over the western Agly, the peak discharges are underestimated in the eastern part as in the 20130304_3d episode. Difficulties in correctly simulating the hydrological responses over all the subcatchments arise due to the spatial variability of hydrological behavior across the Agly catchment, leading to a myriad of runoff responses that are difficult to encompass with single parameterizations of the infiltration process in hydrological models (Amengual et al., 2017).
With respect to the two major 20130304_3d and 20141128_4d events, both simulated with the two observed forcings, simulations are more satisfactory with the 1 km 2 quantitative precipitation estimates ANTILOPE J +1 for the eastern than for the western part. This may be due to the fact that the radar is located close to the sea, with the beams being orographically sheltered over the western Agly (Fig. 1). Several other calibration tests could have been carried out so as to improve the results of the hydrological model such as one calibration for each subcatchment. However, the main purpose of this study focuses on the potential of ensemble strategies to improve flash-flood forecasting. Furthermore, NWPmodel-driven runoff simulations have been compared both against the observed discharges and against the observed rain-gauge and radar-precipitation-driven runoff runs. Hence, the impact of the external-scale uncertainties on the quality of the distinct HEPS can be emphasized.

Meteorological tools
The fully compressible and nonhydrostatic WRF model has been employed to generate the ensemble members. The WRF setup consists of a single computational domain completely spanning the western Mediterranean region at 2.5 km spatial horizontal resolution (i.e., 767×575 grid points) and 50 vertical levels (Fig. 5). Deep moist convection is explicitly solved due to the high-spatial resolution. All the ensemble experiments have a temporal forecasting horizon of 48 h, starting at 00:00 UTC on the day before of the main observed peak floods. Starting on this day guarantees a suitable lead time to issue warnings to local water management services. For these hydrometeorological episodes lasting more than 2 d, succes-sive consecutive 48 h simulations have been performed, starting on the next days at 00:00 UTC. Hence, the initiation and subsequent evolution of the most active precipitation systems and the overall rainfall episodes are completely encompassed.
WRF simulations have been forced by using the global Ensemble Prediction System of the European Centre for Medium-Range Weather Forecasts (ECMWF-EPS). The MPS ensemble has been built by using the reference (i.e., unperturbed) run, while the PILB approach has considered a selected set of the overall ECMWF-EPS population. Finally, the hourly QPFs are used to force the MARINE model one way so as to build the HEPSs. In addition, the deterministic ECMWF forecasts have been also dynamically downscaled so as to have a control baseline for comparative purposes against the ensemble strategies.
Deterministic simulations have used the following physical parameterizations: the WRF single-moment six-class microphysics scheme, including graupel (WSM6; Hong and Lim, 2006); the 1.5-order Mellor-Yamada-Janjić boundary layer scheme (MYJ; Janjić, 1994); the Dudhia shortwave scheme (Dudhia, 1989); the RRTM longwave scheme (Mlawer et al., 1997); the unified Noah land surface model (Tewari et al., 2004); and the Eta similarity surface-layer model (Janjić, 1994). Note that the WRF configuration for the control simulations is the same as the daily operational setup run by the research Meteorology Group at the University of the Balearic Islands (http://meteo.uib.es/wrf, last access: 31 January 2020).

PILB ensemble
The operational ECMWF-EPS is formed by 51 membersthe reference and 50 perturbed forecasts -at T639 spectral resolution (20 km) and aims to cope with uncertainties related to the actual state of the atmosphere. The daily synoptic-scale uncertainties are encompassed by perturbing an initial analysis through the flow-dependent singular vector technique (Buizza and Palmer, 1995;Molteni et al., 1996). However, perturbed IC/LBCs can produce inadequate spread in the short range, before error growth on the synoptic scale becomes nonlinear (Gilmour et al., 2001). Therefore, the implemented PILB ensemble is based on dynamically downscaling these 20 ECMWF-EPS members exhibiting maximum perturbations in the initial and lateral boundary conditions over the WRF domain. This strategy seeks to ameliorate the aforementioned mismatch between the synoptic-scale error growth optimization time for the singular vectors and the subsynoptic error growth, which is more relevant for shortrange forecasts at small-and medium-sized basins (Ravazzani et al., 2016;Amengual et al., 2017).
At this aim, a k-means clustering algorithm using the principal components of the 500 hPa geopotential and 850 hPa temperature fields is applied to the entire ECMWF-EPS over the WRF numerical domain. Then, the 50 ensemble members Nat. Hazards Earth Syst. Sci., 20, 425-450, 2020 www.nat-hazards-earth-syst-sci.net/20/425/2020/ are categorized in 20 clusters and the 20 closest members to the centroids are used as initial and boundary fields for the PILB ensemble. Boundary fields are updated every 3 h, and physical schemes remain invariant for all the ensemble members and are the same as those used to run the deterministic WRF simulations.

Mixed-physics (MPS) ensemble
There is not an optimum set of physical numerical parameterizations when simulating severe weather and intense precipitation events. Several studies have shown that different combinations of physical parameterizations render similar performances (Jankov et al., 2005;Evans et al., 2012). That is, the meteorological variables are sensitive to a myriad of processes which are differently parameterized by capable numerical schemes. When simulating flash flooding driven by convective-type precipitation, cumulus parameterizations are the main candidates for direct uncertainty sampling. However, as convection is explicitly resolved, uncertainties arising from the microphysical subgrid processes and planetary boundary layer (PBL) schemes have been encompassed. The former regulates the distinct forms of rainfall; the latter accounts for the turbulent vertical fluxes of heat, momentum, and moisture within the PBL and throughout the atmosphere. Both physical mechanisms are also dominant when controlling deep moist convection. The MPS ensemble has been generated using all possible pairs (cloud microphysicsboundary layer) between the following schemes, summing up to 20 members.
-Microphysical schemes: (i) WRF single-moment sixclass (WSM6; Hong and Lim, 2006); (ii) Goddard (Tao et al., 1989); (iii) New Thompson (Thompson et al., 2008); and (iv, v) National Severe Storms Laboratory (NSSL) two-moment (Mansell, 2010) with two cloud condensation nuclei (CCN) prediction values of 0.5 × 10 9 and 1.0 × 10 9 cm −3 . On the one hand, all microphysics schemes involve the simulation of explicitly resolved liquid water, cloud, and precipitation and include mixed-phase transformations (i.e., the interaction of ice and liquid water). However, each microphysical parameterization treats differently the interaction among five or six moisture species (i.e., water vapor, cloud water, rain, cloud ice, snow and graupel); the physical processes of rain production, fall and evaporation; the cloud water accretion and autoconversion; condensation; and saturation adjustment and ice sedimentation. The western Mediterranean is affected by air masses of distinct signature (i.e., Saharan, Atlantic, purely Mediterranean or continental central European), featuring a high variability of aerosol concentration that influences the moist physical mechanisms. The inclusion of two different CCN concentrations copes with uncertainties in the aerosol characteristics. On the other hand, the choice of different PBL schemes can be crucial when correctly simulating the onset of mesoscale severe weather phenomena. PBL modulates the temperature and moisture profiles in the lower troposphere and the effects of turbulence in the daytime convective conditions (Hu et al., 2010;Coniglio et al., 2013). Finally, it is worth noting that the initial and lateral boundary conditions are kept invariant through all the MPS ensemble members. IC/LBC come from the ECMWF-EPS reference forecast for each individual case study, and lateral boundary conditions are updated every 3 h. 5 Results and discussion

Verification of the SREPS
The quantitative comparison of the spatial 48 h accumulated precipitations for the PILB and MPS experiments against the radar estimates provides a quality outlook of the ensemble performance for the selected episodes over the study region. Figures 6-8 indicate realistic spatial distributions for all the study cases: high rainfall accumulations in the upper tail distributions of both ensemble strategies are a good indication of the potential for heavy rainfall. The regional roughed topography (i.e., the pre-Pyrenees, Pyrenees and the Massif Central) is determinant in placing and focusing the probabilistic quantitative precipitation forecasts. Both approaches could succeed in issuing warning alerts before flash-flood scenarios in the region. However, SREPS reliability must be previously checked at basin scales. Flash-flood forecasting over a single medium-sized catchment is a challenging issue as many small-scale atmospheric factors concur in determining the location of deep convection and intense precipitation. A crucial feature in determining correctly the location of the rainfall amounts is to accurately simulate the south to northeasterly low-level moisture maritime flows impinging over the mountainous slopes of the Agly basin.
The 48 h rain-gauge (PLU) and radar-derived (JP1) rainfall amounts have been used to evaluate the forecasting ensemble skill at the relevant hydrological scales. To this end, the cumulative ensemble QPFs have been interpolated to all the available rain gauges and to the pixels of the radar do-main shown in Figs. 6 to 8 for each study case (Akima, 1978(Akima, , 1996Fig. 9). Most members of the PILB and MPS ensembles exhibit underestimations for the 4-5 March 2013 and 28-29 November 2014 experiments, while they exhibit overestimations for the 16-18 November 2013 simulations. Both strategies do not present remarkable differences in ensemble skill and spread when forecasting the total rainfall amounts (Fig. 10). Root mean squared errors (RMSEs) and correlations (r) are quite similar, indicating a slightly more accurate performance of the MPS or PILB ensemble strategy depending on the case study and the starting day of the experiment.
In addition, the skill of each ensemble strategy in predicting the probability for different accumulations -ranging from light to torrential rainfalls -has been assessed by means of the ROC (receiver operating characteristic) curves. The ROC curve expresses the true hit rate of a probabilistic forecast at different false-alarm rates, while the area under the ROC curve (AUC) quantifies the ability of the ensemble to discriminate between the occurrence or nonoccurrence of an event (Schwartz et al., 2010). ROC curves have been computed by using all the study cases, and the radarderived (JP1) rainfall accumulations have been employed as the observed baseline. The following 48 h accumulated precipitation thresholds have been considered: 5, 10, 15, 25, 50, 75, 100, 125, 150 and 200 mm. As the forecast probabilities are computed and verified against each pixel within the radar domain shown in Figs. 6 to 8, the statistical sample sums up to 54 145 members (7735 radar grid points times 7 ensemble experiments). Probabilistic QPFs from the PILB approach show slightly higher forecasting skills than MPS for small rainfall accumulations (i.e., ≤ 15 mm; Table 5 and Fig. 11). Even so, the AUCs are above 0.85 for both ensemble strategies. For moderate to high rainfall thresholds (25-75 mm), PILB and MPS are almost statistically indistinguishable, with AUCs well above 0.7. Depending on the precipitation limit, MPS or PILB features a slightly higher probabilistic forecasting skill. At greater thresholds (≥ 100 mm), PILB shows a larger discrimination ability, with areas slightly higher than 0.7 for all the cases, except the most extreme precipitation accumulation. On the other hand, MPS renders values close to but below 0.7. In general, both strategies exhibit an elevated quality of the probabilistic forecasts for low to moderate rainfall accumulations. Remarkably, the discrimination ability of the PILB strategy is maintained up to 150 mm. This result points to a more effective encompassing of uncertainties emerging from the IC/LBCs than from the microphysical and PBL physical inaccuracies likely due to the dominant role of the regional complex orography when controlling rainfall location. However, the high AUCs rendered by both ensemble strategies suggest accounting for both sources of uncertainty so as to obtain high-quality probabilistic quantitative precipitation forecasts.

Verification of streamflow forecasts
As mentioned by Bellier et al. (2017), the visual inspection of individual hydrographs is useful for a better understanding of how forecasts behave. The hydrological simu-    median and the 10th and 90th quantiles of each ensemble strategy, as well as the first-level alert from the flood warning center in France (SCHAPI), are also shown as references.
In general, the WRF deterministically driven hydrological forecasts often miss the peak times for all the hydrometric stations (Fig. 12). The HEPS improves this feature, even if biases in the EPS still remain as they are propagated down to the hydrological model. That is, the MPS-HEPS and PILB-HEPS exhibit slight underestimations (overestimations) for the 20130305_2d and 20141129_2d (20131117_2d) simula-tions. The observed peak time is included in the boxplots (minimum and maximum of all of the data) of the ensemble strategies for the five stations, whereas it is not included in the boxplot for the deterministic simulations at stations no. 1-3 as it can be seen in Fig. 13 for stations no. 1 and no. 5. It can also be appreciated that the peak timing delay is usually negative, independent of the experimental setup. Almost all the hydrometeorological simulations result in earlier peak timings than observed. The peak plot approach has been adopted to better appreciate the value of the ensemble strategies: all the ensemble members are joined in a single plot by calculating the deviation from the observed peak discharge and timing (Zappa et al., 2013;Ravazzani et al., 2016). Figures 14-16 summarize the simulations carried out for stations no. 2 and no. 5 and for simulations 20130305_2d, 20131117_2d and 20141129_2d. Results exhibit a high interevent variability as it might be expected given their different characteristics. Regarding the MPS-HEPS experiments, the observed peak lies in the range of variation of the ensemble for the 20130305_2d run at hydrometric stations no. 1 and no. 2 (Fig. 14). This fact can be ascribed to the large spread found in the driven peak discharges: deviations from the observation range from approximately −110 to +200 m 3 s −1 , while timing delays fluctuate from −26 to +15 h for station no. 2. Indeed, the 80 % confidence interval of the MPS-HEPS simulations never encompasses the observed discharge for this event. The same remarks also apply for the 20141129_2d case at stations no. 3-5 (Fig. 16) and for 20131117_2d at station no. 3. The 80 % confidence interval of the MPS-HEPS simulations encom- passes the observed discharge only for the 20131117_2d simulation at stations no. 2, 4 and 5 (Fig. 15) and for the 20141128_2d simulation at station no. 2.
The observed peak also lies in the range of variation of the PILB-HEPS ensemble strategy for the 20131117_2d run at stations no. 2-5 (Fig. 15) and for the 20141129_2d sim-ulation at the five gauge stations (Fig. 16). Concerning both episodes at gauge station no. 2, the PILB-HEPS spread is larger than MPS-HEPS in terms of the observed peak discharge although smaller for the observed peak time. That is, from −17 to +22 h for the MPS-HEPS and from −3 to +18 h for the PILB-HEPS for 20131117_2d as well as from www.nat-hazards-earth-syst-sci.net/20/425/2020/ Nat. Hazards Earth Syst. Sci., 20, 425-450, 2020 Figure 10. Statistical scores of the 48 h rainfall amounts for the PILB and MPS ensemble members when compared against the rain-gauge (PLU, a, c, e) and the radar-driven (JP1, b, d, f) observations. Boxes denote the p 25 and p 75 interquartile ranges, middle horizontal lines show the ensemble median, and whiskers display the best and the worst ensemble members. Note that the PILB and MPS ensembles start on the day indicated in the upper part of each subpanel.
−12 to +25 h for the MPS-HEPS and from −12 to +8 h for the PILB-HEPS for 20141129_2d. The opposite is found at station no. 5 for 20130305_2d and 20141129_2d. The 80 % confidence interval of the PILB-HEPS simulations encompasses the observed discharge only for the 20141128_2d run at station no. 2 and for the 20141129_2d run at stations no. 2 and 3 (Fig. 16). Given those results, it seems that there are no substantial differences between the both HEPS strategies on these test cases.

System reliability for flood warning
Results of all the performed hydrometeorological simulations lead to the conclusion that it is very difficult to correctly reproduce the spatial variability of the catchment behavior, even forcing the hydrological model with observed rainfall. The next step was therefore to test the ability of the hydrometeorological modeling strategies in issuing reliable flood warnings.
Nat. Hazards Earth Syst. Sci., 20, 425-450, 2020 www.nat-hazards-earth-syst-sci.net/20/425/2020/  Let us consider a forecast event that either occurs or does not occur. For flood forecasting, it usually consists in an alert threshold exceedance. The performance of a hydrometeorological prediction chain can be examined using a contingency table (Table 6).
Several metrics for the evaluation of flood warning performance can be derived from the contingency table by considering the number of hits (h), misses (m), false alarms (f) and correct negatives (n) for all the simulations. The proportion correct (PC), probability of detection (POD), false-alarm ratio (FAR), critical success index (CSI) and BIAS have the following properties (Nurmi, 2003): -The PC score corresponds to the ratio of correct warning forecasts and total forecasts. PC ranges from 0 to 1, with the latter being the perfect score. Note that the PC index does not differentiate between misses and false alarms.
-The probability of detection is the ratio of correctly forecast threshold exceedances to the total number of threshold exceedance observed. POD ranges from 0 (no hit) to 1, with 1 being the best. Note that for values equal to 1, there are no misses and all occurrences of the event were correctly forecast. However, POD does not penalize false alarms, and it can be artificially improved by overforecasting.
-The false-alarm ratio is the ratio of the number of false alarms to the total number of threshold exceedance forecasts. FAR ranges from 0 to 1, with 0 being perfect. That is, there are no false alarms and all warning forecasts were correct. Note that FAR does not penalize misses, and it can be artificially improved by underforecasting.
-Neither POD nor FAR can give a complete picture of forecasting success. The critical success index combines both aspects of the probability of detection and false-alarm ratio. Therefore, CSI is more balanced and better quantifies the correspondence between the observed and forecasted occurrences. This index is sensitive to hits and penalizes both misses and false alarms. CSI values range from 0 (no hit) to 1 (no misses, no false alarms), with 1 being the best. CSI ignores correct negatives as what is expected in the forecast is to be effective in case of alert.
-The frequency bias compares the number of times an event was forecast to the number of times an event was observed. If BIAS = 1, both frequencies are equal and the forecast is unbiased. If BIAS > 1 (< 1), there is an overforecast (underforecast) tendency: the event was forecast more (less) than it was observed.
As a first step, the probability of exceeding the warning threshold has been calculated for each ensemble strategy. The warning threshold that is used here is the first-level alert from the flood warning center in France (SCHAPI) as plotted in Fig. 12. Results are very similar for MPS-HEPS and PILB-HEPS: overall, with respect to the deterministic simulations, both ensemble strategies improve the forecast of threshold  g-i). Note that Q 50 is the ensemble median, Q 10 denotes the 10th ensemble quantile, Q 90 labels the 90th ensemble quantile, Q obs is the observed discharge, WRF is the WRF deterministically driven discharge experiment, PLU is the PLU-driven runoff simulation and JP1 denotes the JP1-driven discharge simulation. Alert 1 corresponds to the first-level alert.
exceedance for station no. 5 (Tautavel) and degrade it for station no. 2 (Saint-Paul-de-Fenouillet), whereas there is no clear trend for station no. 1 (Ansignan). As it has been stated in Sect. 3.2, when the hydrologic simulations are suitable for the eastern Agly (station no. 2), the discharge is overestimated over the western part (station no. 5). As most members of the PILB and MPS ensembles exhibit underestimations for the 4-5 March 2013 and 28-29 November 2014 events, both MPS-HEPS and PILB-HEPS result in less false alarms for station no. 5 and more misses for station no. 2. PILB and MPS ensembles also exhibit overestimations for the 16-18 November 2013 event but less than the determin-istic simulation; results are therefore the same as for the two other events. Figures 17 to 19 show the results for FAR, CSI and BIAS scores at the five hydrometric sections. These scores are calculated with respect to the observed discharges and by using all the runs of the different episodes. As 48 h simulations have been performed, these scores are based on the following seven experiments described in Sect. 5.2: 20130304_2d, 20130305_2d, 20131116_2d, 2013117_2d, 20131118_2d, 20141128_2d and 20141129_2d. Some tendencies can be highlighted from these results.  . The x axis shows the delay from the observed peak time; the y axis shows the deviation from the observed peak discharge. The triangles show the deviation of the simulations with ensemble member forcing (grey for MPS, light blue for PILB), the shapes with black contour show the deviation of the median of the HEPS simulations with ensemble member forcing, the pink circle shows the deviation of the simulation with JP1 forcing, the green circle shows the deviation of the simulation with PLU forcing and the dark blue square shows the deviation of the simulation with deterministic WRF forcing. Alert 1 (yellow dashed line) is the warning threshold; the black star is the observation used as normalized reference.
-The MPS-HEPS strategy overall performs better than the PILB-HEPS approach for the tested scores. However, both ensemble strategies' scores are very similar.
-No ensemble strategy performs best for station no. 2 for FAR and CSI: there is no false alarm at this station (Fig. 17), and therefore the CSI score is the best with respect to the other stations (Fig. 18).
-Although the ensemble improves the peak timing in some events, it does not improve the issuance of warn-ing, at least according to the five tested scores: the deterministic WRF simulation always has better scores than the median of both MPS-HEPS and PILB-HEPS, except for BIAS, and sometimes better than the maximum.
BIAS shows that both ensemble strategies tend to underestimate the discharge at all the gauge stations except station no. 1, in the mountainous part of the catchment (Fig. 19).
That is, MPS-HEPS and PILB-HEPS tend to underestimate the discharge at all the stations except over the mountainous part of the catchment. This is an indication of the paramount  importance of the orography when controlling the location of deep convection in the meteorological simulations. When orography does not play such an important role, forecasting the small-scale atmospheric features linked to the triggering and development of highly localized convective precipitation cores is more uncertain. As mentioned before, PILB-HEPS and MPS-HEPS tend to exhibit underestimations for both 20130305_2d and 20141129_2d simulations and overestimations for the 20131117_2d run. Conversely, the observed forcing and the deterministic forecast tend to overestimate the discharge, except for the two eastern stations, no. 4 and no. 5. We find here the consequences of the hydrological model calibration: when the simulated hydrographs are suitable for the eastern Agly, the discharge is overestimated over the western part (Sect. 3.2). Quantitative discharge forecasts can be evaluated against observed discharges but also against simulated discharges using observed forcings. As stated by several authors (Verkade et al., 2013;Bellier et al., 2017), the errors due to the parameters and structure of the hydrologic model are therefore not taken into account in the last case. This approach separates the impact of the external-scale uncertainties from Nat. Hazards Earth Syst. Sci., 20, 425-450, 2020 www.nat-hazards-earth-syst-sci.net/20/425/2020/   www.nat-hazards-earth-syst-sci.net/20/425/2020/ Nat. Hazards Earth Syst. Sci., 20, 425-450, 2020 those emerging from the hydrological model. Evaluations have been again performed by using the simulated discharges with observed forcing PLU and JP1 as the baseline instead of the observed flows. As expected, when only external-scale uncertainties are taken into account, the scores for the evaluation against simulated discharges with PLU or JP1 improve: PC, POD and CSI are higher, and there are no false alarms at three stations (no. 1-no. 3). However, the BIAS score shows that both ensemble strategies tend to highly underestimate the simulated discharge at all the stations, except at station no. 5 when compared to PLU and at stations no. 4 and no. 5 when compared to JP1 (Fig. 20). These stream gauges are located over the eastern part of the catchment. Again, the deterministic WRF simulations have better scores than the median of both HEPS, except for station no. 4, and the PC, POD, FAR and BIAS scores when compared to JP1.

Overall view of the modeling performance
Binary events highlight one aspect of the forecast, which is especially relevant to avoid casualties, damages or economic losses (Hersbach, 2000). To obtain a more general quantification of the ensemble performances, other criteria are necessary. Here, the overall discharge forecast at the five gaging stations is studied by using the continuous rank probability score (CRPS; Matheson and Winkler, 1976). The CRPS measures the differences between the forecast, P (x), and observation, P a (x), expressed as cumulative distributions of one parameter x (Eq. 4). This score has the dimension of the parameter and is equal to the mean absolute error (MAE) for a deterministic forecast. The following description is mainly retrieved from Hersbach (2000): where x is the parameter of interest, herein the discharge, and x a is the value that actually occurred. P (x) and P a (x) are the cumulative distributions of x and x a , respectively (Eqs. 5 and 6).
where ρ(x) is the probability density function of the forecast x.
where H is the Heaviside function. The minimum value of the CRPS is zero for a perfect deterministic forecast (i.e., P (x) = P a (x)).
Herein, the CRPS is averaged over the ensemble members and is therefore noted CRPS, while the x parameter corresponds to the discharge at the five gaging stations. The CRPS is very small for the simulations corresponding to the episode of November 2013 (i.e., 20131116_2d, 20131117_2d and 20131118_2d). This score is always below 10 m 3 s −1 for all stations and the MPS-HEPS and PILB-HEPS strategies. Conversely, the CRPS is quite high -above 50 m 3 s −1 -for the numerical runs of the event of November 2014 (i.e., 20141128_2d and 20141129_2d), especially at station no. 5. That is, the cumulative distributions of discharge are similar between the HEPSs and the observed discharges for the event of November 2013, but they are dissimilar for the episode of November 2014. Concerning the experiments for the episode of March 2013 (i.e., 20130304_2d and 20130305_2d), the CRPS is low for stations no. 1 and no. 3 (below 15 m 3 s −1 ) and higher for stations no. 2, no. 4 and no. 5 (close to or above 20 m 3 s −1 ).
To evaluate more easily the performances of the ensemble strategies, their performances are also compared against the efficiency of a reference forecast by using the skill score with respect to the CRPS (Eq. 7) (Bontron, 2004): The chosen reference forecast is the simulation performed with the deterministic forecast (WRF), and in that case the CRPS skill score is written as follows: A CRPSS of 1 corresponds to a perfect forecast (CRPS = 0), while a value of 0 indicates that the HEPS and the reference forecast have the same performances (CRPS = MAE(WRF)). Negative skill scores denote that the reference prediction performs better than the HEPS (CRPS > MAE(WRF)). Figure 21 shows that the two ensemble strategies exhibit very similar skill score CRPSS: in general, both ensemble strategies perform better than the deterministic WRF experiment, except for 20130304_2d and 20130305_2d; the main differences between both ensemble strategies are found for the 20131118_2d experiment -PILB clearly outperforms MPS at all the stream stations.
As stated before, selecting the runoff simulation driven by the deterministic weather forecast as the reference does not account for the errors due to the hydrological model. The CRPS skill score can also be calculated by using the simulation performed with the observed precipitation fields (PLU and JP1) as the reference:  .
Not surprisingly, both ensemble strategies have an overall lower performance when compared with the PLU-and JP1-driven runoff simulations, except for event of November 2013. It is interesting to note that for the 20131118_2d run, the PILB-driven runoff forecasts outperform the radardriven discharge simulation (Fig. 22, right). This is consistent with the previous analyses: events with relatively moderate peak discharge -as the event of November 2013 -are not correctly simulated by MARINE regardless of the observed forcing (Table 4), whereas the CRPS is very low for the ensemble simulations of the event of November 2013. As stated before, a low CRPS means that the cumulative distributions of discharge are similar between both HEPSs and the observed discharges for the event of November 2013, but they are dissimilar between the simulations with both observed forcings and observed discharges for the same event. This may be related to the fact that MPS-HEPS and PILB-HEPS exhibit overestimations for this event, maybe compensating for errors in the model structure that prevent the simulation with observed forcings for this event to be efficient. Both ensemble strategies outperform the hydrological simulations driven by observed forcings (PLU and JP1) for the mountainous station (no. 1: Ansignan) and the 20141128_2d, 20141129_2d, 20131116_2d and 20131118_2d runs. This result is consistent with the difficulty in obtaining satisfac- tory observations of rainfall in mountainous areas owing to sparse rain-gauge deployment and beam radar blockage.

Conclusion
One of the main scientific aims of the HyMeX program is to improve the hydrometeorological forecasting of flash floods over the western Mediterranean region. To this end, three of the most important floods that recently developed over the Agly basin have been selected as study cases. Flood forecasting is a challenging task over this region: high spatial and temporal variability in convective cores and rainfall intensity, strong nonlinearities in the rainfall-runoff transformation, and antecedent moisture conditions lead to a myriad of hydrological responses. This work has focused on coping with uncertainties emerging from the initial and lateral boundary conditions and formulation of numerical weather prediction models. To this end, potentialities of MPS-HEPS and PILB-HEPS ensembles have been examined so as to produce suitable flood forecasts over the Agly basin. The main conclusions are as follows.
-A better ensemble generation strategy at the regional scale has not been found. Similarities in the performance of the MPS and PILB approaches indicate that both sources of external-scale uncertainty contribute similarly to produce adequate levels of skill and spread in the probabilistic quantitative precipitation forecasts.
-Ensemble hydrometeorological simulations have turned out to be satisfactory for alarm detection, even if individual ensemble members can be far from the observations. Alarm systems benefit from large hydrometeorological ensemble spreads.
-The overall HEPS performances improved the deterministically driven runoff simulations.
Some unexpected results also raise interesting questions. For instance, the November 2013 event was poorly simulated using both observed forcings, but ensemble strategies improved the overall discharge forecast. What is the specificity of the November 2013 event that makes it poorly simulated? Is it due to the radar and rain-gauge location or to the initial state of the catchment? Is it due to the model structure itself, which does not represent all the hydrological processes involved (karstic system and snowmelt mainly)? These issues require further investigations and probably more test cases. The next logical approach will be to estimate the uncertainties in the hydrological modeling. Performing hydrological model ensembles to test the errors due to the model calibration is time-consuming. However, according to Douinot et al. (2017), it is also useful in identifying the strengths and weaknesses of the model when simulating the distinct hydrological processes. Hopefully, the future implementation of a hydrological model ensemble will provide the beginning of the answers to the above questions.
Code and data availability. Author contributions. HR, AA and RR provided the theoretical background, designed the methodology and analyzed the data. AA ran the meteorological model and wrote the corresponding part. HR ran the hydrological model and wrote the corresponding part.