Downsizing parameter ensembles for simulations of extreme floods

Anna E. Sikorska-Senoner1, Bettina Schaefli2,3, and Jan Seibert1,4 1University of Zurich, Department of Geography, Zürich, Switzerland 2University of Lausanne, Institute of Earth Surface Dynamics, Lausanne, Switzerland 3Now at: University of Bern, Institute of Geography, Bern, Switzerland 4Swedish University of Agricultural Sciences, Department of Aquatic Sciences and Assessment, Uppsala, Sweden Correspondence: Anna E. Sikorska-Senoner (anna.sikorska@geo.uzh.ch)

which metrics would be suitable to assess the performance of such a reduced parameter ensemble against the reference (full) ensemble? Specifically, three different methods of reducing a full hydrological modelling parameter ensemble to a handful of parameter sets are proposed and tested for deriving the uncertainty ranges of simulated rare flood events (up to 10'000 years return period). All three methods rely on simulation of annual maxima and are tested on continuous synthetic data (simulated with a hydrological model); the aim is thus i) to provide long enough simulation periods for extreme flood analysis, ii) to avoid the propagation of errors due to data/model calibration etc. and iii) to be able to focus entirely on the uncertainty of the 100 hydrological response. The principal idea underlying these methods is that the downsizing of the parameter ensemble may be performed with a reduced length of input time series that is much shorter than the full simulation time frame and that then can be applied to the full time window for analysis of extremes.

Study framework and objectives 105
The focus of this study is a fully continuous hydro-meteorological ensemble-based simulation framework for extreme flood estimation. The underlying streamflow time series ensemble is built based on meteorological scenarios and multiple hydrological model runs using a number of calibrated model parameter sets. A meteorological scenario represents a single realisation from a stochastic weather generator with constant model parameters. These meteorological scenarios are equally likely model realisations that differ in the precipitation and temperature patterns and together they represent the natural variability of the 110 climate (and not the model uncertainty of a weather generator). These realisations are then used as inputs into a hydrological model to simulate the hydrological response. To account for hydrological modelling uncertainty, a range of different parameter sets is used for each meteorological scenario. These two sources of hydrologic variability then accumulate along the modeling chain and can be represented as an ensemble of possible hydrological responses (Fig. 1).
Within such a defined framework we first want to understand how variable the hydrologic response simulation is, and second, 115 develop methods to downsize the model parameter ensemble to a smaller subset that could be dealt with within such a modeling chain for extreme flood simulations. This subset should represent the entire range of variability of the hydrological response but with little computational effort and should also be transferable to independent time periods. Hereafter, we call this subset the representative parameter ensemble.
Downsizing of the parameter ensemble is particularly needed if (i) the probability distribution of the parameter sets is 120 unknown because parameter sets result from independent calibrations or regionalization approaches and only a limited number of sets can be run with the hydrological model, or (ii) the distribution is known but due to time-consuming simulations it is not possible to run the hydrological model for a full ensemble of multiple meteorological scenarios.
The question of how many parameter sets are needed to cover most of the simulation range is important. However, here we set this value to a constant number and rather test different selection approaches. Hence, for the purpose of our work, we 125 furthermore would like this representative parameter ensemble to be composed of only three sets, which should be representative of a lower (infimum), a middle (median) and a upper (supremum) interval of the full hydrological ensemble (Fig. 1 and which, together, should enable the construction of predictive intervals for extreme flood estimates that represent the full variability range of all ensemble members. The infimum (from the Latin -smallest) and supremum (from the Latin -largest) refer to the greatest lower bound and the least upper bound (Hazewinkel, 1994), i.e., the largest interval bounding the ensemble 130 from below and the smallest interval bounding it from above. The choice of infimum and supremum is favourable over the maximum and minimum as the latter would imply a complete parameter ensemble range whereas here we use the terms to describe the range of a certain ensemble.
The key challenge for such a downsizing is the fact that we would like to select parameter sets (i.e. select in the parameter space) but based on how representative the corresponding simulations are in the model response space. Moreover, the 135 downsized ensemble should be representative not only for simulated time periods but also be transferable to independent time periods. The first question to answer is which model response space the selection should focus on. In the context of extreme flood estimation, focusing on the frequency distribution of annual maxima (AM) is a natural choice; we thus propose to use the representation of AMs in the Gumbel space as the reference model response space for parameter selection. The next step is the development of selection methods to select parameter sets that plot into certain locations in the model 140 response space (i.e. in the Gumbel space). Given the nonlinear relationship between model parameters and hydrological responses, this selection has to be obtained via an inverse modelling approach, i.e. we have to first simulate all parameter sets and then decide which parameter sets fulfil certain selection criteria in the model response space.
For that purpose, we developed three methods, which are based on: a) ranking, b) quantiling, and c) clustering, described in detail in Sect. 2.2. The main idea behind all three methods is that the parameter set selection is made based on the full 145 hydrological simulation ensemble but using only a limited simulation period that is much shorter than the time window of full meteorological scenarios used within the simulation framework.
-I is a number of parameter sets available with i = 1, 2, ...., being a parameter set index; θ i is the i-th parameter set of a hydrological model;

150
-J is a number of annual maxima (years) per one hydrological simulation, y = 1, 2, ..., is a year of simulation (index of unsorted annual maxima), and j = 1, 2, ...., is an index of sorted annual maxima; -X j is the j-th sorted annual maximum and X y is the unsorted annual maximum from the year y.
-M is a number of meteorological scenarios considered with m = 1, 2, ...., being a meteorological scenario index; -S m is the m-th meteorological scenario;

155
-H(θ i |S m ) is the hydrological simulation computed using the i-th parameter set of a hydrological model and the m-th meteorological scenario; -X y,i,m is the annual maximum for the year y extracted from H(θ i |S m ); θ inf , θ med and θ sup are the representative parameter sets of the hydrological model, i.e., infimum, median and supremum that correspond to the intervals named in the same way.

Developed methods for selecting the representative parameter sets
For the sake of simplicity, let us choose a single meteorological scenario S m for now. Using S m as an input into a hydrological model combined with I parameter sets results in an ensemble of hydrological simulations, H(θ 1,2,... |S m ). Now, the goal is to select a limited number (here 3) of hydrological model parameter sets, i.e., θ inf , θ med and θ sup , from the available pool of I sets (I 3) based on the simulation of annual maxima (AM). These AMs are extracted from time series with continuous 165 hydrological simulations, i.e., H(θ 1,2,... |S m ) using a maximum approach that guarantees that the highest peak flow within each calendar year for each hydrological simulation is selected (Fig. 2). This assumption is made to cover the situation when different model realizations (i.e, for i = 1, 2, ...) lead to different flood events being classified as the largest event within the year. In this case, we ensure that the largest flood event simulated within each y-th year and each i-th parameter set is selected.
This means however that AMs selected for the same year y but with a different parameter set may originate from different 170 flood events and even from a different dominant flood process, e.g. heavy rainfall or intensive snowmelt (Merz and Blöschl, 2003). This could be the case when one parameter set better represents processes driven by the rainfall excess while others by the snowmelt dynamics. For simplicity, we do not distinguish events by their different flood genesis and pool all AMs together.
Using the above notations, the selection of representative parameter sets can be summarized as:  2. Selection of annual maxima (AMs): for each i-th hydrological realisation annual maxima are selected as the highest peak flow within each y-th simulation year. This results in a J set of AMs per each i hydrological simulation. The selection of AMs is repeated for all I hydrological simulations. 3. Selection of three representative parameter sets based on the simulation of AMs and following on from the three methods detailed below. 7 https://doi.org/10.5194/nhess-2020-79 Preprint. Discussion started: 26 March 2020 c Author(s) 2020. CC BY 4.0 License.

Ranking
(a) AMs computed from I hydrological simulations (i.e., using I model parameter sets) are sorted by their magnitude from the highest to the lowest within each y-th simulation year independently ( Fig. 2A).

185
(b) For each y-th simulation year, AMs which correspond to the 5th, 50th and 95th rank for that year are selected.
(c) Parameter sets that correspond to the selected AM ranks are then attributed as 5th, 50th and 95th parameter sets per each y-th year independently.
(d) The parameter sets selected in step (d) are compared over all J simulation years and the sets which are chosen most often as the 5th, 50th and 95th ranks are retained as the parameter sets θ R5 , θ R50 and θ R95 representative for the entire 190 simulation period and for the entire hydrological simulation ensemble. (d) Finally, the ensemble members which in the Gumbel space lie closest to Q 5 , Q 50 and Q 95 , i.e., received the smallest values for R MSE,Q5 , R MSE,Q50 and R MSE,Q95 , respectively, are chosen as the ensemble members representative for the entire hydrological ensemble, and the parameter sets corresponding to these members, i.e., θ Q5 , θ Q50 and θ Q95 , are retained as representative.

Clustering
(a) Similar to the quantiling method, for each i-th parameter set AMs computed with this parameter set are sorted by their magnitude over the entire simulation period and plotted in the Gumbel space, creating I ensemble members of sorted AMs simulated with different parameter sets.
(b) These members are next clustered in the Gumbel space into three representative groups (clusters) based on all J simulation years using the k-means clustering with Hartigan-Wong algorithm (Hartigan and Wong, 1979), as implemented in 215 the function kmeans from the package "stats" (R Core Team, 2019), see Fig. 2C.
(c) Next, these clusters are sorted by their magnitude and for the lower cluster a 5th percentile, for the upper -95th percentile, and for the middle -50th percentile are computed, i.e., P 5 , P 50 and P 95 . Note that we use here percentiles instead of cluster means to make this method comparable with the other two methods and to better cover the variability of the parameter sample.

220
(d) For each i-th ensemble member, the metric R MSE is computed in relation to three estimated cluster percentiles P 5 , P 50 and P 95 as: (e) For each of these three clusters, the ensemble member that lies closest to the cluster percentile, i.e., received the smallest value of R MSE , is selected as the representative member for that cluster and the parameter sets which correspond to these members, θ P 5 , θ P 50 and θ P 95 , are retained as representative.

230
For plotting, we used the Gringorten's method (Gringorten, 1963) to compute the plotting positions of AMs in the Gumbel plots: where k j is a plotting position for the j-th (sorted) AM in the Gumbel space. for a summary), each of these selection methods results in three (different) representative hydrological simulation ensemble members and can be thought of as representing the lower (infimum), upper (median) and middle (supremum) interval of the full simulation range. The parameter sets corresponding to these are then noted as θ inf , θ med , θ sup . The simulations corresponding to these three parameter sets together create the so-called predictive interval, which can be used for extreme flood simulation 240 studies.

Comparison of three selection methods
The major difference between these three methods is that the ranking method is evaluated based on individual simulation years using simple ranking of flow maxima independently of their frequency, i.e., it works on unsorted annual maxima. Note that in this way, for each y simulation year, a different rank order of the I parameter sets may be achieved. In an extreme case, where 245 for each year y different parameter sets are chosen, a choice of the representative sets over all simulation years may become problematic due to difficulties in identifying the parameter sets most frequently selected over all simulation years.
In contrast to the ranking method, both other methods, i.e., quantiling and clustering, are performed on sorted AMs over all simulation years, i.e., in the flow frequency space. This enables statistical statements to be made about the selected parameter sets and about the predictive intervals constructed with the help of these parameter sets. Furthermore, selected parameter sets 250 can be assumed to be representative over the entire simulation period (see Table 1 for a detailed overview of three methods).

Assessment of the behavior of the approach
Testing the methods for a time period different than the one that was used for the parameter ensemble downsizing is crucial for assessing how well the reduced ensembles cover the reference simulation ensemble. Thus, we propose to assess the behavior of the developed approach by repeating the selection of the three representative parameter sets with the three proposed methods with multiple (M ) meteorological scenarios. Using multiple meteorological scenarios enables, first, accounting for the natural variability of the hydrological response due to climate variability, and second, gives us the possibility to evaluate the bias of the approach. Particularly, with the help of multiple meteorological scenarios we explore how the choice of the representative parameter sets θ inf , θ med , θ sup depends on the meteorological scenario.

Leave-one-out cross-validation 260
To evaluate the three selection methods, we perform a leave-one-out cross-validation simulation study, in which a meteorological scenario S r is removed from the analysis and the selection of the representative parameter sets is executed based on all other remaining meteorological scenarios, i.e., using all m = 1, 2, ..., M and m = r. The evaluation of selection methods is then executed against the one meteorological scenario initially removed from the set. In detail, the following steps are executed for each of the three methods independently: (d) Evaluate the meteorological scenario S r removed at step (a) against the predictive intervals of S M −r meteorological scenarios to assess how well the defined identified intervals represent the ensemble members of this S r meteorological scenario.
The simulation is repeated M times to use each meteorological scenario once.

Multi-scenario evaluation
To further evaluate the three methods, we perform a simulation study using multiple (M ) meteorological scenarios. In this study, the three selection methods are executed on one x meteorological scenario randomly (without replacement) selected from the M available scenarios and evaluated against all remaining scenarios. In detail, the following steps are executed for each of the three methods independently:

280
(a) Pick-up one meteorological scenario S p out of the S 1,2,..,M scenarios available; (b) Analyze the I simulated hydrological ensemble members of this scenario H(θ i |S p ), i = 1, 2, ..., I, resulting from I parameter sets θ i for S p and select three representative parameter sets corresponding to θ inf,p , θ med,p , θ sup,p , as described The steps (a-e) are repeated M times to use each meteorological scenario once. We call this evaluation a multi-scenario evaluation because the evaluation is performed using multiple meteorological scenarios at once (S M −p ) in contrast to the leave-one-out cross-validation (Sect. 2.5.1), where the evaluation is performed against only one meteorological scenario (S r ).
2.6 Evaluation criteria 295 2.6.1 Visual assessment The simplest way of assessing the behavior of these three methods is a visual inspection of curves plotted in the Gumbel space, which can tell us how well the selected members reproduce the simulation ensemble and particularly whether the assignment of the representative parameter sets is correct or not. For this purpose, we propose to plot all simulated hydrological ensemble members together with the selected representative members in the Gumbel space for each considered meteorological scenario 300 m individually and to visually assess the assignment of the three selected parameter sets, θ inf,m , θ med,m , θ sup,m , and the corresponding intervals, i.e., H(θ inf,m |S m ), H(θ med,m |S m ) and H(θ sup,m |S m ). The order of the intervals' assignment is assumed to be correct if it holds in the Gumbel space that: We further define a ratio of incorrectly attributed scenarios, with mixed-up intervals, i.e. for which Eq. 8 does not hold, as a 305 measure of the bias as: where R m is computed for the m-th scenario as: 2.6.2 Quantitative assessment 310 To quantitatively compare the three selection methods, we propose to compute the three following metrics: for each m-th scenario as: where R spo,m,i is the ratio for each i-th parameter set of the meteorological scenario m and is computed for each 315 simulation point j (in the Gumbel space) as: (II) In the leave-one-out cross-validation, the ratio of hydrological simulation ensemble members lying outside the predictive intervals is computed for each m-th scenario as: where R hso,m,i is the ratio computed for each i-th ensemble member as: (III) In the multi-scenario evaluation, the ratio of meteorological scenarios lying outside the predictive intervals is computed for each scenario p as: where R mso,m is computed as: With respect to R spo , the question arises of how to define the ratio of simulation points being outside the predictive intervals if multiple hydrological simulations (leave-one-out cross-validation) or multiple meteorological scenarios (multi-scenario evaluation) are considered. Here we propose to use different percentiles, i.e., the 5th, 50th, and 95th percentiles, to characterize 330 the ratio of the simulation points lying outside the computed predictive intervals for each of the methods.
In a similar way, for R hso and R mso an additional condition must be defined, i.e., how many out of J hydrological simulation points for R hso , or how many out of I hydrological simulation ensemble members for R mso must lie outside the defined 13 https://doi.org/10.5194/nhess-2020-79 Preprint. Discussion started: 26 March 2020 c Author(s) 2020. CC BY 4.0 License.
predictive intervals, so that the hydrological simulation H(θ i |S m ), or the meteorological scenario S m , is considered as lying outside these intervals.

335
For this purpose we define the rejection threshold r thr (dimensionless) that has to be reached, so that the meteorological scenario or hydrological simulation is assumed as lying outside the predictive intervals. In this work, we consider the following values for r thr = {0.50, 0.25, 0.10, 0.05}.
These three metrics are computed for all three methods and for all M meteorological scenarios, and the median values over these M scenarios are taken as a measure for comparing the three methods. With no direct human influence within the entire catchment known, it can be assumed close to natural (BAFU, 2017). This catchment is part of a large-scale extreme flood modelling effort in Switzerland for the entire Aare catchment (Viviroli et al., 2020). In this study, the version HBV light (Seibert, 1997;Seibert and Vis, 2012) with 15 calibrated parameters is used; see Table A1 in Appendix C for details on model parameters and their calibration ranges. Model inputs are time series of precipitation and air temperature, and long-term averages of seasonally varying estimates of potential evaporation, all being area-average values 360 for the entire catchment. These inputs are next redistributed along pre-defined elevation bands using two different constant altitude-dependent correction factors for precipitation and temperature. The model output is streamflow at the catchment outlet at time steps identical to input data (hourly in this study).
For the study catchment, meteorological inputs (hourly precipitation totals, hourly air temperature means, average hourly evaporation sums) for the HBV model are derived from observed records from meterological stations and are averaged to the 365 mean catchment values using the Thiessen polygon method. The recorded continuous hourly streamflow data at the catchment outlet (Olten station) covers the period 1990-2014.

Identification of multiple HBV parameter sets
To calibrate the HBV model described in Sect. 3.2, we used a multi-objective function F obj with three scores: the Kling-Gupta efficiency (R KGE ) and the efficiency for peak flows (R PEAK ), which are both sensitive to peak flows, and a measure based on 370 the Mean Absolute Relative Error (R MARE ) that is sensitive to low flows. F obj is obtained through weighing these metrics as follows: For details on R KGE , see the work of Gupta et al. (2009); for details on R PEAK and R MARE , see the work of Vis et al. (2015).
The weights in F obj are chosen following our previous experience in modelling Swiss catchments 375 Westerberg et al., 2020). This objective function F obj is next used together with the Genetic Algorithm approach (Appendix C) to calibrate the HBV model with hourly data. The available observational datasets are split into a calibration (1990-2005 years) and a validation (  The calibration is repeated 100 times resulting in 100 independent optimal parameter sets (see Appendix C). The median model efficiency measured with F obj over all 100 runs is 0.7 in the calibration and in the validation periods, which can be assumed to be a good model performance on an hourly scale.

Generation of synthetic meteorological scenarios using a weather generator
Meteorological scenarios of synthetic precipitation and temperature data for the Dünnern at Olten catchment are generated 385 with the weather generator model GWEX developed by Evin et al. (2018) and referred to in their paper as GWEX_Disag. This stochastic model is a multi-site precipitation and temperature model that reproduces the statistical behavior of weather events on different temporal and spatial scales. The major property of GWEX is that it uses marginal heavy-tailed distributions for generating extreme precipitation and temperature conditions. Moreover, it has been developed to generate long term (≈ 1000 years) meteorological scenarios. GWEX_Disag generates precipitation amounts at a 3-day scale and then disaggregates them 390 to a daily scale using a method of fragments (for details on the precipitation model, see the work of Evin et al. (2018), and for details on the temperature model, see the work of Evin et al. (2019)).
The meteorological scenarios used in this study are a subset from the long-term meteorological scenarios developed for the entire Aare River basin using recorded data from 105 precipitation stations and from 26 temperature stations in Switzerland (Evin et al., 2018(Evin et al., , 2019. For this larger scale research project, GWEX_Disag was set up using daily precipitation and tem-perature data from the period 1930-2015 and hourly records of precipitation and temperature from 1990-2015 for the Aare River basin. The daily values generated with GWEX_Disag were then disaggregated to hourly values using the meteorological analogues method, which for each day in the simulated dataset finds an analogue day in observed data, i.e., with a known hourly time structure. Next, catchment means were computed using the Thiessen polygon method. For the present study, 100 different meteorological scenarios (precipitation and temperature) covering the same time frame 400 of 100 years at an hourly time step are available for the Dünnern at Olten catchment. The simulated data is assumed to be representative for current climate conditions, i.e., no variation due to climate or land use change, or river modification is considered. Thus, differences between scenarios are exclusively due to the natural variability of the meterological time series.

Generation of synthetic hydrological simulation ensembles
Finally, for our analysis, 100 meteorological scenarios with continuous data of 100 years of precipitation and temperature, and with 100×100×100 combinations of the annual maximum × parameter set × meteorological scenario.
These series of AMs are next used to test the developed methods of selecting the representative parameter sets from the 415 ensemble of 100 available sets.

Representative parameter sets
The representative parameter sets selected with each of the three methods are summarized over all 100 meteorological scenarios in the form of violin plots (Fig. 4). We present parameter sets by their unique indices that are kept the same for all three methods   Figure 3. Setup of the experimental study.  others as a representative set for different meteorological scenarios. This grouping is particularly visible for the supremum set in all three methods but is strongest for the clustering method; for the infimum set, the grouping effect is the most pronounced for the clustering and the quantiling method, and for the median set in the ranking method.

M×I=100×100 M×I×J=100×100×100
The five most frequently chosen parameter sets for each method are summarized in Table 2. Although different parameter sets are usually selected by different methods, in a few cases the same set is chosen with more than one selection method 430 (highlighted in Table 2). Among the first five most frequently chosen sets, the same parameter set is selected as the median set once for all three methods and twice for at least two methods. For the supremum set, among the first five most frequently chosen sets, the same set is selected four times at least for two methods but never for all three methods. For the infimum set, only one set is chosen for two methods among the first five most frequently chosen sets. Interestingly, for the supremum set in the clustering method, only four parameter sets among all 100 available are chosen over all 100 scenarios.

Infimum, median and supremum intervals
Using the selected representative sets, representative intervals for extreme flood predictions are constructed for each of the 100 meteorological scenarios and each of the three selection methods. Examples of these intervals for two meteorological scenarios are presented in Figs. 5-6. Note that apart from selecting representative intervals, the clustering method leads to grouping all ensemble members into three selected clusters.

440
According to a first visual assessment, these three methods lead to slightly different constructed frequency intervals particularly in the upper tail of the distribution, i.e., for the most rare (highest) flows, which are of highest interest. Moreover, the ranking method leads to less symmetrically spread intervals, with the median and infimum intervals lying close to each other.
The other two methods lead to more symmetrically spread intervals.
For the quantitative assessment, the ratio of scenarios incorrectly attributed, i.e. with intervals being mixed-up, (R bias ) 445 varies between the three methods and is the lowest for the ranking method (R bias =0.54). For the clustering method, the three intervals are always correctly attributed for all 100 meteorological scenarios tested (R bias =0.0). For the quantiling method, this ratio is equal to R bias =0.02 and thus also very low. Hence, we can conclude that both clustering and quantiling methods provide correctly attributed intervals with a bias ≤ 2%. For the ranking method, the correctness of the interval attribution is poor, and in more than 50% of the meteorological scenarios, the simulations corresponding to the selected parameter sets lead 450 to mixed-up frequency intervals.

Evaluation of the three selection methods
The behavior of the three selection methods is further evaluated with the 100 meteorological scenarios using the leaveone-out cross-validation test (Sect. 2.5.1) and the multi-scenario evaluation method (Sect. 2.5.2) and corresponding metrics (Sect. 2.6.2). Examples for two meteorological scenarios are presented in Figs. 7-8 for the leave-one-out cross-validation test 455 and in Figs. 9-10 for the multi-scenario evaluation. From the visual assessment, it is difficult to judge the methods, as they seem to perform similarly well. However, the range of the predictive intervals obtained with 99 meteorological scenarios (one 460 This is reflected in the quantitative assessment of the methods' behavior, summarized in Table 3. Namely, the leave-one-out cross-validation reveals that the quantiling method receives the highest values for both evaluation criteria, i.e., the median ratio of simulation points lying outside the predictive intervals (R spo ) and the median ratio of hydrological simulation ensemble members lying outside the predictive intervals (R hso ). Thus, this method performed the poorest among all three methods tested here. Yet, with R spo ≤ 0.14 for the 50th percentile and R hso ≤ 0.05 for the threshold r thr ≥ 0.50, even this method can be 465 qualified as behaving well based on the leave-one-out cross-validation. For the ranking and the clustering methods, similar values for these two metrics are achieved, with slightly lower values for the ranking method.
In summary, it can be said that all criteria values are relatively low for all three methods, and thus the computed criteria values can only be used to order the methods by their behavior, while none of the methods is rejected.
In contrast to the above findings, the multi-scenario evaluation reveals different results with R spo being the lowest for cluster-470 ing, and the largest for the ranking method. Similarly, the median ratio of meteorological scenarios lying outside the predictive intervals (R mso ) is the lowest for clustering and the highest for the ranking method for all considered threshold values (r thr in Table 3).
Also, here all computed criteria values are relatively low with R spo ≤ 0.05 for the 50th percentile and R mso = 0 for the threshold r thr ≥ 0.50 for the poorest behaving method (ranking). Hence, again here all three method can be qualified as behaving well 475 based on the multi-scenario evaluation, and only the order of their behavior can be established.

Behavior of three selection methods
The results from our experimental study demonstrate that generally all three methods are capable of selecting representative parameter sets that yield reliable predictive intervals in the frequency domain, i.e. all three methods are fit-for-purpose for 480 extreme flood simulation, with the ranking method performing, however, clearly less well than the others (larger bias, as visible in Sect. 4.2). As the developed methods rely on selecting three representative sets as infimum, median and supremum, they respect the maximum variability between individual ensemble members for a given meteorological scenario.
In the validation tests, the behaviour scores of the three methods, however, were attributed differently depending on the evaluation criteria. To further compare the methods, we provide below a detailed discussion of the major differences and 485 present a synthesis of how the methods rank on average (averaged across all scenarios) for the quantitative evaluation criteria as well as for visual inspection (bias) and for additional ease of use criteria (Table 4).
From the visual assessment, it clearly appears that the ranking method is the most biased method (with more than half of all meteorological scenarios having mixed-up intervals), while the other two methods can be considered as being unbiased with correctly attributed intervals for 98% (quantiling) or more (clustering) of all meteorological scenarios considered here 490 (Sect. 4.2). As expected, these findings are further confirmed by the results from the multi-scenario evaluation that yield the best behaviour for the clustering method and the worst for the ranking method (Sect. 4.3).  Figure 10. Example of multi-scenario evaluation for the three selection methods (meteorological scenario m =93); description as in Fig. 9. Table 4. Synthesis of scoring ranks attributed to the three methods for selecting representative parameter sets (based on quantitative metrics).

Ranking
The ranks are attributed descending from the best (1) to the worst (3)  Interestingly, the leave-one-out cross-validation study, in contrast to the the multi-scenario evaluation, attributes the lowest criteria value to the ranking method, i.e. ranks it as the best method (Table 4). This requires a careful interpretation and understanding of how the predictive intervals are constructed in both evaluation studies. In the leave-one-out cross-validation 495 study, the representative parameter sets are selected and the predictive intervals are constructed based on 99 meteorological scenarios and then evaluated against the full simulation range corresponding to the left out scenario. In other words, this test evaluates how well the selection methods applied to all but one scenario can predict the full simulation range of the left out scenario. In the multi-scenario evaluation, the representative parameter sets are selected based on a single scenario, and the predictive intervals are then assessed by applying these three selected sets (selected based upon a single scenario) to the other 500 99 meteorological scenarios. This test quantifies how well the methods applied to a single scenario are transferable to all other scenarios.
Hence, by comparing findings from these two evaluation studies, it appears that the ranking method has the lowest transferability, i.e. performs poorly if using a single scenario for selecting the representative sets (multi-scenario evaluation). In exchange, the ranking method outperforms the two other methods when a high number of meteorological scenarios is used 505 for selecting the representative parameter sets (leave-one-out cross-validation). This means that the ranking method strongly depends on the meteorological scenario choice, while the other two methods result in representative parameter sets that are transferable to other meteorological scenarios.
This outcome can be understood if we consider how the selection methods are constructed: The ranking method, in fact, depends strongly on the selected simulation period (and hence on the meteorological scenario) because the selection of the 510 representative parameter sets is performed on unsorted annual maxima for each simulated year independently. The other two methods are performed over the entire simulation period, which makes them less strongly dependent on individual simulation years. However, the ranking method can be considered as the (computationally) easiest in application due to its selection criteria relying purely on ranking within individual simulation years. The other two methods need to be performed in the Gumbel space over the entire simulation period and, in the case of the clustering method, require some additional computational effort (which 515 remains low, however, compared to the hydrologic simulation). The use of the Gumbel space in selecting the representative parameter sets helps, however, to interpret the constructed prediction intervals and to directly assign return periods to them.
Overall, it appears that the clustering method behaves the best (with a median scoring rank of 1) due to its unbiasedness and due to a good performance achieved for all evaluation criteria, for both the leave-one-out cross-validation and the multi-scenario evaluation (Table 4). 520

Limitations and perspectives
This study proposes a framework for representative hydrological parameter selection to be used within fully continuous ensemble-based simulation frameworks that are based on meteorological inputs generated with a weather generator. Based on our experimental case study, we demonstrate that the proposed three methods are reliable for downsizing the available full We should emphasize that the presented methods are independent of the selected parameter calibration approach or from the selected hydrological response model and are thus readily transferable to any similar simulation setting, in particular also to 535 settings where the full parameter samples to be downsized come from model regionalization (i.e. in applications to ungauged or poorly gauged catchments). Moreover, although the methods are tested with a bucket type hydrologic model, the most valuable application of the proposed methods would be to computationally more demanding hydrological models that can profit even more from a reduced computational demand.
Furthermore, the proposed approach is tested here using synthetic hydrologic data, i.e., using streamflow simulations of the 540 hydrological model in response to meteorological scenarios. This use of synthetic data makes the approach results independent from the catchment properties and limits the effect of the hydrological model error and errors in calibration data on the methods' comparison results.
We can, however, not directly assess here how much variability in the full hydrological ensemble is due to the climate variability and how much is due to the uncertainty resulting from the hydrological model parameters, because these two 545 components are not linearly additive. This can easily be seen by comparing the ranges of predictive intervals constructed using one scenario and 99 scenarios in the multi-scenario evaluation for two example scenarios in Figs 9-10. In addition, any ensemble simulation also encompasses other uncertainty sources of the modeling chain, such as resulting from the weather generator or from the structure of the selected hydrological model, from the prediction of very rare flood events, etc. (Lamb and Kay, 2004;Schumann et al., 2010;Kundzewicz et al., 2017).

550
Hence, downsizing the hydrological model parameter sample can only aim at understanding and characterizing the hydrological part of the full hydrological ensemble resulting from a combination of multiple parameter sets and multiple meteorological scenarios. These methods are however not applicable for characterizing the climate variability (nor for downsizing the number of meteorological scenarios needed).
Moreover, in developing the selection methods, we did not distinguish between different flood-types such as heavy rainfall-555 excess or intensive snowmelt events (Merz and Blöschl, 2003;Sikorska et al., 2015). Also, as we focused only on large annual floods (annual maxima), we did not represent the flood seasonality in our analysis. Yet, some recent works emphasise the need to include such information on the flood type (Brunner et al., 2017) or on flood seasonality (Brunner et al., 2018b) into bivariate analysis of floods, or to represent a mixture of both flood type and flood seasonality in flood frequency analysis (Fischer et al., 2016;Fischer, 2018). Thus, the proposed selection methods could potentially be extended to account for different flood types 560 during representative parameter selection, using e.g. automatic methods of flood type attribution from long discharge series (Sikorska-Senoner and Seibert, 2020).
Finally, we downsize the parameter sample to three sets which represents the predictive intervals of the full ensemble of hydrological responses fairly well, given different meteorological scenarios. This number of three sets is motivated by the fact is common practice in flood frequency analysis, and the three sets emulate the common practice of communicating median values along with prediction limits (Cameron et al., 2000;Blazkova and Beven, 2002;Lamb and Kay, 2004;Grimaldi et al., 2012b). Optionally, one could further downsize the parameter sample to two sets (i.e., infimum and supremum) which would represent the intervals only. Downsizing to more than three parameter sets (e.g. five) could have an advantage of containing more information on uncertainty intervals, e.g. in the case they are asymmetric, and should be explored in further study.

Conclusions
In this study, we propose and test three methods for selecting the representative parameter sets of a hydrological model to be used within fully continuous ensemble-based simulation frameworks. The three selection methods are based on ranking, quantiling and clustering of simulation of annual maxima within a limited time window (100 years) that is much shorter than the full simulation period of thousands of years underlying the simulation framework. Based on a synthetic case study, 575 we demonstrate that these methods are reliable for downsizing a parameter sample composed of 100 parameter sets to three representative sets that represent most of the full simulation range in the Gumbel space. Among the tested methods, the clustering method that selects parameter sets based on cluster analysis in the Gumbel space, appears to outperform the others due to its unbiasedness, and due to its transferability between meteorological scenarios. The ranking method, which is the only tested method that completes the parameter selection on non-sorted annual maxima, can clearly not be recommended for 580 typical settings since it i) tends to result in mixed-up prediction intervals in the Gumbel space and ii) depends too strongly on the simulation period used for parameter selection and thus lacks transferability to other periods or other meteorological scenarios. Possible applications of these methods include all fully continuous simulations schemes for extreme flood analysis, and particularly those for which computational constraints arise.

Appendix A: Semi-continuous versus fully continuous simulation approach
This appendix briefly discusses the conceptual difference between the semi-continuous and the fully continuous approach from two perspectives, i.e., from the hydrologic and the hydraulic perspective (Fig. A1) Calver and Lamb, 1995;Cameron et al., 2000;Blazkova and Beven, 2004;Hoes and Nelen, 2005;Winter et al., 2019). In both of these semi-continuous approaches, simulated or design hydrographs are further routed through the hydraulic model. In contrast to these semi-continuous approaches, in the fully continuous simulation approach, as seen from the hydraulic perspective, full discharge time series are routed through the hydraulic model . The locality map of the study catchment is presented in Fig. A2.

Appendix C: Details on the HBV model parameters and model calibration
For searching the best parameter sets within the defined parameter ranges (Table A1), a Genetic Algorithm and Powell optimization (GAP) approach (Seibert, 2000) is used. This approach is executed in two major steps. Firstly, the GA optimization is performed that relies on an evolutionary mechanism of selection and recombination of a user defined number of parameter sets 605 (i.e., parameter population) randomly selected within the defined parameter ranges. The principle idea of this searching relies on regenerating the parameter sets from the subgroup of parameter sets selected using the defined objective function F obj as a criteria to choose the parameters that give the highest value of F obj at the previous step of the model calibration. The search for the best parameter set is terminated at a user defined maximum number of model interactions and results in a selected optimal parameter set. Secondly, the optimal parameter set obtained at the previous step is used as a starting point for a local 610 optimization search using the Powell's quadratically convergent method (Press et al., 2002). The parameter set finally achieved from the local optimization is retained as the best set. In this study, the total number of model interactions is set to 2500 for the GA and 500 for the local Powell's optimization. The GAP optimization is repeated 100 times resulting in 100 optimized parameter sets ( Figure A3).