For extreme-flood estimation, simulation-based approaches represent an interesting alternative to purely statistical approaches, particularly if hydrograph shapes are required. Such simulation-based methods are adapted within continuous simulation frameworks that rely on statistical analyses of continuous streamflow time series derived from a hydrological model fed with long precipitation time series. These frameworks are, however, affected by high computational demands, particularly if floods with return periods

The quantification of extreme floods and associated return periods remains a key issue for flood hazard management

Most modern applications, however, require the estimation of not only extreme peak flow, associated flood volumes and duration but also of hydrograph shapes, in particular in the context of reservoir design or for safety checks of hydraulic infrastructure

Simulation-based methods for design or extreme-flood estimation have a long history in hydrology

Modern extensions of this approach, however, use continuous hydrological modelling for design flood estimation to generate either (i) a range of initial conditions for use in combination with design or randomly drawn storms

These fully continuous simulation schemes are particularly useful for studies where recorded discharge time series are too short for extreme-flood analysis

High computational power is particularly needed in order to provide estimations for high to extreme return periods (up to 1000 years and higher) required for safety-related studies or for hydrological-hazard management. For such rare events, the large number of simulations in fully continuous frameworks can easily become prohibitive, in particular if the framework should also account for different sources of modelling uncertainty, such as input uncertainty (different weather generators) or the uncertainty in the hydrological model itself, which is often incorporated into the model parameter sets (using distribution of model parameters rather than a single best set)

Studies dealing with modelling or data uncertainties in such continuous simulation frameworks are rare as most previous studies have focused on the uncertainty related to the hydrological-model parameters only

Accordingly, for settings where full hydrological–hydraulic models are used for continuous simulation, some pre-selection of hydro-meteorological scenarios is often needed, particularly for computationally demanding complex hydrological or hydraulic models. How this selection should be completed, i.e. based on which quantitative criteria, remains unclear. The meteorological scenarios have the particularity that all scenarios generated with the same weather generator present different but equally likely realizations of the assumed climate condition; in other words, they represent the natural variability in the climate. Reducing the number of meteorological-input scenarios is not possible without simulating them with a hydrological model first as long as the continuous simulation scheme is of interest, i.e. if full time series are to be analysed without the possibility of extracting single events. This is due to the non-linear response of any hydrological model to meteorological input (scenario), which translates into hydrological scenarios with different statistical properties, albeit resulting from an ensemble of input scenarios having the same statistical properties.

We are therefore essentially left with finding ways to reduce at least the computational requirements associated with hydrological-model parameter uncertainty, apart from reducing the length of time series, which for analysis of extremes, is an unattractive option. Accordingly, in this work, we propose an assessment of different data-based methods to select a reduced-size ensemble of hydrological-model parameters for the use within a continuous simulation, ensemble-based hydro-meteorological framework. Our specific research questions are as follows. (1) How can we downsize (reduce) the hydrological-model parameter ensembles for simulation of rare floods so that the variability and the range of the full ensemble is preserved as closely as possible? (2) Can such a reduced hydrological-model parameter ensemble be assumed to be reliable for the simulation of rare floods during the reference period (used for parameter ensemble downsizing) as well as during an independent validation period? (3) Which metrics would be suitable to assess the performance of such a reduced hydrological-model parameter ensemble against the reference (full) ensemble? Specifically, three different methods of reducing a full hydrological-modelling parameter ensemble to a handful of parameter sets are proposed and tested for deriving the uncertainty ranges of simulated rare flood events (up to 10 000 years return period). All three methods rely on simulation of annual maxima and are tested on continuous synthetic data (simulated with a hydrological model) of 10 000 years. Using synthetic instead of observed data is important here as only recently

The focus of this study is a fully continuous hydro-meteorological ensemble-based simulation framework for estimation of rare floods. The underlying streamflow time series ensemble is built based on meteorological scenarios and multiple hydrological-model runs using a number of calibrated model parameter sets. A meteorological scenario represents a single realization from a stochastic weather generator with constant model parameters. These meteorological scenarios are equally likely model realizations that differ in the precipitation and temperature patterns, and together they represent the natural variability in the climate (and not the model uncertainty in a weather generator). These realizations are then used as inputs into a hydrological model to simulate the hydrological response. To account for hydrological-modelling uncertainty, a range of different hydrological-model parameter sets is used for each meteorological scenario. These two sources of hydrological variability then accumulate along the modelling chain and can be represented as an ensemble of possible hydrological responses (Fig.

Framework overview. The infimum and supremum refer to the largest interval bounding the ensemble simulation from below and the smallest interval bounding it from above.

Within such a defined framework we first want to understand how variable the hydrological response simulation is and, second, develop methods to downsize the hydrological-model parameter ensemble to a smaller subset that could be dealt with within such a modelling chain for rare-flood simulations. This subset should represent the entire range of variability in the hydrological response but with little computational effort and should also be transferable to independent time periods. Hereafter, we call this subset the

Downsizing of the ensemble of hydrological-model parameters is particularly needed if (i) the probability distribution of the parameter sets is unknown because parameter sets result from independent calibrations or regionalization approaches, and only a limited number of sets can be run with the hydrological model, or (ii) the distribution is known (i.e. estimated from data), but due to time-consuming simulations it is not possible to run the hydrological model for a full ensemble of multiple meteorological scenarios.

The question of how many parameter sets are needed to cover most of the simulation range is important. However, here we set this value to a constant number and rather test different selection approaches. Hence, for the purpose of our work, we furthermore would like this

The key challenge for such a downsizing is the fact that we would like to select hydrological-model parameter sets (i.e. select in the parameter space) but based on how representative the corresponding simulations are in the model response space. Moreover, the downsized ensemble should not only be representative of simulated time periods but also be transferable to independent time periods. The first question to answer is which model response space the selection should focus on. In the context of rare-flood estimation, focusing on the frequency distribution of annual maxima (AMs) is a natural choice; we thus propose to use the representation of AMs sorted by their magnitudes (i.e. frequency space) as the

The next step is the development of selection methods to select hydrological-model parameter sets that plot into certain locations in the model response space. Given the nonlinear relationship between model parameters and hydrological responses, this selection has to be obtained via a post-modelling approach; i.e. we have to first simulate all parameter sets and then decide which parameter sets fulfil certain selection criteria in the model response space.

For that purpose, we developed three methods, which are based on (a) ranking, (b) quantiling, and (c) clustering, described in detail in Sect.

Next, for the purpose of this study, let us define the following variables:

Overview of the modelling chain and the selection methods of the representative parameter sets;

For the sake of simplicity, let us choose a single meteorological scenario

Using the above notations, the selection of representative parameter sets can be summarized as follows.

Simulation of continuous streamflow times series: the hydrological model is run with all available

Selection of annual maxima (AMs): for each

Selection of three representative parameter sets based on the simulation of AMs and following on from the three methods detailed below.

AMs computed from

For each

Parameter sets that correspond to the selected AM ranks are then attributed as 5th, 50th and 95th parameter sets per

The parameter sets selected in step (c) are compared over all

For each

The 5 %, 50 % and 95 % quantiles of these ensembles are computed at each

Next, for each

Finally, the ensemble members which lie closest to

Similar to the quantiling method, for each

These members are next clustered into three representative groups (clusters) based on all

Next, these clusters are sorted based on cluster means by their magnitude by comparing percentiles in the upper tail of the distribution (here we used a 90th percentile). Use of a percentile from the upper tail is important as methods are focusing on rare floods. However, we found that the method was insensitive to the percentile choice as long as it lies in the upper tail (i.e.

For each

For each of these three clusters, the ensemble member that lies closest to the cluster percentile, i.e. received the smallest value of

For visualizing the selection methods, we use the Gumbel space (generalized extreme-value distribution Type I) with the Gringorten's method

Comparison of three methods for selecting representative parameter sets based on annual maxima (AMs).

While the three methods described in Sect.

The major difference between these three methods is that the ranking method is evaluated based on individual simulation years using simple ranking of flow maxima independently of their frequency; i.e. it works on unsorted annual maxima. Note that in this way, for each

In contrast to the ranking method, both other methods, i.e. quantiling and clustering, are performed on sorted AMs over all simulation years, i.e. in the flow frequency space. This enables statistical statements to be made about the selected parameter sets and about the predictive intervals constructed with the help of these parameter sets (as they are constructed on the entire simulation ensemble). Furthermore, selected parameter sets can be assumed to be representative over the entire simulation period (see Table

Testing the methods for a time period different than the one that was used for the parameter ensemble downsizing is crucial for assessing how well the reduced ensembles substitute the whole simulation ensemble for the selection of representative parameter sets. Thus, we propose to assess the behaviour of the developed approach by repeating the selection of the three representative parameter sets with the three proposed methods with multiple (

To evaluate the three selection methods, we perform a leave-one-out cross-validation simulation study, in which a meteorological scenario

Pick up and remove one meteorological scenario

Analyse all other meteorological scenarios

Estimate the predictive intervals of these

Evaluate the meteorological scenario

To further evaluate the three methods, we perform a simulation study using multiple (

Pick up one meteorological scenario

Analyse the

For all other remaining meteorological scenarios {

Compute the 5th percentile for

Evaluate the predictive intervals against all

The simplest way of assessing the behaviour of these three methods is a visual inspection of curves plotted in the frequency space (e.g. using Gumbel distribution for plotting), which can tell us how well the selected members reproduce the simulation ensemble and particularly whether the assignment of the representative parameter sets is correct or not. For this purpose, we propose to plot all simulated hydrological ensemble members together with the selected representative members in the frequency space for each considered meteorological scenario

To quantitatively compare the three selection methods, we propose to compute the five following metrics:

The ratio of simulation points in the frequency space, i.e. sorted annual maxima, lying outside the predictive intervals is computed for each

In the leave-one-out cross-validation, the ratio of hydrological-simulation ensemble members lying outside the predictive intervals is computed for each

In the multi-scenario evaluation, the ratio of meteorological scenarios lying outside the predictive intervals is computed for each scenario

Relative band spread of PIs (

Overlapping pools of PIs (

With respect to

In a similar way, for

With regards to

These five metrics are computed for all three methods and for all

Location of the Dünnern at Olten catchment with a river network extracted from Swiss Map Vector 25

For testing the methods developed here, a small close-to-natural catchment is preferable, i.e. with only little anthropogenic influence, in which hydrological responses are transparent, and the generation of rare floods (peaks) is not affected by human constructions (dams, bridges). For this purpose, the Dünnern stream at Olten catchment with an area of 196 km

To simulate the hydrological catchment responses to meteorological scenarios, the HBV model (Hydrologiska Byråns Vattenbalansavdelning) is used. The HBV model is a semi-distributed bucket-type model, and it consists of four main routines: (1) precipitation excess, snow accumulation and snowmelt; (2) soil moisture; (3) groundwater and streamflow responses; and (4) run-off routing using a triangular weighting function. Due to the presence of the snow component, the HBV model is applicable to mountainous catchments

In this study, the version HBV light

For the study catchment, meteorological inputs (hourly precipitation totals, hourly air temperature means, average hourly evaporation sums) for the HBV model are derived from observed records from meteorological stations and are averaged to the mean catchment values using the Thiessen polygon method. The recorded continuous hourly streamflow data at the catchment outlet (Olten station) cover the period 1990–2014.

To derive multiple parameter sets of the HBV model, we propose a heuristic approach that relies on multiple independent model calibration trials using a genetic-algorithm (GA) approach (Appendix

The weights in

Here, the genetic algorithm is run 100 times, resulting in 100 independent optimal parameter sets (see Fig.

Note that the way to derive 100 parameter sets described above is one possible approach, and other calibration methods could be used (e.g. Monte Carlo or bootstrapping).

Meteorological scenarios of synthetic precipitation and temperature data for the Dünnern at Olten catchment are generated with the weather generator model GWEX developed by

The meteorological scenarios used in this study are a subset from the long-term meteorological scenarios developed for the entire Aare river basin using recorded data from 105 precipitation stations and from 26 temperature stations in Switzerland

For the present study, 100 different meteorological scenarios (precipitation and temperature) covering the same time frame of 100 years at an hourly time step are available for the Dünnern at Olten catchment (see Fig.

Finally, for our analysis, 100 meteorological scenarios with continuous data of 100 years of precipitation and temperature and 100 calibrated parameter sets of the HBV model are available. This number of 100 was chosen as a compromise between minimizing the intensive model calibrations and the simulations at an hourly time step and maximizing the information content of the hydrological-parameter sample and the climate variability. We have chosen the same number of 100 for meteorological scenarios, parameter sets and simulation years to not favour any of these components in the methods' comparison. These 100 meteorological scenarios are used as input into the HBV model to generate streamflow time series with 100 different HBV parameter sets. To set up the initial conditions of the model, a 1-year warming-up period is always used prior to the simulation period. To get an overview of the variability in such hydrological ensembles, see Fig.

From each of these continuous hydrological simulations, 100 annual maxima (AMs; one per calendar year) are selected (see Fig.

Set-up of the experimental study.

These series of AMs are next used to test the developed methods of selecting the representative parameter sets from the ensemble of 100 available sets.

The three representative parameter sets

The representative parameter sets selected with each of the three methods are summarized over all 100 meteorological scenarios in Table

Although different parameter sets are usually selected by different methods, in a few cases the same set is chosen with more than one selection method. Among the first three most frequently chosen sets, the same parameter set is selected as the median set once for all three methods and several times for at least two methods.

The variability in the selected hydrological parameter sets is presented in Fig.

Box plots showing the variability in the hydrological parameter sets selected as the representative parameter sets over 100 meteorological scenarios chosen with three methods. The white box plots illustrate the entire parameter ensemble (i.e. 100 sets); outliers are not presented. I: infimum; M: median; and S: supremum set. Units as in Table

Using the selected representative sets, representative intervals for rare-flood estimations are constructed for each of the 100 meteorological scenarios and each of the three selection methods. Examples of these intervals for two meteorological scenarios are presented in Figs.

Example of the representative parameter sets' selection with three methods in the Dünnern at Olten catchment (meteorological scenario

Example of the representative parameter sets' selection with three methods in the Dünnern at Olten catchment (meteorological scenario

According to a first visual assessment, these three methods lead to slightly different constructed frequency intervals particularly in the upper tail of the distribution, i.e. for the most rare (highest) flows, which are of highest interest. Moreover, the ranking method leads to less symmetrically spread intervals, with the median and infimum intervals lying close to each other. The other two methods lead to more symmetrically spread intervals.

For the quantitative assessment, the ratio of scenarios incorrectly attributed, i.e. with intervals being mixed up (

Example of leave-one-out cross-validation for the three selection methods and two meteorological scenarios. PIs represent the 90 % predictive intervals.

Example of multi-scenario evaluation for the three selection methods and two meteorological scenarios. PIs represent the 90 % predictive intervals.

The behaviour of the three selection methods is further evaluated with the 100 meteorological scenarios using the leave-one-out cross-validation test (Sect.

Metrics of the behaviour of the approach for three methods of selecting representative parameter sets and the predictive intervals in the leave-one-out cross-validation and in the multiple-scenario evaluation. The values represent the median values over all 100 scenario runs.

This is reflected in the quantitative assessment of the methods' behaviour, summarized in Table

In summary, it can be said that all criteria values are relatively low for all three methods, and thus the computed criteria values can only be used to order the methods by their behaviour, while none of the methods are rejected.

In contrast to the above findings, the multi-scenario evaluation reveals different results, with

Also, here all computed criteria values are relatively low, with

Analysis of overlapping PI pools (

Evaluation of the leave-one-out cross-validation and the multi-scenario test for the three selection methods using the relative band spread (

As it appears from the above, the rejection or acceptance of one of the three methods tested here is not straightforward. Apart from the ranking method, which was linked to a huge bias, both other methods, i.e. quantiling and clustering, performed similarly well. Yet, these methods provide quite different intervals (of a different spread). The validity and usefulness of these methods for selecting the representative parameter sets are thus further discussed below in Sect.

The results from our experimental study demonstrate that generally all three methods are capable of selecting representative parameter sets that yield reliable predictive intervals in the frequency domain, i.e. all three methods are fit for purpose for extreme-flood simulation, with the ranking method performing, however, clearly less well than the others (larger bias, as visible in Sect.

Synthesis of scoring ranks attributed to the three methods for selecting representative parameter sets (based on quantitative metrics). The ranks are attributed descending from the best (1) to the worst (3) behaviour. The median scoring rank (last line) corresponds to the median over all criteria.

In the validation tests, the behaviour scores of the three methods, however, were attributed differently depending on the evaluation criteria. To further compare the methods, we provide a detailed discussion of the major differences below and present a synthesis of how the methods rank on average (averaged across all scenarios) for the quantitative evaluation criteria, which we support with further qualitative evaluation criteria (Table

From the visual assessment, i.e. based on the method bias (

Interestingly, the leave-one-out cross-validation study, in contrast to the multi-scenario evaluation, attributes the lowest criteria value to the ranking method, i.e. ranks it as the best method (Table

In a similar way,

Overall based on scoring results from Table

We should emphasize that the presented methods are independent of the selected hydrological-model calibration approach or from the selected hydrological-response model and are thus readily transferable to any similar simulation setting. Despite the fact that the calibration of a hydrological model lies beyond the scope of this paper, it is assumed that (at least) 100 parameter sets of a hydrological model can be made available for selecting the representative parameter sets. For that purpose, a hydrological model should be calibrated with observed data of a long enough record that covers rare floods so that rare floods could be realistically simulated. In this work, to derive 100 parameter sets, we proposed a heuristic approach that relies on multiple independent model calibration trials using a genetic-algorithm approach and a multi-objective function. This method represents an interesting solution to systematic sampling of the posterior parameter distributions (e.g. via Markov chain Monte Carlo sampling) or to any Monte Carlo method relying on a very high number of model runs. Its strength is that it can be applied for selecting parameter sets from independent model calibration settings (with different scores, calibration periods, etc.).

Note however that for the purpose of deriving 100 parameter sets, a continuous hydrological model does not necessarily require continuous calibration data, and it could be also calibrated to discrete data (e.g. using hydrological signatures;

Furthermore, the proposed approach is tested here using synthetic hydrological data, i.e. using streamflow simulations of the hydrological model in response to meteorological scenarios. We chose to use synthetic instead of real observed data to work with long enough continuous simulations that cover rare events and to minimize the focus of the model error arising from the calibration data and procedure. By using synthetic data as a reference (instead of observed data), the latter error can be neglected here. The proposed methods should be tested with more catchments and other models to verify the scoring of methods that was achieved in this study.

Selection methods proposed in this study enable one to choose representative parameter sets of a hydrological model and based on those to construct uncertainty predictive intervals (PIs) for extreme-flood analysis in the frequency space. Here, we tested the methodology using 100 meteorological scenarios that should represent the natural climate variability and in this way should provide independent conditions for methods' evaluation. Such a method for constructing PIs from a hydrological model ensemble is a powerful tool that opens several avenues for further detailed uncertainty analysis. For instance, one may be interested in contributions of different uncertainty sources into the total PIs constructed, e.g. coming from the hydrological model or the natural climate variability. As these two components are not linearly additive, their separation is not straightforward.

In addition, any ensemble simulation also encompasses other uncertainty sources of the modelling chain, such as those resulting from the weather generator, from the structure of the selected hydrological model, from the prediction of very rare flood events, etc.

Downsizing the hydrological-model parameter sample can only aim to understand and characterize the hydrological part of the full hydrological ensemble resulting from a combination of multiple parameter sets and multiple meteorological scenarios. The variability in hydrological-model parameters arises from the parameter equifinality

Moreover, in developing the selection methods, we did not distinguish between different flood types such as heavy rainfall excess or intensive snowmelt events

Finally, we downsize the hydrological-model parameter sample to three sets which represent the predictive intervals of the full ensemble of hydrological responses fairly well given different meteorological scenarios. This number of three sets is motivated by the fact that it can be readily processed within a fully continuous ensemble-based framework using numerous climate settings. This is common practice in flood frequency analysis, and the three sets emulate the common practice of communicating median values along with prediction limits

Optionally, one could further downsize the hydrological-model parameter sample to two sets (i.e. infimum and supremum), which would represent the intervals only. Downsizing to more than three parameter sets (e.g. five or more) could have the advantage of containing more information on uncertainty intervals, e.g. in the case that they are asymmetric, and should be explored in further studies.

Possible applications of these selection methods include all studies where computational requirements are an issue, e.g. rare-flood analysis in safety studies concerning dams or bridge breaks; climate scenarios of these; and evaluation of rare floods due to changes in climatic variables using several emission scenarios and different uncertainty source propagation. Finally, these methods could be used for quantifying different uncertainty source contributions in rare-flood estimates but with less effort from the hydrological model as due to parametric uncertainty propagation.

In this study, we propose and test three methods for selecting the representative parameter sets of a hydrological model to be used within fully continuous ensemble-based simulation frameworks. The three selection methods are based on ranking, quantiling and clustering of simulation of annual maxima within a limited time window (100 years) that is much shorter than the full simulation period of thousands of years underlying the simulation framework. Based on a synthetic case study, we demonstrate that these methods are reliable for downsizing a hydrological-model parameter sample composed of 100 parameter sets to three representative sets that represent most of the full simulation range in the frequency space. Among the tested methods, the clustering method that selects parameter sets based on cluster analysis in the frequency space appears to outperform the others due to its unbiasedness, its transferability between meteorological scenarios and a better performance for rare floods. The ranking method, which is the only tested method that completes the parameter selection on non-sorted annual maxima, can clearly not be recommended for typical settings since it (i) tends to result in mixed-up prediction intervals in the frequency space and (ii) depends too strongly on the simulation period used for parameter selection and thus lacks transferability to other periods or other meteorological scenarios. Possible applications of these methods include all fully continuous simulation schemes for rare-flood analysis and particularly those for which computational constraints arise, such as safety studies or scenario analysis.

For searching the best hydrological-model parameter sets within the defined parameter ranges (Table

Parameter ranges for the calibration of the HBV model.

For further details on

The optimized hydrological-model parameter sets are presented in Fig.

Violin plots (blue) summarizing 100 optimized parameter sets of the HBV model for the Dünnern at Olten catchment vs. initial calibration ranges (grey). Units as in Table

Flow duration curves and model performance metrics for calibration and validation periods over all 100 optimized parameter sets.

The variability in the precipitation depth (annual daily maxima) and temperature (annual daily minima and maxima) of 100 meteorological scenarios used in this study is presented in Fig.

Variability in 100 meteorological scenarios used in this study vs. observations.

Variability in 100 hydrological scenarios used in this study;

The observed discharge data for calibrating the hydrological model can be ordered from the FOEN (

AESS and JS developed the idea. AESS and BS jointly designed the details of this study. AESS performed the analyses and produced the figures. AESS, with a contribution from BS, wrote the first draft of the manuscript. The manuscript was edited by BS and revised by JS.

The authors declare that they have no conflict of interest.

The research was funded through the Forschungskredit of the University of Zurich, grant no. FK-18-118. The ScienceCloud provided by S3IT at the University of Zurich enabled the computation-intensive simulations to be run on virtual machines. Meteorological scenarios used in this study were made available through the project EXAR (Hazard information for extreme flood events on the rivers Aare and Rhine), project no. 15.0054.PJ/O503-1381, funded by the Swiss Federal Office for Environment (FOEN). The authors wish to give special thanks to Guillaume Evin, Anne-Catherine Favre and Benoit Hingray for preparing the meteorological scenarios; to Marc Vis for setting-up virtual machines; and to Tracy Ewen for proofreading the manuscript. We thank the editor Paolo Tarolli, Emmanuel Paquet and one anonymous referees for their constructive comments, which improved the quality of the manuscript.

This research has been supported by the Forschungskredit of the University of Zurich (grant no. FK-18-118).

This paper was edited by Paolo Tarolli and reviewed by Emmanuel Paquet and one anonymous referee.