Evaluation of a compound distribution based on weather pattern subsampling for extreme rainfall in Norway

Simulation methods for design flood analyses require estimates of extreme precipitation for simulating maximum discharges. This article evaluates the multi-exponential weather pattern (MEWP) model, a compound model based on weather pattern classification, seasonal splitting and exponential distributions, for its suitability for use in Norway. The MEWP model is the probabilistic rainfall model used in the SCHADEX method for extreme flood estimation. Regional scores of evaluation are used in a split sample framework to compare the MEWP distribution with more general heavytailed distributions, in this case the Multi Generalized Pareto Weather Pattern (MGPWP) distribution. The analysis shows the clear benefit obtained from seasonal and weather patternbased subsampling for extreme value estimation. The MEWP distribution is found to have an overall better performance as compared with the MGPWP, which tends to overfit the data and lacks robustness. Finally, we take advantage of the split sample framework to present evidence for an increase in extreme rainfall in the southwestern part of Norway during the period 1979–2009, relative to 1948–1978.


Introduction
Flood estimation is important for design and safety assessments, flood risk management and spatial planning.It aims to assess the probability of occurrence of large events, e.g., discharges with return periods of 100 to 10 000 years.Estimation of events with such low probability is particularly arduous.It can only be based on a few data points repre-senting the most extreme events in a time series of a limited length.Thus extrapolation to long return periods is usually needed.In dam safety analyses, for example, return period estimations of 10 3 to 10 4 years are often used (Paquet et al., 2013).Methods for deriving such estimations can be classified into two main groups: statistical flood frequency analysis and precipitation-runoff modeling.Statistical flood frequency analysis is based on the analysis of an observed streamflow record for which the return periods of the highest events are modeled using extreme value theory, and magnitudes with longer return periods are estimated using the fitted statistical model.A drawback of this method is that it relies on local or regional streamflow data and is likely to be very sensitive to the density of observations (for the regional case) and to the type of distribution chosen (Klemes, 2000a, b).Furthermore, heavy rainfall is a major factor driving the occurrence of flooding, even in areas where snowmelt also plays a significant role, such as in Norway.Rainfall series are generally more abundant, often have longer periods of record, and they usually show stronger regional consistency.This observation is one of the main motivations of the GRADEX method (Guillot, 1993) which uses the distribution of rainfall to extrapolate the distribution of discharge.This has further led to the development of rainfall-runoff simulation methods for extreme flood estimation.The idea is to extend the database of streamflow by converting rainfall into surface runoff using a model of the catchment response.Input rainfall may be either observed or synthetic events with an estimated probability of occurrence (event-based method) or, either historical or synthetic rainfall records for gener-ating a continuous streamflow series (continuous simulation approach).
In Norway, a simple event-based rainfall-runoff model, PQRUT, has been used since the 1980s as a simulation method for dam safety analyses for which the magnitude of low frequency events (e.g., 500-, 1000-year peak inflow) and the probable maximum flood are required.Recently, a semi-continuous model, SCHADEX (Paquet et al., 2013) has been tested as an alternative approach for obtaining such estimates.SCHADEX has been developed and applied in France by Electricité de France (EDF) for dam spillway design since 2006.It has also recently been applied in different regions of the world (in France, Austria, Canada and Norway) (e.g., Brigode et al., 2014) and has been more extensively evaluated for three catchments in Norway in Lawrence et al. (2014).Of particular interest in Norway is the need for a method which takes the combined probability of extreme rainfall and snowmelt into account, for which SCHADEX is well suited in comparison with event-based approaches.It is expected that the SCHADEX method should give results more similar to those obtained with statistical flood frequency analysis based on observed discharge series, and this was found in two of the three catchments considered by Lawrence et al. (2014).However, a global evaluation of the SCHADEX method covering the range of conditions found in Norway has yet to be achieved and is a necessary precursor to the wider implementation of the method in standard practise.This article aims to make the first step towards such an evaluation.More specifically, we evaluate the rainfall probabilistic component of SCHADEX: the socalled multi-exponential weather pattern (MEWP) distribution (Garavaglia et al., 2010), a compound distribution based on season and weather pattern subsampling, for the whole of Norway.This approach is in contrast with the recent analyses of extreme precipitation in Norway based on annual maximum series and the application of a generalized extreme value distribution undertaken by Dyrrdal et al. (2014).In our work we analyze over threshold values for rainfall, rather than using a block maxima approach.Our goal is to evaluate the performance of MEWP at the national scale and to decide whether it should be preferred to simpler, and perhaps more classical, seasonal and nonseasonal distributions, or, further, whether its generalization towards heavy-tailed distributions should be considered.A brief analysis of trends in extreme precipitation is also performed based on the split samples used in the evaluation.

Data
Daily data from 368 precipitation stations in Norway were extracted from the European Climate Assessment and Dataset (ECA&D), a database of daily meteorological stations across Europe.From these 368 stations, 192 stations with at least 50 years of record with less than 10 % miss-ing data per year over the period 1948-2009 were selected for further analyses.Years with more than 10 % missing data are entirely replaced by 'NA', representing missing values.Figure 1 shows the location and altitude of the 192 stations.Station altitude ranges from sea level to approximately 1000 m a.s.l., i.e., none of the stations lie at the higher altitudes in the mountainous regions.All the stations above 500 m a.s.l., however, are found in the central southern inland region adjacent to zones of higher altitude.The network is denser in southern Norway, particularly along the coast, reflecting the higher population densities in this zone.This implies that southern Norway will have more weight in the model evaluation but we view this as preferable to deleting a number of stations to create a more spatially uniform network density.The mean number of observed years is 56 (maximum 62, minimum 50).
As already stated in Sect. 1, the main topic of this study is the evaluation of MEWP, the rainfall probabilistic model used in SCHADEX.SCHADEX aims to describe the distribution of floods by a stochastic simulation process which combines heavy rainfall events and catchment saturation states, including simulated snowmelt.In SCHADEX, heavy rainfall events are considered as 3-day centered precipitation events, being composed of a central rainfall and two adjacent rainfalls which are lower than the central one (Paquet et al., 2013).The value for central rainfall is simulated using a fitted MEWP distribution for the extreme rainfall (Garavaglia et al., 2011), and the 2 adjacent days are simulated conditionally, using contingency tables to account for the dependence of the magnitude of the rainfall on the day before and after the peak rainfall.Given that MEWP is a probabilistic model for heavy "central" rainfall, rather than for all daily rainfall values, a pre-processing of the data was required to select the central rainfall values exceeding the precipitation received on both the preceding and following days by 1 mm or more at each station.By doing this we obviously reduce the number of data available for analysis.In Norway about one-quarter of the days of record represent central rainfall values, and this is, on average, about one-half of the days with precipitation.However one advantage of this pre-processing is that central rainfall values at a given location can be expected to be independent since they are always separated by at least 1 day.For extreme values, this independence can be quantitatively assessed by computing the so-called extremal coefficients (Coles, 2001;Ferro and Segers, 2003) for the daily and central samples and comparing their respective values for each station.Extremal coefficients lie between 0 and 1 and the closer to 1, the less dependent the extremes.The inverse of the extremal coefficient can be more easily interpreted as the mean size of clusters at extreme level, i.e., roughly speaking, the mean number of consecutive values that are extreme.Using the estimation method of Ferro and Segers (2003) with a threshold equal to the 90 % quantile of daily rainfall, we find that extremal coefficients for daily rainfall are about 0.6, whereas those for central rainfall are about 0.8 (representing a mean cluster size of about 1.25 days).The central rainfall values can therefore be considered to be close to the case of complete independence.

Exponential and GPD models
Let X be the random variable of central rainfall at some location in Norway.We are interested in the distribution of extreme values, i.e., of Pr(X ≤ x) when x is large.Let us consider a (high) level α and write q α the α-quantile of X, i.e., such that α = Pr(X ≤ q α ).Then, for all x exceeding q α , we have the decomposition Extreme value theory (EVT) ensures that if the central rainfall values are independent and identically distributed and for large enough α, Pr(X ≤ x|X ≥ q α ) can be approximated by the distribution for all x ≥ q α , provided in Eq. ( 2) that x ≤ q α −σ α /ξ if ξ < 0. Parameter ξ in Eq. ( 2) is independent of α; this is the shape parameter which models the heaviness of the tail of the distribution.Parameter σ α > 0 in Eqs. ( 2) and (3) depends upon α and is called the scale parameter.Equations ( 2) and (3) imply that excesses (X − q α |X ≥ q α ) follow the generalized Pareto distribution (GPD) in Eq. ( 2) and the exponential distribution (EXP) with rate 1/σ α in Eq. (3).Models (Eqs. 2 and 3) have been widely used worldwide for modeling rainfall extremes.A good review is provided in the introduction of Serinaldi and Kilsby (2014).Equations ( 2) and (3) combined with Eq. (1) give the approximation of the distribution of X for all x ≥ q α : where α = Pr(X ≤ q α ).

MEWP and MGPWP models
In the previous section, we implicitly assumed that central rainfall, X, is identically distributed throughout the year.This assumption may be questioned.Indeed, different climatological processes trigger precipitation, leading to the occurrence of rainfall of different natures and intensities (e.g., convective vs. stratiform precipitation).Furthermore, rainfall occurrence and intensities often vary with season, reflecting both variations in temperature and in storm tracks, for example.For this reason, Garavaglia et al. (2010) proposed the use of subsampling based on seasons and weather patterns (WP).Each day of the record period is assigned to a WP.If S seasons and K WP are considered, then days are classified into S × K subclasses.The law of total probability gives, for all x, where p s,k is the probability that a given day is in season s and in WP k (thus s k p s,k = 1).The central rainfall values occurring in season s and WP k can be assumed to be identically distributed (Garavaglia et al., 2010).Thus the extreme value theory described in Sect.3.1.1can be applied to F s,k (x) = Pr(X ≤ x|season = s, WP = k).Let us consider a high level α (taken for simplicity constant for all F s,k ) and q α,s,k , the α quantile of F s,k .Application of Eq. (4) to F s,k gives the approximation for x ≥ q α,s,k , where G(x; σ α,s,k , ξ s,k ) is given by Eqs.
(2) and (3), where q α , σ α and ξ are respectively replaced by q α,s,k , σ α,s,k and ξ s,k .Thus, Eqs. ( 5) and ( 6) give, for all x ≥ q + α = max s,k q α,s,k , the approximation of the distribution of X: MEWP and MGPWP (Multi Generalized Pareto Weather Pattern) models are both defined by Eq. ( 7) with different choices of ξ s,k (Garavaglia et al., 2010(Garavaglia et al., , 2011)): in MEWP all ξ s,k are set to 0 -in which case G is the EXP distribution -while in MGPWP, ξ s,k is free to vary in the positive range.We exclude cases ξ s,k < 0 because they give bounded GPD distributions with an upper bound at q α,s,k −σ α,s,k /ξ s,k , which is usually unrealistically low for rainfall.Using the GPD with ξ s,k > 0 in Eq. ( 7) allows models with heavier tails than with the EXP distribution, which is light-tailed.Theoretically, other heavy-tailed distribution could be used for G in Eq. ( 7) but the GPD is justified by EVT and it provides a natural generalization of MEWP by allowing the ξ s,t to vary freely (within the positive range).Both models, MEWP and MGPWP, will be evaluated on the Norwegian data.To keep track of the level α and of the fact that S seasons and K WP are used in Eq. ( 7), we will respectively write these two models as MEWP(α, S, K) and MGPWP(α, S, K).Likewise, we write EXP(α) and GPD(α) to represent the basic cases of Eq. ( 4) when neither season nor WP are considered, corresponding to cases MEWP(α, 1, 1) and MGPWP(α, 1, 1).

Model estimation
Use of the EXP, GPD, MEWP and MGPWP models requires the choice of high enough thresholds such that EVT can be applied.Selection of an adequate threshold gives rise to a bias-variance tradeoff: the higher the threshold, the better the approximation of the tail of F (smaller bias), but at the same time, the higher the variance of the estimated parameters because a smaller number of exceedances are available.Graphical tools for threshold selection, such as mean residual life plots (Coles, 2001), are usually difficult to interpret in practice.Therefore, the common practice is to fix a high enough level α and to set thresholds q α,s,k to the empirical α quantile of rainfall occurring in season s and WP k.
Given α (and therefore q α ), the parameters that must be estimated for the EXP and GPD models (Eq.4) are those of G in Eqs. ( 2) and (3).Estimation is made by the method of L-moments (Hosking, 1990): where λ 1 and λ 2 are the sample L-moments of order 1 and 2 for the central rainfall exceeding q α , which are independent; see Sect. 2. In the GPD case, if ξ < 0, then ξ = 0 is imposed (i.e., the EXP distribution) to exclude bounded distributions.It should be noted that the choice of the L-moments method only affects the GPD case since for the EXP case, the commonly used L-moments, moments and maximum likelihood estimators coincide.For the GPD case, a separate analysis (not shown) reveals that the choice of the estimation method does not actually affect the regional evaluation very much because slight differences in estimation that occur at the local scale are smoothed out at the regional scale.
Parameters ξ s,k and σ α,s,k in G of Eq. ( 7) for MEWP and MGPWP are estimated likewise by the L-moments method, using the observed central rainfall of season s and WP k exceeding q α,s,k .Probability p s,k is estimated as the empirical proportion of days in season s and WP k.Estimation of F is then obtained for all x > q + α with Eq. ( 7).

Computation of return levels
The T -year return level r T is the level expected to be exceeded on average once every T years.It satisfies the relationship F (r T ) = 1 − 1/(T ζ ), where ζ is the mean number of central rainfall events per year.When F is EXP(α) or GPD(α), estimation of r T is obtained explicitly as where σα and ξ are the parameter estimates of F of Sect.3.2.
For the MEWP and MGPWP models, there is not an explicit formulation for rT and it is obtained numerically by solving F (r T ) = 1 − 1/(T ζ ) in Eq. ( 7).Equation 8shows that in GPD(α) model, rT is mainly influenced by the value of ξ .
For the MGPWP model, practice shows that for reasonable to large T (typically T > 50 years), rT is mainly influenced by the largest ξs,k .

Model evaluation
The goal of this evaluation is to assess which model performs better at the regional scale, i.e., for a set of N stations taken as a whole, rather than individually.We follow the split sample evaluation proposed in Garavaglia et al. (2011) and Renard et al. (2013).We divide the data for each station i into two subsamples, C i and C (2) i , and fit a given competing model on each of the subsamples, giving two estimated distributions F (1) i , estimated on C (1) i , and

estimated on C
(2) i .Our goal is to test the consistency between validation Nat.Hazards Earth Syst.Sci., 15, 2653-2667, 2015 www.nat-hazards-earth-syst-sci.net/15/2653/2015/data and predictions of the estimates, and the accuracy and stability of the estimates when calibration data change.For this, three scores are computed, assessing respectively stability (SPAN) and reliability (AREA(FF) and AREA(N T )) of the fits.These scores were proposed and used in Garavaglia et al. (2011) and Renard et al. (2013).
The SPAN criterion evaluates the stability of the return level estimation, when using data for each of the two subsamples.More precisely, for a given return period T and station i, , where r(1) T ,i , e.g., is the T -year return level for the distribution F (see Sect. 3.3) estimated on subsample C (2) i of station i. SPAN T ,i is the relative absolute difference in T -year return levels estimated on the two subsamples.It ranges between 0 and 2; the closer to 0, the more stable the estimations for station i.For the set of N stations, we obtain a vector of SPAN T of length N with a distribution which should remain reasonably close to zero.A rough summary of this information is obtained by computing the mean of the N values of SPAN T ,i , i = 1, . .., N : For competing models, the closer the mean is to 0, the more stable the model is.
The FF criterion is used to estimate the reliability in estimating the probability of occurrence of the maximum of independent variables.Let (X 1 , . .., X n ) be a set of n independent and identically distributed rainfall values with distribution F and Z = max n j =1 X j .Then Pr(Z ≤ x) = {Pr(X ≤ x)} n = {F (x)} n and, thus, the distribution of Z is F n .Therefore FF = {F (Z)} n follows the uniform distribution on (0, 1).Now write F1,i and F2,i , where the estimation of F for station i is obtained respectively for subsamples C (1) i and C (2) i .If F1,i and F2,i are good estimations of F , then FF should both be realizations of the uniform distribution.For the set of N stations, this gives two uniform samples ff (12)  and ff (21) of size N each.Hypothesis testing for assessing if the uniform assumption is valid is challenging because the ff i are not independent from site to site, due to the spatial dependence between data.Thus Renard et al. (2013) proposed to base comparison on the graphical analysis of cumulative distribution functions (CDFs), by inspecting how much the CDF of the ff diverge from the 1 : 1 line, corresponding to the CDF of uniform variates on (0, 1).A quantitative assessment of this divergence is provided by computing the area between both CDFs.However, we find such evaluation confusing because the value of the area depends on where, between 0 and 1, the divergence is located.An illustration of this is given in Fig. 2 for three simulated series of length 200 (which is about the number of stations).In case 0, the ff are all drawn from Unif(0, 1) (reference case).In cases 1 and 2, 80 % of the ff are drawn from Unif(0, 1) and 20 % are drawn from Unif(0, 0.1) in case 1 and from Unif(0.5, 0.6) in case 2. Departure of ff from the uniform case is sometimes not easy to interpret.However case 1 corresponds usually to a tendency towards an overestimation of the largest observation, while case 2 corresponds to a tendency towards overfitting the largest observation.In the CDF plot (upper left), the area value is as expected the lowest for case 0. However case 2 gives surprisingly also a very good score, whereas that of case 1 is 3 times as large.Therefore these criteria would falsely indicate a better performance (i.e., smaller area value) of case 2 (overfitting) as compared to case 1 (overestimation), although they both contain 20 % data diverging from the uniform on (0, 1).As an alternative, we prefer to base evaluation on divergence between densities rather than CDFs.A reasonable estimate of this latter is obtained by computing the empirical histogram of the ff with 10 equal bins between 0 and 1, and comparing it with the uniform density between 0 and 1 (which equals 1).For a more quantitative assessment, we compute the area between both densities as follows: where Card{. ..} denotes the cardinality of the set.The term inside the absolute value in Eq. ( 10) is the difference between densities in the th bin.The division by 18 forces the score to lie in the range (0, 1) with lower values indicating better fits (the worst case being all values lying in the same bin).Illustration of this computation is shown in Fig. 2 on the aforementioned simulated data (upper right and lower panels).The score for case 0 is again the lowest, however the value is larger than when comparing CDFs due to the discretization into bins.As expected, the criteria now give similar scores for cases 1 and 2, unlike the method based on CDFs.This leads us to base comparison on the new AREA score (Eq.10), giving preference to lower scores but keeping in mind that a score of 0.1 is already a good score since this is the mean AREA value we obtain when simulating uniforms on (0, 1).Returning to ff values of cross-validation, ff (12) and ff (21) , this gives us two scores of model evaluation, namely AREA(FF (12) ) and AREA(FF (21) ).
The N T criterion assesses reliability of the fit, as FF, but focuses on prescribed quantiles rather than on the overall maximum.Let (X 1 , . .., X n ) be a set of n independent and identically distributed rainfall values with distribution F , and let N T be the random variable equal to the number of exceedances of the T -year return level, i.e., N T = Card{X j ; F (X j ) > 1 − 1/(ζ T )}, where ζ is the mean number of observations per year.Since every event {F (X j ) > 1 − 1/(ζ T )} occurs with probability 1/(ζ T ), N T follows a binomial distribution with parameters (n, 1/(ζ T )).Let H T be the corresponding cumulative distribution function, i.e., such that H T (k) = Pr(N T ≤ k), k = 0, . .., n and H (−1) = 0.Because H T is not continuous, the probability-transformed indices H T (N T ) are not uniform.Thus, Renard et al. (2013) propose to consider the random variable N T such that and show that N T is uniform on (0, 1).Now, consider the estimates for a given station i and where ζ i is the mean number of central rainfall events per year at station i.If F (1) i and F (2) i are exact estimates for F , then n (12) T ,i (resp.n T ,i are realizations of the uniform distribution (Renard et al., 2013).For i ranging over the set of N stations, we thus obtain two vectors of size N of uniform samples, so that we can write n

Models considered
We wish to evaluate and compare the performance of EXP, GPD, MEWP and MGPWP for estimating central rainfall values across Norway.To apply the split sample procedure described in Sect.3.4 for each station i, we randomly divide years into two subsamples such that 50 % of the observed years are in sample C (1) i and the remaining 50 % are in sample C (2) i .This split sample procedure is applied to each station independently (meaning that years of C (1) i and C (1) i are very unlikely to all be equal for i = i ).This creates two new data sets, each comprising 192 stations with a maximum of 31 years of observations.
As is always the case for extreme value analysis, threshold choice is uncertain.We therefore considered a large set of thresholds with α between 0.50 and 0.97.The evaluation scores are then used to select both the best model and the best threshold(s).Choice of α as low as 0.50 may at first glance appear to be very low for studying extremes, but one has to remember that the data series are already preprocessed to include only central rainfall values.Days with central rainfall will tend to have higher intensities than a randomly selected day with rainfall, as by construction, the central rainfall series excludes the previous and following days with lower rainfall intensities (see Sect. 2).A threshold level of 0.50 corresponds actually to a level of about 0.75 for the daily (nonzero) rainfall values.
The estimation scheme can be summarized as follows.For each of the considered α values, we fit six models with the exponential distribution: -EXP(α), which is a particular case of MEWP, where S is one season and K is one weather pattern; -MEWP(α, 1, K), i.e., a combination of K WP distributions, where K = 4 or 8 (see below); -MEWP(α, 2, 1), i.e., a combination of two seasonal distributions.Choice of the seasons is explained below; -MEWP(α, 2, K), i.e., a combination of seasonal and WP distributions, with K = 4 or 8; and the six corresponding models with the GPD distribution.This gives in total 12 fits F (1) i and 12 fits F (2) i , for each station i and each level α.
For the cases involving the use of WP, we employ the weather-type (WT) classification described in Fleig (2011), following the "bottom-up" method presented in Garavaglia et al. (2010).Details of this scheme are also reported in Lawrence et al. (2014) and can be briefly summarized as follows: ascending hierarchical classification is first performed on the rain fields for days with rain, as described by 175 stations in Norway and the surrounding region.The average synoptic pattern (WT) associated with each rain-field class is then identified from an atmospheric pressure data set constructed from geopotential height data centered over Norway.Finally, every day of the period considered  is assigned to a WT using the proximity of its geopotential height data to one described by a WT.In the first instance (Fleig, 2011), eight distinct WTs were defined, seven corresponding to days with rain and one representing dry days.For the first application of SCHADEX in Norway (Lawrence et al., 2014), a grouping of the eight weather types into four weather patterns (WP) was made to improve the robustness of the MEWP models (Fig. 3) by increasing the number of values in the subsamples.In this paper we, however, use the term "weather patterns" (WP) to refer to both sets of classifications, i.e., having four or eight classes; and both the use of the full set of eight classes or the grouped set of four classes are evaluated.
In cases where subsampling is also undertaken by season, we impose a restriction of S being two seasons, representing the season-at-risk and the season-not-at-risk.Furthermore, we impose the season-at-risk to be composed of 2 to 4 consecutive months (the remaining months falling in the season-not-at-risk).The optimum choice of the months composing the season-at-risk is made following the procedure of Penot (2014), which is applied to each station and model separately, using the whole series (i.e., without splitting into C (1)  or C (2) ).The principle is to find the season-at-risk for which the estimated model fits at best the months with the highest risk (of extreme rainfall intensities).In detail, the procedure is as follows: -Step 1: compute the 12 mean monthly maxima of central rainfall.
-Step 3: compute the mean of these values over moving windows of size M months.
-Step 4: select the M consecutive months corresponding to the highest of these values.These M months define the season-at-risk.The remaining months define the season-not-at-risk.
-Step 6: compare the monthly fits to the monthly empirical distributions.This comparison is made with the KGE score (Kling-Gupta efficiency, Gupta et al., 2009), which is computed for a given month, m, as where F m and Fm are respectively the empirical and fitted distributions for month m.It should be noted that the KGE criterion is not the only score which could be used here, and was not necessarily developed for scoring distributions.However, the final result (i.e., the seasonal split selected) is not particularly sensitive to the score used.
-Step 7: compute a global KGE score as a weighted mean of these 12 KGE scores, with weights proportional to the mean monthly maxima, in order to force the model to have the best fits for the months with the highest risk.
-Step 10: compare the three global KGE scores obtained respectively for M = 2, 3, 4. Select the seasonal definition corresponding to the lowest of these scores.
This procedure is applied for each station and each model separately.This implies that, for a given station, the choice of season may vary among models.However, it was found that changes in the definition of the season-at-risk for a given station are very minimal (i.e., a few percent difference, and always pertain to the intermediate months that could well be classified into either of the two periods).We believe that these differences have very little influence on the evaluation of the model fits.For illustration, Fig. 4 shows the length of the season-at-risk and the first month of this season for the 192 Norwegian stations when using MEWP(0. in the eastern part.Furthermore, the intense season starts 1 month earlier in the eastern part than in the western part. The distinction between a heavy rainfall season beginning in the fall in western Norway vs. late summer in eastern Norway is associated with the two different mechanisms leading to heavy precipitation in each of these regions.In western Norway, heavy precipitation is most commonly derived from frontal activity leading to storms arriving from the southwest.The eastern part of Norway is in the lee of the mountainous area in the central zone of southern Norway, and is, therefore, somewhat sheltered from this storm activity.The heaviest precipitation in the eastern region generally occurs due to convective activity producing intense rain showers, often during the late summer months.It can also be noted that the spatial pattern of the precipitation seasons shows a good correspondence with previously published maps of precipitation regions in Norway (see e.g., Hanssen-Bauer and Førland, 2000, Fig. 1) and with the occurrence of days with precipitation over 10 mm (see Tveito et al., 2001, Fig. 2.5).The regional seasons will be used in Sect.4.2.2 to check the sensitivity of MEWP with respect to slight changes in the definition of the season-at-risk.

Model evaluation and selection
The SPAN, FF and N T scores presented in Sect.3.4 are used to assess the quality of the estimations.We use the three scores because they give complementary answers.Taken together, they allow a global evaluation of both the reliability and the stability of the fits.Different return periods T are also considered for SPAN and N T in order to evaluate different parts of the tail of the distribution.With large T we assess the very tail of the distribution while with small T we assess the bulk of the distribution.Scores are reported in Figs. 5 and 6 for the 12 models, using threshold values equal to the 0.5-, 0.7-and 0.9quantile of the central rainfalls.Keep in mind that all scores lie in the range (0, 1) and the closer to 0, the better the score.For each model and threshold, we depict three MEAN(SPAN T ) scores for T = 20, 100 and 1000 years, the value of AREA(FF (12) ) and the three AREA(N T ) values for T = 5, 10 and 20 years.Values of AREA(FF (21) ) and AREA(N (21) T ) are not shown as they are very similar.For the SPAN scores, it may seem highly questionable to extrapolate return levels up to 1000 years given that estimation is based on about 30 years of data.This is actually the level required by engineering practices and regulatory rules (if not higher) in many countries for risk assessment associated with dam safety.For example, in France 1000-or even 10 000-year return periods are used to design dam spillways (Paquet et al., 2013), and the 1000-year return period is also used as the design flood level for the higher risk classes of dams in Norway, whilst the probable maximum flood is used to assess the safety of these dams with respect to the potential for dam failure (Midttømme et al., 2011).
Figure 5 shows that for the exponential models, there is a clear benefit obtained from the use of seasonal splitting (case EXP vs. (S, K) = (2, 1)) and WP splitting (case EXP vs. (S, K) = (1, 4) and (1, 8)), and the combination of both seasonal and WP splitting performs even better (see cases (2, 4) and (2, 8)).Indeed, subsampling by season and WP creates groups of rainfall values that are more likely to be identically distributed and therefore more easily fitted than groups of rainfall values derived from different parent populations.Using eight rather that four WPs also slightly improves the N T scores, but the improvement is somewhat marginal when compared with the gain derived from sampling by season and WP.
Figure 5 surprisingly shows that for MEWP distributions, scores of N T improve when T increases, meaning that the bulk of the distribution is actually less well fitted than the tail.This may be due to the lack of flexibility of the exponential distribution.Using the more flexible GPD distribution (in the GPD and MGPWP models of Fig. 6) indeed tends to improve N 5 and N 10 .However, it clearly also degrades the FF scores.Keep in mind that FF is based on the maximum observed value (see Sect. 3.4) and, thus, permits an assessment of the quality of the fit of the very tail of the distribution.Therefore, although the bulk of the distribution tends to be better fitted with MGPWP distributions (N 5 and N 10 ), the very tail (FF) is overfitted, usually giving poorer FF scores.Figure 6 also shows a clear loss in stability (indicated by the SPAN scores) when using the MGPWP distribution.Figure 7 illustrates this issue by comparing the 100-year and 1000-year return levels estimated on C (1) and C (2) with the four MEWP models and the four MGPWP models, with a level α = 0.5.This shows a difference of up to 100 mm day −1 with MGPWP models for the 100-year return level and up to 300 mm day −1 for the 1000 year-return level, whereas the MEWP models are much more stable.This lack of robustness is due to the difficulty in estimating the shape parameter ξ of the GPD distribution, which has a large influence on the extrapolation to long return periods (see also page 528 of Garavaglia et al. (2011) or the upper right of page 350 of Serinaldi and Kilsby (2014)).Figure 8, on the left hand side, compares the values of ξ estimated on C (1) and C (2) by all MGPWP models.Values between −0.5 and 0.5 are mainly found, but differences between the two estimates vary in a similar range.Positive values, even when not very large (typically ξ > 0.1) lead to unrealistic return levels at extrapolation, with e.g., up to 600 mm day −1 for the 1000year return level in the MGPWP case versus 270 mm day −1 in the MEWP case (see Fig. 7). Figure 8, right, shows that estimates of ξ based on fewer than 1000 observations are highly variable.Similar variability in the shape of the GPD is found in Serinaldi and Kilsby (2014) for a worldwide data set.Cases with fewer than 1000 observations occur more of-  (2) for the four MGPWP models, with α = 0.5 (one point per station).MEWP models correspond to ξ = 0 (red points).Right: same ξ s as a function of the sample size with WP (black points) and without WP (white points) (one point per station and period).
ten when WP are considered, due to the additional subsampling which produces smaller data sets.However, the SPAN values of Fig. 6 show that even for the GPD and MGPWP with K = 1, robustness is very poor.This lack of robustness is an important limitation of their value and suitability for practical applications.
Regarding the choice of threshold, MEWP distributions give relatively stable scores for α between 0.5 and 0.7 (see Fig. 5) but there is a loss in stability as α increases over 0.9 (see green curves of SPAN scores in Fig. 5).For MEWP(α, 2, 8), which gives the best scores overall, the case α = 0.5 usually seems to be slightly better.Therefore we select the model MEWP(0.5, 2, 8) for further consideration.
It is interesting at this point to compare large return levels obtained with the selected MEWP(0.5, 2, 8) with those obtained for the other MEWP models with the same α. Figure 9 makes this comparison for the 100-year return levels.It appears that the other MEWP models tend to give lower return levels (i.e., positive values of the difference).This underestimation is more marked for the EXP model (mean underestimation of about 5 mm of the 100-year return level), and decreases when seasons (MEWP(0.5, 2, 1)) and WP (MEWP(0.5, 2, 4)) are used.Therefore, the use of more WPs helps to better model the heaviness of the tail.

Use of regional seasons
We have already mentioned in Sect.4.1 that the local definition of the seasons displays a regional pattern, with a seasonat-risk in late summer in the two eastern regions and in fall in the two western ones, as illustrated in Fig. 4. We test here the use of this regional definition of the seasons by fitting new MEWP(0.5, 2, 8) models and comparing the overall scores to those of the local definition of Sect.4.2.1.As shown in Table 1, scores of the two definitions are fairly similar, partic- ularly in light of the differences obtained between the models of Fig. 5. Robustness (SPAN) is slightly improved with the regional definition.However the fact that scores of both FF and N 20 are slightly better (i.e., smaller) when seasons are defined locally gives evidence of a better fit of the very tail with the local definition, and therefore probably a better extrapolation of return levels.Therefore, if one would want to select one and only one definition, we would be tempted to recommend the local one.However, if using MEWP at ungauged sites is of interest, the regional definition of the seasons of Fig. 4 provides a reasonable alternative.

Evidence of trend
The split sample procedure can be used to give insight about potential change in extreme rainfall in Norway over the period represented by the rainfall time series.For this we split the observed years of each station into two subsamples: C (1) contains all years between 1948 and 1978 and C (2) contains all the remaining years, between 1979 and 2009.So, in contrast with the previous analysis, all stations are assigned the same C (1) and C (2) and these are temporal instead of being random.Remember that ff (12) assesses how well the maximum of C (1) is fitted by the distribution estimated on C (2) , namely F (2) (see Eq. ).Therefore a parallel comparison of the density of the values of ff (12) i , for i = 1, . .., 192, for this temporal sampling compared to the random one of Sect.4.2.1 can give insight into increases or decreases in extreme rainfall in Norway between the two periods.The density of these values is shown in Fig. 10.
We see that ff (12) tends to have too many small values with respect to the uniform density under the temporal sampling, whereas it was fairly uniform under the random sampling of Sect.4.2.1 (a complementary analysis, not shown, revealed that very similar densities are obtained with other random splitting approaches).We conclude that F (2) tends to overestimate the probability of occurrence of the maxi- mum of C (1) under the temporal sampling.Broadly speaking this means that the maximum of C (1) tends to be too small with respect to that of C (2) .This indicates that extremes during the second-half of the observed period  tend to be higher than those of the first half .This is confirmed by a comparison of return levels obtained on both periods, as shown in Fig. 11.For the random sampling case, return levels are almost equal for C (1) and C (2) whereas in the temporal sampling case, 100-year return level is about 5 mm higher in C (2) , with 10 % of the stations showing an increase higher than 10 mm (vs. 3 % in the random case).As shown in Fig. 12, these 10 % stations lie mainly in the southwestern region, between Bergen and Stavanger, which is one of the most rainy areas in Norway, with 100-year return levels higher than 100 mm (Fig. 12, left).This brief analysis gives evidence for an increase in extreme rainfall intensities which may already be evident in observations for the southwestern region in Norway.This evaluation does not take the place of a full, detailed trend analysis per se, but rather should be taken as a motivation for such an analysis of trends.Our evaluation relies in particular on a somewhat arbitrary splitting of the years in the middle of the observation period.Assessment of possible trends, including when such trends started and their consistency over time is beyond the scope of this paper, but may be of interest in future studies.

Conclusions
This article evaluates a compound model based on weather pattern classification, seasonal splitting and exponential distributions, the so-called MEWP model, for its suitability for use in Norway.The MEWP model is the rainfall probabilistic model used within the SCHADEX method, which is currently being tested in Norway as an alternative simulation method for flood estimation.We show in particular the benefit gained by subsampling the heavy rainfall data according to season and weather pattern.Our results also indicate that models based on the exponential distribution perform better than those based on the more flexible generalized Pareto distribution, which tends to overfit the data and lacks robustness.
We have also demonstrated that a regional definition of seasons in MEWP is possible.Finally, we give evidence for an increase in extreme rainfall intensities in Norway in recent years, particularly in the southwestern region.Our analysis has also shown that the GPD distribution better models the bulk of the distribution of extremes, but fails to robustly estimate the tail, and therefore fails in extrapolation to large return levels.The reason for this failure is twofold: firstly, the lack of data for estimating such a flexible distribution when using a local approach; secondly, the inherent nature of the GPD, which is a heavy-tailed distribution when the shape parameter is positive, and can therefore tend to give unrealistic return levels for very long return periods.
To address this issue, a regional approach allowing the use of neighboring stations to infer MEWP distributions at local sites is of interest.Finally, there are also other, more flexible, distributions which may be more robust than the GPD distribution and could be used within the MEWP approach.This also represents an important topic for future work.
This study is the first extensive evaluation of MEWP in Norway.It has also been applied successfully in France (Garavaglia et al., 2011;Neppel et al., 2014), Austria and West Canada (Brigode et al., 2014).MEWP is a general model imposing no specific hypotheses on the data, so its application in other regions of the world is absolutely worth considering.The only limitation is that a classification into weather patterns suitable for evaluating extreme precipitation is needed as a precursor to such an analysis, but this is already available in several regions around the world (see Brigode et al., 2014)

Figure 2 .
Figure2.Graphical tools for model evaluation based on FF scores, for three simulated series of length 200.The CDF case (upper left) is the method ofRenard et al. (2013).The density case (upper right and lower panels) is the alternative method comparing densities (Eq.10).The dotted horizontal lines show the 95 % confidence interval for uniform variates on (0, 1) of length 200, based on 1000 simulations.
) should be realizations of a binomial with parameters n(1) i (resp.n (2) i ) and 1/(ζ i T ).Let H (1)T ,i and H(2) T ,i be the corresponding binomial cumulative distribution functions and let n (j k) T ,i , j, k = 1, 2, be uniform simulations are calculated as for FF by comparing the empirical densities of n (j k) T , j, k = 1, 2 to the theoretical uniform density, giving the two scores AREA(N

Figure 3 .
Figure 3. Weather pattern classification with four classes (denoted WT1 to WT4 above) and eight classes (WP1 to WP8 above).This is Fig. 5 of Lawrence et al. (2014).Case with four classes is obtained by combining the eight classes into four.The last class of each classification (respectively WT4 and WP8) represent dry days.

Figure 5 .
Figure5.Scores of evaluation for MEWP models, for α = 0.5, 0.7 and 0.9.Better scores have values closer to 0. Scores of SPAN T , for T = 20−, 100− and 1000-year return periods, are the mean scores of Eq. (9), while scores of FF and N T , T = 5, 10 and 20 years, are based on the density areas (Eq.10).

Figure 9 .
Figure 9. Box plot of the difference (in mm) between the 100-year return levels of MEWP(α, 2, 8) and the three other EXP-based models, for α = 0.5 (one point per station and period).

Figure 10 .
Figure10.Divergence in density between ff(12) and the uniform case, under random sampling (left) and temporal sampling (right), with corresponding scores AREA(F F ).The closer the bars are to 0, the better the fit is.The dotted horizontal lines show 95 % confidence interval for uniform variates.

Table 1 .
Scores of evaluation for the local and regional definition of the seasons.Better scores have values closer to 0. Scores of SPAN T , for T = 20, 100 and 1000 years, are the mean scores of (Eq.9), while scores of FF and N T , T = 5, 10 and 20 years, are based on the density areas (Eq.10).SPAN 20 SPAN 100 SPAN 1000 FF(12)N