Non-stationarity in annual and seasonal series of peak ﬂow and precipitation in the UK

When designing or maintaining an hydraulic structure, an estimate of the frequency and magnitude of extreme events is required. The most common methods to obtain such estimates rely on the assumption of stationarity, i.e. the assumption that the process under study is not changing. The public perception and worry of a changing climate 5 have led to a wide debate on the validity of this assumption. In this work trends for annual and seasonal maxima in peak river ﬂow and catchment-average daily rainfall are explored. Assuming a 2-parameters log-normal distribution, a linear regression model is applied, allowing the mean of the distribution to vary with time. For the river ﬂow data, the linear model is extended to include an additional variable, the 99th percentile of the 10 daily rainfall for a year. From the ﬁtted models, dimensionless magniﬁcation factors are estimated and plotted on a map, shedding light on whether or not geographical coherence can be found in the signiﬁcant changes. The implications of the identiﬁed trends from a decision making perspective are then discussed, in particular with regard to the Type I and Type II error probabilities. One striking feature of the estimated trends 15 is that the high variability found in the data leads to very inconclusive test results. Indeed, for most stations it is impossible to make a statement regarding whether or not the current design standards for the 2085 horizon can be considered precautionary. The power of tests on trends is further discussed in the light of statistical power analysis and sample size calculations. to analyse from to assess and This study investigates trends in the annual and seasonal maximum peak river ﬂow and catchment average daily rainfall totals, and discusses the statistical testing framework by which trends are generally identiﬁed. First a simple trend model is applied to the observed series of both river ﬂow and rainfall: assuming a 2-parameter log-normal distribution, an estimate for trends in time is then obtained by the least


Introduction
Understanding the relationship between magnitude and frequency of hydrological extremes is of vital importance when designing hydraulic structures or when assessing the flood risk of a certain area. This relationship is typically obtained through frequency analysis of annual maxima series (AMS) of observed peak flows using statistical 25 extreme values models (e.g., Stedinger et al., 1993;Institute of Hydrology, 1999). 5500 a debated issue: the same direction of a signal in a data series can be identified by different methods, but these might give contrasting indications when it comes to evaluating the statistical and practical significance of the estimated signal; see Lins and Cohn (2005) for a full commentary on this. In fact, novel approaches are continuously being introduced, adapting the standard statistical methods to the actual properties found in the observed data series, which are in most cases relatively short and therefore only provide a limited view of a very complex, variable and slowchanging processes. Examples of studies attempting to address issues of incomplete information on long-term change and variability in the flood series include Salas and Obeysekera (2013), who revise the methods for return period estimation using 15 a geometric distribution and introduce changing probabilities over time; in order to reduce the variability of return period estimates obtained by the short recorded annual maxima series; Macdonald et al. (2013) and Gaume et al. (2010) propose to include historical evidence of large floods; Cohn and Lins (2005) discuss the importance of accounting for long term persistence in the data series and how this would affect tests 20 for non-stationarity; Renard et al. (2008) discuss methods to simultaneously analyse data from homogeneous regions to assess regional consistency and field significance; Merz et al. (2012) point out that a more rigorous approach is needed when reporting cause-effects claims and stress the need for sound hypothesis-testing frameworks.
This study investigates trends in the annual and seasonal maximum peak river 25 flow and catchment average daily rainfall totals, and discusses the statistical testing framework by which trends are generally identified. First a simple trend model is applied to the observed series of both river flow and rainfall: assuming a 2-parameter log-normal distribution, an estimate for trends in time is then obtained by the least 5502 river flow series, the initial model with time as the only variable is further extended by including a process-related variable to account for the effect of the rainfall-related climate variability. Estimates for the time-component in this latter model will give a better indication of whether any change can be detected in the high flow process itself. Finally, Sect. 5 discusses the implications that the estimated trends could have for decision 10 making in terms of statistical hypothesis testing and power analysis, focussing on the annual peak river flow maxima model as this will be most relevant for the design and maintenance of hydraulic structures.

Data
The different datasets employed in the study, the annual and seasonal (summer and 15 winter) peak river flow and the catchment-average daily rainfall, are introduced below. An annual maxima for a water year indicates the maximum value recorded in the period from October to September. Winter events are the ones occurring in the October-March period, summer events the ones occurring in the April-September period. 20 The annual maximum series and seasonal maximum series of peak river flow were extracted from the monthly maximum peak flow data available from the UK National River Flow Archive (NRFA). Only catchments which were classified as being "suitable for QMED" and "suitable for Pooling" in the Environment Agency's HiFlows-UK dataset v.3.1.1 (Environment Agency, 2011b)  Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ensure that only flood data of reasonably good quality are included. In addition, a minimum record length of 20 yr was imposed for a catchment to be included in the study. A further quality control was done by manually inspecting records for which a Pettitt test indicated the presence of a change point. For some of the series where a change point was identified, a comparison between the annual series available from 5 HiFlows-UK and the annual series extracted from the monthly data showed large discrepancies, mostly due to changes in the gauging structure and/or rating curve. Series in which HiFlows-UK reported about changes in the rating curve or the gauging structure were removed; if no reason was found to justify a change in the data, the series were kept in the study. This was done to exclude stations in which unnatural 10 changes have occurred, as these stations would often show significant large trends. As Renard et al. (2008) point out keeping series which are affected by quality issues in the data-set might distort the perception of the size and direction of the natural changes. The time coverage of the peak flow series for the different hydrometric areas are shown in Fig. 1. A map with the location of the hydrometric areas can be found at 15 http://www.environment-agency.gov.uk/hiflows/97299.aspx or in Marsh and Hannaford (2008), where the Severn and the Trent areas are both included in the EA Midlands hydrometric area. Note that stations are grouped into hydrometric areas based on the actual authority responsible for the maintenance of the gauging stations, not on the stations' hydrological characteristics. The data coverage begins in 1935 and the 20 number of gauged catchments increases with time. By the mid 1970s most of the catchments included in the study are gauged, although missing data are present in some records for various reasons. For each station, the information is considered as missing if data were missing for more than three months in a water year. Water years in which the annual maxima was recorded during the summer months are shown in 25 red. There are visible clusters of summer events in the different areas for some years; this is just one of the many indications of how correlated the series for neighbouring stations are. Although only 18 % of the annual maxima are recorded in the summer, these events are often some of the largest events in the whole record. For 30 % of the 5504 stations the largest peak in the series occurred during the summer months, and for 53 % of the stations the largest summer event is one of the three largest events in the whole series. Figure 2 shows how the proportion of summer events of the total number of annual maxima is fluctuating between decades. Although the median proportion of summer events does not fluctuate much, the variability is very large. This may 5 be related to the exaggeration of the rainfall divide between the northwest and the southeast of the country that occurred in the late 1980s and largely through the 1990s. The north and west was then even more than usually dominated by widespread frontal rainfall (orographically enhanced mainly in winter), whereas high flows in the southeast would to a relatively higher degree be caused by localised heavy convective rainfalls in the summer. This would result in the north and west experiencing fewer summer flood events (compare the time series for west Scotland, W-SEPA, in Fig. 1), while the southeast would retain, or even increase, its relatively higher proportion of flood events in summer, resulting in the high variability seen in Fig. 2. The rainfall divide was associated with the location of the preferred mid-latitude storm track, as also captured 15 by the increase in the North Atlantic Oscillation Index from the 1960s up to the 1990s (e.g., Osborn, 2006). This is also the main reason why trend analyses carried out for this period of record result in significant upward trends in high flows in the north and west (e.g., Hannaford and Marsh, 2008). In Fig. 3, boxplots of the median ratio of the observed annual and seasonal (winter, summer) maxima over the long-term 20 median annual maximum (QMED) are shown, separately for each decade. The welldocumented (Hannaford and Marsh, 2008) drier conditions of the years between 1965 and 1975 are visible for both the annual and seasonal maxima, but it would seem that the levels of river flows in the last decade have not been substantially different from observed levels in other decades.

Gridded daily rainfall data
Catchment average daily rainfall (CADR) series were extracted from a gridded dataset at 1 km resolution, which covers the whole of the UK for the water years from 1961 5505 Introduction Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | till 2010. The dataset is obtained using interpolation methods applied to the observed values of a dense network of rain gauges, see Keller et al. (2005) for further details. From the CADR dataset, annual and seasonal maxima series of daily rainfall totals were extracted, in order to investigate whether any evidence of changes in the extreme rainfall pattern can be seen. 5 In Fig. 4 the proportion of summer events in the annual and seasonal series for each decade is shown. About half of the rainfall annual maxima are recorded in the summer months which is considerably more than the 18 % of annual maximum river flows that occur in summer. This shows the importance of dryer soils in summer for inhibiting river flow formation. The (median of the) proportion of rainfall events occurring in summer is roughly inversely related to the North Atlantic Oscillation Index, which showed an increase from the 1960s to the 1990s (e.g., Osborn, 2006). As discussed for Fig. 2, the high value of this index towards the late 1980s and 1990s signifies a dominance of frontal rainfall in the hilly north and west which tends to be orographically enhanced mainly in winter, thus reducing the proportion of annual maximum rainfall events 15 occurring in summer in this area. This reduction in the median of the proportion of summer events is also discernible for the river flows in Fig. 2, albeit much less clearly. In Fig. 5 boxplots of the median ratio of the observed maximum rainfall over the longterm median annual maximum (RMED) are shown for each season, separately for each decade. The rainfall medians seem to be quite variable from decade to decade, with 20 very different patterns for the different seasons. This is a different behaviour than the one seen for river flow medians shown in Fig. 3.
For each catchment average daily rainfall series, the value of the 99th percentile in each water year is also extracted. This value corresponds more or less to the 1 in 100 days rainfall event, and is used as an indication of the storminess of the year. 25 Rather than the maximum value for a series, which could be highly influenced by singular rare events, the 99th percentile is a more stable indicator of whether a year has been characterized by larger or smaller rainfall extremes. The quantity has previously been used in a study of the UK Met Office (Met Office, 2013)  term patterns in the national high rainfalls. Figure 6 shows a map of results for a Mann-Kendall trend test performed on the 99th percentile of rainfall series for each catchment, and identifies catchments for which the 99th rainfall percentile appears to be changing in time. A consistent increase can be seen in the east of Scotland and some other scattered catchments around the country. For approximately 82 % of the cathments no 5 change can be detected with a Mann-Kendall test at a 0.05 significance level.

Final datasets for analysis
The analysis presented in the remainder of the paper are based on the catchments and water years for which both gridded rainfall data and at least 20 yr of river flow data were available. This corresponds to 446 catchments for the water years between 1961 10 and 2010. The mean and median record length for the high flows data are respectively 39.3 and 40 yr, and a total of 17 529 station-years have been included in the study. The selected catchments allow for a fairly good spatial coverage of the UK, although coverage of north Wales is poor, due to a lack of long records.

15
The evidence, or not, of changes in hydrological extremes for the whole of the UK is investigated using the approach suggested by Vogel et al. (2011). The core idea is to quantify in a simple way what would be the expected change in the magnitude of events with a given return period over a defined time period. A 2-parameters log-normal distribution (LN2) is assumed for annual and seasonal peak flow and daily rainfall 20 maxima. For each catchment the observed flow and rainfall maxima series, respectively x F and x R , are log-transformed and inference is based on the quantities y F = log(x F ) and y R = log(x R ) which are by definition assumed to be normally distributed. The quantile function for the LN2 distribution is given by: where µ y and σ y correspond to the mean and the standard deviation of the logtransformed distribution and z 1−p is the quantile of the standard normal distribution which is exceeded with probability p. The 1 in T years event is calculated by taking p = 1/T . In the stationary case, the µ y and σ y parameters are assumed to be constant and can be estimated with different estimation procedures. In the non-stationary case, 5 one or both the LN2 parameters are assumed to be varying. Much effort has been put particularly into investigating whether the location parameter µ y is changing, see for example the review of change detection by Kundzewicz and Robson (2004). A nonstationary extension of the stationary model in Eq.
(1) can be defined by relating the change in the location parameter to time through a simple linear relationship as: where t is a variable describing time (e.g. the series of water years). In the framework of a linear regression model this becomes where ε t is a zero-mean, homoscedastic, normally distributed error term. x t denotes 15 the value at time t of the variable under study (either the peak flow or the daily rainfall maxima), and it is assumed that observations at different time points t are independent from each other. Estimates for β 0 and β 1 can be obtained via standard linear regression methods, and a statistical two-sided test on H 0 : β 1 = 0 will give indication of nonstationary in the process. The quantile function in the non-stationary case is obtained 20 by substituting the constant location parameter, µ y , in Eq. (1) with the formula in Eq.
(2) and then becomes: Rather than comparing the estimated β 1 values, Vogel et al. (2011) suggest to use a non-dimensional magnification factor M ∆t defined as the ratio of the quantile function 25 5508 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | at a time (t + ∆t) and the quantile function at time t, which for the LN2 distribution is given by: Magnification factors larger (smaller) than one indicate that the magnitude of the events occurring with probability p is increasing (decreasing). In other words, magnification 5 factors larger (smaller) than one indicate that the 1 in T year event in the future will have a higher (smaller) probability of happening than the one that would be expected in the stationary case. When using a LN2 distribution the value of M ∆t only depends on the slope β 1 and the time span ∆t, and not on the chosen exceedance probability p.
Other distributional assumptions would lead to more complex formulas with an explicit dependence on the return period 1/p. Laio et al. (2009) show that the 2-parameters log-normal distribution is an acceptable assumption for a large proportion of the catchments in the UK, and discuss the difficulties involved in testing a distributional assumption. A visual inspection of the model residuals from model (Eq. 3) seemed to confirm the goodness of the normality assumption. If a deviation from normality 15 was found in the data, this was often due to the presence of a very high or low annual maxima in the series: once the influential point is removed from the series the residuals would show a normal behaviour. The non-robustness of linear regression to influential points is a well known issue and the effects on the final estimates can be rather severe, especially if these outliers are located in the beginning or end of the series. The use 20 of robust methods to fit the linear model in Eq.
(2) was tested on many catchments and did not give substantially different results; if only one or two outliers are present in the central part of the data series, these will not have too strong an effect on the results. A visual check of the model residuals was also carried out for the rainfall data and did not raise major doubts on the normality assumption, although again some 25 catchments showed a very small proportion of outliers. Taking  are smoothed out and makes the data less skewed. As discussed in Sect. 4, a Shapiro-Wilkinson normality test was performed on the model residuals for both rainfall and flow series, in order to evaluate the goodness of the distributional assumption. Results are only presented for data series which do not seem to strongly deviate from the normality assumption.

A more complete approach to non-stationarity
The model in Eq. (2) is a rather simple model, relating the changes in the flow generating process only to the time covariate. Visual inspection of flood time series typically show a large variability between years, indicating a high level of climatic influence. In an attempt to better estimate any underlying trend, the 99th rainfall 10 percentile was introduced as a second covariate. In this way it is possible to separate the effect of the rainfall climatology from time on the high flow process, verifying whether or not there are underlying changes in the high flow process. Consequently, the non-stationary model in Eq.
(2) is updated to a multivariate model as where r t is the 99th percentile of the daily rainfall in water year t. The value of β 1 in this model then describes how time has an impact on the process, after the storminess of a given year has been taken into account. It is an indication of what is left to explain in the model, when a process-related variable is also taken into account. The values of β 2 will give an indication of how important the storminess of the water years is 20 in explaining the variability in the data: for some catchments, where the catchment characteristics or water management have a strong impact, this might be less of an important variable. In this study the variable has been found to be significant for a large majority of the catchments, and it explains a fair proportion of the inter-year variability of the flood records. The 25th, 50th and 75th percentiles of the R 2 for a model with will primarily focus on the time-related magnification factor M ∆t , although the model in Eq. (6) could be used to assess the effect on flood risk of long term forecasts of rainfall. This latter application is not pursued further here, as it would require long-term forecasts of catchment averaged rainfall. Finally, although the inclusion of the 99th percentile of rainfall explains a large part 20 of the variability in the flow process, the runoff process is complex and for a more complete model specification variables such as soil moisture deficit and urbanisation could be included. Soil moisture deficits, longer aggregations (months) of rainfall, evapotranspiration and/or temperature would help to describe the longer-term water balance and might improve the model, especially for more groundwater dominated 25 catchments, which respond more slowly to heavy rainfall events. The interaction between these variables would make their inclusion in the model a complex task, and the analyses presented in this work therefore build around the simpler model in Eq. (6).

Results
For all the annual and seasonal maxima series of both peak flow and rainfall the decadal (10 yr) magnification factors M 10 = exp{β 1 10} were estimated for the simpler model in Eq. (2) in which time is the only explanatory variable. Further, only for the peak river flow series was the more complex model involving the 99th rainfall 5 percentile described in Eq. (6) estimated and the corresponding M 10 values computed. As discussed in Sect. 3 the modelling framework relies on the assumption that the log-transformed data are normally distributed. In order to avoid spurious results which could result from severe model misspecification, a Shapiro-Wilkinson test for normality was performed on the model residuals at significance level α norm = 0.01. As mentioned 10 in Sect. 3, a more detailed look at the model residuals which appeared to be nonnormal highlighted the fact that in many cases the low p values observed for a normality test would have been much larger if the highest or lowest observations in the series were taken out. The normality tests were then performed on the subset of residuals without the two most extreme points, and these results are used in the remainder of c. log(x F,t ) = β 0 + β 1 t + β 2 r t + ε t -a model for peak flow data with time and the 99th quantile of daily rainfall as explanatory variables (right panel). 25 The M 10 values indicated as significant correspond to catchments for which the β 1 coefficient was found to be significantly different from 0 at a α reg = 0.1 level, using 5512 the standard inference based on the t distribution. Note that taking α reg = 0.1 for a two-sided test on β 1 = 0 will result in accepting as significantly different from 0 the same slopes which would have been identified if using a unidirectional test at α reg = 0.05 for two separate one-sided t tests on β 1 . Indeed, more than simply testing whether a generic change is detected in the data a more relevant point is to have an 5 understanding of whether or not an increase or a decrease can be detected. More discussion on the implications of the testing framework can be found in Sect. 5. In Fig. 8 the results for the annual maxima series are shown. Note that, from the formula in Eq. (5) Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | seems to be less homogeneous. Both upward and downward trends are visible, with some clustering of upward trends in the north west of England.
The results for the summer series are shown in Fig. 10. A noticeable feature of the M 10 values for the peak flows in the left panel is the large cluster of downward trends in the south and southeast of England, contrasting the upward trends found 5 in the north and west of Great Britain and in Northern Ireland. Rainfall maxima also seem to be decreasing in the southern part of England, although the magnitude of the change is much smaller than for river flows. Finally, results for model (c) show an even larger effect of time when the 99th percentile of annual daily rainfall is included in the model. Downward trends are visible in the south of England and some clusters also 10 appear in the north and west. At the same time many of the upward trends in the left panel become smaller or not significant, i.e. once the storminess of a year is taken into account there is less evidence of upward trends in summer high flows data.
The model in Eq. (6) includes the 99th percentile of rainfall as an explanatory variable, but the runoff process is complex and for a more complete model specification 15 variables such as soil moisture deficit and urbanisation could be included. A changing climate with expected higher temperatures and increased evaporative demands which deplete the underground water stores would be consistent with lower summer high flows in slowly responding catchments, which are mostly located in the south and east of the UK. This is also the part of the country with a more continental, drier, climate. 20 Many of the records used in the analysis end in the summer of 2011, i.e. during the 2010-2012 drought that affected particularly the south and east of the country, see Kendon et al. (2013). Even though the records have been selected to be relatively long, the effect of ending the observation period in an exceptionally dry period could exacerbate the signal of downwards trends. A re-analysis performed for the period 25 until 2008, not shown here, confirms that this is the case, although the main pattern of decreasing trends in the south and east remains visible.
The results for the winter and summer series for both river flow and rainfall give different results, highlighting different patterns in the regions of the UK. Looking at both annual and seasonal series can give a better understanding of possible changes in the hydrological processes, as the annual process is a combination of the complex interaction between the different seasonal processes.

Implications for decision making
The results presented in the previous section show that for some catchments the 5 assumption of stationarity in the location parameter for the observed time series of extreme rain and flow can be rejected. In this section, the implications of these findings and the testing framework of non-stationarity will be further investigated. The current procedure recommended by Defra (2006) for considering the effect of climate change on design flood estimates in the UK is through the use of 10 precautionary safety factors. In practice, this is done by first conducting a flood frequency analysis using standard methods such as those presented in the Flood Estimation Handbook (e.g., Institute of Hydrology, 1999; Kjeldsen and Jones, 2009) based on the assumption of stationarity, and subsequently adding a safety margin of 20 % to the design flow to represent changes expected by 2085. For the final choice 15 of design, it should be investigated if this increase in design flow has a significant impact on the design/management of the hydraulic structure. The choice of 20 % as a safety factor was based on modelling studies reported by Reynard et al. (2004) who coupled downscaled UKCIP02 scenarios of rainfall with a hydrological model to assess future flood risk. Structures being constructed at this point in time should be 20 over-engineered with a view to still comply with protection against the 100 yr event in the future (2085 in this case). Further studies (e.g., Environment Agency, 2011a) have used the UKCP09 projections of rainfall and temperature to estimate river flows and investigated the importance of catchment properties in the response to climate change. The study identifies regional change factor intervals and discusses how these should 25 be employed. In order to keep the presentation more readable the results discussed in the remainder of this work are obtained assuming a national safety margin of 20 %, Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | which seems reasonable for the purpose, even considering the results in Environment Agency (2011a).

NHESSD
Having accepted the premise of increased flood risk and put the appropriate safety procedures in place, rather than investigating whether or not a trend is detectable in the data, it would be more relevant to investigate whether the trend in the data is larger 5 than the increase that the current design criteria already take into account. This can also be seen as a test on whether the current precautionary measures are safe enough and whether they are supported by the currently observed levels of change.
Consequently, it is suggested here to shift the attention from a two sided test on the presence of any trend (upward or downward) in the observed data, to a one sided 10 test in which it is investigated if the observed trend exceeds the current safety margin. Starting from the guidelines by Defra (2006) The future flood estimates in catchments for which the null hypothesis H 0 can be 20 rejected, would be expected to exceed the design flood value that would be obtained using the safety margin in the current guidelines. As Vogel et al. (2013) point out, the standard hypothesis testing framework is built with the purpose of having a small prefixed probability α (the significance level) of not accepting the null hypothesis when the null hypothesis is actually true (Type I error). An error of this type in the framework 25 in Eq. (7) would lead to an increase in flood protection measures (likely a money investment) which would turn out not to be necessary. The price to pay in order to have a test with smaller probabilities of Type I errors, is to actually perform a test with lower power, i.e. the ability of identifying a trend when a trend exists in the data. 5516 it would be overtopped more frequently than expected in the stationary case. One then might rethink the trend detection routines in order to increase the power of the test, and not only focus on the Type I error. As discussed further in Sect. 5.1, due to the close relationship between α and β, for a given α the only way to reduce the probability of Type II errors is to reduce the variability of the test statistics by either increasing the 10 sample size (i.e. wait more years) or improving the way in which the test statistic is estimated. This study tries to do the latter by adding relevant variables in the trend model. For many natural processes, evidence of change has been found in the data, and there is an increasing perception in the public discourse that changes are occurring in 15 environmental and hydrological systems. Moreover there is a high social costs in not being prepared to cope with increasing flood risk (Hall et al., 2012). In response to this change of perception, Vogel et al. (2013) urge the use of tests which shift the attention from the null hypothesis being that there is no change to the case where the change is assumed to be happening. This radically changes the objective of the analysis and 20 could be translated into the following hypothesis framework: In this case, the future flood estimates in catchments for which the null hypothesis 25 H 0 can be rejected, are expected not to exceed the design flood value that would be obtained using the safety margin in the current guidelines. Figure 11 shows the results for the annual river flow series when testing within the two different testing frameworks in Eqs. (7) and (8)  Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | levels α reg = 0.05 and α reg = 0.01, based on the regression model presented in Eq. (6). Stations for which a large change was found in the flow series according to the testing framework in Eq. (7) are shown in the left panel of Fig. 11. For these sites the null hypothesis of magnification factor smaller than 1.2 is rejected and there is an indication that the floods for these stations are going trough important changes. These stations 5 partially coincide with the stations for which the highest M 10 factors were found (see Fig. 8), although the map in the left panel of Fig. 11 adds the additional information on whether the estimated change is strong enough to raise safety issues according to the current design standards. In contrast, the right panel of Fig. 11 shows the stations for which the null hypothesis H 0 : β 1 > log(1.2)/85 was rejected: these sites are the ones 10 for which the data do not support the assumption that a worryingly large increase in the annual high flow process is occurring. Again, these stations are characterized by very low M 10 is Fig. 8, but this map adds the additional information on whether the estimated change in the annual maxima can rule out the possibility of a long-term increase in the high flow. For the majority of the catchments (80 % at a α = 0.05 significance level) the 15 null hypothesis is not rejected in either of the testing framework, suggesting that it is not possible to reject either of the null hypotheses of an increase in estimated design floods to be either smaller than 20 % or larger than 20 %. This results shows how difficult it is to obtain definite information on change from such variable data. These results support the assertion by Lins and Cohn (2005) that "stationarity and non-stationarity 20 are essentially indistinguishable" for river flows, given the currently available periods of record, when doing a single-site analysis.

Testing and sample size
An important additional results from the statistical power analysis theory is the possibility of calculating the sample size which would be needed under certain 25 specified assumptions in order to attain a desired power (i.e. the probability of not committing a Type II error). The issue is considered as a routine step in many fields like clinical or behavioural research: when setting up a study a decision needs to be made 5518 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | regarding the amount of experimental units needed. This choice is made based on the probability of the Type I and Type II errors that the researcher is willing to accept, the variability of the process under study and the precision that is needed. Summarizing, the following quantities need to be pre-specified: the significance level α,

5
the power to be attained π = 1 − β, the variability τ of the parameter under study, in this study the regression coefficient τ = s β 1 , the effect size (ES) δ, an indication of the magnitude of the effect that would be relevant to the process of interest, in this study the trend.

10
The last quantity is rarely discussed in the standard presentation of the hypothesis testing framework, but is very relevant when calculating sample sizes, as it indicates the level of precision to be achieved. It can also be interpreted as an indication of where the alternative hypothesis really begins. In a test for H 0 : β 1 = 0, it would be reasonable to not already start rejecting the null hypothesis for a test statistic which gives indication 15 of, say, β 1 = 10 −26 , but rather allow an ES value δ such that for any |β 1 | ≥ δ the null hypothesis can be rejected. The effect size can either be fixed beforehand by the researcher, or can be derived from properties expected to be found in the data based on previous studies. See among others Cohen (1992) for a discussion on how to obtain ES and Cohen (1994Cohen ( , 1990) for a detailed discussion on ES and the importance of each 20 pre-specified component in a power analysis. For a univariate model, like the simple regression in Eq.
(2), the power π for a onesided test with H 1 : β 1 > δ is defined as: where T is a t distribution with (n −2) degrees of freedom and non-centrality parameter 25 √ nδ/τ. The standard deviation of the regression parameter in this case is estimated by 5519 Introduction The term t α,n−2 corresponds to the 1−α quantile of a standard t distribution with (n − 2) degrees of freedom, i.e. the cutoff value which marks the beginning of the rejection region. t α,n−2 changes as a function of the sample size n and the significance level α. Equation (9) is often approximated with where T is a standard t distribution random variable with (n − 2) degrees of freedom. The decisions made on the size of each of (α, τ, δ), the three quantities used in Eq. (10), will have an effect on the sample size needed to attain the pre-specified power π. A typical value for π is π = 0.8, which translates into a probability of Type II error β = 0.2. For the commonly used significance level α = 0.05 a power π = 0.8 corresponds to 10 a 4 : 1 proportion of probability of Type II errors over the probability of Type I errors.
In most cases, the value of τ would be unknown and difficult to estimate from previous studies or the researchers' knowledge. However, for a univariate regression model the value of τ can be related to ρ, the correlation between the dependent and independent variables, which for the univariate case corresponds to the square root 15 of the well-known coefficient of determination R 2 (see Appendix A for the derivation of this relation). Thus, taking τ = ρ 2 /((1 − ρ 2 ) δ 2 ) the formula in Eq. (10) can than be rewritten as: which corresponds to the formula used by Vogel et al. (2013). Note that for this 20 formula the effect size δ is cancelled out from the formula and the power levels are completely determined by the sample size and the strength of the relationship between the dependent and independent variable. Alternatively, the value of τ can be estimated starting from the parameters σ and s x , defined as the standard deviation of the model residuals ε and the sample standard Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | deviation of the independent variable x, respectively. Taking τ = σ/(s x ) then, the formula in Eq. (10) can be rewritten as: Once the effect size and the sample size have been fixed, power levels in Eq. (12) are determined by the variability of the model errors relative to the sample variability 5 of the independent variable. In the particular case studied in this work, time, e.g. the water year, is the independent variable, so that s x is changing and known for increasing sample sizes (see Appendix A). From the formulas in Eqs. (11) and (12) it is then clear that the power of the test on the regression parameter β 1 will be strictly connected with the variability of the dependent 10 and independent variables in the model, and the strength of the relationship between them. For multivariate regression models like the one in Eq. (6), the relationship between the different independent variables also play a role. Thus, in order to do a power analysis for the effect of one of the independent variables, some further assumptions need to be made regarding these relationships. In order to keep the 15 presentation more readable the discussion in the remainder of this section is limited to univariate models.
Computation of the sample size necessary to attain the desired power of a test requires a number of assumptions on the variability of the variables involved in the model. Depending on which information is more easily available and more reliable, one 20 of the two formulas in Eqs. (11) and (12) can be used to investigate the relationship between sample size and power. Levels of power for increasing sample sizes computed using the formula in Eq. (11) are shown in the curves in the left panel of Fig. 12: for higher R 2 smaller sample sizes are needed to attain a given power level. Since in the framework under study each measurement corresponds to a water year, assuming 25 that a data series would start in a certain year, for example 1970, each sample size corresponds to an end of record year. In the lower x axis of Fig. 12 the year corresponding to each sample size is indicated. The graph in the left panel of Fig. 12  shows representative power functions obtained with the 25th, 50th and 75th percentile of the R 2 for the fitted univariate models for flow data as in Eq. (2): the observed R 2 are fairly small and if a sample size for a test for trend in the flow data was to be chosen based on the current levels of correlation between time and flow data, it might only be possible to obtain a reasonable power for the test by waiting for another 500 yr. In (2). It would appear that reasonable power levels for a test on the regression coefficient for models as in Eq.
(2) should be attained by the end of the 21st century. The huge difference in the sample sizes chosen using the two different formulas is partially due to the fact that for the particular case at hand, when using the formula in Eq. (12) one can also include the information on the actual sample standard 15 deviation s x , which will necessarily increase for increasing sample sizes. When using the formula in Eq. (11) the information on the increased variability of the independent variable is not used. In a different experimental set up, researchers might be able to control the sample standard deviation of the independent variable, but since this is not possible in the case under study it only makes sense to use this additional information. 20 It should be pointed out that when deciding on the sample size for a designed experiment, power analysis should not be performed ex-post, after an experiment has been carried out, but rather ex-ante, in the experiment design step. A researcher should have some knowledge on the variability of the process under study and can decide on the sample size based on this knowledge. This is clearly not a viable approach for flow 25 data, as researchers have no control on the variability of the processes and more data can only be obtained by waiting more years. It is important to stress that the sample sizes indicated in Fig. 12 are only giving an indication of the time needed to attain a required power under some pre-specified conditions. 5522 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | The power analysis for a regression parameter as discussed in this section would be valid for complete, independent data series. Some short-term and long-term autocorrelation is likely to be observed in hydrometric series, and would have an impact on the variability of the test statistic and therefore on the power levels. Another source of uncertainty that would require further corrections in power analysis is the 5 correlation between events recorded in the same year at different stations. Throughout the study each station has been analysed in isolation, i.e. the stations are taken into consideration independently, although correlations may exist between them, as Fig. 1 suggests. Not accounting for the spatial correlation can potentially lead to problems when trying to interpret the significance of the results. However, accounting for it is 10 a non-trivial issue, see for example Davison et al. (2012) and Hüser and Davison (2013).
Another possible strategy to lower the number of years needed in each station to detect possible spatially coherent changes would be to apply a regional method for trend detection as the ones presented in Renard et al. (2008) and applied to a number of French and Alpine high river flows variables in, respectively, Giuntoli et al. (2012) and Bard et al. (2012). By analysing spatially and hydrologically coherent data together, stronger evidence can be found in favour or against the presence of nonstationarity for hydrological variables in a region. One possible obstacle to perform such a regional analysis for the British data would be that a relevant analysis would 20 need hydro-climatological coherent regions to be defined. Some efforts have been done to define hydrologically coherent regions in the UK (e.g., Kingston et al., 2013;Svensson and Prudhomme, 2005), and the main division for the country would seem to be in two regions: the Northwest and the Southeast. These two large regions could be employed to perform a regional analysis, but they might be not as well defined 25 as the ones employed in the above mentioned studies. Identifying coherent regions for high flows and rainfall patterns in the country would be a necessary initial step to perform a regional analysis. This would require an additional effort which is beyond the scope of this paper. Moreover, for some regional tests like the non-parametric approach  Renard et al. (2008), the records for all stations included in the regional study should all have data available for the same water years: considering the several missing data which can be seen in Fig. 1, a careful selection of which stations should be included in a regional analysis also needs to be carried out.

5
This study has investigated the presence of trends in the location parameter of the distributions for annual and seasonal maxima series of peak river flow and daily rainfall totals recorded in the UK. Building on Vogel et al. (2011), a dimensionless magnification factor is estimated for different catchments and the presence of local patterns is investigated by plotting the estimated factors on maps. For the peak river flow data the simple time trend model is expanded by adding a process-related variable: the 99th percentile of the daily rainfall for each water year. This work only pursued a model for µ, the location parameter of the distribution, assuming the higher order moments, like the dispersion, to be constant. A detailed analysis of the variance function could be beneficial, although a reliable estimate for the variance would ideally require a higher 15 number of observations for each station.
For the location parameter model, the 99th rainfall percentile explains a very large part of the variability seen in the flow observations. The advantage of adding a rainfallrelated quantity is that any residual effect of time should be related more to the other unknown drivers of change rather than precipitation, and that the variability of the 20 slope estimate is reduced, thus giving more precise information. This is an attempt in the direction of the better attribution effort, (Merz et al., 2012), and the framework could potentially accommodate additional variables other than time to better explain the residual variability in the model. Indeed the evidence for changing high flows is slightly different when the 99th rainfall percentile is taken into account than when it is not. The Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | some small scattered clusters of decreasing and increasing trends. However there is a strong indication of decreasing maxima summer flows, particularly in the south east of England. The last part of this work in Sect. 5 discusses some aspects of the statistical testing approach used to detect non-stationarity and the implications for decision making. The 5 definition of non-stationarity can be expanded into something more relevant than the frequently used null-hypothesis of no trend (H 0 : β 1 = 0), and the importance of Type II errors is discussed. Indeed the statistical testing framework used in any study should be formulated thinking carefully about the question that is relevant for the problem at hand. 10 With the data used in this study, only for a very small proportion of stations one of the two contrasting null hypotheses H 0 : β 1 ≤ log(1.2)/85 and H 0 : β 1 > log(1.2)/85 can be rejected. This is to say, for more than 80 % of the stations neither hypotheses can be rejected, and it can not be determined whether or not flood estimates are likely to exceed the current design criteria for the 2085 horizon, or if they will be safely 15 below it. This striking result is due to the high natural variability of the estimates for the regression coefficients: the trend signal is simply not strong enough to be really informative from a practical point of view. This is even more evident when computing the sample sizes which would be needed to attain relatively high power levels if the correlation values or the model errors would be comparable to the ones obtained from 20 the models fitted to the datasets used in this study. Methods to better account for, and use, the spatial correlation between nearby stations might lead to more informative results.
Appendix A A1 Derivation of some key quantities 25 In this appendix some of the formulas used in the approximations in Eqs. (11) and (12)  Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | models. A comprehensive presentation of linear models can be found in Neter et al. (1996). In a model like the one in Eq. (3), y i = β 0 + β 1 x i + ε i with ε i ∼ N(0, σ 2 ), a test on the coefficient β 1 , with H 0 : β 1 = b 1 against H 1 : β 1 = b 1 , is based on the test statistic: 5 with sβ 1 the estimated standard deviation for the estimated coefficientβ 1 . Under the null hypothesis it can be shown that (β 1 − b 1 )/sβ 1 ∼ t n−2 . For a two-sided test at a significance level α, the null hypothesis would be rejected when |T | > t α/2,n−2 .
It can be shown that sβ 1 = σ/( √ ns x ), so that Eq. (A1) becomes To calculate the power of a test, it is necessary to make assumptions on σ and/or s x : in a designed study s x would be either known or kept under control, but this is not the case for a test on trend in time.

A3 Derivation of the variance of a sequence of water years
Let x be a sequence of numbers like the water years variable. Since var(X +a) = var(X ), the variance for a water year record of length n corresponds to the variance of (1, . . ., n).