Non-stationary extreme value analysis of ground snow loads in the French Alps: a comparison with building standards

In a context of climate change, trends in extreme snow loads need to be determined to minimize the risk of structure collapse. We study trends in 50-year return levels of ground snow load (GSL) using non-stationary extreme value models. These trends are assessed at a mountain massif scale from GSL data, provided for the French Alps from 1959 to 2019 by a meteorological reanalysis and a snowpack model. Our results indicate a temporal decrease in 50-year return levels from 900 to 4200 m, significant in the northwest of the French Alps up to 2100 m. We detect the most important decrease at 900 m with an average of −30% for return levels between 1960 and 2010. Despite these decreases, in 2019 return levels still exceed return levels designed for French building standards under a stationary assumption. At worst (i.e. at 1800 m), return levels exceed standards by 15% on average, and half of the massifs exceed standards. We believe that these exceedances are due to questionable assumptions concerning the computation of standards. For example, these were devised with GSL, estimated from snow depth maxima and constant snow density set to 150kgm−3, which underestimate typical GSL values for the snowpack.


Introduction
Extreme snow loads can generate economic damages and casualties. For instance, more than USD 200 million in roof damages occurred during the Great Blizzard of 1993 (O'Rourke and Auren, 1997). In 2006, at the Katowice International Fair, the roof of one of the buildings collapsed under a layer of snow, leading to 65 casualties and 140 in-jured (BBC News, 2006). In France, snow loads over Roussillon in 1986, caused both EUR 17 million in damages and a major power outage due to overloading of electrical cables and pylons by sticking snow (Vigneau, 1987;Naaim-Bouvet et al., 2000).
Ground snow load (GSL) is defined as the pressure exerted by accumulated snow on the ground, which can be directly associated with accumulated snow on structures, e.g. on roofs (Sanpaolesi et al., 1998). In detail, the observed height of accumulated snow is called snow depth (in m). The density of this snow can vary widely between precipitation particles (ρ SNOW ≈ 100 kg m −3 ) and a ripe snowpack (ρ SNOW ≈ 500 kg m −3 ). Multiplying snow depth by snow density gives the surface mass of snow (in kg m −2 ). Surface mass of snow corresponds to the snow water equivalent (SWE), which is the height of water (in mm) we could obtain if we melt all the snow in a 1 m 2 area. Indeed, since water density is ρ WATER = 1000 kg m −3 , we have that 1 mm of water on 1 m 2 has a surface mass of 1 kg m −2 . Snow load is the pressure exerted by this surface mass of snow (in N m −2 or Pa) and equals the SWE times the gravitational acceleration (g = 9.81 m s −2 ).
Snowpack variables related to GSL (snow depth, SWE) evolve with climate change. As shown in Table 1, the literature on past trends in snowpack variables for the Western Alps shows a decreasing trend. The literature on projected trends also points to a decrease (stronger for the second half of the 21st century under a high greenhouse gas emission scenario than with strong reductions in greenhouse gas emissions) for mean winter (December-May) SWE in the European Alps (IPCC, 2019). However, anthropogenic climate change impacts climatic variables in their averages as well as in their extremes (Klein Tank and Können, 2003;IPCC, 2012). For instance, annual maxima of snow depth have decreased in Switzerland (Marty and Blanchet, 2012). Projected trends in extreme snowpack variables are prone to strong uncertainties (Strasser, 2008;Beniston et al., 2018) as both mean winter temperature (IPCC, 2019) and winter precipitation extremes (Rajczak and Schär, 2017) are projected to increase in the European Alps.
The impact of climate change on GSL was not taken into account in current European standards for structural design, a.k.a Eurocodes (Sanpaolesi et al., 1998), which drive French standards (Biétry, 2005). These standards define that structures must withstand their own weight plus a pressure proportional to a characteristic value. The latter is the stationary 50-year return level of GSL, exceeded once every 50 years on average. Thus, studying trends in 50-year return levels of GSL is needed for updating these standards (Croce et al., 2018). In the literature, past and projected trends in 50-year return levels of GSL have rarely been investigated with the exception of Rózsás et al. (2016), Il Jeong and Sushama (2018), and Croce et al. (2018). In the French Alps, several studies focused on extreme snow variables (Biétry, 2005;Gaume et al., 2012Gaume et al., , 2013 and their spatial dependence (Nicolet et al., 2015(Nicolet et al., , 2016(Nicolet et al., , 2017(Nicolet et al., , 2018. However, trends in 50-year return levels of GSL remain unexplored. We fill these gaps by studying annual maxima of GSL provided every 300 m of altitude at a mountain massif scale for the 23 French Alps massifs. We rely on the SAFRAN-Crocus reanalysis (Vernay et al., 2019) produced by the SAFRAN-Crocus chain (Durand et al., 2009a;Vionnet et al., 2012) available for the period 1959-2019. The major advantage of this reanalysis is to benefit from an advanced snowpack model which provides daily estimates of ground snow load values, while previous studies relied on approximate values directly related to snow depth with a crude estimation of snow density (Biétry, 2005). Thus, our approach considers only natural snow processes: we do not account for snow removal throughout the year and consider all processes (accumulation, thaw-freeze, melt, compaction etc.) occurring during the winter season.
Our statistical methodology consists in applying stationary and non-stationary extreme value models to annual maxima time series. We select one model by massif and altitude with the Akaike information criterion (AIC) statistical criterion, validate the selected model with the Anderson-Darling test, and assess its significance with the likelihood ratio statistical test. Finally, for each massif and altitude, we compute the relative change of 50-year return levels of GSL between 1960 and 2010, and we compare the non-stationary return level in 2019 with the stationary return level designed for French building standards. This paper is organized as follows. Section 2 presents our data. Section 3 describes standards for ground snow load. Then, Sect. 4 explains our methodology. Results, discussion and conclusions are introduced in Sects. 5, 6 and 7, respectively.

Ground snow load data
The study area covers the French Alps which are located between Lake Geneva to the north and the Mediterranean Sea to the south (Fig. 1). The climate is contrasted, colder and wetter in the northern Alps and much drier in the southern Alps (Durand et al., 2009a). This region is typically divided into 23 mountain massifs of about 1000 km 2 . We rely on the SAFRAN-Crocus reanalysis (Vernay et al., 2019) from the SAFRAN-Crocus chain (Durand et al., 2009a;Vionnet et al., 2012) available from August 1958 to July 2019 at the scale of these massifs, every 300 m of altitude from 300 to 4800 m. Contrary to gridded products, this reanalysis assumes for a given altitude the homogeneity of the different variables at the scale of the massif. Also, annual maxima are available from 1959 to 2019. Indeed, annual maxima denote the maxima during a year centred on the winter season; for example, annual maxima for 1959 correspond to the maxima from the 1 August 1958 to the 31 July 1959.
To sum up, GSL equals SWE from the SAFRAN-Crocus reanalysis times the gravitational acceleration. We study time series of annual maxima of GSL for each massif from 1959 to 2019 every 300 m of altitude from 300 to 4800 m (Fig. 1).
The SAFRAN-Crocus reanalysis is produced by a chain of two models. First, SAFRAN meteorological reanalysis (Durand et al., 2009a) performs a spatialization of the weather data (precipitation, temperature, humidity, radiation, wind speed) over the massifs and altitudes. Then, the Crocus snowpack model (Vionnet et al., 2012) infers snow depth and SWE based on SAFRAN time series. Crocus is a onedimensional multilayer physical snow scheme, which simulates the snowpack evolution over time, by accounting for several processes such as thermal diffusion, phase changes and metamorphism.
The SAFRAN-Crocus reanalysis has been evaluated against various observation datasets, as reported in previous publications (Lafaysse et al., 2013;Vionnet et al., 2016;Revuelto et al., 2018;Vionnet et al., 2019). In most cases, the evaluation is carried out against in situ snow depth observations and remote sensing snow cover information. For example, Vionnet et al. (2016) evaluated SAFRAN-Crocus snow depth data against 79 observed snow depth data in the French Alps for the 2010-2014 time period, with mean bias and standard error values of 18 and 37 cm, respectively. This corresponds to typical values for snow modelling systems applied in various regions on Earth. Because of lower data availability, evaluations against observed SWE values are less frequent than against snow depth data, although we note that Crocus has been shown to perform extremely well compared to other snow cover models, in terms of SWE, across many observation sites worldwide (Krinner et al., 2018) and Table 1. Past trends in snowpack variables, snow depth (HS) and snow water equivalent (SWE), according to existing studies in the Western Alps, i.e. in Italy (IT), France (FR) and Switzerland (CH). In the Trend column, "north" and "south" refer to the considered country.  SAFRAN-Crocus exhibits satisfying performance in terms of snow depth and SWE in the Pyrenees , providing confidence, with respect to other existing datasets, in using this model chain for GSL values. Further model evaluations, using additional datasets, are required to continue assessing and improving the quality of the model chain. Furthermore, we highlight that we only use SAFRAN-Crocus reanalysis values on flat field, and we did not use simulations on slopes; hence it is not relevant to discuss the impact of slope and aspect on the results of this study.
3 Standards for ground snow load in the French Alps GSL French standards (Biétry, 2005) are based mostly on Eurocodes (Sanpaolesi et al., 1998) and on prior French standards. Each French department, and by extension each French Alps massif, is associated with a region (C or E) that sets characteristic 50-year return level values of GSL between 200 and 2000 m of altitude (Fig. 2). French standards were elaborated with annual maxima time series of snow depth on the ground measured at stations from 1945 to 1992. GSL data were approximated from annual maxima of snow depth and by assuming that snow density equals 150 kg m −3 . Following Eurocodes, the character-E. Le Roux et al.: Non-stationary extreme value analysis of ground snow loads istic value of GSL is defined as the 50-year return level of a Gumbel distribution (Sect. 4). This distribution was fitted using the least squares method and by removing the top annual maximum when considered exceptional (Biétry, 2005) according to a criterion not explicitly mentioned in the French report cited as reference. However, in the Eurocodes, the standard method was to consider the top maximum as exceptional if it was 1.5 times larger than the second largest maximum (Sanpaolesi et al., 1998). In our methodology, we do not remove the top annual maximum.

Statistical methodology
Following extreme value theory, we employ two stationary models and six non-stationary models for time series of annual maxima of GSL (Sect. 4.1). We select a single model for each time series (i.e. for each massif and altitude) with the AIC statistical criterion, validate this model with the Anderson-Darling test, and assess its significance with the likelihood ratio statistical test (Sect. 4.2). Finally, we compute the relative change of 50-year return levels of GSL between 1960 and 2010, quantify the uncertainty of return levels in 2019 to compare them with the stationary return levels designed for French standards (Sect. 4.3).

Stationary and non-stationary models based on extreme value distributions
Climate extremes are generally studied with statistics. As underlined in the IPCC special report on climate extremes, a large amount of statistical literature builds on extreme indices to examine moderate extremes (IPCC, 2012). However, since we focus on extremes that are more rare, it is recommended to rely on extreme value theory (EVT, Coles, 2001). Such statistical models provide and hypothesize additional prior information in order to compensate for the limited number of empirical observations that commonly span only several decades. These models can be used to extrapolate beyond the empirical observations and to estimate return levels (Sect. 4.3).
EVT offers a suitable framework to analyse extreme values, i.e. to model the form of the tail for almost any probability distribution. Asymptotically, as the central limit theorem motivates sample means modelling with the normal distribution, the Fisher-Tippett-Gnedenko theorem (Fisher and Tippett, 1928;Gnedenko, 1943) encourages sample maxima modelling with the generalized extreme value (GEV) distribution. This theorem justifies that the maximum of finitesized blocks with a large enough block size can be modelled with the GEV distribution. In practice, an annual maximum is thus usually considered a realization of a GEV distribution. Three parameters define the GEV distribution: a location µ, a scale σ > 0 and a shape ζ (a.k.a extremal index or tail index). The GEV distribution includes three specific types of distributions: Weibull (ζ < 0), Fréchet (ζ > 0) and Gumbel (ζ = 0). Thus, by definition, if Z represents an annual maximum of GSL, we can assume that Z follows a GEV distribution (i.e. Z ∼ GEV(µ, σ, ζ )), which implies the following: (1) In a context of climate change, a large amount of hydrological literature builds on non-stationary modelling (Milly et al., 2008) to assess whether a time series is generated by a unique probability distribution (stationary model) or if the generating probability distribution is changing (nonstationary model). Non-stationary extremes are usually studied with both non-stationary modelling and EVT (Katz et al., 2002). Annual maxima are assumed independent but not necessarily identically distributed (Serinaldi and Kilsby, 2015). Such approaches combine a stationary random component (a fixed extreme value distribution) with non-stationary deterministic functions that map each temporal covariate t to the changing parameters of the distribution (Montanari and Koutsoyiannis, 2014). In a non-stationary context, Zhang et al. (2004) showed that tests based on this parametric approach have stronger power of detection when compared with non-parametric methods.
We consider non-stationarity for both the Gumbel distribution and the more general GEV distribution, since they represent natural extensions of the Gumbel distribution which was used for French building standards (Sect. 3). For any model, we have Z(t) ∼ GEV(µ(t), σ (t), ζ (t)), as the Gumbel distribution corresponds to ζ (t) = 0. For a model M, we denote as θ M all parameters for its functions (µ(t), σ (t) and ζ (t)). We focus on simple linear functions due to the limited length of time series (60 years). The linearity starts in 1959, which is the first winter with available data. As shown in Table 2, we consider only models with a constant shape parameter but where the location and/or the scale parameter can vary linearly with years t.

Model selection, validation and significance
Model selection. Let z = (z 1959 , . . ., z 2019 ) represent a time series of annual maxima of GSL, i.e. for a massif and an altitude (Sect. 2). First, models are fitted with the maximum likelihood method. For every model M, we compute the maximum likelihood estimator θ M , which corresponds to the parameter θ M that maximizes the likelihood: Then, for each z, i.e. for each massif and altitude, we select the model M N with the minimal AIC value (Akaike, 1974), as it is the best information criterion in a non-stationary context with small sample sizes (Kim et al., 2017). We define Table 2 The selected model M N can be any model from Table 2 Model validation. Quantile-quantile (Q-Q) analysis is performed for all selected models. To apply this analysis to both stationary and non-stationary model, we rely on Katz (2012), who suggests the following: (i) to transform the data to stationary Gumbel and (ii) to use a Q-Q plot analysis on the transformed data with respect to a Gumbel distribution. Q-Q plots reveal that transformed data are well fitted by a stationary Gumbel distribution; hence that data are well fitted by the selected models (Appendix B). Moreover, according to the comparative study of Abidin et al. (2012), the most powerful goodness-of-fit test for the Gumbel distribution is a combination of the Anderson-Darling test and the maximum likelihood estimator. We apply this test on the transformed data using Saeb (2018) and found that we cannot reject the null hypothesis (samples generated from the Gumbel model) at the 5 % significance level for almost all our selected models (98 %), justifying their good fit. We refer to Appendix B for more details.
Model significance. If the selected model M N is not the model M 0 , then -since models are nested -we can compute the significance of M N with respect to M 0 with a likelihood ratio test (Coles, 2001). This test assesses whether there is enough evidence to reject the stationary Gumbel model M 0 in favour of the selected model M N . The null hypothesis can be stated as follows: the N additional parameters of the model M N can be set to zero. In other words, we want to check if setting to zero the N additional parameters of the model M N are supported by the data z. Under the null hypothesis, the likelihood ratio test statistic (LR) has an asymptotic χ 2 where∼ means distributed under suitable regularity conditions. In practice, the test works as follows. We first choose a 0.05 level of significance. Then, if LR is greater than q χ 2 N , the 1-0.05 = 0.95 quantile of the χ 2 N distribution, we reject the nested model M 0 in favour of the selected model M N . If the selected model M N is nonstationary, we consider the associated trend as significant.

Return levels
In a stationary context, the T -year return level, which corresponds to a return period of T years, is the classical metric to quantify hazards of extreme events (Cooley, 2012). For a stationary model, there is a one-to-one relationship between a return level (a quantile exceeded each year with probability p) and a return period (a duration exceeded every T = 1 p years on average).
In a non-stationary context, return level and return period concepts (Cooley, 2012) become further ambiguous, are prone to misconceptions and can lead to misleading conclusions (Serinaldi, 2015). We focus on the yearly level for a fixed probability of exceedance, a.k.a effective return level (Katz et al., 2002;Cheng et al., 2014), as it conveys best that hazard evolves with time.
For the stationary Gumbel model M 0 , the return level z p (θ M 0 ) is defined as the level exceeded each year with probability p. In other words, if Z denotes an annual maximum, then P (Z ≤ z p (θ M 0 )) = 1 − p. This return level is constant through time and equals z p (θ M 0 ) = µ 0 −σ 0 log(− log(1−p)). In this paper, we set p = 1 50 = 0.02 as it corresponds to the 50-year return period defined by For the selected model M N , the return level is defined as the yearly level for a fixed probability of exceedance p. For any model considered in Table 2, we ob- where we set µ 1 , σ 1 or ζ 0 to 0 if they are not defined in the model M N . For example, for the Gumbel model M 0 , the return level is constant: for any year For any considered model, the time derivative of the return level is constant, as . It quantifies the yearly change of return level. Thus, the relative difference of return levels between year t 1 and year t 2 is as follows: In the context of maximum likelihood estimation, uncertainty related to return levels can be derived by the delta method, which quickly provides confidence intervals both in the stationary and non-stationary case (Coles, 2001;Gilleland and Katz, 2016). First, the return level estimator associated with the maximum likelihood estimator simply equals z p ( θ M ). Then, due to the asymptotic normality of the maximum likelihood estimator (MLE), we can assume that, even with a finite number of data, the MLE is normally distributed. Therefore, under regularity conditions, limits of the 1 − α = 95 % confidence interval are θ M ± q α 2 × v z p ( θ M ), where q α 2 is the 1 − α 2 quantile of the standard normal distribution, and v z p is a function that maps each parameter θ M to the variance of the approximate normal distribution associated with its return level z p (θ M ). For a full expression of the function v z p and for details on the delta method, we refer to Theorem 2.4 of Coles (2001). In particular, this theorem explains that the delta method is valid for ζ 0 < 1, which is respected in our case as −0.5 ≤ ζ 0 ≤ 0.5 (Sect. 4.4). Also, uncertainty of non-stationary return levels z p ( θ M , t) can be obtained by incorporating the covariate t in the function z p .

Application
First, we exclude four time series of annual maxima with more than 10 % of zeros, i.e. years without GSL. Then, we fit models to time series and retain only those models with −0.5 ≤ ζ 0 ≤ 0.5. This impacts three time series. We make this choice because ζ 0 > 0.5 designates distributions with an "exploding" tail which are known to be physically implausible (Martins and Stedinger, 2000). Following Sect. 4.2, we select one model for each time series (i.e. for each massif and altitude) with the AIC statistical criterion. Then, we exclude the five time series (2 %) where the selected model does not pass the Anderson test. Finally, we assess if the selected model is significantly more appropriate than the stationary Gumbel model M 0 with a likelihood ratio test. Figure 3 shows that a stationary model, i.e. models M 0 and M ζ 0 , is selected for a majority (57 %) of time series studied (Sect. 2). Models with a linearity in both the location and scale parameters are the most frequently selected non-stationary models (22 %). For both stationary and nonstationary models, Gumbel models are always more often selected that their corresponding GEV models (Figs. 3, 4). All in all, we highlight that 39 % of selected models are significantly more appropriate than the stationary Gumbel model M 0 . Figure 4 depicts shape parameter values for the selected models at 900, 1800 and 2700 m. We notice that a majority of massifs are white, illustrating that a (stationary or nonstationary) Gumbel model (i.e. ζ 0 = 0) is selected (Sect. 5). This emphasizes that a Gumbel distribution often explains more succinctly the data than a GEV distribution. Also, with the GEV distribution, the estimated most likely shape parameter ζ 0 is often quite uncertain; that is, confidence intervals are large, which is the main reason why French standards did not rely on it. This uncertainty in ζ 0 likely comes from the limited length of time series. Therefore, additional data would enable a more robust estimation of ζ 0 and thus reduce uncertainty. In Fig. 4, we further observe that non-null shape parameters at low altitudes (900 m) are always positive (browncoloured massifs); that is, a Fréchet distribution is preferred. On the other hand, for high altitudes (1800 and 2700 m) nonnull shape parameters are always negative (green-coloured massifs); that is, a Weibull distribution is favoured. Similar results for the shape parameter have been observed for snow depth by Blanchet et al. (2009), Blanchet and Lehning (2010), and Schellander and Hell (2018). This reflects the different nature of annual maxima of GSL between low and high altitudes. At high altitudes, annual maxima are mainly due to snowpack accumulation during several months, while at low altitudes this accumulation is limited, and thus annual maxima roughly correspond to heavy precipitation.

Trends in return levels of ground snow load
Figure 5 maps the relative change of 50-year return levels of GSL between 1960 and 2010 (Eq. 4) at 900, 1800 and 2700 m (see Appendix A for maps at all altitudes). Quantitatively, for northwest massifs, we observe that return levels have decreased by up to 60 % at 900 m (dark blue), while at 1800 m this decrease is less marked (pale blue). Qualitatively, these decreasing trends are frequently due to significant changes both in the location and scale parameters of the Gumbel or GEV distribution (small and large diamondshaped filled markers). At 2700 m, or in the south at 900 and 1800 m, we often do not observe any trends (white), since stationary models are selected (small and large cross-shaped markers). Figure 6 emphasizes the evolution of decreasing trends between 900 and 4800 m of altitude. We observe that decreasing trends are significant for more than one-third of the massifs, located in the northwest of the Alps (Appendix A), up to 2100 m (black bars). In half a century, return levels have dropped on average by up to 30 % at 900 m. Until 3300 m, we observe a decline in the percentage of massifs with a significant decreasing trend. Above 3300 m, we do not find any sig-nificant decreasing trend. For both the relative decrease and the percentage of massifs with a decreasing trend, we notice a similar declining pattern. We also notice more decline between 3300 and 3900 m than at 3000 m, which echoes results from Lüthi et al. (2019), who found that, in the Alps above 3000 m, the relative decrease for projected winter mean of fresh SWE is more marked than at 3000 m (see their Fig. 8). We emphasize, however, that most meteorological observations used as input to the SAFRAN-Crocus reanalysis are situated below 2000 m. Therefore, trends beyond 2000 m altitude should be considered with great caution. Figure 7 illustrates that, for altitudes 300 and 600 m, in general no trends are found except for a few decreasing trends at 600 m and two time series (1 at 300 m, 1 at 600 m) with important increasing trends (+100 % for one massif at 600 m). Despite this important increase in relative change, annual maxima of snow load remain small (< 1 kN m −2 ). Indeed, we found that these annual maxima correspond to snow load accumulated in a few days and thus are mainly driven by heavy precipitation rather than a seasonal snowpack accumulation. In particular, we hypothesize that the important increasing trend observed in the south at 600 m (colour red) might be caused by a regional phenomenon, resulting from Mediterranean humid air masses flowing northward into the north of Italy and then westward to the eastern part of the French Alps, that might be intensifying with global warming (Garavaglia et al., 2010;Gottardi et al., 2012;Faranda, 2020).
To sum up trends in return levels of ground snow load, from 900 to 4800 m, either no trends or decreasing trends of 50-year return levels of GSL are found (Figs. 5, 6, Fig. A1), while at 300 and 600 m, no clear trends are found (Fig. 7).

Comparison of return levels of ground snow load with French standards
We compare 50-year return levels of GSL and their uncertainty (Sect. 4.3) to French standards first for two massifs   Figure 8 illustrates these levels and their uncertainty for two massifs (Vercors and Beaufortain) associated with different French standards regions. Standards are often exceeded at higher altitudes (e.g. at 1800 m). Also, Fig. 8 exemplifies the impact of accounting for decreasing trends in return levels. Indeed, we observe that return levels from the stationary Gumbel model M 0 (left) are often larger than effective return levels in 2019 (last year of data) from the selected model M N (right). Figure 9 sums up the comparison between French standards and 50-year return levels for all 23 massifs. We display (i) the percentage of massifs whose return level exceeds standards and (ii) the mean relative difference between return levels and standards. Limits of the confidence intervals are approximated as the percentage of exceedances for the limits of return levels' 95 % confidence interval are displayed with black bars. Limits of the confidence intervals for the mean relative difference are displayed in shaded blue. The number of massifs considered is equal to 7 at 300 m, 17 at 600 m and 23 at 900 m and above.
First, if we estimate return levels from data with the French standards method (Fig. 9 left), i.e. with a stationary Gumbel model M 0 , and GSL data approximated with snow depth obtained from reanalysis and ρ SNOW = 150 kg m −3 , then we observe few exceedances (less than 10 %) and that on average return levels remain below standards, as the mean relative difference remains below zero. Thus, in this setting, estimations from our reanalysis are consistent with French standards.  However, if we consider the actual GSL, i.e. computed with SWE from the reanalysis, then French standards drastically underestimate return levels. Indeed, with a stationary Gumbel model M 0 , then for altitudes above or equal to 900 m, French standards are exceeded for a majority of massifs (Fig. 9 centre). But, if we consider the selected model M N , i.e. if we account for the decreasing trend in 50-year return levels, we have fewer exceedances at all altitudes (Fig. 9 right). In the latter case, at worst, i.e. at 1800 m, return levels exceed standards by 15 % on average, and half of the massifs (60 %) exceed standards.
Furthermore, despite the fact that uncertainty intervals (black bars) can be large, it does not impact the main conclusions of this article. Indeed, in Fig. 9 right at 1800 m, we still have between 40 % and 80 % of massifs exceeding French standards.

Methodological considerations
We discuss in depth the statistical models chosen for this study. It is well-known that an annual-maximum-based approach can be wasteful in terms of data (Coles, 2001). However, since our objective is to estimate 50-year return levels and since we have 60 years of data, we still decide to rely on the annual-maximum-based approach (with the GEV distribution) rather than on the concurrent approach based on threshold exceedances (with the generalized Pareto distribution). Also, with the GEV distribution, our methodology is a direct extension of French building standards (Sect. 3).
For the non-stationary models, we focus on simple deterministic functions of time (µ(t), σ (t), ζ (t)) due to the limited length of time series. A linear non-stationarity seems preferable to a non-stationarity based on the Heaviside step function due to the continuous nature of climate change. We start the linear non-stationary at the initial year, i.e. 1959.
We decided to consider non-stationarity only for the location and scale parameter. Indeed, in the literature, a linear non-stationarity is considered sometimes only for the location parameter (Fowler et al., 2010;Tramblay and Somot, 2018) but more often both for the location and the scale (or log-transformed scale for numerical reasons) parameters (Katz et al., 2002;Kharin and Zwiers, 2004;Marty and Blanchet, 2012;Wilcox et al., 2018). Also, we consider a non-stationarity for both parameters because the scale parameters were not proportional to the location parameters, which could have otherwise simplified our parametrization. Finally, the shape parameter is typically considered constant in the literature, and we follow this approach.
For time series containing zeros, French standards rely on a mixed discrete-continuous distribution. They fit both a Gumbel distribution on non-zero annual maxima and the probability of having a non-zero annual maxima. However, with our reanalysis data, this approach sometimes leads to fitting non-stationary extreme value models with fewer than 40 non-zero annual maxima. Therefore, we rather decided to exclude any time series with more than 10 % of zeros (Sect. 4.4), to ensure that we fit models with more than 55 non-zero annual maxima. In practice, our approach gives 50year return levels close to the approach from French standards (absolute difference remains lower than 0.1 kN m −2 ).
6.2 On the limitation to approximate annual maxima of ground snow load with annual maxima of snow depth SWE times the gravitational constant equals GSL. However, most countries do not measure SWE but only have access to snow depth (HS) (Haberkorn et al., 2019). In that case, snow density is required to obtain SWE (and subsequently GSL) from HS (Sect. 1). In particular, French standards approximate annual maxima of GSL with annual maxima of HS and by assuming a constant snow density, equal to ρ SNOW = 150 kg m −3 . In Fig. 10, we highlight limitations of such approaches with our reanalysis that provides, for the whole snowpack, daily values of SWE, HS and thus of snow density. We find that annual maxima of GSL are always underestimated by French standards' approximation ( Fig. 10 left). The main reason is that, when annual maxima of GSL are reached, snow density is on average largely superior to ρ SNOW = 150 kg m −3 (Fig. 10 centre). Indeed, we observe that at the time of the annual maxima of GSL the snow den- sity is around ≈ 350 kg m −3 on average at 2700 m and close to ≈ 250 kg m −3 on average at 900 m. Finally, despite high variations along the years, we also notice that, when annual maxima of GSL are reached, snow depth can be much lower than the annual maxima of snow depth (Fig. 10 right), which is another argument against the use of snow depth maxima as a proxy for GSL maxima.

Conclusions
Based on both a reanalysis and a snowpack model, we detect an overall temporal decreasing trend of 50-year return levels of ground snow load (GSL) between 900 and 4200 m, which is significant up to 2100 m in the northwest of the French Alps. This confirms other studies in the Western Alps which also found overall decreasing trends in linked snowpack variables: SWE and snow depth. The largest decrease is found at 900 m with −30 % for return levels between 1960 and 2010. Despite these decreases, in 2019 return levels still exceed return levels designed for French building standards under a stationary assumption. At worst, i.e. at 1800 m, return levels exceed standards by 15 % on average, and half of the massifs exceed standards.
We hypothesize that this number of exceedances might be due to an underestimation of GSL by French standards. Indeed, these standards were devised with GSL estimated from snow depth maxima and constant snow density equal to 150 kg m −3 , which underestimate typical GSL values for the snowpack. Another reason for these exceedances might be ill-designed relationships between altitude and snow load. As shown in Fig. 2, French standards return levels increase linearly with altitude in three steps. Indeed, French standards (Biétry, 2005) follow previous national standards that advised for a linear relationship between altitude and snow load instead of relying on European standards' results that showed a quadratic relationship for the Alpine region (Sanpaolesi et al., 1998). Thus, at higher altitudes, French standards underestimate actual return levels, which might explain the augmenting percentage of exceedance observed with the altitude (Fig. 9 right).
Many potential extensions of this work could be considered. First, our methodology could be extended with more advanced definitions of non-stationary return levels (Rootzén and Katz, 2013;Serinaldi, 2015). Also, instead of considering time series of annual maxima as spatially independent, we believe that our analysis may benefit from an explicit modelling of the spatial dependence between extremes. Then, reanalyses are increasingly available at the European scale (e.g. Soci et al., 2016), which could be used for extending this work to a wider geographical scale. This requires, however, remaining cognizant of the limitations of such reanalyses, in particular (i) the temporal heterogeneity of the meteorological data input to these reanalyses (Vidal et al., 2010); (ii) the lack of observations at high altitudes, requiring caution in analysing trends for high-altitude locations; and (iii) model errors (e.g. snowpack model errors) which need to be taken into account when analysing the results.
Finally, even if, according to our analysis, GSL exceeds French standards return levels in the French Alps, (Fig. 9 right), few destructions related to snow loads actually occurred. Several reasons might explain that. First, French standards consider a coefficient that maps GSL return levels to roof snow load return level, i.e. multiplication by a coefficient that summarizes several roof features: shape, exposure and thermal transmission (Sanpaolesi et al., 1998). This coefficient might be overprotective. Also, following European standards, roof designers must add safety coefficients to ensure roofs' reliability. Indeed, they actually build roofs that withstand the sum of (i) the characteristic value of permanent action, i.e. self-weight, multiplied by a safety coefficient equal to 1.35 and (ii) the characteristic value of variable action, i.e. roof snow load return level, multiplied by a safety coefficient equal to 1.5 (Sanpaolesi et al., 1998 Eq. 8). Above all, French standards do not take into account that, after intense days of snowfall, the snow accumulated on the roof either slides off or is removed. In that case, the main risk lies in extreme snow events that might accumulate enough snow in a few days to exceed French standards. Undeniably, most known snow load destructions resulted from such intense snow events, sometimes combined with liquid precipitation that often heavily increases snow load. The response of these short but extreme and complex snow events to climate change might be an interesting topic for future research. In this section, we report, for every 300 m of altitude from 900 to 4200 m, the map of the relative change of 50-year return levels of GSL between 1960 and 2010 (Fig. A1). Trends at 4500 and 4800 m are not reported, since they only concern the Mont Blanc massif, where no significant trend is inferred at these altitudes. Quantile-quantile plot. Standard diagnosis tools for both stationary and non-stationary extreme value models (Coles, 2001;Katz, 2012) rely on a probability integral transformation f to the standard Gumbel distribution, i.e. Gumbel(0, 1).
Quantile-quantile (Q-Q) plot is a standard diagnosis based on the comparison of empirical quantiles (computed from the empirical distribution) and theoretical quantiles (computed from the expected distribution). On the one hand,z (1) , . . .z (61) are the empirical quantiles, which correspond to the ordered values of thez t . On the other hand, − log − log 1 62 , . . ., − log − log 61 62 r correspond to the theoretical quantiles. Indeed, ifZ ∼ Gumbel(0, 1), then P (Z ≤z) = exp −e −z = i 62 ↔z = − log − log i 62 . Thus, the Q-Q plot is comprised of the pairs { − log − log i 62 ,z (i) ; i = 1, . . ., 61}. In Fig. B1, we display Q-Q plots for the three time series of annual maxima of GSL displayed in Fig. 1. We observe that the left and the right Q-Q plots show a good fit, as the points stay close to the line. However, for the centre Q-Q plot, all points are close to the line, except for the highest empirical quantile that is largely above the corresponding theoretical quantile. As a whole, when observing all Q-Q plots (not shown) most time series show a good fit, except for a few time series (fewer than 10) which have a pattern similar to the centre Q-Q plot in Fig. B1. Anderson-Darling test. Q-Q plot is a qualitative tool to validate the goodness of fit for probability models. For the quantitative validation of the goodness of fit of the selected models, we rely on the Anderson-Darling statistical test, which is the most powerful test for the Gumbel distribution according to the comparative study of Abidin et al. (2012).
In practice, with this test, we assess whether the transformed annual maximaz (1) , . . .,z (61) are likely to be generated from a standard Gumbel distribution. Let n = 61 denote the number of samples, and F emp denotes the cumulative distribution function of the empirical (F gum for standard Gumbel) distribution. Then, Anderson-Darling test is based on the distance: where w(x) places more weight on the tail of the standard Gumbel distribution. For details, we refer to Abidin et al. (2012). We apply this test on the transformed data using Saeb (2018) and found that we cannot reject the null hypothesis (samples generated from the Gumbel model) at the 5 % significance level for almost all our selected models (98 %), justifying their good fit. As explained in Sect. 4.4, we exclude time series whose selected models do not pass this Anderson-Darling test.
Author contributions. ELR performed the analysis and drafted the first version of the manuscript. All authors discussed the results and edited the manuscript.
Competing interests. The authors declare that they have no conflict of interest.
Special issue statement. This article is part of the special issue "Advances in extreme value analysis and application to natural hazards". It is a result of the Advances in Extreme Value Analysis and application to Natural Hazard (EVAN), Paris, France, 17-19 September 2019.
Acknowledgements. The S2M data are provided by Météo-France -CNRS, CNRM Centre d'Etudes de la Neige, through AERIS. We are grateful to Eric Gilleland for his "extRemes" R package. Finally, we are indebted to Jacques Biétry for providing us the report on French standards with respect to ground snow load and for his explanations on their methodology. Review statement. This paper was edited by Ivan Haigh and reviewed by two anonymous referees.