Using high-resolution regional climate models to estimate return levels of daily extreme precipitation over Bavaria

Extreme daily rainfall is an important trigger for floods in Bavaria. The dimensioning of water management structures as well as building codes are based on observational rainfall return levels. In this study, three high-resolution regional climate models (RCMs) are employed to produce 10-year daily rainfall return levels and their performance is evaluated by comparison to observational return levels. The study area is governed by different types of precipitation (stratiform, orographic, convectional) and a complex terrain, with convective precipitation also contributing to daily rainfall levels. The Canadian 10 Regional Climate Model version 5 (CRCM5) at 12 km spatial resolution and the Weather and Forecasting Research model (WRF) at 5 km resolution both driven by ERA-Interim reanalysis data use parametrization schemes to simulate convection. The WRF at 1.5 km resolution driven by ERA5 reanalysis data explicitly resolves convectional processes. Applying the Generalized Extreme Value (GEV) distribution, all three model setups can reproduce the observational return levels with an areal average bias of +6.6 % or less and a spatial Spearman rank correlation of ρ > 0.72. The increase of spatial resolution 15 between the 12 km CRCM5 and the 5 km WRF setup is found to improve the performance in terms of bias (+6.6 % and +3.2 %) and spatial correlation (ρ = 0.72 and ρ = 0.82). However, the finer topographic details of the WRF-ERA5 return levels cannot be evaluated with the observation data because their spatial resolution is too low. Hence, this comparison shows no great further improvement (bias = +1.1 %, ρ = 0.82) of the overall performance compared to the 5 km resolution setup. Uncertainties due to extreme value theory are explored by employing three different approaches for the highest-resolution 20 WRF-ERA5 setup. The GEV distribution with fixed shape parameter (bias = +0.9 %, ρ = 0.79) and the Generalized Pareto (GP: bias = +1.3 %, ρ = 0.81) show almost equivalent results for the 10-year return period, whereas the Metastatistical Extreme Value (MEV) distribution leads to a slight underestimation (bias = -6.2 %, ρ = 0.86). From these results, it follows that high-resolution regional climate models are suitable for generating spatially homogeneous rainfall return level products. In regions with a sparse rain gauge density or low spatial representativeness of the stations due 25 to complex topography, RCMs can support the observational data. Further, RCMs driven by global climate models with emission scenarios can project climate change-induced alterations in rainfall return levels at regional to local scales. This would allow adjustment of structural design and, therefore, adaption to future precipitation conditions. https://doi.org/10.5194/nhess-2021-66 Preprint. Discussion started: 29 March 2021 c © Author(s) 2021. CC BY 4.0 License.

especially in regions with complex topography (Poschlod et al., 2021). Also, the intensity of short-duration hourly rainfall extremes could not be reproduced at 12 km spatial resolution.
When communicating the results of climate model projections to local or regional stakeholders, insurance companies, and governmental authorities in the field of flood prevention, hydrological modelling, dimensioning of reservoirs, buildings, and water infrastructure, these aforementioned biases may prevent the results from being accepted and implemented (Benjamin 65 and Budescu, 2018).
For shorter durations, many studies have shown that higher-resolution RCMs, so-called convection-permitting models (CPMs), improve the reproduction of high-intensity short-duration convectional precipitation events (Brisson et al., 2016;Coppola et al., 2018;Fosser et al., 2014;Kendon et al., 2014). A spatial resolution of a few kilometres is considered necessary by the RCM community to explicitly resolve convection (Langhans et al., 2012, Panosetti et al., 2020Prein et al., 2015), whereas at 70 broader-resolutions parametrization schemes are applied to represent convection. However, also long-duration rainfall return levels can be influenced by convectional precipitation. In Germany, convectional rainfall contributes to the 24-hourly return level for roughly 50 % of the area (Malitz and Ertel, 2015). Therefore, CPMs are expected to improve the estimations of these return levels as well. Additionally, the higher spatial resolution enhances the representation of complex terrain (Karki et al., 2017;Langhans et al., 2012;Poschlod et al., 2018). 75 Hence, in this study, three different high-resolution RCMs featuring 12 km, 5 km, and 1.5 km spatial resolution and driven by 30-year reanalysis data are applied to reproduce daily 10-year rainfall return levels over the complex terrain of the northern Pre-Alps and Alps. Based on interviews with stakeholders from the infrastructure sector and on legislative guidelines, Nissen and Ulbrich (2017) identified the 10-year return level as relevant threshold. Following this recommendation but also to avoid large extreme value statistical uncertainties based on the 30-year time series the 10-year return level is chosen in this study as 80 well representing "moderate extremes". The daily duration is relevant for the generation of riverine floods in the study area (Berghuijs et al., 2019;Keller et al., 2017;Merz and Blöschl, 2003), such as the two extreme flooding events in May 1999and August 2005in southern Bavaria, Austria, and Switzerland (BLFW, 2003Grieser et al., 2006;LfU, 2007;Stucki et al., 2020) induced by high daily precipitation sums. The daily 10-year return levels based on the three RCM setups are evaluated by means of an observational return level product using national datasets from Germany, Austria, and Switzerland. In a second 85 step, different extreme value distributions and sampling strategies are applied to the highest-resolution climate model dataset to explore uncertainties due to extreme value theory and to investigate possible improvements.
It is studied if RCMs can bridge the data gap of spatial homogeneous rainfall return levels and if higher spatial resolution can decrease the biases over areas with complex topography. Therefore, the study aims to evaluate the added value due to higher spatial resolution in terms of biases and spatial correlation between the climate model products and the observational product. The area of investigation is given by the domain of the highest-resolution RCM, which is centred over the state of Bavaria, and the available observational rainfall return level data (see Fig. 1). It covers south-eastern Germany, north-western Austria, north-eastern Switzerland and Liechtenstein. The area shows altitude levels below 100 m in the northwest in the Rhine plain 95 up to altitudes above 2500 m in the Alps. It covers various low mountain ranges, including the Ore Mountains, Odenwald, Swabian Jura and Bavarian Forest. The patterns of annual mean precipitation are governed by the complex topography (see Fig. 2). Different rainfall types (convectional, orographic, stratiform) contribute to this precipitation climatology (Malitz and Ertel, 2015). The lowest annual precipitation sums amount to 500 -700 mm in the north of the study area. The low mountain ranges induce orographic lifting leading to precipitation sums of 1000 to 1500 mm per year. The highest precipitation sums of 100 more than 2000 mm are found in the Alps, with dry valleys, such as the Inn valley having totals below 1000 mm. Annual average temperatures range from less than 0°C in the Alps to 10°C in northern Bavaria (DWD 2021, ZAMG 2021

Observational rainfall return level data
To evaluate the RCMs, an observation-based product is generated from the three national datasets described below. As these 110 datasets extend to the national borders and a little beyond, the arithmetic mean is calculated in the overlapping areas. To compare gridded precipitation from the RCMs and point measurements from the observations, Breinl et al. (2020) suggest an areal reduction of 5 % for pointwise 24-hourly 10-year return levels in Austria. However, to be consistent over the study area, no areal reduction factor is applied for the daily duration following Berg et al. (2019) and Poschlod et al. (2021).

Germany 115
The German weather service offers gridded return level data derived from rain gauge measurements (Malitz and Ertel, 2015).
The observations cover a period of maximum 1951 -2010, where only May -September are analysed as the highest rainfall amounts occur during these months. A peak-over threshold (POT) sampling strategy was applied for 2231 rain gauges, where the threshold corresponds to the available time period. A maximum of 2.718 events per year on average was considered. For these samples, an exponential distribution was fitted. The resulting rainfall return levels were spatially interpolated over 120 Germany at roughly 8 x 8 km² resolution. An uncertainty range of 15 % is assumed for the 10-year return levels, which is induced by measurement errors, uncertainties of the extreme value statistics and regionalization, and the internal variability of the climate system (Junghänel et al., 2017). Data are accessed from DWD (2020). As running window 24-hourly return periods are provided, the rainfall intensities are reduced by 14 % to transfer them to daily estimates. This relation between daily fixed windows and 24-hourly moving windows has been applied by Poschlod et al. (2021) following Barbero et al. (2019) and 125 Boughton and Jakob (2008).

Austria
The Austrian dataset follows a similar approach as the German dataset also applying POT sampling at 141 ombrographs and 843 ombrometers spatially interpolated to gridded return levels at 6 x 6 km² resolution (BMLRT, 2018). As the rain gauges are distributed inhomogeneously yielding too low return level estimations, the "orographic convective model" OKM (Lorenz 130 and Skoda, 2001) was employed to support the observations (Kainz et al., 2007). The resulting design rainfall is based on a combination of the observational data and the weather model simulations. Further details can be found in Kainz et al. (2007) and BMLRT (2006;2018). Data are accessed from BMLRT (2020). The 24-hourly return levels are adjusted to daily values using the reduction of 14 %.

Switzerland 135
MeteoSwiss (2021) provides pointwise daily rainfall return levels at 336 rain gauges. The observations cover the time period from 1966 to 2015. To increase the sample size, seasonal maxima were extracted and assumed to follow a Generalized Extreme Value (GEV) distribution. The GEV distribution is fitted via Bayesian estimation and the according return levels are generated (Fukutome et al., 2015). Since an areal comparison product is to be produced in this study, these point return levels are regionalised by means of ordinary kriging. 140

Climate model data
Three different RCM setups are used. The Canadian Regional Climate Model version 5 (CRCM5) driven by ERA-Interim, the Weather and Research Forecasting Model (WRF; Skamarock et al., 2008)

CRCM5 ERA-INTERIM 145
The CRCM5 at 0.11° resolution equalling roughly 12 km is driven by ERA-Interim reanalysis data. Convectional processes are parametrized due to the spatial resolution. Processes related to deep convection are calculated with the parametrization scheme by Kain and Fritsch (1990). The Kuo transient scheme (Bélair et al., 2005;Kuo, 1965) is applied to represent shallow convection. A more detailed documentation of the model setup and options used is given by Hernández-Díaz et al. (2012) and Martynov et al. (2012). Daily rainfall sums of 30-year time period of 1980 -2009 are extracted for this study. 150

WRF ERA-INTERIM
The WRF version 3.6.1 is set up in nested domains of 45 x 45 km², 15 x 15 km² and 5 x 5 km² spatial resolution in its nonhydrostatic mode and driven by ERA-Interim reanalysis data at 75 x 75 km² spatial resolution and 6-hourly temporal resolution . Spectral nudging is applied to reduce deviations from the large-scale forcing patterns in the reanalysis data (Wagner et al., 2018). Convection is parametrized with the Grell-Freitas scheme (Grell and Freitas, 2014). The detailed 155 model setup as well as an evaluation of different climate variables is given in Warscher et al. (2019). Here, daily rainfall data of the highest-resolution domain are used for the time period of 1980 -2009. Data are accessed from Warscher (2019).

WRF ERA5
The WRF model version 4.1 is configured with two one-way nested domains of 7.5 x 7.5 km² and 1.5 x 1.5 km² grid spacing centred over Bavaria (Collier and Mölg, 2020). The model is forced at the outer lateral boundaries by ERA5 reanalysis data at 160 30 x 30 km² spatial resolution and 3-hourly temporal resolution applying spectral nudging. The higher-resolution 1.5 km setup is assumed to explicitly resolve convection, and therefore no parametrization scheme is applied. The 30-year simulation was divided into 30 annual slices starting at 1 September of each year. A detailed description of the model setup and evaluation of various climate variables is provided in Collier and Mölg (2020). However, the authors emphasize that the applied schemes and the model configuration has not been optimized for the study area due to the high computational expenses of the high-165 resolution run. The physics and dynamics options used in the simulations are based on former convection-permitting WRF applications (e.g. Collier et al., 2019). In this study, daily rainfall sums from 1988 -2017 are extracted from the climate model data accessed from Collier (2020).

Sampling strategies 170
Extreme value theory (EVT) is applied to quantify the stochastic behaviour of a process at unusually large or small values. It is commonly used to calculate return levels for different rainfall durations. There, two classical approaches exist to sample these unusual ("extreme") rainfall intensities (Coles, 2001). For the block maxima (BM) approach, a single value is extracted https://doi.org/10.5194/nhess-2021-66 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. from a typically seasonal or annual block. This strategy ensures that the samples are distant from each other leading to very low serial dependence. However, not all sampled values might be extreme. Also, the information of more than one extreme 175 value per block is lost as these values are discarded.
The second approach peak-over threshold (POT) tries to overcome these drawbacks as all values s above a threshold u are sampled as extreme values (Balkema and de Haan, 1974;Picklands, 1975). Therefore, multiple values per block are allowed.
However, additional restrictions have to be introduced to ensure approximately independent samples. To prevent successive data points from being sampled that originate from one persistent rainfall event, the time series has to be de-clustered. 180 Therefore, a temporal threshold tdecluster is chosen and all values within the duration of tdecluster around the sampled extreme value are discarded (Coles, 2001).
For both classical approaches only a limited number of samples contributes to the database of extreme values. A newer approach by Marani and Ignaccolo (2015) samples all "wet" events assuming that the information of these "ordinary" values can be used to estimate the distribution of extreme values. Thereby, wet events are defined by a threshold twet. It has been 185 successfully applied for extreme daily precipitation by Zorzetto et al. (2016).

Extreme value distributions
According to the sampling strategy, different theoretical extreme value distributions (EVDs) are typically found to represent the distributions of the samples. 190 The Fisher-Tippett-Gnedenko theorem (Fisher and Tippett, 1928;Gnedenko, 1943) states that the distribution of the block maxima samples tends to follow the GEV distribution G with the sample size n → ∞: The location parameter µ governs the center, and the scale parameter σ governs the spread of the GEV distribution. The tail behaviour of G is defined by the shape parameter ξ determining whether the GEV follows the Weibull (ξ < 0), Gumbel (ξ = 0), or Fréchet (ξ > 0) distribution (Gilleland et al., 2017). Hence, the GEV is a very flexible distribution. The drawback of this flexibility shows up in a high estimation variance of ξ resulting in an unstable quantile estimate (Bücher et al., 2020).
For the POT approach, the exceedances y = su are sampled for the threshold u and samples s > u. Thereby, the number of 200 exceedances per year is assumed to follow a Poisson distribution (Davison & Smith, 1990). The exceedances y of the POT threshold u are described by the two-parameter Generalized Pareto (GP) distribution Smith, 1990, Martins andStedinger, 2001). The corresponding cumulative density function (CDF) is given by where y defines the precipitation excess over the threshold u of the POT sampling. The scale parameter β and shape parameter ξ describe the spread and tail behaviour of the GP distribution (Coles, 2001).
Both statistical frameworks can be expressed by the other one as the GP distribution corresponds to the tail distribution of the GEV (Coles, 2001;Goda, 2011;Serinaldi and Kilsby, 2014).
The newer approach by Marani and Ignaccolo (2015) features the Metastatistical Extreme Value (MEV) distribution. They 210 propose a framework supposing that the "meta-statistic" of the rainfall sums of wet events per year contains information about the intensity of extreme events. They assume the sampled wet days > twet to be independent following that the probability distribution of maxima ζm can be expressed as ( ) = ( ) , where nj is the number of wet events in a year and F(x) is a distribution describing the rainfall sums of these events. Based on the results of Wilson and Toumi (2005), the distribution of rainfall sums during wet days per year is found to follow a distribution with an exponential tail. They expressed precipitation 215 as the product of mass flux, specific humidity and precipitation efficiency. Following statistical relationships, they concluded that the tail of the distribution of the product of these three random variables is given by a stretched exponential form. Marani and Ignaccolo (2015) and Zorzetto et al. (2016) apply a Weibull distribution to describe this relationship. Hence, Weibull parameters have to be estimated for each year based on all wet events of a year. The MEV-Weibull CDF is given by where j is the year (j = 1, 2, …, M), and nj is the number of wet events in year j. Cj and wj describe the scale and shape of the Weibull distribution (Marani and Ignaccolo, 2015).

Applied approaches 225
For all three RCM setups, annual maxima of daily precipitation are extracted. Then for all grid cells trends were detected applying the Mann-Kendall test at the significance level of α = 0.05. There, the critical p-value is adjusted for multiple testing following Wilks (2016). No significant trends are found for the 30 sampled values at each grid cell of every RCM setup. The parameters of the GEV distribution G are optimized to the BM samples by estimating the L-moments (Hosking et al., 1985). This is carried out applying the R package "extRemes" by Gilleland and Katz (2016). Delicado and Goria (2008) recommend 230 the method of L-moments for sample sizes of n ≤ 50 as it is robust to outliers in the data. The Anderson-Darling test at the significance level of α = 0.05 is applied to ensure the goodness of fit of the estimated GEV distribution at each grid cell. Again, the critical p-value is adjusted for multiple testing. Based on these fits, the 10-year return levels are derived. The spatial   There, the location and scale parameters are governed by the topography, where the spatial distribution of these parameters is similar for all three RCM setups. The fitted shape parameter reveals a chaotic pattern with small patches of positive and 240 negative values differing for the three RCM setups.
The histograms of the parameters are given in the Supplementary Materials (Fig. S1). An exemplary fit for the grid cell of Munich is shown in Figure S2 for all three RCM setups. This EVT approach is referred to as GEV-LMOM.
To assess uncertainties due to the different EVT approaches described in sections 3.1 and 3.2, a modified GEV approach as well as the POT and MEV approaches are explored for the WRF-ERA5 data. As small samples lead to high uncertainties 245 estimating the shape parameter of the GEV distribution, Papalexiou and Koutsoyiannis (2013) recommend using a fixed value of ξ = 0.114. This approach is referred to as GEV-FIX.
For the POT approach, the daily rainfall time series is de-clustered applying a conservative threshold tdecluster of 5 days. Typical continental cyclones are found to last up to 2.25 days in Bavaria, whereas van Bebber type Vb cyclones can last up to 3 days (Hofstätter et al., 2018;Mittermeier et al., 2019). Hence, the threshold of 5 days ensures independent samples. Precipitation 250 intensities are assumed to be extreme when exceeding the threshold given by 90 events per 30-year period. This threshold has also been selected by Berg et al. (2019). It amounts to 23.4 mm as spatial average for the whole study area, where the lowest threshold is 12.7 mm. Trends are excluded in the same way as for the GEV-LMOM approach. For sample sizes of n > 50, Delicado and Goria (2008) and Madsen et al. (1997) recommend Maximum Likelihood Estimation (MLE) as optimization algorithm to fit an extreme value distribution. Following this recommendation, MLE is applied to fit the GP distribution to the 255 90 samples per grid cell using the software package by Gilleland and Katz (2016). The goodness-of-fit is assessed in the same way as for the GEV-LMOM approach. An exemplary fit for the grid cell of Munich is shown in Figure S3. This approach is referred to as GP-MLE.
For the MEV approach, I follow the procedure applied by Zorzetto et al. (2016). The Weibull distribution is fitted to the annual wet events by means of the probability weighted moments method (PWM, Greenwood et al., 1979). Wet days are defined by 260 exceedance of the threshold twet = 1 mm d -1 in accordance to WMO guidelines (Klein-Tank et al., 2009). This also accounts for the behaviour of RCMs to produce too many very low-intensity precipitation days ("drizzle-effect"; Gutowski et al., 2003).
The MEV fitting procedure and the calculation of return levels is carried out using the Python software package mevpy (Zorzetto, 2021). This approach is referred to as MEV-PWM.

Results
All approaches and their performance metrics are summarized in Table 1 original data (see Fig. S4 for the natively resolved observational product). The observational product shows the rainfall highest 270 intensities above 100 mm d -1 at the northern slopes of the Alps. The low mountain ranges of the Bavarian Forest, Swabian Jura, Odenwald and Ore Mountains also induce enhanced intensities between 70 mm d -1 and 100 mm d -1 . The lowest return levels are observed in the north of the study area amounting to intensities below 50 mm d -1 (Fig. 4b, e, h). The 12-km resolution CRCM5-ERA-I can reproduce the general spatial pattern with a Spearman rank correlation of ρ = 0.72 (Fig. 4a). The return levels are generally overestimated north of 48° N and underestimated south of 48° N as well as in the Ore Mountains (Fig. 4c). 275 The spatially averaged bias amounts to +6.6 %. The range of simulated rainfall return level intensities is similar to the observations for the whole study area (Fig. 5a) as well as for the southern alpine part (Fig. 5b). However, the histogram also reveals that the bias stems from simulating too few grid cells with return level intensities between 50 mm d -1 to 60 mm d -1 and too many grid cells with return levels at intensities of 70 mm d -1 to 90 mm d -1 (Fig. 5a).  The 10-year return levels based on the WRF-ERA-I at 5 km resolution can recreate the spatial pattern of the observations with a Spearman correlation of ρ = 0.82 (Fig. 4d). The higher intensities due to the orographic precipitation at the lower mountain 290 ranges and their spatial patterns are reproduced, though the intensity around the Bavarian Forest is underestimated. In the alpine area, the WRF-ERA-I simulates higher intensities than observed, especially in the Alps southeast of the Inn valley. The overall bias amounts to +3.2 %. The histogram of simulated return levels is similar to the observed histogram (Fig. 5a), however, the very-high intensities above 110 mm d -1 in the alpine area are overrepresented. Also the range of simulated return levels extends to over 140 mm d -1 (Fig. 5b). 295

Figure 5: Histograms of the resulting 10-year return levels in the whole study area (a-c) and the alpine area south of 48° N (d-f). Gaussian kernel density estimates are plotted to enhance the readability.
The 10-year return levels based on the WRF-ERA5 show a generally similar spatial pattern to the WRF-ERA-I (Figs. 4g and 300 3d). The spatial pattern of orographic precipitation around the low mountain ranges is recreated, whereby the intensities at the Bavarian Forest and the Odenwald are underestimated. The return levels in south-eastern Bavaria are underestimated as well.
As the WRF-ERA-I, also the WRF-ERA5 simulates high return levels above 100 mm d -1 in the Alps southeast of the Inn valley. This results in a Spearman correlation of ρ = 0.82. The spatial average of the bias amounts to +1.1 %. The range and distribution of the simulated return levels is very close to the observations for the whole study area (Fig. 5a) as well as south 305 of 48° N (Fig. 5b).    (Fig 4a, 6a and 6d). The spatial correlation between GEV-FIX and the observations amounts to ρ = 0.79 and the overall bias to +0.8 %. For the GP-MLE, the spatial correlation is ρ = 0.81 and the overall bias is +1.3 %. The MEV-PWM method also shows a similar spatial pattern (Fig. 6g), which is slightly more homogeneous than the GEV-LMOM or GP-MLE. The 10-year return levels based on 315 the MEV-PWM are estimated generally lower than by the classical approaches. The spatial correlation between MEV-PWM and the observations amounts to ρ = 0.86 and the overall bias to -6.2 %.

Discussion
Generally, the high values of the Spearman's rank correlation as well as the visual comparison to the observational product 320 ( Fig. 4) prove that all three RCM setups are able to capture the topographic and climatic differences within the study area.
Furthermore, the overall low bias of the return level indicates that the complex climate of heavy daily precipitation is reproduced by the climate models.
Both, the overall bias, and the spatial correlation imply that the WRF setups at 5 km and 1.5 km spatial resolution can better reproduce the observed return levels than the broader-resolution CRCM5-ERA-I. There, both WRF setups show similar 325 performance metrics. However, this equivalent performance may be caused by the spatial resolution and spatial representativeness of the observational data (see Fig. S4 for the native resolution). The German dataset is natively resolved at roughly 8 km, and the Austrian dataset at 6 km, whereas the Swiss data are given by single gauges interpolated via ordinary kriging. Hence, small-scale spatial features below such resolution cannot be evaluated by comparison to this observational product. Comparing the WRF-ERA-I and WRF-ERA5 (Figs. 4d and 4g) reveals a similar spatial pattern, where the higher-330 resolved WRF-ERA5 can especially add more topographically driven spatial variability in the Alps.

Uncertainties of the observational datasets
As the German, Austrian, and Swiss data are based on rain gauge measurements, these data are subject to the usual measurement inaccuracies leading to an underestimation of rainfall (Westra et al., 2014). For flat areas in Germany, this 335 deviation is estimated about 5 % during summer (Richter, 1995). In mountainous areas, this deviation is expected to increase due to higher wind speeds. According to Sevruk (1981) it amounts to 7 % for Switzerland during summer. In addition to these systematic underestimations, different rain gauge types yield varying rainfall measurements inducing additional uncertainty (Vuerich et al., 2009). This applies for the different meteorological networks in the study area (Kainz et al., 2007;Frei and Schär, 1998;Rauthe et al., 2013, Zolina et al., 2008. 340 Apart from these measurement errors, the gridded return level products suffer from a limited number of rain gauges (see Section 2.2), which also differ within their temporal coverage (Isotta et al., 2014 but also their spatial representativeness is important for an appropriate interpolation from point-wise measurements to gridded estimations (Ahrens, 2006). In mountainous areas, the spatial representativeness of a station is even more limited due to the heterogeneous topography. In addition, the station distribution with elevation is not representative as well. Due to easier 345 maintenance conditions more stations are located in valleys than on the tops of the mountains (Ahrens, 2006;Sevruk, 1997) leading to an underestimation for spatially interpolated rainfall in these areas (Isotta et al., 2014). Although the monitoring network density in the Alps makes this one of the best-monitored regions with complex topography, Isotta et al. (2014) estimate the "real" spatial resolution of the observations to be 10 -25 km. The regionalization of these point-wise measurements induces additional uncertainties. For the German dataset, the orography is employed as additional variable to interpolate the return 350 levels (Malitz andErtel, 2015 following Bartels, 1992). Due to the limited spatial representativeness of the rain gauges in the Alps, the weather model OKM at 1.5 km resolution (Lorenz and Skoda, 2001) was used to support the spatial interpolation of the Austrian return level data (BMLRT, 2018;Kainz et al., 2007). Thereby, not only the spatial distribution of return levels was supported by the weather model simulations, but also the intensity of the resulting design rainfall return levels. The return levels based on observations only are classified as "probably too low" due to the spatial distribution of the rain gauges, whereas 355 the weather model return levels are estimated to be "probably too high" (BMLRT, 2006;2018). Hence, the resulting design rainfall return level is a weighted averaging combination of the measured rainfall intensities and the intensities simulated by the weather model (BMLRT, 2006). This leads to the conclusion that the deviations of the 10-year return levels between the WRF setups (see Figs. 4f and 4i) and the observational data in the Austrian Alps may be caused by the limited spatial representativeness of the measurement stations. 360 For the Swiss data, ordinary kriging is applied to regionalize the available pointwise 10-year return levels. As different interpolation methods yield differing results (Hu et al., 2019), this processing step induces additional uncertainty.
In summary, it can be stated that the study area offers a good temporal and spatial coverage of measurements, especially compared to other regions in Europe (Poschlod et al., 2021), which are, however, subject to the uncertainties mentioned above.
Additionally, uncertainties due to the application of different EVT approaches contribute to the overall uncertainty, which are 365 discussed in Section 4.2.3 as they apply for both observations and climate model data.
Hence, the 10-year return levels (Fig. 4b, e, h) provide the best guess based on observations, but are still matter to substantial uncertainties, especially in the Alps.

Uncertainties of the RCM datasets
Generally, climate model simulations of historical conditions are subject to two major uncertainty factors (Hawkins and Sutton, 370 2009). Due to the chaotic nature of atmospheric processes the climate system is governed by internal variability. These nonlinear dynamics lead to the behaviour of the system that slightly differing starting conditions may result in significantly differing climate realizations (Deser et al., 2012). However, in this study the degree of internal variability is constrained as the RCMs are forced by reanalysis data. The large-scale atmospheric flows are imposed by the lateral boundary conditions, and, therefore this source of internal variability is not present in these three RCM setups (Christensen et al., 2001). Still, RCMs are governed by smaller-scale atmospheric variability. Alexandru et al. (2007) have shown that a 20-member RCM ensemble of the CRCM driven by the same lateral boundary conditions with slightly perturbed starting conditions leads to a reasonable spread of simulated precipitation. Even seasonal weather model forecast simulations, which are initialized every month, still show variability, especially for precipitation extremes (Kelder et al., 2020;Thompson et al., 2017). Hence, internal variability cannot be excluded as uncertainty source. 380 Since models can only represent a simplified image of reality, the structure of climate models leads to the second major uncertainty factor. Even though mainly physically-based, RCMs make use of parametrizations with a differing degree of complexity (Jerez et al., 2013). Model uncertainty includes all limitations of the climate model setup such as model-inherent simplifications, parameterizations and schemes, the lateral boundary conditions, nesting, nudging, spin-up times, and spatial resolution. 385 Multi-model experiments using the same boundary and starting conditions yield deviating simulations of the climate (Holtanová et al., 2019;Solman et al., 2013). Yet, also the same model applying differing physics options and parametrization schemes can lead to significant variability in the model results . Hence, climate model setups can be optimized by choosing different model options and schemes and comparing the simulations to observed climate conditions.
For the WRF-ERA-I setup, this has been carried out for the whole domain covering central Europe following Wagner et al. 390 (2018;Warscher et al., 2019). The CRCM5-ERA-I and the WRF-ERA5 setups are based on former applications of the respective climate model in different domains. Adapting the applied options to the study area could potentially improve the model performance (Collier and Mölg, 2020).
Furthermore, two different reanalysis datasets at 75 km and 30 km spatial resolution covering differing time periods are used to drive the RCMs. Stucki et al. (2020) argue that the difference of the driving conditions regarding the spatial resolution can 395 alter the simulation results, especially over complex terrain.
The different time windows (1980 -2009 and 1988 -2017) of the three model setups lead to different events being sampled.
Due to the small sample size, this variance can also lead to deviations in the resulting return levels.
The overall differences between the three RCM setups indicating model uncertainty are less apparent in the resulting 10-year return levels than in the evaluation of individual extreme events. For the close reproduction of extreme rainfall events, Stucki 400 et al. (2020) have shown that initialization of the RCM briefly before the respective events improves the performance at recreating rainfall intensities. Here, the RCMs are run in climate mode featuring transient 30-year simulations (CRCM5-ERA-I, WRF-ERA-I) or annual initialization (WRF-ERA5). It cannot be expected that single extreme events are closely reproduced due to the internal variability. Hence, such comparison is not appropriate to evaluate the skill of the model, but to visualize the differences due to internal variability and model uncertainties. Therefore, the daily rainfall intensities of the two extreme events 405 in May 1999 and August 2005 are given in the Supplementary Materials (Figs. S5 and S6). Furthermore, such a comparison makes it clear that the compared setups are climate model setups and not weather model setups, despite the high spatial resolution (Kelder et al., 2020). While the simulation of individual extremes can differ greatly, the 10-year return levels as a climatic indicator for extreme precipitation show a high degree of agreement (see Fig. 4). This suggests that despite all the https://doi.org/10.5194/nhess-2021-66 Preprint. Discussion started: 29 March 2021 c Author(s) 2021. CC BY 4.0 License. simplifications and differences leading to model uncertainty, the models can reproduce the climatic character of extreme 410 precipitation in the study area.

Uncertainties due to EVT
The concept of classical EVT (see Sections 3.1 and 3.2) holds under rather restrictive assumptions (Papalexiou et al., 2013) and each step featuring the choice of the distribution and fitting the distribution parameters induces additional uncertainty 415 (Miniussi and Marani, 2020). For the GEV approach, Eq. (1) holds for a large number of samples n (ideally the sample size n → ∞). In practice, the limited available time series make it very difficult to determine whether the distribution of extreme samples is close to its asymptotic GEV limit (Cook and Harris, 2004;Koutsoyiannis, 2004;Miniussi and Marani, 2020).
The POT approach partly overcomes the limitation of very low sample sizes by using the threshold u to define extreme events.
However, the choice of this threshold is crucial as the assumptions of a Poisson arrival of exceedances y as well as the GP 420 distribution of these exceedances only hold for a threshold u ensuring both the sampled events to be "extreme" and a large number of samples n (Picklands, 1975, Miniussi andMarani, 2020).
Furthermore, uncertainty is induced by the parameter optimization of the respective EVD to adapt the theoretical EVD to the extreme precipitation samples, even though appropriate methods are chosen (see Section 3.3; Muller et al., 2009). Assessing the goodness of fit by quality measures or statistical testing (e.g. the Anderson-Darling test) can lower the uncertainty due to 425 the aforementioned assumptions. However, the goodness of fit can only assess the quality of the fit between the theoretical EVD and the empirical distribution of the samples. It cannot evaluate if the samples are a "good representation" of possible extreme rainfall events within the boundaries of internal variability of the climate system.
Uncertainty is therefore apparent as the different sampling approaches, EVDs, and fitting methods may lead to differing estimations of rainfall return levels (Lazoglou and Anagnostopoulou, 2017). For the GEV-LMOM (Fig. 4g) and ) the mean absolute deviation (MAD) between the 10-year return levels based on both approaches amounts to a spatial average of 1.7 %. Hence, despite different sampling, distributions, and fitting methods, the results are almost equivalent.
However, both classical approaches still suffer from drawbacks. Papalexiou and Koutsoyiannis (2013) as well as Serinaldi and Kilsby (2014) argue that producing stable fits of the shape parameter of the GEV and GP distributions needs larger sample sizes than typically available. They have shown that the estimation of the shape parameter of the GEV and GP distributions is 435 dependent on the sample size, whereby it is also influenced by the geographical location. Papalexiou and Koutsoyiannis (2013) propose to restrict the shape parameter of the GEV to a window described by a normal distribution around the mean value of 0.114 or to apply a fixed shape parameter of ξ = 0.114. Indeed, the shape parameter of GEV-LMOM shows a heterogeneous spatial distribution with small patches of positive and negative values for all three RCM setups (Fig. 3c, f, i). This spatial distribution can be interpreted as "noise" due to too low sample sizes, where the 30 annual maximum precipitation events do 440 not fully represent the range of possible extreme precipitation within the boundaries of internal climate variability. The distribution of the shape parameter ξ based on all three RCM setups is centred around a value close to 0.114 (see Fig. S1c). However, Papalexiou and Koutsoyiannis (2013) suggest that 99 % of the distribution should be between 0 and 0.225, whereas the distribution of all three RCM setups reveals a larger spread. When restricting the shape parameter to ξ = 0.114, however, the 10-year return levels only differ very slightly from the GEV-LMOM resulting in an average MAD of 3.4 % (see Fig. 6a). 445 As ξ only defines the tail of the distribution it is more relevant for longer return periods.
In addition to unstable parameters fits, the sampling strategies of both classical EVT approaches only use a small fraction of available data. In this study, only 0.3 % (GEV) and 0.8 % (GP) of the daily precipitation sums from the RCMs are sampled.
Especially with respect to short available observation time series, but also with respect to the extensive computational power and related costs of such high-resolution climate models, this sampling is a waste of valuable information (Miniussi and 450 Marani, 2020). The sampling of the MEV approach overcomes this limitation and uses the information of rainfall intensities of all wet days as well as the frequency of these days. This is found to result in more stable fits (Zorzetto et al., 2016). Zorzetto et al. (2016) concluded that the MEV outperforms the classical GEV approach due to the more stable parameter fits if the return period exceeds the length of the available samples. However, this is not the case for the 10-year return levels based on 30 years of RCM data in this study. As also found by Zorzetto et al. (2016), the performance of the MEV-PWM at reproducing 455 the return levels for such a combination of return period and available sample size is slightly worse than of the GEV-LMOM (see Figs. 4 and 6).

Conclusion
Various combinations of high-resolution regional climate models driven by reanalysis data and state-of-the-art EVT approaches have been explored to reproduce 10-year return levels or daily rainfall. The increase in spatial resolution comparing 460 the 12 km CRCM5-ERA-I and the 5 km WRF-ERA-I setups to observations reveals added value in terms of spatial correlation and bias. The further increase in spatial resolution featuring the 1.5 km WRF-ERA5, accompanied by an explicit simulation of convective processes, cannot greatly improve the performance metrics. This is possibly since the observational product is resolved natively between 6 km and 8 km. Hence finer-scale spatial features cannot be evaluated by such comparison.
The resulting 10-year return levels based on the different applied EVT approaches show good agreement to each other and to 465 the observational product. This suggests that the methodological uncertainty for return levels of moderate extremes is relatively low. However, if return periods outside the sample size are to be extrapolated, the estimation uncertainty of the shape parameter governing the tail of the GEV and GP distributions becomes more important. Two approaches are studied to overcome this uncertainty. The GEV with fixed shape parameter shows 10-year return levels, which are almost equivalent to the threeparameter GEV optimized via L-moments. The rather new EVT approach by Marani and Ignaccolo (2015) featuring the MEV 470 distributions leads to a slight underestimation of 10-year return levels. Nevertheless, this methodology shows great potential for extrapolation of longer return periods due to the larger sampling and, therefore, increased stability of the fits (Zorzetto et al., 2016).
The question remains to be answered as to what the findings of this study can contribute to in practice. First, in regions with a low density of rain gauges, such RCM setups can contribute to a homogeneous spatial estimation of 475 return levels. Even in regions, where the rain gauges cannot represent the spatial heterogeneity, RCMs can be applied to support observational products. This is already being done in Austria using a convective-permitting weather model, and the results of this study reinforce such use of regional climate models. It is also conceivable to use the high-resolution spatial patterns of CPMs as an auxiliary variable for the interpolation of the return levels based on measured data (e.g. via kriging with external drift; Haberlandt, 2007 or spatial GEV models; Davison et al., 2012). A visualization of a simple combination approach for 480 such a subsequent enhancement is provided in Figure 7. Therefore, the differences at each grid cell between the observational product and the WRF-ERA5 GEV-LMOM are smoothed with a Gaussian filter and again added to the climate model return levels. However, this rather naive approach only serves to provide a visual impression of a possible enhancement. Second, large ensembles of RCMs can be set up to increase the sample size within the boundaries of the internal climate 490 variability. On the one hand, increased sample sizes lower the uncertainty related to EVT, on the other hand large ensembles enable to quantify uncertainties due to internal variability (Poschlod et al., 2021). Third, RCMs driven by global climate models following different emission scenarios allow to simulate climate change induced alterations of return levels (Ban et al., 2020). Even though an increase in extreme precipitation intensities is known for decades (Trenberth et al., 2003), there is a lack of operational implementation and adaptation. In 2004, a climate change surcharge of 495 a flat +15 % on top of the 100-year flood return level was introduced in Bavaria for the planning of flood protection facilities (LfU, 2021). However, such an adaptation for extreme rainfall is missing so far. Despite all model-specific uncertainties, the evaluation of RCMs in this study proved that they are suitable to reproduce daily extreme precipitation intensities over complex terrain.

Data availability
The observational rainfall return level data are available at the German weather service (DWD, 2020), the Federal Ministry of Agriculture, Regions and Tourism Austria (BMLRT, 2020), and MeteoSwiss (MeteoSwiss, 2021).
The daily precipitation of the WRF-ERA-I and WRF-ERA5 are publicly available at Warscher (2019) and Collier (2020), which is cordially acknowledged. The CRCM5-ERA-I data are available on request from the author. 505

Competing interests
The author declares that he has no conflict of interest.

Financial support 510
The author acknowledges the support within the project StarTrEx (Starkniederschlag und Trockenheitsextreme; Heavy precipitation and drought extremes; Az. 81-0270-82467/2019) by the Bavarian Environment Agency.