Characteristics of precipitation extremes over the Nordic region: added value of convection-permitting modeling

. It is well established that using km scale grid resolution for simulations of weather systems in weather and climate models enhances their realism. This study explores heavy and extreme precipitation characteristics over the Nordic region generated by the regional climate model, HARMONIE-Climate (HCLIM). Two model setups of HCLIM are used: ERA-Interim driven HCLIM12 spanning over Europe at 12 km grid spacing with a convection parameterization scheme and HCLIM3 spanning over the Nordic region with 3 km grid spacing and explicitly resolved deep convection. The HCLIM simulations are evaluated against a unique and comprehensive set of gridded and in situ observation datasets for the warm season from April to September regarding their ability to reproduce sub-daily and daily heavy precipitation statistics across the Nordic region. Both model setups are able to capture the daily heavy precipitation characteristics in the analyzed region. At sub-daily scale, HCLIM3 clearly improves the statistics of occurrence of the most intense heavy precipitation events and the amplitude and timing of the diurnal cycle of these events compared to its forcing HCLIM12. Extreme value analysis shows that HCLIM3 provides added value in capturing sub-daily return levels compared to HCLIM12, which fails to produce the most extreme events. The results indicate clear benefits of the convection-permitting model in simulating heavy and extreme precipitation in the present-day climate, therefore, offering a motivating way forward to investigate the climate change impacts in the region.

enhanced by orography (Førland et al., 1998). For instance, an organized convective system occurred on 31 August 2014 in the Malmö basin in southern Sweden generating very intense rainfall and leading to severe flooding (Olsson et al., 2017).
The accumulated rainfall reached ~150 mm within 6 hours. Although such extreme events are rare, previous studies have shown that precipitation extremes have become more frequent globally and in Europe over recent decades (van den Besselaar et al., 2013;Westra et al., 2013;Fowler et al., 2021). There is also evidence that the past trends in annual maximum daily precipitation are predominantly positive over the Nordic-Baltic region (Du et al., 2019;Dyrrdal et al., 2021).
Regional climate models (RCMs) project a future intensification of rainfall both on sub-daily (e.g., Lenderink and van Meijgaard, 2010;Kendon et al., 2014;Westra et al., 2014;Ban et al., 2015) and daily scales (e.g., Christensen and Christensen, 2003;Frei et al., 2006;Boberg et al., 2010;Ban et al., 2015) over mid to high latitudes of the Northern Hemisphere (Lucas-Picher et al., 2021 and references therein). However, similar to global climate models (GCMs), RCMs usually include sub-grid scale parameterization of convective processes including deep convection. One limitation from having to parameterize convection is the inaccuracy of the models to correctly represent, for instance, hourly intensities of extreme precipitation (Hanel and Buishand, 2010;Gregersen et al., 2013;Berg et al., 2019) and the diurnal cycle of rainfall intensity (Trenberth et al., 2003;Brockhaus et al., 2008;Prein et al., 2015;Beranová et al., 2018;Pichelli et al., 2021). The skill of RCMs to adequately represent short-duration precipitation extremes in the present and future climate is therefore of concern.
Due to increased computer capacity, running convection-permitting regional climate models (CPRCMs) with explicit deep convection and a high grid resolution (typically < 4 km) has recently become affordable on a climatic scale (see e.g., Coppola et al., 2020;Lucas-Picher et al., 2021;Pichelli et al., 2021). Since short-duration extreme events are often associated with smaller-scale spatial structures, there is a strong indication that these events are better represented using an increased model resolution. For instance, Fosser et al. (2015), Lind et al. (2016), Kendon et al. (2017), Leutwyler et al. (2017), Berthou et al. (2020), Fumière et al. (2020), Ban et al. (2021), and Caillaud et al. (2021) have found an added value of models with explicit deep convection compared to their coarser RCM counterparts with parameterized convection, especially in the ability of the CPRCMs to represent sub-daily rainfall characteristics over Europe. CPRCMs have been found to improve the diurnal cycle, frequency, and intensity of precipitation also over China, Africa, and the United States (Lucas-Picher et al., 2021 and references therein).
In this study, the main goal is to evaluate the performance of a regional climate model, cycle 38 of HARMONIE-Climate (HCLIM38 hereafter) , with hourly output frequency in its ability to reproduce sub-daily and daily observed heavy and extreme precipitation statistics across the Nordic region for the summer half-year (April to September).
We focus on the statistics of intensities and frequencies of heavy precipitation events as well as on the ability of HCLIM38 to reproduce return levels that are commonly used to investigate short duration extremes from an urban planning perspective.
We utilize a 21-year-long simulation from a convection-permitting model setup with HCLIM38 at a grid resolution of 3 km spanning over the Nordic region. Lateral boundary data were provided by an intermediate model setup at 12 km grid resolution driven by reanalysis data. With a domain covering the Nordic region (see Fig. 1), the high grid resolution combined with the 21-year-long simulation period allows for a robust assessment of the added value in simulating precipitation extremes over the Nordic countries. The simulations have been evaluated and presented in previous studies by Lind et al. (2020) and Olsson et al. (2021a). However, Lind et al. (2020) focused mainly on general model evaluation, while Olsson et al. (2021a) evaluated heavy and extreme precipitation events only over the southern part of Sweden. Both studies found an added value of the convection-permitting HCLIM38 model setup in simulating the intensities and frequencies of mean and heavy precipitation events at sub-daily time scales. Previously, Lind et al. (2016) showed an added value of the high-resolution HCLIM model with the previous cycle 37 in representing precipitation extremes over the Alps. The current study deepens the understanding of the benefits of convection-permitting climate modeling over Northern Europe by extending the analysis by Olsson et al. (2021a) over the whole model domain and by studying extreme precipitation events with generalized extreme value (GEV) theory.

Model and experiment set-up
This study utilized the HCLIM38 regional climate model that is based on the ALADIN-HIRLAM numerical weather prediction (NWP) system (Lindstedt et al., 2015;Bengtsson et al., 2017;Termonia et al., 2018). The model is presented only briefly here because it is thoroughly described in Belušić et al. (2020). The HCLIM38 modeling system contains different model configurations that are each suitable for different spatial scales. We employed two model setups, HCLIM38-AROME at 3 km horizontal grid resolution and HCLIM38-ALADIN at 12 km horizontal grid resolution. HCLIM38-AROME is used with non-hydrostatic dynamics as it is designed for convection-permitting resolutions (< 4 km) (Seity et al., 2011;Bengtsson et al., 2017). The recommended option in HCLIM38 for grid resolutions over 10 km is HCLIM38-ALADIN that is used with hydrostatic dynamics (Termonia et al., 2018). HCLIM38-ALADIN originates from the limited-area version of the global model ARPEGE. From now on, HCLIM38-AROME at 3 km and HCLIM38-ALADIN at 12 km will be referred to as HCLIM3 and HCLIM12, respectively.
The experiment was performed using double nesting. The HCLIM12 run spans over a major part of Europe and eastern North Atlantic with 313 x 349 horizontal grid points (Fig. 1), 65 vertical levels, and a time step of 300 s. The global ERA-Interim reanalysis with a grid resolution of ~80 km (Dee et al., 2011) provided the boundary data for HCLIM12 every six hours. HCLIM12 provided boundary data every three hours for HCLIM3 that was run over the Nordic domain with 637 x 853 horizontal grid points, 65 vertical levels, and a time step of 75 s. The modeled periods covered 1997-2018, but the year 1997 was treated as a spin-up year and is thus not included in the analysis. Lind et al. (2020) provide more details of the experiments.
In the analysis, we focus mainly on the HCLIM3 domain. To account for the boundary effects, we removed approximately 100 km (33 grid points including the relaxation zone of 8 grid points) from each side of the HCLIM3 boundaries, which 3 70 75 80 85 90 resulted in a domain that was analyzed in more detail (see the dashed outline in Fig. 1). Also, the HCLIM12 boundaries are adequately far away from the analyzed sub-domain (more than 500 km) (see e.g. Denis et al., 2002;Matte et al., 2017).

Observations
The simulated daily precipitation was compared with gridded observational datasets, E-OBS and Nordic Gridded Climate Dataset (NGCD), as well as with high-resolution national gridded datasets (see Table 1 for references) that were also used to analyze the hourly precipitation. In addition, the hourly precipitation was compared with the ERA5 reanalysis dataset and in situ rain gauge data. We evaluated only land points as most of the observational datasets are based on in situ gauge measurements over land. The used observational datasets and their references are summarized in Table 1 and described in more detail below.
The E-OBS dataset is based on the station series from the European Climate Assessment and Dataset (ECA&D) station network. The dataset spans from 1950 until the present and covers a pan-European domain with a grid spacing of 0.1° x 0.1°(~1 2 km). We utilized version 20.0e that consists of the ensemble means of 100-member realizations which can be taken as grid box averages (Cornes et al., 2020).
The NGCD dataset is a high-resolution dataset of gridded daily precipitation covering Finland, Sweden, and Norway (Tveito and Lussana, 2018). The dataset covers a period from 1971 until the present and has a grid spacing of 1 km x 1 km.
NGCD extends the national dataset of Norway, seNorge, that has been developed over the last 20 years (e.g. Tveito et al, 2005;Lussana et al., 2018a;Lussana et al, 2018b). The station data from Norway are extracted from the climate database of the Norwegian Meteorological Institute while the Finnish and Swedish station data are extracted from ECA&D. We employed the NGCD version 19.03 and type 2 data that utilize the Bayesian interpolation method. ERA5 is a reanalysis product based on a combination of data assimilation and numerical models (Hersbach et al., 2020).
The dataset provides hourly precipitation at a horizontal grid spacing of approximately 30 km. Because the dataset was produced with a numerical model, it includes similar model deficiencies compared to other weather and climate models. The ERA5 forecast product has two separate initialization times, 06 UTC and 18 UTC. After initialization, the forecasts are run for 18 hours. To account for and reduce the effects of model spin-up in precipitation, we utilized the 7-18 h forecast hours from the 06 UTC analysis (representing 13-00 UTC) and the 7-18 h forecast hours from the 18 UTC analysis (representing 01-12 UTC). A similar approach was used e.g. in Crossett et al. (2020).
We utilized three national high-resolution gridded datasets, namely seNorge2 (seNorge hereafter), Klimagrid Danmark (Klimagrid hereafter), and HIPRAD v2 (HIPRAD hereafter). The seNorge dataset provides hourly precipitation starting from 2010 with a grid spacing of 1 km over Norway (Lussana et al., 2018a). The dataset is based on in situ measurements that are interpolated using optimal interpolation and successive-correction schemes. Also, geographical coordinates and elevation are used as complementary information. The performance of this dataset is comparable to or even better than E-OBS, because of the higher effective resolution in seNorge (Lussana et al., 2018a). Despite this, seNorge underestimates precipitation over the mountainous region that has sparse data coverage. Klimagrid is a gauge-based gridded dataset with a grid spacing of 1 km over Denmark. The data consist of hourly precipitation for 2011-2019. At each time, an interpolation to the 1 km grid includes station information in all directions, weighted by distance; distance to the coastline is treated explicitly in the interpolation (Wang and Scharling, 2010). HIPRAD (Berg et al., 2016) is a gridded dataset covering Sweden with hourly resolution and a 2 km grid spacing. This dataset is based on radar data corrected by daily scaling factors using a 31-day running window and the PTHBV gridded data set for Sweden (Johansson and Chen, 2003). HIPRAD is available for 2000-2014, but due to gaps in the data, we utilized only the period of 2005-2014. In addition, grid points with suspected clutter effects were discarded from the analysis. These points were identified by comparing the distribution of daily intensity values of HIPRAD and its reference data PTHBV, and matches based on Perkins skill score (Perkins and Pitman, 2009) of the two probability density functions below 0.8 were rejected. In addition, we investigated the results from 102 in situ gauges over Sweden. However, the results were comparable to HIPRAD and are therefore not discussed in this paper.
We also utilized the daily and hourly annual maxima (AM) dataset that is extracted from in situ observations over the Nordic region. The daily data are available for Denmark, Finland, Norway, and Sweden for the years 1998-2018, while hourly maxima are available for Denmark, Norway, and Sweden covering the same period. Because this dataset includes only annual maxima, it could be used for the comparison of modeled return levels. The observations are extracted for each year utilizing all months, although the criteria for extraction varied between each country (Dyrrdal et al., 2021). For instance, the Swedish data was retrieved only if at maximum two days were missing from June to October whereas the Norwegian data was extracted using a limit of 30 missing days per year. The Finnish and Danish data were extracted without any limits, but the plausibility of low values was checked. Annual maxima were extracted using all months also from the model and other observational datasets instead of limiting the analysis to April-September (see Sect. 3.1). The locations of the in situ stations can be found in Fig. S1a.
It is important to keep in mind that gridded and in situ observations of precipitation are prone to uncertainties. These uncertainties originate, for instance, from instrument errors, post-processing (interpolation methods, quality checks), and different spatial scales (e.g. comparison of point measurements with modeled areal averages) (Eggert et al., 2015) as well as a high spatio-temporal climatic variability of precipitation (e.g. Prein and Gobiet, 2017;Kotlarski et al., 2019). Furthermore, precipitation undercatch can be substantial for snowfall or windy conditions (e.g. Adam and Lettenmeier, 2003;Rubel and Hantel, 2001). Based on Rubel and Hantel (2001), the undercatch in the Baltic Sea area might be around 20-50 % during winter and 2-5 % during summer. Uncertainty is also introduced during the interpolation process of point measurements onto a regular grid. For instance, sparse data coverage and complex topography can lead to a large underestimation of precipitation (e.g. Prein and Gobiet, 2017). Moreover, interpolation can impose a smoothing effect on the spatial variability and lead to an underestimation of extremes (Hofstra et al., 2010). The national datasets, excluding Klimagrid, used in this study include mostly fewer stations compared to their corresponding daily records. They also cover shorter periods and include therefore more uncertainties compared to the daily products. It is also worth noting that most of the stations located in the Scandinavian mountains are established below 1000 m above sea level (m a.s.l) although the terrain height can reach 2000 m a.s.l or more . This leads to uncertainties in precipitation values that are measured over mountainous areas and mountain ridges.
The NGCD dataset has been shown to underestimate precipitation exceeding 1, 10, and 25 mm/day by 5, 15, and 25 % on average, respectively, due to spatial smoothing (Tveito and Lussana, 2018). The correction factors for seNorge precipitation data have been estimated to vary between 0.7-3 depending on the region (Lussana et al., 2018a). The mean correction factor is 1.25, which means that precipitation is mainly underestimated by 25 %. The estimates of the effect of spatial smoothing were not available for Klimagrid, but precipitation from the gauge data in Denmark is underestimated by 1-2 % due to undercatch which is lower for higher intensities (Vejen et al., 2021). There are no uncertainty estimates for HIPRAD v2, but the newest version of HIPRAD (v3) is generally overestimating precipitation compared to gauge data (Olsson et al., 2021b).
It needs to be noted that HIPRAD includes an undercatch correction that was not applied to the in situ stations. Because the uncertainty estimates vary between the datasets and different intensities, we do not consider one acceptable uncertainty range (see e.g. Ban et al., 2021). Nevertheless, these uncertainties need to be kept in mind when analyzing the results.
To explore the effect of interpolation on the results, we additionally selected only the grid cells that included at least one weather station for futher assesment. This so-called geographic sampling was performed for the seNorge and Klimagrid datasets. If a climate model has a horizontal resolution of tens of kilometers or below, there are likely grid cells where the model output is compared with an interpolated value instead of an actual measurement. This affects the evaluation of extreme precipitation as noted by Risser and Wehner (2020). However, Klimagrid is constructed to use exact station values in the grid points if there is not more than one station. In this case, the values for grid points containing one station are not areal averages, but rather comparable to point measurements. Figure S1b shows the locations of the stations that were used for geographic sampling and that were available during the entire period in question, 2010-2018 for seNorge and 2011-2018 for Klimagrid.
We decided not to use so-called Areal Reduction Factors (ARFs) for the in situ data. ARFs are generally used to take into account temporal and spatial differences in the observations and model. In our study, the area of one grid cell is 9 km² in HCLIM3 and 144 km² in HCLIM12. The adjustment needed for HCLIM3 can be considered small whereas the adjustment could be over 10 % for HCLIM12 (Pavlovic et al., 2016). However, the literature proposes several different ways to adjust the values, which makes the use of adjustment factors uncertain. Therefore, we prefer not to adjust but rather assess the model outputs directly and comment in the text when needed.

Evaluation metrics of heavy precipitation
We analyzed daily and hourly heavy precipitation events with intensity and time-based metrics. These metrics included the average of precipitation values above the 95th and 99.9th percentiles of all days or hours (hereafter pXavg) following Berthou et al. (2020) and frequencies of heavy precipitation events of more than 10 and 20 mm/day or 5 mm/hour (hereafter R10mm, R20mm, and R5mm, respectively). R10mm and R20mm represent heavy and very heavy precipitation days, respectively, while R5mm represents heavy precipitation hours. No threshold was used for the percentile computations as recommended by Schär et al (2016). For the hourly scale, we also computed extra metrics that included the frequency distributions of precipitation intensity with a drizzle threshold of 0.1 mm/hour as well as the diurnal cycle of the 99.9th percentile events.
The percentiles were determined separately for each hour of the day. In this study, the term "heavy precipitation" is considered to represent the highest percentiles whereas by "extreme precipitation" we mean either annual maximum precipitation or return levels obtained with extreme value analysis (see Sect. 3.2). The seasonality of extreme precipitation events was investigated by sampling the annual maxima of hourly and daily precipitation events for each year separately for a period from April to September and computing the monthly occurrences.
A drizzle threshold of 0.1 mm/hour was applied to the data and similarly to Berg et al. (2019), a 24 hour dry period was used between events so that the events can be considered independent.
All metrics were computed for a period from the 1st of April to the 30th of September over the overlapping years between the model and observations. The results were computed for each grid cell separately and boxplots were used to show the spatial variability of the results. The results are shown mainly in the native grid. Remapping was performed prior to the analysis to the coarsest grid with a first-order conservative remapping method. However, remapped results did not change the conclusions, and are therefore not discussed in more detail. If not stated otherwise, the differences between the HCLIM model (mod) and observations (obs) for a metric (M) were computed as relative biases (%):

Extreme value analysis
To gain insight into the extreme precipitation events, we used extreme value analysis (Coles, 2001) at hourly and daily timescales. The analyzed period was 21 years  which was assumed to be stationary. We tested this assumption using the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test (Kwiatkowski et al., 1992). Based on this test, more than 90 % of the in situ stations and grid cells in the HCLIM model and gridded observations indicated stationary annual maximum values. The only exception was hourly Norwegian in situ data of which 80 % were stationary.
The generalized extreme value (GEV) distribution was fitted to the annual maximum precipitation data to estimate return values. The cumulative distribution function of the GEV for a random variable (x) can be written as: This function is defined by three parameters (location μ, scale σ, and shape ξ) which ) which were estimated with a modified maximum-likelihood method that utilizes a Bayesian prior distribution for the shape parameter (Martins and Stedinger, 2000;Frei et al., 2006). Several other studies have utilized this method (Frei et al., 2006;Rajczak et al., 2013;Rajczak and Schär, 2017;Ban et al., 2020) because it prevents the estimation of unrealistic shape parameters in case the sample size is small (Martins and Stedinger, 2000). Also, the L-moments method was tested for parameter estimation, but it yielded very similar results compared to the modified maximum-likelihood. When the parameters are known, return values for different return periods can be estimated from the quantile function: We computed return values for hourly (x1h) and daily (x1d) accumulated precipitation for return periods of T years (5, 10, and 20 years). We use abbreviations x1h.T and x1d.T to define the return values of 1 hour and 1 day precipitation, respectively, for a return period of T. The goodness-of-fit was checked with the Kolmogorov-Smirnov (KS) test that compares the empirical distribution function with a specified distribution function, in this case, the GEV distribution. Based on the KS test, the GEV fit was adequately captured for more than 99.9 % of the grid cells or stations in the models and observations both at daily and hourly scales.

Evaluation of heavy daily precipitation
Both HCLIM12 and HCLIM3 underestimate p99.9avg by 10-30 % over areas where p99.9avg values are the highest in E-OBS (coastal Norway, south and north of Sweden, the central part of Finland, and Germany), while this metric is overestimated by 10-40 % over other parts of Finland, Sweden, and Norway (Fig. 2). Overall, the average relative bias over the domain is positive: 23 % for HCLIM12 and even greater in HCLIM3 with 44 %. The results are in line with previous studies in which HCLIM38 has been shown to overestimate mean daily precipitation at 12 km resolution by 25-28 % during the summer period over the Nordic region (Toivonen et al., 2019;Lind et al., 2020). In addition, Lind et al. (2020) showed that HCLIM3 improved the representation of the mean daily precipitation in the area. However, the differences compared to E-OBS in daily heavy precipitation seem to be larger in HCLIM3 than in HCLIM12 in the current study.
It is worthwhile to note that the high-resolution gridded observation set, NGCD, gives around 18 % higher p99.9avg values compared to E-OBS. Therefore, some part of the overestimation in HCLIM3 (25 % for p99.9avg over the NGCD domain) can be due to E-OBS failing to capture the most intense precipitation events. Furthermore, overestimation of more than 50 % in HCLIM12 and HCLIM3 can mostly be found in Denmark, Baltic countries, and the eastern part of the analyzed domain.
These areas have delivered less dense in situ station network data (see Fig. 1 in Cornes et al., 2020), which is argued to lead to smoothing of the extremes in the E-OBS dataset (Hofstra et al., 2010;Cornes et al., 2020). Previous studies have found similar issues when comparing mean precipitation from RCMs to E-OBS .
The spatial distribution of the biases is similar for R20mm and p99.9 avg (Fig. 2). It seems the average percentile values are concentration nuclei (CCN) numbers are used in HCLIM: 100/cm 3 over the sea, 300/cm 3 over land, and 500/cm 3 over cities.
Sensitivity results performed over Norway showed that the negative bias in mean precipitation and extreme precipitation events in the coastal regions could be improved by more realistic CCN values in the model, especially in cases when an air mass is moving from ocean to land (Landgren, 2020). However, more evaluation regarding the improvements in the extremes would be needed as only two extreme precipitation cases were studied.   Table S1). The relative biases in HCLIM3 were mainly positive of 3-13 % (32-84 % in Denmark), the largest biases being recorded for the greater percentile. The main features of the relative biases of p95-99.9avg values, such as overestimation in HCLIM3 over all regions and underestimation in HCLIM12 over Finland and Sweden, were comparable with relative biases of the 70th to 99.99th percentiles of daily precipitation found in a study by Lind et al. (2020). In that 10 280 285 290 study, HCLIM3 overestimated summertime (June-July-August) percentiles by 0-6 %, while HCLIM12 underestimated them by 6-13 % when compared to NGCD. Compared to the median values from the NGCD dataset (E-OBS in Denmark), HCLIM12 produces greater median values of R10mm leading to positive relative biases ranging from 3 to 27 % (Table S1). HCLIM12 underestimates the very heavy precipitation days (R20mm) in Finland and Sweden by 14-15 %, while these are overestimated over Norway and Denmark by 15 and 135 %, respectively. HCLIM3 has mainly positive biases for both metrics: 4-24 % for R10mm (-2 % over Finland) and 7-206 % for R20mm.
The largest biases for p95-99.9avg as well as for R10mm and R20 mm values are seen for Denmark where the modeled values were compared to E-OBS instead of the high-resolution NGCD dataset. As discussed before, the ability of E-OBS to represent the heavy precipitation events is questionable and might lead to misleading results. For instance, the relative biases in HCLIM3 and HCLIM12 decrease substantially from 22-206 % to ± 15 % when the modeled values over Denmark are compared with the national high-resolution dataset, Klimagrid, instead of E-OBS. Another aspect is the clear added value that can be found for HCLIM3 over Sweden when the modeled values are compared to the national high-resolution HIPRAD dataset instead of NGCD. When comparing the values to NGCD, it seems that the relative biases would be smaller in HCLIM12 than that of HCLIM3. However, a comparison with HIPRAD reveals that HCLIM12 greatly underestimates the observed values. At the same time, the assumed overestimation in HCLIM3 decreases. We note that the baseline used in constructing HIPRAD, namely PTHBV, includes a generic undercatch correction that might explain differences to NGCD.
Also Hu et al. (2020) showed that E-OBS and ERA5 datasets underestimated the magnitude of daily extreme precipitation when compared to in situ data over Germany while the national high-resolution dataset was able to represent the extremes adequately. On the other hand, E-OBS and NGCD cover larger regions and a longer time period compared to national highresolution observations, which makes them worthwhile to consider in this study.
It is therefore worth noting that the model results should be compared with several different observations to get a more realistic overview of the model biases as already suggested by Prein and Gobiet (2017). They also emphasized the importance of local in situ measurements in evaluating the modeled statistics of extremes. However, also in situ measurements include uncertainties (e.g. related to undercatch). For instance, Crespi et al. (2019) noted that precipitation climatologies over Norway were improved by combining another HCLIM model output at 2.5 km grid spacing with in situ measurements instead of using only local observations. The climatologies were especially improved over remote mountainous regions. Similar conclusions were obtained in a study by Lundquist et al. (2019) who noted that precipitation from the model might be more accurate compared to observationally based datasets in complex terrain for mid to northern latitudes. Therefore, the overestimation in HCLIM3 could partly be caused by the inability of the gridded observations to represent the upper tails of precipitation distribution, especially over the Scandinavian mountains.

Evaluation of extreme daily precipitation
A noticeable feature seen in daily return values (Fig. 4) is the systematic difference between observational datasets. For all countries and return periods, the return values based on E-OBS are smaller than those from NGCD, which are in turn smaller than those from the collection of observational stations (i.e. the AM dataset, see Table 1). This is most pronounced for Denmark, where median return values from the AM dataset are larger than those from E-OBS by ~50 %. The comparison between modeled and observed return values needs to be interpreted with this in mind. HCLIM 12 and HCLIM3 overestimate daily return values by 0-5 % and 5-21 %, respectively (> 30 and 50 % in Denmark) compared to NGCD (E-OBS in Denmark) (Fig. 4, Table S1). The variability of return levels is mainly well captured by both HCLIM setups, although HCLIM12 overestimates the variability in Finland and underestimates it in Sweden and Norway. E-OBS seems to produce a too large spread of return values compared to in situ observations over Denmark. HCLIM3 produces very similar variabilities to the Danish in situ data whereas HCLIM12 underestimates them.
The inadequacy of E-OBS observations in capturing the rarest extreme precipitation events might explain the large differences in modeled return values when compared to E-OBS. This is confirmed when the model results are compared with the Danish in situ measurements: the relative biases are actually negative in HCLIM12 (around -15 %) meaning that this model setup does not capture very high intensities observed at the stations. Also, the relative biases in HCLIM3 decrease substantially. In Finland and Sweden, the results are similar between NGCD and in situ gauges, although the overestimation in HCLIM3 reduces slightly when the comparison is made against in situ stations instead of NGCD. Looking at the spatial distribution of the biases, negative biases in HCLIM12 can be found over Denmark and the coastal and mountainous areas of Norway and Sweden (Fig. S2). Positive biases can be found in the inland areas in both model setups.

Evaluation of heavy hourly precipitation
The following sections show the evaluation of modeled hourly precipitation over Denmark, Norway, and Sweden for which the national high-resolution gridded observations were available. HCLIM3 mainly overestimates p95avg and p99.9avg by 3-44 % compared to national high-resolution observations (Fig. 5 and Table S2), despite an underestimation of 6 % of p95avg in Sweden. On the contrary, HCLIM12 underestimates these metrics by 5-37 % and overestimates the p95avg by 1 % in Denmark. HCLIM3 also overestimates the precipitation events of more than 5 mm (R5mm) around 60 % (5 % over Sweden), while HCLIM12 underestimates these events by 43-69 %. Also, Lind et al. (2020) and Olsson et al. (2021a) found mostly positive biases for hourly values over the 95th percentile in HCLIM3 and underestimation of the percentile values by HCLIM12. overestimates the spread of p95avg and p99.9avg over Sweden, while the spread of R5mm is underestimated in all countries.
The spatial distribution of the signs of the relative biases in p99.9avg and R5mm is very homogeneous: negative biases are found over all three countries in HCLIM12 whereas the biases are mainly positive in HCLIM3 throughout the domain despite some negative biases in the northern parts of Sweden and Norway (Fig. S3). Furthermore, the spatial structure of the relative biases is very similar for p99.9avg and R5mm.
It should be noted that the ERA5 dataset did not capture well the intensities or the spread of the average precipitation over the 95th and 99.9th percentiles. Also, the R5mm values are substantially lower in ERA5 compared to the national highresolution observations. This is not surprising since ERA5 has a coarse grid resolution and parameterized convective precipitation. Therefore, ERA5 should be used with caution in the evaluation of extreme precipitation from convectionpermitting climate models.
The comparison with geographically sampled observational grid cells (i.e. selecting only grid cells with stations) over Norway and Denmark reveals even more negative relative biases in HCLIM12 and decreasing relative biases in HCLIM3 of all metrics (p95-99.9avg and R5mm; see Fig. 5 and Table S2). Without geographical sampling, the absolute relative biases of HCLIM12 seem to be lower than the biases of HCLIM3. When geographic sampling is accounted for, the biases are clearly lower in HCLIM3 while HCLIM12 considerably underestimates all metrics (Table S2). Geographical sampling could not be applied over Sweden as the HIPRAD dataset is mainly based on radar observations. However, even when geographical sampling is performed, which is generally recommended by Risser and Wehner (2020), a scaling mismatch will still be present. This is because the scales of any station network will always be different from the scales of the horizontal resolutions and remaining subgrid-scale parameterizations in regional climate models. Figure 6 presents probability density functions of hourly precipitation over Sweden, Norway, and Denmark. Added value can be found especially in the ability of HCLIM3 to represent the highest intensities in Sweden and Norway. HCLIM12, on the other hand, underestimates all intensities above 1.5-2 mm/hour over all three countries compared to the national datasets.
In Denmark, HCLIM3 shows overestimation for the highest intensities. With geographic sampling, the highest intensity (with the density of 0.001) computed from the observational Klimagrid data increases from 9 to 11 mm/hour, which is closer to the value simulated by HCLIM3 (~ 12.5 mm/hour) than that of HCLIM12 (~ 7.3 mm/hour) (Fig. S4 b). Again, the added value found in HCLIM3 depends on the observational datasets that are used as a reference and can also be influenced by geographic sampling. The coarse-resolution ERA5 fails to capture the highest intensities compared to the higher resolution observations. Figure 6. The mean probability density functions of hourly precipitation for (a) Sweden, (b) Norway, and (c) Denmark. The shading presents the 25th and 75th percentiles from the spatial distribution. 'Obs' refers to the national high-resolution gridded datasets. All data are presented on their native grids. A threshold of 0.1 mm/hour was used prior to analysis.
The diurnal cycle of the 99.9th percentile is better represented in HCLIM3 compared to HCLIM12, especially over Sweden and Norway (Fig. 7). However, HCLIM3 shows some overestimation which is the largest during the daytime with a mean bias (model-observations) over all hours of 0.4 mm/hour in Sweden, 0.8 mm/hour in Norway, and 1.5 mm/hour in Denmark (Table S2). HCLIM12 underestimates the precipitation intensities of all hours with a mean bias of -2.1 mm/hour in Sweden, -0.9 mm/hour in Norway, and -1.0 mm/hour in Denmark. The observed peak in the 99.9th percentile precipitation occurs in the late afternoon in Sweden and Norway. There is no clear peak in observations in Denmark, but a minimum occurs during the night. Overall, the afternoon peak is well captured by HCLIM3 in Sweden and Norway, while the nighttime minimum in Denmark is not well represented. While HCLIM12 does not show any clear peaks, ERA5 shows too early peaks in Sweden and Norway. It is known that models with parameterized convection tend to produce too early peaks of the diurnal cycle because the convection onset might be triggered too early (e.g. Brockhaus et al., 2008;Meredith et al., 2021). HCLIM12 and ERA5 also underestimate the observed intensities of the 99.9th percentile events throughout the day.
Previously, Belušić et al. (2020), Lind et al. (2020), andOlsson et al. (2021a) have shown that HCLIM3 improved the representation of the diurnal cycles of mean precipitation as well as the 90th and 99th percentiles compared to HCLIM12. When geographical sampling is applied, HCLIM3 better captures the shape of the diurnal cycle also over Denmark, although the nighttime minimum occurs still too late compared to observations (Fig. S4 c, d). The mean bias in HCLIM3 decreases from 1.5 to 0.5 mm/hour in Denmark and from 0.8 to 0.5 mm/hour in Norway (Table S2). It is clear that involving all grid cells in the Klimagrid dataset deteriorates the comparison with HCLIM3. Lind et al. (2020) encountered similar problems in the ability of Klimagrid to capture the diurnal cycle of mean hourly precipitation. The problems might arise from the interpolation scheme used in Klimagrid, which interpolates spatially for each hour.

Evaluation of extreme hourly precipitation
HCLIM3 captures the hourly return levels of all return periods substantially better than HCLIM12 (Fig. 8). HCLIM12  (Table S2). Both model setups (especially HCLIM12) underestimate the variability of return periods over all countries, although HCLIM3 well captures the variability over Sweden. As expected, ERA5 shows poor performance in capturing the values and variabilities of all return levels compared to the in situ stations. The spatial structure of the relative biases of the return values with a return period of 10 years is very uniform in HCLIM12 as the biases are negative and of the same magnitude over all three countries (Fig. S5). Positive and negative relative biases of HCLIM3 simulated return levels are irregularly scattered over the whole domain. The underestimation of return levels and their variability by HCLIM3 and HCLIM12 might be too large because we did not take into account areal reduction factors for the in situ data. Statistical extremes retrieved from point sources (e.g. in situ gauges) are generally expected to be higher than extremes from climate models that produce spatial averages, which is causing a scaling mismatch (Chen and Knutson, 2008). Also, the differences in the temporal scales are not taken into account. The observed hourly precipitation is measured every 1 minute in Denmark, every 15 minutes in Sweden, and every 1 minute or 1 hour in Norway, while the model produces values for every full hour. For instance, Berg et al. (2019) reported a reduction factor of 1.21 when going from a point measurement to a 12 km grid resolution and from 1 minute temporal sampling time to 60 minutes. On the other hand, HCLIM3 would still show superior performance over HCLIM12 even if the hourly annual maxima from the in situ data would be reduced by 20 %.
The fact that most of the annual maximum precipitation occurs during the convective season in the summer (see Sect. 4.5;Lutz et al., 2020;Dyrrdal et al., 2021) indicates that the reason for the superior performance of HCLIM3 over HCLIM12 might be the explicitly resolved deep convection in HCLIM3. In addition, the results obtained for HCLIM12 are in line with previous studies. Berg et al. (2019) concluded that the regional climate model simulations with 12.5 km resolution underestimated 10-year return levels for hourly durations over selected European countries including Sweden.
Although Fig. 8 shows promising results from the convection-permitting regional climate model (CPRCM) setup, it would be important to assess the ability of the model to capture the actual meteorological conditions that lead to extreme precipitation events. For instance, Coppola et al. (2020) showed that some specific observed extreme precipitation cases might be missed by CPRCMs. Although CPRCMs generally cannot be expected to reproduce single extreme events even in perfect boundary simulations, they might simulate events that were "missed" by reality keeping in mind that reality is only one of the many realizations of climate. This leads to a better agreement of the long-term statistics of precipitation extremes between the model and observations. Olsson et al. (2021a) illustrated this point by analyzing how well the HCLIM model captured the observed extreme precipitation event occurring in August 2014 in Malmö. They concluded that HCLIM3 reproduced the event but with reduced intensity. However, another event similar to the Malmö case was found in HCLIM3 but in a different year whereas HCLIM12 did not simulate events of the same magnitude. It is good to note that a 21 year simulation period is a relatively short period, and one might need to wait more than 21 summers to generate the most intense precipitation events. Nonetheless, this calls for more studies of the underlying processes and meteorological conditions of the simulated extreme events, which would bring us forward regarding the shortcomings in models and observations. Figure 9 illustrates the occurrences of simulated hourly and daily annual maximum precipitation in Sweden, Norway, and Denmark compared to the national high-resolution datasets. Annual hourly and daily extreme events are most frequent in July and August, except for daily events in Norway. In Norway, the daily extreme precipitation events occur also in late autumn, especially near the coastal areas (Dyrrdal et al., 2021). Still, the larger density of occurrences in September in Norway is captured by both model setups. In addition, both HCLIM3 and HCLIM12 show good consistency with observational datasets, and overall, the differences resulting from the model setups are small.

Seasonality of hourly and daily annual maximum precipitation
HCLIM3 overestimates the occurrence of hourly extreme events in July over Sweden and Norway and, on the contrary, underestimates these events in August. Geographical sampling does not substantially affect the results, but for instance, the overestimated density of hourly events in July over Norway by HCLIM3 diminishes. Olsson et al. (2021a) encountered overestimated fractions of annual maxima observed in June and underestimated fractions in July and August by HCLIM3 and HCLIM12 over southern Sweden. However, they compared the results only over seven in situ stations in the specific area over Sweden whereas this study considered the whole of Sweden. Although the results are not completely comparable, Olsson et al. (2021a) concluded that HCLIM3 did not improve the model performance regarding the monthly occurrences of annual maxima, which seems to be the case also in our study. We analyzed the characteristics of heavy and extreme precipitation in 21-year-long convection-permitting climate simulations with non-hydrostatic dynamics at a 3 km grid spacing (HCLIM3) and compared them with climate simulations performed with 12 km grid spacing, hydrostatic dynamics, and parameterized convection (HCLIM12). These simulations have been evaluated and presented in previous studies Olsson et al., 2021a), but this paper presents a more detailed evaluation of the extreme precipitation statistics utilizing both basic metrics, such as average precipitation over the 95 to 99.9th percentiles and frequencies of heavy precipitation events, as well as generalized extreme value (GEV) theory.
The evaluation was performed over the Nordic region with a special focus on the warm season from April to September. The results are summarized as the following:  Daily heavy precipitation amounts and frequencies, as well as daily return levels, were well represented over the Nordic region by HCLIM (at both 3 and 12 km grid spacings).
 HCLIM3 was able to capture the most intense hourly precipitation events and their frequency with a slight overestimation, while these were underestimated by HCLIM12.
 Overall, HCLIM3 improved the representation of the probability density functions of hourly precipitation and the diurnal cycle of the 99.9th percentile events over Sweden and Norway. In particular, the shape of the diurnal cycle, peak time, and amounts were better captured by HCLIM3 whereas the peak was not visible in HCLIM12.
 A clear added value of HCLIM3 was seen in simulating return levels of hourly precipitation. HCLIM3 produced very similar precipitation intensities to in situ observations while HCLIM12 substantially underestimated them.
 Both models captured the seasonality of annual maximum precipitation with most of the daily and hourly events occurring in July and August.
This study confirmed that the coarser E-OBS and ERA5 datasets underestimate the most intense precipitation extremes in the Nordic region: the amounts and frequencies of heavy and extreme precipitation were substantially lower in these datasets compared to the high-resolution gridded observations. Therefore, the model results should be compared with several different datasets, preferably high-resolution or in situ observations, to get a better overview of the model biases. In general, part of the model biases might arise from the uncertainties in observations. The in situ network is sparse especially over Scandinavian mountains and systematic undercatch of precipitation lowers the quality of observations in this region.
Nevertheless, the results indicate that high-resolution observations are crucial in the evaluation of high-resolution climate models. The model evaluation would especially benefit from datasets that merge both rain gauges and high-resolution temporally and spatially continuous weather radar data as well as from better uncertainty estimates for the observational datasets. Also, geographic sampling affected the results of model evaluation: the sampling decreased the overestimation in HCLIM3 and led to a greater underestimation of the observed values by HCLIM12. The hourly heavy precipitation intensities and amounts as well as the shape of the diurnal cycle were clearly better represented in HCLIM3 compared to HCLIM12 when geographic sampling was applied. The results presented in this study generally agree with previous studies of evaluation of convection-permitting regional climate models. These include a better representation of the diurnal cycle and the highest intensities of hourly precipitation and their frequencies in convection-permitting HCLIM3. Hence, we conclude that an added value can be found for the HCLIM38 model at convection-permitting scales in simulating heavy and extreme precipitation events over the Nordic region. Although investigating the origin of the added value in HCLIM3 is beyond the scope of this study, the results indicate that the improvements in HCLIM3 are due to explicitly simulated deep convection. The higher horizontal resolution might also play a role. However, understanding the underlying processes of precipitation extremes would be highly beneficial to gain information on the ability of the model to reproduce these events for the right reasons. Nonetheless, the results indicate that high-resolution convection-permitting climate models are valuable for the construction of the future projections of extreme precipitation as climate adaptation requires more robust and reliable projections of future changes.
Future work will investigate the changing characteristics of precipitation extremes due to climate change over Northern Europe with HCLIM3 and HCLIM12 while putting the results in context by comparing the HCLIM38 model with a larger RCM ensemble.
Code availability. The ALADIN and HIRLAM consortia cooperate on the development of a shared system of model codes. The HCLIM model configuration forms part of this shared ALADIN-HIRLAM system. According to the ALADIN-HIRLAM collaboration agreement, all members of the ALADIN and HIRLAM consortia are allowed to license the shared ALADIN-HIRLAM codes within their home country for non-commercial research. Access to the HCLIM codes can be obtained by contacting one of the member institutes of the HIRLAM consortium (see links at: http://hirlam.org/index.php/hirlam-programme-53). The access will be subject to signing a standardized ALADIN-HIRLAM license agreement (http://hirlam.org/index.php/hirlam-programme-53/access-to-the-models). Some parts of the ALADIN-HIRLAM codes can be obtained by non-members through specific licenses, such as in OpenIFS (https://confluence.ecmwf.int/display/OIFS) and Open-SURFEX (https://www.umr-cnrm.fr/surfex).