Bayesian hierarchical modeling of sea level extremes in the Finnish coastal region

Occurrence probabilities of extreme sea levels required in coastal planning, e.g. for calculating design floods, have been traditionally estimated individually at each tide gauge location. However, these estimates include uncertainties, as sea level observations typically have only a small number of extreme cases such as annual maxima. Moreover, exact information on sea level extremes between the tide gauge locations and incorporation of dependencies between the adjacent stations is often lacking in the analysis. 5 In this study, we use Bayesian hierarchical modeling to estimate return levels of annual maxima of short-term sea level variations related to storm surges in the Finnish coastal region. We use the generalised extreme value (GEV) distribution as the basis and compare three hierarchical model structures of different complexity against tide gauge specific fits. The hierarchical model structures allow to share information on annual maximum sea levels between the neighboring stations and also provide a natural way to estimate uncertainties in the theoretical estimates. 10 The results show that, compared to the tide gauge specific fits, the hierarchical models, which pool information across the stations, provide narrower uncertainty ranges both for the posterior parameter estimates and for the corresponding return levels on most of the tide gauges. The estimated shape parameter of the GEV model is systematically negative for the hierarchical models, which indicates a Weibull-type of behavior for the extremes along the Finnish coast. This also suggests that the hierarchical models can be used to estimate theoretical upper limits of the extremes of short-term sea level variations along the 15 Finnish coast. Depending on the tide gauge and hierarchical model considered, the median value of the theoretical upper limit was 47–73 cm higher than the highest observed sea level.

The theory on extreme value analysis is well documented in the literature (e.g., Coles et al., 2001). If block-maxima, such as the annual maximum sea levels are considered, extreme value theory tells us that the generalised extreme value distribution (GEV) is the only possible limiting distribution. GEV has been used to model return levels of annual maximum sea level heights in the Baltic Sea region in many previous studies (e.g., Ribeiro et al., 2014;Wolski et al., 2014;Marcos and Woodworth, 2017; 55 Soomere et al., 2018;Kudryavtseva et al., 2021). Most of these studies, however, have concentrated on individual tide gauge locations. As the time series of sea level extremes are relatively short, they generally do not allow reliable estimation of occurrence probabilities of very rare events (e.g. return period >1000 years). Furthermore, to be able to construct predictions for sea level extremes outside the tide gauge locations, information on their spatial dependencies and variations is needed. basis for extreme value analysis. However, they did not consider spatial dependencies explicitly in their analysis.
Issues with the limited sample size can be partly alleviated by pooling information on sea level extremes across the neighbouring tide gauges, which reduces uncertainty on the parameter estimates. One approach to pool information across multiple tide gauges is based on regional frequency analysis (RFA) (Dalrymple, 1960;Hosking and Wallis, 1997). In this approach, a reasonably homogeneous region is searched using some similarity criterion, tide gauges within the region are normalized 65 locally using flood-index and then pooled together before fitting a single extreme value distribution to the pooled data. This approach has been successfully applied to storm surges (e.g., Bardet et al., 2011;Bernardara et al., 2011). However, the results obtained with RFA might contain inconsistencies between different domains. Also, the method does not directly account for spatial dependencies between the tide gauges within the whole target region.
Bayesian hierarchical modeling approach allows more flexibly to incorporate spatial and other dependencies in statistical 70 models (e.g., Gelman et al., 2013). In our case, this means that tide gauge specific GEV parameters are described jointly at the population level, for example, assuming that they come from the same joint hyper-distribution. This hyper-distribution has its own hyper-parameters which need to be specified either from data or modeled by adding an additional layer to the model. Bayesian hierarchical models have been used to estimate extremes in other geoscientific fields (e.g., Cooley et al., 2007), but there are relatively few examples of their use when modeling sea level extremes. Coles and Tawn (2005) made 75 one of the first attempts to model storm surges within the Bayesian framework. Calafat and Marcos (2020) developed a novel hierarchical model to account for spatial dependencies on storm surge extremes over the North Sea and British Isles. They showed that Bayesian hierarchical modeling approach allows to estimate occurrence probabilities of sea level extremes outside gauge locations and to quantify uncertainty in the predicted sea levels.
This paper extends previous efforts on statistical modeling of sea level extremes on the Finnish coastal region. Our aim is 80 to assess how Bayesian hierarchical modeling -implemented in a simpler manner compared to Calafat and Marcos (2020) -performs in comparison to tide gauge specific models, when estimating theoretical return levels for extremes related to short-term sea level variations. Three hierarchical models are included, two of which take the spatial distance between the tide gauges into account in their model formulations. We perform a series of tests with the hierarchical models and evaluate their performance against the observed sea level extremes. Furthermore, we illustrate the theoretical return level estimates obtained 85 with the hierarchical models and their differences with respect to tide gauge specific estimates. We exclude long-term changes from the analysis and focus on short-term sea level variations controlled by meteorological factors. As we concentrate solely on stationary conditions, the potential time dependence of GEV parameters is not assessed.
The paper is structured in the following manner. In Section 2, we describe the tide gauge observations from the Finnish coast. In Section 3, we introduce the methods utilised in this paper, which is then followed by a summary of the main outcomes 90 of the study in Sect. 4. In Section 5, we discuss the shortcomings of this study and potential avenues for future research. The paper is closed with conclusions in Sect. 6.  was not captured by our simple hierarchical modeling approach.
Data set used here consists of annual maximum sea levels. Annual maxima were calculated from monthly maximum sea levels which were extracted from either the original continuous paper recordings, or later from the digital data at 1-minute resolution. Before year 1939, the values have been calculated from 4-hourly observations. Although there have been changes in the temporal resolution of the observations, we decided to include the earliest years in our study in order to use the longest 105 possible time series in our analysis. The measured sea levels have been converted to the height system N2000 from fixed tidegauge-specific reference systems. N2000 is the Finnish realisation of the European Vertical reference system, and the datum has been derived from Normaal Amsterdams Peil (NAP) (Saaranen et al., 2009). All time series extend to the end of year 2020 with some gaps in them due to missing data. To reduce the impact of long-term changes caused by climate change and post-glacial land uplift to the extreme value analysis, a linear trend was calculated separately for each tide gauge based on 110 4-hourly observations and then subtracted from the annual maximum values.
Studies in the Baltic Sea region have suggested that the annual maximum sea levels should be calculated for a year-long block that covers all winter months instead of using the calendar year due to seasonality in storminess (e.g., Johansson et al., 2001;Suursaar et al., 2002;Soomere et al., 2018;Kudryavtseva et al., 2021). We took this approach and calculated the annual maximum values between 1 April-31 March. This way the maximum value from each winter period was selected, which should 115 reduce the correlation between the annual maximum values. Furthermore, those years, which had more than one monthly maximum value missing from the winter half of the year were filtered out. The length of the pre-processed time series varied between 87-124 years. These are illustrated in Fig. 1.
The maximum sea level varies between 36-198 cm in the pre-processed time series. From Fig. 1, it is observed that the highest sea levels are typically observed in Kemi and Hamina, which are the tide gauges located closest to the ends of the Gulf 120 of Bothnia and the Gulf of Finland. Time series for these tide gauges also exhibit the highest year-to-year variability. These features have some implications for the statistical model design as discussed in the next section. We note that the highest annual maximum values used here differ to certain extent from the reported record heights in Finland (e.g., Wolski and Wiśniewski, 2020) due to detrending and as they have been previously defined against a different reference, the so-called theoretical mean sea level (Johansson et al., 2003).

125
Some individual values stand out from the time series panels. For example, markedly high sea levels were observed in January 1984 on some tide gauges located on the western coast of Finland. Such exceptional cases are of interest to us, because they are likely to some extent affect our statistical modeling results. We will briefly discuss the sensitivity of our GEV models for anomalously high observed sea levels in Sect. 4 using the aforementioned case as an example.

Extreme value analysis for annual maximum sea levels
We briefly summarise the main properties of the generalised extreme value distribution (GEV) applied to the extreme sea levels before discussing the chosen modeling approaches. We refer to Coles et al. (2001) for more statistical details on GEV distribution and its properties. Let Y i with i = 1, . . . , 12, be a random variable describing the annual maximum sea level (i.e., block-maxima) at the tide gauge i. The extreme value theorem states that for the normalised maxima of a sequence of inde-135 pendent and identically distributed random variables the GEV distribution is the only possible limiting distribution. GEV is a suitable model in our case, as the detrended annual maxima can considered to be independent of each other. The cumulative distributions function of GEV can be written as: In Eq. (1), µ ∈ R is the location, σ > 0 scale parameter and ξ ∈ R denotes the shape parameter. The tail behavior of GEV 140 distribution is strongly controlled by the shape parameter. If ξ < 0, y has a bounded upper tail at µ − ξ/σ, while in other cases the upper tail is unbounded and has either exponential (ξ = 0) or polynomial (ξ > 0) decay.
The quantiles of GEV distribution are obtained by inverting Eq. (1): This equation is used to calculate the return level y p associated with a certain exceedance probability p or conversely, the return 145 period T = 1/p. While other approaches such as peaks-over-threshold could be considered (e.g., Coles et al., 2001), we used GEV as it allows the direct modeling of annual maxima.
In the simplest case where the GEV parameters are estimated separately for each tide gauge (denoted as Separate hereafter), the model for the observations at i th tide gauge is expressed as y i ∼ GEV (µ i , σ i , ξ i ). We use Bayesian methods to estimate the distribution parameters. As we are particularly interested in robustly quantifying uncertainty in the GEV parameters and in 150 the predicted return levels, the Bayesian approach provides a natural way to do this. Bayes' theorem states that the posterior distribution of model parameters given observations y is Here, p(θ) is the prior distribution for the parameters and p(y|θ) the likelihood function. The symbol θ is a vector containing all the unknown parameters of the model. As the integral in the denominator in Eq.
(3) does not generally have a closed-155 form solution for our hierarchical models, we use Markov chain Monte Carlo (MCMC) methods to sample from the posterior parameter distribution p(θ|y). The implementation details such as the selected prior distributions for the parameters p(θ) (and for the hyper-parameters in hierarchical modeling cases) are provided in the supplementary material.

Hierarchical models for GEV parameters
Tide gauge specific parameter estimates tend to be uncertain, as we usually have relatively few observations in our disposal.

160
One way to include more information is to assume that the model parameters at different tide gauge locations are similar. A hierarchical approach is to assume that their values are bind together with a joint prior distribution, whose parameters are not assumed to be known, but estimated along the individual gauge specific parameters. For pooling information across the tide gauges, we tested three hierarchical formulations for the GEV distribution parameters. The first model (denoted as Common hereafter) is a simple extension to the separate fitting of GEV distribution at the tide gauge locations. The model is written as where N + denotes half-normal distribution. This model applies partial pooling to the data with the assumption that the tide gauge specific distribution parameters come from the same joint Gaussian distribution but it simultaneously allows different parameter values to be estimated for individual tide gauges. In addition to the tide gauge specific GEV parameters, we have six additional hyper-parameters (µ µ , σ µ , µ σ , σ σ , µ ξ , σ ξ ), that tie the individual parameters together. As the hyper-parameters 170 are unknown and estimated from the data, we need to assign further prior distributions on them.

GEV parameter hierarchy using splines
Common does not account for possible spatial dependencies in the sea level extremes between the tide gauges. As was mentioned previously, the northern part of the Baltic Sea consists of sub-basins like the Gulf of Bothnia and the Gulf of Finland.
This geometry strongly regulates the sea level variations in these regions (so-called bay effect), and the magnitude of sea level 175 extremes systematically increases towards the end of the two bays ( Fig. 1). We used the distance d calculated roughly along the coast between the tide gauge locations such that d Kemi = 0 as covariate information when modeling the GEV parameters.
In these models, the vector for the GEV parameters is expressed as θ = (µ(d), σ(d), ξ). Thus, we assume the shape parameter ξ does not depend on d but still gets its own value at each tide gauge, similarly as in Sect. 3.2. One benefit of this approach is that the model provides estimates of the GEV parameters and consequently return level estimates between the tide gauge 180 locations.
We tested two approaches to incorporate the distance dependence in our hierarchical models. In the first one, the location and scale parameter are modeled using B-splines (de Boor, 1978) (denoted as Spline hereafter). B-splines are defined by the degree p of the basis function polynomials and a non-decreasing, here equally-spaced, set of knots t = t 1 , . . . , t r . We have set r = 10. Spline functions are then constructed as a linear combination of B-spline basis functions. The model formulation for 185 Spline is where m = r + 2 is the number of cubic (p = 3) B-splines, B j are the B-spline basis functions and α j and β j are the spline coefficients to be estimated from the data. To avoid overfitting, we used cubic B-splines with first order random walk priors for the spline coefficients (e.g., Eilers and Marx, 1996;Lang and Brezger, 2004).

GEV parameter hierarchy using Gaussian processes
In the third hierarchical model, spatial dependence for the GEV parameters is accounted for using a Gaussian process (GP) prior. The model is written as: In Eq. 6, m µ and m σ are the mean and K µ and K σ the covariance functions for µ and σ, respectively. As we have a finite 195 number of tide gauge locations, this amounts to modelling both priors as multidimensional Gaussian distributions. To obtain smoothness in the neighbouring tide gauge estimates for the GEV parameters, we used squared exponential covariance function in our model: where d i and d j are tide gauge locations defined by their distances to the reference station at Kemi. Furthermore, α is spatial 200 standard deviation and ρ is the characteristic length scale, both of which are estimated from the data.
All models were fitted using MCMC simulations by R and Stan probabilistic programming language (Gabry andČešnovar, 2021;Stan Development Team, 2020a, b). Stan implements a variant of Hamiltonian Monte Carlo MCMC algorithm, socalled No-U-Turn Sampler (NUTS), which has shown to perform well in fitting complex hierarchical models (e.g., Calafat and Marcos, 2020). For each model, MCMC simulations were done with four parallel chains over 3000 iterations. The first 1000 205 iterations were removed as the burn-in period. Thus, the total number of draws obtained from the posterior distribution was 8000.  distance dependence is similar for both the Spline and GP model. Note that we do not attempt to extrapolate beyond Kemi or Hamina, as the resulting parameter estimates would be highly uncertain.

Results
We first evaluate the goodness-of-fit of the hierarchical models against the observations and compare their performance against the tide gauge specific GEV fits. We then have a more in-depth look at the simulated posterior parameter distributions for 215 the hierarchical models and illustrate how the return level estimates differ between Separate and the hierarchical models. We also briefly discuss whether some theoretical limits for the maximum sea level height could be inferred from the estimated parameters. The diagonal line is also shown in each panel.

Evaluation of model fit against observations
To obtain some insights on model adequacy and the goodness of the Separate and hierarchical model fits, quantile-quantile (Q-220 Q) plots calculated against the empirical return levels are shown in Fig. 3. As the Q-Q plots are very similar for all hierarchical models, the results are shown only for the Spline model. Visual inspection of the tide gauge specific panels shows that both GEV models fit reasonably well to the data in most tide gauges, although the empirical estimates tend to be outside the 95% uncertainty range in some cases. Fig. 3 also shows that although the median results are close to each other for these models, the uncertainty range is indeed larger for the Separate model in many locations. Fig. 4 shows the probability-probability (P-P) 225 plots for the same models. While the panels do not reveal any major discrepancies from the observed probabilities in most tide gauges, the match is less good for those tide gauges that are located close to the ends of the Bay of Bothnia and Gulf of Finland. In particular, the fit seems to be less good for Spline compared to Separate in Oulu. Nevertheless, both Q-Q and P-P  plots indicate that, with some tide gauge specific discrepancies with the observations, the models provide reasonable fits to the data.

Leave-one-out cross validation
To compare the relative performance of the GEV models, we performed leave-one-out cross validation with Pareto smoothed importance sampling (PSIS-LOO) implemented in the loo package in R (Vehtari et al., 2017(Vehtari et al., , 2020. loo provides the expected logarithmic point-wise predictive density elpd LOO for the out-of-sample predictions and also estimates the effective number of parameters. While Bayesian LOO cross-validation has some limitations for model selection (e.g., see discussion

235
in Gronau and Wagenmakers (2019a) and Gronau and Wagenmakers (2019b)), LOO is nevertheless useful in our case for highlighting possible differences in the model performance.  Table 1. Leave-one-out cross validation statistics for the implemented GEV models. The first column shows the difference in the expected logarithmic point-wise predictive density (elpdLOO) with respect to the best performing model and the second column the effective number of parameters. Values within the brackets are the standard errors for the estimated statistics.
The LOO cross-validation statistics are listed in Table 1. The first column shows the difference in elpd LOO with respect to the best model for which this value is zero, together with its standard error estimate. Spline has the highest elpd LOO value out of all models with GP having only slightly worse results. Separate has the smallest elpd LOO , which indicates that the 240 hierarchical models might provide a better fit to the observed annual extremes. The second column shows the effective number of parameters (P LOO ) together with the estimated standard error, whose value gives a rough estimate of model complexity. In line with ∆elpd LOO Spline has the smallest P LOO which is roughly half of that for the separate fits.

Posterior parameter distributions
We next take a look at the posterior parameter estimates obtained with MCMC. To first illustrate some of the properties of 245 the posterior parameter distributions, bi-variate parameter distributions are shown in Fig. 5 for the Separate and Spline model in the Kemi tide gauge. The panels in this plot show that µ and σ are negatively correlated with ξ to a certain degree for the Separate model, as the location and scale tend to increase for more negative shape parameter values. The parameters are less strongly correlated for the Spline model and the distribution shape is more Gaussian due to the effect of pooling. The bi-variate distributions look well identified for both models but for the Spline model cover visibly narrower parameter ranges compared 250 to Separate. Very similar results are obtained for the other hierarchical models and therefore, are not shown here. Figure 6 shows the posterior parameter distributions for all models. Both the location µ and scale σ parameter generally increase towards the ends of the two bays. µ obtains its largest values in Kemi, Oulu and Hamina tide gauges, where the median value for µ approaches or even exceeds 100 cm. The posterior median for σ is close to, or exceeds, 25 cm in the same tide gauges. Furthermore, the spread of the posterior parameter distribution tends to be largest in these tide gauges for both µ 255 and σ. Note that the posterior parameter uncertainty in σ tends to be somewhat larger for Separate compared to the hierarchical models. Variations in µ across the tide gauges are similar for all models, but for the scale parameter σ, Separate has slightly larger overall variations compared to the hierarchical models. This could be due to the shrinkage effect, which tends to bring the station specific parameter values of the hierarchical models towards the overall mean.
In contrast with µ and σ, the shape parameter ξ does not have as clear connection with the distance from the two bays. The 260 overall posterior median for ξ is around -0.16 for all models. However, for the Separate model the posterior median tends to  posterior parameter estimates is taken into account, ξ is consistently negative for all hierarchical models, although for Separate 265 the posterior distribution of ξ has a long tail towards positive values in some locations. The typically negative shape parameter value for all models suggests that annual maximum sea levels follow a Weibull-type distribution on the Finnish coast. This result is in line with the previous study by Marcos and Woodworth (2017), which also suggested a negative shape parameter for the Finnish and the neighbouring tide gauges. A slightly contrasting result was obtained by Soomere et al. (2018), who concluded that the shape parameter is close to zero along the Estonian coast. However, they used ocean model output instead 270 of observations in their analysis, and as the behavior of annual sea level extremes is likely different on the Estonian side of Gulf of Finland, their results are not directly comparable to ours.
When ξ < 0, the return level y p has an upper limit at y 0 = µ − σ/ξ, which could be used, at least in theory, to estimate the highest possible value that the annual maximum sea level can reach along the Finnish coast. We illustrate the upper limit estimates in the next section.

275
To conclude, compared to the separate fits, the hierarchical models have a narrower uncertainty range for the scale and shape parameter. Overall, posterior parameter distributions are very similar for the hierarchical models regardless of the tide gauge.
This shows that pooling information across the tide gauges narrows the uncertainty range in the posterior parameter estimates.

Posterior predictive simulations
The estimated GEV parameters shown in the previous section can be used to calculate theoretical estimates for the N-year 280 return levels on tide gauge locations. The estimated return levels are illustrated for all models and for two return periods in Fig. 7. The upper panel shows box-plots of 50-year return levels together with the return level estimated empirically from observations. The empirical estimate has been obtained by interpolating between the observed annual maximum values, as the slightly below the 95% uncertainty range. One must keep in mind that due to the short length of the available time series, the empirical return level estimates are uncertain and likely affected by sampling variability.
The lower panel in Fig. 7 shows estimates of much rarer, 1000-year return levels. In this case, differences between Separate and the hierarchical models have become noticeably more visible compared to the 50-year return level. The median values are slightly higher for the hierarchical models on tide gauges from Oulu to Pietarsaari, while the Separate model has markedly higher median values in many tide gauges from Vaasa to Hamina. For Separate model, the 95 % uncertainty range in the return level estimates tends to be smaller (larger) in the former (latter) locations, although the uncertainty range is in absolute terms largest for this model in all tide gauges. Hierarchical models predict very similar 1000-year return level values and also have similar uncertainty ranges in all locations. Most of the features in the return level estimates obtained with the Separate model are directly related to the larger uncertainty in the estimated shape parameter values, which also exhibit larger spatial variations 295 for this model than for the hierarchical models.
As was mentioned in Sect. 2, individual extremely high sea level observations might markedly influence model fitting.
To check this, we performed a sensitivity test in which the observation from January 1984 was left out of the time series before fitting the models to the data. Removal of this observation lead to a slightly more negative shape parameter value for the Separate model especially in Kaskinen and Pori, which reduced its median estimate of 1000-year return level closer to 300 the hierarchical models in these locations (not shown). Return level estimates for the hierarchical models were only slightly changed in this test, which highlights higher sensitivity of separate model fits to anomalous observations. We note that the 1000-year return level estimates require extrapolation to such cases that are not present in the observational records. Therefore, the aforementioned results should be interpreted cautiously. As discussed in the previous section, the mostly negative shape parameter suggests that, in theory, we could infer an upper 305 limit the sea level height can reach given the assumptions on data and models are met. As a sanity check for our models, we briefly illustrate how high values the hierarchical models give at the theoretical upper limit. We stress that the shown values should not be interpreted as actual limits for the sea level but, rather, as a hypothetical result provided by the hierarchical models.
The distribution of µ − σ/ξ is illustrated for the hierarchical models in Kemi tide gauge in Fig. 8. All distributions look very 310 similar and are positively skewed towards larger values. The median value is almost identical for all models and exceeds 260 cm. These values are 63-65 cm higher than the largest value in the observed time series which sounds reasonable, although the very long right tail highlights the substantial uncertainty associated with the upper level estimates.
The theoretical upper limit estimates are summarised for the rest of the tide gauges in Table 2. Unsurprisingly, the median estimates are highest at those tide gauges, which are located closest to the end of the Bothnian Bay and Gulf of Finland, where 315 they exceed 250 cm. As in Kemi, all hierarchical models give very similar results. Overall, differences between the highest observed sea levels and the theoretical upper limit vary between 47-73 cm, depending on the location. Interestingly, there is also a local maxima in the upper limit estimates in Kaskinen, which is similar to the feature that was seen in the 1000-year return level estimates for Separate in Fig. 7.

Tide gauge Common Spline GP
Kemi 263 (65) 265 (67) 264 (65) Oulu 248 (70) 251 (73) Table 2. Median value of the theoretical upper limit (µ − σ/ξ) of the annual maximum sea level in units of centimeters, calculated separately for each tide gauge. Differences to the highest observed annual sea levels are provided in the brackets.

320
To put the obtained results into a broader perspective, we briefly compare our results to those available in the literature from both statistical and dynamical model studies. We stress that this comparison is qualitative at the best, as it is difficult to make direct comparisons due to differences in the used methods and observations. Hanko and 186 cm in Helsinki) are higher than our 1000-year return level estimates for the hierarchical models and closer to their median upper limit estimates. Särkkä et al. (2017) performed an 850-year long simulation with a hydrodynamic ocean model using regional climate model simulations as input. In their results, sea level could reach 225 cm height in Helsinki, which is somewhat higher, but still similar to our median upper limit estimate in this location.
We recognise some limitations in our study. Firstly, long-term changes in mean sea level were excluded from the analysis by detrending the time series. Inclusion of mean sea level would be required, if we were to assess the overall flooding risk related to sea level extremes both in the present and future climate, as mean sea level has been a major driver of sea level extremes also 335 in the Finnish coastal region (e.g., Marcos and Woodworth, 2017). Furthermore, studies have suggested that future changes in the sea level extremes are strongly associated with changes in mean sea level in the Baltic Sea region (e.g., Meier et al., 2004;Gräwe and Burchard, 2012;Vousdoukas et al., 2016Vousdoukas et al., , 2017. In Finland, changing mean sea level is expected to increase flooding risk along the coast of Gulf of Finland and the Gulf of Bothnia, whereas post-glacial land uplift is likely to counter most of changes in the mean sea level in the Bothnian Bay .

340
Secondly, it was assumed that the time series of sea level extremes are stationary after detrending. This assumption is unlikely to be completely true. Studies have indicated that sea level extremes have exhibited variations in the Baltic Sea region over different time scales (Johansson et al., 2001;Ribeiro et al., 2014;Marcos and Woodworth, 2017;Kudryavtseva et al., 2021) and that these variations were associated with variations in large scale atmospheric conditions (Samuelsson and Stigebrandt, 1996;Johansson et al., 2001;Ribeiro et al., 2014;Marcos and Woodworth, 2017;Kudryavtseva et al., 2021;Passaro et al., 2021). As 345 with changing mean sea level, future climate is also expected to bring changes in storm surges in the Baltic Sea region (e.g., Vousdoukas et al., 2016Vousdoukas et al., , 2017Vousdoukas et al., , 2018. Our models do not account for non-stationarity in extremes related to short-term sea level variations and cannot be used to assess such changes. One remedy would be to model temporal dependence directly by allowing (e.g.) linear trends in the GEV parameters (e.g., Ribeiro et al., 2014;Marcos and Woodworth, 2017;Kudryavtseva et al., 2021). Physical covariates, which describe large scale 350 atmospheric circulation conditions around the Baltic Sea region, could also be used to account for interannual-to-decadal scale variations in the annual maximum sea levels. For example, Marcos and Woodworth (2017) assessed connections between sea level extremes and North Atlantic Oscillation (NAO) index in the North Atlantic region. Their results showed that even after taking the effect of mean sea level into account, NAO explained part of temporal variations in the sea level extremes in the Baltic Sea. We aim to incorporate such physical covariates in our models in the following studies.

Conclusions
In this study, Bayesian hierarchical modeling was applied to estimate theoretical return levels from annual maximum sea level on tide gauges located along the Finnish coast. Three hierarchical descriptions of the parameters of the generalised extreme value (GEV) distribution were compared against individual fits (Separate) at twelve tide gauges. The main motivation was to test how hierarchical description of distribution parameters affects the estimation of, and uncertainty in, GEV parameters 360 and the corresponding estimates of return levels. The simplest hierarchical model (Common) assumes that station specific distribution parameters come from the same joint Gaussian hyper-distributions. In addition, two hierarchical model structures based on B-splines and Gaussian processes (Spline and GP) were implemented in which the distance with respect to Kemi tide gauge along the Finnish coast was used as a covariate in the model formulations. From these models, Spline and GP allow, in principle, the estimation of return levels between the tide gauges.

365
The main results obtained from subsequent analysis are: -All hierarchical models provide similar results for the GEV parameters and reduce the posterior parameter uncertainty in comparison to the Separate model. In addition, the shape parameter is consistently negative along the Finnish coast with median values around -0.16 for all models. From theoretical perspective, this suggests that there might be an upper limit that the sea level extremes can reach along the Finnish coast.

370
-Examples for the return level estimates show that the 50-year return level is well captured by the hierarchical models.
For rarer (1000-year) events, individual fits tend to have a larger uncertainty range and they also give higher return level estimates compared to the hierarchical models in most locations.
-Median values of the theoretical upper limits computed with the hierarchical models indicate differences of 47-73 cm in comparison to the observed maximum sea levels in the observational time series. However, uncertainty in these estimates 375 is large and therefore, they should be interpreted with extreme caution.
While the results suggest that our hierarchical models provide an improvement over separate fits to tide gauge time series, the study is region specific and exploits regional geographical features. Thus, our results might not generalise to other regions.
Other modeling approaches should be included along our approach and compared with the local sea level observations to find the best solutions. One must also keep in mind that the results still contain uncertainties due to possibly missing components 380 (long-term temporal evolution, physical information) in the model formulation and the limited sample size for the observed extremes. To conclude, care needs to be taken when interpreting the above-mentioned results. Improvements for the tested models, e.g., by including missing non-stationary components, will be addressed in future studies.
Code and data availability. Code for reproducing this study has been made publicly available in ZENODO https://doi.org/10.5281/zenodo. 5805120 (Räty and Laine, 2021). The package contains the implemented Stan models, R scripts for running the models and for reproducing 385 the analysis made in this study. The detrended tide gauge time series of annual maximum sea levels required by the scripts are also available in ZENODO https://doi.org/10.5281/zenodo.5807461 (Räty and Johansson, 2021).
Author contributions. All authors contributed to the design of the study. MJ and JS provided the tide gauge data and OR pre-processed the data with the help of ML, JS and MJ. OR and ML implemented the statistical models and conducted the analysis presented in the paper. OR, UL and ML wrote the manuscript with critical comments from the other authors.