Correlation of wind waves and sea level variations on the coast of the seasonally ice-covered Gulf of Finland

Both sea level variations and wind-generated waves affect coastal flooding risks. The correlation of these two phenomena complicates the estimates of their joint effect on the exceedance levels for the continuous water mass. In the northern Baltic Sea the seasonal occurrence of sea ice further influences the situation. We analysed this correlation with 28 years (1992– 2019) of sea level data, and four years (2016–2019) of wave buoy measurements from a coastal location outside the City of Helsinki, Gulf of Finland in the Baltic Sea. The wave observations were complemented by 28 years of simulations with a 5 parametric wave model. The sea levels and significant wave heights at this location show strongest positive correlation (τ = 0.5) for southwesterly winds, while for northeasterly winds the correlation is negative (-0.3). The results were qualitatively similar when only the open water period was considered, or when the ice season was included either with zero wave heights or hypothetical no-ice wave heights. We calculated the observed probability distribution of the sum of the sea level and the highest individual wave crest from the simultaneous time series. Compared to this, a probability distribution of the sum calculated 10 by assuming that the two variables are independent underestimates the exceedance frequencies of high total water levels. We tested nine different copulas for their ability to account for the mutual dependence between the two variables. Copyright statement. TEXT

m (Leppäranta and Myrberg, 2009), while precipitation, evaporation and river runoff play a minor role. On even longer time scales, the Baltic Sea is affected by the global mean sea level rise, which is counterbalanced to varying extent by the local post-glacial land uplift (e.g. Pellikka et al., 2018;Vestøl et al., 2019).
The wind-generated waves in the GoF are influenced by the fetch, the seabed, and narrowness of the gulf (Kahma and Pettersson, 1994). The measured significant wave height has exceeded 5 m both during easterly and westerly winds (Pettersson 5 et al., 2013;Pettersson and Jönsson, 2005), but the significant wave height close to Helsinki is expected to be less than half of the open-sea value, since waves are effectively attenuated by the coastal archipelago (Björkqvist et al., 2019). Finally, both the sea level variations and the waves are affected by the ice season, which in the Baltic Sea is 5-7 months (Leppäranta and Myrberg, 2009). While the presence of ice somewhat attenuates sea level variability -mainly by blocking the piling-up effect of wind on the sea surface (Lisitzin, 1957) -it completely blocks waves if the ice concentration is high enough.

10
The maximum ice coverage ranges from 12.5 to 100 % of the surface area of the Baltic Sea (Leppäranta and Myrberg, 2009); on normal and severe ice winters the entire GoF becomes ice-covered, while on mild winters only the eastern part of the gulf freezes.
In this paper, we explore the correlation of sea level and wind waves near Helsinki on the northern coast of the Gulf of Finland, with an aim to determine its effect on the exceedance frequencies of their sum. The study is motivated by the high Meteorological Institute (FMI). Missing values were estimated by linear interpolation from the measurements at two other nearby tide gauges on the Finnish coast (Hanko and Hamina). There were 477 interpolated hourly values in 1992-2019, less than 0.2 % of the data. All sea levels in this study are given in relation to the N2000 height system, a Finnish realization of the common European height system (Saaranen et al., 2009).
The average sea level in 1992-2019 at Helsinki was +0.21 m, variations ranging from -0.73 to +1.69 m ( Table 1). The linear trend in the time series was -2 cm in 28 years -the land uplift and sea level rise practically canceling each other. These long-term phenomena thus have a negligible contribution to the sea level variations studied here.

Wind and temperature measurements
FMI's automatic weather stations provided wind speed, wind direction, and air temperature measurements ( Fig. 1) which were 5 used as a forcing for the parametric wave model (Sect. 2.4). From 1992 onwards these stations covered three areas of the Baltic Sea: i) Northern Baltic Proper (NBP) covered by Utö weather station (measuring height 19 metres above mean sea level), ii) Gulf of Finland (GoF) covered by Kalbådagrund and the Helsinki Lighthouse (31 and 33 metres above mean sea level), and iii) the coastal area near Suomenlinna, Helsinki covered by Harmaja (18 metres above mean sea level). The Helsinki Lighthouse weather station was only used for 2019 to cover a Kalbådagrund data gap. 10 We calculated hourly wind and air temperature values by either interpolating the older 3-hour data (up until mid to late 90s) or by calculating hourly averages of later denser measurements. All gaps shorter than six hours were interpolated. Continuous data blocks that were shorter than 12 hours were removed. Sea surface temperature (SST) data were only available from Harmaja starting end of June 1995. For other times and locations we estimated the SST using a mean yearly cycle from the GoF monitoring station LL7 (Fig. 1). The mean is defined from 434 CTD soundings from R/V Aranda between 1992 and 2017. where m 0 is the zeroth moment of the variance density spectrum (see Datawell, 2019). We picked every other value to get a data set with hourly time resolution. Of these values, the highest significant wave heights during 2016-2019 were 4.12 m (GoF) and 1.47 m (Suomenlinna); the mean values were 0.81 m and 0.33 m.

Simulated wave heights
We complemented the coastal wave measurements at Suomenlinna with 28 years (1992-2019) of simulations produced by 30 a parametric wave model. The model uses fetch-limited wave growth relations (Kahma and Calkoen, 1992), while also ac-counting for the wind duration and the atmospheric stratification. A description of the wave evolution scheme is presented in Appendix A.
First we simulated the wave conditions at the GoF wave buoy, which were needed as boundary conditions for the coastal wave hindcast. The GoF simulation accounted for the locally generated waves in the middle of the gulf and waves propagating from the eastern part of the basin and the Northern Baltic Proper -also simulated using the same model and the local wind. 5 We then determined the fraction of the GoF wave energy that arrived to Suomenlinna; these waves have longer periods than what can be generated by the local wind and the nearest fetch, and will henceforth be called "long waves". This long wave energy was quantified as the difference between the observed wave energy (for 2016) and the modelled local wave energy at Suomenlinna. The attenuation -as a function of the wind direction -was determined using a linear regression between the simulated wave energy at the GoF and the estimated long wave energy at Suomenlinna. The final time series of coastal 10 significant wave height was determined from the sum of the simulated local wave energy at Suomenlinna and the attenuated GoF wave energy. The simulated significant wave heights were validated using wave buoy data from 2016-2019. At GoF the -0.02 m bias and 0.28 m RMSE (N=25293) were slightly worse than for state-of-the-art numerical wave models (e.g. Tuomi et al., 2011;Björkqvist et al., 2018). Nonetheless, the accuracy was sufficient to serve as boundary conditions for the coastal simulations 15 at Suomenlinna, which had a 0.00 m bias and 0.10 m RMSE (N=13574) (Fig. 2). For Suomenlinna the validation statistics compare well with those from high-resolution numerical wave model simulations (Björkqvist et al., 2020), but the parametric model showed a slight tendency to overestimate the highest significant wave heights (Fig. 2 b). This is natural, since a linear regression cannot account for the increased wave-bottom interaction for the higher and longer waves during stronger winds that were not captured in the calibration period. We can, however, conclude that the quality of the simulated wave height time

Ice conditions and wave statistics types
Data originating from FMI's ice charts described the ice conditions at the southern point of the Suomenlinna islands (Fig. 1).
The data represent an area with a radius of roughly 1 km and is therefore fairly representative of the wave buoy location. Of the statistics available for the entire study time (1992-2019) we used the start and end dates of the permanent ice-cover. During this time -defined as our ice period -the ice concentration is typically high (over 80 %), making it reasonable to set the significant 5 wave height to zero. The length of our ice period varied from zero (winters 2007-2008, 2017-2018, and 2018-2019) to 139 days (winter 2002-2003). Tuomi et al. (2011) defined four different wave statistics for water bodies with a seasonal ice cover. Using this classification, we compiled the following data sets (Table 1): -Data set N (hypothetical no-ice): full simulated H s data set. As the wave model does not account for ice, this represents a hypothetical situation with no ice present.
-Data set M (measurement statistics): wave observations only (this time is a true subset of the ice-free period because of measurement gaps). 15 We divided the 28-year data sets (F, I, and N) into seven consecutive 4-year periods, from 1992-1995 to 2016-2019. These 4-year time series were, where applicable, used to study temporal variability.

Probability distribution of the total water level
We defined the total water level, z max , as the hourly maximum level to which the continuous water mass reaches as a combined effect of the still water level, z still , and the wave action. To do this, we first assumed that z still does not change during an hour. 20 Neglecting site specific coastal effects, z max in such case is determined by the highest individual wave crest, η max . (Note that wave crest denotes the height above z still , thus being half of the actual height of the highest wave.) At Suomenlinna the highest single wave crest during 30 minutes is approximately 92 % of the significant wave height (Björkqvist et al., 2019), which we rounded up to η max = H s for hourly values. With these assumptions, the maximum water level elevation is The probability distributions of z max were then calculated directly from the time series obtained from the observed and simulated z still and H s data. The distributions of z still and H s both describe the actual duration of the phenomenon, and thus 5 have units of h yr −1 . The distribution of z max , on the other hand, has a unit of events/year, where "event" is defined as the instant hourly maximum total water level.
If simultaneous time series of z still and H s are not available, the cumulative distribution function (CDF) of z max needs to be calculated based on the respective univariate distributions. If the two variables are assumed to be independent (for dependent variables, see Sect. 2.7), the CDF of their sum is the convolution where f x denotes the probability density function (PDF) of z still , and F y denotes the CDF of H s (see Leijala et al. (2018) for more details).

Copula-based probability distributions
The Sklar's theorem (Nelsen, 2006) states that if H is a bivariate CDF with marginal CDFs F 1 and F 2 , then a copula C exists 15 such that Thus, the joint distribution H can be expressed as a function of the marginal distributions F 1 and F 2 and the copula C, which describes the dependency of the variables x 1 and x 2 . If x 1 and x 2 are independent, the copula is simply the product ). An extensive introduction to copula methods is provided by Genest and Favre (2007). A benefit of the method 20 is that the marginal distributions can be of any form.
We studied the suitability of nine different copulas (Gumbel, Frank, Clayton, Normal, t-EV, Hüsler-Reiss, Galambos, Tawn and Plackett) in representing the dependence structure of the observed z still and H s data (2016-2019). We calculated the parameters for each copula with the inversion of the Kendall's τ estimator. To quantify the goodness-of-fit for each of these copulas, we calculated the Cramér-von Mises statistic S n as defined in Equation (2)

Marginal distributions and extrapolation
Probabilities of sea level extremes have been estimated using extreme value distributions with e.g. block maxima (annual/monthly) or peak-over-threshold values (e.g. Arns et al., 2013). However, when the focus is on the sum of sea level and wave height (Eq. 1), it is essential to use simultaneous values, and the extremes generally do not co-occur. Our purpose was to study the correlation between these two variables in general, on a hourly time scale, not just the correlation of extremes. Thus, we chose 5 to base our analysis on the entire data set of hourly values.
Our 28 years of hourly data allow for a direct estimate of the exceedance frequency down to 0.036 h yr −1 , and we therefore had to extrapolate the distribution to cover rarer values. Extremes are typically evaluated using some extreme value distribution (Eelsalu et al., 2014;Särkkä et al., 2017;Kulikov and Medvedev, 2017), but this approach is not applicable to extrapolate hourly data needed for this study.

10
The probability distribution of the observed sea levels on the Baltic Sea coasts is slightly asymmetric: the high-end tail is thicker than the low-end tail (e.g. Johansson et al., 2001;Kulikov and Medvedev, 2017). The significant wave heights in the archipelago, again, show a clear tendency to saturate , and are therefore not modelled well simply by using some commonly used distribution type, such as the Weibull distribution (e.g. Battjes, 1972;Mathisen and Bitner-Gregersen, 1990). We chose to extend the distributions to high (low) extreme values by extrapolating with two-parameter exponential 15 functions F = e −λ(x−x0) fitted to the tail of the CCDFs (CDFs) at exceedance frequencies below 44 h yr −1 , following e.g Leijala et al. (2018). This method was applied consistently to the sea level, significant wave height (only high tail), and total water level data. The PDFs were obtained by numerically differentiating the CDFs.

20
The simultaneous hourly sea levels and significant wave heights (Fig. 3)  The hypothetical no-ice data (type N) show situations with low sea level and wave heights ranging from 0 to 1 m which 30 would have occurred during the ice-covered period, especially in DJF (Figs. 3 and 4). In MAM, also many high-sea-level and hypothetically-high-wave situations occurred during the ice-covered time.
As a measure of the correlation, we used the Kendall's tau coefficient (τ ). It measures rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. Thus, no assumption of a linear relationship between the variables is made. The sea level correlates stronger with the observed wave heights than the simulated wave heights from the same time period (the measurement seasons on 2016-2019) on all seasons except DJF ( Table 2). The observed wave data in DJF are very sparse (Fig. 4a)  the correlation, the only exception being DJF where the data set I shows clearly higher correlation than data sets F or N. There is a tendency of lower sea levels during the ice-covered period (the blue points with H s = 0 in Fig. 4a) which likely enhances the correlation.

Probability distributions
The CCDF of sea level from 2016-2019 shows lower high tail than the CCDF from the entire period 1992-2019 (Fig. 6a).  The shape of the CCDF obtained from the observed wave heights differs from those obtained from the simulated wave heights (Fig. 6b), the exceedance frequency decreasing more rapidly with increasing wave height in the observed CCDF than in the 5 simulated ones. This is in accordance with the parametric wave model validation (Fig. 2d) which showed an overestimation of highest significant wave heights.
The CCDFs of z max calculated by summing the observed sea levels and observed or simulated wave heights according to  The CCDFs based on data sets F, I and N show only minor differences ( Fig. 6 Table 3 shows the parameters of the different copulas fitted to the sea levels and significant wave heights. In Fig. 7, CCDFs obtained from these copulas and the marginal distributions are compared with the observed CCDF and the CCDF obtained by assuming the two variables independent. In accordance with the above results, the observed CCDF shows higher probabilities of high total water levels than the independent case (Eq. 2).

5
The copulas are divided into two distinct groups. The Clayton, Frank and Plackett copulas remain close to the independent case. On the other hand, Tawn, Gumbel, t-EV, Galambos and Hüsler-Reiss copulas capture more of the effect of dependence on the probability distribution, giving higher probabilities for high z max . The Normal copula falls between these two groups. This behaviour is confirmed by the Cramér-von Mises statistics (Table 3), where smaller S n values correspond to better fit. For the latter group S n < 1, while in the first group S n varies from 1.22 to 3.97. The Normal copula, again, falls between these groups with S n = 1.17. Note that even if the tail of the Normal copula is closer to the observations in Fig. 7 than the five copulas with smaller S n , those five copulas more closely follow the observations above 10 events/year, where most of the mass of the distribution is, leading to a better fit.

5
We tested the effect of wind direction on the copula results by doing the copula analysis separately for two subsets of the data: situations with southwesterly wind (160 • -340 • , Fig. 7b) and northeasterly wind (340 • -160 • , not shown). The results for the southwesterly-wind subset are qualitatively similar to those including all data: copulas are divided into two groups.
The stronger correlation (τ = 0.269) results in somewhat higher probabilities of high total water levels in the distributions.
In the northeasterly-wind case, on the other hand, the non-existent correlation (τ = -0.012) leads to the copula analysis being 10 somewhat pointless, as all the copulas end up close to the independent case.
To further illustrate the differences in bivariate distributions based on different copulas, we simulated an ensemble of 21 546 samples from each, these being of the same size as the observed data set (Fig. 8). The copulas with S n < 1 (Tawn, Gumbel, t-EV, Galambos and Hüsler-Reiss) tend to show a positive tail dependence: the slight bending of the scatter cloud into a tail on the upper right corner. The copulas with S n > 1 do not show such an apparent positive tail. 15

Discussion
As the wave conditions in the archipelago markedly differ from those on the open sea, a multi-year time series of wave data from a coastal location such as Suomenlinna provides an excellent material for studies. As more data accumulate, it will provide more comprehensive knowledge on the local wave behaviour. We compensated for the shortness of the wave buoy time series by using 28 years of simulated wave data. The lightweight parametric wave model is considerably faster to run than high-resolution numerical wave models. Nevertheless, the bias and RMSE of the simulations were comparable.
The validation suggested that the highest wave heights are overestimated in the simulations. The attenuation of longer waves through bottom processes is indirectly accounted for in the model through the calibration against measurements. Nonetheless, this calibration might not hold for even harsher weather conditions and even longer waves. While the simulations offer a good 5 tool for assessing the general dependence between the wave height and sea level variations, they should not be used as is for analyzing extreme values. If the wave simulations of the parametric model are to be used for extreme conditions, the wavebottom interactions need to be accounted for in a more explicit manner, and such improved model needs to be re-validated.
One of the advantages of a wave model is its ability to simulate the wave conditions during ice-covered periods, or periods when there is a risk of freezing and the wave buoy cannot be kept in the sea. This extends the time series and provides an 10 opportunity to analyse "what would happen if there was no ice?" Such question is relevant when we consider the conditions in the GoF in a warmer future climate, for instance. Our results did not indicate that the wave or sea level conditions, or the correlation of these, would markedly differ from those occurring during the ice-free period in the present climate.
The wave conditions vary significantly among 4-year periods (Fig. 6). The observational period of this study (2016)(2017)(2018)(2019) was relatively calm: the 4-year maximum sea level (1.11 m) was lowest among the seven consecutive 4-year sets, while the 15 maximum simulated significant wave height (1.89 m) ranked fourth. To obtain reliable estimates for probabilities of extreme sea levels on the Finnish coast, at least 30 years of data are generally used (e.g. Pellikka et al., 2018;Kahma et al., 2014).
Our results obtained from 4-year time series can only be considered a methodological analysis on the applicability of different methods for obtaining exceedance frequencies of high total water levels. They do not provide any flooding risk estimates for practical applications. Additionally, considering flooding on the coast, the relevant parameter is the wave run-up. While our meeting the coastline. On a steep coast, a more appropriate estimate for the total water level would be z still + 2 · H s due to wave reflection , while on a shallow coast the eventual run-up on the coastline is less than z still + H s .
The behaviour of the correlation with respect to wind direction (Fig. 5) is readily explained by the geography of the GoF.
Southwesterly winds push water into the gulf and simultaneously raise high waves, leading to positive correlation. Northeasterly winds, on the other hand, push water out from the GoF while still raising waves, which shows up as a negative correlation.
When the correlation is calculated from the entire data set, these two opposing situations with different mechanisms likely cancel each other and reduce the overall correlation. To more completely describe the effect of correlation on the distribution 5 of the total water level, and thus the flooding risks, one option might be to use three variables instead of two: still water level, significant wave height and wind direction. We leave this as an option for future studies.
The sea level processes in the GoF act on different time scales, from the sub-daily storm surges to sub-weekly Baltic Sea internal variations, and weekly and longer-term changes in the total water volume. From one day to one week, the Baltic Sea mainly responds to wind and air pressure forcing like a closed basin. The response time of such variations, e.g. the transport of 10 water between the GoF and the Baltic Proper in a seiche oscillation, is usually of the order of one day. In such time scale, the wind-generated waves have time to develop, and these phenomena co-occur.
The sea level variations in time scale longer than a week usually involve changes in the Baltic Sea water volume. These are related to longer-term (up to 60 days; Särkkä et al., 2017) prevailing wind conditions, which control the in-or outflow of water in the Danish Straits. These should not have a direct relationship with the local waves. However, the longer-term weather 15 conditions correlate with short-term weather phenomena too: e.g. prevailing westerly winds are related to more low-pressure systems with strong winds travelling over the Baltic Sea, and might thus contribute to the overall correlation.
Our results demonstrated that the assumption of independence of sea level and significant wave height leads to an underestimation of the total water levels corresponding to certain exceedance frequencies. The exceedance frequency of 0.01 events/year, for instance, is relevant for flooding risk considerations. The total water levels corresponding to this frequency 20 were underestimated by 0.7 m in the 28-year data set, and 0.1-1.2 m in the 4-year data sets (the difference of the distributions in Figs. 6c and 6d). With lower exceedance frequencies, the underestimation generally increases. This underlines the need to take the correlation into account when estimating probabilities of water level extremes on the coast. It turned out that a suitable copula function is able to incorporate the effect of the correlation on the CCDF to some extent.

25
Our main conclusions, related to the research questions presented in Sect. 1, are as follows.
The sea level variations and significant wave heights show a positive correlation in general (τ = 0.20). The correlation depends on wind direction: southwesterly winds lead to strongest positive correlation (up to τ = 0.5), while northeasterly winds lead to zero or negative correlation. Ice conditions did not have an apparent effect on the correlation: including hypothetical no-ice wave heights during the ice season did not markedly alter the correlation, and neither did the setting of the wave height 30 to zero during the ice-covered time.
The 4-year probability distributions of the total water level, calculated as a sum of the observed sea level and observed/simulated wave heights, give 0.1-1.2 m higher values for extremely high sea levels (0.01 events/year) than the distributions calculated by assuming the two variables independent. This underlines the importance of accounting for the dependence between the variables when calculating the probabilities of high total water levels e.g. for flooding risk estimates.
The probability distribution of the sum based on the copula approach is closer to the observed distribution than the distribution based on the independence assumption. Of the nine copulas studied, the Tawn, Gumbel, t-EV, Galambos and Hüsler-Reiss copulas proved most suitable for our data.

5
Code and data availability. The wave, wind and sea-level observations are available through FMI's open data portal. The code for the parametric wave model will be made available in a public repository upon the publication of the final paper.
Appendix A: Model scheme for wave evolution The core idea of the wave model is based on known universal properties of different dimensionless parameters, namely the energy, peak frequency, phase speed, fetch, and time: 15 The wave growth in the model is quantified using the fetch limited wave growth relations of Kahma and Calkoen (1992): The coefficients p 1 , p 2 , q 1 , q 2 for the wave growth relations depend on the atmospheric stratification as presented by the authors. 20 Instead of assuming that the waves are fetch limited at each time step, the model simulates a more realistic wave growth and decay by also accounting for the limitation of wind duration through the following propagation scheme: 1. Calculate the dimensionless fetch,X ε i−1 , based on the energy from the previous time step, ε i−1 , and the current wind speed, U i .
3. Determine an enhanced dimensionless time step as ∆t +t 0 , where ∆t is the fixed time step of the model (e.g. 1 h).
5. Calculate the minimum dimensionless fetch for a fully developed sea (using the wind speed at 10 metre height) as: 6. Set the relevant fetch toX i = min{X t ,X F D ,X}, whereX is the physical dimensionless fetch.
7. Calculate the new energy, ε i , and peak frequency, f i p , usingX i , U i , and Equations A6 and A7.   Competing interests. No competing interests are present.