Identifying the non-exceedance probability of extreme storm surges as a component of natural-disaster management using tidal-gauge data from Typhoon Maemi in South Korea

. Global warming, one of the most serious aspects of climate change, can be expected to cause rising sea levels. These, in turn, have been linked to unprecedentedly large typhoons that can cause flooding of low-lying land, coastal invasion, seawater flows into rivers and groundwater, rising river levels, and aberrant tides. To prevent loss of life and property damage caused by typhoons, it is crucial to accurately estimate storm surge related risk. This study therefore 20 develops a statistical model for estimating probability model, based on surge data pertaining to Typhoon Maemi, which struck South Korea in 2003. Specifically, estimation of non-exceedance probability models of the typhoon-related storm surge was achieved via clustered separated peaks-over-threshold simulation, while various distribution models were fitted to the empirical data for investigating the risk of storm surge height. The result of this process found that the result of Weibull distribution was better than other distribution model for Typhoon Maemi’s peak total water level. of in can provide useful to those tasked with for on on its The findings this are provide a viable method of predicting economic losses associated with and corresponding models for managing emergency situations arising from natural disasters,

defined as the difference between observed and predicted sea level. In the present study, Busan, a major metropolitan area on the south-eastern coast of South Korea, has been used as a case study. According to Yoon and Kim's (2012) calculations, the sea level around Busan rose by an average 1.8mm/year from 1960 to 2010, i.e., roughly the same as the global trend over the 65 same period.

Typhoon trends in South Korea
The Korean Peninsula is bounded by three distinct sea-systems, generally known in English as the Yellow Sea, the Korea 70 Strait, and the Sea of Japan. This characteristic has often led to severe damage to its coastal regions. According to the Korea Ocean Observing and Forecasting System (KOOFS), Typhoon Maemi in September 2003 had a maximum wind speed of 54 meters per second (m/s), and caused US$35 billion in property damage, as shown in Table 1. All three of the highest peaks ever recorded by South Korea's tidal-gauge stations also occurred in that month. The most typhoon-heavy month there is August, followed by July and September, with two-thirds of all typhoons occurring in July and August. Tables 2 and 3, below, present statistics about typhoons in South Korea over periods of 68 years and recent 10 years, respectively; Figure 1 shows the track of Typhoon Maemi from September 4 to September 16,2003. As can be seen from Figures 1, Typhoon Maemi passed into Busan from the southeast, causing direct damage upon landfall, after 80 which its maximum 10-minute sustained wind speed was 54 m/s. Typhoon Maemi prompted the insurance industry, the Korean government, and many academic researchers to recognize the importance of advance planning and preparations for such storms, as well as for other types of natural disasters.   ArcGIS software product

Tidal gauge stations in South Korea
Effective measures for reducing the damage caused by future typhoons, including especially the design and re-design of waterfront infrastructure, will require accurate prediction of storm-surge height. As of 2019, South Korea operated 17 tidalgauge stations, of which eight had been collecting data for 30 years or more. They were located on the western (n=5), 95 southern (n=10), and eastern coasts (n=2). This study focuses on the 15 tidal-gauge stations located on the southern and western coasts ( Figure 2). The reason for excluding the remaining two stations is that the majority of typhoons do not arrive from the east or make landfall on that coast. The hourly tidal data for this study has been provided by the Korea Hydrographic and Oceanographic Agency (KHOA) and is used with that agency's permission.

Highest recorded water levels
The western tidal-gauge stations are located at Incheon, Gyeongin, Changwon, Gunsan, and Mokpo. Each of the five has operated for a different length of time, ranging from 2 to 61 years. After collecting the sea levels observed hourly by each station throughout their respective periods of operation, the top three sea-level heights at each were obtained. These heights, which are shown in Table 4, are clearly correlated with the dates of arrival of typhoons. 110 The same approach was applied to the data from the 10 tidal-gauge stations on the south coast, as shown in Tables 5 and 6, below.

Tidal gauge station at the City of Busan in South Korea
One of the focal tidal-gauge stations has observation records covering more than half a century. It is located on the south coast in Busan, South Korea's second-largest city. Thanks to its location near sea, Busan's international trade has boomed, and as a consequence it now boasts the largest port in South Korea. The Nakdong, longest and widest river in South Korea, also passes through it. Due to these geographical characteristics, Busan has been very vulnerable to natural disasters, and the 120 importance of accurately predicting the characteristics of future storms is increasingly recognized by its government and other stakeholders. The top three sea-level heights at the tidal-gauge station there are shown in Table 6, below.   Looking at the sea-level history in Figure 3, it is clear that the data trend between 1956 and 1961 is anomalous. As this may have been due to quality-control issues with the observations from that period, it has been excluded from this study, and only data from 1962 to 2019 have been used, as shown in Figure 4. KHOA makes hourly observations of water height at the 140 Busan tidal-gauge station, and our annual means have been calculated from this hourly data. As seen in Figure 4, plotting MSL for each year confirms that short-term water-level variation merely masks the long-term trend of sea-level increase.
Therefore, acting on the assumption that MSL variation is a function of time, a linear regression was performed, with the resulting coefficient of slope indicating the rate of increase (Yoon and Kim 2012), i.e., y=0.24x-443.96 in our case.

Relationship between sea level and typhoons
When a storm occurs, surge height tends to increase, and these larger surges can cause natural disasters such as floods. In 150 this study, before calculating the height of a surge, we took account of the dates and times when the three greatest sea-level heights were observed, as well the dates and times when typhoons occurred. These data are presented side by side in Table 7. As can been seen from Table 7, above, the three highest recorded sea levels at each south coast tidal-gauge station 155 corresponded with the occurrence of typhoons in 20 out of 30 cases. The dates and times of the three highest sea levels observed during Busan's entire observation period (1962-2019) all coincided with Typhoon Maemi passing out of the area.
As well as US$35 billion in property damage, Typhoon Maemi caused 135 casualties in Busan and nearby cities (National Typhoon Center 2011). However, other typhoonsnotably including Thelma, Samba, and Megialso caused very significant damage, as shown in Table 1. 160

Research objective
The objective of this study is to the estimate the probability of the risk, in years, of typhoon-induced high water levels in can be used by South Korea's government agencies, insurance companies, and construction industry. And, although this study focuses on a specific city-region, the proposed probabilistic methodologies should be applicable to other coastal 170 regions in South Korea and around the world.
2 Literature review

Prior studies of Typhoon Maemi
Most previous studies devoted to avoiding or reducing natural-disaster damage in South Korea have focused on storm 175 characteristics, such as storm track, rainfall, radius, and wind-field data. Their typical approach has been to create synthetic storms that can be utilized to predict real storms' paths and estimate the extent of the damage they would cause. Kang (2005) investigated the inundation and overflow caused by Maemi at one location near the coast, using a site survey and interviews with residents, and found that the storm surge increased water levels by 80%. Using a numerical model, Hur However, while past research on Typhoon Maemi has used such input data as atmospheric pressure, wind fields, typhoon radius, storm speed, latitude, longitude, and tidal-gauge data, the tidal gauge data has not been used for estimating exceedance probability of storm surge. For instance, Kim

Return period estimates for Hurricane Sandy
While no prior research has estimated return periods for typhoons, some has done so for hurricanes. For example, Lopeman Specifically, they argued that Sandy's perpendicular impact angle with respect to the shore as it passed to the south of Manhattan's port was of critical importance, based on an analysis of the tracks of other hurricanes of similar intensity. They estimated the return period for Sandy's water level to be 714 years within a 95% CI (435-1,429 years).
Zervas (2013) estimated the return periods for extreme events using monthly mean water-level data from the U.S. National 220 Oceanographic and Atmospheric Administration, recorded at the tidal-gauge station in Battery Park, New York. Using generalized extreme-value (GEV) distribution and maximum-likelihood estimation (MLE) methods, Zervas calculated that the return period for Sandy's peak water level was 3,500 years; but sensitivity analysis suggested that the estimated results were probably inaccurate, given the GEV fit's sensitivity to the range of years used. Once Sandy was excluded, the return period was 60,000 years. This difference in results suggests that GEV distribution of the yearly maximum water level is not a 225 realistic method for estimating extreme events in the New York Harbor area.
Building on their own past research, Lopeman (2015) were the first researchers to estimate Sandy's return period using tidalgauge data. They proposed that CSPS should be used, and that tide fluctuation, surge, and sea-level rise should all be dealt with separately, as of these three phenomena, only surge is truly random. This approach led them to calculate the return period as 103 years with a 95% CI (38-452 years). 230 The sharp differences in the results of the past studies cited above are due to wide variations in both the data they used and their assumptions. The present study therefore applies all of the methods used in previous studies of Hurricane Sandy's return period to estimate that of Typhoon Maemi, and in the process, establishes a new model. https://doi.org/10.5194/nhess-2020-379 Preprint. Discussion started: 21 November 2020 c Author(s) 2020. CC BY 4.0 License.

Generalized extreme value distribution 235
Extreme events are hard to predict because data points are so few, and predicting their probability is particularly difficult due to their asymptotic nature. Extreme-value probability theory deals with how to find outlier information, such as maximum or minimum data values during extreme situations. Examining the tail events in a probability distribution is very challenging.
However, it is considered very important by civil engineers and insurers, due to their need to cope with low-probability, high-consequence events. For example, the designs and insurance policies of bridges, breakwaters, dams, and industrial 240 plants located near shorelines or other flood-prone areas should account for the probability, however low, of major flooding.
Various probability models for the study of extreme events could potentially be used in the present research, given that its main theme is the extreme high water levels caused by typhoons. Extreme-value theories can be divided into two groups, according to how they are defined. In the first, the entire interval of interest is divided into a number of subintervals. The maximum value from each subinterval is identified as the extreme value for it, and then the entirety of these extreme values 245 converges into a GEV distribution. In the second group, values that exceed a certain threshold are identified as extreme, and converge to a generalized Pareto distribution (GPD). In the following two subsections, we will discuss the block maxima (BM) and the peaks-over-threshold (POT) methods, as illustrations of these two groups, respectively (Coles 2001).

Block maxima method
The BM approach relies on the distribution of the maximum extreme values in the following equation, 250 where the series, comprising independent and identically random variables, occurs in order of maximum extreme values; n is the number of observations in a year; and is the annual maximum.
Data is divided into blocks of specific time periods, with the highest values within each block collectively serving as a sample of extreme values. One limitation of this method is the possibility of losing important extreme-value data because 255 only the single largest value in each block is accounted for, and thus, the second-largest datum in one block could be larger than the highest datum in another.

Peaks-over-threshold method
The POT method can address the above-mentioned limitations of BM, insofar as it can gather all the data points that exceed a certain prescribed threshold, and use limited data more efficiently because it relies on relatively larger or higher values 260 instead of the largest or highest ones. All values above the thresholdknown as exceedancescan be explained by the differentiated tail-data distribution. The basic function of this threshold is to assort the larger or higher values from all data; and the set of exceedances constitutes the sample of extreme values. This means that, although POT can capture potentially important extreme values even when they occur close to each other, selecting a threshold that will yield the best description of the extreme data can be challenging (Bommier 2014 where is the threshold and is a random variable. Also, as shown in Equation (3), can be defined by conditional probabilities: According to Bommier (2014), the distribution of exceedances ( ) can be generalized by GPD with following 275 assumption: When for , and , can be described with which is th exceedance, . Equation (4) expresses the GPD, with being independent and identically random variables; , the scale; , the shape; and , the threshold. All values above 280 are considered tail data (extreme values). When calculating the return level that is exceeded once every years ( -year return periods ), Equation (5) describes the probability of exceedance over a threshold.
If the exceedances above the threshold are rare events (as measured by number of observations per year), we can expect ( ) to follow Poisson distribution. The mean of exceedance per unit of time ( ̂) describes that distribution. 285 In other words, can be estimated by dividing the number of exceedances by the number of years in the observation period.
Combining the POT and Poisson processes with GPD allows us to describe the conditional probability of the extreme values that exceed the designated threshold, as per Equation (7)

(7) 290
And when Bayes' theorem is applied to the role of GPD in conditional probability, we can rewrite Equation (7) as follows: To determine the height of surges from publicly available KHOA data, it is first necessary to predict sea levels. Equation (9) explains the relationship among observed water level, predicted water level, tidal fluctuation height, and residual (surge) at time , where i = 1, 2, …, n; n is the time series of the input dataset; , the predicted water height at ; , the observed water height 300 at ; and , the surge height.

Separation of tidal gauge data (harmonic analysis)
In this study, a standard harmonic analysis was performed to calculate predicted sea level height based on hourly data. First, this technique was used to estimate the tidal components from the total seawater-level data, allowing residuals to be isolated so that surge data could be calculated once sea-level rise had been estimated. Second, the estimated constituents were used to 305 predict tidal fluctuations in the years simulated via Monte Carlo. Then, the TideHarmonics package in R (Stephenson 2017) was used for the estimation of tidal components, as detailed below. To account for long astronomical cycles (LAC), nodal-correction functions for both the amplitude and phase are used. With these corrections, the tidal component takes the form where ( ) and ( ) represent the nodal corrections for the amplitude and phase, respectively. In this new formulation, the amplitude and phase parameters to be estimated are denoted by and (in degrees). Finally, is the reference signal, by which the phase-lag is calculated and set to refer to the origin .
The summation term in ̂ ( ) can alternatively be written as timespan covered by the data, harmonic tidal constituents were estimated, and a constant mean sea level was assumed across all years of data available.

Observed, predicted and residual water levels 325
Because observed sea level usually differs from predicted sea level, Figure 5 displays the former at Busan's tidal-gauge station in blue, as calculated through harmonic analysis. Predicted sea levels at the same location are shown in green, and surge height in red.   (Figure 7) computes the threshold such that this rate approximates the resulting yearly number of exceedance clusters. In this case, an 340 exceedance cluster is a set of consecutive surge observations that lie above the threshold. Hence, rather than choosing an "ideal" threshold according to some criterion or other, the algorithm simply finds the threshold that forces a chosen target rate to occur. Accordingly, a study of this kind could set its target rate as the average rate observed over a given period, or as a value that the authors find reasonable in light of their knowledge of historical data for their focal area.
Next, the algorithm iteratively updates the threshold to allow a computationally intensive, but not exhaustive, exploration of 345 possible threshold values between its minimum value (i.e., here, minimum observed surge height) and its maximum one (i.e., maximum observed surge height). Specifically, it first sets the threshold to 0cm, and then iteratively overwrites it according to the following steps.
(1) The exceedance clusters produced at a given iteration and given threshold are identified, and the resulting annual storm rate computed. 350 (2) If the annual storm rate arrived at in step (1) is equal to, or about equal to, the chosen target, then the threshold from the previous iteration is the result, and the algorithm is stopped. (b) …but is larger than the target rate, then a vector collecting the maximum height of the clusters is built and sorted in descending order. The threshold is then updated by setting it as equal to the C-th element of this vector, where C is the integer closest to the value of 54 (the number of years covered by the dataset) multiplied by the target rate. This updated threshold is used in the next iteration of the algorithm, and steps (1) through (3) are repeated. 360    Figure 11 displays six possible thresholds. The first, of 31.2cm, was based on a target rate of 3.5, and 189 clusters, and is 375 shown in red. The dark-blue line represents the second threshold, of 30.54cm, (target rate=4, clusters=217); the purple, a threshold of 29.56cm (target rate=4.5, clusters=246); the green, a threshold of 29.15cm (target rate=5, clusters=274); the sky-blue, a threshold of 28.33cm (target rate=6, clusters=324); and the orange, a threshold of 26.53cm (target rate=8, clusters=431). Figure 11 shows that, as expected, when the target rate increases, the threshold decreases, and as the threshold decreases, the number of clusters (i.e., storm events) increases. Conversely, the lower the target rate, the lower the number of clusters and the higher the threshold. Thus, if the desired number of storms is three per year, the algorithm will converge in three iterations and set the threshold level to 32.01cm; this results in a total of 164 storm events (clusters) over the timespan covered by our data. Conversely, if the desired target rate is 10 storms per year, the threshold is significantly lower 385 (25.43cm), and the total number of storm events more than trebles, to 539 clusters. Figures 12, 13,   peak times (interarrival times), this study therefore uses a gamma (exponential) distribution, under the common POT assumption that the arrival maxima of exceedance clusters are regulated by a Poisson process.

Clustering of the storm surge data (Relationship among target rate, threshold and clusters) 380
For peak height, on the other hand, GPD is typically used, because some representation-theorem results from extreme-value statistics indicate that, if the cluster maxima follow a Poisson process, then the intensityin this case, heightof the cluster 405 peaks follows a GPD distribution (Lopeman et al. 2015;Zhong et al. 2014). However, a Weibull distribution has been applied to peak storm-surge heights in this study because it fits the data better, especially with regard to the right-hand tail.
Because the rise ratio does not appear to be evenly distributed along the interval [0, 1], a beta distribution was used because the rise ratio is by definition between 0 and 1, and the beta distribution is commonly used to model continuous random variables that occur within that range (Lopeman et al. 2015). 410 Duration follows a lognormal distribution, which was used for the following two reasons, previously articulated by Lopeman  the parameters of the statistical model via MLE. For the reasons given in the previous section, the interarrival times for each season were fitted with an exponential distribution; the rise ratios with a beta distribution; and the peak heights with a Weibull distribution in which the location parameter was equal to the threshold. Detailed descriptions of how we applied each of these methods are provided in turn below.

Maximum likelihood estimation 435
If we assume that an independent and identically distributed data sample ( ) is observed from a population with a distribution of interest parametrized by an unknown variable , which the study wants to estimate, the MLE estimator ̂ is defined as: where ( ) denotes the probability-density function of the distribution of interest, parametrized by . The distributions of 440 interest for the data in this study were chosen as follows: (1) ( ), where denotes the interarrival time between the peak of the th cluster and the peak of the -th cluster. This distributional assumption is equivalent to assuming that a Poisson process governs peaksurge arrivals.
(2) ( ), where denotes the rise ratio of the -th cluster. 445 where denotes the generalized Pareto distribution, ∏ the peak surge height of the -th cluster, and the selected threshold.
For the exponential distribution (i.e., interarrival times), the exact solutions of the maximization problem stated above can be derived in closed form. For the GPD distribution (i.e., peak exceedances) and the beta distribution (i.e., rise ratios), the problem is solved numerically. A full description of the estimation algorithm MLE for interarrival times, rise ratios, and 450 peak exceedances is detailed below.
(3) Procedure: 460 (a) Compute the MLE estimate ̂ for the exponential interarrival rate as: (b) Compute the MLE estimates ̂ and ̂ for the beta parameters and , by numerically solving the following first-order equations, in which ( ) denotes the digamma function.   Figure 17 shows the GPD cumulative-distribution function as estimated by maximum likelihood, and the empirical distribution function, with the latter shown as dots. Each dot represents the observed proportion of exceedances below a 490 certain height in a given season (blue: cold season, red: warm season), while the corresponding value on the fitted line of the same season gives the probability that the exceedances are below that height, per the estimated GPD distribution. Figures 18 and 19 indicate that the Gamma and Weibull distributions fitted to the data for the cold and warm seasons via maximum likelihood are better than the GPD distribution across a variety of target rates, as measured by mean square error (MSE). This confirms the conclusions suggested graphically by the plots. 495

Conclusion
Typhoons cause numerous fatalities and immense property damage, and their frequency has recently been increasing.
Nevertheless, comprehensive typhoon risk assessments are not yet sufficiently developed to estimate either the damage levels from such events, or the probability of their occurrence. If they are to effectively plan for typhoons, governments and the insurance industry will need accurate estimates of both. Prompted by the high levels of damage inflicted by the high 500 surge during South Korea's most severe typhoon, Maemi, this research has estimated the risk of storm surges through nonexceedance probability using a maximum-likelihood method. Its findings afford us an opportunity to compare various statistical probability models of typhoon-induced surges, which in turn can enhance the quality of estimation of the risk of these natural hazards. Non-exceedance probability can also be a useful, geographically sensitive tool for government agencies, insurance companies, and construction companies conducting risk assessments, setting insurance prices, preparing 505 safety guidelines, and setting policies aimed at reducing typhoon-related damage and financial losses.
Even though the present research investigated various non-exceedance probability distributions of typhoon-driven storm surges, it only used a single extreme event in a specified region. As such, its findings may not be applicable to other regions, each of which has its own unique weather conditions, geographic features, and tidal characteristics. Future research should therefore include tidal and environmental data from a range of different regions and various extreme events to confirm the 510 present study's findings. Also, various natural-hazards indicators and environmental factors such as wind speed, pressure, rainfall, landslides, distance to waterways, and so forth may be useful variables in estimating the exceedance probabilities of typhoons and other natural hazards, and thus be beneficial to risk assessment and mitigation.
Return periods based on various non-exceedance probability models should also be considered in future research, since elaborated return-period estimation can be utilized for better disaster relief and emergency planning. Advanced statistical 515 methods such as Monte Carlo simulation, as well as deep-learning techniques, can be applied to make typhoon return-period estimates even more accurate.