Short-term prediction of extreme sea-level at the Baltic Sea coast by Random Forests

Bellinghausen, Kai; Hünicke, Birgit; Zorita, Eduardo

doi:https://doi.org/10.5194/nhess-2023-21

Preprints

https://doi.org/10.5194/nhess-2023-21

Preprints

21 Mar 2023

| 21 Mar 2023

Status: this discussion paper is a preprint. It has been under review for the journal Natural Hazards and Earth System Sciences (NHESS). The manuscript was not accepted for further review after discussion.

Short-term prediction of extreme sea-level at the Baltic Sea coast by Random Forests

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Abstract. We have designed a machine-learning method to predict the occurrence of extreme sea-level at the Baltic Sea coast with lead times of a few days. The method is based on a Random Forest Classifier and uses sea level pressure, surface wind, precipitation, and the prefilling state of the Baltic Sea as predictors for daily sea level above the 95 % quantile at seven tide-gauge stations representative of the Baltic coast.

The method is purely data-driven and is trained with sea-level data from the Global Extreme Sea Level Analysis (GESLA) data set and from the meteorological reanalysis ERA5 of the European Centre for Mid-range Weather Forecasting. These records cover the period from 1960 to 2020 using one part of them to train the classifier and another part to estimate its out-of-sample prediction skill.

The method is able to satisfactorily predict the occurrence of sea-level extremes at lead times of up to 3 days and to identify the relevant predictor regions. The sensitivity, measured as the proportion of correctly predicted extremes is, depending on the stations, of the order of 70 %. The proportion of false warnings, related to the specificity of the predictions, is typically as low as 10 to 20 %. For lead times longer than 3 days, the predictive skill degrades; for 7 days, it is comparable to a random skill.

The importance of each predictor depends on the location of the tide gauge. Usually, the most relevant predictors are sea level pressure, surface wind and prefilling. Extreme sea levels in the Northern Baltic are better predicted by surface pressure and the meridional surface wind component. By contrast, for those located in the south, the most relevant predictors are surface pressure and the zonal wind component. Precipitation was not a relevant predictor for any of the stations analysed.

The Random Forest classifier is not required to have considerable complexity and the computing time to issue predictions is typically a few minutes. The method can therefore be used as a pre-warning system triggering the application of more sophisticated algorithms to estimate the height of the ensuing extreme sea level or as a warning to run larger ensembles with physically based numerical models.

Received: 07 Feb 2023 – Discussion started: 21 Mar 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Status: closed

RC1:
'Comment on nhess-2023-21', Anonymous Referee #1, 14 May 2023

see attached report

Citation: https://doi.org/10.5194/nhess-2023-21-RC1
- AC2: 'Reply on RC1', Eduardo Zorita, 21 Sep 2023
  
  We thank the reviewer for their interest and time in reviewing our manuscript. We describe in the following how we intend to revise the manuscript to address their concerns and suggestions.
  
  The original comments are boldfaced; our responses are in normal text.
  Major comments
  1. The model descriptions and setup lacks some crucial details, includes partially incorrect statements and seems partially somewhat questionable:
  a. The description of the input predictors in line 223f is not sufficiently clear. What is actually being used as input predictors for the meteorological variables? Those are 2D fields, but are averages computed over the whole field (e.g. over all grid points and hourly data)?
  The predictors are extracted from ERA5 reanalysis data, as discussed in Section 2.3. Indeed, the averaging was not clearly stated, as here we had in mind only readers with a climatological background. We will clarify this point in this revised version. The averaging is only performed in the temporal domain (hours to days), but not in the spatial domain. The storm surge response to atmospheric forcing is not local, either in space or time. Winds and atmospheric pressure set up perturbations on the ocean surface that then travel over spatial scales of several hundreds of kilometres over several days. Therefore, the whole spatial fields need to be considered as predictors to identify the areas that may provide the best predictability at different lead times. Hence, grid points remain as described in the ERA5 dataset. We add a note to this point.
  
  Would it not have been an alternative to use the hourly values of only those hours corresponding to the relevant occurrence of the storm surge (the maximum over the day, as defined before)?
  
  As explained in the previous point, the occurrence of storm surges (in the Baltic Sea and elsewhere) depends on weather situations and physical patterns developing over several days before the event itself. Hence, extracting only the value of the predictors hours before the onset and on the location of the tide gauge would not be a sufficient predictor. The link between atmospheric forcing and storm surge is not local.
  
  Or alternatively, some other summary statistics of the forecast fields, such as either local values close to the station of interest, or averages over different sub-domains?
  
  This is related to the previous points. As local events at a single station can be induced by weather patterns far away (extreme examples of this are so-called Teleconnections) it is not useful to only look at physical patterns close to the stations themselves. The goal was to extract physical patterns in specific domains of the research area, so averaging over sub-domains would blur this information.
  
  b. If all missing values are set to -999, doesn’t this imply problematic behavior since they would potentially be grouped together with other “low” values of the corresponding predictor variable (unless the RF specifically accounts for missingness, which seems unlikely given the limited complexity of the trees chosen in the study)?
  
  Using -999 to code the missing values affects the RF models using Preefilling as the sole predictor. As soon as other predictors from ERA5 were used, the missing values of the Prefilling were discarded. This handling is somewhat inconsistent and will be amended in the revised version. It only potentially affects a few models (with PF as the sole predictor ) and a few time steps.
  
  c. line 270: Trees up to a maximum depth of only 3 are considered. In my view, this leads to trees that are less complex than usually applied in many practical applications. The complexity of individual DTs is usually controlled via the minimum node size, which tends to be something small, typically below 50. What typical minimum node size do you see for the chosen maximum depth, and how do results for more complex DTs compare?
  
  The main reason for using shallow decision trees was to keep the total number of degrees of freedom of the tree small enough. This point is related to the previous points raised by the reviewer. As the predictors are the full 2-d fields - i.e. several grid cells- a tree depth of 50 or more would include a very large number of tree parameters that would require many extremes to calibrate. We did not check the typical minimum node size of our trees. In addition, we realized that for deeper DTs the computational time increases heavily, indicating that the calibration algorithm struggled to find optimal values for all parameters of a deep tree.
  2. In light of the previous comment, since I assume the averaging is done over all grid points and hours for the input predictors, I find the figures showing full meteorological fields, which are then used to infer the meteorological interpretation of the importance of this predictor somewhat misleading. While often tendencies and values in specific regions are discussed, the model only receives the overall average as an input.
  See Comment 1a: Input predictors are spatially resolved. There is no spatial averaging involved.
  
  3. A main limitation of the study is that only simple binary predictions of storm surges are considered. Over the past two decades, large parts of the meteorological and forecasting literature have transitioned towards probabilistic predictions, and random forest enable (for example) probabilistic classifier models in a straightforward manner.
  We agree that RF allows for continuous predictors. Our study is the first step to test this method for storm surge-predictors. As explained in the previous responses, the predictors are the fully spatially resolved fields of a few meteorological forcings, and therefore, the number of model parameters is already considerable. We intend to progress this study to a continuous prediction of water levels. Nevertheless, the model leads to interesting physical patterns for predictions (which are in line with theoretical expectations) despite its simplicity and low computational time. Our present study is a first step towards a machine-learning-based forecast scheme, and it will be further developed to include quantitative forecasts, which can then be more easily aligned to the operational forecasts.
  4. The study does not consider any competitive benchmark models – neither a simple non-ML classification model (e.g. logistic regression), nor a climatological forecast or an operational storm surge model. At the very least, a simple climatological model that accounts for the seasonality in the target variable would have been essential to consider as a naive
  benchmark to allow for a fair assessment of the RF models. For now, the models are compared to benchmarks that either always predict a 1 or a 0 only.
  
  In the submitted version of the manuscript, we also compared it with a naive model that would predict 5% of the time. It is true that we did not directly compare our model results to a competitive benchmark, and we plan to pursue this avenue as far as possible.
  The reviewer’s suggestion of comparing with an operational forecast is worth pursuing, and we will try to implement it in the revised version. However, this implementation is not straightforward. Each Baltic country has its own operational forecast system, and usually, they do not issue forecasts at lead times of seven to one day. We will, however, try to offer the reader a fair comparison with the operational forecast.
  
  5. Section 5 contains the main results of the paper, which are organized via “case studies” of comparisons of individual model configurations based on different combinations of input variables. As noted elsewhere, in principle I find this approach of connecting model performance and physical “interpretation” interesting, however, this makes the results section cumbersome to read and difficult to follow, since it is challenging to keep track of all the different comparisons and model variants (for example since a quantitative overall comparison is missing). A more common approach in the ML literature would have been to simply supply all the considered input predictors (the number of which would still be manageable) to the RF model, and learn the model itself which combinations and connections are important. Then, in a second step, feature importance methods can be used to analyze what the model has learned and this can be connected to the meteorological domain expertise.
  We agree with the idea of the common approach. It is sensible to feed the RF with all predictors first and filter the combinations it relies on. The reason we did not do that is the underlying theory of storm surge development. Hence, we only tested combinations of predictors that were in line with the theoretical explanation of storm surges. We then wanted to infer the spatial patterns of physical predictors within the research area and their importance compared to each other.
  Minor comments
  1. line 35: Wouldn’t it have been an interesting alternative to compare directly to an operational storm surge forecasting system?
  See response to comment 4. We will pursue this comment, but we already acknowledge that an apple-to-apple comparison will be hard to achieve.
  2. line 55: “... but rather try to identify recurring patterns in the data…”: Isn’t that the same as what the statistical models do?
  This is, to some extent, a matter of taste. Data-driven machine learning models can also be considered ‘statistical models’. There may be some difference in scope, as statistical models would, in principle, try to be aligned with physical models, albeit in a simplified or aggregated form, whereas machine learning models are completely utilitarian, i.e. they completely gloss over any underlying physical link between predictor and predictands.
  3. The description of the meteorological background in the introduction is fairly long. Overall, the authors seem to have had the intention of combining a detailed meteorological analysis of the physical processes relevant to the prediction problem at hand, with a detailed study of random forest based forecasting methods. While this certainly is an interesting idea, it makes the paper fairly lengthy and not easy to follow, see also the comments on the presentation of the results.
  We will condense this section, highlighting the key points that may be useful for readers with climatological or machine-learning backgrounds. For instance, the main points raised by the reviewer were prompted by the not explicit assumption that storm surges are driven by the spatial patterns of the meteorological forcings and not only the direct local wind or pressure. We acknowledge that this needs to be more clearly explained and will try to do so in the revised version.
  This physical link has, in turn, consequences for the RF model, for instance, for the permissible depth of the tree, which need also to be more clearly explained.
  4. It does not become clear from Section 2.2 what the predictand actually is, what kind of predictions are sought for (deterministic / probabilistic etc).
  The predictand is the occurrence of an extreme sea-level (over the daily 95% percentile threshold). The prediction is in the present state of the method deterministic.
  5. In my view, the setup is sufficiently simple such that Figure 3 could be left out without losing any relevant information.
  See our response to points 6
  6. Personally, I don’t see the particular relevance of using this exact version of Figure 4. It does not directly fit the setting here (“averaging” is not really appropriate in a binary prediction setting). Further, it remains unclear whether the image, or the architecture was taken from the linked website. I would have preferred a merged version of Figures 4 and 5, which is more adapted to the situation at hand.
  We will blend Figures 3,4 and 5 into one figure. This figure will display the concept of a random forest (for the reader not familiar with RRF but interested in storm surge prediction), and indicate the (spatially resolved ) predictors and categorical predictand.
  7. The descriptions of random forests needs improvements, as it lacks details and is partially incorrect:
  a. line 249: Typically, all the round nodes in the lowest row in Figure 5 would be called terminal nodes / leafs.
  Point taken
  b. line 252: “... random sample of the test data”. This should be “training data” instead.
  Point taken
  c. line 254: It should probably also be explained how a single DT arrives at a binary decision, not only the final RF
  We will include a short explanation.
  d. line 257: Does this mean you re-train the models to compute the importances? A computationally less costly and often applied alternative is to use permutation-based methods that do not require re-training.
  The importance is computed directly by the RF routine, not independently by us. It indeed uses permutation-based methods.
  e. line 269f: “... will lead to less overfitting”: In my view, this is only true for very low numbers of trees. Otherwise, with a growing number of trees, the RF model will converge to the “true tree model”, but this will not have a clear effect on overfitting. A reference should be provided here if you disagree.
  We agree with the reviewer, and we will expand the sentence.
  
  f. line 278: Why is the oob_score parameter important here? This (to me) seems to be relevant only when the number of trees is based monitoring the oob estimates of the forecast error, which is not done here.
  The overall structure of the individual DT, including the tree depth, is tested by the algorithm using the out-of-bag score. Once this structure is fixed and the Random Forest is set up, the overall error on validation data is calculated. This out-of-sample validation error is the one quoted in this study. We will add a few sentences to clarify this choice.
  8. line 341: “... contains more instances … due to the hourly recording time”: This is not clear to me, wasn’t the target aggregated to daily values? Why do the time scales not match here?
  The sentence in the manuscript is unclear. The use of the hourly values pertains only to the model runs with the Prefilling as the sole predictor. The reviewer is right, as this is inconsistent with the handling of the other predictors. This, however, affects only a few model runs. We will change this in the revised version of the manuscript.
  9. The discussion of the results in Section 5 often talks about good / bad models, without giving a clear indication of what specific true positive / false negative / … rates are required for a model to be “good”. Again, this indicates that a sensible, non-naive benchmark is missing.
  In the submitted version, we referred to good models as those displaying a clearly better rate of hits as a naive prediction scheme. In the new version, we will estimate the number of hits and misses of extreme sea level from the operational products (categorized using the 95% percentile threshold), so that the reader can know what a ‘state-of-the art’ model prediction looks like.
  10. The discussion/conclusion is missing some summary of the overall findings: If I were to develop a RF model for storm surge extremes, what predictors and model setup should I use?
  The main take-home message is that the best predictors depend on the station. Although physically, both pressure and wind components may contribute to extreme sea level, the ranking of predictors depends on the particular station. We will better highlight this take-home message.
  
  Technical comments
  1. line 7f: The discussion of the cross-validation setup does not need to be part of the abstract
  
  Point taken: we will amend the abstract.
  2. line 34-35: What is the difference between forecasting and predicting, and why is it relevant here?
  There is actually no difference. We will rephrase the sentence.
  3. line 53/54: Abbreviation “ML” introduced twice.
  Point taken
  4. line 62: Reference should be in parentheses.
  Point taken
  5. Figure 1: Reference should not be in parentheses.
  Point taken
  6. line 260: There is a “?” after “small”
  It will be corrected
  7. Figure 8: What does that refer to?
  It stands for lead time. It will be corrected
  8. line 528: increased -> improved
  Point taken
  
  Citation: https://doi.org/10.5194/nhess-2023-21-AC2
RC2:
'Comment on nhess-2023-21', Anonymous Referee #2, 22 Aug 2023

The manuscript is presenting a new methodology (Random Forests machine-learning) that may be used for prediction of storm surges at different locations in the Baltic Sea. I am not expert in ML techiques, so that the other reviewer did the great job on that, however my main concern is:
- The presented methodology is testing application of RFs to get the Baltic storm surge predictions, i.e., this approach have a potential to be used for operational coastal flood forecasts. In that line I miss (in the discussion) a comparison to other operational storm surge forecasts in the Baltic, i.e., it would be nice to discuss if your approach have any benefit to existing operational systems (quality of forecast, false alarms, model run time and prerequisites, ...). Having discussion which tries to explain the physical background of your results is ok, but you are using method which doesn't care about these relations (as is self-learning), but having the goal to get the best prediction as possible. For that reason, a comparison with other storm surge forecasts is mandatory.
Other that this comment, the manuscript can be a bit improved in context of clearing results vs. discussion vs. conclusions, like:
- Lines 28-29. The sentence "This study ..." is not belonging here but at the end of Introduction as confusing the story line - please remove it.
- Line 58. "Several studies ..." may start new paragraph.
- Fig. 1 caption. Change to Wolski and Wisniewski (2020)
- Section 1.1 is too long, more appropriate for a review paper, please shorten it and keep the most relevant for the manuscript.
- Lines 227-234. I cannot follow how PF is estimated, please provide mathematically precise definition of PF.

Citation: https://doi.org/10.5194/nhess-2023-21-RC2
- AC1: 'Reply on RC2', Eduardo Zorita, 21 Sep 2023
  
  We thank the reviewer for their interest and time in reviewing our manuscript. We describe in the following how we intend to revise the manuscript to address their concerns and suggestions.
  
  The original comments are boldfaced; our responses are in normal text
  
  1) The presented methodology is testing application of RFs to get the Baltic storm surge predictions, i.e., this approach have a potential to be used for operational coastal flood forecasts. In that line I miss (in the discussion) a comparison to other operational storm surge forecasts in the Baltic, i.e., it would be nice to discuss if your approach have any benefit to existing operational systems (quality of forecast, false alarms, model run time and prerequisites, ...). Having discussion which tries to explain the physical background of your results is ok, but you are using method which doesn't care about these relations (as is self-learning), but having the goal to get the best prediction as possible. For that reason, a comparison with other storm surge forecasts is mandatory.
  
  We agree that a comparison with operational forecasts can be illustrative, and we plan to pursue this avenue as far as possible. However, this comparison can not be completely apples-to-apples. The operational forecasts usually do not publish continuous forecasts for different lead times, say ten days to one day. Also, our forecast is not quantitative but categorical (above or below the 95% quantile). Thus the result depends on the period to derive the quantiles themselves. Our present study is a first step towards a machine-learning-based forecast scheme, and it will be further developed to include quantitative forecasts, which can then be more easily applied to operational forecasts.
  Nevertheless, we will give a sense of comparative skill of our method to products derived from operational forecasts.
  
  2) - Lines 28-29. The sentence "This study ..." is not belonging here but at the end of Introduction as confusing the story line - please remove it.
  
  We disagree with the reviewer on this point. Actually, it is a matter of writing technique or style. The reviewer is lauding to a classical introduction structure, which, starting from the more general, increasingly reduces the scope to the paper's objective at the end of the introduction. This leaves the reader wondering during the whole introduction what this objective really is.
  The inclusion of the specific objective at the end of the first paragraph to inform the reader early on is called ‘upfronting’. It has the advantage of giving the reader the exact direction of the endpoint and can better understand the subsequent introduction.
  This writing technique is described in many editorial manuals and books, for instance:
  Schimel, J. (2012). Writing science: how to write papers that get cited and proposals that get funded. OUP USA.
  
  3) - Line 58. "Several studies ..." may start new paragraph.
  point taken
  
  4) - Fig. 1 caption. Change to Wolski and Wisniewski (2020)
  point taken
  
  5) - Section 1.1 is too long, more appropriate for a review paper, please shorten it and keep the most relevant for the manuscript.
  
  We will heed the advice of the reviewer and shorten section 1.1. However, as an interdisciplinary manuscript at the interface between machine learning and climate, we cannot assume that all readers have the same background. For instance, reviewer # 1 clearly has a background in machine learning methods, but inquiries and even suggests whether the predictors from extreme sea-level at one station could be the value of the wind or pressure at the grid cell closer to that station, which from the climatological point of view is not meaningful (obviously, the whole spatial field over a broad region should be taken as predictor). Therefore, we must consider the backgrounds of these two groups of potential readers. And come up with a better solution that provides enough background for both.
  
  6) - Lines 227-234. I cannot follow how PF is estimated, please provide mathematically precise definition of PF.
  
  The Prefilling state is a predictor and is represented in the Baltic Sea research literature as the sea level at the station Landsort, as it is considered that this station is a good indicator of the mean volume of the Baltic Sea. We will explain it more clearly in the revised version.
  
  Citation: https://doi.org/10.5194/nhess-2023-21-AC1

Status: closed

RC1:
'Comment on nhess-2023-21', Anonymous Referee #1, 14 May 2023

see attached report

Citation: https://doi.org/10.5194/nhess-2023-21-RC1
- AC2: 'Reply on RC1', Eduardo Zorita, 21 Sep 2023
  
  We thank the reviewer for their interest and time in reviewing our manuscript. We describe in the following how we intend to revise the manuscript to address their concerns and suggestions.
  
  The original comments are boldfaced; our responses are in normal text.
  Major comments
  1. The model descriptions and setup lacks some crucial details, includes partially incorrect statements and seems partially somewhat questionable:
  a. The description of the input predictors in line 223f is not sufficiently clear. What is actually being used as input predictors for the meteorological variables? Those are 2D fields, but are averages computed over the whole field (e.g. over all grid points and hourly data)?
  The predictors are extracted from ERA5 reanalysis data, as discussed in Section 2.3. Indeed, the averaging was not clearly stated, as here we had in mind only readers with a climatological background. We will clarify this point in this revised version. The averaging is only performed in the temporal domain (hours to days), but not in the spatial domain. The storm surge response to atmospheric forcing is not local, either in space or time. Winds and atmospheric pressure set up perturbations on the ocean surface that then travel over spatial scales of several hundreds of kilometres over several days. Therefore, the whole spatial fields need to be considered as predictors to identify the areas that may provide the best predictability at different lead times. Hence, grid points remain as described in the ERA5 dataset. We add a note to this point.
  
  Would it not have been an alternative to use the hourly values of only those hours corresponding to the relevant occurrence of the storm surge (the maximum over the day, as defined before)?
  
  As explained in the previous point, the occurrence of storm surges (in the Baltic Sea and elsewhere) depends on weather situations and physical patterns developing over several days before the event itself. Hence, extracting only the value of the predictors hours before the onset and on the location of the tide gauge would not be a sufficient predictor. The link between atmospheric forcing and storm surge is not local.
  
  Or alternatively, some other summary statistics of the forecast fields, such as either local values close to the station of interest, or averages over different sub-domains?
  
  This is related to the previous points. As local events at a single station can be induced by weather patterns far away (extreme examples of this are so-called Teleconnections) it is not useful to only look at physical patterns close to the stations themselves. The goal was to extract physical patterns in specific domains of the research area, so averaging over sub-domains would blur this information.
  
  b. If all missing values are set to -999, doesn’t this imply problematic behavior since they would potentially be grouped together with other “low” values of the corresponding predictor variable (unless the RF specifically accounts for missingness, which seems unlikely given the limited complexity of the trees chosen in the study)?
  
  Using -999 to code the missing values affects the RF models using Preefilling as the sole predictor. As soon as other predictors from ERA5 were used, the missing values of the Prefilling were discarded. This handling is somewhat inconsistent and will be amended in the revised version. It only potentially affects a few models (with PF as the sole predictor ) and a few time steps.
  
  c. line 270: Trees up to a maximum depth of only 3 are considered. In my view, this leads to trees that are less complex than usually applied in many practical applications. The complexity of individual DTs is usually controlled via the minimum node size, which tends to be something small, typically below 50. What typical minimum node size do you see for the chosen maximum depth, and how do results for more complex DTs compare?
  
  The main reason for using shallow decision trees was to keep the total number of degrees of freedom of the tree small enough. This point is related to the previous points raised by the reviewer. As the predictors are the full 2-d fields - i.e. several grid cells- a tree depth of 50 or more would include a very large number of tree parameters that would require many extremes to calibrate. We did not check the typical minimum node size of our trees. In addition, we realized that for deeper DTs the computational time increases heavily, indicating that the calibration algorithm struggled to find optimal values for all parameters of a deep tree.
  2. In light of the previous comment, since I assume the averaging is done over all grid points and hours for the input predictors, I find the figures showing full meteorological fields, which are then used to infer the meteorological interpretation of the importance of this predictor somewhat misleading. While often tendencies and values in specific regions are discussed, the model only receives the overall average as an input.
  See Comment 1a: Input predictors are spatially resolved. There is no spatial averaging involved.
  
  3. A main limitation of the study is that only simple binary predictions of storm surges are considered. Over the past two decades, large parts of the meteorological and forecasting literature have transitioned towards probabilistic predictions, and random forest enable (for example) probabilistic classifier models in a straightforward manner.
  We agree that RF allows for continuous predictors. Our study is the first step to test this method for storm surge-predictors. As explained in the previous responses, the predictors are the fully spatially resolved fields of a few meteorological forcings, and therefore, the number of model parameters is already considerable. We intend to progress this study to a continuous prediction of water levels. Nevertheless, the model leads to interesting physical patterns for predictions (which are in line with theoretical expectations) despite its simplicity and low computational time. Our present study is a first step towards a machine-learning-based forecast scheme, and it will be further developed to include quantitative forecasts, which can then be more easily aligned to the operational forecasts.
  4. The study does not consider any competitive benchmark models – neither a simple non-ML classification model (e.g. logistic regression), nor a climatological forecast or an operational storm surge model. At the very least, a simple climatological model that accounts for the seasonality in the target variable would have been essential to consider as a naive
  benchmark to allow for a fair assessment of the RF models. For now, the models are compared to benchmarks that either always predict a 1 or a 0 only.
  
  In the submitted version of the manuscript, we also compared it with a naive model that would predict 5% of the time. It is true that we did not directly compare our model results to a competitive benchmark, and we plan to pursue this avenue as far as possible.
  The reviewer’s suggestion of comparing with an operational forecast is worth pursuing, and we will try to implement it in the revised version. However, this implementation is not straightforward. Each Baltic country has its own operational forecast system, and usually, they do not issue forecasts at lead times of seven to one day. We will, however, try to offer the reader a fair comparison with the operational forecast.
  
  5. Section 5 contains the main results of the paper, which are organized via “case studies” of comparisons of individual model configurations based on different combinations of input variables. As noted elsewhere, in principle I find this approach of connecting model performance and physical “interpretation” interesting, however, this makes the results section cumbersome to read and difficult to follow, since it is challenging to keep track of all the different comparisons and model variants (for example since a quantitative overall comparison is missing). A more common approach in the ML literature would have been to simply supply all the considered input predictors (the number of which would still be manageable) to the RF model, and learn the model itself which combinations and connections are important. Then, in a second step, feature importance methods can be used to analyze what the model has learned and this can be connected to the meteorological domain expertise.
  We agree with the idea of the common approach. It is sensible to feed the RF with all predictors first and filter the combinations it relies on. The reason we did not do that is the underlying theory of storm surge development. Hence, we only tested combinations of predictors that were in line with the theoretical explanation of storm surges. We then wanted to infer the spatial patterns of physical predictors within the research area and their importance compared to each other.
  Minor comments
  1. line 35: Wouldn’t it have been an interesting alternative to compare directly to an operational storm surge forecasting system?
  See response to comment 4. We will pursue this comment, but we already acknowledge that an apple-to-apple comparison will be hard to achieve.
  2. line 55: “... but rather try to identify recurring patterns in the data…”: Isn’t that the same as what the statistical models do?
  This is, to some extent, a matter of taste. Data-driven machine learning models can also be considered ‘statistical models’. There may be some difference in scope, as statistical models would, in principle, try to be aligned with physical models, albeit in a simplified or aggregated form, whereas machine learning models are completely utilitarian, i.e. they completely gloss over any underlying physical link between predictor and predictands.
  3. The description of the meteorological background in the introduction is fairly long. Overall, the authors seem to have had the intention of combining a detailed meteorological analysis of the physical processes relevant to the prediction problem at hand, with a detailed study of random forest based forecasting methods. While this certainly is an interesting idea, it makes the paper fairly lengthy and not easy to follow, see also the comments on the presentation of the results.
  We will condense this section, highlighting the key points that may be useful for readers with climatological or machine-learning backgrounds. For instance, the main points raised by the reviewer were prompted by the not explicit assumption that storm surges are driven by the spatial patterns of the meteorological forcings and not only the direct local wind or pressure. We acknowledge that this needs to be more clearly explained and will try to do so in the revised version.
  This physical link has, in turn, consequences for the RF model, for instance, for the permissible depth of the tree, which need also to be more clearly explained.
  4. It does not become clear from Section 2.2 what the predictand actually is, what kind of predictions are sought for (deterministic / probabilistic etc).
  The predictand is the occurrence of an extreme sea-level (over the daily 95% percentile threshold). The prediction is in the present state of the method deterministic.
  5. In my view, the setup is sufficiently simple such that Figure 3 could be left out without losing any relevant information.
  See our response to points 6
  6. Personally, I don’t see the particular relevance of using this exact version of Figure 4. It does not directly fit the setting here (“averaging” is not really appropriate in a binary prediction setting). Further, it remains unclear whether the image, or the architecture was taken from the linked website. I would have preferred a merged version of Figures 4 and 5, which is more adapted to the situation at hand.
  We will blend Figures 3,4 and 5 into one figure. This figure will display the concept of a random forest (for the reader not familiar with RRF but interested in storm surge prediction), and indicate the (spatially resolved ) predictors and categorical predictand.
  7. The descriptions of random forests needs improvements, as it lacks details and is partially incorrect:
  a. line 249: Typically, all the round nodes in the lowest row in Figure 5 would be called terminal nodes / leafs.
  Point taken
  b. line 252: “... random sample of the test data”. This should be “training data” instead.
  Point taken
  c. line 254: It should probably also be explained how a single DT arrives at a binary decision, not only the final RF
  We will include a short explanation.
  d. line 257: Does this mean you re-train the models to compute the importances? A computationally less costly and often applied alternative is to use permutation-based methods that do not require re-training.
  The importance is computed directly by the RF routine, not independently by us. It indeed uses permutation-based methods.
  e. line 269f: “... will lead to less overfitting”: In my view, this is only true for very low numbers of trees. Otherwise, with a growing number of trees, the RF model will converge to the “true tree model”, but this will not have a clear effect on overfitting. A reference should be provided here if you disagree.
  We agree with the reviewer, and we will expand the sentence.
  
  f. line 278: Why is the oob_score parameter important here? This (to me) seems to be relevant only when the number of trees is based monitoring the oob estimates of the forecast error, which is not done here.
  The overall structure of the individual DT, including the tree depth, is tested by the algorithm using the out-of-bag score. Once this structure is fixed and the Random Forest is set up, the overall error on validation data is calculated. This out-of-sample validation error is the one quoted in this study. We will add a few sentences to clarify this choice.
  8. line 341: “... contains more instances … due to the hourly recording time”: This is not clear to me, wasn’t the target aggregated to daily values? Why do the time scales not match here?
  The sentence in the manuscript is unclear. The use of the hourly values pertains only to the model runs with the Prefilling as the sole predictor. The reviewer is right, as this is inconsistent with the handling of the other predictors. This, however, affects only a few model runs. We will change this in the revised version of the manuscript.
  9. The discussion of the results in Section 5 often talks about good / bad models, without giving a clear indication of what specific true positive / false negative / … rates are required for a model to be “good”. Again, this indicates that a sensible, non-naive benchmark is missing.
  In the submitted version, we referred to good models as those displaying a clearly better rate of hits as a naive prediction scheme. In the new version, we will estimate the number of hits and misses of extreme sea level from the operational products (categorized using the 95% percentile threshold), so that the reader can know what a ‘state-of-the art’ model prediction looks like.
  10. The discussion/conclusion is missing some summary of the overall findings: If I were to develop a RF model for storm surge extremes, what predictors and model setup should I use?
  The main take-home message is that the best predictors depend on the station. Although physically, both pressure and wind components may contribute to extreme sea level, the ranking of predictors depends on the particular station. We will better highlight this take-home message.
  
  Technical comments
  1. line 7f: The discussion of the cross-validation setup does not need to be part of the abstract
  
  Point taken: we will amend the abstract.
  2. line 34-35: What is the difference between forecasting and predicting, and why is it relevant here?
  There is actually no difference. We will rephrase the sentence.
  3. line 53/54: Abbreviation “ML” introduced twice.
  Point taken
  4. line 62: Reference should be in parentheses.
  Point taken
  5. Figure 1: Reference should not be in parentheses.
  Point taken
  6. line 260: There is a “?” after “small”
  It will be corrected
  7. Figure 8: What does that refer to?
  It stands for lead time. It will be corrected
  8. line 528: increased -> improved
  Point taken
  
  Citation: https://doi.org/10.5194/nhess-2023-21-AC2
RC2:
'Comment on nhess-2023-21', Anonymous Referee #2, 22 Aug 2023

The manuscript is presenting a new methodology (Random Forests machine-learning) that may be used for prediction of storm surges at different locations in the Baltic Sea. I am not expert in ML techiques, so that the other reviewer did the great job on that, however my main concern is:
- The presented methodology is testing application of RFs to get the Baltic storm surge predictions, i.e., this approach have a potential to be used for operational coastal flood forecasts. In that line I miss (in the discussion) a comparison to other operational storm surge forecasts in the Baltic, i.e., it would be nice to discuss if your approach have any benefit to existing operational systems (quality of forecast, false alarms, model run time and prerequisites, ...). Having discussion which tries to explain the physical background of your results is ok, but you are using method which doesn't care about these relations (as is self-learning), but having the goal to get the best prediction as possible. For that reason, a comparison with other storm surge forecasts is mandatory.
Other that this comment, the manuscript can be a bit improved in context of clearing results vs. discussion vs. conclusions, like:
- Lines 28-29. The sentence "This study ..." is not belonging here but at the end of Introduction as confusing the story line - please remove it.
- Line 58. "Several studies ..." may start new paragraph.
- Fig. 1 caption. Change to Wolski and Wisniewski (2020)
- Section 1.1 is too long, more appropriate for a review paper, please shorten it and keep the most relevant for the manuscript.
- Lines 227-234. I cannot follow how PF is estimated, please provide mathematically precise definition of PF.

Citation: https://doi.org/10.5194/nhess-2023-21-RC2
- AC1: 'Reply on RC2', Eduardo Zorita, 21 Sep 2023
  
  We thank the reviewer for their interest and time in reviewing our manuscript. We describe in the following how we intend to revise the manuscript to address their concerns and suggestions.
  
  The original comments are boldfaced; our responses are in normal text
  
  1) The presented methodology is testing application of RFs to get the Baltic storm surge predictions, i.e., this approach have a potential to be used for operational coastal flood forecasts. In that line I miss (in the discussion) a comparison to other operational storm surge forecasts in the Baltic, i.e., it would be nice to discuss if your approach have any benefit to existing operational systems (quality of forecast, false alarms, model run time and prerequisites, ...). Having discussion which tries to explain the physical background of your results is ok, but you are using method which doesn't care about these relations (as is self-learning), but having the goal to get the best prediction as possible. For that reason, a comparison with other storm surge forecasts is mandatory.
  
  We agree that a comparison with operational forecasts can be illustrative, and we plan to pursue this avenue as far as possible. However, this comparison can not be completely apples-to-apples. The operational forecasts usually do not publish continuous forecasts for different lead times, say ten days to one day. Also, our forecast is not quantitative but categorical (above or below the 95% quantile). Thus the result depends on the period to derive the quantiles themselves. Our present study is a first step towards a machine-learning-based forecast scheme, and it will be further developed to include quantitative forecasts, which can then be more easily applied to operational forecasts.
  Nevertheless, we will give a sense of comparative skill of our method to products derived from operational forecasts.
  
  2) - Lines 28-29. The sentence "This study ..." is not belonging here but at the end of Introduction as confusing the story line - please remove it.
  
  We disagree with the reviewer on this point. Actually, it is a matter of writing technique or style. The reviewer is lauding to a classical introduction structure, which, starting from the more general, increasingly reduces the scope to the paper's objective at the end of the introduction. This leaves the reader wondering during the whole introduction what this objective really is.
  The inclusion of the specific objective at the end of the first paragraph to inform the reader early on is called ‘upfronting’. It has the advantage of giving the reader the exact direction of the endpoint and can better understand the subsequent introduction.
  This writing technique is described in many editorial manuals and books, for instance:
  Schimel, J. (2012). Writing science: how to write papers that get cited and proposals that get funded. OUP USA.
  
  3) - Line 58. "Several studies ..." may start new paragraph.
  point taken
  
  4) - Fig. 1 caption. Change to Wolski and Wisniewski (2020)
  point taken
  
  5) - Section 1.1 is too long, more appropriate for a review paper, please shorten it and keep the most relevant for the manuscript.
  
  We will heed the advice of the reviewer and shorten section 1.1. However, as an interdisciplinary manuscript at the interface between machine learning and climate, we cannot assume that all readers have the same background. For instance, reviewer # 1 clearly has a background in machine learning methods, but inquiries and even suggests whether the predictors from extreme sea-level at one station could be the value of the wind or pressure at the grid cell closer to that station, which from the climatological point of view is not meaningful (obviously, the whole spatial field over a broad region should be taken as predictor). Therefore, we must consider the backgrounds of these two groups of potential readers. And come up with a better solution that provides enough background for both.
  
  6) - Lines 227-234. I cannot follow how PF is estimated, please provide mathematically precise definition of PF.
  
  The Prefilling state is a predictor and is represented in the Baltic Sea research literature as the sea level at the station Landsort, as it is considered that this station is a good indicator of the mean volume of the Baltic Sea. We will explain it more clearly in the revised version.
  
  Citation: https://doi.org/10.5194/nhess-2023-21-AC1

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Viewed

Total article views: 1,195 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
824	310	61	1,195	69	81

HTML: 824
PDF: 310
XML: 61
Total: 1,195
BibTeX: 69
EndNote: 81

Views and downloads (calculated since 21 Mar 2023)

Month	HTML	PDF	XML	Total
Mar 2023	122	26	3	151
Apr 2023	88	10	1	99
May 2023	57	17	8	82
Jun 2023	11	8	1	20
Jul 2023	25	13	2	40
Aug 2023	33	15	3	51
Sep 2023	39	13	4	56
Oct 2023	32	19	3	54
Nov 2023	9	5	3	17
Dec 2023	26	11	2	39
Jan 2024	24	7	2	33
Feb 2024	16	11	1	28
Mar 2024	27	11	0	38
Apr 2024	19	12	8	39
May 2024	21	21	5	47
Jun 2024	45	14	2	61
Jul 2024	15	14	2	31
Aug 2024	18	3	3	24
Sep 2024	19	6	0	25
Oct 2024	18	0	18
Nov 2024	13	7	0	20
Dec 2024	18	4	1	23
Jan 2025	15	11	1	27
Feb 2025	16	6	0	22
Mar 2025	12	4	2	18
Apr 2025	16	4	1	21
May 2025	21	8	2	31
Jun 2025	22	16	0	38
Jul 2025	23	10	1	34
Aug 2025	4	4	0	8

Cumulative views and downloads (calculated since 21 Mar 2023)

Month	HTML	PDF	XML	Total
Mar 2023	122	26	3	151
Apr 2023	88	10	1	99
May 2023	57	17	8	82
Jun 2023	11	8	1	20
Jul 2023	25	13	2	40
Aug 2023	33	15	3	51
Sep 2023	39	13	4	56
Oct 2023	32	19	3	54
Nov 2023	9	5	3	17
Dec 2023	26	11	2	39
Jan 2024	24	7	2	33
Feb 2024	16	11	1	28
Mar 2024	27	11	0	38
Apr 2024	19	12	8	39
May 2024	21	21	5	47
Jun 2024	45	14	2	61
Jul 2024	15	14	2	31
Aug 2024	18	3	3	24
Sep 2024	19	6	0	25
Oct 2024	18	0	18
Nov 2024	13	7	0	20
Dec 2024	18	4	1	23
Jan 2025	15	11	1	27
Feb 2025	16	6	0	22
Mar 2025	12	4	2	18
Apr 2025	16	4	1	21
May 2025	21	8	2	31
Jun 2025	22	16	0	38
Jul 2025	23	10	1	34
Aug 2025	4	4	0	8

Viewed (geographical distribution)

Total article views: 1,159 (including HTML, PDF, and XML) Thereof 1,159 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 08 Aug 2025

Short summary

The prediction of extreme coastal sea level, e.g. caused by a storm surge, is operationally carried out with dynamical computer models. These models are expensive to run and still display some limitations in predicting the height of extremes. We present a successful purely data-driven machine learning model to predict extreme sea levels along the Baltic Sea coast a few days in advance. The method is also able to identify the critical predictors for the different Baltic Sea regions.


Total:	0
HTML:	0
PDF:	0
XML:	0

Short-term prediction of extreme sea-level at the Baltic Sea coast by Random Forests

Viewed

Viewed (geographical distribution)

Cited

6 citations as recorded by crossref.