the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Transferability of machine learning-based modeling frameworks across flood events for hindcasting maximum river flood depths in coastal watersheds
Abstract. Despite applications of machine learning (ML) models for predicting floods, their transferability for out-of-sample data has not been explored. This paper developed an ML-based model for hindcasting maximum flood depths during major events in coastal watersheds and evaluated its transferability across other events (out-of-sample). The model considered spatial distribution of influential factors, which explain underlying physical processes, to hindcast maximum river flood depths. Our model evaluation in a HUC6 watershed in Northeastern US showed that the model satisfactorily hindcasted maximum flood depths at 116 stream gauges during a major flood event, Hurricane Ida (R2 of 0.92). The pre-trained, validated model was successfully transferred to three other major flood events, Hurricanes Isaias, Sandy, and Irene (R2 > 0.71). Our results showed that ML-based models can be transferable for hindcasting maximum river flood depths across events when informed by the spatial distribution of pertinent features and underlying physical processes in coastal watersheds.
- Preprint
(3305 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on nhess-2023-152', Anonymous Referee #1, 05 Feb 2024
Manuscript Reference: NHESS-2023-152
Manuscript Title: Transferability of machine learning-based modeling frameworks across flood events for hindcasting maximum river flood depths in coastal watersheds
Summary: This paper explores the application of ML to hindcasting maximum river flood depths across events when informed by the spatial distribution of pertinent features and underlying physical processes in coastal watersheds. Trained within the same watershed, the ML model was then transferred in time to predict out-of-sample events for three major storms events. Acceptable performance was noted at 100+ gauge stations in the domain.
General Comments:
- The content of the literature review (and motivation overall) was helpful and sound (applicable references to previous studies), highlighting gaps in the current field and therefore valid contributions that would be efficient for those wanting to hindcast flood depths for tropical storms without the need to use a hydrologic model.
- Scientific Significance: An ML model that moves past simple categorical prediction of expecting a flood or not in a watershed to estimate actual flood depths at stream gauges; and an approach that allows for hindcasting of storm events in the watershed not previously seen or trained by the model are two main contributions of this work. While it is necessary for communities to understand expected depths along flood plains and other areas in a watershed, the study takes a key first step towards establishing a methodology for maximum flood depth estimation at point locations in a stream with an ML model that requires input features that could be harnessed from available datasets. Though retained to point estimations of flood depths for the study in a single HUC-06 watershed, the methodology permits application/implementation to watersheds of various size and locations. The study addresses scientific questions within the scope of NHESS.
- Scientific Quality: The PCA and SHAP analyses for assessing the importance of features on flood depth estimation will be very useful for the hydrological community. Of course these features will vary per watershed and future work may further inform potential important features currently not included that will improve performance – but, feature analyses such as these coupled with ML models are what help to make the black-box of ML models interpretable and hydrologists piece together valuable information on the physics behind flood events in a watershed without the need for setting up and running hydrologic models. Overfitting is always a concern for ML models but some steps to reduce “redundant” variables by computing correlation among variables was taken to help.
- Presentation Quality: The overall presentation of the manuscript was well-organized and easy to follow – from the perspectives of both hydrology and ML. Figures and tables are easy to understand. There are minor grammar edits needed, but language otherwise was precise and fluent.
Minor Suggestions/Technical corrections:
- Considering the mentioned overestimation of shallow flood depths and underestimation of high flood depths, perhaps a median metric would be a more robust test of performance (than MAE)?
- Also, given the vast difference noted between the NRMSE and the simpler FQ ratios, it would be interesting to see the scatter plot of simulated vs observed max flood depths for the storm events. This may even shed some light (on further understanding) where performance is good and not so good to help improve max flood depth estimation.
- The ML model caters to multiple types of flooding (both fluvial and coastal) – as the intention was the hindcast depths at stream gauges using these events and using the same model to do locations where inland flooding is more likely versus coastal flooding (and vice versa); is there a trend or pattern noted for performance in areas susceptible to fluvial versus coastal flooding or storm surge? Finding that Hurricanes Isaias and Sandy were overestimating flood depths further from the storm tracks (tracks that were further from coastal locations for both cases) – is there a chance that it is skewed towards one flood type?
- Line 643 – review tense.
- Review Legend in Figure 6 (lowest interval overlaps with the next)
- It was useful to see the datasets and sources/references summarized in a table (e.g. Table 3).
- Perhaps a single note of abbreviations only at their initial mention is needed – e.g. for machine learning.
Citation: https://doi.org/10.5194/nhess-2023-152-RC1 -
AC1: 'Reply on RC1', Ebrahim Ahmadisharaf, 19 Apr 2024
The comment was uploaded in the form of a supplement: https://nhess.copernicus.org/preprints/nhess-2023-152/nhess-2023-152-AC1-supplement.pdf
-
RC2: 'Comment on nhess-2023-152', Anonymous Referee #2, 26 Feb 2024
=== General Comments
Disclosure: The reviewer is a ML/AI expert with very limited expertise in the Earth Sciences.
The paper showcases the application of an ANN (likely, a multi-layer perceptron) to hindcast maximum flood depth in a HUC6 watershed in in Northeastern US based on a broad collection of (geographic. hydrologic, meteorologic, topographic, etc.) features, which were determined via a feature selection process. The model was trained on one major flood event and its "transferability" to other major flood events was evaluated yielding, in general, positive results. The paper claims, in essence, the following main contributions: (i) predicting maximum flood depth instead of presence/absence of inundation that has been examined in the literature so far, (ii) appraising the usefulness of a data-driven model on major flood events not used in training/calibration ("transferability").
According to the reviewer's opinion, strengths and weaknesses of this work are as follows:
Strengths:
+ The broad range of features that has been considered in this study.
+ The discussion in subsection 4.2.2 about feature importance (pertains to explainability) seems valuable. Such discussions are often missing from similar works.Weaknesses:
- The paper's motivation is rather weekly framed and its contributions seem rather of limited extent. Why would this work be valuable to Earth scientists?
- Reproducibility of this work seems to be rather low, as important modeling details are not provided. Also, the code and data have not been made publicly accessible.
- Lack of a baseline model/approach that can be compared against. How can one tell whether the prediction quality reported in this work is up to par without any comparisons to established approaches? Perhaps, a natural candidate here would have been a physics-based or hybrid (data-driven physics-based) model.
- There is a lack of rigor surrounding several ML-related concepts/topics discussed in the paper.
Based on all this, the reviewer's ratings are as follows:* Scientific Significance: 3 (Fair)
* Scientific Quality: 2 (Good)
* Presentation Quality: 2 (Good)Finally, the reviewer hopes that the provided feedback will be useful towards improving this paper.
=== Specific Comments1) Lines 101-103: "ML and DL models can make satisfactory predictions (in terms of minimum error in estimating flood characteristics like depth) and generate valuable insights."
By nature, most general purpose DL architectures are notoriously opaque in terms of being interpretable. Hence, it would be very useful to see here some examples of such insights.2) Lines 114-115: "... and different ML models were examined on the same dataset. "
Examined in what sense?3) Lines 182-184: "Next, we assessed the transferability of our developed model across three other extreme events — Hurricanes Isaias, Sandy, and Irene — in the same watershed."
Is "transferability" the right term to use here, instead of "generalization ability" or something equivalent? In other words, is "transferability" used in place of out-of-sample (test) performance?4) Lines 386-387: "The observed flood data and features were split into training and testing sets, with 70% to 90% of the data used for training and 10% to 30% for testing (Joseph 2022; Nguyen et al. 2021)."
How many samples were used for training and how many for testing and why such a split was selected? Also, Line 403 states "We allocated 90% of the data for training and 10% for testing." which conflicts with what is said here.5) Lines 391-394: "We used the Random Search cross-validation approach (Boulouard et al. 2022; Hashmi 2020) to perform hyper-parameter optimization. This approach performs a randomized search on hyperparameters using cross-validation. The hyperparameters we optimized here included the number of layers, units, activation functions, optimizer, regularization rate, batch size, and epochs. "
As there are several hyper-parameters involved, a bayesian search method would be more pertinent and fruitful than a plain random search. There are several handy Python-based implementations out there that could have been employed (e.g., see https://optuna.org/).6) Lines 400-401: "Cross-validation was performed using a 5-fold cross-validation strategy during the hyperparameter optimization process."
Why was 5-fold cross-validation used in favor of hold-out validation, which uses a separate, dedicated validation set?7) Lines 534-536: ""Rain-MAX" and "Rain-Mean" suggested that they offer similar information about maximum and average rainfall values across the watershed. Consequently, "Rain-Mean" was excluded from consideration."
Shouldn't "Rain-Mean" to be retained instead as being potentially much less noisy than "Rain-MAX"?8) Lines 554-555: "The optimization process involved 500 fits, with each fit considering 100 candidates for each of the five folds in the cross-validation."
It would be helpful if much more explanation were provided here.9) Lines 560-562: "The best hyperparameters were identified as follows: 50 units, a regularization rate of approximately 0.104, the sgd optimizer, one layer, 600 epochs, a batch size of 8, and the elu activation function."
What was the precise hyper-parameter search domain (e.g., range of number of units, choices of optimizers, etc.)? Also, acronyms such as "sgd" and "elu" should be spelled out somewhere. Regarding the latter, is this the activation function used in the hidden layer? How about the output layer? Does it use a linear activation function? Finally, the conclusion is that a one (hidden) layer network is the best model as estimated via cross-validation; this is somewhat surprising and, curiously enough, is not commented on.10) Section 4.3. "Examining the machine learning (ML) model transferability across flood events"
What motivates this "transferability" study? Only intellectual curiosity? Why not train with data from all three identified flood events? Why wouldn't (or would, for that matter) someone expect that the model, trained on a single flood event, will also perform well for other floods in the same watershed? Are there important variables not captured by the features considered in this study? And, if so, why were these omitted?
=== Technical Corrections1) Overall: The paper constantly mentions that it employs an "ANN," which is a non-specific term. I suspect that it uses a multi-layer perceptron and, if so, the paper needs to reflect this.
2) Line 89: "Machine learning (ML) and deep learning (DL) models..."
As DL models are ML models, perhaps, this could be slightly rephrased like "Machine learning (ML) and, in particular, deep learning (DL) models, ..."3) Lines 95-96: "..., and through their intricate nonlinear structures and algorithms."
Consider rephrasing this.4) Line 105: "...neural networks (ANNs), random forest, convolutional neural networks..."
Please note that convolutional neural networks are ANNs.5) Lines 180-181: "The developed ML-based model combined the ANN algorithm with feature selection methods and geospatial data."
Perhaps, this should be rephrased as "The developed ML-based model combined an ANN with feature selection methods and geospatial data."6) Line 299: "... remove any fake depressions, ..."
Perhaps, rephrase to "... remove any spurious depressions, ..." or something similar.7) Lines 350-351: "The PCA components were evaluated based on their absolute values, allowing us to quantify the contribution of each feature to the overall variance."
This statement is somewhat confusing and could be improved. PCA components are the eigen-vectors of the data's covariance matrix. What does it mean to take their absolute value and how does this help in quantifying the contribution of each feature to the overall variance? The latter is typically accomplished by considering he corresponding eigen-values, which are necessarily non-negative.8) Line 354: "... the PCA captures both linear and non-linear relationships."
PCA is considered a "linear (affine)" method, as it assumes that the features are a linear-affine function of some latent variables. What is meant by non-linear relationships here?9) Lines 356-357: "Through PCA, we determined which principal components in the feature set captured the most variation."
This is the very definition of what constitutes a principal component, so it is redundant to state.10) Lines 368-370: "One of the key advantages of using ANN is its capacity for generalization, as highlighted by Maier et al. (2023), allowing the model to perform well on unseen data, making it robust and reliable for real-world flood estimations."
As any reasonably selected/constructed model is capable of generalization (provided it is well-parameterized), this statement here should probably be removed or rephrased.11) Lines 373-379: "ANNs are computing systems inspired ... based on the input data it is processing (McCulloch and Pitts, 1943)."
To conserve space, this paragraph could be pruned, as it talks about very well-known facts about neural networks.12) Lines 387-389: "The numerical features in the data were standardized using the StandardScaler function from the Scikit-learn library of python."
Instead of referring to a software implementation here, it would be more helpful to describe the type of scaling in a couple of sentences.13) Lines 389-391: "Hyperparameter optimization is a step in improving the performance of ML models. This process involves identifying the optimal hyper-parameter values for ML classifiers."
Strictly speaking, it is synonymous to model selection based on generalization performance. As a matter of fact, the paper points this out shortly after. Also, hyper-parameter optimization can be applied to any model family regardless of what task it addresses, not only to ML-based (i.e., data-driven) classifiers.14) Lines 391-394: "rs. We used the Random Search cross-validation approach (Boulouard et al. 2022; Hashmi 2020) to perform hyper-parameter optimization. This approach performs a randomized search on hyperparameters using cross-validation. The hyperparameters we optimized here included the number of layers, units, activation functions, optimizer, regularization rate, batch size, and epochs. "
In the first sentence, only Hashmi 2020 seems to be useful, unless the sentence is framed differently (e.g., "as also adopted by Boulouard et al. 2022"). The second sentence seems redundant.15) Line 395: "The best hyperparameters were selected based on the negative mean squared error."
While this may be true, it may be better to plainly state that "hyperparameters were selected based on cross-validation MSE."
16) Lines 406-407: "This allocation of 10% for testing, combined with these methodologies, is designed to enhance the model's ability to generalize across diverse scenarios."
How is this statement substantiated? A test set's purpose is to honestly appraise a model's generalization performance. It is not supposed to be leveraged in any way to train/select models according to their generalization abilities.17) Line 423: "2.2.4. Model interpretation "
Perhaps, in order to preserve the nuanced distinction between model explainability versus interpretability, "Model Explainability" is more pertinent as a heading, given the content of this subsection. Same comment applies to line 574.18) Line 563: "This meticulous hyperparameter optimization approach..."
I am unsure if a (plain) random hyper-parameter search can be regarded as "meticulous," when, in theory, one could do an exhaustive grid search.19) Lines 576-577: "The SHAP values measure the contribution of a feature to the estimation for each sample in comparison to the estimation made by a model trained without that feature. "
Maybe "estimation" needs to be replaced with "prediction" here.20) Figure 5: Some legends feature an underscore, which should be removed.
21) Line 581: "The most influential features in estimating flood depths are antecedent water level..."
From where is this concluded? I do not see "antecedent water level" as a label in Figure 5.22) Line 679: "...hyperparameter set was used as the optimal parameterization scenario."
What is meant by this?23) Lines 679-681: "This deterministic approach does not incorporate the uncertainty from model parameterization. Probabilistic models are needed to address this uncertainty."
What is meant by "uncertainty from model parameterization"? Is it "uncertainty from model misspecification"? Also, if that is so, how can probabilistic models address this? Some details are warranted here.24) Lines 731-732: "We recommend that future work compares the performance of our ML-based model to traditional physically-based and morphologic-based models using the same datasets."
Why is this recommended and not performed in this study? Besides, the paper reports results against no baseline performance, which is problematic when assessing its usefulness.Citation: https://doi.org/10.5194/nhess-2023-152-RC2 -
AC2: 'Reply on RC2', Ebrahim Ahmadisharaf, 19 Apr 2024
The comment was uploaded in the form of a supplement: https://nhess.copernicus.org/preprints/nhess-2023-152/nhess-2023-152-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Ebrahim Ahmadisharaf, 19 Apr 2024
Status: closed
-
RC1: 'Comment on nhess-2023-152', Anonymous Referee #1, 05 Feb 2024
Manuscript Reference: NHESS-2023-152
Manuscript Title: Transferability of machine learning-based modeling frameworks across flood events for hindcasting maximum river flood depths in coastal watersheds
Summary: This paper explores the application of ML to hindcasting maximum river flood depths across events when informed by the spatial distribution of pertinent features and underlying physical processes in coastal watersheds. Trained within the same watershed, the ML model was then transferred in time to predict out-of-sample events for three major storms events. Acceptable performance was noted at 100+ gauge stations in the domain.
General Comments:
- The content of the literature review (and motivation overall) was helpful and sound (applicable references to previous studies), highlighting gaps in the current field and therefore valid contributions that would be efficient for those wanting to hindcast flood depths for tropical storms without the need to use a hydrologic model.
- Scientific Significance: An ML model that moves past simple categorical prediction of expecting a flood or not in a watershed to estimate actual flood depths at stream gauges; and an approach that allows for hindcasting of storm events in the watershed not previously seen or trained by the model are two main contributions of this work. While it is necessary for communities to understand expected depths along flood plains and other areas in a watershed, the study takes a key first step towards establishing a methodology for maximum flood depth estimation at point locations in a stream with an ML model that requires input features that could be harnessed from available datasets. Though retained to point estimations of flood depths for the study in a single HUC-06 watershed, the methodology permits application/implementation to watersheds of various size and locations. The study addresses scientific questions within the scope of NHESS.
- Scientific Quality: The PCA and SHAP analyses for assessing the importance of features on flood depth estimation will be very useful for the hydrological community. Of course these features will vary per watershed and future work may further inform potential important features currently not included that will improve performance – but, feature analyses such as these coupled with ML models are what help to make the black-box of ML models interpretable and hydrologists piece together valuable information on the physics behind flood events in a watershed without the need for setting up and running hydrologic models. Overfitting is always a concern for ML models but some steps to reduce “redundant” variables by computing correlation among variables was taken to help.
- Presentation Quality: The overall presentation of the manuscript was well-organized and easy to follow – from the perspectives of both hydrology and ML. Figures and tables are easy to understand. There are minor grammar edits needed, but language otherwise was precise and fluent.
Minor Suggestions/Technical corrections:
- Considering the mentioned overestimation of shallow flood depths and underestimation of high flood depths, perhaps a median metric would be a more robust test of performance (than MAE)?
- Also, given the vast difference noted between the NRMSE and the simpler FQ ratios, it would be interesting to see the scatter plot of simulated vs observed max flood depths for the storm events. This may even shed some light (on further understanding) where performance is good and not so good to help improve max flood depth estimation.
- The ML model caters to multiple types of flooding (both fluvial and coastal) – as the intention was the hindcast depths at stream gauges using these events and using the same model to do locations where inland flooding is more likely versus coastal flooding (and vice versa); is there a trend or pattern noted for performance in areas susceptible to fluvial versus coastal flooding or storm surge? Finding that Hurricanes Isaias and Sandy were overestimating flood depths further from the storm tracks (tracks that were further from coastal locations for both cases) – is there a chance that it is skewed towards one flood type?
- Line 643 – review tense.
- Review Legend in Figure 6 (lowest interval overlaps with the next)
- It was useful to see the datasets and sources/references summarized in a table (e.g. Table 3).
- Perhaps a single note of abbreviations only at their initial mention is needed – e.g. for machine learning.
Citation: https://doi.org/10.5194/nhess-2023-152-RC1 -
AC1: 'Reply on RC1', Ebrahim Ahmadisharaf, 19 Apr 2024
The comment was uploaded in the form of a supplement: https://nhess.copernicus.org/preprints/nhess-2023-152/nhess-2023-152-AC1-supplement.pdf
-
RC2: 'Comment on nhess-2023-152', Anonymous Referee #2, 26 Feb 2024
=== General Comments
Disclosure: The reviewer is a ML/AI expert with very limited expertise in the Earth Sciences.
The paper showcases the application of an ANN (likely, a multi-layer perceptron) to hindcast maximum flood depth in a HUC6 watershed in in Northeastern US based on a broad collection of (geographic. hydrologic, meteorologic, topographic, etc.) features, which were determined via a feature selection process. The model was trained on one major flood event and its "transferability" to other major flood events was evaluated yielding, in general, positive results. The paper claims, in essence, the following main contributions: (i) predicting maximum flood depth instead of presence/absence of inundation that has been examined in the literature so far, (ii) appraising the usefulness of a data-driven model on major flood events not used in training/calibration ("transferability").
According to the reviewer's opinion, strengths and weaknesses of this work are as follows:
Strengths:
+ The broad range of features that has been considered in this study.
+ The discussion in subsection 4.2.2 about feature importance (pertains to explainability) seems valuable. Such discussions are often missing from similar works.Weaknesses:
- The paper's motivation is rather weekly framed and its contributions seem rather of limited extent. Why would this work be valuable to Earth scientists?
- Reproducibility of this work seems to be rather low, as important modeling details are not provided. Also, the code and data have not been made publicly accessible.
- Lack of a baseline model/approach that can be compared against. How can one tell whether the prediction quality reported in this work is up to par without any comparisons to established approaches? Perhaps, a natural candidate here would have been a physics-based or hybrid (data-driven physics-based) model.
- There is a lack of rigor surrounding several ML-related concepts/topics discussed in the paper.
Based on all this, the reviewer's ratings are as follows:* Scientific Significance: 3 (Fair)
* Scientific Quality: 2 (Good)
* Presentation Quality: 2 (Good)Finally, the reviewer hopes that the provided feedback will be useful towards improving this paper.
=== Specific Comments1) Lines 101-103: "ML and DL models can make satisfactory predictions (in terms of minimum error in estimating flood characteristics like depth) and generate valuable insights."
By nature, most general purpose DL architectures are notoriously opaque in terms of being interpretable. Hence, it would be very useful to see here some examples of such insights.2) Lines 114-115: "... and different ML models were examined on the same dataset. "
Examined in what sense?3) Lines 182-184: "Next, we assessed the transferability of our developed model across three other extreme events — Hurricanes Isaias, Sandy, and Irene — in the same watershed."
Is "transferability" the right term to use here, instead of "generalization ability" or something equivalent? In other words, is "transferability" used in place of out-of-sample (test) performance?4) Lines 386-387: "The observed flood data and features were split into training and testing sets, with 70% to 90% of the data used for training and 10% to 30% for testing (Joseph 2022; Nguyen et al. 2021)."
How many samples were used for training and how many for testing and why such a split was selected? Also, Line 403 states "We allocated 90% of the data for training and 10% for testing." which conflicts with what is said here.5) Lines 391-394: "We used the Random Search cross-validation approach (Boulouard et al. 2022; Hashmi 2020) to perform hyper-parameter optimization. This approach performs a randomized search on hyperparameters using cross-validation. The hyperparameters we optimized here included the number of layers, units, activation functions, optimizer, regularization rate, batch size, and epochs. "
As there are several hyper-parameters involved, a bayesian search method would be more pertinent and fruitful than a plain random search. There are several handy Python-based implementations out there that could have been employed (e.g., see https://optuna.org/).6) Lines 400-401: "Cross-validation was performed using a 5-fold cross-validation strategy during the hyperparameter optimization process."
Why was 5-fold cross-validation used in favor of hold-out validation, which uses a separate, dedicated validation set?7) Lines 534-536: ""Rain-MAX" and "Rain-Mean" suggested that they offer similar information about maximum and average rainfall values across the watershed. Consequently, "Rain-Mean" was excluded from consideration."
Shouldn't "Rain-Mean" to be retained instead as being potentially much less noisy than "Rain-MAX"?8) Lines 554-555: "The optimization process involved 500 fits, with each fit considering 100 candidates for each of the five folds in the cross-validation."
It would be helpful if much more explanation were provided here.9) Lines 560-562: "The best hyperparameters were identified as follows: 50 units, a regularization rate of approximately 0.104, the sgd optimizer, one layer, 600 epochs, a batch size of 8, and the elu activation function."
What was the precise hyper-parameter search domain (e.g., range of number of units, choices of optimizers, etc.)? Also, acronyms such as "sgd" and "elu" should be spelled out somewhere. Regarding the latter, is this the activation function used in the hidden layer? How about the output layer? Does it use a linear activation function? Finally, the conclusion is that a one (hidden) layer network is the best model as estimated via cross-validation; this is somewhat surprising and, curiously enough, is not commented on.10) Section 4.3. "Examining the machine learning (ML) model transferability across flood events"
What motivates this "transferability" study? Only intellectual curiosity? Why not train with data from all three identified flood events? Why wouldn't (or would, for that matter) someone expect that the model, trained on a single flood event, will also perform well for other floods in the same watershed? Are there important variables not captured by the features considered in this study? And, if so, why were these omitted?
=== Technical Corrections1) Overall: The paper constantly mentions that it employs an "ANN," which is a non-specific term. I suspect that it uses a multi-layer perceptron and, if so, the paper needs to reflect this.
2) Line 89: "Machine learning (ML) and deep learning (DL) models..."
As DL models are ML models, perhaps, this could be slightly rephrased like "Machine learning (ML) and, in particular, deep learning (DL) models, ..."3) Lines 95-96: "..., and through their intricate nonlinear structures and algorithms."
Consider rephrasing this.4) Line 105: "...neural networks (ANNs), random forest, convolutional neural networks..."
Please note that convolutional neural networks are ANNs.5) Lines 180-181: "The developed ML-based model combined the ANN algorithm with feature selection methods and geospatial data."
Perhaps, this should be rephrased as "The developed ML-based model combined an ANN with feature selection methods and geospatial data."6) Line 299: "... remove any fake depressions, ..."
Perhaps, rephrase to "... remove any spurious depressions, ..." or something similar.7) Lines 350-351: "The PCA components were evaluated based on their absolute values, allowing us to quantify the contribution of each feature to the overall variance."
This statement is somewhat confusing and could be improved. PCA components are the eigen-vectors of the data's covariance matrix. What does it mean to take their absolute value and how does this help in quantifying the contribution of each feature to the overall variance? The latter is typically accomplished by considering he corresponding eigen-values, which are necessarily non-negative.8) Line 354: "... the PCA captures both linear and non-linear relationships."
PCA is considered a "linear (affine)" method, as it assumes that the features are a linear-affine function of some latent variables. What is meant by non-linear relationships here?9) Lines 356-357: "Through PCA, we determined which principal components in the feature set captured the most variation."
This is the very definition of what constitutes a principal component, so it is redundant to state.10) Lines 368-370: "One of the key advantages of using ANN is its capacity for generalization, as highlighted by Maier et al. (2023), allowing the model to perform well on unseen data, making it robust and reliable for real-world flood estimations."
As any reasonably selected/constructed model is capable of generalization (provided it is well-parameterized), this statement here should probably be removed or rephrased.11) Lines 373-379: "ANNs are computing systems inspired ... based on the input data it is processing (McCulloch and Pitts, 1943)."
To conserve space, this paragraph could be pruned, as it talks about very well-known facts about neural networks.12) Lines 387-389: "The numerical features in the data were standardized using the StandardScaler function from the Scikit-learn library of python."
Instead of referring to a software implementation here, it would be more helpful to describe the type of scaling in a couple of sentences.13) Lines 389-391: "Hyperparameter optimization is a step in improving the performance of ML models. This process involves identifying the optimal hyper-parameter values for ML classifiers."
Strictly speaking, it is synonymous to model selection based on generalization performance. As a matter of fact, the paper points this out shortly after. Also, hyper-parameter optimization can be applied to any model family regardless of what task it addresses, not only to ML-based (i.e., data-driven) classifiers.14) Lines 391-394: "rs. We used the Random Search cross-validation approach (Boulouard et al. 2022; Hashmi 2020) to perform hyper-parameter optimization. This approach performs a randomized search on hyperparameters using cross-validation. The hyperparameters we optimized here included the number of layers, units, activation functions, optimizer, regularization rate, batch size, and epochs. "
In the first sentence, only Hashmi 2020 seems to be useful, unless the sentence is framed differently (e.g., "as also adopted by Boulouard et al. 2022"). The second sentence seems redundant.15) Line 395: "The best hyperparameters were selected based on the negative mean squared error."
While this may be true, it may be better to plainly state that "hyperparameters were selected based on cross-validation MSE."
16) Lines 406-407: "This allocation of 10% for testing, combined with these methodologies, is designed to enhance the model's ability to generalize across diverse scenarios."
How is this statement substantiated? A test set's purpose is to honestly appraise a model's generalization performance. It is not supposed to be leveraged in any way to train/select models according to their generalization abilities.17) Line 423: "2.2.4. Model interpretation "
Perhaps, in order to preserve the nuanced distinction between model explainability versus interpretability, "Model Explainability" is more pertinent as a heading, given the content of this subsection. Same comment applies to line 574.18) Line 563: "This meticulous hyperparameter optimization approach..."
I am unsure if a (plain) random hyper-parameter search can be regarded as "meticulous," when, in theory, one could do an exhaustive grid search.19) Lines 576-577: "The SHAP values measure the contribution of a feature to the estimation for each sample in comparison to the estimation made by a model trained without that feature. "
Maybe "estimation" needs to be replaced with "prediction" here.20) Figure 5: Some legends feature an underscore, which should be removed.
21) Line 581: "The most influential features in estimating flood depths are antecedent water level..."
From where is this concluded? I do not see "antecedent water level" as a label in Figure 5.22) Line 679: "...hyperparameter set was used as the optimal parameterization scenario."
What is meant by this?23) Lines 679-681: "This deterministic approach does not incorporate the uncertainty from model parameterization. Probabilistic models are needed to address this uncertainty."
What is meant by "uncertainty from model parameterization"? Is it "uncertainty from model misspecification"? Also, if that is so, how can probabilistic models address this? Some details are warranted here.24) Lines 731-732: "We recommend that future work compares the performance of our ML-based model to traditional physically-based and morphologic-based models using the same datasets."
Why is this recommended and not performed in this study? Besides, the paper reports results against no baseline performance, which is problematic when assessing its usefulness.Citation: https://doi.org/10.5194/nhess-2023-152-RC2 -
AC2: 'Reply on RC2', Ebrahim Ahmadisharaf, 19 Apr 2024
The comment was uploaded in the form of a supplement: https://nhess.copernicus.org/preprints/nhess-2023-152/nhess-2023-152-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Ebrahim Ahmadisharaf, 19 Apr 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
467 | 107 | 51 | 625 | 40 | 32 |
- HTML: 467
- PDF: 107
- XML: 51
- Total: 625
- BibTeX: 40
- EndNote: 32
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1