the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Nowcasting thunderstorm hazards using machine learning: the impact of data sources on performance
Jussi Leinonen
Ulrich Hamann
Urs Germann
John R. Mecikalski
Download
- Final revised paper (published on 25 Feb 2022)
- Preprint (discussion started on 22 Jun 2021)
Interactive discussion
Status: closed
-
RC1: 'Comment on nhess-2021-171', Tomeu Rigo, 07 Aug 2021
Dear Authors,
please, find attached my review suggestions to your manuscript
Best regards
-
AC1: 'Reply on RC1', Jussi Leinonen, 24 Sep 2021
Dear Dr. Rigo,
We thank you for your constructive comments. Please find below our answers to your comments. The original comments are posted in italic font and the point-by-point responses are under each comment in normal font. The specific changes made to the manuscript in response to the comment are described in bold font.
Best Regards,
Jussi Leinonen (on behalf of all authors)
General comments
The document is well-addressed, easy to follow, and well-documented. My major concerns regard on the type of data and/or methodologies, according to the main objective: “we seek to understand the impact on thunderstorm nowcasting from the new generation of geostationary satellites, which, compared to the previous generation, provide higher-resolution imagery, additional image channels and lightning data “:
- In my opinion, there is a lack of coherence on the considered model: if the authors analyze an American region and all the data are from American sources, why in the case of the NWP they have considered the European one. Could you clarify this point?
3/4 of our team are from a Meteorological Service based in Europe, whose primary focus is on improving weather products in that region. We are carrying this study out in the American region mostly out of necessity rather than out of specific intent to analyze the US thunderstorm environment.
More specifically, to fulfill the objective of working with MTG-like data, we need to carry this study out in regions where GOES-R data is available, as MTG itself is not yet operational. The use of GOES-R limits us roughly to the Americas, from which we chose a part of the US as a study area (as the higher time resolution CONUS product is available there), which in turn required us to use the NEXRAD network as our radar data source. On the other hand, the ECMWF IFS model is available in the US, so unlike with the satellite and radar, we are not constrained to choosing an American model for the source of our NWP data. In fact, we prefer using ECMWF since, as mentioned in Sect. 2.2.4, it should allow us to adapt the knowledge and tools developed in the course of this study to later research in Europe.
We have added some clarifying remarks in the first paragraph of Sect. 2.2.4 to add information mentioned in the response above.
- In a similar way, if the analysis is focused on operational methods, why you do not have taken into account the operational methodology (or a similar one, maybe provided by the NWS or the NOAA) used in Switzerland. The use of different types of techniques can lead to significant errors. Could you justify with at least one example, the similitude of the results of both techniques?
The current operational procedure in Switzerland is using an empirical thunderstorm intensity (TRT rank). Therefore, it is difficult to directly compare precipitation intensity, lightning activity and hail occurrence to the TRT rank. The assumption for the TRT rank is persistence. The paper compares the predictive skills of the ML models to the persistence, and, hence, is a kind of comparison to the Swiss operational algorithm.
In order to keep the conclusions of this paper general and applicable to several environments and operational agencies, and facilitate the replication of the results, we decided to use very generic methodology rather than any particular operational scheme. However, we feel this was not well communicated in the submitted version and we have added a mention of this in Sect. 3.1.
- According to figure 3 and other comments in the discussion and the conclusions, how optimistic are you in the improvement of the nowcasting using this technique?
We show in Figure 3 and other results in this study that we can improve results over simple baseline methods. The significance of the improvements in practical use depends on the end user and the application. The results could be applied to several different use cases (e.g. lightning prediction for aviation, flood warnings for hydrological services, or real-time hazard warnings for the general public). Different countries and meteorological services also use different procedures to issue warnings. Therefore, in this paper we would prefer to evaluate and quantify the results and leave it to the individual end users to decide whether the improvements are worth the implementation of the ML prediction scheme. However, we have made an improvement to Figure 3 that shows the relative error of the rain rate that corresponds to the reflectivity error, thus providing a more concrete way to judge the improvement.
And which could be the ways for improving the ML technique (e.g. other data sources, other thresholds, the ML itself) in the future?
In the final paragraph of the Conclusions, we already discuss potential future directions for the study, including additional data sources and alternative ML methods (specifically, neural networks). We are not quite sure what the reviewer means by “other thresholds”. However, we added in a mention in this paragraph that the methodology could also be extended for use with other types of hazard.
Minor comments
- Figure 3: If the methodology (section 3) considers that Maximum reflectivity should exceed 37 dBZ, I cannot understand the Figure, because it shows that most of the time MaxZ do not reach this threshold and even in one case it never exceeds this value.
This is because the threshold for selecting the cases is based on a single-pixel value, i.e. MAXZ exceeding 37 dBZ in a single pixel. Meanwhile, in Fig. 3 we are showing the MAXZ predictand, defined as the mean of the column maximum reflectivity over the 25-km diameter neighborhood. As the mean is smaller than the maximum, in many cases the shown MAXZ does not exceed 37 dBZ. However, thank you for pointing this out. This seeming discrepancy was not well explained in the original text; we have revised the caption of Fig. 3 and the text of Sect. 4.1 to make it clearer what is shown and why the MAXZ is sometimes below 37 dBZ.
- In the same way that the previous point: which is the reason of selecting a so low reflectivity threshold (37 dBZ), considering that severe thunderstorms present values much higher than this threshold. Could you explain the motivation of your choose?
The 37 dBZ threshold was selected based on various earlier studies which identify thunderstorms based on reflectivity thresholds between 30 dBZ and 40 dBZ. We have added more explanation and several references in section 3.1.1 to support this. Severe thunderstorms can indeed have reflectivity values much higher than this, but one goal of nowcasting is also to identify those thunderstorms that may later become severe, so that advance warnings can be provided. Therefore, we need to choose a threshold at which the thunderstorms are not yet severe but rather have potential to become severe storms.
- Figure 4: I assume that as higher is the value of y-axis, worst is the performance. But, how do you really quantify the quality of the performance? E.g. POD values close to 0 (1) are very bad (good) skill values, or the opposite, FAR close to 1 (0) indicates bad (good) performance.
For variables like MAXZ shown in Fig. 4, which are predicted as a value of the variable rather than a probability of some event occurring (as with, for example, LIGHTNING-OCC), showing the MAE or RMSE error metric is how the performance of the prediction is usually quantified. Beyond such metrics, the performance depends on the specific application and the needs of the end user. As mentioned in our response to the reviewer’s earlier question, in order to keep the conclusions of our study general, we have left such considerations outside the scope of the paper.
- Paragraph of L290: how good do you assume is your performance in operational terms with an increase of the MAE of 1.2 dB. Can you explain it?
Please see our response to the previous comment and to the earlier comment from the same reviewer starting with “According to Figure 3…”
- The occurrence of hail is poorly dependent of the occurrence of 45 dBZ, because of different reasons: values are concentrated at low levels, or the freezing height is much higher than the EchoTop45. In my opinion, choosing VIL parameter gives a better correlation (also poor, but less in any case) with hail occurrence. I would like that you provide a clarification of your selection
We do not intend to represent hail directly with the occurrence of 45 dBZ. As indicated in the second sentence of the third paragraph of Sect. 3.2, we refer to the heuristic from the Probability of Hail (POH) metric, which uses the height difference between the 45 dBZ echo top and the freezing level to indicate hail. This difference is undefined if the 45 dBZ reflectivity does not exist in the vertical column. Therefore, we split the prediction into two components: ECHO45-OCC, which predicts whether the 45 dBZ reflectivity occurs, and ECHO45-HT, which is only used if ECHO45-OCC predicts hail, and which predicts the height of the echo top. The freezing level is obtained from the NWP data so there is no need to predict it with the ML model.
In an operational setting, the hail prediction using our scheme would work roughly as follows:
- If the ECHO45-OCC model predicts that a 45 dBZ reflectivity will be present, use the ECHO45-HT model to predict the 45 dBZ echo top height, then subtract the freezing level from this in order to compute the POH.
- If the ECHO45-OCC model predicts that a 45 dBZ reflectivity will not be present, predict that no hail will occur.
We have summarized the above description at the end of Sect. 3.2 in order to clarify this point further.
- Paragraph of L320: The increase of the influence of the NWP in time over the forecast is a well-known fact. Could you be more concise in the weight of this data source in your results? How do you explain the “valleys” in the relationship with radar data? (Fig. 6)
Our best explanation for the valleys is that they are simply noise. Since the models for each time step are trained independently, they may converge to slightly different feature importances. We now mention this in the paragraph.
We think that it is useful to mention the increase of the influence of NWP with longer lead times, but we now also state that it is a known result and support this with a reference.
Citation: https://doi.org/10.5194/nhess-2021-171-AC1
-
AC1: 'Reply on RC1', Jussi Leinonen, 24 Sep 2021
-
RC2: 'Comment on nhess-2021-171', Anonymous Referee #2, 16 Aug 2021
This paper presents an analysis of the utility of different data within a Machine learning nowcasting tool for thunderstorm hazards. ML techniques have improved and become more usual in the last decades with the advances in computer resources and the increasing of high temporal and spatial resolution observations.
In concrete, this paper explores the role in hazard forecasting of ground-based radar, different satellite data (lightning, DEM, and imagery), and numerical weather prediction using a gradient tree-boosted machine learning algorithm. It especially emphasizes the role of the satellite data, and it tests the results in an area in the northeast of the USA, with similar climatology as areas in central Europe, so the technique can be extrapolated to other regions. It also uses open-source data, which makes the research here presented even more valuable for the scientific community around the world, allowing other researchers to reproduce their experiments and/or continue the research.
I congratulate the authors since the paper is well-written, clear, and brief, and it allows a good understanding of the authors' findings through clear and complete figures, which makes it clear for non-ML experts too. Results are well described with a comprehensive discussion and supported by scientific evidence.
Nevertheless, I suggest some tiny inclusions in the conclusion section which I believe will give more emphasis to the importance of the research here presented. Those comments /suggestions are described below.
General comments:
The authors state that the objective of the project is " to provide a systematic assessment of the value of various data sources for nowcasting hazards caused by thunderstorms using a ML approach.” Particularly: “ we seek to understand the impact on thunderstorm nowcasting from the new generation of geostationary satellites, which, compared to the previous generation, provide higher resolution imagery, additional image channels and lightning data.”
However, although discussed in the results, I think that the conclusions could benefit from an explicit mention of the findings regarding that data source, reinforcing that, although the best results are obtained through the NEXRAD data, GLM has a positive impact offshore or in areas without radar coverage. It could be good to slightly mention the importance of satellite methods in areas without a good radar coverage nowadays, such, for instance, Africa, on which the new EUMETSAT generation data -and derived products will be also available in the future, and for which maybe ML techniques could be implemented if computational resources are available.
I strongly believe that adding such tiny discussion, and some numbers regarding the computational costs of running this ML algorithm, would give the readers and possible future researchers/operational-tools developers a better idea of whether ML would improve their hazardous thunderstorms nowcasting tools, or if it is better to remain with ground-based instrumentation data and NWP (for those regions where no such data is available).
Technical comments:
L65: Please, change to read “as well as a region of the Atlantic..”
L135: Please, change to read “as well as their energies...”
L156: Is “elevation gradient” and “Surface gradient” here used indistinctly? Please, clarify.
L366 and L375: Should this be “1.2%” and “4.6%”, perhaps?
L427: Please, change to read: “may also expose the training process to the problem of overfitting...”
Citation: https://doi.org/10.5194/nhess-2021-171-RC2 -
AC2: 'Reply on RC2', Jussi Leinonen, 24 Sep 2021
Dear Reviewer 2,
We thank you for your constructive comments. Please find below our answers to your comments. The original comments are posted in italic font and the point-by-point responses are under each comment in normal font. The specific changes made to the manuscript in response to the comment are described in bold font.
Best Regards,
Jussi Leinonen (on behalf of all authors)
The authors state that the objective of the project is " to provide a systematic assessment of the value of various data sources for nowcasting hazards caused by thunderstorms using a ML approach" Particularly: "we seek to understand the impact on thunderstorm nowcasting from the new generation of geostationary satellites, which, compared to the previous generation, provide higher resolution imagery, additional image channels and lightning data."
However, although discussed in the results, I think that the conclusions could benefit from an explicit mention of the findings regarding that data source, reinforcing that, although the best results are obtained through the NEXRAD data, GLM has a positive impact offshore or in areas without radar coverage.
It could be good to slightly mention the importance of satellite methods in areas without a good radar coverage nowadays, such, for instance, Africa, on which the new EUMETSAT generation data -and derived products will be also available in the future, and for which maybe ML techniques could be implemented if computational resources are available.
We agree that this could have been emphasized more and have added a discussion of it in the third paragraph of the conclusions. It is now stated: “The GLM lightning data are highly useful for lightning prediction; for other targets, they provide more modest benefits, although they can still provide improvements to nowcasting performance particularly when radar data are not available. More generally, the results confirm that satellite data can be used to provide ML-based nowcasts in areas without radar coverage, such as over the oceans and in less-developed regions lacking ground-based radar networks.”
I strongly believe that adding such tiny discussion, and some numbers regarding the computational costs of running this ML algorithm, would give the readers and possible future researchers/operational-tools developers a better idea of whether ML would improve their hazardous thunderstorms nowcasting tools, or if it is better to remain with ground-based instrumentation data and NWP (for those regions where no such data is available).
In addition to the changes mentioned in our previous response, we have also added a paragraph at the end of section 3.4 describing the computational costs of training and evaluation.
Technical comments:
L65: Please, change to read “as well as a region of the Atlantic..”
This was corrected as indicated by the reviewer.
L135: Please, change to read “as well as their energies...”
Also corrected as requested.
L156: Is “elevation gradient” and “Surface gradient” here used indistinctly? Please, clarify.
Yes, we have changed both mentions to “elevation gradient” to remove the ambiguity.
L366 and L375: Should this be “1.2%” and “4.6%”, perhaps?
We think it is more correct and less ambiguous to use the term “percentage points” in these cases where we are comparing two percentage values. For instance, in the case of L366, the difference of 15.3% and 14.1% is 1.2 percentage points, but expressed in percent it could also be the relative difference of those values (8.5%).
L427: Please, change to read: “may also expose the training process to the problem of overfitting...”
We thank the reviewer for pointing this out, corrected.
Citation: https://doi.org/10.5194/nhess-2021-171-AC2
-
AC2: 'Reply on RC2', Jussi Leinonen, 24 Sep 2021
-
RC3: 'Comment on nhess-2021-171', Anonymous Referee #3, 16 Aug 2021
General comments:
The authors present a well written article that explores the contribution of different data sources in the nowcasting of impactful weather events by way of decision trees. My chief comment concerns exercising caution when using blanket statements about drawbacks of “brute-force” strategies in machine learning (e.g., Page 20, starting Line 424), considering a line of thinking in deep learning that more data is better than less, given that a properly structured neural network with representative training, validation, and testing sets should effectively handle counterproductive data sources by properly reducing their weight in very high-dimensional, multivariate space. Granted that decision trees, rather than neural networks, are the focus of this study, but it may be beneficial to mention that the given recommendations may not hold in the case of neural networks when much more data are available, being that such an avenue is proposed as being part of future work. True, also, that computer resources can become a limiting factor when all available data are included, so there is an aspect of practicality to consider, but given close margins of error in the authors’ two analysis studies, and lack of input-dependent hyperparameter tuning, the relative skill of the different experiements is questionable. With some tuning, one could imagine different results in Figs. 8 and 9. Many of the “surprising” results might just be owed to noise in the tuning space plus random noise. This last point about random noise holds especially true given that the authors employed an 80/10/10 split for training/validation/testing, rather than a more robust method like k-fold cross-validation, which would better demonstrate that the model is/is not very sensitive to random data assignment. We also have not been shown skill metrics on the training set and validation set, which when compared with the results shown on the testing set would give the reader a sense of variance in the model. If there is high variance in the model, then result comparisons with narrower margins are difficult to trust. The authors should discuss training/validation set results, even briefly, to demonstrate that there is low variance in their model.
Specific comments:
Page 2, Line 35: Please include reference to Bedka et al. 2018, “The Above-Anvil Cirrus Plume: An Important Severe Weather Indicator in Visible and Infrared Satellite Imagery,” https://doi.org/10.1175/WAF-D-18-0040.1, and Bedka and Khlopenkov 2016, “A probabilistic pattern recognition method for detection of overshooting cloud tops using satellite imager data,” https://doi.org/10.1175/JAMC-D-15-0249.1.
Page 3, Line 66: Fix typo, “…as well as an region…”
Page 7, Line 175: Can the authors explain their reasoning for choosing 37 dBZ?
Page 7, Line 177: If there are more than one pmaxZ with equal dBZ>=37 connected withing 25 km, what determines which pmaxZ get excluded, and how do you ensure the center-most pmaxZ is not excluded?
Page 10, Line 258: The decision on whether to use MSE or MAE should depend on the importance of outliers in your training and validation sets. If the outliers are “real,” that is, if they are not the result of corrupted data and therefore it is important to detect them, then MSE is the correct loss function to use. Otherwise, if the outliers are corrupted data that are unimportant to detect, then MAE should be chosen because MAE gives less weight to outliers. If using MAE rather than MSE, can the authors demonstrate that outliers in their datasets are unimportant.
Page 10, Line 270: One concern I have is that a given set of hyperparameters is not one-size-fits-all for testing different model setups, which is the main purpose of this article. Changing the data sources in order to assess their importance using the same hyperparameters each time might not be conclusive given that there may be some combination of hyperparameters that results in better performance with one source compared to another. Did the authors use the same hyperparameters for all input sources assessed? That is, when the authors did an “informal manual search of the parameter space,” did they do this only for one input source? To be convincing, the authors should search the hyperparameter space for more than one input source (assuming they haven’t already) to prove that performance is demonstrably insensitive to the changes, and the relative skill between each case stays consistent.
Page 11, Figure 3: Except for the very start of the ‘blue’ case, these tracked cells do not depict an active thunderstorm given the defined threshold of 37 dBZ. The purpose of the article is to analyze ML-based nowcasting of thunderstorm hazards, so it would be more relevant to see a figure that better satisfies the authors’ definition of an active thunderstorm.
Page 11, Line 287: “… with MAXZ > 37 dBZ …”: Like the previous point, most examples in Figure 3 do not show a MAXZ > 37 dBZ. If the MAXZ threshold was reached at some point prior to t=-60 min, then please explain this in the text and/or figure caption.
Page 14, Line 325: Combining the total source importance in this way seems questionable given that you claim a large selection of well-correlated but poor-performing variables add up to an importance comparable to the much higher skill radar variables – also considering the fact that the NWP variables are likely tapping into the same information, and that signal is amplified by being picked up by many variables. With this in mind, can the authors comment further on the value added by the inclusion of the b) and d) figure panels.
Page 14, Line 327: The text seems to suggest GLM has more contribution than ASTER, but the Figure 6b suggests the opposite (or appears to). Can the authors please clarify?
Page 17, Line 359: Again, the way Fig. 6b was arrived at seems flawed and maybe suggests more importance assigned to ECMWF features than is the case. Figure 6a shows almost no significant skill in inclusion of the ECMWF variables.
Page 18, Line 360: “… because the other results in Fig. 8a–b do not suggest in any way … ”: As a style suggestion, consider removing “other” and “in any way”, as they seem unnecessary and detract from the sentence.
Page 18, Line 370: “… as can be seen by comparing the columns to each other”: Similarly, this phrase is unnecessary.
Page 19, Line 389: It would alleviate ambiguity if the authors could explicitly state why not all panels in Fig. 9 have a bottom right corner showing climatology.
Page 19, Line 402: Grouping features by data source overcomes the burden of testing all possible combinations of input features, but it doesn’t solve the problem of understanding the sensitivity of said combinations (which, as rightly stated, would be implausible to determine in this manner). I would suggest simply making clear that the problem overcome is the former one I mentioned, and that this is a reasonable alternative approach.
Pag 20, Line 411: “… moderate to high importance …” is questionable. Instead saying “… of some importance …” would be more agreeable.
Citation: https://doi.org/10.5194/nhess-2021-171-RC3 -
AC3: 'Reply on RC3', Jussi Leinonen, 27 Sep 2021
Dear Reviewer 3,
We thank you for your constructive comments. Please find below our answers to your comments. The original comments are posted in italic font and the point-by-point responses are under each comment in normal font. The specific changes made to the manuscript in response to the comment are described in bold font.
Best Regards,
Jussi Leinonen (on behalf of all authors)
General comments:
My chief comment concerns exercising caution when using blanket statements about drawbacks of “brute-force” strategies in machine learning (e.g., Page 20, starting Line 424), considering a line of thinking in deep learning that more data is better than less, given that a properly structured neural network with representative training, validation, and testing sets should effectively handle counterproductive data sources by properly reducing their weight in very high-dimensional, multivariate space. Granted that decision trees, rather than neural networks, are the focus of this study, but it may be beneficial to mention that the given recommendations may not hold in the case of neural networks when much more data are available, being that such an avenue is proposed as being part of future work.
In the authors’ experience, even with neural networks there is a benefit from proper selection of features, given that reducing the weight of useless or redundant features wastes network capacity that could have been used better if such features had not been present to begin with. However, the reviewer has a point in that this remains to be seen in future work. We have added a mention in the final paragraph of the conclusions, where future directions are discussed, that neural networks may be better able to utilize large datasets.
True, also, that computer resources can become a limiting factor when all available data are included, so there is an aspect of practicality to consider, but given close margins of error in the authors’ two analysis studies, and lack of input-dependent hyperparameter tuning, the relative skill of the different experiements is questionable. With some tuning, one could imagine different results in Figs. 8 and 9. Many of the “surprising” results might just be owed to noise in the tuning space plus random noise. This last point about random noise holds especially true given that the authors employed an 80/10/10 split for training/validation/testing, rather than a more robust method like k-fold cross-validation, which would better demonstrate that the model is/is not very sensitive to random data assignment. We also have not been shown skill metrics on the training set and validation set, which when compared with the results shown on the testing set would give the reader a sense of variance in the model. If there is high variance in the model, then result comparisons with narrower margins are difficult to trust. The authors should discuss training/validation set results, even briefly, to demonstrate that there is low variance in their model.
We are aware that small differences in the results might be simply noise and therefore have attempted not to overinterpret small differences in the individual results of Figs. 8 and 9, but rather concentrate on the general patterns found in these figures. While there were already some remarks pointing this out in Sect. 4.3, we have added further discussion to emphasize that the broader patterns are more robust than the individual results. To support these conclusions, following the reviewer’s suggestion, we have added figures showing the results of the exclusion studies (equivalent to Figs. 8 and 9) for the training and validation sets in the Appendix. In the revised test of Sect. 4.3, we now mention these figures and briefly cross-compare the results between the testing, validation and training sets.
Specific comments:
Page 2, Line 35: Please include reference to Bedka et al. 2018, “The Above-Anvil Cirrus Plume: An Important Severe Weather Indicator in Visible and Infrared Satellite Imagery,” https://doi.org/10.1175/WAF-D-18-0040.1, and Bedka and Khlopenkov 2016, “A probabilistic pattern recognition method for detection of overshooting cloud tops using satellite imager data,” https://doi.org/10.1175/JAMC-D-15-0249.1.
These references were added.
Page 3, Line 66: Fix typo, “…as well as an region…”
This was also pointed out by Reviewer 2 and has been corrected.
Page 7, Line 175: Can the authors explain their reasoning for choosing 37 dBZ?
Reviewer 1 also asked about this. The 37 dBZ threshold was selected based on various earlier studies which identify thunderstorms based on reflectivity thresholds between 30 dBZ and 40 dBZ. We have added more explanation and several references in section 3.1.1 to support this.
Page 7, Line 177: If there are more than one pmaxZ with equal dBZ>=37 connected withing 25 km, what determines which pmaxZ get excluded, and how do you ensure the center-most pmaxZ is not excluded?
It is not explicitly attempted to ensure that the centermost pmaxZ is selected; however, the procedure described in Sect. 3.1.1 selects the points with highest dBZ first and therefore the most significant cells tend to be selected except when they are excluded due to being close to even more significant ones.
Page 10, Line 258: The decision on whether to use MSE or MAE should depend on the importance of outliers in your training and validation sets. If the outliers are “real,” that is, if they are not the result of corrupted data and therefore it is important to detect them, then MSE is the correct loss function to use. Otherwise, if the outliers are corrupted data that are unimportant to detect, then MAE should be chosen because MAE gives less weight to outliers. If using MAE rather than MSE, can the authors demonstrate that outliers in their datasets are unimportant.
Our study deals with complex multi-source data that are processed with motion-detection and feature-extraction algorithms. Although we made an effort to make this process robust, in such an environment it is to be expected that some outliers will occur due to bad data or failed processing. As the reviewer points out, MSE is more sensitive to these than MAE. Of course, some outliers may also be “real” outliers that result from unusual natural behavior, but in the interest of robustness we believe that it is better to use MAE. One further point that supports the choice of MAE is that a model trained with MAE loss achieved better MSE in the validation set than an equivalent model trained with MSE loss. While we had omitted this mention from the originally submitted version, we have now added it to this paragraph. We also added some citations on the relative merits of MAE vs MSE.
Page 10, Line 270: One concern I have is that a given set of hyperparameters is not one-size-fits-all for testing different model setups, which is the main purpose of this article. Changing the data sources in order to assess their importance using the same hyperparameters each time might not be conclusive given that there may be some combination of hyperparameters that results in better performance with one source compared to another. Did the authors use the same hyperparameters for all input sources assessed? That is, when the authors did an “informal manual search of the parameter space,” did they do this only for one input source? To be convincing, the authors should search the hyperparameter space for more than one input source (assuming they haven’t already) to prove that performance is demonstrably insensitive to the changes, and the relative skill between each case stays consistent.
Doing a thorough hyperparameter search for all input combinations would be very time-consuming and computationally expensive, and our search convinced us that the results were not very sensitive to changes over a reasonable hyperparameter space. However, the reviewer makes a good point that this does not guarantee that this is true for all combinations. To test this, we reran the data exclusion analysis of lightning occurrence from Fig. 8c-d varying two of typically most influential parameters of LightGBM: the depth of the tree and maximum number of leaf nodes. The depth was increased from 5 to 6 (theoretically increasing the predictive power) and the number of leaf nodes reduced from 48 to 24 (theoretically decreasing the predictive power). The results are shown below. These sensitivity experiments show that, even though the individual numbers may change slightly (typically less than 2%, but sometimes somewhat more) the patterns seen Fig. 8c-d remain stable even when the hyperparameters are changed.
The figure below is equivalent to Fig. 8c-d but with the maximum tree depth increased from 5 to 6:
The figure below is equivalent to Fig. 8c-d but with the maximum number of leaf nodes reduced from 48 to 24:
(The blurriness of the images above is due to the maximum 500x500 pixel size imposed by the comment submission system.)
Page 11, Figure 3: Except for the very start of the ‘blue’ case, these tracked cells do not depict an active thunderstorm given the defined threshold of 37 dBZ. The purpose of the article is to analyze ML-based nowcasting of thunderstorm hazards, so it would be more relevant to see a figure that better satisfies the authors’ definition of an active thunderstorm.
Reviewer 1 had a similar comment so we will reproduce the answer here: This is because the threshold for selecting the cases is based on a single-pixel value, i.e. MAXZ exceeding 37 dBZ in a single pixel. Meanwhile, in Fig. 3 we are showing the MAXZ predictand, defined as the mean of the column maximum reflectivity over the 25-km diameter neighborhood. As the mean is smaller than the maximum, in many cases the shown MAXZ does not exceed 37 dBZ. However, thank you for pointing this out. This seeming discrepancy was not well explained in the original text; we have revised the caption of Fig. 3 and the text of Sect. 4.1 to make it clearer what is shown and why the MAXZ is sometimes below 37 dBZ.
Page 11, Line 287: “… with MAXZ > 37 dBZ …”: Like the previous point, most examples in Figure 3 do not show a MAXZ > 37 dBZ. If the MAXZ threshold was reached at some point prior to t=-60 min, then please explain this in the text and/or figure caption.
Please refer to our response to the previous comment.
Page 14, Line 325: Combining the total source importance in this way seems questionable given that you claim a large selection of well-correlated but poor-performing variables add up to an importance comparable to the much higher skill radar variables – also considering the fact that the NWP variables are likely tapping into the same information, and that signal is amplified by being picked up by many variables. With this in mind, can the authors comment further on the value added by the inclusion of the b) and d) figure panels.
The feature importance shown here is defined as the total gain of a given feature, that is, the reduction in the loss function attributed to the gradient boosting model using that feature. Since, by this definition, the gain is additive, it is our understanding that it is appropriate to calculate the total gain of a group of features by adding together the gains of the individual members of the group.
As for why the individual features seem to be poorly performing, consider the following, highly simplified, example: There are two predictors, A and B, which are very highly correlated. When the decision tree creates a split, it chooses either A or B to use as the basis of the split. Since A and B are highly correlated, the tree might use either A or B depending on which one happens to be slightly better in that particular circumstance. Therefore, the gain is attributed near-randomly to either A or B and thus split near-evenly between them. If the model is instead trained using only feature A, it will make every split based on A, and therefore A will be attributed a much higher gain – indeed, nearly the total gain of A and B in the first case.
Therefore, we do not really claim that poorly performing variables add up to a high importance – but rather that the high importance is split between multiple variables, and thus each of the individual variables seems to be performing poorly. For this reason, we think that the b) and d) panels make it clearer to the reader that the model is actually using the ECMWF variable to make decisions, even though the decisions are split between many individual variables. As for why the ECMWF variables have high importance in the feature importance analysis but low importance in the source exclusion analysis, we have already discussed this at length in the submitted manuscript.
We have edited this paragraph to add some clarity about the points presented above.
Page 14, Line 327: The text seems to suggest GLM has more contribution than ASTER, but the Figure 6b suggests the opposite (or appears to). Can the authors please clarify?
While both contributions are very minor, the reviewer appears to be correct. We suspect, though unfortunately cannot verify, that this was text left over that referred to an earlier version of the analysis. The wording was changed to “the GLM and ASTER data contribute to a lesser extent”.
Page 17, Line 359: Again, the way Fig. 6b was arrived at seems flawed and maybe suggests more importance assigned to ECMWF features than is the case. Figure 6a shows almost no significant skill in inclusion of the ECMWF variables.
Please refer to our response to the reviewer’s comment regarding page 14, line 325.
Page 18, Line 360: “… because the other results in Fig. 8a–b do not suggest in any way … ”: As a style suggestion, consider removing “other” and “in any way”, as they seem unnecessary and detract from the sentence.
We agree that these can be removed without loss of meaning, and the sentence has been edited accordingly.
Page 18, Line 370: “… as can be seen by comparing the columns to each other”: Similarly, this phrase is unnecessary.
This has also been removed as suggested by the reviewer.
Page 19, Line 389: It would alleviate ambiguity if the authors could explicitly state why not all panels in Fig. 9 have a bottom right corner showing climatology.
This has been clarified in the caption of Fig. 9.
Page 19, Line 402: Grouping features by data source overcomes the burden of testing all possible combinations of input features, but it doesn’t solve the problem of understanding the sensitivity of said combinations (which, as rightly stated, would be implausible to determine in this manner). I would suggest simply making clear that the problem overcome is the former one I mentioned, and that this is a reasonable alternative approach.
This is a fair point, we have reworded the text here as: “Testing all possible combinations of input features would have quickly become implausible as the number of features increased, but grouping the features by data source allowed us to cover the most realistic situations of missing data, where an entire data source is unavailable…”
Pag 20, Line 411: “… moderate to high importance …” is questionable. Instead saying “… of some importance …” would be more agreeable.
This was reworded as suggested by the reviewer.
Citation: https://doi.org/10.5194/nhess-2021-171-AC3
-
AC3: 'Reply on RC3', Jussi Leinonen, 27 Sep 2021
Peer review completion
nowcasting) of severe thunderstorms using machine learning. Machine-learning models are trained with data from weather radars, satellite images, lightning detection and weather forecasts and with terrain elevation data. We analyze the benefits provided by each of the data sources to predicting hazards (heavy precipitation, lightning and hail) caused by the thunderstorms.