Ensemble random forest for tropical cyclone tracking

Vaittinada Ayar, Pradeebane; Bourdin, Stella; Faranda, Davide; Vrac, Mathieu

doi:10.5194/nhess-25-4655-2025

Articles | Volume 25, issue 11

https://doi.org/10.5194/nhess-25-4655-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/nhess-25-4655-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 25, issue 11

Research article

|

24 Nov 2025

Research article |

| 24 Nov 2025

Ensemble random forest for tropical cyclone tracking

Pradeebane Vaittinada Ayar, Stella Bourdin, Davide Faranda, and Mathieu Vrac

Download

Final revised paper (published on 24 Nov 2025)
Supplement to the final revised paper
Preprint (discussion started on 18 Mar 2025)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-252', Anonymous Referee #1, 05 May 2025
Review of “Ensemble Random Forest for Tropical Cyclone Tracking”
Overview
This work applies Random Forest (RF) models to track tropical cyclones using environmental variables from a global reanalysis (ERA5) with an eventual goal of using the RF tracker in long-running climate simulations. The Eastern Pacific and Northern Atlantic TC basics were chosen for investigation. Random Forests were trained by categorizing localized boxed regions in each basin as either containing a TC or not (TC-free) and associating statistics of environmental variables in each box from ERA5 to the binary events. Variables of mean sea level pressure, relative vorticity, column water vapor, and thickness were used as they represented different facets of physical mechanisms and TCs. Statistics are computed for these variables and included as inputs during RF training.
Training is conducted with 6-fold cross-validation to generate a range of RF solutions that are then used to compute MCC, POD, and FAR over a series of subsampling experiments – the authors note a significant proportion of their samples are TC-free compared to TC samples. Generally, a ratio of 25-1 is seen as reasonable with POD and FAR tradeoffs as the ratio is increased/decreased. Detection skill is notably better than the baseline UZ method in both basins. Further investigation of skill suggests the model primarily misses TCs at low intensity and low duration. The authors also devise analyses to interpret physical meaning, although I have some comments on this aspect of the analysis below.
Overall, the authors have employed RFs in a very unique and potentially innovative application area to track TCs in global reanalyses. The manuscript could benefit from improved grammar and clarity in locations, along with consideration of additional analyses or methods to improve the scientific presentation. I look forward to seeing a revised manuscript after careful revision.

Comments
Lines 119-125: If one of the objectives of the manuscript is to determine the physical relationships that govern TC tracks, why not allow the ML model to do the variable selection for you? Provide X number of variables and employ variable selection procedures like sequential forward selection or sequential backward selection? Or use explainable AI techniques to do variable selection? If we are informing which variables the RF should learn from, aren’t we prescribing our own biases into the physical mechanisms?

Line 160: Hyperparameter testing should be done for all RFs developed in this work and should be explicitly provided to readers for reproducibility. It is uncommon for RFs to be trained and validated without hyperparameter testing and tuning – I have to see any published works that used the default values given in whichever software package was being used. By testing and tuning the hyperparameters the authors guarantee that the RFs are not “good by luck” and there is a repeatable process for future experimentation. Further, all hyperparameters should be listed in the manuscript/in a table.

Line 225: The authors should note some of the limitations of gini-based importance, namely the emphasis on input variables at the tops of the decision trees when proportions are largest. An alternative approach the authors could consider would be permutation importance, single-pass or multi-pass, which offers a more robust consideration of importance. See McGovern et al. 2019

McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the Black Box More Transparent: Understanding the Physical Implications of Machine Learning. Bull. Amer. Meteor. Soc., 100, 2175–2199, https://doi.org/10.1175/BAMS-D-18-0195.1.
The authors use MCC, POD, and FAR based on a nebulous threshold of 50% to define TC track objects. Rather than using this approach, authors could leverage existing verification metrics that assess the probabilistic skill of the RF models (e.g., Brier Skill Score, Reliability diagrams). This approach would effectively assess the skill of the system at a range of probabilistic thresholds, which could also be leveraged for area under the ROC statistics.

Technical Edits and Questions
Generally: the authors should spend a substantial amount of time proof-reading the document for lingering grammar issues.
Line 48: Change to “this study focuses on data-driven algorithms using machine learning”. Sometimes “so-called” can have a negative/inappropriate connotation, which I don’t believe was your intent.
Lines 93-96: While I understand it is a long-held tradition to include a “table of contents paragraph” in this manner, you can remove this paragraph – it has no particular value for readers. The scientific structure of manuscripts has remained unchanged for decades and every reader knows that methods will come next, results afterward, and so on. If a reader is interested in a particular section, they can seek out the section header to know what is contained within.
Line 99: Remove this single line
Line 103: Remove “cyclonic” – seasons are not “cyclonic”. Alternatively, can adjust to “cyclone seasons”
Line 106: “Track records that do not provide”
Lines 106-107: If a TC undergoes extratropical transition, how is the transition from TC to extratropical TC handled? Also, how is TC demise to depression stage handled? Only TC achievement is mentioned here (i.e., genesis).
Line 131: Moisture is misspelled
Lines 136-138: The description here appears to have two statements in conflict with one another. First, the text says that every box has a vector of ones and zeros is constructed: is this for every grid point in the box? The next sentence says the box is encoded as a 1 or 0. Some additional clarity and perhaps rewording these sentences is needed to clarify the approach. I suspect it is the latter, but the wording is a bit confusing.
Line 134: Why are the boxes not immediately adjacent to one another? Could a TC be missed if it lies outside of the boxes in the white areas of Figure 1?
Lines 139-140: What is the motivation for synthesizing the ERA5 data in the boxes to single-statistic values? Other works have used spatial regions to encode relevant spatial relationships into RFs (see Hill et al. 2020, 2021, 2023, 2024, Schumacher et al. 2021) and have had tremendous success, including deducing how those spatially oriented data contribute to forecast skill (Mazurek et al. 2025). Others tackling severe weather hazards have taken a synthesizing approach too (see Clark and Loken 2022, Loken et al. 2022). Were there any tests that also included the full box of ERA5 data to demonstrate the single-value statistics were a better methodological choice?
Loken, E. D., A. J. Clark, and A. McGovern, 2022: Comparing and Interpreting Differently Designed Random Forests for Next-Day Severe Weather Hazard Prediction. Wea. Forecasting, 37, 871–899, https://doi.org/10.1175/WAF-D-21-0138.1.
Clark, A. J., and E. D. Loken, 2022: Machine Learning–Derived Severe Weather Probabilities from a Warn-on-Forecast System. Wea. Forecasting, 37, 1721–1740, https://doi.org/10.1175/WAF-D-22-0056.1.
Mazurek, A. C., A. J. Hill, R. S. Schumacher, and H. J. McDaniel, 2025: Can Ingredients-Based Forecasting Be Learned? Disentangling a Random Forest’s Severe Weather Predictions. Wea. Forecasting, 40, 237–258, https://doi.org/10.1175/WAF-D-23-0193.1.
Hill, A. J., R. S. Schumacher, and M. R. Green, 2024: Observation Definitions and their Implications in Machine Learning-based Predictions of Excessive Rainfall. doi.org/10.1175/WAF-D-24-0033.1.
Hill, A. J., R. S. Schumacher, and I. L. Jirak, 2023: A new paradigm for medium-range severe weather forecasts: probabilistic random forest-based predictions. doi:10.1175/WAF-D-22-0143.1.
Hill, A. J. and R. S. Schumacher, 2021: Forecasting excessive rainfall with random forests and a deterministic convection-allowing model. doi:10.1175/WAF-D-21-0026.1.
Schumacher, R. S., A. J. Hill, M. Klein, J. Nelson, M. Erickson, S. M. Trojniak, and G. R. Herman, 2021: From random forests to flood forecasts: A research to operations success story. doi:10.1175/BAMSD-20-0186.1.
Hill, A. J., G. R. Herman, and R. S. Schumacher, 2020: Forecasting severe weather with random forests. doi:10.1175/MWR-D-19-0344.1.
Lines 147-148: This sentence is not needed – can be removed. All of this information is contained in the section headers.
Line 174-175: To be consistent with both machine learning and atmospheric science literature, the “calibration” phase should be referred to as the “training” phase of the ERF. Then, you use cross-validation to validate the trained model on withheld periods – you don’t use those withheld periods to “calibrate” the models.
Line 188: Should RF actually be ERF?
Line 188: Did you consider alternative probability thresholds (beyond just 50%) to assignment detected tracks (D)?
Lines 251-253: This text is best reserved for the figure caption – please move there if not already. This text is just describing the figure, not the science.
Figure 3: It would be good to see the full distribution of MCC scores for the 100 RFs plotted as error bars, akin to a 95% confidence interval. Are the MCC values truly indifferent statistically? (it is hard to tell but maybe this detail is plotted as light blue lines? If so, please try and make these lines clearer so they can be discerned and provide a description in the figure caption)
Lines 273-274: What is meant by “calibration experiments”? Are you just evaluating the model’s ability to detect storms over the testing period for which it was trained? It is to be expected that POD will be high and FAR low.
Line 283-284: Isn’t a missed track by definition lower probability? Aren’t hits/misses defined by probabilities greater than or less than 50%? These box plots in Figure 5b are being more or less constrained by the methods used, and don’t necessarily provide much scientific reasoning for “FA are less likely to happen than hits”. The authors should reconsider the usefulness of this analysis in regard to their methodological choices.
Lines 320-322: As mentioned earlier, they are also prescribed by the authors, so these results are not extremely surprising. See major comment above.
Lines 348-349: This information is once again best reserved for the figure caption.
Figure 10: This is an excellent figure that clearly demonstrates how the RFs are learning the relevance of each predictor to drive the yes/no predictions.
Citation: https://doi.org/10.5194/egusphere-2025-252-RC1
- AC1: 'Reply on RC1', Pradeebane Vaittinada Ayar, 21 Jul 2025
  
  Please see the attached file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-252-AC1
RC2:
'Comment on egusphere-2025-252', Anonymous Referee #2, 20 May 2025
Summary:
This study uses an ensemble of random forests (ERFs) to identify and track tropical cyclones (TCs) within ERA5 in the North Atlantic and East Pacific basins. The identified TCs and tracks are compared with observations, taken from IBTrACS, and the ERF performance was compared with the Tempest Extreme tracking algorithm. Overall, the authors demonstrate that the ERF performs well in identifying and tracking observed TCs (high probability of detection and low false alarm ratio).
Beyond simply demonstrating that the ERF “works”, the authors also nicely examined the characteristics of false alarms and misses. The authors found that missed TCs and false alarms were generally associated with short duration storms that were of marginal tropical storm intensity. In addition, the authors examined which of the chosen predictors from ERA5 had the largest geni-based feature importance and the contribution of each to the outcome of the random forest prediction using SHAP values.
I personally found the manuscript an interesting and useful application of ERFs. I particularly appreciated the authors discussion on the misses, false alarms, and predictors that most informed the random forest outcome. I also believe the manuscript can be further improved through both the comments below and a more careful editing of the spelling and grammar within the text. I am specifically interested in encouraging the authors to more carefully consider the probability provided by the ERFs using traditional ensemble verification methods such as Brier Skill Score and ROC diagrams. It would also be of interest to better understand if the characteristics of the misses and false alarms from the ERF and Tempest Extreme exhibit any noteworthy differences in location, intensity, duration, or environmental conditions. Overall, I believe this is a study worthwhile of publication after addressing the below comments.
Specific Comments:
One of the main benefits of the ERFs is the probabilities provided. I wish the authors examined this in more detail. I recommend the authors reconsider the use of a strict threshold, greater than 50% probability, as defining a TC event. There is no requirement for this to be the cutoff and the authors may wish to explore alternative thresholds. Furthermore, the authors may wish to examine the reliability of the ERFs by assessing whether the spread correctly represents the forecast uncertainty by examining the spread-error ratio. On average the ensemble spread should be equal to the error. In addition, I suggest the authors examine reliability diagrams which compare the forecasted probability with the observed frequency, ROC diagrams, and Brier skill score. Each of these analyses will help determine the benefits of the probability provided by the ERFs and may reveal weaknesses of the ensemble design.

I struggled to fully understand the details of the calibration, validation, and test experiments (L174-182). I am still a bit confused by the overlap between the calibration, validation, and testing periods. It appears from point 3 that the whole 1980-2021 period is used for testing. This is not a fair testing dataset as the ERF was also tested using much of this same period. I believe the authors should perform testing using an entirely new period that was not used during training.

The authors also mentioned a potential change in the quality of IBTrACS with time in motivating their choice of validation data (every 6 years). I am curious if the authors tested how the performance of the ERF would change if they trained on an earlier period and then tested on a more recent period. This would be interesting for several reasons, including serving as an “easier” initial test for the potential application to future climate simulations the authors mention in the summary section.

300 km (L193) appears to be a generous threshold for the distance between an observed and identified TC to be considered a hit. This value is still probably small enough that it is identifying the same storm but large enough that the center location may be off by the approximate size of the TC. Why was this value chosen and how sensitive are the results to this threshold?

It would be helpful to readers to define each of the predictors from ERA5 within a table in the supplementary material.

I am interested in understanding if the characteristics of the false alarms and misses are similar with the ERFs and Tempest Extremes. I suggest the authors recreate Figures 5, 6, and 7 for Tempest Extreme within the supplemental figures. This analysis may help identify strengths and weaknesses of each approach.

Technical edits:
L2: change “evanesce” to “weaken”
L37-38: Another tracking algorithm the authors may be interested in referencing here is TRACK (Hodges, 1994). This algorithm differs from others in that it is more general and tracks all vorticity maximums and only later filter out TCs using a warm core threshold.
Hodges, K. I., 1994: A General Method for Tracking Analysis and Its Application to Meteorological Data. Mon. Wea. Rev., 122, 2573–2586, https://doi.org/10.1175/1520-0493(1994)122<2573:AGMFTA>2.0.CO;2.
L144: The authors should more carefully describe what is meant by “standardized”. This is important for the reproducibility.
L252: I suggest the authors replace “different subsampling of zeros” with language more physically intuitive.
Figure 6: The layout of the figure panels in Figure 6 are a bit confusing. I was repeatedly confusing panels (a) and (b). I suggest revising the layout to avoid this.
L311: remove “basin”
L314-315: What is the basis for this hypothesis?
L323: Change “with” to “which”.
L325: Change “since they are associated with the strong surface winds and the location of the cyclone eye, respectively”.
L332-333: A transition would be helpful here to emphasize the different information provided by each of these analyses.
L362-362: I suggest splitting this into two sentences. Ending the first sentence after “literature”.
L390-391: The end of this sentence, “indicating us to be…” should be revised.
Citation: https://doi.org/10.5194/egusphere-2025-252-RC2
- AC2: 'Reply on RC2', Pradeebane Vaittinada Ayar, 21 Jul 2025
  
  Please see the attached file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-252-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (05 Aug 2025) by Piero Lionello

AR by Pradeebane Vaittinada Ayar on behalf of the Authors (05 Aug 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (26 Aug 2025) by Piero Lionello

RR by Anonymous Referee #1 (11 Sep 2025)

Suggestions for revision or reasons for rejection

Ayar et al.: Ensemble Random Forest for Tropical Cyclone Tracking

The authors addressed a number of my previous comments in their responses as well as edits made to the manuscript. I also thank the authors for taking the time to edit and proofread the manuscript for grammatical issues.

Remaining Comments
1. Please be consistent with your use of acronyms. For example, you use “TCs” in Line 122 to refer to Tropical Cyclones yet “TC” in Line 127 to also refer to Tropical Cyclones. However, the “TC” acronym is defined on line 117 as “Tropical Cyclones” – so should the acronym be TC or TCs in this case? Please edit the text as you see appropriate to have consistent plural/singular use of this (and other) acronyms.
2. For the record, standardization is not necessary for random forest applications (lines 153-154).
3. Lines 201-204: As long as testing is done for a different basin, i.e., model trained on ENP for the full period and “tested” on NATL for the same period, this approach should be valid. However, if an ENP model is trained on a period and subsequently tested on that same period, I would have significant concerns about misrepresenting skill.
4. Line 193: Why are there 100 Random Forests used for each subsampling test? The Random Forest by nature is an ensemble of Decision Trees – why do you need 100 RFs?
5. Lines 210-214 and previous review comment: Both myself and another reviewer asked to see forecast skill as a function of probability, e.g., reliability diagrams. I’m not convinced that the 50% threshold used here is not more-or-less arbitrary. Since you have developed a probabilistic prediction system, probabilistic skill metrics should be computed to evaluate skill properly and robustly.

Hide

RR by Anonymous Referee #2 (29 Sep 2025)

ED: Publish subject to minor revisions (review by editor) (05 Oct 2025) by Piero Lionello

AR by Pradeebane Vaittinada Ayar on behalf of the Authors (08 Oct 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (02 Nov 2025) by Piero Lionello

AR by Pradeebane Vaittinada Ayar on behalf of the Authors (03 Nov 2025) Manuscript

Download

Article (5098 KB)
Full-text XML

Short summary

Tracking tropical cyclones (TCs) remains a matter of interest for investigating observed and simulated tropical cyclones. In this study, Random Forest (RF), a machine learning approach, is considered to track TCs. RF associates the TC occurrence or absence with different atmospheric configurations. Compared to trackers found in the literature, it shows similar performance for tracking TCs, better control over false alarms, more flexibility, and reveals key variables for TCs' detection.