I appreciate the authors’ hard work on the manuscript, and I think the changes they have made have improved it. I also am glad that the authors have decided to make their code available, which will be very useful to other researchers. Overall, I think that the manuscript makes a useful contribution and could be accepted for publication after minor revisions.
However, as it stands, the methodology is still unclear to me from the text as written, and I recommend revising it. I have picked out my main areas of confusion below, and I hope this can help the authors in clarifying the text.
First, are models trained individually for each spatial point? This is not explicitly stated anywhere, as far as I can tell.
It is also not clear from the text how the CV and training/test evaluation is executed. For example, lines 151-152 (in the file with tracked changes): ‘The creation of training, validation, and testing subsets is crucial to avoid overfitting and achieve reasonable estimates of model performance. The data set was divided into the first 80% for training and 20% for validation data.’. This does not explicitly state that the split is in time, and although a training, validation and test set is mentioned, in fact only a training and test split is done.
Additionally, it is unclear how the LOYO-CV is done - the text states a ‘fixed window’ of 10 years was used, and one year as a test set; is this repeated such that every year is a validation set and then the resulting scores are averaged, as is usually done for CV? ‘Fixed window’ reads as if only a single split is done, but I assume the authors do not mean that.
Line 160: ‘the best fit model was determined by employing a leave-one-year-out cross-validation approach (LOYOCV)’: I do not understand this step. Did the authors train multiple models, with identical hyperparameters and features, on different subsets of years and select the best one according to its performance on each corresponding validation set? This would not be advisable, as the validation set performance would depend on the year, not just the model performance. Or were the hyperparameters selected or features selected based on the average CV performance across all folds, as would be more typical?
Lines 175-177: ‘Different models were trained considering different combinations of input data, including precipitation means, temperature means, and combinations of means and extreme climate indices. The goal of this experiment was to identify the most important climate indices.’ I would recommend describing this more explicitly. Which subsets of features were tested and how were they evaluated?
I would suggest that the authors explicitly state which years are covered in the training and test sets, on which years the LOYO-CV is done, and further on in the text, on which years/data the results are calculated on for each figure. This should also be included in Figure 1.
Additionally, there are multiple ways of calculating feature importance from random forest models. I am assuming that the authors refer to the internal entropy-based feature importance. This should be stated explicitly. I would also advise that the authors consider additionally calculating the permutation-based feature importance on the validation or test years. Their agreement or disagreement with the entropy-based feature importance would aid readers in assessing the robustness of the findings.
Other points of confusion:
- Line 128, 129 introduces ‘explainable’ and ‘operational’ features, but does not define them. Also lines 137-8: ‘Other relevant aspects, such as relevancy, explainability, and operationability, will be explained in the following steps.’ As far as I can see, these terms aren’t explained later in the text.
- Line 144: ‘The SHAP method uses a second model, most commonly the RF model…’ I do not think this is true. SHAP is used to explain the original model.
- Line 162: ‘The models were trained and optimized on the training and validation datasets’ This is unusual, did the authors intend to write only training datasets, not validation datasets?
- Outlier removal: Lines 276-277 state that the authors remove outliers using the interquartile range, but from the supplementary material it seems that only few datapoints are removed, which does not make sense. Additionally, the supplementary material discussion of the outlier removal is confusing: in lines 30-31 they state ‘Removal of outliers is a complex problem since we are working with extreme events’, but this is followed by an explanation of the trend and heteroskedasticity removal process, and not the outlier removal. Then, in line 52: ‘After obtaining a consistent time series corrected for outliers, trends, and heteroskedasticity’ - but the outlier removal occurs afterwards, as far as I can tell (line 59: ‘To eliminate potential outliers, we excluded values considering each year and state’).
- Lines 277-278: ‘Changes in technology in seed production, fertilizers, and land management, also known as technological trends (Liu and Ker, 2020) were removed by Local Polynomial Regression Fitting (LOESS)’ - all trends would be removed, including those due to e.g. climate change, not just technology. I would recommend mentioning this.
- Supplementary material line 137: the sentence ending in ‘indicating the significant role of both rainfall’ is incomplete.
- Figure 3: Where do the error bars come from here? Which variables were used as predictive features? Are these metrics calculated on the test set?
- Lines 340-342: Where do these ranges come from?
- Lines 349-354: Where are the results discussed here shown?
- Figure 4: The hazard types don’t correspond, as far as I can tell, to the CID types and categories from Table 2. What do these labels mean?
- Line 420: ‘This technique allowed us to extract insights by coupling the results of a Random Forest model’ - I don’t think SHAP works by coupling a model to another, it is intended to explain the original model.
- In lines 466-468, and in the Supplementary material (lines 95-101) the authors discuss whether or not to keep the most extreme years in the training or test set, or remove them. I understand from the text that they were kept in the dataset, but it doesn’t say if they were used in the test or training set.
Technical corrections:
- Line 136: I would remove the sentence ‘Feature selection is a pre-processing step in machine learning models’. It is confusing as the feature selection is described later in the text.
- ‘ML’ is introduced as an abbreviation early in the text, but the authors continue to use ‘machine learning’ afterwards. I would also advise using ‘RF’ as an abbreviation for random forest to improve readability.
- Figure 1 is helpful, but there are multiple typos and minor formatting issues. E.g. ‘Filter Highly correlated variables’ should be ‘Filter highly correlated variables’; ‘Boostrap RF model’ should be ‘Bootstrap RF model’.
- Line 170: ‘To achieve this, we used the Random Forest model’ This is repeated multiple times in the text, and could be removed.
- Line 175: ‘Different models were trained considering different combinations of input data’ - I believe the authors refer to different combinations of features or variables, rather than data.
- Line 207: Typo - ‘The ~~the~~ SHAP explanations was performed’
- Lines 239, 240, 244: Maize and soybean should not be capitalised here.
- Figure S1 and S2: Typo - eath should be each
- Supplementary line 86: The reference is erroneously capitalised: RODRIGUES et al. (2013)
- Supp line 130: typo - Table SS2 should be S2
- In the Supplementary material, Section 4 still refers to an XGBoost model.
- Lines 364-365: Typo - ‘The analysis can be of variable importance for soybean datasets is shown in Table S1’
- Line 413: Typo - ‘climate impact-divers’ should be ‘drivers’
- Lines 461-462: Typo - ‘however, the for IBGE is 150mm’
- Figure 7: Please include the units for e.g. precipitation.
- Line 471: ‘exemplify’ is the wrong word, I think - perhaps ‘present’? |