Multivariate regression trees as an ‘explainable machine learning’ approach to exploring relationships between hydroclimatic characteristics and agricultural and hydrological drought severity
Abstract. The typical causes of droughts are lower precipitation and/or higher than normal evaporation in a region. The region’s characteristics and anthropogenic interventions may enhance or alleviate these events. Evaluating the multiple factors that influence droughts is complex and requires innovative approaches. To address this complexity, this study employs a combination of modelling and machine learning tools to assess the relationship between hydroclimatic characteristics and the severity of agricultural and hydrological droughts. The Soil Water Assessment Tool is used for hydrological modelling. Model outputs, soil moisture and streamflow are used to calculate the drought indicators for the subsequent drought analysis. Other simulated hydroclimatic parameters are treated as hydroclimatic drivers of droughts. A machine learning technique, the multivariate regression tree approach, is then applied to identify the hydroclimatic characteristics that govern agricultural and hydrological drought severity. The case study is the Cesar River basin (Colombia).
Our research indicates that multiple parameters influence the Cesar River basin’s exposure to agricultural and hydrological droughts. Accordingly, the basin can be divided into three distinct areas. First is the upper part of the river valley. Due to precipitation shortfalls and high potential evapotranspiration, this region is very susceptible to agricultural and hydrological droughts. The second area is the middle part of the river valley. This area is likewise very susceptible to agricultural and hydrological droughts; however, severe drought conditions are brought on by inadequate rainfall partitioning and an unbalanced water cycle that favours water loss through percolation and evapotranspiration. Third, the Zapatosa marsh and the Serrania del Perijá foothills present moderate exposure to agricultural and hydrological droughts. Mild drought conditions appear to be related to the capacity of the subbasins to retain water, which lowers evapotranspiration losses and promotes percolation. Our results show that the presented methodology, in combining hydrologic modelling and machine learning techniques, provides valuable information about an interplay between the hydroclimatic factors that influence drought severity in the Cesar River basin.
Ana Paez-Trujilo et al.
Status: open (until 15 Jun 2023)
- RC1: 'Comment on nhess-2023-50', Anonymous Referee #1, 27 Apr 2023 reply
- RC2: 'Comment on nhess-2023-50', Samuel Jonson Sutanto, 18 May 2023 reply
Ana Paez-Trujilo et al.
Ana Paez-Trujilo et al.
Viewed (geographical distribution)
The paper investigates the use of multivariate regression tree (MVRT) to characterize drought in a water basin in Colombia. The approach represents a valid solution to interpret the complex interplay between climatic forcing and basin characteristics and drought conditions. While the paper is generally well-structured, it lacks some clarifications that are fundamental to understand the overall goal of the study and to allow reproducible outcomes.
My main concern regards the lack of clarity in the objective of the study. The title fails to mention that it is an application to a specific case study. In addition, it is not clear until deep in the “results section” what exactly the authors mean with drought severity. For almost the entire paper the readers are left wondering what exactly is modelled with the MVRT. Is it the severity of a series of events on the entire basin? Is it the spatial distribution of the severity? This should be made clear already in the objective described in the introduction, and then detailed in the methodology.
Another related issue of the paper is the lack of specific details of the application of MVRT to the given study case. Most of the description is rather generic, and do not answer key questions about the specific application. The authors state that one of the advantages of MVRT is the capability to output multiple variables, but it is never clarified why this is needed here and how this is exploited.
A lot more can be said on the “explainable” portion of the study. The authors provide some comments on the outcomes of the two MVRTs, but the link between these outputs and a physical interpretation is lacking. In both the discussion and the conclusion sections (as well as in the abstract), the authors stress how a main finding is the division of the domain in 3 macro regions. However, it is not clear how this conclusion is drawn from the outputs of MVRT, and how MVRT are “explained” to derive this conclusion. At the moment, it is seems that this conclusion is derived from previous knowledge of the area rather than the actual outcomes of the study.
In addition, the outcomes of the two MVRTs are rather different, and it would be interesting to discuss the analogies and differences between the two (in spatial patterns, explanatory variables, etc.). In the current version, the two analyses are almost independent from each other. Is the division in 3 macro regions valid for both agricultural and hydrological droughts? Is yes, how it is so given the differences in the trees?
Finally, given the focus on drought, I would have expected a validation of the model also in term of drought quantities, especially low-flow conditions. The validation of the SWAT model should be expanded to highlight reasonable performances during drought conditions, and possibly expanded to soil moisture as well.
Overall, the paper can potentially bring a valid contribute to the field of machine learning applied to drought studies, but I recommend accommodating my major concerns before considering the manuscript for publication.
L12-13. You mention anthropogenic interventions and region’s characteristics, but those are factors that are barely included in your analysis. If this is a key point of your study, it should be better reflected in the analysis.
L51. “MAY play…” Actually, I have the impression from your results that some of these quantities do not play a major role, at least in your study region.
L53-55. Again, you stress the role of human interventions but only marginally included them in the study.
L76. This is the right place to highlight why a multivariate approach may be needed here.
L87. Please better link this line and figure to the rest of the text reported later (description of the methodology).
L91. Please mark these three sub-regions in the map for the people not familiar with the region.
L105. You mention pasture here, but no “pasture” class is reported in the Figure. Please align the text with the figure.
L121. I would link this sentence to the next.
L142. I assume that CN2 is the initial CN for soil moisture condition 2, since the actual CN is a variable. Please clarify.
L143. No calibration on the Manning factor?
L152. Since your focus is on hydrological drought, I suggest adding some evaluation metrics focused specifically on low flow. It is a well-known issue that NSE may return high values even when low flow conditions are not well represented due to a good matching of flood values. Also, given the relevance of soil moisture in your study, some kind of validation/evaluation of the performances in terms of soil moisture is needed.
L162. No details are provided on the soil profile. Is it a single soil layer? How depth? Please clarify.
L190. What is the reference period? 1987-2018? Clarify.
L194. This sentence is not clear to me. Does the 30% refer to the total area of the basin, meaning that a minimum number of sub-basins (covering at least 30% of the total area) need to be in moderate drought?
L196. You mention short periods, but I do not see any constrains on the duration of an event. Please better clarify the definition of drought event used here (i.e. starts when at least 30%...., and end when…). Also, if any kind of spatial or temporal pooling is performed please clarify.
L198. The PCA has a very limited role in this study. I suggest reevaluating the need to include this section and this analysis in the study.
L216-221. This a rather generic description of the methodology. Please contextualize the method to your study. This section should answer the questions: What is a predictand (see comment below)? Why are they multiple? Why do you need MVRT instead of simple RT?
L223. The response variables need to be better identified here. The generic “drought severity” used here leaves a lot of questions to the readers. Is it a time series of event severity for each sub-basin? A time series over the entire basin? Just a single value (average or similar)? This need to be clarified here (and eventually detailed later) in order to justify the multivariate dimension of the problem.
L223-229. Related to the previous point. Here you first give the impression that agrological and hydrological drought severities are the two “multivariate” variables. Then, you clarify that the two are studied separately, leaving the question on what is the “multivariate” variable then. This can be only indirectly inferred from the results section, but it must be clearly stated already here.
Since section 2.5 is supposed to be the main methodology section, you need to significantly extend this section and add all the needed clarifications. Also link to the flow chart should me reported here.
L234. Again, similarly to the previous section, it is not clear what average means here. Is it a spatial average? A temporal average? Do you use time series of spatial-average values for each sub-basin or just a single value. This can be indirectly inferred from the results, but it should be made clear here.
L240. Following the previous comment: so, do you have 3 values for each sub-basin as response variables? Are then the frequency in the 3 categories the “multivariate”?
L251.Which two groups?
L276. This sentence seems to imply that two methods are used to choose the size, which is in contrast with the next sentence. Please clarify.
L294-295. This should be clarified in the methodology and not here.
I am not 100% sure that the data reported in sections 3.1 and 3.2 are results of the study. They may fit better in the “Data and method section”, since they do not bring much to the discussion on the use of MVRT.
Section 3.3. It is not clear how these 6 events are derived from the methodology described in section 2.3. There, only a minimum fraction of the area in the sub-basin is defined, and nothing is said on duration/continuity of an event. Is there any constrain on duration? Did you remove the minor events? Please clarify.
Table 5. There is a typo on event 4 (IV).
L310. This should be made clear much sooner in the text, and clearly highlight that the multivariate of the MVRT is referring to the 3 categories.
3.4 As a said before, this has very marginal impacts on the analysis. At the end, you included all the variables in the MVRT analysis, but some of them where not actually used in the final trees (and some very marginally). What does this say on the usefulness of the PCA in this case? I suggest removing this part and focus more on analyzing the variables used in the two final MVRTs and the differences between the two trees.
L334-342. Was an analysis on a limited number of explanatory variables also performed? As an example: how different are the results if only ET and PREC are used? Are some leaves really necessary? As an example, h) and i) are separated only at the end and based on WYLD, but the plots in Fig. 9 are quite similar. Are all 12 leaves relevant, considering that you then discuss only 3 macro regions? Some leaves are also quite small (just 2 basins for b) and k) for instance); if these are relevant, then they shouldn’t be grouped in the 3 macro regions in the discussion and conclusion sections.
The same considerations are true for the results on hydrological drought.
L424-426. This should be better supported by some synthetic results, rather than leaving the extraction of meaningful information to the readers.
L514-521. This explanation is a little lacking, since the explanatory variables and the targets are both derived from the same modelling framework. I am wondering if some variables that are relevant for the hydrological drought were not included in the analysis.
L523-529. Even if 9/11 were included, some have a very limited role and appears only in hydrological drought. This discussion needs to be expanded, and a more in-depth comparisons of the two trees need to be added.
L542. Is this true also for hydrological drought?
L546-547. This subdivision in three sub-areas is never highlighted in the results, and it is not evident how and why these three sub-areas are the same for agricultural and hydrological droughts, given that different trees and explanatory variables are identified.