the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Multivariate regression trees as an ‘explainable machine learning’ approach to exploring relationships between hydroclimatic characteristics and agricultural and hydrological drought severity
Ana Paez-Trujilo
Jeffer Cañon
Beatriz Hernandez
Gerald Corzo
Dimitri Solomatine
Abstract. The typical causes of droughts are lower precipitation and/or higher than normal evaporation in a region. The region’s characteristics and anthropogenic interventions may enhance or alleviate these events. Evaluating the multiple factors that influence droughts is complex and requires innovative approaches. To address this complexity, this study employs a combination of modelling and machine learning tools to assess the relationship between hydroclimatic characteristics and the severity of agricultural and hydrological droughts. The Soil Water Assessment Tool is used for hydrological modelling. Model outputs, soil moisture and streamflow are used to calculate the drought indicators for the subsequent drought analysis. Other simulated hydroclimatic parameters are treated as hydroclimatic drivers of droughts. A machine learning technique, the multivariate regression tree approach, is then applied to identify the hydroclimatic characteristics that govern agricultural and hydrological drought severity. The case study is the Cesar River basin (Colombia).
Our research indicates that multiple parameters influence the Cesar River basin’s exposure to agricultural and hydrological droughts. Accordingly, the basin can be divided into three distinct areas. First is the upper part of the river valley. Due to precipitation shortfalls and high potential evapotranspiration, this region is very susceptible to agricultural and hydrological droughts. The second area is the middle part of the river valley. This area is likewise very susceptible to agricultural and hydrological droughts; however, severe drought conditions are brought on by inadequate rainfall partitioning and an unbalanced water cycle that favours water loss through percolation and evapotranspiration. Third, the Zapatosa marsh and the Serrania del Perijá foothills present moderate exposure to agricultural and hydrological droughts. Mild drought conditions appear to be related to the capacity of the subbasins to retain water, which lowers evapotranspiration losses and promotes percolation. Our results show that the presented methodology, in combining hydrologic modelling and machine learning techniques, provides valuable information about an interplay between the hydroclimatic factors that influence drought severity in the Cesar River basin.
Ana Paez-Trujilo et al.
Status: open (until 15 Jun 2023)
-
RC1: 'Comment on nhess-2023-50', Anonymous Referee #1, 27 Apr 2023
reply
General comment
The paper investigates the use of multivariate regression tree (MVRT) to characterize drought in a water basin in Colombia. The approach represents a valid solution to interpret the complex interplay between climatic forcing and basin characteristics and drought conditions. While the paper is generally well-structured, it lacks some clarifications that are fundamental to understand the overall goal of the study and to allow reproducible outcomes.
My main concern regards the lack of clarity in the objective of the study. The title fails to mention that it is an application to a specific case study. In addition, it is not clear until deep in the “results section” what exactly the authors mean with drought severity. For almost the entire paper the readers are left wondering what exactly is modelled with the MVRT. Is it the severity of a series of events on the entire basin? Is it the spatial distribution of the severity? This should be made clear already in the objective described in the introduction, and then detailed in the methodology.
Another related issue of the paper is the lack of specific details of the application of MVRT to the given study case. Most of the description is rather generic, and do not answer key questions about the specific application. The authors state that one of the advantages of MVRT is the capability to output multiple variables, but it is never clarified why this is needed here and how this is exploited.
A lot more can be said on the “explainable” portion of the study. The authors provide some comments on the outcomes of the two MVRTs, but the link between these outputs and a physical interpretation is lacking. In both the discussion and the conclusion sections (as well as in the abstract), the authors stress how a main finding is the division of the domain in 3 macro regions. However, it is not clear how this conclusion is drawn from the outputs of MVRT, and how MVRT are “explained” to derive this conclusion. At the moment, it is seems that this conclusion is derived from previous knowledge of the area rather than the actual outcomes of the study.
In addition, the outcomes of the two MVRTs are rather different, and it would be interesting to discuss the analogies and differences between the two (in spatial patterns, explanatory variables, etc.). In the current version, the two analyses are almost independent from each other. Is the division in 3 macro regions valid for both agricultural and hydrological droughts? Is yes, how it is so given the differences in the trees?
Finally, given the focus on drought, I would have expected a validation of the model also in term of drought quantities, especially low-flow conditions. The validation of the SWAT model should be expanded to highlight reasonable performances during drought conditions, and possibly expanded to soil moisture as well.
Overall, the paper can potentially bring a valid contribute to the field of machine learning applied to drought studies, but I recommend accommodating my major concerns before considering the manuscript for publication.
Specific comments
L12-13. You mention anthropogenic interventions and region’s characteristics, but those are factors that are barely included in your analysis. If this is a key point of your study, it should be better reflected in the analysis.
L51. “MAY play…” Actually, I have the impression from your results that some of these quantities do not play a major role, at least in your study region.
L53-55. Again, you stress the role of human interventions but only marginally included them in the study.
L76. This is the right place to highlight why a multivariate approach may be needed here.
L87. Please better link this line and figure to the rest of the text reported later (description of the methodology).
L91. Please mark these three sub-regions in the map for the people not familiar with the region.
L105. You mention pasture here, but no “pasture” class is reported in the Figure. Please align the text with the figure.
L115. Reference?
L121. I would link this sentence to the next.
L142. I assume that CN2 is the initial CN for soil moisture condition 2, since the actual CN is a variable. Please clarify.
L143. No calibration on the Manning factor?
L152. Since your focus is on hydrological drought, I suggest adding some evaluation metrics focused specifically on low flow. It is a well-known issue that NSE may return high values even when low flow conditions are not well represented due to a good matching of flood values. Also, given the relevance of soil moisture in your study, some kind of validation/evaluation of the performances in terms of soil moisture is needed.
L162. No details are provided on the soil profile. Is it a single soil layer? How depth? Please clarify.
L190. What is the reference period? 1987-2018? Clarify.
L194. This sentence is not clear to me. Does the 30% refer to the total area of the basin, meaning that a minimum number of sub-basins (covering at least 30% of the total area) need to be in moderate drought?
L196. You mention short periods, but I do not see any constrains on the duration of an event. Please better clarify the definition of drought event used here (i.e. starts when at least 30%...., and end when…). Also, if any kind of spatial or temporal pooling is performed please clarify.
L198. The PCA has a very limited role in this study. I suggest reevaluating the need to include this section and this analysis in the study.
L216-221. This a rather generic description of the methodology. Please contextualize the method to your study. This section should answer the questions: What is a predictand (see comment below)? Why are they multiple? Why do you need MVRT instead of simple RT?
L223. The response variables need to be better identified here. The generic “drought severity” used here leaves a lot of questions to the readers. Is it a time series of event severity for each sub-basin? A time series over the entire basin? Just a single value (average or similar)? This need to be clarified here (and eventually detailed later) in order to justify the multivariate dimension of the problem.
L223-229. Related to the previous point. Here you first give the impression that agrological and hydrological drought severities are the two “multivariate” variables. Then, you clarify that the two are studied separately, leaving the question on what is the “multivariate” variable then. This can be only indirectly inferred from the results section, but it must be clearly stated already here.
Since section 2.5 is supposed to be the main methodology section, you need to significantly extend this section and add all the needed clarifications. Also link to the flow chart should me reported here.
L234. Again, similarly to the previous section, it is not clear what average means here. Is it a spatial average? A temporal average? Do you use time series of spatial-average values for each sub-basin or just a single value. This can be indirectly inferred from the results, but it should be made clear here.
L240. Following the previous comment: so, do you have 3 values for each sub-basin as response variables? Are then the frequency in the 3 categories the “multivariate”?
L251.Which two groups?
L276. This sentence seems to imply that two methods are used to choose the size, which is in contrast with the next sentence. Please clarify.
L294-295. This should be clarified in the methodology and not here.
I am not 100% sure that the data reported in sections 3.1 and 3.2 are results of the study. They may fit better in the “Data and method section”, since they do not bring much to the discussion on the use of MVRT.
Section 3.3. It is not clear how these 6 events are derived from the methodology described in section 2.3. There, only a minimum fraction of the area in the sub-basin is defined, and nothing is said on duration/continuity of an event. Is there any constrain on duration? Did you remove the minor events? Please clarify.
Table 5. There is a typo on event 4 (IV).
L310. This should be made clear much sooner in the text, and clearly highlight that the multivariate of the MVRT is referring to the 3 categories.
3.4 As a said before, this has very marginal impacts on the analysis. At the end, you included all the variables in the MVRT analysis, but some of them where not actually used in the final trees (and some very marginally). What does this say on the usefulness of the PCA in this case? I suggest removing this part and focus more on analyzing the variables used in the two final MVRTs and the differences between the two trees.
L334-342. Was an analysis on a limited number of explanatory variables also performed? As an example: how different are the results if only ET and PREC are used? Are some leaves really necessary? As an example, h) and i) are separated only at the end and based on WYLD, but the plots in Fig. 9 are quite similar. Are all 12 leaves relevant, considering that you then discuss only 3 macro regions? Some leaves are also quite small (just 2 basins for b) and k) for instance); if these are relevant, then they shouldn’t be grouped in the 3 macro regions in the discussion and conclusion sections.
The same considerations are true for the results on hydrological drought.
L424-426. This should be better supported by some synthetic results, rather than leaving the extraction of meaningful information to the readers.
L514-521. This explanation is a little lacking, since the explanatory variables and the targets are both derived from the same modelling framework. I am wondering if some variables that are relevant for the hydrological drought were not included in the analysis.
L523-529. Even if 9/11 were included, some have a very limited role and appears only in hydrological drought. This discussion needs to be expanded, and a more in-depth comparisons of the two trees need to be added.
L542. Is this true also for hydrological drought?
L546-547. This subdivision in three sub-areas is never highlighted in the results, and it is not evident how and why these three sub-areas are the same for agricultural and hydrological droughts, given that different trees and explanatory variables are identified.
Citation: https://doi.org/10.5194/nhess-2023-50-RC1 -
RC2: 'Comment on nhess-2023-50', Samuel Jonson Sutanto, 18 May 2023
reply
Summary
This paper uses a Multivariate Regression Tree (MVRT) machine learning approach to identify the drivers of soil moisture and hydrological droughts in the Cesar River Basin, Colombia. The Soil Water Assessment Tool (SWAT) was used to simulate the soil moisture and streamflow and later these variables were used to indicate droughts in the basin. 11 model inputs and outputs were used as drought drivers. The author found that three distinct areas could be identified that have different drought drivers. The upper part of the basin is strongly influenced by precipitation deficit and high potential evapotranspiration leading to soil moisture and streamflow drought. The middle part is more prone to streamflow drought and is influenced by rainfall, percolation, and evapotranspiration. Last, retention capacity which reduces water loss by evapotranspiration and percolation is the main driver for the lower region.
Assessment
This paper explores the relationship between 11 potential hydrological drivers and soil moisture and streamflow drought severities using the MVRT machine learning technique. The manuscript is very interesting and readable. I have a few minor comments below and three general comments, but only for clarification. I believe this work is well suited for NHESS.
General Comments
I have three general comments regarding the manuscript but all of them are only for clarification and improvement of the manuscript.
- I am wondering why the authors used SMDI and SSI to identify soil moisture/agriculture and streamflow droughts, respectively. These are two different methods. Why don’t the authors use the Standardized Soil Moisture Index (SSMI) in order to have a comparable method with SSI since both are the standardized indices. I suggest to write a clarification of why the authors decide to use SMDI instead of SSMI.
- I do not get the importance of PCA analysis in your study. Here the authors used the PCA to further confirm the key drivers of droughts obtained from the MVRT. However, more explanation in the text about the PCA results and how these confirm the MVRT results is lacking. For example, from the 11 variables, which variables have the higher explained variances, and how to read the loading factors in the sense of what positive and negative signs mean? From the MVRT, I can see that ET, precipitation, and percolation are key drivers for agriculture drought (correct me if I am wrong) and the key drivers for streamflow drought are precipitation and water yield.
- An explanation of why the authors used different CVRE, relative error, and EV values for agriculture and streamflow droughts is needed.
Line by line comments
L refers to line and P refers to page.
P1L17: Maybe add the word “such as” -> ….model outputs, such as soil moisture and streamflow….
P1L26: the authors may replace the word “brought on” with “caused”
P2L46-50: Here the authors describe drought propagation. I suggest stating this clearly thus the readers understand what is drought propagation. Moreover, the authors also used the term propagation a few times in the next paragraph. The authors may also add a drought propagation study by Van Loon et al. (2012).
P3L66-68: Write the references (two studies) directly after the sentence.
P3L79-80: Same, write such as or the authors may re-write it as: “Soil moisture and streamflow obtained from the SWAT model are used to…….”
P3L80-81: The authors mention “other simulated hydroclimatic parameters….” -> mention them already.
P3L86: I suggest restructuring section 2. Section 2 will be Study location and methods and thus section 2.1 will be study location and section 2.2 will be methods. Swap figures 1 and 2 accordingly. In the present form, the authors mention first the flowchart describing the data and method but then no explanation is followed. Study location is placed after this 1 sentence about data and method, and then section 2.2 back to method again.
P5L117: Figure 2. Please label this figure into 2a and 2b and refer these figures in the text above. I also strongly suggest the authors change the color label for Figure 2b. Using red color for water bodies, blue for grass, and black color for forest are not common. Change the color codes into the most commonly used colors to represent the land use.
P6L143: Maybe reverse the abbreviations and full names. I suggest to write the full name first and then the abbreviation.
P8L195: Please write the minimum threshold. Is it 30%?
P9L198: PCA analysis. Please describe in this section that the authors only used the first order until the third order only.
P9L215: MVRT method. Some explanations why the authors only used this single method are encouraged.
P10L234: value -> values
P10L2242-243: I am wondering why the authors use the total number of months for each drought category and not monthly. By doing this then the response variables are only 1 total number of SM drought month and 1 total number of streamflow drought month? I thought the input variables for both explanatory and response are monthly data or at least yearly data.
P11L251: What are these two groups?
P13L292: Figure 3. I suggest to write the alphabet a, b, c, and so on at the top of the figure. Moreover, please use different colors for observed and simulated for better visibility.
P13L296: The authors may re-write the sentence into “……of the parameters, which are the curve number, slope, and soil type at…..
P14L300: Figure 4. Please describe the soil types. What is soil type a, b, c, and d? I could not find it everywhere.
P16L316: PCA. Please see my general comment.
P17L347-348: Please re-write this sentence: “This leaf contains no instance of severe…” It is unclear what do the authors mean with no instance? Also, write Figure 9b after the sentence.
P18L374: Figure 8. Please mention a, b, c, d, and so on are the number of n in each decision tree. Same for all figures. The figure caption should be self-explanatory and detailed.
P20L411: I see that the MVRT has higher (g) and lower (h) than 0.5 mm.
P22L476: Please mention the selected drivers.
P23L484: The authors stated “previous studies”. Please mention those studies.
P24L524: What do the authors mean with eleven out of nine potential drivers? Usually 9 out of 11 and not vice versa.
P25L555: Here the authors mention other ML techniques. This is the reason I suggest the authors to describe why the MVRT was selected compared to others.
Reference
Van Loon et al.: Evaluation of drought propagation in an ensemble mean of large-scale hydrological models, Hydrol. Earth Syst. Sci., 16, 4057–4078, 2012.
Citation: https://doi.org/10.5194/nhess-2023-50-RC2
Ana Paez-Trujilo et al.
Ana Paez-Trujilo et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
266 | 53 | 11 | 330 | 3 | 4 |
- HTML: 266
- PDF: 53
- XML: 11
- Total: 330
- BibTeX: 3
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1