Estimating soil moisture conditions for drought monitoring with random forests and a simple soil moisture accounting scheme

. Soil moisture is a key variable for drought monitoring, but soil moisture measurements networks are very scarce. Land-surface models can provide a valuable alternative for simulating soil moisture dynamics, but only a few countries have such modelling schemes implemented for monitoring soil moisture at high spatial resolution. In this study, a soil moisture accounting model (SMA) was regionalized over the Iberian Peninsula, taking as a reference the soil moisture simulated by a high-resolution land-surface model. To estimate the soil water holding capacity, the sole parameter required to run the SMA model, two approaches were compared: the direct estimation from European soil maps using pedotransfer functions or an indirect estimation by a machine learning approach, random forests, using as predictors altitude, temperature, precipitation, potential evapotranspiration and land use. Results showed that the random forest model estimates are more robust, especially for estimating low soil moisture levels. Consequently, the proposed approach can provide an efﬁcient way to simulate daily soil moisture and therefore monitor soil moisture droughts, in contexts where high-resolution soil maps are not available, as it relies on a set of covariates that can be reliably estimated from global databases.


Introduction
Soil moisture droughts have strong impacts on vegetation and agricultural production (Raymond et al., 2019;Tramblay et al., 2020;Vicente-Serrano et al., 2014;Pena-Gallardo et al., 2019). There is a growing interest in simple indicators to monitor drought events at short timescales that could be related to impacts (Li et al., 2020;Noguera et al., 2021). In particular, soil moisture indicators could be more relevant than climatic ones to monitor potential impacts of droughts on agriculture and natural vegetation (Piedallu et al., 2013). Since actual soil moisture measurements remain very scarce, soil moisture simulated from land-surface models is an interesting proxy to develop simplified methodologies that could be applied on data-sparse regions. Landsurface models (LSMs) are valuable tools for a fine-scale monitoring of drought events; however, their implementation requires accurate forcing data and computational resources (Almendra-Martín et al., 2021;Quintana-Seguí et al., 2019;Barella-Ortiz and Quintana-Seguí, 2019). Global implementation also exists but with a coarser resolution and driven by reanalysis data (Rodell et al., 2004; that may not be adequate for local-scale applications. Only very few countries have land-surface schemes implemented at the national level to monitor droughts . Remote sensing is another option which allows for monitoring soil moisture (Dorigo et al., 2017;Brocca et al., 2019). Microwave sensors allow for the monitoring of surface soil moisture (first 5 cm for L-band-based products and skin for C-band-based products), without the interference of clouds. However, surface soil moisture is not enough for most applications, which require root zone soil moisture, which is the water resource in the soil available to plants. Furthermore, passive L-band products, such as SMOS (Soil Moisture and Ocean Salinity; Martínez-Fernández et al., 2016) or SMAP (Soil Moisture Active Passive; Mishra et al., 2017), have a low resolution, and active C-band products, such as Sentinel-1 (Bauer-Marschallinger et al., 2019), which have a higher resolution, suffer from higher noise and are more sensitive Y. Tramblay and P. Quintana Seguí: Estimating soil moisture conditions for drought monitoring to vegetation. Thus, even though remote sensing is very useful, it still has problems to be surmounted. The resolution of passive L-band products can be increased using optical data (NDVI, normalized difference vegetation index; LST, landsurface temperature), by means of downscaling algorithms (Merlin et al., 2013;Fang et al., 2021), but then the resulting product is sensitive to cloud cover. Also, some progress has been made in deriving root zone soil moisture from surface soil moisture estimations using an exponential filter (Stefan et al., 2021) calibrated using the SURFEX LSM (Surface Externalisée; Masson et al., 2013), but these products are in their early stages and are not operational yet.
Simplified methodologies to estimate and monitor the status of soil moisture are needed in contexts where LSM data are not available and where remote sensing products fall short, such as areas and time periods with dense vegetation or high soil roughness which may affect their accuracy (Escorihuela and Quintana-Seguí, 2016). Different modelling approaches have been proposed, either with conceptual soil moisture accounting models or computational variants of the antecedent precipitation index (Willgoose and Perera, 2001;Javelle et al., 2010;Brocca et al., 2014;Zhao et al., 2019;Li et al., 2020). The general availability of spatial estimates of soil moisture content would help introduce soil moisture into drought monitoring systems, improving their scope and usefulness. Furthermore, this would also facilitate the creation of long-term reanalysis, based on meteorological forcing data, and future climate change studies, without the need for running LSMs. However, to apply this model type at a regional or national scale, there is a need to estimate their parameters over the area of interest. For that purpose, regionalization methods have been employed in hydrology for decades to estimate the parameters of hydrological models in ungauged basins (Blöschl and Sivapalan, 1995;He et al., 2011;Hrachowitz et al., 2013). Several methods exist, based on either catchment similarity or the direct estimation of model parameters using regression techniques with physiographic attributes. For soil moisture modelling, up to now only very few studies have considered these approaches to apply soil moisture accounting models at ungauged locations (Grillakis et al., 2021) or estimate root zone soil moisture using machine learning methods (Carranza et al., 2021).
The goal of the present study is to regionalize a simple soil moisture accounting (SMA) scheme that could be used to monitor soil moisture droughts. The SMA model considered in the present study requires a single parameter, the maximum soil water holding capacity. Two different approaches are compared to estimate this parameter regionally: direct estimation with soil maps or with a machine learning technique, namely random forests.

Study area and data
The study area of this work is the Iberian Peninsula, which is located between the Mediterranean Sea and the Atlantic Ocean and thus is influenced by both synopticscale systems, which often come from the Atlantic side, and mesoscale heavy-precipitation events, which often come from the Mediterranean side. The Iberian Peninsula presents a marked relief, with a large and high central plateau and different mountain ranges, which heavily influence the spatial patterns of precipitation, enhancing it windward and decreasing it leeward, generating areas of high precipitation in the west, northwest and north and very dry areas on the central plains and, especially, in the southeast. As a consequence the Iberian Peninsula has a heterogeneous distribution of average annual rainfall, with values ranging from 2000 mm yr −1 to less than 100 mm yr −1 . All this has a strong influence on the spatial and temporal variability of soil moisture and soil moisture regimes, having wet regimes in the west and north, where the soil is hardly stressed, and semi-arid areas elsewhere, with a wet (energy-limited) and a dry (water-limited) season, with a dry down that might be interrupted by convective events. All this makes the modelling of soil moisture in Iberia a rather challenging task.
Daily precipitation, temperature and potential evapotranspiration (PET) were retrieved from the SAFRAN-Spain database (Quintana-Seguí et al., 2017). SAFRAN (Durand et al., 1993) is a meteorological reanalysis that produces gridded datasets by combining the outputs of a meteorological model and all available observations using an optimal interpolation algorithm. It has been implemented over France  and recently over the Iberian Peninsula (Quintana- Seguí et al., 2017) with a 5 km × 5 km spatial resolution. The SAFRAN dataset used in this study includes not only observations from the Spanish part of the Iberian Peninsula but also ingested data from Portugal. The SURFEX LSM  has been run using SAFRAN-Spain as the meteorological forcing dataset and on the same grid, as was done in Quintana- . SURFEX uses the ECOCLIMAP2  physiographic database, and it uses the ISBA (Interaction Sol-Biosphère-Atmosphère) scheme (Noilhan and Mahfouf, 1996) for natural surfaces. ISBA has different options; we have used ISBA-DIF, the multi-layer diffusion version (Boone, 2000;Habets et al., 2003). From this simulation, we have extracted the soil moisture of the first 60 cm of the soil by calculating the weighted average of the soil layers that fall within this range. This simulated soil moisture over the Iberian Peninsula is considered herein as the observed reference, in the absence of dense monitoring networks of soil moisture (Martínez-Fernández et al., 2016). From the ECO-CLIMAP2 database, elevation and land cover data have also been retrieved and aggregated into the following nine categories: water, bare, ice/snow, urban, forest, grass, dry crops, irrigated crops and wetlands.
We also use the European Soil Database (ESDB) produced by the European Soil Data Centre (Panagos et al., 2012). The ESDB contains information on soil characteristics, including soil depth and texture for topsoil (0-30 cm) and subsoil (30-70 cm) layers at a grid resolution of 1 km. The total available water content (TAWC) is a volumetric parameter describing the water content between the field capacity and permanent wilting point, as a function of the available water content, presence of coarse fragments and depth (Reynolds et al., 2000). In the ESDB, water content at the field capacity and permanent wilting point were determined following the equation from van Genuchten (1980) to estimate the soil water retention curve (Hiederer, 2013). The parameters of the equation are provided by a pedotransfer function (Wösten et al., 1999) for the volumetric soil water content computed from the soil water retention curve. The pedotransfer function uses soil texture, organic carbon content and bulk density to determine the parameters of the soil water retention curve (Hiederer, 2013).

Soil moisture accounting model
The soil moisture model considered here has been previously applied in several studies for applications related to soil moisture monitoring (Anctil et al., 2004;Javelle et al., 2010;Tramblay et al., 2012Tramblay et al., , 2014. It consists in the SMA part of the GR4J model (Génie Rural à 4 paramètres Journalier; Perrin et al., 2003), driven by precipitation and PET, which represents a conceptual formulation of the impact of precipitation and PET on the soil water balance, using a soil reservoir of fixed depth A. This parameter represents the maximum capacity of that reservoir, which can be assumed to be equivalent to the soil water holding capacity (Perrin et al., 2003;Javelle et al., 2010;Tramblay et al., 2014).
The soil reservoir has a net outflow when PET exceed rainfall.
If P t ≤ PET t , In all the other cases it has a net inflow.
where S * can never exceed the maximum reservoir capacity. Finally, the outflow from the storage reservoir due to percolation is taken into account using The level of the soil reservoir is given by S/A, ranging between 0 and 1, which provides a soil wetness index (SWI) for the catchment. The outputs of SURFEX soil moisture are first normalized with the maximum and minimum values to obtain an SWI consistent with the SMA model output. Then, the SMA model parameter A is calibrated using this normalized SURFEX soil moisture as a reference. The SMA model is calibrated for each grid cell independently using soil moisture simulated with SURFEX covering the full Iberian Peninsula domain. The Nelder-Mead simplex algorithm is used for the calibration with the Nash efficiency criterion. To regionally estimate the values of A, two different methods are compared: the direct estimation of A with TAWC from ESDB soil maps or its indirect estimation with machine learning methods, namely random forests using 5 km × 5 km grid physiographic and climatic properties.

Regionalization with soil maps
The first approach consists in using the total available water content from the ESDB to estimate the A parameter for each grid cell. In the present work, the TAWC of subsoil and topsoil layers have been added and averaged at the scale of 5 km × 5 km, matching the spatial resolution of the SAFRAN grid. Then, these estimates have been used to set the A parameter of the SMA model. Thus, this regionalization approach is based on the a priori estimation of the A parameter from soil maps solely.

Regionalization with random forests
Random forests (RFs; Breiman, 2001) belong to the class of machine learning techniques. RFs are based on a bootstrap aggregation (Breiman, 1996) of classification and regression trees (Breiman et al., 2017). They generate a bootstrap sample from the original data and trains a tree model using this sample. The procedure is repeated many times, and the bagging's prediction is the average of the predictions. Among the many advantages of RFs, they are fast, non-parametric, robust to noise in the predictor variables, able to capture nonlinear dependencies between predictors and dependent variables, and can simultaneously incorporate continuous and categorical variables (Tyralis et al., 2019). The drawbacks are that they are complex to interpret and cannot extrapolate outside the training range. Given their advantages, this algorithm is particularly suited for the estimation of spatial variables such as soil properties (Booker and Woods, 2014;Hengl et al., 2018;Gagkas and Lilly, 2019;Stein et al., 2021). In the present work, an RF model is generated to estimate the values of the A parameter of the SMA model, representing the soil water holding capacity, with the properties of the 5 × 5 km grid cells, namely altitude, land cover, mean annual precipitation, temperature and PET, using random forests.
To estimate the reliability of the method, the 5 km × 5 km grid cells covering the Iberian Peninsula have been split randomly into a training sample containing 70 % of the cells (15 636 data points) and a testing sample with the remaining 30 % cells (6701 data points). The random selection of the training and testing sets have been performed using a Latin hypercube sampling (McKay et al., 1979) to ensure homogeneous sampling over the Iberian Peninsula. Given that the RF trees cannot be interpreted directly, as for example the weights in a linear regression, we additionally implemented an out-of-bag predictor importance estimation by permutation (Loh and Shih, 1997) to measure how influential the predictor variables in the model are at predicting the response. The influence of a predictor increases with the value of this measure. If a predictor is influential in prediction, then permuting its values should affect the model error. If a predictor is not influential, then permuting its values should have little to no effect on the model error.

Validation on the ability to detect dry soil moisture conditions
To compare the efficiency of the two methods compared to estimate the A parameter of the SMA model, the SMA model was run using the two methods, and all daily values of soil moisture below the 10th percentile were extracted, corresponding to dry soil conditions. Only the grid cells in the testing sample were considered for this validation. We computed different verification scores to assess the relative efficiency of the two methods to reproduce daily soil moisture below the 10th percentile using the ISBA simulated soil moisture as a benchmark: the probability of detection (POD), the false-alarm ratio (FAR) and the Heidke skill score (HSS) summarizing the global efficiency to detect dry periods (Jolliffe and Stephenson, 2011). These scores are based on the contingency table between forecasts (or simulated values in the case of the present study) and observations (Table 1). POD is the probability of detection (Eq. 1); FAR is the number of false alarms per the total number of warnings or alarms (Eq. 2); and HSS is a skill score ranging from −∞ to 1 (Eq. 3), for categorical forecasts where the proportion of correct measure is scaled with the reference value from correct forecasts due to chance.

Calibration of the SMA model
The calibration results of the SMA model against SURFEX soil moisture provide very good model performance, with a mean Nash coefficient equal to 0.94, indicating its ability to reproduce the soil moisture dynamics as simulated by SURFEX. Nash values below 0.5 are found for 1.21 % of grid cells (n = 273); these are only for areas located in the mountainous range affected by snow processes above 1500 m a.s.l. (Fig. 1). This outcome is expected; since the SMA model does not include a snow module, it cannot reproduce snow dynamics in these areas. However, high-elevation areas with seasonal snow cover are not the areas most at risk of soil moisture droughts for agricultural activities in Spain.
The calibrated values of the A parameter of the SMA model range from 60 to 250 mm, depending on the location (Fig. 2). There is no significant correlation between A and the mean annual precipitation or the aridity index (P /PET). This highlights the interplays between soil properties and climate to explain the spatial variability of the soil water holding capacity.

Regional estimation of the A parameter
The values of the calibrated A parameter are related to the properties of the 5 km × 5 km grid cells using random forests. First, an out-of-bag predictor importance estimation by permutation is applied to compute the overall performance of RFs and estimate the relative influence of each predictor. When using the A out-of-bag estimates to run the SMA model, the loss of performance is very small; the decrease in Nash values in validation is on average equal to −0.0019 (with a maximum decrease of −0.04). This is due to the low sensitivity of the SMA model to the value of A, given that the error in the estimation of A is in the range of 10 mm (RMSE of 13.18 mm). This type of validation mimics the case when the estimation at one single location is required, yet since all the remaining points are used for the estimation, it makes the approach in that case very robust. The relative importance for each predictor is plotted in Fig. 3, indicating that precipitation and potential evapotranspiration are the two most important predictors, followed by altitude. On the contrary, the land cover attributes for each grid cell are the least important predictors, and removing them from the RF model does not   significantly change the results. This shows the relative importance of climatic variables in the spatial variability of the soil moisture holding capacity.
To estimate the robustness of the method, we applied a split-sample validation into a testing and a training sample.
The results are presented for the testing set (Fig. 4). The performance in terms of Nash for the SMA model with A estimated by random forests or soil map is very similar, with a mean Nash value equal to 0.86 (median of 0.89) with RFs and 0.81 (median of 0.85) with soil maps. The Nash values in validation (testing set) are low, or even negative, only for mountainous ranges, as expected. Overall, the spatial patterns of the Nash coefficients obtained with RFs or the ESDB are very similar too. There are no significant relationships between model efficiency and the aridity index or the presence of irrigated areas, as identified in the ECOCLIMAP2 land cover database.

Estimation of dry soil conditions
A further validation is made for daily soil moisture below the 10th percentile corresponding to dry soil conditions. We computed the probability of detection (POD), the false-alarm ratio (FAR) and the Heidke skill score (HSS) summarizing the global efficiency to detect dry periods. For both approaches to estimate A, the mean POD is very high, close to 97 %, while FAR is close to 3 %. But these average results hide some discrepancy in the different regions (Fig. 5): the efficiency is the highest for the northwestern region, the wettest areas of Spain, with the most important increase of HSS and POD, associated with a decrease in FAR, using random forests, while in the southern and central parts of Spain, the performance is lower on average and very similar to the two regionalization approaches. For the wettest parts of the Iberian Peninsula, POD remains higher than 94 %, and FAR is lower than 6 %; it is the region where the main improvements with RFs are observed. As shown in Fig. 5, the results with random forests mostly follow the climate conditions, with improved estimations in the wettest regions of the northern and northwestern parts of Spain. For the estimation with EU soil maps, the results seem related to soil depth and, to a lesser extent, land cover. Indeed, higher scores are found in regions with shallow soils, such as those of plutonic (Galicia, western parts of the Extremaduran mountainous ranges and Douro basin) or metamorphic  where soil depth is probably overestimated. On average, the RF estimation method outperforms the approach based on the ESDB (Fig. 6), with more stable results in terms of HSS, since all values obtained with RFs are above 0.4, while with the ESDB for the grid cells, the HSS scores drops to values close to zero.

Summary and conclusions
In this study, a simple model allowing for the monitoring of soil moisture conditions was regionalized over the entire Iberian Peninsula, taking as a reference the soil moisture simulated by a high-resolution land-surface model. Two different regionalization methods have been compared, either by the direct estimation of the soil water holding capacity from European soil maps or by random forests, using covariates such as altitude, temperature, precipitation, potential evapotranspiration and land cover. Results have shown that the estimation by random forests is more robust notably to estimate low soil moisture levels. Despite similar average performance between the two methods, the use of soil maps to set the water holding capacity reveals less stable results in some cases, most probably related to the uncertainties in the pedotransfer functions used. While these pedotransfer functions are process-based predictive functions of certain soil properties, random forests are not based on physical processes and are tailored to provide the best estimates in a statistical sense. Therefore, they provide a valuable alternative in contexts where high-resolution soil maps are not available, since they rely on a set of covariates that can be reliably estimated from global databases, such as satellite or reanalysis products (Funk et al., 2015;Hersbach et al., 2020;. It should be noted that the results presented herein are highly dependent on the quality of land-surface simulations, in the absence of dense monitoring networks of in situ soil moisture data; thus these results suffer from the same limitations as LSMs, notably, the lack of human process representation in these models (notably irrigation). However, new remote sensing irrigation estimates are being developed (Massari et al., 2021); as a consequence, once the RF model is trained, irrigation estimates could be added to the precipitation forcing data in order to include the human impacts on soil moisture estimations. The results show that this ap-proach could allow for cheaply extending the value of highresolution LSM simulations to areas where no LSM is implemented (i.e. northern Africa), as long as the climate conditions belong to the range of values used to train the model, mostly in terms of precipitation and potential evapotranspiration ranges. Thus, the model train over the Iberian Peninsula could be applied to other similar areas such as northern Africa, Italy or Greece. As a perspective, other simulations from countries where high-resolution LSM simulations are available, such as France or the USA, could be added to the database in order to expand the coverage over different physiographic and climate contexts (Ma et al., 2021). Consequently, the benefits of LSM simulations of soil moisture could be expanded to other areas, provided those suitable forcing datasets are available. Furthermore, if public meteorological and hydrological organizations were to create soil moisture observation networks, cleverly designed to cover the most relevant climates of their countries, this approach could be used to train the model using these observations and then regionalize the results to the rest of the territory, thus, converting an in situ observation dataset into a gridded dataset with a much greater spatial coverage.
Code availability. Codes (such as MATLAB scripts and functions) are available upon request.
Author contributions. YT and PQS designed the study. YT carried out the data processing, analyses and visualization and wrote the initial draft of the paper. PQS extracted all the required data, performed the SURFEX simulations and reviewed the initial draft of the paper.
Competing interests. The contact author has declared that neither they nor their co-author has any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Special issue statement. This article is part of the special issue "Hydrological cycle in the Mediterranean (ACP/AMT/G-MD/HESS/NHESS/OS inter-journal SI)". It is not associated with a conference.
Acknowledgements. This work is a contribution to the HyMeX programme through the HUMID (grant no. CGL2017-85687-R, AEI/FEDER, UE) and ANR HILIAISE projects. We thank Jaime Gaona (Instituto de Investigación en Agrobiotecnología -CIALE, Universidad de Salamanca, Villamayor, Salamanca, Spain) for his comments on some aspects of the manuscript and two anonymous reviewers for their suggestions to improve the manuscript.
Review statement. This paper was edited by Christian Barthlott and reviewed by two anonymous referees.