Using principal component analysis to incorporate multi-layer soil moisture information in hydrometeorological thresholds for landslide prediction: an investigation based on ERA5-Land reanalysis data

. A key component for landslide early warning systems (LEWSs) is constituted by thresholds providing the conditions above which a landslide can be triggered. Traditionally, thresholds based on rainfall characteristics have been proposed, but recently, the hydrometeorological approach, combining rainfall with soil moisture or catchment storage information, is becoming widespread. Most of the hydrometeorological thresholds proposed in the literature use the soil moisture from a single layer (i.e., depth or depth range). On the other hand, multi-layered soil moisture information can be measured or can be available from reanalysis projects as well as from hydrological models. Approaches using this multi-layered information are lacking, perhaps because of the need to keep the thresholds simple and two-dimensional. In this paper, we propose principal component analysis (PCA) as an approach for deriving two-dimensional hydrometeorological thresholds that use multi-layered soil moisture information. To perform a more objective assessment we also propose a piecewise


Introduction
The impact of landslides triggered by rainfall is constantly increasing due to landscape modifications, i.e., urbanization, deforestation, land changes, and the abandonment of rural areas (Roccati et al., 2019).Landslides can cause serious damage to man-made structures and land, as well as loss of natural resources and lives.The role of landslide risk in human well-being is highlighted by the fact that more than 4800 landslide occurrences were documented from 2004 to 2016, with over 55 000 reported fatalities at a global scale (Froude and Petley, 2018).Furthermore, landslides triggered by rainfall have been identified as the cause of approximately 90 % Published by Copernicus Publications on behalf of the European Geosciences Union. of fatalities globally (Haque et al., 2016;Sultana, 2020), and, from an economic point of view, annual losses were estimated to total USD 20 billion (Sim et al., 2022).
Over the last decades, an increasing number of studies also focused on the potential effects of climate change on landslide phenomena (McInnes et al., 2007;Dijkstra and Dixon, 2010;Crozier, 2010), pointing out that there are some unresolved issues, such as the abundance, activity, frequency, and return period of landslides in response to the projected climate change (Gariano and Guzzetti, 2016;Peres and Cancelliere, 2018).In light of these considerations and, after recent catastrophic landslides worldwide, there is high interest from scholars and civil protection agencies in the development of landslide early warning systems (LEWSs), which can serve as an aid in predicting possible slope movements and thus as a risk mitigation tool (Roccati et al., 2020;Highland and Bobrowsky, 2008;Chae et al., 2017).
Landslide-triggering thresholds are a key component of LEWSs.In general, empirical rainfall thresholds, which relate the occurrence of landslides to rainfall event characteristics such as intensity, duration, total amounts, or a combination thereof, are commonly applied for the majority of regional LEWSs (Guzzetti et al., 2007(Guzzetti et al., , 2008;;Segoni et al., 2018a;Aleotti, 2004).When information on non-triggering rainfall is also available, thresholds can be determined as the best classifiers based on the confusion matrix (Berti et al., 2012;Staley et al., 2013;Peres andCancelliere, 2014, 2021;Postance et al., 2018).In the last decade, there has been an increasing interest in the development of hydrometeorological thresholds that consider rainfall characteristics and subsurface hydrological variables, such as soil moisture content and catchment storage information (Uwihirwe et al., 2022;Mirus et al., 2018a, b;Thomas et al., 2018;Segoni et al., 2018b;Wicki et al., 2020Wicki et al., , 2021;;Bogaard andGreco, 2018, 2016;Reder and Rianna, 2021;Marino et al., 2020;Palau et al., 2021;Conrad et al., 2021).These studies demonstrate improvements of the prediction performances with the hydrometeorological approach, with respect to the traditional precipitation-based thresholds, even if not all climatic areas have been explored, so further applications are still useful.Furthermore, none of the previous studies take into account the possibility to exploit the information from a soil moisture profile or multi-layered soil moisture information, corresponding to several depths or depth ranges.This is most likely because thresholds have to be kept simple, i.e., twodimensional, for being effectively communicated to decision makers.
In the present work, we propose an approach that allows taking into account the multi-layer soil moisture information within hydrometeorological thresholds while keeping these two-dimensional thanks to a statistical technique named principal component analysis (PCA) (Jolliffe, 2002).This technique allows us to find the linear combination between soil moisture at different depth layers which retains as much as possible the information content of the multi-ple layers together, capitalizing on the presence of correlation between the soil moisture at different depths.The proposed approach is also intended to test whether multi-layer soil moisture information may provide better predictive performance than the single-layer one by comparing the relative prediction performances in terms of receiver operating characteristic (ROC) indices, such as the well-known true skill statistic (TSS).We carry out our investigation using observed precipitation in combination with ERA5-Land reanalysis soil moisture data, available at four different depth layers with a 0.1 • × 0.1 • ( ∼ = 9 km) resolution (Hersbach et al., 2020).Recent studies proved that the main climate variables (i.e., soil moisture, temperature, precipitation) obtained from third-generation atmospheric and reanalysis datasets (i.e., ERA5 project) have a reasonable accuracy in reproducing in situ measurements (Dorigo et al., 2011;Li et al., 2020;Beck et al., 2021), though accuracy issues still remain significant.The case study of the Sicily region is used to test the proposed methodology.
The paper is organized as follows.First, the procedure for the dataset creation and description of the methodology leading to the proposed approach to implement multilayer soil moisture data in hydrometeorological thresholds are presented in the "Material and methods" section.Then, the "Study area" section describes the relevant features of the study area, namely the Sicily island (southern Italy).Next, the results and discussion concerning the performance obtained in correspondence with all identified rainfall-triggering thresholds are presented in the "Results and discussion" section.Finally, conclusions are drawn in the last section.

Dataset construction
The construction of the rainfall and landslide events dataset is a key step that involves different types of data (i.e., observed landslides, rainfall events, and reanalysis data of soil moisture).As schematically illustrated in Fig. 1, in the first step the FraneItalia project (Calvello and Pecoraro, 2018) is employed to collect information regarding the observed landslides, as it is a thorough spatiotemporal inventory of historical landslides that have impacted the Italian territory since 2010, including both occurrences that resulted in fatalities and occurrences that did not.
The first classification criterion by the FraneItalia catalog is based on the number of landslides triggered by the same rainfall event in a given geographic area.Specifically, single landslide events (SLEs) and areal landslide events (ALEs) are distinguished for records referring to single or multiple landslides, respectively.Both SLEs and ALEs are then categorized into one of three classes in relation to their impacts, in order to track whether a landslide occurrence resulted in casualties or missing people (C1, very severe), injured people and evacuations (C2, severe), or no one was physically harmed (C3, minor).The data on occurrence location, the date the landslide occurred, the source of information, and the number of landslides for ALEs are further details that have also been included in the catalog, together with the onset and duration of the landslide occurrence and its consequences.
Thanks to this accurate level of detail, it is possible to filter only the landslide events triggered by rainfall, which are precisely those to take into consideration in our study.
The CTRL-T (Calculation of Thresholds for Rainfallinduced Landslides-Tool) code (Melillo et al., 2018) is subsequently used for the identification of the rainfall events that were more likely to be responsible for the observed slope failures.Specifically, CTRL-T automatically and objectively reconstructs rainfall events and the triggering conditions responsible for the failure using a set of adjustable parameters to account for different morphological and climatic settings.Briefly, the tool consists of distinct modules with specific purposes.Among these, one module operates the reconstruction of rainfall events in term of duration (D, in h) and cumulated event rainfall (E, in mm) using a continuous hourly rainfall time series and setting several climate and spatial parameters such as the warm period in a year (C W ), the cold period in a year (C C ), the resolution of the rain gauge (G S ), the instrumental sensitivity of the rain gauge and the minimum value exceeding which the isolated hourly measurements are considered relevant (E R ), and the radius of the buffer to assign each landslide to the closest rain gauge (R B ). Furthermore, in order to account for seasonality (i.e., different evapotranspiration rates in different periods of the year), addi-tional rainfall parameters can be set by the user, namely the dry interval separating isolated rainfall measurements (P 1 ); the time periods used to remove irrelevant amounts of rainfall, (P 2 ), and (P 3 ); and the minimum dry period separating two rainfall events (P 4 ).The readers are referred to Melillo et al. (2018) for more detailed information on these parameters.A further module instead performs the selection of the rain gauge representative for the landslide.Once the maximum allowed distance between a landslide and a rain gauge is defined as a circle of radius R B specified by the user, if more than one rain gauge is located within the circle, then the rainfall events from each rain gauge are weighted based on the rain gauge-landslide distance and the rainfall event characteristics (cumulated rainfall and duration).More specifically, given the multiple rainfall conditions (MRCs) that are most likely responsible for the slope failures as pair of rainfall event duration (D L ) and cumulated event rainfall (E L ), or a set of two or more pairs, each MRC is assigned a weight to select the representative rain gauge and the rainfall conditions associated with the landslide.The weight is proportional to the inverse square distance between the rain gauge and the landslide (d −2 ), the cumulated rainfall (E L ), and the rainfall mean intensity (E L D −1 L ): Thus, among all the identified MRCs, those with the highest weights w are defined as the maximum probability rainfall conditions (MPRCs), and these reconstructed rainfall conditions were assumed as the triggering rainfall events.Lastly, Fig. 2, depicts how the duration of a triggering rainfall event is defined.Specifically, when a landslide occurs during a dry period the whole event that preceded it is considered as a trig-  gering rainfall event; otherwise, just the rainfall that occurred before the landslide occurrence is taken into account.
As shown in Fig. 1, the last step for the dataset set-up consists of the association of soil moisture data to the beginning of each rainfall event, both triggering and non-triggering ones.In this regard, the ERA5-Land reanalysis dataset is used.It provides the volume of water ϑ [m 3 m −3 ] at four distinct soil depths levels (i.e., 0-7, 7-28, 28-100, and 100-289 cm).The ERA5-Land soil moisture data are provided at the hourly scale as grid data with a horizontal resolution of 0.1 • × 0.1 • .Thus, being at the same temporal resolution of rainfall time series, the soil moisture values representative of the closest cell to the rain gauge that recorded the rainfall event are associated, without delay, to the considered event.

Principal component analysis
Principal component analysis (Jolliffe, 2002) is a multivariate technique that analyzes a data table in which observations are described by several intercorrelated quantitative dependent variables to extract the important information from the table and to represent it as a set of new orthogonal variables called principal components (Abdi and Williams, 2010).
Precisely, the data are transformed according to a new coordinate system having the x axis, known as the first principal axis, characterized by the highest data variation.Along the successive axes (e.g., the second principal axis, the third principal axis, and so on), the data are characterized by increasingly lower variation.Each succeeding principal component explains the maximum amount of variance feasible with the requirement that it is orthogonal to the previous principal components.In practice, identifying the eigenvalues and eigenvectors of the covariance matrix is the formal mathematical equivalent of solving the PCA problem.The direction along which the data have the highest variance is the eigenvector, while the related eigenvalue is a quantification of the variance in the data along the corresponding eigenvector.Accordingly, the first principal component is the eigenvector with the greatest eigenvalue, followed by the eigenvector with the second-highest eigenvalue, and so on.Thus, the so computed principal components are employed for the projection of the data into the new coordinate space (Kherif and Latypova, 2019).
Practically, in our study, θ (Eq.2) represents the soil moisture data table for which to compute the principal components, specified as an n-by-p matrix.Rows correspond the total number n of the considered rainfall events (i.e., observations), and the number of columns to the four depths levels at which the initial soil moisture data are provided (i.e., variables).
In matrix notation, the transformation of the original variables to the principal components is written as (8)

Thresholds' identification
First, we identify the traditional rainfall intensity-duration power-law thresholds.The I D threshold has the form of a power law I = αD −β , where I [mm h −1 ] represents the rainfall intensity, i.e., the average precipitation rate over the considered period; D [h] represents the duration of the rainfall event; α is the intercept parameter; and β is the slope parameter.After reconstructing the rainfall events with the methodology explained for the dataset creation, and after calculating the main variables (i.e., mean rainfall intensity and duration), an optimization tool (i.e., the MATLAB ® particle swarm optimization toolbox) is used with the aim to search for the best possible α and β curve parameters able to maximize the The highest performances correspond to TSS = 1 when the model produces no false or missing predictions.
Afterwards, the analysis is focused on the identification of the hydrometeorological threshold trough a novel parametric equation that represents the lower boundary between triggering and non-triggering rainfall events on the basis of the mean rainfall intensity and the reanalysis of soil moisture values.In this context, we propose a piecewise linear equation as a reliable relationship able to well classify the events on the semi-log plane: where I and ϑ correspond to rainfall intensity and to soil moisture values, respectively.This parametric form of the threshold has been devised based on the visual inspection of the scatter plot of triggering and non-triggering events (i.e., heuristically) and corroborated by comparison with other relationships proposed in the literature -specifically, the power law and the simple bilinear (as opposed to a linear or more complex power or high-degree polynomial) (Uwihirwe et al., 2022;Thomas et al., 2019;Mirus et al., 2018b).x 0 , x 1 , y 0 , and y 1 are the threshold's parameters that must be estimated.In this regard, these parameters are computed by adopting the same objective function and optimization procedure as those used for the identification of the parameters of the power-law I D threshold, i.e., the TSS objective function (Eq.11) and the MATLAB ® particle swarm global optimization toolbox.Therefore, Eq. ( 12) is used to derive the hydrometeorological thresholds employing single-and multi-layer soil moisture data, respectively.Specifically, the mean rainfall intensity (I ) and the soil moisture at each of the four depth levels (ϑ 1 , ϑ 2 , ϑ 3 , ϑ 4 ) available from the ERA5-Land reanalysis data are used for the single-layer approach, while the mean rainfall intensity (I ) together with the first principal component of soil moisture, i.e., the linear combination of soil moisture at the four depths corresponding to the minimum information loss (highest explained variance), are used for the multi-layer approach.The TSS values obtained in the applications considering soil moisture, both single-and multi-layered, (hereinafter indicated as TSS par ) are being compared to one another, as well as to the TSS value obtained for the power-law I D threshold (hereinafter TSS pl ).

Study area
The study area selected for our study is the island of Sicily (southern Italy, 37.75 • N, 14.25 • E) which, with an area of ∼ 25 700 km 2 , is the largest island of the Mediterranean Sea.
A hilly morphology (62 %) dominates the landscape in the island, while the rest is characterized by a mountainous and flat morphology, especially in the eastern part of the island around Catania.The terrain average elevation is about 400 m above sea level, ranging from 0 to 3320 m on the peak of the Etna volcano.Geologically, the Sicily island arose during the Neogene, when the European and African plates converged.Thus, Sicily stands out for its complex geological and lithological features which, cooperatively with anthropic activities (e.g., changes in land use, management of forest), have generated a wide range of different types of soil (Venturella, 2004).The climate is warm-temperate, with hot and dry summers, especially on the southern coasts, and higher and more frequent precipitation during the colder winter months, in the mountainous internal areas (Pumo et al., 2019).Mean annual precipitation ranges between 700 and 800 mm, and autumn and winter are the rainiest seasons.The most severe rainfall events frequently hit the eastern side of the island and, specifically, the eastern side of the Etna volcano and the flanks of the Peloritani mountains, with the greatest precipitation peaks on the Ionian side (Gariano et al., 2015).On the other hand, south Sicily is distinguished by lower precipitation than the mean values recorded in the rest of the region, since it is located at a lower height and is exposed to the hot and dry African winds (Alecci and Rossi, 2007).
Figure 3 shows the geographical context of Sicily, the rain gauge locations for the period 2009-2018 (Distefano et al., 2022), and the observed landslide locations.In more detail, 207   latitude coordinates (WGS84 datum), together with the initiation time, were retrieved.
Concerning the observed rainfall measurements, we consulted the data provided by the regional water observatory (Osservatorio delle Acque, OdA), the SIAS (Sicilian Agrometeorological Information Service), and the Regional Civil Protection Department (DRPC), namely the three main gauging networks installed in Sicily.
This enabled an hourly time series to be reconstructed for the precipitation over the period 2009-2018.As previously explained in Sect.2.1, using these continuous rainfall time series, the rainfall events were identified using the CTRL-T research code.For the calibration of these regional parameters required by CTRL-T, we referred to a previous application of the algorithm to the Sicily island (Melillo et al., 2015).Specifically, according to this approach, the dry period (no rain) has been set equal to 48 h (P 4,warm ) between April and October (warm season, C w ), while it has been set equal to 96 h (P 4,cold ) from November to March (cold season, C c ).Indeed, in line with Köppen (1936) and Trewartha (1968), it is reasonable to assume that in Sicily, due to the Mediterranean climate, the warm period is longer than the cold one.The rain gauge sensitivity G S has been set equal to 0.2 mm, while the rain gauge search radius R B has been established equal to 16 km.Table 2 summarizes adopted values for mentioned CTRL-T parameters.

Principal component analysis
An explorative analysis was carried out, to investigate the correlation between the four soil moisture depths (ϑ 1 , ϑ 2 , ϑ 3 , ϑ 4 ).The plot shown in Fig. 4 represents the correlation matrix between all pairs of variables, together with the Pearson's correlation coefficients.
Overall, all the four soil moisture depths are related to each other.Specifically, the diagonal subplot between the upper two depth levels ϑ 1 and ϑ 2 has the highest correlation, with a Pearson correlation coefficient equal to 0.85.This suggests that PCA can be adopted in order to find out the linear combination expressing the correlation between the involved soil moisture variables.
The preliminary step, required when PCA is performed, is to center the data on the mean values of each variable, namely by subtracting the mean.This step allows the cloud of data to be centered on the origin of the principal components, but it affects neither the spatial relationships of the data, nor the explained variance along the variables.At this stage, it was possible to proceed with PCA and, according to Eqs. ( 4)-( 7), the four principal components of soil moisture were defined as follows: The loading values of each principal component are intended as the weights a ij (Eq.3); therefore, the higher the value of the weight, the larger the contribution of a variable to the component associated with the weight.The sign of a loading indicates whether a variable and a principal component are positively or negatively correlated.Here, although overall slightly large loadings correspond to the first principal component, none of the four variables has a strong relationship with a particular principal component.
Figure 5a shows the scree plot representing the total percentage of variance explained by each of the four principal components.The chart reveals the decreasing rate at which variance is explained by additional principal components.Figure 5b represents a grouped bar plot indicating the estimated loadings corresponding to each of four principal components as reported at Eqs. ( 13)-( 16).
Because dimensionality reduction is a goal of PCA, several criteria can be considered for determining how many principal components should be examined and how many should be ignored (Rencher, 1998).A few of the criteria that can be considered include the following: (i) ignore principal components at the point at which the next principal component offers little increase in the total explained variation; (ii) ignore the last principal component whose explained variations are all roughly equal; (iii) include all principal components up to a predetermined total explained variation.In our study, the third criterion was applied considering a threshold value of 75 %.Therefore, only the first principal component was considered as it guaranteed the desired explained variation of about 75 %.

Threshold identification
CTRL-T tool reconstructed 144 landslide events out of the 207 landslides retrieved by the FraneItalia database.Four dif-Table 2. CTRL-T parameters for the reconstruction of the rainfall events used in the present study.ferent triggering rainfall events, representing a range of triggering conditions, were selected within the database, and the precipitation time series together with the soil moisture time series are plotted in Fig. 6.As expected, the upper two soil moisture layers are those that are most similar to precipitation trends, as well as the first principal component of soil moisture S 1 , computed using Eq. ( 13).Overall, a greater variability in soil moisture values can be observed in correspondence with ϑ 1 and ϑ 2 , which assume maximum values about equal to 0.4 in correspondence with all the analyzed triggering rainfall events.
First, the power-law I D threshold maximizing TSS was identified (Fig. 7).In particular, the plot shows the triggering events as red points, while the non-triggering events, since there are a very large number, are better represented by a color map indicating the relative frequency of non-triggering rainfall events, following a plotting technique inspired by Leonarduzzi et al. (2017).For this threshold a TSS pl = 0.50, corresponding to a TPR pl = 0.76 and FPR pl = 0.26, is obtained.Figure 8 shows the obtained thresholds when the mean rainfall intensity and the soil moisture at each of the four depth levels are considered.As can be seen, especially in correspondence with the upper two depths (i.e., 0-7, 7-28 cm), the triggering rainfall events are located, for the most part, on the upper right side of the graph, suggesting that the equation proposed for the identification of the thresholds (Eq.12) well fits this trend.Furthermore, at all depths taken into consideration, there is a noticeable clustering of the highest relative frequency values of non-triggering rainfall events below the related parametric threshold.All four identified thresholds have better performance than the ID threshold.Specifically, higher TSS values were obtained for the first two depths, with a TSS par equal to 0.71, while significantly lower values of TSS par (0.61 and 0.54) are obtained with the third and fourth soil moisture level, respectively.
Moving to the multi-layer approach, the optimal parametric threshold identified using the mean rainfall intensity and first principal component of soil moisture is presented in Fig. 9.In this case, a TSS par = 0.71 was obtained.
Table 3 summarizes the TSS values in correspondence with the analyzed thresholds, together with the values of parameters (Eq.12) estimated for the parametric thresholds.
Overall, the results relative to the hydrometeorological thresholds corroborate other studies showing their better predictive performance when compared to the traditional I D threshold.For the specific case study of Sicily, thresholds based on multi-layered soil moisture information have similar predictive performances to thresholds based on singlelayered information.This points out that the two shallowest depth layers are of the greatest relevance for landslide triggering in Sicily.This may not be the case for other case study areas, and the proposed approach of comparing multi-vs.single-layer information allows us to define which layers of soil are most relevant in controlling landslide triggering in a given region.

Conclusions
In this study, a framework based on PCA aimed at introducing multi-layer soil moisture information within hydrometeorological threshold identification has been proposed.Our investigation, relative to Sicily, corroborates previous studies showing higher performances for hydrometeorological thresholds compared to the traditional I D power-law thresholds.Specifically, a significant improvement of performances was found with hydrometeorological thresholds, leading to TSS values of up to 0.71, which were much higher than those obtained with the traditional approach (TSS = 0.50).The application of PCA to soil moisture data at various depths turned out to be a valuable approach to include multi-layer soil moisture information while keeping the thresholds twodimensional, though for the case study region, multi-layer information seemed not so relevant, as performances corresponding to the two uppermost layers are similar to those corresponding to the PCA combination of all four layers.Comparison of prediction performances relative to thresholds based on multi-versus single-layer soil moisture information provides a mean to assess which soil depth intervals retain the most relevant information for improving thresholds' predictive performances.This represents a strategic tool supporting decision-making in LEWSs development.
Finally, it is worth mentioning that our investigation considered ERA5-Land soil moisture data, whose actual use for landslide prediction is limited by the fact that they are made available with a delay of some weeks from real time.However, this delay is expected to be significantly reduced in the near future in light of the increasing computational capabilities.In this regard, the valuable improvements, gained despite the inherent uncertainty of reanalysis data, further encourage the installation of monitoring networks for direct in situ soil moisture measurements with enhanced spatial and temporal resolutions, as with these observations even higher improvements are to be expected.Future developments of this research will consider other geographical regions in order to further explore the role of multi-layer soil moisture.

Figure 1 .
Figure1.Schematization of the procedure followed for dataset construction.

Figure 2 .
Figure 2. Sketch illustrating how the duration of a triggering rainfall event is defined (adapted from Peres et al., 2018).

Figure 4 .
Figure4.Correlation matrix between the four soil moisture level depths (ϑ 1 , ϑ 2 , ϑ 3 , ϑ 4 ).Each off-diagonal subplot contains a scatterplot of a pair of variables with a least-squares reference line, the slope of which is equal to the displayed Pearson correlation coefficient.Each diagonal subplot contains the distribution of a variable as a histogram.

Figure 5 .Figure 6 .
Figure 5. (a) Total variance explained by each principal component; (b) estimated loadings for each principal component S i .

Figure 7 .
Figure 7. Traditional power-law threshold on the log-log plane between observed mean rainfall intensity (I ) and duration (D).

Figure 9 .
Figure 9. Parametric threshold on the semi-log plane between observed mean rainfall intensity (I ) and first principal component of soil moisture (S 1 ).
12 ϑ 13 ϑ 14 ϑ 21 ϑ 22 ϑ 23 ϑ 24 11 a 12 a 13 a 14 a 21 a 22 a 23 a 24 a 31 a 32 a 33 a 34 a 41 a 42 a 43 a 44 A instead represents the principal component loadings (i.e., coefficients) table, specified as a p-by-p matrix.The rows of matrix A are called the eigenvectors, and these specify the orientation of the principal components relative to the original variables.

Table 3 .
TSS values in correspondence with each analyzed scenario and parameters (x 0 , y 0 , x 1 , y 1 ) estimated for the parametric thresholds.