Natural Hazards and Earth System Sciences Technical Note : Assessing predictive capacity and conditional independence of landslide predisposing factors for shallow landslide susceptibility models

The aim of this study is to identify the landslide predisposing factors’ combination using a bivariate statistical model that best predicts landslide susceptibility. The best model is one that has simultaneously good performance in terms of suitability and predictive power and has been developed using variables that are conditionally independent. The study area is the Santa Marta de Penagui ̃ o council (70 km2) located in the Northern Portugal. In order to identify the best combination of landslide predisposing factors, all possible combinations using up to seven predisposing factors were performed, which resulted in 120 predictions that were assessed with a landside inventory containing 767 shallow translational slides. The best landslide susceptibility model was selected according to the model degree of fitness and on the basis of a conditional independence criterion. The best model was developed with only three landslide predisposing factors (slope angle, inverse wetness index, and land use) and was compared with a model developed using all seven landslide predisposing factors. Results showed that it is possible to produce a reliable landslide susceptibility model using fewer landslide predisposing factors, which contributes towards higher conditional independence.


Introduction
Recent developments in GIS software and increasing computing power allow a substantially high number of independent variables to be used in empirical, data-driven landslide susceptibility models.Recent studies in landslide susceptibility models usually involve over a dozen variables considered as predisposing factors of slope instability (e.g. Lee et al., 2002 (13 variables); Lee and Choi, 2004 (15 variables); van der Eeckhaut et al., 2010 (9 variables); Sterlacchini et al., 2011 (9 variables)).Nevertheless, the evaluation of the weight of each landslide predisposing factor within the predictive model through a thorough sensitivity analysis is frequently missing.In addition, the application of statistic bivariate methods to assess landslide susceptibility assumes conditional independence (CI) of the landslide predisposing factors (Bonham-Carter et al., 1989;Agterberg et al., 1993;Van Westen, 1993;Agterberg and Cheng, 2002;Thiart et al., 2003;Thiery et al., 2007).Blahut et al. (2010) pointed out that spatial probabilities are overestimated when conditional independence is not verified.
In this study, the aim is to determine the best combination of landslide predisposing variables using a bivariate statistical model, based on the assessment of goodness of fit and predictive power, using variables that have a high degree of conditional independence.In addition, we assess the number of unique conditions within each landslide susceptibility model associated to each combination of landslide predisposing variables.This number should be minimized when landslide susceptibility maps are made for land use planning and management in order to avoid the over partitioning of the study area.

Study area
The study area (Fig. 1) is Santa Marta de Penaguião council (70 km 2 ), located in the Northern Portugal.This area is part of the Iberian Hercynian Massif where plutonic and metamorphic rocks are dominant.Geomorphologically, the study area is located in a transition area between the Portuguese north-western mountains and the Douro Valley (Ferreira, 1991).Tectonic deformation explains the vigorous down cutting of the rivers, deep incised valleys and steep slopes.The elevation ranges from 70 m to 1416 m.The bedrock of the study area includes mainly contrasting metamorphic rocks, (e.g.phyllites, metagreywakes, metaquartzitic greywakes and quartzite).These rocks are strongly fractured, and weathered materials are abundant in clay-rich laminated rocks, especially in those areas where agricultural terraces were built for plantations of vineyards.Almost 52 % of the study area is covered by vineyards.Vineyards dominate on south exposed slopes, and have been supported for centuries by man-made terraced structures built with schist stone.Recently, these structures have been replaced by mechanical land embankments which were built with anthrosoils and schist rock.These structures are not supported by walls which increases slope instability.
The main landslide triggering factor in the study area is rainfall (Pereira, 2010).The mean annual precipitation ranges from 700 mm (at the bottom of fluvial valleys) to 2500 mm (on Marão Mountain).

Data and methods
The methodological flowchart used to select and validate the best landslide susceptibility model is shown in Fig. 2.This methodology is based on both the assessment of the predictive capacity and the conditional independence of landslide predisposing factors within landslide predictive models.

Landslide inventory and landslide predisposing factors
The landslide inventory was produced for the study area between 2005 and 2010, using aerial photo-interpretation (1/5000 scale) and field work.The boundaries of the landslides were drawn over detailed field maps (1/5000 scale), later vectorized and stored in a GIS database.All landslides  were validated in the field, thus allowing the landslide distribution map to be very reliable.The landslide inventory (Fig. 3a) includes 767 shallow translational slides (11 landslides per km 2 ).Each landslide has, on average, 136 m 2 and the depth of the slip surface typically ranges from 1 to 1.5 m.Table 1 summarizes the set of factors considered as the main landslide predisposing factors in the study area.The digital elevation model (DEM) was used to derive slope angle, slope aspect, slope curvature and the inverse wetness index.The DEM was created from elevation points and 10 m contour lines that were initially in vector format.Thus, the DEM and derivative data were produced with a pixel size of 10 m (100 m 2 ), resulting in a raster image containing 684 637 pixels to cover the study area.This pixel size was considered adequate taking into account the resolution of the cartographic database.Given that one of the main objectives is to choose the best set of variables that control shallow translational slides, the predisposing factors were not classified using a priori information on landslide distribution in order to not overestimate the susceptibility scores.
The slope angle was divided in 7 classes (<5 • ; 5-10 • ; 10-15 • ; 15-20 • ; 20-25 • ; 25-30 • ; >30 • ) using class limits indicated in the literature (e.g.Zêzere et al., 2004).The slope aspect was split into nine classes: the eight major directions and a class for flat areas.The total slope curvature was calculated using ArcGIS 9.3.1 on a cell-by-cell basis and taking into account a surface composed of a 3 × 3 window.The obtained results were divided into three classes: concave, straight, and convex.The Inverse Wetness Index is the ratio of Slope by Specific Catchment Area.A logarithmic scale was used to define five classes (0; 0-0.0001; 0.0001-0.001;0.001-0.01;0.01-0.18)that are adjusted to the natural breaks of variable distribution.
A classification of the main landforms was performed and mapped by manual delineation based on the interpretation of a DEM at 1/10 000, geological maps and field observations.This resulted in the identification of eight main geomorphological classes (Table 1).Six lithologic units were defined exploring the official geological maps at 1/50 000.The lithological boundaries were rectified by fieldwork where errors were found in the geological maps.Land use data were obtained from Corine Land Cover (CLC) 2000 at 1/100 000 and aerial photo interpretation.Field work for validation purposes was performed in 2009 on a 1/5000 basis to improve the quality of land use information.Twelve land use classes were identified (Table 1).
Although a high rainfall gradient was found within the study area, rainfall was not considered as a predictor for landslide spatial distribution.Rainfall is crucial to understand the temporal activity of landslides but it does not explain the landslide spatial distribution at the basin scale.
Finally, seven variables were mapped and converted into a raster format with a 10 m pixel and later reclassified for modelling purposes.

Weighting of variables and evaluation of models' goodness of fit
The weighting of class variables for landslide susceptibility evaluation was made applying the bivariate statistical method of Information Value.This method was proposed by Yan (1988) and is well described in Yin and Yan (1988) and Zêzere (2002).
For each class within each landslide predisposing theme, the Information Value was calculated using the following equation (Yin and Yan, 1988): where: Ii = Information Value of variable xi; Si = number of pixels with landslides (depletion area) of type y within variable xi; Ni = number of pixels with variable xi; S = number of pixels with landslides (depletion area) of type y; N = number of pixels of the study area.
When Ii is negative, the variable xi is not relevant to explain the spatial landslide distribution.Positive Ii values indicate a direct relationship between the presence of variable and slope instability, which increase with the increase of the score.In Eq. (1), Ii is not possible to obtain if Si is zero.In such cases, Ii was forced to be the first decimal lower than the lowest Ii obtained within the theme (see Table 1).
Total Information Value for a j pixel was determined by (Yin and Yan, 1988): where: m = number of variables; Xji = is either 0 or 1 if the variable is not present or present in the pixel j , respectively.The Information Value Method was applied to the total possible combinations among the above mentioned seven landslide predisposing factors.To this end, 120 landslide susceptibility models were computed, including 21 models with 2 variables, 35 models with 3 variables, 35 models with 4 variables, 21 models with 5 variables, 7 models with 6 variables, and one model with 7 variables.
The goodness of fit of each landslide susceptibility model was evaluated using a Receiver Operating Curve (ROC) (Swets, 1988) and by computing the Area Under the ROC Curve (AUC).The ROC curve plots the True Positive Rate (TPR) versus the False Positive Rate (FPR), where TPR is the proportion of landslide area that is correctly classified as susceptible and FPR is the proportion of non-landslide area classified as susceptible (Fratinni et al., 2010).

Assessment of conditional independence
The existence of conditional independence among landslide predisposing factors is a prerequisite of any bivariate statistical method.In this work, two tests were performed to the entire dataset to assess conditional independence: the Overall Conditional Independence (OCI) and the Agterberg and Cheng Conditional Independence Test (ACCIT) (Agterberg and Cheng, 2002).The ArcSDM toolbox (Sawatzky et al., 2009) was used to evaluate OCI and ACCIT.The ArcSDM tool requires the dependent (landslide) layer to be a point shapefile.Therefore, the centroid of each landslide depletion area was extracted from the original landslide inventory.The point shapefile representing landslides was cross-tabulated individually with each landslide predisposing factor.Variable classes were weighted and integrated using the Weights of Evidence method (Bonham-Carter, 1994;Bonham-Carter et al., 1989, 1990) for the abovementioned 120 landslide predisposing factors combinations (results not showed in this paper) in order to obtain the variables' combination response (Post Probability and Post Probability Standard Deviation).The OCI is the ratio between observed and the expected number of training points.Values below 1 usually indicate conditional dependence among two or more landslide predisposing factors (Bonham-Carter, 1994).The ACCIT is a statistic test for confidence in which the predicted training points (T ) are greater than the observed training points (n) and applies a one-tailed test of the null hypothesis such that T −n = 0 (Agterberg and Cheng, 2002).According to Agterberg and Cheng (2002), probability values greater than 95 % or 99 % indicate that the hypothesis of conditional independence should be rejected, whilst any value greater than 50 % indicates that some conditional dependence occurs.

Selection of the landslide susceptibility model and evaluation of prediction capacity
The best landslide susceptibility model was identified considering simultaneously the model goodness of fit and the conditional independence of landslide predisposing factors within the model.The chosen landslide susceptibility model obtained with the minimum number of independent variables was compared with the model based on the entire dataset of landslide predisposing factors.In order to allow the visual comparison between maps, each landslide susceptibility map was divided in 10 classes, each one representing 10 % of the study area.
Additionally, the prediction skill of the selected landslide susceptibility models was independently assessed.The original landslide dataset was partitioned into two subgroups (modelling and validation) using temporal, spatial and random criteria.Therefore, three new landslide susceptibility models were built using the landslide modelling groups.The model prediction skill was assessed by cross-validating susceptibility results with the landslide validation group.Model results were compared through ROC curves plotting TPR and FPR.

Results and discussion
Table 1 summarises the Information Values obtained with the total set of shallow translational slides and with landslide sub-sets corresponding to different partition criteria.
Information Values computed with the total set of landslides (Table 1) were integrated and 120 landslide susceptibility models were produced using all possible combinations of landslide predisposing factors (Table 2).For each landslide susceptibility model the ROC curve and the corresponding AUC were computed, as well as the number of unique terrain conditions and two tests of conditional independence (OCI and ACCIT).
Figure 4 shows the range values of Area Under the ROC Curve, Terrain Unique Conditions and OCI and ACCIT tests according to the number of landslide predisposing factors for each predictive model.The AUC tends to increase with the increasing number of landslide predisposing factors, although such increment is not linear.The best AUC (0.804) was obtained using all seven variables.As expected, the average number of unique conditions increases almost exponentially with the increasing number of predisposing factors.The unique conditions number ranges from a minimum of 12 in a model with two variables (curvature + lithology) to a maximum of 8094 in the model built with seven variables.
The ACCIT results were subtracted from 100 to ease comparison with OCI (Fig. 4).The average conditional independence decreases consistently for both tests with the increasing number of landslide predisposing factors used in the model.Generally, conditional independence has a large range of values in susceptibility models built from two to five landslide predisposing factors.In comparison, all models built using six or seven variables have low values of conditional independence in both tests (lower than 10 % for any combination of variables).
Figure 5 is a scatter diagram of the Area Under the ROC Curve and ACCIT for the 120 landslide susceptibility models.As expected, the model built with all seven variables (yellow diamond) has the highest AUC (0.804), but the AC-CIT equals zero.This indicates that despite the goodness of fit of this model being high, the set of variables used to develop the model has a high degree of conditional dependence.
Following the Agterberg and Cheng (2002) indication, the landslide susceptibility models with 1-(ACCIT/100) below 0.5 have some conditional dependence and should be rejected if the test result is at least below 0.05.In the present study, we reject models with 1-(ACCIT/100) below 0.4 (vertical line in Fig. 5).Accordingly, the red diamond in Fig. 5 is the selected landslide susceptibility model.This model (model 70 in Table 2) simultaneously maximizes the AUC (0.751) and minimizes the number of variables used (three variables: slope angle + land use + Inverse Wetness Index).Slope angle performs quite well as it is simultaneously highly spatially correlated with landslide distribution and has good conditional independence relative to other variables.Land use correlates well with landslides and the co-variance between the Inverse Wetness Index and other variables is low.
Figure 6 shows the shallow translational slides susceptibility map built with the selected model (Fig. 6a).For comparison, the map developed using all seven variables is also shown (Fig. 6b).Each landslide susceptibility map was classified in 10 classes, each one covering 10 % of the study A correspondence table was produced to analyse the degree of overlap of the spatial distribution of susceptibility between the two maps (Table 3).Results indicate that the overall degree of overlap is high; however, there are some striking differences between classes.Whilst only 26 % of area matches exactly the same decile class, 65 % is in the same decile or the deciles immediately adjacent.However, this overlap over three classes is systematically higher in the first four deciles (ranging from 68 % to 84 %) than in the following four deciles (ranging from 52 % to 60 %).The implication of this finding is that the three-and seven-variable models assign high-susceptibility to the same areas, which means that three variables are sufficient to identify the most susceptible areas to shallow translational slides.As usual in any data-driven landslide susceptibility assessment, there are some landslides occurring in areas with low predicted susceptibility, in both maps A and B. This is explained by the limitations of the available landslide predisposing factors to reproduce the complete slope instability system.
Finally, the prediction skill of the selected landslide susceptibility model was evaluated by computing new ROC curves based on the partitioning of landslide inventory using temporal, spatial and random criteria.The same procedure was applied to the seven-variable model for comparison purposes.The temporal partition of landslide inventory uses the year 2002 as a threshold to identify the landslide modelling group (prior to 2002 there are 611 landslides, which is equivalent to an unstable area of 94 200 m 2 ) and the landslide validation group (after 2001 there are 156 cases corresponding to an unstable area of 10 600 m 2 ) (Fig. 3b).The difference in size of the modelling and validation groups is a consequence of the occurrence of over 80 % of shallow translational slides in a single event in 2001.
A spatial partition of the landslide inventory was also performed, in which the spatial occurrence was split into West (modelling group) and East (validation group) as shown in Fig. 3c.Modelling and validation groups include 380 cases (unstable area = 58 400 m 2 ) and 387 cases (unstable area = 46 400 m 2 ), respectively.
The landslide inventory was also partitioned according to a random criterion (Fig. 3d).The landslide modelling group included 383 cases (unstable area = 52 900 m 2 ) and the landslide validation group included 384 cases (unstable area = 51 900 m 2 ).The landslide modelling groups were used to reassess Information Values (Table 1) and to build new landslide prediction models.Landslide validation groups were crossed with these models to obtain the ROC curves and corresponding AUC shown in Fig. 7.The AUC of the three-variable model is only slightly lower than the one obtained with the sevenvariable model when the temporal partition is used (0.723 and 0.776, respectively).Indeed, the validation results using spatial and random partitions of the inventory show that the predictive capacity of the three-variable model is higher than the seven-variable model.This difference is remarkably notorious if the spatial partition is used (0.743 vs. 0.652) but less so with the random partition (0.743 vs. 0.732).
Overall, the validation results show that the predictive capacity of the three-variable model is either very similar or better than the model developed using all seven variables.

Conclusions
All possible combinations of up to seven variables related to landslide predisposing factors were modelled in this study, resulting in the quantitative comparison of 120 models.Model suitability tends to increase in a non-linear way with the increasing number of landslide predisposing factors, whilst the number of unique conditions of the model increases exponentially.Conditional Independence, on the contrary, decreases with the number of variables to the point that all models built with six or seven themes variables had to be rejected when tested for Conditional Independence.
The best landslide susceptibility model was selected on the basis of a threshold of 60 % for the Agterberg and Cheng Conditional Independence Test (Agterberg and Cheng, 2002).Therefore, the best model proved to be built with only three landslide predisposing variables (slope angle + Inverse Wetness Index + land use) which obtained an Area Under the ROC Curve of 0.751 (less 0.053 than model developed with seven variables).Slope angle controls the shear forces acting on hillslopes and the Inverse Wetness Index reflects the decisive role that water infiltration in soils has on the development of shallow slides.Furthermore, the land use variable exhibits the highest spatial relationship with landslide distribution, which concentrates in vineyard areas, thus reflecting human interference on slope instability in the study area.In comparison, lithology is substantially less important as a factor for the occurrence of landslides because most of these (75 %) were located in a single lithological unit (the Desejosa Formation, Table 1), which is present in 73 % of this territory.
The model of spatial distribution of landslide susceptibility built with three variables is not significantly different from the model produced with seven variables.Therefore, it was shown that it is possible to produce a reliable landslide susceptibility model using only a few landslide predisposing factors and fulfilling the conditional independence hypothesis.

Fig. 1 .
Fig. 1.Location of the study site Santa Marta de Penaguião.

Fig. 2 .
Fig. 2. Methodological flowchart used to select the best landslide susceptibility model.

Fig. 3 .
Fig. 3. Inventory of shallow translational slides in the study area (a) and landslide groups used to validate landslides susceptibility models, generated with temporal (b), spatial (c) and random (d) partitioning criteria.

Fig. 4 .
Fig. 4. Range values of Area Under the ROC Curve (a), Unique Conditions (b) and Overall Conditional Independence and Agtenberg and Cheng Conditional Independence Test (c) for the complete set of landslides' predisposing factor combinations.The Agtenberg and Cheng Conditional Independence Test was subtracted from 100 to ease comparison.

Fig. 5 .
Fig.5.Scatter plot between Agterberg and Chen Conditional Independence Test and ROC curve AUC for 120 landslide susceptibility models built using the set of possible combinations of landslide predisposing factors.Yellow diamond -Model with seven themes; red diamond -Selected model (three themes).The red line marks the limit of acceptable conditional independence among variables within landslide susceptibility models.

Fig. 7 .
Fig. 7. ROC curves of the selected landslide susceptibility model and of the model based on seven variables.Landslide groups used for model validation where obtained using temporal, spatial, and random criteria for landslide database partition.

2012 982 S. Pereira et al.: Assessing predictive capacity and conditional independence of landslide predisposing factorsTable 1 .
Thematic layers and class information values computed with the total set of landslides and the landslide modelling groups obtained with temporal, spatial, and random partitions.Variable classes more related with landslide distribution are highlighted in bold.

Table 3 .
Overlap degree of the spatial distribution of susceptibility classes between the two shallow translational slides susceptibility maps.The grey boxes highlight the same decile and the deciles immediately adjacent.