Landslide susceptibility estimation by random forests technique : sensitivity and scaling issues

Despite the large number of recent advances and developments in landslide susceptibility mapping (LSM) there is still a lack of studies focusing on specific aspects of LSM model sensitivity. For example, the influence of factors such as the survey scale of the landslide conditioning variables (LCVs), the resolution of the mapping unit (MUR) and the optimal number and ranking of LCVs have never been investigated analytically, especially on large data sets. In this paper we attempt this experimentation concentrating on the impact of model tuning choice on the final result, rather than on the comparison of methodologies. To this end, we adopt a simple implementation of the random forest (RF), a machine learning technique, to produce an ensemble of landslide susceptibility maps for a set of different model settings, input data types and scales. Random forest is a combination of Bayesian trees that relates a set of predictors to the actual landslide occurrence. Being it a nonparametric model, it is possible to incorporate a range of numerical or categorical data layers and there is no need to select unimodal training data as for example in linear discriminant analysis. Many widely acknowledged landslide predisposing factors are taken into account as mainly related to the lithology, the land use, the geomorphology, the structural and anthropogenic constraints. In addition, for each factor we also include in the predictors set a measure of the standard deviation (for numerical variables) or the variety (for categorical ones) over the map unit. As in other systems, the use of RF enables one to estimate the relative importance of the single input parameters and to select the optimal configuration of the classification model. The model is initially applied using the complete set of input variables, then an iterative process is implemented and progressively smaller subsets of the parameter space are considered. The impact of scale and accuracy of input variables, as well as the effect of the random component of the RF model on the susceptibility results, are also examined. The model is tested in the Arno River basin (central Italy). We find that the dimension of parameter space, the mapping unit (scale) and the training process strongly influence the classification accuracy and the prediction process. This, in turn, implies that a careful sensitivity analysis making use of traditional and new tools should always be performed before producing final susceptibility maps at all levels and scales.


Introduction
Landslide susceptibility maps (LSM) are useful for land planning, natural risks management and development of mitigation measures.They depict the relative probability of occurrence of a given type of landslide in a given area, without taking into consideration the probability of occurrence in time (Brabb, 1984).
LSMs can be obtained in a variety of ways and a very ample literature is available on the subject, relying on at least 20 yr of history of susceptibility assessment for mass movements.The first method ever adopted is probably the so called heuristic mapping, carried out by a team of expert geomorphologists through the definition of a set of conditioning factors leading to landslide development in a given area on the basis of field surveys and aerial photograph interpretation supported by ancillary map data such as geological maps.This approach, on the one hand, has the advantage of providing a way to exploit the expert knowledge and judgment of the geomorphologist and has been used in Published by Copernicus Publications on behalf of the European Geosciences Union.
recent times as well (see e.g.Cardinali et al., 2002;Casagli et al., 2004).On the other hand, it has the drawback of being subjective: in an interesting study, for example, Ardizzone et al. (2002) showed that the same area, independently surveyed by 3 teams of geomorphologists, produced 3 very different LSMs, as a result of spatial positioning inconsistencies in the boundaries of mapped landslide polygons.
For this reason many authors started to propose quantitative assessment methods, based on a set of uniquely defined conditioning factors to increase LSM reproducibility and on a series of weighting techniques to improve accuracy and robustness.A large part of the quantitative methods to produce LSMs relies on regression or classification approaches.The techniques most widely used are probably discriminant analysis (Carrara, 1983;Chung and Fabbri, 1995;Baeza and Corominas, 1996) and logistic regression (Hosmer and Lemeshow, 2000;Lee, 2005;Manzo et al., 2013), although other techniques have proved themselves reliable and in some cases more flexible, such as e.g.artificial neural networks (ANN) (Bianchi and Catani, 2002;Lee et al., 2003Lee et al., , 2004;;Ermini et al., 2005, Yilmaz, 2009a), linear regression (Atkinson and Massari, 1998), fuzzy membership (Kanungo et al., 2006), conditional probability or Bayesian methods (Yilmaz, 2010a).
Often, compared to each other, such methods seem quite equivalent if applied to large areas where multiple landslide typologies coexist (Guzzetti et al., 1999;Kanungo et al., 2006;Carrara et al., 2008;Rossi et al., 2010;Yilmaz, 2009bYilmaz, , 2010a) ) and produce similar results starting from the same input data, even though much depends on the ability of the practitioner in calibrating the various parameters and finetuning the model so as to obtain a high-quality result.Recently, on this account, Rossi et al. (2010) suggest that optimal susceptibility predictions might be obtained through the combination of suitable basic LSMs generated by different methods rather than by the application of a single prediction.
Several efforts have also addressed how to best measure the quality of the LSMs produced by different methods, as well as determining the influence of mapping errors or mapping choices on the final results.In particular, Frattini et al. (2010) propose a complete framework for the quantitative assessment of LSM quality and also discuss the possible impact of using different methods in terms of cost-benefit analysis.They conclude that ROC (receiver operating characteristic) curves are at present the best quantitative tool to measure LSM quality.As far as mapping errors or model assumptions are concerned, only a few studies are available in the literature, which try to get a deeper insight on the actual impact of modelling choices on the final result.Among them, an important contribution has been provided in a study of Guzzetti et al. (2006) that explores the influence of using different types of mapping units for the production of a LSM in central Italy.They test a discriminant analysis method against an ensemble of 350 different sets of map units, concluding that every LSM product should include such sensitivity analysis in order to obtain a map of the spatial distribution of the estimation error, necessary to complement LSM information.In particular, they highlight the importance of exploring LSM model calibration and validation.In general, there seems to be a variability of results within an ensemble of single-model runs as high as among different model-type runs.
Despite all such efforts, however, there is still a lack of studies focusing on specific aspects of LSM model sensitivity.For example, the influence of factors of paramount importance such as the survey scale of the landslide conditioning variables (LCVs), the resolution of the mapping unit (MUR) and the optimal number and ranking of LCVs have never been investigated analytically, especially on large data sets.We have reconstructed and summarized the main lines of LSM model sensitivity in a mental map which is included in the auxiliary material for reasons of space.Only some of the aspects highlighted in it have been treated in published papers so far.
In this paper, thus, we attempt this experimentation concentrating on the impact of model tuning choice on the final result, rather than on the comparison of methodologies.To this end, we adopt a simple implementation of the random forest (RF) classification family to produce an ensemble of landslide susceptibility maps for a set of different model settings, input data types and scales.RF classification and regression methods offer a very flexible environment for testing model parameters and mapping hypotheses, allowing for a direct quantification of variable importance.The model choice is, in itself, quite innovative since it is the first time that such technique, widely used in remote sensing for image classification (Ham et al., 2005;Pal, 2005) but rarely adopted in landslide studies (Brenning, 2005;Vorpahl et al., 2012), is used for the production of a LSM over a large area.
We apply the model statistics to a test area in central Italy, the hydrographic basin of the Arno River (ca. 9000 km 2 ), we present the obtained results and discuss them.We also use the outcomes of the parameter sensitivity analysis to investigate the different roles of environmental factors in the test area.

Random forest classification
As a basic model for LSM we used a random forest implementation based on Matlab (Matworks, version 7.11, treebagger object (RFtb) and methods).Random forest classification is basically a machine-learning algorithm for non-parametric multivariate classification first developed by Breiman (2001).RF approaches have been adopted in psychological studies (Strobl et al., 2009) and remote sensing image classification (Ham et al., 2005;Pal, 2005).They are being increasingly used, however, also in environmental modelling (Prasad et al., 2006;Strobl et al., 2008;Bachmair and Weiler, 2012) and, even though only rarely, in landslide susceptibility (Brenning, 2005;Vorpahl et al., 2012).The algorithm exploits random binary trees which use a subset of the observations through bootstrapping techniques: from the original data set a random selection of the training data is sampled and used to build the model, the data not included are referred to as "out-of-bag" (OOB) (Breiman, 2001).Furthermore, a random selection of predictor variables is used to split each node of the trees.Each tree is developed so as to minimize classification errors but the random selection influences the results, thus making a single-tree classification very unstable.For this reason, the RF-type methods make use of an ensemble of trees (the so-called "forest") thereby ensuring model stability.The RF technique has several advantages with respect to other, more used, multivariate regression or classification methods.Firstly, it does not require assumptions on the distribution of explanatory variables, secondly, it allows for the mixed use of categorical and numerical variables without recurring to the use of indicator (or dummy) variables and, thirdly, it is capable of accounting for interactions and nonlinearities between variables.These are big advantages that limit the generation of outliers, especially when working with terrain variables with a high frequency of missing data and an intrinsic uncertainty in the assignment to the correct class also in surveyed areas (see e.g. the case the correctly defining the type of soil for a given areal extent for which only a few point locations have been directly surveyed).
A further advantage, even more so for our study, is the ability of RFtb models to provide information on the statistical weight of each single variable on the overall result.The algorithm estimates the importance of a variable by looking at how much the prediction error increases when OOB data for that variable is permuted while all others are left unchanged (Liaw and Wiener, 2002).This capability can be fruitfully exploited to study the relative importance of the different explanatory variables, a quite important but often neglected aspect of LSM.
In the Matlab implementation of RFtb, the model output is a membership probability to one of the 2 possible classes "landslide" and "no-landslide".An estimator of the ensemble error is the "out-of-bag error" (OOBE), which is computed comparing the out-of-bag predicted responses against the true responses.
The overall performance of the model, instead, can be assessed through the misclassification probability (MP) given by the average classification errors (commission and omission) after a given number of runs on a specific RF configuration (see following sections for details).

Model tests
Before starting the experimentation concerning model parameters scale, accuracy and sampling, we performed a series of tests on the influence of basic RF treebagger (RFtb) settings on the results of modelling.
As we have seen in the previous section, the RFtb implementation requires some preliminary user choices on the forest complexity and on the control over the random component of the model itself.
In particular, we are interested in assessing the stability of the model performance over two important settings: the RFtb tree number and the number of runs required to obtain a consistent prediction of the dependent variable.

Number of trees
A single realization of a RFtb model ends up in a forest structure whose tree number T# is usually established by the user.There is no specific rule nor a best a priori value for T# and, even with complex classification or regression models, there is no guarantee that the accuracy of the results would increase proportionally with T#.
In the case of prediction of landslide susceptibility at regional scale we have in general a problem of unknown a priori complexity which demands a strong computational effort.Therefore, in applying a RF-type method, a preliminary exploratory step has to be performed in order to evaluate a T# value ensuring an optimal cost-benefit ratio.
A technique to find the optimal range of T# is based on the run of a basic model performance test with increasing structure complexity (increasing T#).We applied this technique to the study case plotting the overall accuracy of the prediction (in terms of out-of-bag classification errors, OOBE) versus the number of grown trees (T#).We propose to choose as the working T# the value at which the OOBE stops decreasing and starts oscillating around a stabilized value (see results section).

Assessment of the random component on results
Given a constant T#, for each model run, the RFtb method rebuilds a new ensemble, using random choices for the order in which the trees themselves are stored and added to the structure.The RF-type algorithms offer a very robust set of techniques in order to perform this kind of random choice to ensure optimal performance.However, we believe that an assessment of the impact of this randomized choice on the overall model performance is needed to define the best set-up for our experiment and to propose a general framework for using RFtb classification and regression methods as a tool for landslide susceptibility estimation in large areas.
For this reason, we performed a series of tests devoted to the evaluation of the noise due to the random nature of the model and of the number of runs that are needed to ensure stability in the model results.
To this end, we compared the model results in terms of OOBE over different numbers of runs.Using a fixed resolution of 10 m for all the variables (see Sect. 2.5 for a complete description of variable used) we averaged the OOBE for each parameter over 1, 10 and 100 runs and we compared F. Catani et al.: Landslide susceptibility estimation by random forests technique the results in terms of both relative errors and relative variable ranking.

Parameter set tests
After the first set of tests on the model structure and internal functionality we concentrated on the analysis of the parameter set.In particular, among the many possible issues to be explored, we focused on some of the aspects that are less considered in previous studies for landslide susceptibility: the influence of each single parameter on the final result, the importance of parameter resolution (in terms of pixel size and terrain unit scale), and the impact of the dimension of the training set over the final result.In the study, we used a pixel approach over the slope unit concept to allow for a multi-resolution analysis to be carried out.Furthermore, preliminary tests carried out in a parallel research show that the choice of the best method to represent classification units in LSM is dependent on scale but that in general pixel units may be considered a more flexible approach in many cases (Trigila et al., 2013).

Optimal parameter set at different resolutions
An important question, in landslide susceptibility studies, is whether or not a conditioning variable (LCV) is actually useful and needed for the classification.
The susceptibility map is not only influenced by the total number of variables (LCV#) but also by the resolution at which the analysis is performed using those variables.For this reason we prepared and performed tests where, at different resolutions, a fixed structure RFtb model is repeatedly run with decreasing LCV#.This means running the RFtb estimator with the complete LCV set and then removing one parameter at a time (the least important in terms of increase in prediction error when the variable is permuted across the out-of-bag observations, according to the Matlab function OOBPermutedVarDeltaError -http://www.mathworks.it/it/help/stats/treebagger.oobpermutedvardeltaerror.html), so as to reduce the parameter space.Every predictor set was applied to the test points and evaluated in terms of misclassification probability (MP), to identify the optimal configuration for each working resolution.This assessment is very important, especially when there is a limited possibility of increasing the training set along with the LCV#.A larger LCV# would require larger training sets and it is often impossible to have the required number of verified samples to carry out a satisfactory model training for a large LCV#.
We test this feature selection method for 6 different map resolutions to verify which is the influence of scale in parameter ranking and map accuracy in terms of full ROC AUC (area under curve).AUC was calculated applying the model to the area as a whole and then comparing the output of the model with the available landslide database (both landslide and non-landslide sites).The mapping unit resolution (MUR) is defined in raster terms as pixel size and we used for this study 10, 20, 50, 100, 250 and 500 m resolutions.It is important to highlight that the original scale of the input data (e.g.DEM or thematic maps) is constant throughout the experiments as we only varied the pixel size of the derived raster maps expressing the spatial distribution of the variable.

Influence of training set dimension on results
Directly connected to the resolution of parameters used in LSM there is the problem of how large the training set should be in order to stabilize statistical predictions.For obvious practical reasons, it is important to find out which is the minimum number of samples (mS#) required to calibrate a model, a quantity which is a function of the dimension of the parameter space LCV#.Usually, the larger LCV#, the larger the mS# needed for model calibration.
The parameter space sampling method used to build the training set is also important.Mainly, the sampling can be completely random or guided by heuristic rules.In the first case we do not have any control on whose classes or occurrences of a given LCV are sampled whilst in the second case we can constrain the sampling so that every class is represented at least once.
We performed 2 types of tests.In the first, the performance of the model in terms of AUC was analysed using a constant mS# proportion (10 % of study area) with random sampling at different resolutions (MUR = 10, 20, 50, 100, 250 and 500 m).In the second, the same model performance was tested using a constant map resolution (50 m) and variable mS# proportion (from 0.5 to 50 % of the study area).We also tested the impact of using random versus ordered selection methods for training set sampling.

Test area and Landslide database
The selected test site is the hydrographic basin of the Arno River, central Italy.The area is 9100 km 2 wide and it is located in the northern Apennines, a complex thrust-belt system composed by several tectonic units and sedimentary basins.The relief is characterized by a succession of NW-SE ridges (made up of Mesozoic/Tertiary flysches and calcareous units) and Pliocene-Quaternary sedimentary basins.
The Arno Basin has a temperate climate with dry summers, November and March are the rainiest months.However, the typical rainfall amounts exhibit strong local differences and the mean annual precipitation ranges from 800 mm on the Chiana Valley to 1800 mm on some parts of the Apennines' ridges.
Landslides are very common in the study area.The geological settings and the lithological characteristics of the area affect the typology and occurrence of landslides, which are mainly constituted by slow-moving rotational slides (IAEG, 1990;Bertolini et al., 2004;Catani et al., 2005).The majority of the landslides are reactivations of dormant slides and the frequency of first-time landslides is very low; as a consequence the landslide susceptibility chiefly depends on the presence or absence of known instability.To establish the spatial distribution of existing landslides we used a detailed database (Catani et al., 2005), which was recently updated (Rosi et al., 2012), containing more than 27 000 landslides (Fig. 1).According to the variable resolution analysis, the landslide polygons were rasterized starting from the vector high-resolution data to six different pixel sizes (MURs) using standard vector-to-raster conversion.

Input parameters
The choice of the input parameters is a fundamental step in the susceptibility assessment process.While some parameters are extensively used in landslide susceptibility (e.g.lithology and slope gradient), the effectiveness of many others (e.g. higher derivatives of elevation, soil depth, aspect, and structural settings,) is still debated and seems to depend on the methodology adopted, the physical settings of the study area and the landslide typology (Carrara and Guzzetti, 1995;Baeza and Corominas, 1996;Segoni et al., 2012).
The number of parameters to be adopted is also debated: effective landslide susceptibility assessments have been carried out with just a few parameters (Ohlmacher and Davis, 2003;Dahl et al., 2010;Akgün, 2012;Pereira et al., 2012) as well as with a relevant number of parameters (Guzzetti et al., 1999;Lee et al., 2002;Gorsevski et al., 2006;Lee and Pradhan, 2007;Nefeslioglu et al., 2011;Felicísimo et al., 2013).However, a high number of parameters do not necessarily grant the quality of the results: it can be demonstrated (Pradhan and Lee, 2010;Floris et al., 2011;Manzo et al., 2013) that an increase in the number of model parameters can even worsen the accuracy of the LSM.
Automated procedures of forward selection of variables in landslide susceptibility mapping have been proposed for several techniques (Carrara et al., 2008;Van den Eeckhaut et al., 2009;Costanzo et al., 2012).The use of the RF treebagger algorithm can be a valuable tool in assisting the decision on how many (and which) attributes should constitute the optimal configuration of the susceptibility assessment, since it provides quantitative estimates of variable importance.An initial and expert-driven selection of the input parameters is therefore not necessary in this study as the feature selection procedure will automatically sort LCVs according to importance ranking.We initially used three main kinds of input parameters: morphometric attributes, thematic attributes and rainfall-related attributes.
Morphometric attributes are quantitative parameters used to characterize landforms.All of them are related to geomorphologic processes and consequently to landslide susceptibility.The original resolution of the Arno Basin DEM is 10 m.The data layers were resampled to 20, 50, 100, 250 and 500 m resolution considering the mean value of the pixels and, separately for each of them, a series of topographic attributes were extracted with the same pixel size.To encompass the spatial variability of the topographic attributes in the modelling, we defined another series of variables: for each morphometric attribute we considered the standard deviation (for numerical attributes) or the variety (for categorical attributes) calculated on a 3 by 3 pixel kernel window.
-Elevation (ELE; ELE_STD).Elevation is commonly used in landslide susceptibility assessments as different altitudes may be related to different environmental settings (e.g.vegetation, temperature, rainfall regime, etc.) (Dai and Lee, 2002;Costanzo et al., 2012;Felicísimo et al., 2013;Günther et al., 2013;Sabatakakis et al., 2013).The standard deviation of elevation is closely related to the relative relief and can be considered as an indicator of the potential energy for erosion and mass wasting (Oguchi, 1997;Günther et al., 2013;Kayastha et al., 2012).
-Slope (SLO; SLO_STD).Slope angle is one of the most important preparatory factors as it strongly controls the shear forces acting on hillslopes, therefore it has been widely used in LSM (Aleotti and Chowdhury, 1999;Guzzetti et al., 1999).In addition to the value directly derived from the DTM, its standard deviation was used as an indicator of the potential energy for erosion and mass wasting.
-Curvature.Curvature is traditionally used to describe the physical characteristics of an area with respect to erosion and runoff processes (Zeverbergen and Thorne, 1987) and to identify landforms related to landslide bodies (Evans, 1998;Ohlmacher, 2007;Catani et al., 2010).Various curvature components can be computed as the second derivative of the surface topography.In this study we used the following four curvature components.
-Curvature s.s.(CUR; CUR_STD): for each pixel, the second derivative of elevation is computed in two directions (steepest descent and normal to the steepest descent) in a 3 by 3 pixel kernel window, as in Moore et al. (1991), then averaged.
-Profile curvature (CPR; CPR_STD) expresses the rate at which the slope gradient changes towards the direction of maximum slope.It affects the acceleration/deceleration of superficial flux and thus the erosion/deposition of the hillslope's loose material.
-Plan curvature (CPL; CPL_STD) is calculated orthogonally to the direction of the maximum slope and it can be used to characterize the convergence and divergence of flow and to discriminate between watersheds and hollows channelized by a 0th order hydraulic network.
-Combo curvature (CCU; CCU_VAR).This is a categorical variable obtained by the combination of the values of plan and profile curvature assumed in each pixel.The use of this attribute contributes to the characterization of slope morphology and flow.Profile and planar curvature were reclassified in three classes (concave, flat, convex) using the values −1 and 1 as class breaks.Afterwards, the two rasters were overlain to find nine possible curvature combinations, which provide information about the shape of the hillslope.
-Aspect (ASP; ASP_VAR).Aspect represents the orientation in the space of each pixel composing the landscape.This variable can play a key role in landslide susceptibility as it may influence the exposition of the terrain to different amounts of rainfall and solar radiation, thus conditioning the terrain humidity and the vegetation growth (Guzzetti et al., 1999;Dai and Lee, 2002;Demir et al., 2013).In this study the aspect was used as a categorical variable after reclassifying its angular values on the basis of the facing direction with respect to the eight main cardinal directions.
-Flow accumulation (FLA; FLA_STD; LFA; LFA_STD).This attribute expresses the upslope contributing area of each pixel (i.e. the size of the area drained by a specific pixel in the map), which has been used in landslide susceptibility assessments as it can be put in relation with water flux or with potential soil saturation (Catani et al., 2005(Catani et al., , 2010;;Felicísimo et al., 2013;Xu et al., 2013).Because of the wide extension of the study area, FLA values have a very wide range, therefore we introduced in the analysis also the LFA attribute, which was calculated as the logarithm of the flow accumulation.
-TWI (TWI; TWI_STD).Topographic wetness index (TWI) is defined as ln(A/tanβ), where A is the aforementioned upslope contributing area (or flow accumulation) and β is the slope angle (Beven and Kirkby, 1979).This index is commonly used to characterize the spatial distribution of soil moisture, therefore it is used in landslide susceptibility assessments (Devkota et al., 2013;Pereira et al., 2012;Costanzo et al., 2012;Felicísimo et al., 2013).
Thematic attributes were derived by means of GIS analyses from specific thematic maps.Lithology (LIT; LIT_VAR).Lithology is largely acknowledged as one of the most important driving variables in susceptibility assessments, since it directly reflects the geomechanical and hydraulic properties of the bedrock and influences the characteristics of the soil coverage (Dai and Lee, 2002;Catani et al., 2005;Costanzo et al., 2012).In this work we used a 1 : 100 000 lithotechnical map of the Arno Basin, previously used in other studies (Catani et al., 2005), where all the geological formations are grouped into eight classes based on their geotechnical properties: cohesive soils; granular soils; indurated rocks; weakly cemented conglomerates and carbonate rocks; weak rocks; marls and compact clays; rocks with pelitic layers; and complex (mainly pelitic) units.
Since the alternation of different lithologies may contribute to slope instability, we also included in the modelling an additional variable defined as the variety of the classes in a 3 by 3 pixel kernel window (LIT_VAR variable).
Land cover (COV; COV_VAR).Landslide susceptibility is highly influenced by the vegetation cover and the use of land may be used to indirectly account for the human interference on hillslopes (Varnes and IAEG 1984;Costanzo et al., 2012;Pereira et al., 2012).A nine classes (urban fabric; crops and permanent cultivations; grasslands; heterogeneous cultivated lands; forests; rangelands; scrublands; wetlands) land use map was devised starting form a 1 : 50 000 scale land cover map (Catani et al., 2005).As for the lithology, the variety of land cover classes in a 3 pixel×3 pixel window was included amongst the susceptibility variables as well.
Distance to roads (RDS).In some circumstances roads can be considered a landslide predisposing factor: heavy traffic may determine vibrations and sudden increase/decrease of stress, while the construction of roads sometimes requires anthropogenic modification to the hillslope profile or to the drainage system such as road cuts, fills, culverts, ditches, etc. (Collins, 2008;Ramakrishnan et al., 2013).Therefore distance to roads has been successfully used in landslide susceptibility assessments (Devkota et al., 2013;Demir et al., 2013;Feizizadeh and Blaschke, 2013;Ramakrishnan et al., 2013).We included this preparatory factor in our analysis as a continuous variable by calculating the distance of each pixel from an existing 1 : 10 000 scale shapefile of the road network.
Distance to faults (FTS).Faults are widely used as predisposing factors in landslide susceptibility studies (Devkota et al., 2013;Demir et al., 2013) since they can be related to earthquake induced landslides and because they can be associated to a decrease in the strength parameters of the bedrock and to anomalous groundwater conditions.Faults and other relevant tectonic features of the area were extracted from 1 : 100 000 geological maps and a raster was set up in which each pixel assumes the value of the distance to the closest fault.
Distance to rivers (RIV).The stream network is an important feature in the geomorphological setting of an area and may directly or indirectly be linked to landslides (Devkota et al., 2013;Demir et al., 2013;Feizizadeh and Blaschke, 2013).Similarly to roads and faults, a shapefile of existing streams (Straheler order > 0) was extracted from 1 : 25 000 technical maps and a raster of distances from the hydraulic network was set up.
Rainfall data has been rarely used in landslide susceptibility models (Günter et al., 2012;Sabatakakis et al., 2013;Schicker and Moon, 2012;Feizizadeh and Blaschke, 2013), mainly because rainfall is considered one of the main triggering factors instead of a predisposing factor.It is often assumed that rainfall is related to the temporal occurrence of landslides and not to their spatial distribution (Pereira et al., 2012).However, this assumption can be considered valid over small areas where rainfall characteristics can be considered quite homogeneous, while on large areas different rainfall regimes can be observed.We therefore included in our analysis some attributes related to the rain without inserting actual rainfall measurements but considering  2 for optimal set description).
the predisposition of the territory to be struck by a rainstorm of a given typology.We defined a series of variables Rp_a_t (RP_100_24; RP_100_72; RP_240_24; RP_300_72; RP_30_1; RP_50_6; RP_600_120) expressing the return period (RP, in years) of a rainstorm characterized by a given total rainfall amount (a, expressed in mm) in a given time lapse (t, expressed in hours).These data were already known for 111 locations corresponding to pluviometric stations distributed across the whole territory and were extended to the whole study area using a suitable geostatistical interpolation algorithm.
-Random value (RND).The risk of introducing so many variables in the modelling is to have a chaotic and instable system; therefore we defined a control variable that assumes for each pixel a random value from 0 to 100.This "random variable" was used as a benchmark to better identify "useless" or "pejorative" variables that could be less effective than a random choice of values.

Model tests results
In order to identify the minimum number of trees required to minimize OOBE, we repeatedly ran the training sequence and calculated the OOBE as a function of increasing T#. Figure 2 shows that the OOBE stabilizes starting from T# = 100.
Considering that the calculation time depends on T#, we choose 200 trees as the optimal configuration of the model.Considering a 10 m resolution and the full parameters set, for each LCV the OOBE was calculated executing 1, 10 and 100 runs.Results show that the value obtained with 1 run is almost always comparable to the other values, falling within the range identified by ±3 standard deviations (with the only expected exceptions represented by RND and LIT_VAR).Furthermore, in order to compare mean values obtained with 10 and 100 runs, a t test was implemented to compare the results coming out from different number of runs: the calculated value is always lower than the critical value, so that the null hypothesis that the two values (10 and 100 runs averages in Table 1) belong to the same distributions can never be rejected (in Table 1 the corresponding probability value is also shown).We can assert that the number of model runs does not affect the OOBE.In Table 1 we also report the variation coefficient (CV) considering the mean and std value calculated for 10 and 100 runs.The highest values are obtained for RND and LIT_VAR: the first one did not affect the classification result, the last one makes little sense at 10 m resolution being derived from a 1 : 100 000 scale map.

Parameter set and resolution tests results
In order to find the best LCV parameter set, a fixed structure RFtb model has been run at different resolutions in the test area applying a feature selection procedure, i.e. progressively decreasing the LCV#.The optimal configuration of the parameter set expressed in terms of misclassification probability (MP) for each tested resolution, is reported in Table 2.The MP is calculated on test points, randomly sampled over the entire area.
Out of the 35 parameters considered in the analysis, a combination of 24 parameters represents the best configuration for the resolution of 20 and 50 m, while a combination of 9, 15, 14 and 22 are the best sets for the 10, 100, 250 and 500 m, respectively.
For each resolution the parameters were ranked on the basis of their importance.The random variable was invariably discarded during the first step of the feature selection procedure as the least important (and always pejorative) explaining variable.Rainfall return periods are always included in the optimal set at all different scales, whereas some factors, such as CCU, TWI and ASP are always excluded; ELE has always a high rank.The resolutions at 20 and 50 m have almost identical parameter sets with very similar ranks for each parameter.Considering the 10 m resolution, most DEM-derived parameters, land use and lithology are discarded.
An interesting by-product of the feature selection procedure is the ability to quantify the parameter importance for each LCV# set. Figure 3  resolution (MUR) and for decreasing LCV#: when the parameter is discarded its rank is displayed in gray.It can be noticed how the rank varies with the number of parameters and depending on the MUR.The white boxes point out the LCV# which resulted in the optimal set for each MUR.Some examples of this plot are presented in the discussion.The complete set of rank-MUR-LCV# plots are included in the auxiliary material for reasons of space.
As described in the previous section, in order to find the best training set in terms of resolution and minimum sample dimension (mS#) we performed two types of tests.The first type, reported in Fig. 4, considered a random sampling of 10 % of the study area at different resolutions.The model performance compared in terms of AUC is highest for 50 m resolution (AUC = 0.88) and is lowest for 10 m resolution (AUC = 0.54).
The second type of test is performed using a variable sample dimension mS# (from 0.5 to 50 %) at a constant resolution of 50 m.The results displayed in Fig. 5 highlight that, as predictable, the higher the mS# is, the higher is the resulting AUC.

Model testing implications
The main results of the model testing phase for what concerns model stability tell us that, basically, RFtb approaches in landslide studies are feasible and robust provided that a RF structure of the suitable complexity is used and a preliminary stability test is performed to check for model instability due to the random selection component.This means that, before searching for the optimal parameter set, it is necessary to configure the RFtb structure for maximum stability.
The final structure uses a T# = 200 which implies a computational effort easily manageable by the great majority of desktop computers.Applications in different fields often require much larger forest densities (Bachmair and Weiler, 2012).Moreover, the random component of the RFtb algorithm does not seem to compromise in any way model stability over multiple runs, which further simplifies the practical implementation of this approach in LSM.
Using the model configuration derived from the preliminary tests we can safely assume that model performance in the following tests is not influenced by model structure but only dependent on the input parameter set (number, typology, scale and accuracy of LCVs).

Parameter importance and scale issues
For what concerns the tests of RFtb performance in classifying the study area for landslide susceptibility, we can say that, mainly, the optimal set of LCVs to be included in the analysis depends strongly on the scale of analysis, i.e. on the resolution of the terrain unit of reference (MUR).
The LCV ranking and significance is clearly influenced by terrain unit size or scale (MUR, which is equivalent to resolution throughout the document).
Table 2 shows that optimal parameter configurations change with scale in a notable manner.Several LCVs which are significant at one scale can be completely negligible at another.This is probably connected to the original resolution of the survey, mapping and measurement of each single variable.For example, the slope curvature variables (CPL_STD, CUR_STD, CUR, CPL, CPR) do not seem to have any influence on landslide susceptibility when MUR = 10 m.This could be probably due to the fact that the dimension of slope curvature which is meaningful for landslide occurrence has a scale larger than that.The slope profile or planar convexities/concavities with dimensions limited to 30 m (window dimension used to compute curvature for 10 m resolution) are probably not significant for discriminating landslide versus non-landslide units in the study area because they fail to include both detachment and deposition areas.
The only curvature-related LCV that is never discarded is the standard deviation of profile curvature (CPR_STD), which assumes high ranking in LSM classification for MUR = 500 m (coarser scale) and average ranking at MUR = 50 and 100 m (medium scales).
This result may have a notable importance in LSM, per se.The most immediate implication might be that RFtb multiscale analysis can reveal important clues concerning the scaling characteristics of the landslide size distribution in a given area.In particular, for the case at hand, the frequency size distribution (as computed by Catani et al., 2005 andConvertino et al., 2013) shows a scaling in agreement with the classical double-Pareto recognized in most of the landslide inventories worldwide (Stark and Hovius, 2001;Guzzetti et al., 2002;Stark and Guzzetti, 2009;Brunetti et al., 2009;Van den Eeckhaut et al., 2007) with exponents of 0.4-0.5 (small landslides) and 1.8-1.9(larger landslides) and a rollover around 10 4 .Which means that only a small percentage of landslides in the Arno Basin have an area smaller than the 10 m resolution window size for neighbour DEM analysis (30 m ×30 m = 900 m 2 ).Curvature at that scale (and also slope) simply cannot capture the shape of terrain connected to most landslides.
Another interesting issue is related to the lack of importance that the prediction model assigns to the topohydrological covariates.The TWI and RIV indexes are always discarded in the optimal configuration at all scales whilst the contributing area (FLA_STD, FLA and their logarithms LFA, LFA_STD) performs only slightly better, being discarded or low-ranked.We believe that this reflects the fact that these variables are more related to earth flows or rapid landslides in low-order channels of the hydrographic network (such as debris flow) (Montgomery and Dietrich, 1994;Costanzo et al., 2012;Felicísimo et al., 2013;Pereira et al., 2012;Xu et al., 2013) which are almost absent from the historical inventory used.
Conversely, the LCVs connected to the major triggering factor in the area, rainfall, are by far the most important, rivalled only by the elevation derived covariates (ELE, ELE_STD).The pattern of landslide distribution in the geographic space seems to be strongly connected to the spatial pattern of interpolated return time for rainfall events as shown in Fig. 6.In particular, the most relevant types of rainfall events which dictate failure distribution are 30 mm in 1 h and 240 mm in 24 h; given the typical rainfall regime of the area, these values can be considered as characteristic of intense rainstorms.The areas which are more subjected to this type of event are also characterized by a higher spatial probability of landslide occurrence.This is quite new in the Arno River basin, where previous studies did not consider triggering factors (Catani et al., 2005).The ranking of rainfall parameters remains stable across scales, which means that the autocorrelation of the Rp variables obtained for the study area on the basis of the rainfall gauge distribution is strong within the range considered in the analysis (max MUR = 500 m).Since the Rp distribution has been obtained starting from sample points with average inter-distance in the order of 10 km, our analysis shows that this is a correct scale to represent meteorological phenomena for what concerns landslide triggering prediction.Present day state-of-the-art weather forecast systems offer a series of predictions at the meso-β-scale (20-100 km) or at the meso-γ -scale (2-20 km) that for local studies and predictions can be pushed to decametric pixel sizes only making use of statistical downscaling techniques (Mercogliano et al., 2013).This implies that, even though at the moment the rainfall forecast models do not have the resolution needed for accurate real-time deterministic landslide prediction, they have probably a resolution which is suitable for LSM applications.
The low "classification-power" of many other classical LCVs may again be related to scale issues.For example, the LCVs connected to soil type (LIT, LIT_VAR), derived from a geological map at the 1 : 100 000 scale, do not appear to be important at any scale.We believe that the local differences in lithology or soil/rock composition usually represented in the geological maps at that scale are not accurate enough to capture the local factors leading to landslide occurrence or, alternatively, that they do not actually represent the true surface situation in which regolith and soil cover are the real object of mass failure instead of bedrock material.This consideration cannot be underestimated and has important implications on the way geologists should map surface deposits when the final objective is landslide prediction and forecasting.
The strong linkages between the LCVs' nature and landslide size distribution in a given area are also highlighted in the overall classification results obtained at the different scales.The mapping accuracy, evaluated using the area under the ROC curve (AUC), is again dependent on scale.7 for a visual comparison).Our hypothesis is that the 50-100 m scale is the one that best represents the compromise between landslide size and LCVs' accuracy in the Arno River basin, given the available input data.This implies that, in agreement with the suggestions of Guzzetti et al. (2006), our findings underline the need of performing sensitivity analysis when-  ever an effective LSM has to be produced for land planning and/or civil protection purposes in a given area.
Figure 8 depicts a new type of plot (see also Fig. 3) which can be considered a support tool for sensitivity analysis in LSM.In it, five LCVs are considered as an example of what is the effect of MUR on the susceptibility classification power (colour scale for parameter importance) and optimal data set configuration.We suggest that this plot, or similar graphic tools, should be routinely used to test model performance and to study the sensitivity of susceptibility estimates to model settings before performing statistical predictions on landslide occurrence.
Another important issue emerging from the results is the evident (and expected) influence of the training set dimension (mS#) on the classification performance in terms of AUC (Fig. 5).At first sight, this would only seem the F. Catani et al.: Landslide susceptibility estimation by random forests technique confirmation of an obvious principle well known to everyone using statistics.However, this aspect of LSM has always been remarkably absent in the majority of published studies, which usually report a unique training sample size without discussing the implications of changing this constant value.From our results it is clear that no comparison between different LSMs would be possible before carefully assessing the MUR and the mS# used by the modellers.Furthermore, the sampling method itself influences the model performance and classification power (Yilmaz, 2010b).Figure 9 illustrates the impact of using two different sampling methods to choose the training set (random versus regular grid selection over the entire study area).For low percentages of sampling (5 and 10 %) the random choice is by far preferable to the regular grid sampling.This is even more important because in most of the practical cases of LSM application to the real world of civil protection and risk mitigation, we cannot decide a priori the mS# and we are forced to use what we have.LSM approaches ensuring high AUC performance only when using an mS# higher than e.g.50 % are useless in areas where only a very low percentage of terrain has been already mapped using an accuracy suitable for model training.

Conclusions
We performed a series of tests to understand how model tuning and model parameters can affect landslide susceptibility mapping in a well studied area of Tuscany (the Arno River basin).
We used a specific version of the random forest classification method in which variable importance can be assessed throughout a series of resolutions and parameter sets, so as to understand which is the actual impact of model choices on the final result in terms of classification performance.
The main results we have obtained are that the optimal number of parameters varies with scale and resolution and that the importance of each given landslide conditioning variable is influenced by the model settings and the available data.Also, the choice of the training set (both for dimension and location) is of key importance for obtaining accurate results.
All the results we have concur to the conclusion that model sensitivity analysis for tuning choices, parameter sets and scale issues have a paramount importance on LSM and should always be performed before producing maps to be used for effective landslide risk mitigation.

Fig. 1 .
Fig. 1.Location of the study area and distribution of landslides over the Arno River basin (red polygons).

Fig. 2 .
Fig. 2. Plot showing the decrease of OOBE with increasing number of trees T# in the RF structure.A sill value is reached for T# > 100.A working value of T# = 200 was chosen for the RFtb model structure used in the tests and experiments.

Fig. 3 .
Fig.3.Rank-MUR-LCV# plot example illustrating the variation of parameter relative importance (expressed as rank using the colour ramp on the right) with parameter space (no. of parameters used LCV#) and map unit resolution (MUR in m).Grey colours correspond to combinations of MUR and LCV# in which the parameter importance was estimated as poor or where the parameter was discarded.The white boxes indicate the combination of MUR-LCV# leading to the best classification for each resolution (see Table2for optimal set description).

Fig. 4 .
Fig. 4. ROC plots and AUC values for the best classifications obtained at different resolutions.The plots are relative to a model training with mS# = 10 %.The 50 m resolution is the best with AUC = 0.88 whilst no discriminant capability is shown by the RFtb used at 10 and 20 m resolutions.MUR = 100, 250 and 500 m display intermediate accuracies in terms of AUC.

Fig. 5 .
Fig. 5. ROC plots and AUC values for a fixed resolution (MUR = 50 m) at increasing training sample size (0.5 % < mmS# m< 50 %).Results clearly highlight that AUC increases with mS#, suggesting the need for research of the best trade-off for such value especially when applying LSM to nearly-unknown areas.

Fig. 6 .
Fig. 6.Map of the spatial distribution of the return time after a rainfall of 240 mm in 24 h with landslide inventory in overlay.

Fig. 7 .
Fig. 7. Susceptibility maps at different resolutions for mS# = 10 % showing the spatial distribution of what is summarized in Fig. 4.

Fig. 8 .
Fig. 8. Rank-MUR-LCV# plots relative to the behavior of 5 LCVs at different resolutions.The plot description is as in Fig. 3.The plots visually depict the efficiency of elevation and rainfall (Rp 240 mm, 24 h) in correctly classifying landslide proneness at all scales and almost on every parameter space dimension (LCV#).

Fig. 9 .
Fig. 9. Influence of the training sampling on the overall classification results.The plot illustrates ROC curves for regular grid (block) versus random sampling at two different mS# values (5 and 10 %).The best performance is by far that offered by the random sampling scheme, at least at the tested mS# values.

Table 1 .
Out-of-bag errors (OOBE)relative to the used LCVs for different numbers of model runs.Results show that the number of runs does not seem to influence the LCV impact on classification.

Table 2 .
Optimal configurations of the parameter set for each map unit resolution(MUR = 10, 20, 50, 100, 250 and 500 m).Discarded parameters are omitted.Numbers represent the rank of each parameter according to OOBE.MP is the overall misclassification probability of the given ensemble.Some of the parameters are never present in the optimal sets.MUR = 10 m MUR = 20 m MUR = 50 m MUR = 100 m MUR = 250 m MUR = 500 m