Sample size m tt rs : investigating the e ff ect of sample size on a logistic regression debris flow susceptibility model

Predictive spatial modelling is an important task in natural hazard assessment and regionalisation of geomorphic processes or landforms. Logistic regression is a multivariate statistical approach frequently used in predictive modelling; it can be conducted stepwise in order to select from a number of candidate independent variables those that lead to the best model. In our case study on a debris flow susceptibility model, we investigate the sensitivity of model selection and quality to different sample sizes in light of the following problem: on the one hand, a sample has to be large enough to cover the variability of geofactors within the study area, and to yield stable results; on the other hand, the sample must not be too large, because a large sample is likely to violate the assumption of independent observations due to spatial autocorrelation. Using stepwise model selection with 1000 random samples for a number of sample sizes between n = 50 and n = 5000, we investigate the inclusion and exclusion of geofactors and the diversity of the resulting models as a function of sample size; the multiplicity of different models is assessed using numerical indices borrowed from information theory and biodiversity research. Model diversity decreases with increasing sample size and reaches either a local minimum or a plateau; even larger sample sizes do not further reduce it, and approach the upper limit of sample size given, in this study, by the autocorrelation range of the spatial datasets. In this way, an optimised sample size can be derived from an exploratory analysis. Model uncertainty due to sampling and model selection, and its predictive ability, are explored statistically and spatially through the example of 100 models estimated in one study area and validated in a neighbouring area: depending on the study area and on sample size, the predicted probabilities for debris flow release differed, on average, by 7 to 23 percentage points. In view of these results, we argue that researchers applying model selection should explore the behaviour of the model selection for different sample sizes, and that consensus models created from a number of random samples should be given preference over models relying on a single sample.


Introduction
Spatial modelling, i.e. finding and applying a model of the spatial distribution of some phenomenon, can be used for two slightly different purposes: first for regionalisation, i.e. the transfer of findings from the surveyed area to some larger region.In geomorphology, the methodological framework for regionalising the occurrence of a process or a landform (that is associated with the activity of geomorphic processes) is termed "predictive geomorphological mapping" (Luoto and Hjort, 2005).It can be helpful in reducing time, cost and, to some degree, subjectivity associated with area-wide geomorphological mapping (van Asselen and Seijmonsbergen, 2006).Second, models are applied to identify areas where the phenomenon might occur in the future (even/especially where they is no evidence of recent activity).The probability (in space) for any spatial unit of experiencing an event forms an important factor of the hazard term in quantitative risk assessment, although for a complete formulation one also needs to consider the temporal probability and the magnitude-frequency relationship of events (Guzzetti et al., 2006).However, spatial modelling includes some temporal aspect as well.Specifically for landslides, the most important (see Pike et al., 2003, for more) underlying assumptions are (i) that landslides can occur and/or have occurred in the larger area wherever the conditions are equal or similar to those in the surveyed area and (ii) that future events will take place under the same or similar conditions as they did in the past (e.g.Fabbri et al., 2003).
A number of different approaches are used for spatial modelling of landslides (Guzzetti et al., 1999); they differ both with respect to spatial (presence/absence of whole landslide body or parts of it; observations based on different terrain subdivisions) and methodological aspects (heuristic or rule-based, statistical and physically-based approaches).For this study, we have chosen a statistical approach on three grounds: -Heuristic techniques achieve good results in some cases (e.g.Zimmermann et al., 1997;van den Eeckhaut et al., 2010), especially where the data base is too poor for the application of statistical approaches.Yet they involve a high degree of Introduction

Conclusions References
Tables Figures

Back Close
Full subjectivity (Davis, 1986) as they rely on the quality of expert knowledge.Hence, it is not surprising that physically-based and statistical models are more common than rule-based methods.
-Models based on quantitative approaches, i.e. physically-based and statistical ones, are data-based, more objective and reproducible, and "calibration and validation of the models support transparency and rationality of the prediction" (Pistocchi et al., 2002, p. 766).The process of establishing a quantitative model should be guided by, but is not only based on, a-priori understanding of the target phenomenon; moreover, it also implies an analytical procedure which might lead to insights in the relative importance of influencing factors.
-Physically-based approaches (Montgomery and Dietrich, 1994;Pudasaini et al., 2005) aim at modelling the processes which lead to failure.Landslide models tend to be highly parametrized, i.e. they require a large number of parameters.The values of these parameters are difficult or even impossible to obtain, especially for large areas and considering their spatiotemporal variability.In contrast, statistical models do not simulate in detail the processes leading to failure, but rely on statistical relationships between its occurrence and spatial factors (directly measurable or "substituted" by proxy data).In debris-flow research, where basic conditions of the process are incompletely understood, statistical approaches are often favoured over physically-based ones (Wichmann, 2006).Carrara et al. (2008) compare physically-based and statistical models for regional debris flow susceptibility; although all models yielded comparable results, a statistical approach (discriminant analysis) was considered the most suitable method.
In this study, we apply the method of multivariate logistic regression to the identification of potential debris flow initiation sites in a high mountain catchment; the spatial unit is the raster cell (as opposed to, e.g., slope units, see van den Eeckhaut et al., 2009).

Conclusions References
Tables Figures

Back Close
Full Terhorst , 2006) or "certainty factor" (e.g.Binaghi et al., 1998), and artificial neural networks (e.g.Liu et al., 2006), logistic regression belongs to the most frequently chosen approaches to spatial modelling of landslides (Atkinson et al., 1998;Ohlmacher and Davis, 2003;Begueria and Lorente, 2003;Brenning, 2005;Ayalew and Yamagishi, 2005;Beguería, 2006b;van den Eeckhaut et al., 2006;Meusburger and Alewell, 2009;van den Eeckhaut et al., 2010;Atkinson and Massari, 2011;Ruette et al., 2011;Guns and Vanacker, 2012).Recently, some published studies dealt specifically with debris flow susceptibility models on the regional scale; for the identification of potential release areas, a range of different approaches has been used, including heuristic (Horton et al., 2008;Kappes et al., 2011;Fischer et al., 2012) and statistical ones (Heckmann and Becht, 2009;Blahut et al., 2010b,a).The so delineated release areas can be used as starting points for models that predict the pathways, lateral extent, runout length and other relevant properties of debris flows (e.g.Blahut et al., 2010b;Kappes et al., 2011), which is important for hazard assessment and has also been used in geomorphological applications, for example research on sediment cascades (Wichmann et al., 2009;Heckmann and Schwanghart, 2013).In order to use a model for prediction, a sample has to be drawn, and the model parameters of the population are estimated based on that sample.Sampling is essential because event and non-event units show spatial autocorrelation (see Sect. 1.2), and dependent data lead too easily to the rejection of null hypotheses and the incorrect declaration of parameters as significant (Legendre, 1993, explains this for ecological models, see also van den Eeckhaut et al., 2006).Using a stepwise approach, the predictor variables for an effective, yet parsimonious model are selected from a set of candidate geofactors (Sect.3.2.2).Brenning (2005) found that logistic regression with stepwise variable selection yielded the lowest error rates in his comparison of different statistical methods.The choice of predictor variables will understandably depend on the sample (Guns and Vanacker, 2012), and it is also clear that the aim of every susceptibility model should be a reliable prediction that does not too much depend on the sample taken to select the variables and estimate the model parameters.Several previous studies do not involve sampling at all (e.g.Ohlmacher Introduction

Conclusions References
Tables Figures

Back Close
Full and Davis, 2003;Ruette et al., 2011), i.e. they use all available data for estimating the model parameters.The majority of studies uses only one single sample (e.g.Atkinson et al., 1998;van den Eeckhaut et al., 2006;Meusburger and Alewell, 2009), the size of which usually depends on the number or size (in terms of raster cells) of landslide initiation zones (see Sect. 1.1).Recognising the dependence of model results on the sample, Brenning (2005) takes 50 samples to compare error rates across different sample sizes and statistical methods.Begueria (2006a) and Guns and Vanacker (2012) apply 50-fold replication in order to estimate the robustness of the modelling result with respect to sampling, van den Eeckhaut et al. ( 2010) calculate an ensemble of 25 models from different samples of their data.Hjort and Marmion (2008) conduct repeat sampling to explore the influence of sample size on the predictive power of (among others) multiple logistic regression models for predictive geomorphological mapping.
The present study has two main foci that will be developed in detail in the following subsections: first, we explore the sensitivity of model selection to sample size.Sections 1.1 and 1.2 will explain why the sample size must neither be too small nor too large.Here, the main aim of the study is to investigate if an "optimal" sampling size can be found as a compromise between samples too small and too large.Second, we quantify the uncertainty inherent in a stepwise modelling approach, with respect to (i) the selection of geofactors, (ii) model parameters, and (iii) the spatial pattern of uncertainty in the resulting susceptibility map.This study aim will be developed in Sect.1.3.

Constraints on sample size 1: why the sample must not be too small
In inferential statistics, confidence intervals are calculated for population parameters based on a sample; the width of the former depends, besides the desired confidence level, on the sample size.Small samples result in large standard errors and wide confidence intervals for the population parameters.In case of regression parameters, small samples cause the estimation to be uncertain, and there is a higher risk of parameters being insignificant (because the respective confidence interval includes zero).With respect to replicate sampling and model selection, it is expected that the diversity of Introduction

Conclusions References
Tables Figures

Back Close
Full models (and hence the dependence of the models on the sample) will be large in this case.
Moreover, in a large study area, a small sample is unlikely to cover the variability of geofactors, especially if several of them are part of the model.Here, a larger sample would include more information on the study area and would possibly provide a better model.There are rules of thumb that estimate the minimum sample size for a regression analysis on the basis of a constant (e.g.> 50), of the ratio of observations and predictor variables, or of a combination of the latter; such rules have been explored in light of significance, power and effect size, e.g. by Green (1991), who found "some support" for the rule of thumb n min ≥ 50 + 8m where n min is the minimum sample size and m is the number of predictor variables.
In this study, when we speak of sample size, we always address a sample of "nonevents", i.e. a sample of raster cells without debris flow initiation.If a random sample referred to all raster cells, including event and non-event cells, the number of event cells in the sample would certainly be smaller than in the original inventory, which would cause a loss of information particularly for those cells that represent the target of the modelling exercise; therefore, all initiation areas will be represented in the models, and only the size of the non-event sample is varied in our investigation.

Constraints on sample size 2: why the sample must not be too large
While it is intuitive that larger samples contain more information that can be used by the model, and the model might be better, there are several reasons why the sample size must not be too large either.King and Zeng (2001) argue that the non-event sample size has to be kept as small as possible because of the disproportionate cost and effort of acquiring data for many variables and observations not related to the target phenomenon (event).Like in political science, the acquisition of observations is costly in ecology (with the application of regression models to the spatial prediction of species distribution), especially given the complexity of the investigated systems reflected in large numbers of predictors and the Introduction

Conclusions References
Tables Figures

Back Close
Full logistic difficulty of mapping the presence or absence of a species in remote areas (see e.g.Stockwell and Townsend Peterson, 2002).An important justification for predictive geomorphological mapping (Luoto and Hjort, 2005) is that area-wide field mapping is time-consuming, difficult in remote or inaccessible areas, and may suffer from subjectivity (van Asselen and Seijmonsbergen, 2006;Hjort and Marmion, 2008).However, in contrast to the examples from political and ecological science, many if not most variables in predictive geomorphological mapping are easily derived from digital elevation models (that are available globally, with ever increasing accuracy and resolution) and remote sensing data.This does not change the effort required for mapping the target phenomenon ("events"), but the motivation for limiting sample size of non-events appears to be quite different, as it does not so much refer to the effort of data acquisition (quantity and quality).In our study, an empirical analysis of the stability of model selection as a function of sample size is therefore given preference over the mere adoption of some ratio of event : non-event observations given in the literature (e.g. by King and Zeng, 2001), mostly without justifying the particular choice of this ratio.
Other reasons for restricting sample size are overparameterisation and overfitting of the model (Hjort and Marmion, 2008, and references therein).Increasing sample sizes cause standard errors and confidence intervals in parameter estimation to decrease.In a stepwise model selection, very large samples are expected to cause parameter significance (as a criterion for variable inclusion), and hence the number of included variables, to increase (risk of overparameterisation).Such inclusion of more information does not necessarily lead to better model performance; Stockwell and Townsend Peterson (2002) describes "plateaus" wherein new data add little to model performance.In some cases, inclusion of more data even causes worse performance, because a model fit to a very specific set of information may perform poorly on new data (risk of overfitting, see Stockwell and Townsend Peterson, 2002, and references therein).Brenning (2005), however, states that overfitting is "not a serious problem for logistic regression".
The most serious reason for limiting the sample size is spatial autocorrelation.Logistic regression generally requires few assumptions to be met; the most important Introduction

Conclusions References
Tables Figures

Back Close
Full are (i) the independence of observations and (ii) uncorrelated independent variables.While violations of the second assumption can be avoided by testing for multicollinearity and excluding variables (see Sect. 3.2.1), the first assumption proves to be critical when dealing with spatial data.Geofactors tend to have very similar values in a close neighbourhood, a property called spatial autocorrelation.If several observations from nearby sites are included in a model, the independence assumption will not hold.In case of the generalised linear modelling approach adopted in this study, the maximum likelihood method that is used to estimate the model parameters strictly requires the observations to be independent (e.g., Hosmer and Lemeshow, 2000).Moreover, Atkinson and Massari (2011) explain that (spatial) autocorrelation of the geofactors causes the model residuals to be spatially autocorrelated (which is not acceptable as model residuals have to be uncorrelated), and that this may lead to "incoherent significance estimates for the parameters" (see also Brenning, 2005).In previous studies applying logistic regression to landslide susceptibility analysis, this problem is frequently ignored (no sample is used at all, see above).In some instances, the risk of autocorrelation is dealt with for "events" only, as geofactors tend to be homogeneous (and consequently strongly autocorrelated) on landslide terrain (Atkinson and Massari, 2011).In order to mitigate the issue of spatial autocorrelation, some authors choose one raster cell for each landslide source area on a systematic basis.Atkinson et al. (1998)  tiguous (and hence potentially strongly spatially autocorrelated) sample of hundreds of landslide initiation cells from entering the model.Spatial autocorrelation has also been accounted for in model validation (Brenning, 2005).However, as Atkinson and Massari (2011) point out, autocorrelation in the geofactors is frequently not adequately accounted for in the regression model.While the latter study proposes an autologistic model (see also Brenning, 2005), we will try to warrant independence of observations through the choice of an adequate sampling size (see Sect. 3.3.2):as the number of sampled raster cells in a finite study area increases, the average distance between those cells will decrease, and finally the independence assumption will no longer hold given the spatial autocorrelation of the geofactors.
Normally, a logistic regression model is fit to a sample where the ratio of event : nonevent cases is approximately 1 : 1.Then, the so-called cutoff, the value of the model result that discriminates between event and non-event, equals 0.5.King and Zeng (2001) explain that the number of non-events should be typically 2-5 times higher than that of events.In this case, the cutoff needed to translate the model result to a classification (event or non-event) would need to be adjusted accordingly.Because the ratio of event : non-event spatial units (raster cells, but also lumped spatial units, Begueria and Lorente, 2003) usually is by far smaller (in our study areas, the ratio of release area cells to the total study area is 1 : 200 and 1 : 500, respectively), a bias towards small probabilities arises; this problem has been addressed by the development of "rare events logistic regression" (King and Zeng, 2001).Besides endogenous stratified sampling (including all events and a random sample of non-events in the model), these authors propose corrections for the intercept and for the estimated probabilities.Rare events logistic regression was applied in landslide susceptibility modelling by van den Eeckhaut et al. (2006) and Guns and Vanacker (2012).In many studies, endogenous stratified sampling has been adopted, and the authors chose event : non-event ratios of 1 : 1 (e.g.Brenning, 2005;Meusburger and Alewell, 2009;van den Eeckhaut et al., 2010), 1 : 2 (Wang and Sassa, 2005), 1 : 5 (van den Eeckhaut et al., 2006), or 1 : 10 ( Begueria and Lorente, 2003;Begueria, 2006a;Guns and Vanacker, 2012).Finally, Introduction

Conclusions References
Tables Figures

Back Close
Full  1998), who use only the central cell of each landslide as the event sample, sample as many non-event cells as it is required to attain the ratio of landslide to non-landslide area.
In our study, we adopt stratified random sampling by a random sample of one cell for each debris flow initiation zone, and a random sample of non-event cells.The size of the latter is then varied in order to explore the effect on stepwise model selection; hence, we do not pre-select an event : non-event ratio.Rare event correction according to King and Zeng (2001) is not applied.

Uncertainty: model selection, parameters, spatial patterns
The result of the investigations motivated in the previous subsections is a suitable sample size reaching a compromise between sample sizes too small and too large, seeking above all a stable model selection (i.e.low diversity of geofactors remaining in the repeat stepwise selection), and also sample independence (i.e.avoiding spatial autocorrelation).Even with an optimised sample size in that respect, the selection of predictor variables will still depend on the specific sample.As different predictor variables, with their distinct spatial structure, will be part of the model when the procedure is repeated with a different sample, the spatial pattern of the resulting susceptibility map will also differ from time to time, and the predictive power of the model might be different as well.The second main goal of this study is to elucidate three aspects of this uncertainty: (i) geofactors and how often they are included after stepwise selection, (ii) the range of model parameters estimated for the replications, and (iii) the spatial distribution of differences in the estimated susceptibility.This is important because in the majority of studies employing sampling for model calculation, only one sample is taken; on the other hand, all studies involving repeat sampling concentrate on the set of geofactors, the parameters and the predictive ability of the models, and do not investigate how this affects the spatial distribution of susceptibility.Introduction

Conclusions References
Tables Figures

Back Close
Full

Study area
This study has been conducted in two adjacent subcatchments of the Horlachtal valley, a tributary of the Oetztal valley, located in the Austrian central Alps (Stubai Alps).The two valleys, the Zwieselbachtal (ZBT, ca.19 km 2 ), and the Larstigtal (LT, ca.7 km 2 ), strike approximately S-N and have a typical trough cross-section.Due to their adjacency, they are similar in their natural characteristics.Figure 1 shows the location and and an overview of the catchments.The most important properties of the study areas are listed in Table 1; the Horlachtal and its subcatchments are described in more detail by Rieger (1999) and Geitner (1999).
The lithology of both valleys is dominated by gneiss and mica schist, metamorphic granites can also be found.Pleistocene glaciations have shaped the valleys and are evidenced by glacial landforms (e.g.moraines, cirques, roches moutonnées).Glacial cirques are concentrated on the east-facing valley sides whereas the west-facing valley sides are marked by extensive scree slopes.Currently, the two catchments are formed primarily by fluvial and gravitational processes such as rock falls and debris flows.
Sediment transport through the catchments is limited as the valleys exhibit largely disconnected subsystems (at least with respect to the transport of coarse sediment, see Heckmann and Schwanghart, 2013) separated by alluvial reaches of the Zwieselbach and Larstigtal creeks, respectively.These reaches are located immediately upstream of the terminal moraines of the Little Ice Age and of the particularly well-preserved terminal moraines of the Egesen stadial (corresponding to the Younger Dryas, ca.11 to 12 ka BP, recent datings for the European Alps are listed by Ivy-Ochs et al., 2008).
Debris flows in both study areas can be termed slope-type debris flows of type 2 according to Zimmermann et al. (1997).Events of this type initiate on scree slopes following failure caused by positive pore water pressure following intense rainfall and progressive erosion.This is often the case at the base of rock walls where debris flow formation is triggered by the so-called "firehose-effect" (Johnson and Rodine, 1984) which describes concentrated flux of water out of the rockface onto the talus.Slope type Introduction

Conclusions References
Tables Figures

Back Close
Full debris flows can be regarded as transport-limited processes, thus their frequency is primarily controlled by hydroclimatic events (Bovis and Jakob, 1999).In the study area, rain intensities of around 20 mm within half an hour have been reported to trigger debris flows (Becht, 1995;Rieger, 1999), while Zimmermann et al. (1997) suggest regional intensity-duration thresholds of about 11 mm per hour.The threshold is comparatively low, which has been attributed to the low mean annual precipitation (Hagg and Becht, 2000) of ca.1000 mm (Becht, 1995).Vegetation primarily consists of dwarf shrub heath, alpine meadows and pioneer vegetation.At elevations of > 2300-2500 m, bedrock and scree are predominant.In general, more than 60 % of the study area are completely lacking vegetation cover.

Debris flow inventory
As every statistical approach, logistic regression requires an inventory of targets (here: a map of debris flow initiation areas) for the dependent variable, and maps of (potentially) influencing factors as independent variables, hereafter referred to as geofactors.The dependent variable (here: debris flow initiation) is observed as a binary variable (1: presence, 0: absence).The debris flows inventory of the Zwieselbachtal and Larstigtal catchment was compiled using orthophoto and field maps (Thiel, 2013), updating an earlier inventory for which debris flows had been surveyed using a total station (Rieger, 1999).It contains 81 events within the Zwieselbachtal, and 64 events within the Larstigtal.Debris flows areas are represented by polygon features (which had to be converted to raster format for the pixel-based approach of this study), and divided in three zones related to geomorphic activity: erosion (indicated by incision), transition (indicated by a channelised reach accompanied by levées) and the depositional lobe(s).Introduction

Conclusions References
Tables Figures

Back Close
Full Conceptionally, as the susceptibility map specifically aims at predicting potential initiation zones, the "event" samples for the regression models should be taken from the erosional zones, preferably from the uppermost part as the latter represents the area where events typically started (and probably will also initiate in the future).The strategy of using only the detachment zone of a mass movement for susceptibility modelling has been advocated by several workers (see for example van den Eeckhaut et al., 2006;Heckmann and Becht, 2009), Magliulo et al. (2008), however, report that this restriction does not automatically lead to better results.The initial idea of manually setting one raster cell for each debris flow initiation zone was discarded, because placing this raster cell in the channelised part would introduce a bias towards larger catchment areas and concave plan curvature.Therefore, a GIS procedure was used to select, for each debris flow erosional zone, the area that is higher than the P75 percentile of elevation, i.e. the uppermost 25 %.The raster cells belonging to the initiation zone of each debris flows are coded with an ID, allowing for a stratified random sampling of one cell per debris flow event for each regression model.

Digital terrain model
Before model selection (see Sect. 3.2.2),geofactors conceptually related to debris flow initiation have been preselected.Debris flow initiation is related to (i) the availability of mobile debris (ii) on steep slopes and (iii) large amounts of water, typically provided by intense rainfall.Not all influencing factors in these three groups (material, relief, water) can be directly measured or calculated; many of them, however, can be derived from a DEM, either directly or as proxies.Although geological and landcover maps were available, we tried to use only geofactors that can be derived from (high-quality) digital elevation models (DEM) in order to test the feasibility of DEM-based modeling (such high-quality DEMs are increasingly available for large parts of the world).
For the derivation of several topographical parameters used as geofactors for the regression models, we used a raster digital elevation model (DEM) with a resolution of Introduction

Conclusions References
Tables Figures

Back Close
Full 1 m that was interpolated from an airborne LiDAR survey in the year 2006.For most applications, and for the modelling itself, the original DEM (DEM1) was resampled to a raster resolution of 5 m (DEM5).Apart from saving memory and computing time, the resampling smoothes the DEM so that very fine-scale topography is no longer contained in the resulting DEM5.This effect is desired, as debris flow initiation is not expected to result from microscale topography.Information on available sediment is usually provided by landcover and/or geological maps.The former mainly contain information on vegetation (that might in cases stabilise soils and sediments), the latter focus on different types of bedrock.In this study, the "available sediment" group is represented by one single geofactor (roughness class).This geofactor is derived from a cluster analysis of slope (DEM5, see below) and roughness.Roughness was calculated as the "vector ruggedness measure" (Sappington et al., 2007) on the DEM1 within a moving window of radius 5 m, and the result was resampled to the same resolution and extent as the DEM5 using the nearest neighbour approach.This analysis yields two clusters closely representing (i) bedrock and (ii) areas covered by sediments.For the Zwieselbachtal, this unsupervised classification could be validated with a very detailed landcover map created from orthophoto imagery; the φ coefficient of the mapped vs. the DEM-based classification was 0.78.The reason for the satisfactory fit is the characteristic fine-scale roughness 1 of bedrock areas that can easily be discerned on a shaded relief map, together with the existence of a sharp threshold of slope (resembling the angle of internal friction) above which an area cannot be covered by unconsolidated scree.Leaving out the information on landcover/vegetation is not expected to be decisive in our case study, because the study areas are only sparsely covered with vegetation, mostly grass, and forest is widely missing, at least in the areas relevant for debris flow genesis.
Relief parameters were derived from the DHM5 using the algorithm of Zevenbergen and Thorne (1987) implemented in SAGA GIS (http://www.saga-gis.org).As slope sta-

Conclusions References
Tables Figures

Back Close
Full bility, especially for scree, is a function of slope, this parameter is expected to be very important for debris flow initiation.As both valley axes have a north to south orientation (resulting in a strong bias towards east-and west-facing slopes), and as the physical role of aspect cannot be described unambiguously, it was not included in the analysis.
Plan and profile curvatures were derived with the same algorithm as slope, but from a DEM1 smoothed with a moving window mean filter with a radius of 10 m.This was deemed necessary because of the extremely noisy character of fine-scale curvature; medium-scale curvature is expected to be a better proxy variable for convergent flow of water (plan curvature) and changes in flow velocity (profile curvature).
Relief parameters related to the local catchment area are derived from the DEM5 as proxies for the availability of water for debris flow initiation.We calculated the specific catchment area (SCA) as the local flow accumulation per unit contour length using a multiple flowdirection algorithm (Freeman, 1991).Heavy rainfall on steep bedrock slopes is expected to be converted almost entirely to Hortonian overland flow; on talus slopes bordering steep rockfaces, this runoff can cause the initiation of debris flows, especially were it enters the talus in a channelised manner ("firehose effect", see e.g.Johnson and Rodine, 1984;Coe et al., 2008).However, if the sediment is coarse grained, large amounts of water are expected to infiltrate; this leads to a decrease of hydrological connectivity, and at least to an attenuation of the increase of runoff with increasing catchment size.Therefore, we re-calculated the catchment area, accumulating only bedrock cells in the roughness class map instead of every DEM5 raster cell.
The modified SCA map hence refers to the size of the bedrock catchment draining into each raster cell.

The susceptibility model
Multivariate logistic regression (Hosmer and Lemeshow, 2000) forms part of the family of generalised linear models (GLM); in contrast to ordinary linear models, a function of the expected value of a response variable is modelled by a linear combination of continuous or discrete predictor variables.In logistic regression, the response variable Introduction

Conclusions References
Tables Figures

Back Close
Full is binary (Bernoulli distribution); here, it takes the values 0 (no debris flow initiation) and 1 (debris flow initiation).The response function is the logit transform of the probability p ∈ ]0, 1[ that the response variable takes the value 1: Since the logit is within the interval ]−∞, ∞[, it can be modelled as a linear combination of predictor variables X 1 . ..X n : where β 0 is the intercept and β 1 . ..β n are the model parameters.These are estimated using a maximum likelihood approach.
The spatial data are generated and managed in SAGA GIS, including the derivation of relief parameters (Sect.3.1.2);for the statistical analysis, they can be directly read from the SAGA native data format using the RSAGA package (Brenning, 2009) for the statistical software R (R Development Core Team, 2012).Logistic regression is then performed using the glm and stepAIC functions of the MASS package (Venables and Ripley, 2002).For reasons explained in the introduction, we estimate the model parameters for a sample (the size of which we will try to optimise in this study) of "event" (debris flow initiation) and "non-event" cells; sampling is also performed in R.
The resulting susceptibility maps are written back to SAGA data format for visualisation and further spatial analysis.They contain the probability that the dependent variable takes the value 1, i.e. that debris flow initiation will take or has taken place.

Multicollinearity analysis
Besides sample independence, an important prerequisite for the application of GLM is the absence of multicollinearity, i.e. that the predictor variables are not correlated with each other.In order to test for multicollinearity, we applied the vif function of the Introduction

Conclusions References
Tables Figures

Back Close
Full car package (Weisberg and Fox, 2010) to a full model (i.e.including all geofactors described in Sect.3.1), yielding the variance inflation factors (VIF) of each geofactor.
Although no binding rules exist for their interpretation, several authors who conduct a multicollinearity analysis apply a very strict threshold of 2 above which variables are considered multicollinear, and are excluded from the model (e.g.van den Eeckhaut et al., 2006Eeckhaut et al., , 2010;;Guns and Vanacker, 2012).However, the most common rule of thumb is reported to be the "rule of 10" (using VIF = 10 as a threshold for severe multicollinearity), and the use of strict thresholds of VIF appears to be questionable (O'brien, 2007).The analysis of VIFs yields values of 1.18 and 1.47 for the two curvature variables, and 1.77 for SCA.Roughness and slope have VIFs of 2.06 and 2.76, respectively, which is only slightly above the threshold used in other studies, so we decided to keep all candidate variables.

Stepwise selection of predictor variables
An important task in susceptibility modelling is model building, i.e. the selection of the independent variables (geofactors).In Sect.3.1, several candidate variables are described that conceptionally explain the spatial distribution of debris flow initiation.Model building is achieved in this study through an automatic stepwise variable selection (function stepAIC Venables and Ripley, 2002).Starting from a full model, i.e. a model including all variables, variables are removed (or re-included) in order to minimise the Akaike Information Criterion (AIC, Akaike, 1973) which is calculated from the likelihood function of the model and the number of predictor variables.The AIC penalises for the number of predictor variables, i.e. it increases with the number of variables, and it decreases with a larger likelihood function indicating a better model.Hence, although there is no theoretical justification of the AIC (Sachs and Hedderich, 2006), this procedure is suitable in practice for selecting a parsimonious model, i.e. a best-fit model using as few variables as possible (Brenning, 2005).Menard (2002) states that stepwise model selection is inappropriate for theory testing, but this is not the goal of our study.Landslide susceptibility models include theoretical considerations, but aim at ef-Introduction

Conclusions References
Tables Figures

Back Close
Full fectively predicting landslide occurrence rather than explaining it.If, for example, the geofactor slope was excluded from a model, we would not conclude that slope was no factor of debris flow initiation.However, the results of stepwise logistic regression are often used to rank the controlling factors by importance (e.g.van den Eeckhaut et al., 2006).Stepwise procedures can be applied as a backward selection, as in this study (and, e.g., in Brenning, 2005;Ruette et al., 2011), but also as a forward selection (Beguería, 2006b; Meusburger and Alewell, 2009;Atkinson and Massari, 2011).Menard (2002) explains that backward selection is in some cases superior to the forward procedure.Note that the stepwise procedure used here and in Brenning (2005) differs from other studies where the decision of keeping of dropping predictor variables is based on significance of model improvement (e.g.Beguería, 2006b;Meusburger and Alewell, 2009;Guns and Vanacker, 2012), not on an information criterion.Recently, alternative approaches for model selection have been proposed (e.g.Calcagno and Mazancourt, 2010), they will be tested in future research.

Model validation
It has been stressed that a modelling study without proper validation is useless (Chung and Fabbri, 2003).Many studies in susceptibility modeling use spatial or temporal cross-validation (space or time partition, cf.Chung and Fabbri, 2003) within the same study area, i.e. the data are split either systematically or randomly into training and test datasets (Chung and Fabbri, 2003;Begueria, 2006a) according to their location or time of occurrence.Here, we estimate model parameters based on samples drawn from the Zwieselbachtal catchment, and apply the resulting models to the neighbouring Larstigtal catchment.Hence, training and test areas are completely independent.For each model run, the predictive ability is evaluated using receiver operating curves (ROC) or prediction-rate curves sensu Chung and Fabbri (2003), plotting true positive against false positive rates.The advantage of ROC is that they yield a threshold-independent measure of predictive ability; in our case, we do not have to define a threshold of modelled landslide probability below which we do not recognise susceptibility.Additionally, Introduction

Conclusions References
Tables Figures

Back Close
Full as a single measure of predictive ability, the area under the curve (AUC) is calculated (Hosmer and Lemeshow, 2000;Begueria, 2006a); this parameter falls in the range [0.5,1],where 0.5 is equivalent to random prediction and 1 to a perfect prediction.

Exploring the effect of sample size
In the introduction, we have argued why the sample size should be neither too small nor too large.Here, we describe (i) how the effect of sample size on the diversity of models is explored, and (ii) how we constrain the upper limit of sample size.

Sample size and model diversity
For small sample sizes, the geofactor composition of the resulting model depends extremely on the random sample, because small samples cannot sufficiently cover the diversity of geofactors within the study area.We hypothesise that with increasing sample size, the diversity of relevant models (selected by the stepwise procedure) first decreases towards a plateau that can be explained with the overall variability of geofactors in the study area; when the sample size approaches the size of the study area, the variability of models will eventually decrease to zero.Such a behaviour is similar to the dependence on sample size of the predictive power of predictive geomorphological models explored by Hjort and Marmion (2008).We analyse model diversity by repeating the stepwise model selection 1000 times with independent samples of different sample sizes (between 50 and 5000).Specifically, a stratified sampling scheme has been adopted; one single raster cell is randomly selected from each debris flow initiation zone, and the sample size of non-event cells (from the remaining area) is varied.The choice of non-event sample sizes in relation to event sample size ranges from 1 : 1.6 to 60 : 1, thus including the recommendations of King and Zeng (2001) and the alternatives chosen in landslide susceptibility studies, e.g. 5 : 1 (van den Eeckhaut et al., 2006), or 10 : 1 (Beguería, 2006b;Guns and Vanacker, 2012).Introduction

Conclusions References
Tables Figures

Back Close
Full For each of the 1000 replications, the geofactors that remain in the "best" model (with respect to the AIC) after stepwise selection are saved in a table.Each geofactor can then be evaluated by the percentage of replications which it was part of (cf.Guns and Vanacker, 2012).A model consisting of a distinct set of geofactors defines a "model species"; theoretically, 2 g different model species exist for g geofactors.The diversity of the 1000 replicate models calculated for each sample size is evaluated using the proportions of model species using three measures: (i) the number k of model species ("species richness"), (ii) the Shannon diversity index, also known as Shannon information entropy, and (iii) the Simpson index.The Shannon index was developed in information theory (Shannon, 1948) and has been widely applied in ecology as an index measure of biodiversity (e.g.Magurran, 2004).In geomorphology, it has been used to assess the uncertainty of drainage routing and watershed delineation (Schwanghart and Heckmann, 2012).In our study, it is calculated as where i = 1. ..k represents the i -th of k different model species, and p i is the probability of occurrence of the i -th species, estimated by n i /N, the proportion of the i -th model species found in N individual stepwise modeling runs (here: 1000).The logtransformed Simpson index (Simpson, 1949) has been developed for measuring biodiversity; it is considered superior to the H as it is independent of sample size (Magurran, 2004).It is calculated as where n i is the absolute frequency of the i -th model species and N is the number of individual models (here: 1000).smaller numbers of different model species and a high degree of dominance of one or few species.

Sample size and spatial autocorrelation
In our study, the spatial autocorrelation of a dataset is explored with the empirical semivariogram, which is typically used for geostatistical interpolation techniques such as Kriging (Webster and Oliver, 2007).It is derived from point measurements by evaluating the semivariance of values of a variable (geofactor) for pairs of points separated by a specific distance.One important property of the semivariogram is the range; points separated by a distance below this range are autocorrelated.Brenning (2005) uses the range of the empirical correlogram of the residuals of a logistic regression model (180 m in his study) to constrain the minimum distance between training and test data points in spatial cross-validation.Similarly, we estimate the range parameter of the variogram of each geofactor to constrain the sample size: we argue that the average distance between raster cells in the (non-event) sample should not fall within the autocorrelation range(s) of the geofactors included in the model in order to keep the non-event sample as uncorrelated or independent as possible.As the average distance implies that some points in the sample will be closer neighbours, we concede that this strategy minimises spatial autocorrelation rather than preventing it.Assuming a set of randomly distributed points (here: raster cells), the average distance d to the nearest neighbour can be estimated by Eq. ( 5): (Clark and Evans, 1954), where ρ is the density of the sample, i.e. the sample size n divided by the study area (here: the number of raster cells within the study area multiplied with 25 m 2 , the area of each cell).For each study area, d is calculated as a function of n and used to estimate the upper boundary for the "suitable sample size".
Instead of using the highest autocorrelation range (i.e. that of the geofactor with the Introduction

Conclusions References
Tables Figures

Back Close
Full most far-reaching spatial autocorrelation) as a crisp, absolute upper limit of sample size, we take into account d (n) as it progressively falls below the autocorrelation range of more and more geofactors, and we regard the corresponding n as progressively less acceptable.An upper limit is finally reached when the smallest autocorrelation range from the set of geofactors is undercut.
Figure 2 shows the empirical geofactor semivariograms and the practical range parameter (i.e. the range where 95 % of the sill is reached) of the fitted variogram models.Depending on the geofactor, spherical and exponential models were used.It is obvious that some geofactors, e.g.slope, are autocorrelated on multiple scales.In these cases, the lower range is used; however, it appears that a sample which is independent with respect to all geofactors is not possible.

Variability of model results
The investigations described in the previous sections have the aim of quantifying and reducing the dependence of the results on the sample while maintaining sample independence.Once a suitable sample size is estimated, we investigate the variability of model results (both quantitatively and with respect to its spatial distribution).In order to do so, we repeat 100 times the sampling, model selection, fitting and application for the Zwieselbachtal area, creating a stack of 100 gridded susceptibility maps of the whole study area.The median of 100 probabilities in each raster cell is taken as a consensus model (Marmion et al., 2009) and the final susceptibility map.The interquantile range IQR90 = p 0.95 − p 0.05 , that encompasses 90 % of the modelled susceptibility values as a non-parametric measure of dispersion, quantifies the uncertainty caused by sampling and stepwise model selection.As this measure is calculated for each raster cell, the respective map can be used to visualise the spatial distribution of model uncertainty (not with respect to the true probability, but with respect to model variability).In addition, the distribution of the parameter coefficients of the 100 models, and their predictive power (ROCs and AUC, see Sect.3.2.3)can be displayed and analysed.Introduction

Conclusions References
Tables Figures

Back Close
Full

Investigation of sample size effects
Before we approach the question of an optimal range of sample sizes, we take a look at the results of model selection as a function of sample size.Specifically, Fig. 3 shows, for each geofactor, the number of models that retained this geofactor after the AICbased selection procedure.The six geofactors that were eligible for model selection were slope, specific catchment area (SCA), the interaction of the latter factors (denoted "Slope*SCA" in Fig. 3), the roughness category which distinguishes bedrock from debris-mantled slopes, and the two curvature variables.The first row of the diagram shows positive changes in the selection.While roughness and profile curvature gradually increase their membership with larger sample sizes (roughness starting from almost zero), the interaction term Slope*SCA quickly attains 100 % (i.e.all of the 1000 samples lead to models containing this variable) already with small samples.Here, it is important to mention that interaction terms may only be part of a model if their marginals (here: slope and SCA) are also contained.This is the case, as the latter variables are contained in all models, irrespective of sample size (second row of Fig. 3).
The proportion of models containing the geofactor plan curvature is very low, starting with about 20 % and decreasing in larger samples.
If the "success" of a geofactor in the model selection procedure is a measure of its importance, then the most important variables are slope, SCA, the interaction of the latter, and profile curvature.The importance of roughness and plan curvature is low, but the number of models containing roughness surpasses that of models containing plan curvature already at sample sizes below 1000.These findings are consistent with previous work on (slope type) debris flow susceptibility: Heckmann and Becht (2009) and Wichmann et al. (2009), for example, use slope, landcover, and a variable called the CIT index (Montgomery and Foufoula-Georgiou, 1993).The latter is calculated as the specific catchment area times the squared tangent of slope.Similarly to the CIT index, the interaction term slope*SCA can be interpreted physically (mathematically, Introduction

Conclusions References
Tables Figures

Back Close
Full it is the product of the two geofactors) as a geofactor predicting high erosion potential (slope and catchment areas as proxies for the abundance and energy of surface runoff; see also discussion in Sect.4.2.1).In comparing several models (discriminant analysis and logistic regression) Carrara et al. (2008) observed that factors relating to slope gradient (here: slope), landcover and availability of detrital material (here: roughness as a proxy), and active erosional processes (here: slope and SCA) best described debris flow initiation.Figure 4 evaluates the diversity of models selected by the AIC-based procedure as a function of sample size.The diversity is expressed as the number of model species (i.e.models defined by a given combination of geofactors) in 1000 samples (center), and is quantified using the Shannon and Simpson diversity measures (bottom).The number of model species declines exponentially to reach a stable minimum of 8 species at a sample size of n = 1000.Even for the largest sample size in our analysis (n = 5000), differences between the 1000 samples result in as many as 8 different model species.The diversity measures show a local minimum at n = 300 and n = 350, respectively; for these sample sizes, the number of model species is higher, but the distribution of the 1000 models across this number of species is more uneven, i.e. few species make up the lion's share of the selections, and the rest is represented only by a few cases.For larger sample sizes, model diversity slightly increases again and reaches a more or less stable value.Sample sizes much larger than 5000 (not shown) lead to a decrease of the diversity indices; when the sample size approaches the size of the population (i.e. the complete study area), the stepwise procedure of course yields only one model species, and the diversity indices attain their absolute minimum (0).The plateau of the diversity measures is also reflected in the model composition shown in Fig. 3 where all geofactors (except roughness) exhibit only slight changes with sample sizes larger than ca.1000.
We interpret the minimum of the diversity indices as a minimum of the dependence of model selection on the sample and therefore the corresponding sample size (300-350) as a data-based recommendation (for our case study) which is, in our opinion,

Conclusions References
Tables Figures

Back Close
Full better than the adoption of arbitrary recommendations with respect to absolute or relative sample sizes.As to the latter, the sample size of 300-350 (non-event) cells corresponds to a ratio of event : non-event of 1 : 3.7 to 1 : 4.3, which is approximately consistent with the 1 : 5 ratio used by van den Eeckhaut et al. (2006) and with the recommendation (1 : 2-1 : 5) of King and Zeng (2001).It is also in the range of the ratio of event to non-event cells in our study areas (about 1 : 500 in ZBT, 1 : 200 in LT), a ratio that has been used by Atkinson et al. (1998).Considering Green's rule of thumb (Green, 1991) reported in the introduction (Sect.1.1), the six candidate geofactors in our case study would require a minimum sample size of ca.100.Hjort and Marmion (2008), who investigate the predictive power of different models estimated with different sample sizes, state that a "level of robust predictions" is attained, with all statistical techniques, at a sample size of n = 200.The local minima, however, do not appear to be always present, depending on the choice of geofactors and the study area used for model selection (not shown), but there is always at least a conspicuous knickpoint in the empirical diversity diagram where an increase in sample size does not lead to a significant reduction of model diversity.The analysis of the LT data, for example, shows a plateau, not a local minimum, of model diversity, and this is only reached between n = 1000 and n = 2000, a sample size which is already becoming problematic with respect to spatial autocorrelation (see next paragraph).The LT is smaller than the ZBT, has a smaller number of debris flows, but a higher debris flow density (events per square kilometer), hence there does not appear any conspicuous relationship of the existence and location of plateaus or local minima and the aforementioned study area properties, a problem which is left open to future research.In Sect.3.3.2,we proposed the mean distance between sampled locations in relation to ranges of spatial autocorrelation as an upper constraint of sample size.Figure 4 (top) shows the expected mean distance between nearest neighbours as a function of sample size (see Sect. 3.3.2).Additionally, the horizontal dashed lines indicate the autocorrelation ranges of the geofactors mentioned above (cf.Fig. 2).As the red curve intersects the autocorrelation ranges of more and more geofactors, the sample of the Introduction

Conclusions References
Tables Figures

Back Close
Full corresponding size is more and more likely to violate the independence assumption.
The decreasing suitability of larger samples to this end is visualised across the whole Fig. 4 through darker shades of grey.The optimal sample sizes indicated by the red arrows in the bottom part of the diagram belong to a range of sample sizes that are within the autocorrelation range of one single geofactor only.In this case, it is the "large scale" range of slope (ca.800 m, slope is autocorrelated also at smaller spatial scales with a range of ca.200 m; see Fig. 2).We consider this only a minor violation of the independence assumption, so that the sample size recommended above remains optimal also with respect to the spatial autocorrelation issue raised in Sect.1.2.
From the observations reported here, we recommend to look at the behaviour of model diversity with sample size, to select a sample size with respect to a local minimum or a plateau in model diversity, and to keep in mind the upper constraint of sample size with respect to spatial autocorrelation.

Model parameters
In this section, the results of the procedure described in Sect.3.4 are evaluated.the center of channelised debris flow paths (with highly concave plan curvature), but also at the boundary of these areas, which are highly (plan) convex.Conversely, the profile curvature coefficient is strictly negative, which means that a concavity in the long profile increases the probability of debris flow initiation.The explanation for this finding is a morphological one: the typical locations of debris flow initiation (facilitated by the firehose effect, see Sect. 1) at the contact of steep rock faces and the corresponding talus cones are marked by large negative (i.e.concave) profile curvatures.The mostly negative coefficients for slope and SCA are difficult to interpret, as one would expect that the probability of debris flow initiation would increase with steeper slopes and with larger catchment areas.However, this problem appears to be only a mathematical one, as the interaction term of slope and SCA is present in the model.Therefore, the coefficient of slope (alone) models the effect of slope where SCA is zero (and vice versa); the coefficient for the interaction term is positive, indicating higher probabilities with steep slopes and large catchment areas, which is conceptionally correct.The interaction term plays an important role in the model: without it, the positive relationship of SCA with debris flow release causes the modelled susceptibility to increase even in the comparatively flat valley bottoms.Under these conditions, slope-type debris flows cannot occur; Rickenmann and Zimmermann (1993) report starting zone slopes for type 2 debris flows (that type which occurs in our study areas) between 26.5 and 38 • , with catchment sizes of up to 1 km 2 ; Takahashi (1981) gives a lower threshold for debris flow initiation of 15 • .Generally, there appears to be a trend that the minimum slope angle required for debris flow release decreases with larger catchment areas (Rickenmann and Zimmermann, 1993;Heinimann et al., 1998;Horton et al., 2008), so there is, besides the CIT index (cf.Sect.4.1), one more theoretical justification for including the interaction of slope and SCA.

Susceptibility maps
The a model result (here: the susceptibility map containing the debris flow initiation probability) depends on the spatial pattern of the geofactors that form part of the model.Figure 6 shows a section of the susceptibility map that can be seen as a consensus model (see Marmion et al., 2009) as every raster cell contains the median of 100 model predictions, the coefficients of which have been summarised in the previous chapter (Fig. 5).Susceptibility in both valleys has been predicted using the model estimated with ZBT data only.The whole map is part of the Supplement of this paper.On the map, debris cones are highlighted by yellowish to reddish colours indicating medium to high probability of debris flow release.The distal parts of the cones are characterised by lower (if any) susceptibility, while their apices, and channel-like portions of the upslope area show the highest values.Most of the valley floor and most steep parts of the rockwalls have very low to zero susceptibility.This can be seen in detail in the upper row of Fig. 7; virtually all mapped debris flows (including not only the depositional lobes, but the whole process area) have high to very high susceptibility values in their upper part, and it can be stated that the spatial pattern of debris flow occurrence appears to be reproduced well by the model.This visual validation also reveals problems.The zones of highest susceptibility, indicated by violet colours, extend very far upslope along very steep channel-like features within the rockwalls.Many of these locations appear to be too steep for debris to accumulate (one of the preconditions for debris flow generation); for this problem, we offer two explanations: first, an analysis of slope values within the mapped starting zones (see Sect. 3.1.1)reveals that ca.75 % of slope values within the initiation areas are within a physically meaningful range (below ca.40 • ), while the remaining values clearly speak against the accumulation of debris in these location.This can be attributed in part to mapping errors (Ardizzone et al., 2002), where the upper portion of a debris flow area is spuriously extended into very steep bedrock channels that are in part poorly identifiable on aerial imagery.Another source of this error, probably to a lower degree, is a mismatch in the exact location of the rockwall-talus contact between the DEM (which is decisive for the model) and the aerial photo.Second, a linear modelling approach is not capable of modelling complex non-linear relationships Introduction

Conclusions References
Tables Figures

Back Close
Full such as the one of slope and debris flow release: conceptionally, susceptibility should increase, starting from some minimum slope, up to a maximum and then decrease again, until the susceptibility reaches zero at slope gradients that are prohibitive for the formation and persistence of sediment storage needed for debris flow generation.The GLM approach, however, only handles monotonic relationships between independent and dependent variables, e.g. an increase of susceptibility with slope.Problems of this kind could be solved by using other approaches, for example the weights-of-evidence, certainty factor, or generalised additive models (GAM, see e.g.Hjort and Luoto, 2011).
A novel output of our model replication exercise is the quantification of the variation in model results and the assessment of its spatial distribution.The model uncertainty addressed here is due to the sampling and model selection procedure only.For each raster cell of the susceptibility map, we computed not only the median, but also the interquantile range (IQR90) between the p 0.95 and p 0.05 quantiles; the corresponding map can be seen in the Supplement and on Fig. 7, bottom row.In the whole study area, the IQR90 has a highly positively skewed distribution that ranges from 0.0 to 0.98; it has a mean of 0.081, i.e. debris flow release probability predicted by the 100 models varies by 8 percentage points, on average.In the ZBT area (that was used to estimate the models) this value equals 0.073, while in the LT area it is slightly higher (0.103).For samples taken according to the "1 : 1 event to non-event" rule (n = 81 non-event cells), the average IQR90 is 0.190 (ZBT), 0.230 (LT) and 0.200 (total study area), respectively.The expected variability is consistently higher for smaller samples, and with the application of a model to another area.Generally, the lowest uncertainty is found for both the lowest and the highest susceptibility values.On the uncertainty maps, the largest standard deviations occupy spatially coherent areas along the zones of high susceptibility, but additionally in considerable portions of the valley bottom where the slope gradient is low.In some places, the spatial pattern of uncertainty is consistent with the fact that profile curvature is included in only about 60 % of the models; here, zones of high curvature (both concave and convex) are characterised by high IQR90 values.Such zones of high uncertainty may generally occur where a high (or low) pre-Introduction

Conclusions References
Tables Figures

Back Close
Full dicted susceptibility relies on one parameter only, that is not part of all models.In our opinion, the map adds information to the susceptibility map that can be useful for its interpretation.

Validation
The variability of model parameters and predictions is also reflected in the validation.A first, qualitative, validation is done by visually inspecting the susceptibility map (here: the median of 100 models, Figs. 6 and 7).Each model is quantitatively validated by means of a receiver-operating-curve (ROC, see Sect.3.2.3)using data from the Larstigtal (LT) only; hence, the data used to estimate the model parameters (from the ZBT area) and the validation data are completely independent, and the corresponding diagram represents a "prediction curve" (Chung and Fabbri, 2003).Split-sample validation approaches such as cross-validation, spatial and temporal partitions (Chung and Fabbri, 2003) do not warrant such independence when, for example, subsets of the same inventory are used to estimate model parameters and to validate the resulting model in one study area.
Figure 8 top shows the prediction curves for the 100 models, and the distribution of the corresponding area under the curve (AUC).The 100 curves are located quite close to each other, and there are no conspicuous extreme outliers.The AUC reaches 0.83, on average; the predictive ability of a model calculated in the LT area and applied to the ZBT (not shown) is even higher with AUC = 0.9.In total, the observed AUCs are within the range of many published studies (e.g.0.69-0.8;Ruette et al., 2011, 0.84, Ayalew andYamagishi, 2005, 0.89-0.93;van den Eeckhaut et al., 2010) and can be regarded as satisfying.Interestingly, the sample size did not influence the predictive ability of the model ensemble -both n = 81 and n = 350 have very similar mean AUC values.However, the smaller sample size leads to a much larger spread of the different prediction curves and consequently also of the AUC values.In our case, a single sample of events and non-events at a ratio of 1 : 1 (see, for example, Brenning, 2005;Meusburger and Alewell, 2009)  a comparatively poor model (AUC 0.75), although the expected AUC is approximately the same.We deduce from our results a recommendation to create susceptibility maps from model ensembles, because they are supposed to yield a more reliable result on the one hand and an estimation of (sample-induced) uncertainty on the other.Similarly, Marmion et al. (2009) propose "consensus models"; in their study, results from different predictive modelling approaches are combined using several methods, among them the median that was used in our study to combine the results of 100 models generated with the same method, but from independent random samples.

Conclusions
In this paper, we investigated the effect of sample size on a logistic regression model with a parameter selection procedure that is based on an information criterion (AIC).
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | and van den Eeckhaut et al. (2006), for example, use the center of each landslide source area.Similarly, Vanwalleghem et al. (2008) use the center of each topographic depression, and the center of each gully in their study predicting the spatial distribution of closed depressions and gullies under forest.Different authors sample source areas on different grounds; besides spatial autocorrelation, Atkinson et al. (1998) explain their approach with the aim of preventing model bias towards larger landslides (in a full sample of events, more data would enter the model from larger source areas than from smaller ones).Begueria and Lorente (2003) use one raster cell for each debris flow initiation zone because the raster size (10 m) of the data in their study corresponds to the size of a typical debris flow scar.All approaches have in common that they prevent a con-Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Atkinson et al. ( Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | H and D combine the number of model species and their relative frequency in one single number: both a large number of different models and an even distribution of the latter (i.e. each model species has more or less the same share of the 1000 individual models) cause H and D to increase, while it decreases with Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | shows the distribution of the estimated coefficients for each of the geofactors.Additionally, the percentage below the parameter name gives the proportion of models that contained the respective geofactor after stepwise selection.The coefficients were estimated using 100 independent random samples of n = 81 + 350 (event + non-event sample) in the ZBT area.The geofactors slope, SCA, and their interaction are part of every model, followed in decreasing order by profile curvature, plan curvature, and roughness class.The spread of the coefficients is low for most of the geofactors, with the exception of the two curvature parameters.The coefficient for plan curvature has the largest range, and it takes positive and negative values, which makes the interpretation very difficult; this is probably caused by the fact that the random sampling of event cells from the upper erosional zones in the debris flow inventory will select locations in Discussion Paper | Discussion Paper | Discussion Paper | previous analyses have shown the dependence of models found through AICbased model selection on the respective sample and its size.The spatial pattern of Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | could have resulted in a good (AUC 0.84), but also in Discussion Paper | Discussion Paper | Discussion Paper |

Fig. 2 .Fig. 5 .
Fig. 2. Empirical variograms of geofactors used in this study.Note that slope and the ratio of bedrock catchment area and catchment area are autocorrelated at different spatial scales.