A scoring test on probabilistic seismic hazard estimates in Italy

Probabilistic estimates of seismic hazard represent a basic element for seismic risk reduction strategies and they are a key element of seismic regulation. Thus, it is important to select the most effective estimates among the available ones. An empirical scoring strategy is described here and is applied to a number of time-independent hazard estimates available in Italy both at national and regional scale. The scoring test is based on the comparison of outcomes provided by available computational models at a number of accelerometric sites where observations are available for 25 years. This comparison also allows identifying computational models that, providing outcomes that are in contrast with observations, should thus be discarded. The analysis shows that most of the hazard estimates proposed for Italy are not in contrast with observations and some computational models perform significantly better than others do. Furthermore, one can see that, at least locally, older estimates can perform better than the most recent ones. Finally, since the same computational model can perform differently depending on the region considered and on average return time of concern, no single model can be considered as the best-performing one. This implies that, moving along the hazard curve, the most suitable model should be selected by considering the specific problem of concern.


Introduction
Seismic hazard assessment is a basic tool for risk estimates necessary to develop effective preventive strategies against seismic damage.Being in essence a forecasting of future ground shaking, uncertainty is a basic element of seismic hazard and it requires specific formalizations based on a probabilistic assumptions (probabilistic seismic hazard assessment -PSHA) to manage available information by providing likelihood estimates for each possible ground-shaking level (hazard curve).Information considered for this purpose includes deterministic (e.g.geometry of seismogenic structures or seismic waves propagation patterns) and statistical (e.g.average seismicity rates) elements.The latter ones aim at managing the lack of information about important ingredients of seismic hazard (e.g.seismogenic activity of the faults).Actually, many PSHA procedures exist, that are mainly differentiated for the relative roles played by deterministic and statistical elements.Procedures span from purely deterministic approaches assuming a nearly complete knowledge of the seismic process (e.g.Peresan et al., 2011) to purely statistical analyses assuming a nearly complete ignorance of the underlying physical processes (e.g.Kagan and Jackson, 1994;Frankel, 1995;Albarello and Mucciarelli, 2002), including balanced combinations of deterministic and statistical elements to manage the aleatory variability (e.g.Cornell, 1968;McGuire, 1978).Outcomes of alternative approaches, as well as PSHA outputs using similar methods, may present strong differences and this makes mandatory an evaluation of the respective heuristic value and effectiveness.Arrogating ageless Shakespeare's words "Shall I compare thee to a summer's day", comparison of subjects with different nature is always difficult.Actually, the effectiveness of any considered procedure (which includes both computational aspects and data used to feed the model) is uncertain and this is managed by associating with each procedure a degree of "belief" (again in the form of a probability).As hazard estimates are the combination of deterministic elements and relative variabilities, both aleatory and epistemic uncer-

D. Albarello et al.: A scoring test on probabilistic seismic hazard estimates in Italy
tainties have to be considered and they contribute to the estimate of the hazard curve.
While each PSHA procedure is applied to manage the aleatory uncertainties via probabilistic modelling, assessment and management of epistemic uncertainty are more controversial topics.Given any ith PSHA model H i , epistemic uncertainty can be defined as the probability P (H i ) expressing the degree of belief in the effectiveness of that model.This formalization allows the management of epistemic uncertainty within a coherent frame (Albarello and D'Amico, 2015).A key aspect is the way to assess P (H i ), i.e. by scoring H i .
Two general approaches exist for this purpose.The first one relies on "ex ante" expert evaluations of the actual reliability (in terms of internal robustness or coherency with current knowledge about the underlying physical process) of the elements constituting the relevant PSHA procedure H i (e.g. the geometry of considered seismic sources, the ground-motion attenuation relationship, etc.): these evaluations are usually combined in the frame of a "logic tree" (e.g.SSHAC, 1997;Kammerer and Ake, 2012).The second kind of approach is "ex post" and considers a comparison of procedure outcomes ("forecasts") with observations.Some examples of empirical testing procedures have been provided by Mucciarelli et al. (2000), Albarello andD'Amico (2005, 2008), Beauval (2011) and recently by Tasan et al. (2014).Ex ante and ex post approaches can be seen as complementary in the frame of a Bayesian view aiming at combining different PSHA models to obtain a "comprehensive" one, taking advantage of different competing models (Albarello and D'Amico, 2015).
Only the ex post approach will be considered here to score on an empirical basis a number of PSHA models available for the Italian area using a simple procedure described in the next section.Then, the data set of observations used for scoring the PSHA models will be described.The scoring test has been performed in the frame of the research agreement between the National Civil Defence Department (DPC) and the National Institute of Geophysics and Volcanology (INGV), namely the S2-2012 Project (https://sites.google.com/site/ingvdpc2012progettos2/home),after which observed maximum peak ground acceleration (PGA) values at a subset of available accelerometric stations were provided for a long observation time window (Pacor et al., 2013), and a repository of released PSHA results has been compiled too (Faccioli and Vanini, 2013).The data used in this study are given as Supplement to motivate alternative analyses and/or methodological comparisons.

Empirical scoring and testing
The bulk of any empirical scoring procedure is the evaluation of a probability L = P (E|H i ) that expresses the degree of belief (likelihood L) that a set of observed ground mo-tion occurrences E ("evidence") will happen in the case that the PSHA computational model H i provides a correct hazard estimate (Albarello and D'Amico, 2008).Given the model H i and the set of sites E * t where ground shaking has been monitored during the control interval t * of duration equal to the hazard exposure time t, the model's likelihood L i can be estimated from the control sample E * t .If the seismic occurrences e s are mutually independent (usual assumption in PSHA computational models) and if, over the duration of the control interval, a total of N * out of S sites have experienced ground shaking above their given thresholds g 0S , then we have where each value P (e S |H i ) is the hazard estimated (i.e. the exceedance probability for g 0S ) by the ith model at the sth site for the exposure time t = t * .If time stationarity is assumed in the PSHA model, only the overall duration of the exposure time is of concern: this fact is not true when time-dependent PSHA models are considered.Of course, one should take into account that several possible combinations of sites and events may exist that result in the same configuration of the available evidence: all sites characterized in H i by the same exceedance probability are equivalent.It is worth noting, however, that the likelihood value in Eq. ( 1) also depends on the number of sites considered and on the P values of concern: this implies that comparison among different models by using respective likelihoods should be performed by considering the same values for S and P .If this is not the case, some kind of "rescaling" is necessary.The rescaling could be performed by considering instead of Eq. (1) the "support" function l, which is the log-likelihood ratio as defined by Edwards (1972) in the form where r is a reference log-likelihood value computed as in Appendix A as a function of P (e S |H i ) and S.
It can be seen (Kagan and Jackson, 1994) that the probability distribution of the support l function is nearly normal.This formulation allows the using of the reference value in Appendix A and the relevant standard deviation to compute a Studentized form of l as where the denominator is provided in Eq. (A5).In general, values of Z i near to 0 indicate best-performing models while Z i > 2 indicates models providing outcomes significantly different from observations.In this case, the model should be considered "unreliable".In this frame, the value Z i can be considered as the "score" of the ith model: the smaller Z, the better the computational model is.
Other possibilities exist for testing PSHA procedures against the evidence E (e.g.Schorlemmer and Gerstenberger, 2007;Schorlemmer et al., 2007).Counting is one of these procedures.In this case, a binary variable e s (g 0 ) is defined which assumes the value of 1 if, during the control interval t * (which has the same extension as the hazard exposure time t), at least one earthquake occurred producing a ground motion in excess of g 0 at the sth site; otherwise e s (g 0 ) = 0. We define the "control sample" E t * as the set of S realizations of the variable e s (g 0 ) at S sites.The ith considered PSHA computational model H i provides a probability P si for the event e s (g 0 ) =1 given by where the dependence on g 0 and t is omitted to simplify the notation.Expectation µ si and standard deviation σ si relative to the Bernoulli variable e s are expressed as and The number N * of sites out of the S sites considered for testing that experienced at least one earthquake during t * with ground shaking greater than g 0 is In terms of probabilistic prediction (forecasts) provided by the H i PSHA computational model, N * is a random variate with the expectation In the hypothesis that e s are the realizations of the stochastic process modellized in the PSHA computations, one can assume that where e s and e z are the realizations of the Bernoulli variable defined above at two generic sth and zth sites.In this case, the standard deviation of the random variable N * is When S is relatively large, the Lyapunov variant of the central limit theorem (e.g.Gnedenko, 1976) Equation ( 11) allows us to evaluate whether a potential disagreement between the experimental value N * and the "forecast" µ i (N * ) is statistically significant, thus making the H i PSHA computational model "not confirmed" by the set of S observations.

Evidence: long lasting accelerometric recordings
The selection of observed data to be used for scoring PSHA models is a key element of the present analysis.The Italian accelerometric database ITACA (Luzi et al., 2008;Pacor et al., 2011), reporting records of accelerometric stations operating in Italy since 1974, has been considered for our purpose.A number of accelerometric sites continuously operating for some decades were selected by considering available station books (courtesy of F. Pacor and R. Puglia from INGV-Milan and A. Gorini, Dept. of National Civil Defence, Fig. 1).Finally, we selected 71 stations operating during the time span 1979-2004.These time boundaries were chosen in order to maximize the number of stations contemporaneously active.These stations are unevenly distributed all over the Italian area (Fig. 2) and are located both on rock or different kind of soils classified according to the National Seismic Code NTC08 (NTC, 2008).In particular: 25 stations are located on soil type A (Vs 30 > 800 m s −1 , where Vs 30 is the average shear-wave velocity in the uppermost 30 m of underground), 30 on soil type B (Vs 30 in the range 360-800 m s −1 ), 13 on soil type C (Vs 30 in the range 180-360 m s −1 ), 1 on soil type D (Vs 30 < 180 m s −1 ) and 2 on soil type E (i.e.soils type C and D but with seismic bedrock at a depth in the range 3-20 m from the surface).Most of the stations (52) lay on a flat outcrop (topography Type T1 according to the NTC08 code), 14 on smooth morphology (type T2, i.e. on a surface dipping in the range 15-30 • ) and 5 on rough topography (type T3, i.e. surface dipping more than 30 • ).Note that the site classification based on Vs 30 mimics the one adopted by the Eurocode8 (EN-1998(EN- , 2004)).One can see that most of the stations lay on stratigraphic/geomorphological configurations different from the "reference" site condition (i.e.flat outcrop of a rigid bedrock with Vs 30 > 800 m s −1 ) generally considered for seismic hazard estimates.This fact implies that, in order to compare hazard outcomes with observations, some correction terms should be considered to "reduce" observed accelerometric data to "reference" values.In the present study, such a correction term has been assumed equal to the amplification factor stated by the NTC08 regulation code, which includes both stratigraphic and topographic effects: it assumes values depending on the soil type and topographic class at the site, but also on the hazard estimated on the reference outcrop.Figure 1.ON-OFF status for accelerometric stations declared to be continuously operating for at least 30 years, as reported on ITACA V1.1 database (Pacor et al., 2011(Pacor et al., , 2013)).Blue circles show potential triggering conditions, computed as (mean PGA + 1 sd) > 0.01 g, using CPTI11 earthquake catalogue (Rovida et al., 2011) and ITA10 GMPE (Bindi et al., 2011); black crosses are the effective recordings available.Data acquired on continuous-mode recording in 2009-2011 have not been considered for the existing time gap between the analogue and new digital equipment.
tion coefficients computed at the 71 accelerometric stations and considered for testing are mapped in Fig. 2; details are given in Sect.3.2.3 of NTC08.These coefficients represent a first approximation to site-specific hazard, coherent with the common practice for buildings that do not require specific studies; they have been used to correct maximum PGA values observed on horizontal components in the time interval 1979-2004.Regarding available recordings, 12 out of 71 stations have no records at all for this 25 year long period.In these cases, we assumed the sensitivity threshold of the early deployed accelerometer (0.01 g, i.e. 9.8 cm s −2 ) as the maximum "observed" value.We checked possible problems with data completeness, which nevertheless we acknowledge are difficult to be properly fixed.For this purpose, PGA values expected at all the sites due to the occurrence of nearby earthquakes have been computed (synthetic "observations"), on the basis of epicentral information (CPTI11 earthquake catalogue, Rovida et al., 2011) and the ground motion prediction equation (GMPE) ITA10 of Bindi et al. (2011).We acknowledge that synthetic observations are not complete, as the catalogue itself is declustered and filtered on damage threshold.In particular, synthetic PGA values have been considered as potential observations at the relevant site if they exceed the sensitivity trigger threshold.Synthetic PGA values obtained at ALT (Auletta, Salerno) and PTL (Pietralunga, Perugia) stations are plotted in Fig. 3, and compared with recorded data: note that even if some data are possibly missing (blue circles in Fig. 1 correspond to values above the sensitivity threshold of the accelerometric network), on average, the maximum observed values in the time window considered for the analysis is coherent with the expectations.We estimated that missing maximum PGA should have occurred on about 5 % of stations.In order to evaluate the possible role of this incompleteness on our results, a sensitivity test has been performed by generating, via Monte Carlo procedure, a large number of artificial data sets.Each set has some "incomplete" stations, randomly selected with a fixed probability (0.05, 0.1, etc.); if a station is considered to be affected by incompleteness, the maximum acceleration actually observed is substituted by the sensitivity threshold (9.8 cm s −2 ).The scores of artificial data sets with respect to a forecast are thus obtained.The average scores and the respective standard deviations, associated with each incompleteness probability, show that final results are not significantly affected by incompleteness percentages lower than 20 %.
On the subset of selected stations, the observed maximum PGA values span from about 1 cm s −2 for M < 4.5 earthquakes at about 40-60 km distance (e.g. at ARI Ariano Irpino, Avellino) to the 490 cm s −2 at NCR (Nocera Umbra, Perugia) for the main shock of the long-lasting Colfiorito Umbria-Marche sequence (26 September 1997, M w = 6.0 at 11 km distance).Station codes, coordinates, site conditions and the maximum registered PGA values are given in Supplement A.

Models: PSHA in Italy
Italy has three maps, or groups of maps, of PSHA which have been turned into regulation acts, therefore having an impact on society: as shown in Fig. 4, these maps were released in 1979, 1996-1999 and 2004, and they were adopted by laws, with some delays, following their release, always after deadly earthquakes.
The 1979 map (Gruppo di Lavoro Scuotibilità, 1979) is expressed in terms of macroseismic intensity and belongs to the so-called generation of "historical probabilism" (Muir Wood, 1993).In essence, key elements for its formulation were: an earthquake catalogue, an empirical relationship for attenuating intensity (without uncertainties) and Gumbel type I statistics of shaking at the sites (Gumbel, 1958).The map was transformed into seismic classes (categories) with given prescription rules after the 1980 Irpinia M w 6.9 earthquake (about 3000 casualties), and a series of laws from 1981 to 1984 regulated the municipalities (Petrini et al., 1980;Slejko, 1993).
The other maps belong conceptually to the generation of the "seismotectonic probabilism" (Muir Wood, 1993).This second group of maps was released in 1996 (Slejko et al., 1998), and refined in 1998-1999(Albarello et al., 2000): they are maps in terms of macroseismic intensity and PGA (for additional details refer to Table 1, Project frame GNDT).The refinements, mostly due to changes in seismicity rate interpolation and GMPEs, came after the long and highly damaging Umbria-Marche sequence (known as Colfiorito sequence, 11 casualties) in 1997-1998.These maps were the basis of the revision of seismic law approved in 2003, after the collapse of a school in San Giuliano of Puglia (2002 Molise earthquake) that killed 27 pupils and their teacher.The same law (Ord.3274/03) stated the rules for preparing a new reference national hazard map.
The third and ultimate map was released in 2004 (Gruppo di Lavoro MPS, 2004) and it was provided by supplementary elaborations (maps for PGA and spectral accelerations for several return periods) in the following years (see MPS04 andS1 2004-2006 in Table 1, Montaldo et al., 2007;Stucchi et al., 2011); it became the official reference document for seismic re-classification in 2006 (Ord.3519/06), and later it Latest PSHA for Europe, first regional project in GEM initiative (http://www.globalquakemodel.org/).New European historical and instrumental catalogues, full logic-tree of GMPE set on tectonic regionalization, combination of area sources, distributed seismicity and larger events concentrated on faults, with new maximum magnitude scheme for the whole region.Results progressively released via the SHARE portal.* For the explanation of acronyms see the references listed in Table 2.
was fully embedded together with the supplementary elaborations in the building code NTC08 (NTC, 2008).After some years of partial application, compulsory rules for its adoption have been stated after the 2009 L'Aquila M w 6.3 earthquake (about 300 deaths).
In the frame of the S2-2012 annual project funded under the decennial agreement of DPC and INGV, a research team (Politecnico of Milan, Faccioli and Vanini, 2013) selected and collected, after the compilation of an online form, some PSHA results for Italy.Data are stored in electronic files or worksheets; a summary list and a short report are freely avail- able at https://sites.google.com/site/ingvdpc2012progettos2/deliverables/d1-1 (last access: 31 August 2014).
The PSHA outcomes are mostly provided in terms of PGA values and comprise the only shaking parameter considered for scoring so far.Models are released by referring to one or a limited number of return periods (i.e.thresholds of exceedance probability in given exposure time).Figure 5 shows the comparison of expected PGA values for the models hav-  ing approximately the same return period (i.e.475 years, or 10 % probability of exceedance in 50 years) at two localities, in northern (Modena) and southern Italy (Potenza); remarkably, the Po Plain and southern Apennines have been set as priority regions by the first year of DPC-INGV research agreement.As time-dependent models (blue labels in Fig. 5) refer to origin time set up in 2010, they cannot be used in our retrospective testing.In Tables 1 and 2 the list of selected models and their references are given; a synoptic graphical representation of results referred to the whole of Italy is given in Fig. 6 The SHARE model (Giardini et al., 2013, represented in Fig. 6 by ID9 frame) has not been stored in the repository of S2-2012 project.SHARE results have been progressively released since 2013, and are available at the SHARE Portal http://www.efehr.org:8080/jetspeed/portal/hazard.psml.
All the PGA values used for the scoring test are given in an Excel file (Supplement A).The values refer to the computation node nearest to the selected accelerometric sites previously described.These data are provided for motivating additional testing by the scientific community.

Results
In order to compare observations and predictions provided by each PSHA model, the time span covered by both should be the same.In general, PSHA outcomes have the form of a PGA value g 0 characterized by a fixed exceedance probability in a time span of duration t (the exposure time) at the sth site.Actually, as all of the considered PSHA models are based on the assumption that the seismogenic process is Poissonian, the following relation holds: where λ si (g 0 ) is the annual rate of exceedance for the threshold g 0 and P is the exceedance probability at the sth site, for the relevant exposure time t and the acceleration threshold g 0 if the ith model is considered.
In the present case, t lasts 25 years (i.e. the time span contemporary covered by 71 accelerometric observations, see above).However, most of the PSHA models provide exceedance probabilities for a different exposure time t (in general 30 or 50 years), i.e.P e s |H i , t .Thus, in order to apply Eq. 1-Eq.11, some conversion tool is necessary to compare hazard estimates and observations.This conversion takes advantage of the stationarity Poissonian character of seismic occurrences assumed by the PSHA models.In this case, in fact, one has that: The above formula can be used to compute the exceedance probability relative to the acceleration threshold g 0 for a   Malagnini et al., 2000Malagnini et al., , 2002;;Morasca et al., 2006;De Natale et al., 1988;Patanè et al., 1994Patanè et al., , 1997)).given exposure time ( t) when the exceedance probability is supplied for another exposure time ( t ).The value P si then is considered for testing.Since some models also provide g 0 values corresponding to different exceedance probabilities, they were scored by considering each realization as an independent "forecast".In general, since in the same model lower exceedance probabilities correspond to longer return times and to higher g 0 values, different scoring can be attributed to different parts of the hazard curve.
Thus, for each PSHA model, a set of P si values is computed for the sites considered for testing, for t = 25 years.Consequently, the binary variable e S (g 0 ) is computed: equal to 1 when at the sth site the value g 0 was exceeded in the time interval 1979-2004, and to 0 otherwise.
On this basis, the score Z (Eq. 3) was computed for each PSHA model.This value is considered as the empirical score of the model: the lower Z the more effective are the results of the relevant model.The overall number of exceedances (Eq.7) was also compared with the values expected in the relevant PSHA model (Eq.8): if this difference exceeds two times the relevant standard deviation (Eq.10), the PSHA model is considered to be not compatible with observations (Eq.11).3); the lower the best.Model IDs are given in Table 1; Tyear indicates the mean return time the elaboration refers to.

Scoring models at the national scale
Except for ID7 and ID8, all models have nationwide coverage, thus allowing the scoring on the full set of 71 selected accelerometric stations.Some models have been given for different return periods; they give a final set of 12 realizations from 7 models.Comparison of expected versus observed occurrences is shown in Fig. 7; models are sorted according to the relevant return period.Despite the fact that some models tend to slightly underestimate the observed number of exceedances, in all the cases these discrepancies are not significant according to Eq. ( 11).This, however, does not mean that all the models equally fit to the observations.In fact, when the score factor Z is considered (Fig. 7b), one can see that significant differences exist in the performances of the considered models at the different return times.
The best-performing model is the 1996 GNDT model at intermediate return time (ID1, RT = 284 years) followed by the MPS04-like area-based source model using Cauzzi and Faccioli (2008) GMPE (ID5) for a 984 year return period; notably, models obtained under different theoretical assumptions or computational choices behave nearly the same: as an example one can see the results provided by the ID6 model (smoothed seismicity approach by Akinci, 2010), the ID5 one (the one provided by Meletti et al., 2009 with the standard Cornell-McGuire approach, by considering the same single ground motion prediction equation used in ID5), and ID9 (produced in the frame of the SHARE project).On the other hand, the same model performs in different ways at different return times: e.g.see the ID1 best performing at a return time of 284 years and providing a worse performance at a shorter return time of 94 years (Fig. 7b).As models that explore different parts of the hazard curve have controversial scoring (i.e.different scores for different return times), it is not easy to identify a single "best" performing model.

Scoring models at the regional scale
The same test has been performed at the regional scale, for the two selected priority regions of the southern Apennines and the Po Plain, for which ad hoc regional PSH estimates have been released during the S2-2012 project (Fig. 8).Thus, the same subset of accelerometric stations on national and regional PSHA models have been manually selected and controlled to exclude sites not considered for the relevant hazard estimate.
In the southern Apennines, all the models provide results that are compatible with observations that refer to 21 sites (Fig. 9a).When the score factor Z is considered (Fig. 9b) the best-performing models at about 457 years are the one derived from smoothed seismicity model (ID6) and the MPS04 model (ID3).Similarly, the best-performing model at the shortest times is the one provided by Akinci (2013) (ID7).Thus, some PSHA evaluations seem to be more adequate to represent the observed shaking on that southern region.Note that the scoring positions of the long-term predictions of the The same analysis performed for the Po Plain area uses only 12 stations; again, it indicates that all except the ID8 computational model provide results that are compatible with observations (Fig. 9c).In this case, the scoring indicates several best performing models (i.e.MPS04, ID4, at RT = 2475 years; smoothed seismicity and MPS04like models ID 6 and ID 5, at RT = 984 years; SHARE, MPS04-like and GNDT 1996 results, ID9, ID6 and ID1, at RT = 475 years).Note also that the underestimation of results released in the frame of the S2-2012 project (ID8) gives Z values higher than 3.
In order to better visualize the impact of scoring in hazard estimates, the PGA values provided by the models at different return times are labelled with their relevant Z values (Fig. 10), for the cities of Potenza (southern Apennines) and Modena (Po Plain).We believe this kind of analysis should help in defining a comprehensive PSHA, no longer based on logic tree procedures, or expert elicitation, but on the strength of observation data.

Conclusions
Nowadays, the scientific community is looking for a coherent, formal and robust procedure for testing probabilistic seismic hazard estimates.Like it is for weather forecasts, the availability of observational data of the last years is not comparable with the previous decades and probably will faster Red dots and b/w diamonds represent the national and regional models, black and labels respectively the absolute Z values (Eq. 3) on the whole set and regional subsets of stations.changes of approaches of seismic hazard than ever before.An extensive empirical test of seismic hazard estimates in Italy has been carried out by evaluating quantitatively their performances.In particular, an empirical scoring procedure has been applied to a number of PSHA computational models in the frame of the DPC-INGV S2-2012 seismological research project; many of the considered models provided outcomes that were included in the Italian Seismic Regulation Code and this fact strengthens the importance of evaluating their reliability.Twelve realizations from seven timeindependent PSHA models available at the national scale plus six maps from two models at the regional scale have been collected; a set of accelerometric stations continuously operating in the time interval 1979-2004 has been analysed, using the maximum observed PGA at each station as testing parameter.Site-specific corrections were applied for PGA values at accelerometric stations where possible amplification effects are expected due to the local soil conditions.These correction coefficients are the ones set up in the Italian seismic code (NTC08).The scoring results obtained suggest some preliminary conclusions as follows: 1. Nearly all the considered models provide outcomes that are compatible with available observations; 2. The most recent models are not necessarily the bestperforming ones; 3. None of the models analysed can be considered as the best performing at all the considered return times; 4. Testing done on sub-regions reveals different features with respect to the national scale, but the reasons should be investigated with other cases.
One may wonder that in the list of observed PGA maxima, aftershocks could appear instead of the mainshocks.This seems in contrast with a basic assumption of the PSHA models that only include independent events (mainshocks) in computations.The implicit underlying assumption is that the mainshock is the earthquake providing the largest acceleration for a sequence, at each site.However, in many situations, this is not the case; an astonishing case is that of station DMN, in the northwestern Alps, with more than 200 cm s −2 for a M = 3 earthquake.As the basic outcome of any PSHA is the maximum ground shaking one can reasonably (i.e. at any fixed probability) expect, irrespective to the causative earthquake, we believe our test follows cautious criteria.If strong aftershocks are responsible of PGA values larger than the one resulting from the main event, and if this fact occurs many time, the "maximum acceleration" forecasted by the considered PSHA may underestimate the actual hazard, so therefore resulting simply wrong.This study has focused on some open questions which remain to be addressed in future: 1. Site-specific PSHA or calibrated amplification functions at the accelerometric stations are necessary to avoid the over-simplification here adopted; they may play a key role in scoring results: specific activities have been planned on these subject in the prosecution of S2 Project started in 2014 (see Task 2 and 4, at https://sites.google.com/site/ingvdpc2014progettos2/); 2. Completeness of accelerometric records relative to accelerometric sites is a critical aspect for validation; we overcome the problems by considering the maximum PGA in a quite long time period, but further analyses are needed to fully exploit the observations provided by the actual Italian databases.

Figure 2 .
Figure 2. Location of the accelerometric stations considered for empirical testing.Dotted pins refer to stations on soil type A or A * (* for hypothesized conditions) in Eurocode 8 (EN-1998, 2004) classification, pin colours represent simplified amplification factors for PGA (NTC, 2008) used to accomplish stratigraphic and topographic site response: green = 1, yellow = 1.2-1.35,violet ≥ 1.35.

Figure 3 .
Figure 3. Observed and synthetic PGA values at two stations in the time span 1975-2004.The observed data are plotted by b/w symbols (values as reported in ITACA 1.1.database and in Pacor et al., 2013); the computed PGA values represented respectively by green circle (mean), red and blue triangles (±1 standard deviation), have been obtained by ITA10 GMPE applied to the CPTI11 earthquake catalogue; synthetic PGA are plotted only if the red triangle (mean PGA + 1s) is greater than 0.01 g, common triggering threshold of that time, as chosen in Fig. 1.(a) Auletta station ALT (Salerno, in southern Apennines); (b) Pietralunga station PTL (Perugia, Central Italy).

Figure 4 .
Figure 4. Timeline of PSHA maps in Italy relevant for regulation; orange symbols represent deadly earthquakes that occurred in the last 40 years.

Figure 5 .
Figure 5.Comparison at two sites of expected PGA values (with 10 % probability of exceedance in 50 years) from collected PSHA models (redrawn fromFaccioli and Vanini, 2013).Modena is located in the Po Plain, at about 20-30 km distance from the main earthquakes of the 2012 Emilia sequence; Potenza is at about 90 km distance from the recursive sequences that affected the border of Calabria and Basilicata Regions, in southern Apennines, since 2011.Time-dependent models listed in this graph (labels in blue) have not been used in this analysis.

Figure 6 .
Figure 6.Synoptic view of PSHA maps collected by S2-2012 Project at the national scale.Model ID refers to Table1; the vertical axis shows approximately the return period the elaborations refer to; on the x-axis, a rough timeline of results release (from 1996 to 2013) with main earthquakes occurrences (orange symbols); full size maps and other graphic details are given in Supplement B.

Figure 7 .
Figure 7. Results of scoring for national-based PSHA models.(a) Observed versus computed number of stations exceeding the predicted PGA values g 0 .Results are sorted according to the return period and subordinately on model IDs.(b) Final scores: the y-axis represents the absolute value of Z score as given in Eq. (3); the lower the best.Model IDs are given in Table1; Tyear indicates the mean return time the elaboration refers to.

Figure 8 .
Figure 8. Synoptic view of regional PSHA maps collected by S2-2012 project.Model IDs refer to Table 1; the vertical axis shows approximately the return period the elaborations refer to.The colour scale is automatically adjusted on values; full size maps and other graphic details are given in Supplement B.
PSHA and scores at two selected sites.(a) Modena, in the Po Plain; (b) Potenza in the southern Apennines.

Table 1 .
List of selected PSHA models.Exceedance prob. of 10 % in 50 years, percentile 16, 50 and 84.Data points collected by S2-2012 Project refer only to the priority areas of Po Plain and southern Apennines: the data sampled on a 0.05 • grid on the whole country are available at http://zonesismiche.mi.ingv.it/Same approach and input data of previous model 3, MPS04, additional probabilities of exceedance in 50 years have been computed during the project S1 (2004-2006): 2 and 39 % are selected in this analysis.PGA values on a 0.05 • step grid for all Italy.

Table 2 .
List of references for the selected PSHA models.