Fault network reconstruction using agglomerative clustering: applications to southern Californian seismicity

Abstract. In this paper we introduce a method for fault network reconstruction based
on the 3D spatial distribution of seismicity. One of the major drawbacks of
statistical earthquake models is their inability to account for the highly
anisotropic distribution of seismicity. Fault reconstruction has been
proposed as a pattern recognition method aiming to extract this structural
information from seismicity catalogs. Current methods start from simple
large-scale models and gradually increase the complexity trying to explain
the small-scale features. In contrast the method introduced here uses a
bottom-up approach that relies on initial sampling of the small-scale
features and reduction of this complexity by optimal local merging of
substructures. First, we describe the implementation of the method through illustrative
synthetic examples. We then apply the method to the probabilistic absolute
hypocenter catalog KaKiOS-16, which contains three decades of southern
Californian seismicity. To reduce data size and increase computation
efficiency, the new approach builds upon the previously introduced catalog
condensation method that exploits the heterogeneity of the hypocenter
uncertainties. We validate the obtained fault network through a pseudo
prospective spatial forecast test and discuss possible improvements for
future studies. The performance of the presented methodology attests to the
importance of the non-linear techniques used to quantify location
uncertainty information, which is a crucial input for the large-scale
application of the method. We envision that the results of this study can be
used to construct improved models for the spatiotemporal evolution of
seismicity.



23
Owing to the continuing advances in instrumentation and improvement of seismic networks coverage, earthquake 24 detection magnitude thresholds have been decreasing while the number of recorded events is increasing. As governed by the 25 Gutenberg-Richter law the number of earthquakes above a given magnitude increases exponentially as the magnitude is 26 decreased (Ishimoto and Iida, 1939;Gutenberg and Richter, 1954). Recent studies suggest that the Gutenberg-Richter law 27 might hold down to very small magnitudes corresponding to interatomic-scale dislocations (Boettcher et al., 2009;Kwiatek 28 et al., 2010). This implies that there is practically no upper limit on the amount of seismicity we can expect to record as our 29 instrumentation capabilities continue to improve. Although considerable funding and research efforts are being channeled i) For a given dataset featuring N hypocenters, we first construct an agglomerative hierarchical cluster (AHC) tree based on Ward's minimum variance linkage method (Ward, 1963). Such a tree starts out with a cluster for each data-point 86 (i.e., with zero variance) and then progressively branches into an incrementally decreasing number of clusters. At any step, 87 the merging of two clusters is based on a criterion involving the minimum distance D w criterion given by: In this equation, C ij is the cluster formed by merging clusters C i and C j , x represents the set of hypocenters, and r (with 89 proper subscript) is the centroid of each cluster. Hence, clusters i and j are merged if the sum of squares in Eq. (1) is 90 minimized after they are merged into a single cluster ij. The number of branches in the tree is thus reduced by one, and the 91 remaining clusters are used to decide which ones will be merged at the next iteration. This merging of clusters/branches 92 continues until there remains only a single cluster. "Cutting" the AHC tree at the D w level corresponding to the desired 93 number of branches allows one to choose the number of clusters (from 1 to N) used to represent the original dataset.

94
ii) Since our goal is to obtain a fault network where segments are modeled by Gaussian kernels, we begin by 95 estimating how many such kernels can be constructed with the clusters featured in the AHC tree. At its most detailed level 96 (N clusters) no such kernel exist as they would collapse on each data point, becoming singular. At the next level (N-1 97 clusters), we have the same problem. We thus incrementally reduce the level, traversing AHC tree, until we get a first cluster 98 featuring 4 hypocenters, which defines the first non-singular cluster. We then continue our traverse along the tree down 99 replacing each cluster having more than 4 points by a Gaussian kernel. At each level on the tree, we count the number of

113
iii) Once we determine the holding capacity, all points that are not associated with any Gaussian kernel are assigned 114 to a uniform background kernel that encloses the whole dataset. The boundaries of this kernel are defined as the minimum 115 bounding box of its points. The uniform spatial density of this background kernel is defined as number of points divided by This representation facilitates the calculation of an overall likelihood and allows us to compare models with different 119 complexities using the Bayesian Information Criteria (BIC) (Schwarz, 1978) given by: where L is the likelihood of each data point, k is the number of free parameters of the mixture model and N is the total 121 number of data points. The value of k is calculated as k=10N C -1 (where N C is the number of kernels in the mixture) since each kernel requires 3 (mean vector) + 6 (covariance matrix) + 1 (weight) = 10 free parameters. The same parameterization its covariance matrix. The number of free parameters (k) is reduced by 1 because the weights have to sum to unity and hence 125 knowing N C -1 of them is sufficient.
126 iv) At the holding capacity, the large number of kernels are likely to constitute an overfitting model for the data set.

127
Therefore the we iteratively merge pairs of the Gaussian kernels until an optimal balance between fitness and model Notice that each merging of a pair of kernels decreases k by 10, thus a given merger can be considered only if the reduction 134 of the penalty term is greater than the decrease of likelihood (i.e. BIC Gain >0).  Figure 3b shows such a BIC gain 143 matrix calculated for the initial model with 77 clusters. Notice that a Gaussian cluster it is not allowed to merge with the 144 background kernel. The BIC Gain >0 criteria, which essentially drives and terminates the merging process, is similar to a 145 likelihood ratio test (Neyman and Pearson, 1933;Wilks, 1938) with the advantage that the models tested do not need be 146 nested.

147
The computational demand of the BIC gain matrix increases quadratically with the number of data points. To make our 148 approach feasible for large seismic datasets, we introduce a preliminary check that considers clusters as candidates for 149 merging only if they are overlapping within a confidence interval of σ√12 in any of their principal component directions.

150
The factor √12 is derived from the variance of an hypothetical uniform distribution over a planar surface (for details see 151 (Ouillon et al., 2008)).

152
During all steps of the merging procedure, the data points are in the state of soft clustering, meaning that they have a finite 153 probability to belong to any given kernel. A deterministic assignment can be achieved by assigning each point to the kernel 154 that provides the highest responsibility (as per the definition of a mixture model), which is referred to as hard clustering.

155
This dichotomy between stochastic and deterministic inference gives rise to two different implementations for the merging 156 criteria: 1) local criterion: considering only the two candidate clusters and the data-points assigned to them through hard-tests the information gain for the case of two kernels versus one kernel on a subset, whereas the global criterion considers N c 159 versus N c -1 kernels on the whole mixture and dataset. Figure 4 shows the resulting final reconstructions for the two criteria.

180
The initial atomization step produces a total of 394 proto-clusters that are iteratively merged using the two different criteria 181 (local and global). The resulting fault networks are given in Figure 5. Comparing the two fault networks, we observe that the local criterion provides a much detailed structure that is consistent with the large scale features in the global one. We also overall likelihood. The global criterion favors these mergers to reduce the complexity penalty in Equation (2)

193
Our second observation is that the background kernel attains a higher weight of 11% using the local criterion 194 compared to the global one yielding only 5%. Keeping in mind that both criteria are applied on the same initial set of proto-195 clusters, and that there are no mergers with the background kernel, we argue that the difference between the background 196 weights is due to density differences in the tails of the kernels. We investigate this in Figure 6 for the simple 1D case 197 considering mergers between two boxcar functions (analogous for planes in 3D) approximated with Gaussian functions. We 198 observe that the merged Gaussian has higher densities in its tails compared to its constituents. The effect is amplified when 199 the distance between the merging clusters is increased (Figure 6b). Hence, in the local case, the peripheral points are more 200 likely to be associated with the background kernel due to the lower densities at the tails of the small, unmerged clusters.

220
Location uncertainties in these catalogs are assumed to be normally distributed and hence expressed either in terms of a 221 horizontal and vertical standard deviation, or with a diagonal 3x3 covariance matrix. With the development of the KaKiOS-222 16 catalog, we extended this simplistic representation to allow arbitrarily complex location PDFs to be modeled with 223 mixtures of Gaussians. Such mixture models, consisting of multiple Gaussian kernels, was found to be the optimal 224 representation for 81% percent of the events, which required an average of 3.24 Gaussian components (the rest was 225 optimally modeled using a single Gaussian kernel). Therefore we first needed to generalize the condensation methodology,

226
which was initially developed for single kernels, to accommodate the multiple kernel representation. In the original version, 227 all events are initiated with equal unit weights. They are then ranked according to their isotropic variances and weights are progressively transferred from the high variance to the low variance events according to their overlap. In the generalized 1). All kernels are then ranked according to their isotropic variance and the weights are transferred as in the original method 231 with the additional constrain that weight transfers between kernels of the same event are not allowed (see Figure 7a, b). This

241
The KaKiOS-16 catalog contains 479,056 events whose location PDFs are represented by a total of 1,346,010

242
Gaussian components (i.e kernels). Condensation reduces this number to 600,463 as weights from events with of high 243 variance are transferred to better located ones. Nevertheless, in Figure 8 we see that nearly half of these components amount to only 10% of the total event weight. The computation time scales with the number of components, while the information 245 content is proportional to number of events. Hence the large number of components amounting to relatively low number of 246 events would make the computation inefficient. A quick solution could be to take the components with the largest weights 247 constituting 90% or 95% of the total mass, mimicking a confidence interval. Such a "solution" would depend on the arbitrary 248 cutoff choice and would have the potential to discard data that may be of value for our application.

253
We can avoid such an arbitrary cut-off by employing the fact that the condensed catalog is essentially a Gaussian 254 mixture model (GMM) representing the spatial PDF of earthquake occurrence in South California. We can then, in the same 255 vein as the hard clustering described previously, assign each event to its most likely GMM component (i.e. kernel). If we 256 consider each event individually, the most likely kernel would be the one with the highest responsibility. However, for a 257 globally optimal representation we need to find the best representative kernel for each event among all other kernels. To do 258 this, we sample the original (uncondensed) PDF of each event with 1000 points and then calculate the likelihood of each 259 sample point with respect to all the condensed kernels. The event is assigned to the kernel that provides the maximum 260 likelihood for the highest number of sample points (see Figure 7c,d). As a result of this procedure, the 479,056 events are 261 assigned to 93,149 distinct kernels. The spatial distribution of all the initial condensed kernels is given in Figure 9a, while 262 the kernels assigned with at least one event after the hard clustering are shown in Figure 9b. Essentially, this procedure can 263 be viewed as using the condensed catalog as a prior for the individual event locations. The use of accumulated seismicity as 264 a prior for focusing and relocation has been proposed by Jones and Stewart (1997)

301
immediate observation is related to the events associated with the 1986 Oceanside sequence (Wesson and Nicholson, 1988) 302 located at coordinates (-75,-125). The kernel associated with these events is virtually absent in the fault networks 303 reconstructed from 5 initial subsets ( 304 https://doi.org/10.5194/nhess-2020-231 Preprint. Discussion started: 23 July 2020 c Author(s) 2020. CC BY 4.0 License. Figure 10a,b). This can be explained in terms of the atomization procedure. In the case of 5 initial subsets, the 306 offshore Oceanside seismicity falls in a subset containing onshore faults such as the Elsinore fault at coordinates (0,-75).

307
Because these faults have a more coherent spatial structure compared to the diffused Oceanside seismicity, their proto-308 cluster holding capacity is higher. Hence the atomization procedure continues increasing the number of clusters while the storage space compared to one obtained by JPEG compression. Although the JPEG compression is an elaborated algorithm it 377 produces a representation that is much simpler. In the same vein, the fault reconstruction method uses regularities in the data 378 to obtain a simpler, more optimal representation.

379
Another contributing factor to the performance of the fault network can be regarded as the utilization of location 380 uncertainty information that facilitates condensation. This has two consequences: 1) decreasing the overall spatial entropy 381 and thus providing a clearer picture of the fault network and 2) reducing the effect of repeated events occurring on each 382 segment, thus providing a more even prior on all segments.