Hidden-State Modelling of a Cross-section of Geoelectric 1 Time Series Data Can Provide Reliable Intermediate-term 2 Probabilistic Earthquake Forecasting in Taiwan 3 4

Abstract. Geoelectric time series (TS) has long been studied for its potential for probabilistic earthquake forecasting, and a recent model (GEMSTIP) directly used the skewness and kurtosis of geoelectric TS to provide Time of Increased Probabilities (TIPs) for earthquakes in several months in future. We followed up on this work by applying the Hidden Markov Model (HMM) on the correlation, variance, skewness, and kurtosis TSs to identify two Hidden States (HSs) with different distributions of these statistical indexes. More importantly, we tested whether these HSs could separate time periods into times of higher/lower earthquake probabilities. Using 0.5-Hz geoelectric TS data from 20 stations across Taiwan over 7 years, we first computed the statistical index TSs, and then applied the Baum-Welch Algorithm with multiple random initializations to obtain a well-converged HMM and its HS TS for each station. We then divided the map of Taiwan into a 16-by-16 grid map and quantified the forecasting skill, i.e., how well the HS TS could separate times of higher/lower earthquake probabilities in each cell in terms of a discrimination power measure that we defined. Next, we compare the discrimination power of empirical HS TSs against those of 400 simulated HS TSs, then organized the statistical significance values from these cellular-level hypothesis testing of the forecasting skill obtained into grid maps of discrimination reliability. Having found such significance values to be high for many grid cells for all stations, we proceeded with a statistical hypothesis test of the forecasting skill at the global level, to find high statistical significance across large parts of the hyperparameter spaces of most stations. We therefore concluded that geoelectric TSs indeed contain earthquake-related information, and the HMM approach to be capable at extracting this information for earthquake forecasting.


seismic hazard rates that varies across location for the next 50 years. Long-term EQ forecasting such as 86 PSHA can be valuable for location-specific seismic risk evaluation, thereby providing guidelines or 87 criteria for local construction projects. For example, a building that is expected to last 100 years must be 88 able to withstand 10 large EQs of the magnitude that occurs once every 10 years locally. What long-term 89 EQ forecasting does not do, would be to tell people how to do things differently at any time.

91
For intermediate-term EQ forecasting, the aim is to detect deviations of EQ rates from their long-time 92 values, to assess increased probabilities of EQs within the next one to ten years. For example, if a region 93 usually has a magnitude-6 EQ every 10 years, and 15 years have passed without one, the region would 94 be in a state of increased probability. (1990), that also took the EQ catalog as input to produce as output TIP for strong EQs (defined 100 specifically for different regions) within the next half to a few years. In the literature, we also found the 101 self-organizing spinodal (SOS) model (Chen, 2003; Rundle et al., 2000), which used the increased 102 activity of medium-sized EQs as precursors to large EQs that could occur within the next several years 103 or decades. Finally, one of the more successful methods at this time scale is pattern informatics (Nanjo 104 et al., 2006), which was demonstrated to be effective at predicting ≥ 5 EQs in Japan between 2000 105 and 2009. Intermediate-term EQ forecasting can, for example, help local authorities prioritize inspections 106 and reinforcements of old buildings over the construction of new ones. 107 108 Short-term EQ forecasting use a variety of methods to forecast the time, place, and magnitude of a 109 specific large EQ. Here we commonly find methods using the EQ catalog as input data, and apply HMM was trained to track the waiting time between EQs with magnitudes above 4 in southern California 113 and western Nevada (Yip et al., 2018), giving EQ forecasts for up to ten days in the future. Apart from 114 using EQ catalog data, there is an increased variety of methods using other data inputs, such as the widely 115 used Seismic Electric Signals (SESs) ( Varotsos et al., 1993), to look for EQ precursors in 117 the form of abnormal changes to the geoelectric potential. In addition to looking for specific SES-type 118 precursors, we also found papers using methods such as artificial neural networks (ANNs) ( the data of 20 geoelectric stations in Taiwan (Fig. 1) and studied the association between skewness and 132 kurtosis of the geoelectric data and ML ≥ 5 EQs, where ML is the Richter magnitude scale. Through 133 statistical analyses, they found significant correlations between geoelectric anomalies and these large 134 EQs. They then developed an EQ forecasting algorithm named GEMSTIP to extract TIPs for future EQs. 135 TIPs were identified through differences in the distributions of skewness and kurtosis with those found 136 during normal periods. Moreover, Jiang et al. (2020) investigated the geoelectric signals before, during, 137 and after EQs by the shifting correlation method, and found that the lateral and vertical electrical 138 resistivity variation and subsurface conductors might amplify SESs, which agreed with the findings by 139 Sarlis (Sarlis et al., 1999) and Huang (Huang and Lin, 2010).

146
Inspired by these findings, in this paper we wanted to take a closer look at the relationship between the 147 EQ times and statistical indexes of geoelectric TSs, namely correlation ( ), variance ( ), skewness ( ), 148 and kurtosis ( ). During initial explorations, we computed the TSs of these indexes (see Sect we find the darkest pixels concentrated in the long-TTF regime, whereas in the short-TTF regime, the 158 pixels show a lower variability in their intensities. We suspect that this second phenomenon is the result 159 of fewer samples at longer TTFs.

166
To overcome this problem, which is created by superimposing the index TSs of different lengths between 167 EQs, we decided to discover such regimes directly from the geoelectric TSs by using HMMs. forecasting of specific EQs. In this regard, we also note that the same HMM approach described in this 186 paper can be applied to many other geophysical high-frequency time series data, such as geomagnetic or 187 GPS ground movement data, even though we only used geoelectric data as the input of the HMM, to 188 show that the underlying seismic dynamics is indeed clearly separable into distinct regimes of higher 189 versus lower seismic activities (as supported by (Yip et  The 1-Hz geoelectric TSs data used in this paper was provided by the 20 monitoring stations located 214 across Taiwan (see Fig. 1), which are collectively named Geoelectric Monitoring System (GEMS). The 215 spacings among stations are generally 50 km. The geoelectric data here is the self-potential data, which 216 is the natural electric potential differences in the earth, measured by dipoles placed 1-4 km apart within 217 each station. Each station can output two sets of high-frequency geoelectric TSs, measuring alone the 218 NS direction and the EW direction. Depending on the spatial constraints of some stations, the azimuths 219 of the dipoles might deviate from the exact NS or EW directions by 10-40°. For the purpose of this study, 220 we used the geoelectric TSs provided by the GEMS with the same time span as the EQ catalog data, 221 which is from January 2012 to December 2018. We down sampled the data to 0.5 Hz, and used these in 222 subsequent analyses.

224
The HMMs that we will show in Sect. 3 partitioned the 20 geoelectric TSs into two HSs, distinguished 225 by the local statistics of their geoelectric fields. We believe these HSs can also exhibit different 226 seismicities within their time durations. To check this, we used EQ catalog data compiled by the Central 227 Weather Similar to the GEMSTIP model, our HMM modelling also searched for EQ-related information in 252 skewness and kurtosis TSs computed from the geoelectric TS, we conveniently utilized the insight from 253 Chen

273
Next, we present the definitions for each index. Within each time window, let us write the geoelectric 274 field as { -} -.%,…,"! . The correlation that we used in this paper is the lag-1 Pearson autocorrelation where 2 is the mean of { -} -.%,…,"! . Additionally, we observed astronomically extreme values in the 288 TSs for most stations, which were caused by unknown technical errors, and we therefore considered 289 them outliers that have to be removed for consistent data quality. We discuss how we removed them in 290 detail in Supporting Information Sect. A. From here onwards, the TSs will always refer to those after 291 the outlier-removal process.

293
The skewness of { -} -.%,…,"! , or the sequence's third standard central moment, is defined as: 294 where 2 is the standard deviation of { -} -.%,…,"! . It is a real number measuring how asymmetric the 296 distribution of { -} -.%,…,"! is about the mean. For a perfectly symmetric distribution such as the normal 297 distribution, the skewness is 0. A positive skewness means the distribution has a longer tail to the right, 298 and a negative skewness means the distribution has a longer tail to the left.

300
The kurtosis of { -} -.%,…,"! , or the sequence's fourth standard central moment, is defined as: 301 It is a real number measuring how frequently extreme values (values very far from the mean) appear in 303 the distribution. The higher the number, the more frequently extreme values can be found. As a reference, 304 the kurtosis of the normal distribution is = 3. If > 3, we say that the distribution is leptokurtic, 305 meaning the distribution has fatter tails and more frequent extreme values compared to the normal 306 distribution. If < 3, the distribution is said to be platykurtic, meaning the distribution has thinner tails, 307 and extreme values appear less frequently compared to the normal distribution. 308 309

Estimation of HMM Using the Baum-Welch Algorithm 310 311
A Markov model is a stochastic model that can be used to describe a system whose future state 41% is 312 drawn from a set of states { 5 } 5.%,…," with probabilities 6←8 = R 41% = 6 S 4 = 8 T conditioned 313 https://doi.org/10.5194/nhess-2021-378 Preprint. Discussion started: 4 January 2022 c Author(s) 2022. CC BY 4.0 License. by its current state 4 . The probabilities 6←8 can be organized into a transition matrix , where 314 ( , ) = 6←8 . The HMM is an extension of the Markov model, with the additional property that the 315 system state 4 is not explicitly known, hence the word "hidden" in the name. Instead, what can be 316 observed from an HMM at any time is an observable 4 drawn from a size-observable set 317 9.%,…,: . Just as in a Markov model, the future state 41% of an HMM is drawn from the set 318 { 5 } 5.%,…," with probabilities 6←8 (similarly conditioned by the current state 4 ) taken from the 319 transition matrix . At time , the observable 4 is emitted with a probability ( 4 = 9 | 4 = 5 ) 320 that depends on which HS 4 = 5 the system is in. These probabilities can be organized into an × 321 emission matrix , where ( , ) = ( 4 = 9 | 4 = 5 ). Additionally, we call the HS probability 322 Maximization methods (Bilmes, 1998). Starting from randomly initialized model parameters , the 332 algorithm runs recursively to maximize the likelihood of the model given the observation TS. When the 333 algorithm converges, we will obtain a set of estimated model parameters f = R g , g , h T, as well as a 334 posterior probability ( 4 = 5 |{ 4 } 4.%,…,< , f ) TS. We include more details on the BWA in Sect. 2.5. 335 Additionally, for readers who want an intuitive demonstration of how HMM and BWA works, we 336 attached a simulation of a simple HMM and its BWA application in Supporting Information Sect. B.

338
HMMs are traditionally applied in fields such as speech recognition ( of our knowledge, there is no past HMM study on geoelectric TSs for EQ forecasting. In this paper, we 344 argue that the HMM is an objective tool, because the HSs were estimated only from the geoelectric TSs, 345 and thereafter validated against the EQ catalog. We believe this statistical procedure limits the bias that 346 we could introduce into our prediction model when we optimized the model. This will be even clearer 347 by the end of Sect. 2.5 where we summarize the entire procedure. In the context of this study, we assume for simplicity two seismicity states of the earth crust beneath each 352 station. These are our HSs { % , * }, since they cannot be directly observed. as input. In this section, 359 we discuss possible ways to convert 4 into discrete observations for the BWA, and why we chose one 360 particular method for implementation. 361 362 One way to do so would be to model each component of 4 as samples drawn from known distributions, 363 such as a normal distribution or a gamma distribution. Unfortunately, as we can see from

383
The BWA has no problem dealing with high-dimensional problems, provided the inputs are discrete. 384 However, this method would work well only if the overall number of possible observations is small. If 385 we use 50 bins for each of the eight indexes, there would be = 50 A ≈ 3.91 × 10 %3 possible 386 observations, meaning the emission matrix would be of dimension 3.91 × 10 %3 -by-2. Reducing the 387 number of bins to just 10 for each index, we still have = 10 A possible observations. This latter space 388 is still too large for the BWA to search through exhaustively in a reasonable amount of time, even though 389 we feel 10 bins for each index may already be too coarse and likely to miss subtle details. Furthermore, 390 with so many possible observations, we expect the emission probabilities to be significantly different 391 from 0 only for a very small subset of the possible observations. 392 393 We such as 4 . We chose to use the k-means clustering for discretizing 4 because of its low computational 400 cost as well as its reliability in grouping similar feature vectors in the feature space. In so doing, we 401 created a discrete feature space with reasonable size, as high-level labels of different geoelectric 402 dynamics.

468
As the iteration goes, the BWA improves the likelihood of observing the input observation TS 469 % , * , … , < given the model parameters ( , , ), which converges when the improvements on the 470 posterior probability ( % , * , … , < |( , , )) become minimal. In practice, we found that 30 471 iterations were long enough for most models to converge. We therefore obtained the estimated model 472 parameters R g , g , hT = ( , , ) , as well as the posterior probability TS of ( 4 = 473 5 | % , * , … , < , g , g , h) for both HSs and all , which we write in short form as: we noted that BWA assigns the indexing of HSs randomly; therefore, the % of one station is not 476 guaranteed to be equivalent to the % of another station.

478
We cannot simply do the above BWA estimation once to get R g , g , hT, because the BWA converges to where the BWA converges. In order to obtain a global optimum result within a reasonable computation 482 time, we ran 15 BWA estimations in parallel for each station, with different random initial parameters. 483 For each station, we then chose the model with the highest model score given by 484 ' % , * , … , < |R g , g , hT' for subsequent analysis. Later in Fig. 4(a), we also show all 15 HMMs to 485 demonstrate how consistent the converged models are. We can write the posterior probability TS of this 486 For each initial condition, the BWA randomly assigns one HS to be % , and the other to be * . To show 489 all 15 HMMs simultaneously in Fig. 4(a), we need to standardize % and * across all HMMs. For this 490 purpose, we set g as the "standard". For the remaining 14 posterior probabilities [ ] 8.*,…,%, , we 491 checked their Expected Absolute Difference = R| g − }T from g , whose value ranges 492 from 0 and 1. If > 0.5, is more similar to g than to g , and we proceed to swap the HS 493 indexing for the 4V HMM by assigning ( ) ≡ and ( ) ≡ . Otherwise, 494 corresponds to the same HS as g , and we leave its HS indexing unchanged. In this way, we standardized 495 all 15 models so that their can be visualized together in Fig. 4(a), with the g TSs sorted by their 496 model scores ' % , * , … , < |R g , g , hT', and the optimal model at the first row. In Fig. 4(

506
We summarize the procedures used to obtain g , starting from a pair of geoelectric TSs for each GEMS 507 station in the form of a flow chart in Fig. 5. It is noteworthy that the full procedure contains essentially 508 only 2 hyperparameters: and ! . The figures shown in the results section will use the optimal 509 hyperparameters, whose identification procedure will be discussed in detail later in Sect. 3.5. 510 Additionally, for each station's optimal HMM, we plotted the distribution of indexes ( , , , ) at both 511 HSs in Supporting Information Sect. D.

EQ Grid Map, EQ Frequency, and EQ Frequency Ratio 519 520
Up to this point, we did not incorporate any EQ catalog information into g for each station. Unlike 521 many past EQ studies looking for specific precursory features within the geoelectric data, we made no 522 specific assumptions regarding what these EQ precursors might look like. Instead, we let the BWA search 523 for specific precursory features within the 8-dimensional feature space.

525
After the HMM modeling, we then checked locally whether % and * would effectively partition time 526 periods with significantly lower EQ probabilities from those with significantly higher EQ probabilities. 527 We think of one HS as a passive state (with significantly lower EQ probabilities) and the other HS as an 528 active state (with significantly higher EQ probabilities), but we cannot call the former % and the latter 529 * because we have not yet standardized these HS labels across the 20 stations. To do so, we need to 530 match the HS TS of each station to the EQ catalog to determine the EQ frequencies of % and * for 531 this station, and use % and * as the HS labels of the active and passive states respectively (relabeling 532 when necessary). In the remainder of this section, we describe in detail how this is done.

534
For each GEMS station we started from g , and classified time periods across the 7 years as belonging 535 to two sets % and * . The time point 8 was assigned to % if ™ 4 % ( % ) > 0.5 and * if ™ 4 % ( * ) > 536 0.5. After this is done, we checked how EQs are distributed between % and * for different regions 537 across Taiwan. For this task, we first made a 16-by-16 grid map of Taiwan, so that EQs within the same 538 grid cell ( , ), for and in [0,1, … ,15], are grouped together (see Fig. 6).  where % is the number of EQs occurring within % , * is the number of EQs occurring within * , 548 | % | is the total duration of % time periods, and | * | is the total duration of * time periods. From 549 Fig. 6, we see that the spatial distribution of EQs is highly heterogeneous, so we may find a grid cell with 550 about 10 EQs but also another grid cell with about 1000 EQs. This tells us that we should not directly 551 compare the EQ frequencies, but should instead compare the EQ Frequency's Ratio, defined as: (14) 553 For any cell containing at least one EQ, the range of its # is [0,1]. Intuitively, any cell with # < 0.5 554 is a region having lower EQ frequency in % compared to * ; and any cell with # > 0.5 is a region 555 having a higher EQ frequency in % compared to * . For example, for a cell with # = 0.2, ?:,% is 556 https://doi.org/10.5194/nhess-2021-378 Preprint. Discussion started: 4 January 2022 c Author(s) 2022. CC BY 4.0 License.
only 1/4 of ?:,* . The # value quantifies how one HS has a higher or lower EQ frequency than the 557 other. In Sect. 3, we will present how we deep dived into the spatial-temporal correlations between HS 558 TSs ( g ) and EQ activities for all 20 stations, starting from 20 grid maps of # values. 559 560 561 3 Results and Discussions 562 563 In this section, we present the results obtained for all 20 stations, as well as additional treatments that we 564 felt are necessary to investigate whether the HS TSs have significant forecasting power for EQs. 565 566

EQ Frequency's Ratio ( ) Grid Maps 567 568
Once we obtained the g TS for each station, the natural first step of our analysis was to examine the 569 # values for all cells in the 16-by-16 grid map. We show this procedure for CHCH station in Fig. 7,  570 where we visualize the grid maps for % and * in Figs 7(a) and (b) respectively, to clearly show how 571 many EQs occurred during % and * . The resulting # grid map is shown in Fig. 7(c), where there 572 are cells with values close to 0.5 (white-color cells) and cells with values far from 0.5 (red for close 573 to 0; green for close to 1). White-color cells are regions whose EQ activities are weakly correlated with 574 the HSs, since the time periods of % and * are not very different in terms of EQ frequency; whereas 575 red/green cells are regions with significantly lower/higher EQ frequencies in % .

583
As can be seen in Fig. 7(c), for different regions the HS with higher EQ activities can be either % or 584 * . This is true not only for CHCH station, but also for all 20 stations, whose # grid maps are shown 585 in Fig. 8. Although there is no consistent pattern of any state corresponding to higher EQ activities 586 globally, we see in Fig. 8 that there are regions whose # values are far from 0.5 across many stations. 587 This means that statistically speaking, one of the HSs has higher EQ activities than the other. In fact, if 588 the active HS has a lot more EQs than the passive HS, it is also likely that the active HS cover most of 589 the larger EQs (e.g., > 5), which is a good attribute for potential EQ forecasting applications. This 590 phenomenon is shown in Supporting Information Sect. E, where we visualized the EQ frequency 591 distributions across different magnitudes for both HSs, for three selected cells with most EQ events. hyperparameters individually specified for each station in Fig. 12).

597
All in all, the findings in this section is important, but we cannot directly decide % or * to be the 598 proxy for increased EQ probabilities, because they cannot be associated consistently with the active or 599 the passive state. Instead, we should understand % and * as two high-level, fuzzy labels for tectonic 600 dynamics related to EQ activities in different regions. There can be elements such as rock and soil 601 formations, the underground water system, and fault lines, forming a complex dynamical system that 602 influences where and when EQs become active. A concrete mapping between EQ activities and specific 603 elements of the complex dynamical system would be very difficult, as this will involve high-resolution 604 https://doi.org/10.5194/nhess-2021-378 Preprint. Discussion started: 4 January 2022 c Author(s) 2022. CC BY 4.0 License. subterrain surveys. Nevertheless, we can still measure how well % and * can partition the time 605 periods so that one HS can have significantly more EQs than the other. To show this more clearly, we 606 created grid maps of discrimination power and present them in the next section. 607 608

Discrimination Power ( ) Grid Maps 609 610
We defined the discrimination power for each cell as: The value of ranges from 0 to 0.5, with 0.5 being the most discriminative since all EQs are found 613 in one HS, and 0 being the least discriminative since EQ frequencies are identical between the two HSs.

614
We show the grid maps of for 20 stations in Fig. 9, which are easier to interpret compared to the grid 615 maps in Fig. 8 where we had to use two different colors. Intuitively, for a region with = 0.25 (not 616 uncommon), one of its HSs would have an EQ frequency three times that of the other HS. It can be noted 617 that cells around the edge of the map tend to have very high values, because there are very few EQ 618 events in these cells. This is not a problem, as we will take the number of EQs into account later in Sect.

625
In some cells, we find values close to 0.5, which seems to suggest that the seismicity associated with 626 % is very different from that associated with * . However, looking at Fig. 9, we see large variations in 627 values across the cells, and more importantly among some neighboring cells. We therefore wonder 628 whether regions with high values are statistically significant, or the products of random temporal of HSs would produce the highest value of 0.5. To address this concern, we investigated the 632 significance of the grid maps of through statistical tests in the next section. Since we had the optimal HMMs for the 20 stations, we can test cellular statistical significance levels 637 that their HSs can indeed separate time periods of higher/lower EQ probabilities, using grid maps 638 shown in Fig. 9. Specifically, for each grid cell and an empirical HS TS we carried out a statistical 639 hypothesis testing using the following null hypothesis: any random HS TS would achieve the same or 640 higher performance (in terms of value). To create random HS TSs for the hypothesis testing, we chose 641 to directly simulate the HMM using the same model parameters R g , g , hT as the empirical HMM of the 642 corresponding station. For each hypothesis testing of an empirical HS TS (actual HS TS obtained for 643 each station), we created 400 simulated HS TSs, which were then used to create 400 grid maps of the 644 Discrimination Power . In Fig. 10, we show the empirical HS TSs alongside a random sample of 10 645 simulated HS TSs for YULI, SHRL, CHCH, and SIHU to illustrate the simulated counterparts. After this, 646 in each cell, we had one empirical value of that we can compare against a distribution of 400 647 simulated values of . This allows us to compute for each cell the probability that its empirical value 648 is higher than its simulated counterparts. We named this quantity the Discrimination Reliability $ , 649 defined for each cell in the grid map as: 650 .
(16) 651 In the language of statistical hypothesis testing, the p-value for the test is given by = 1 − $ . The 652 value of $ ranges from 0 to 1. If $ is close to 1, we are confident that the discrimination power 653 of the empirical HS TS is statistically significantly high; otherwise, we have no such confidence. we can better appreciate the utility of HS TSs across the grid map, since the $ value is a statistical 664 significance measure of the HS-EQ correlation, unlike the discrimination power . To explain this, let 665 us take the example of station LIOQ (upper left of Fig. 11), whose physical location is marked by the 666 blue star, within a dark-red grid cell of $ = 0.992. This means that the empirical HS TS performs 667 better than random guesses (i.e., simulated HS TSs) at separating time periods of low/high EQ 668 frequencies, with a statistical significance of = 0.008. This means that it is improbable for a simulated 669 HS TS to have such a high , and therefore the empirical HS TS is unlikely to be a product of random 670 chance. This is a very strong demonstration of the mutual information between the HS TS obtained from 671 geoelectric TS, and the EQ catalog that was not used to train the HMM.

673
In the proximity of station LIOQ located within 22.55-23.58° , we can see a clear pattern of cells 674 with $ ≥ 0.9 (dark-red color), while $ ≥ 0.9 occasionally for most cells outside this general region. 675 This pattern suggests the geoelectric information from station LIOQ is approximately local. This is 676 consistent with the logical requirement for direct/indirect structural relation between station LIOQ and 677 region X, such as being close to the same subterrain fault line, for the information at station LIOQ to be 678 useful for region X. As a corollary, information given by station LIOQ is less likely to be useful for far 679 away regions, as they are less likely to have such structural relations with station LIOQ. In application 680 scenarios, this means that the state of EQ probabilities of region X can be estimated using stations closer 681 to the region. Last but not least, it is also worth mentioning that most cells at the edge of the map seldom 682 have high $ values. This is consistent with the fact that these cells typically have very few EQ events 683 to provide high statistical significance.  Based on our discoveries on the HS-EQ correlations so far, it is clear that the HS TSs can provide usable 690 EQ forecasts for real-world applications. For example, in a high-performance grid cell such as the one 691 where LIOQ is situated, the corresponding HS TS can tell us confidently ( = 0.008) whether the current 692 time is within the "active state" featuring more frequent EQs or the "passive state" featuring less frequent 693 EQs. Let us note that the above statement is about how the EQ frequency deviates from its long-term 694 value. Since the regime switches after every few months to every few years (e.g., CHCH in Fig. 10 to weigh the information given by all 20 stations. For example, we can consider only stations with $ ≥ 705 $_K8-at the given grid cell. The user-defined threshold $_K8-can take on constant values (e.g., 0.9) 706 across the grid map, or be location specific, such as being lower (e.g., 0.8) for grid cells where few of 707 the 20 stations have $ ≥ 0.9. We hope to explore this in future works. 708 709 Due to the nature of our HSs, we cannot use them to forecast specific EQs or issue evacuation alarms. 710 What the HSs can do, however, is to provide information with forecasting skill to decision makers, in 711 regions where the HS switched from the passive state to the active state convincingly (i.e., the observed 712 active state is persistent and not a temporary fluctuation), to take courses of action that can lower the 713 potential damage with minimal costs. For example, in the passive state, the building inspection authority 714 can prioritize the inspection and issuing safety permits to new projects over re-inspections of old 715 buildings. With the arrival of an active state that might last a few months to a few years, local authorities 716 would have the incentive to clear up pending re-inspection works, so that fewer old buildings are exposed 717 to potential EQ damage. Other than the re-inspection of old buildings, local authorities could also 718 increase the frequency of safety education and drills to vulnerable groups such as students and 719 construction workers, to reduce potential injuries or fatalities due to panic or lack of understanding. 720 Additionally, disaster relief services may use the HS's information to re-deploy the stockpile of relief 721 materials such as food, clothing, tents, and first-aid kits, whenever necessary. In doing so, the stockpile 722 of relief materials can be brought closer to high-risk regions within a convincing active state, to be 723 distributed to victims more cost-effectively after a major EQ. 724 725 3.4 Global-level Significance Tests of the Forecasting Power 726 727 From Fig. 11 alone, we have demonstrated the HS TSs are able to separate time periods of low/high EQ 728 probabilities for regions (cells in the grid map) with high $ values. While the forecasting power of HS 729 TSs in each of these cells is statistically significant, the more critical among us may wonder whether 730 some of these cells can be significant purely by chance, even though there is in reality no persistent 731 correlation between EQs and HSs. For example, any simulated HS TS in Fig. 10 would have at least a 732 few cells with high $ values. Therefore, in this next section, we will answer the question of "whether 733 https://doi.org/10.5194/nhess-2021-378 Preprint. In order to answer this question, we need to define a performance metric that can quantify the 738 performance of each station with a single value, instead of a grid map of $ values. We start by 739 assuming that all stations have zero forecasting skill, but as a result of our statistical test, some cells may 740 still end up with high $ by chance. A truly informative station should have significantly more cells 741 with high $ than random guesses. Taking the number of EQs into the consideration, we further propose 742 that a truly informative station should have significantly higher EQs counts located in high-performing 743 cells. On the grid map, let us define cells with $ ≥ $_K8-as satisfactory cells, and the rest as 744 unsatisfactory cells, where $_K8-is the user-defined threshold that determine how high the $ 745 should be in order to be considered "high-performing". As mentioned earlier, it is possible to work out 746 schemes that allow for regionally acceptable $_K8-. Here for simplicity let us consider a scheme with 747 a uniform $_K8-across all cells in the grid map. With this setting we can proceed to define the single-748 value performance metric for each station, as the ratio of EQs in satisfactory cells, or ?:> as: 749 located within satisfactory cells, and are therefore "forecasted" by the station to the level required by the 753 user (i.e., $_K8-). Therefore, to show that a station has more forecasting power than random guesses, 754 we proceed to test a given station against the null hypothesis is that a random guess (simulated HS TS) 755 can have the same or higher ?:> than the empirical HS TS. 756

757
We carried out this hypothesis test station by station, by first computing the ?:> values of its empirical 758 HS TS as well as for 400 HS TSs simulated using the HMM parameters for the given station. We then 759 defined the global confidence level as: 760 Last but not least, the histograms for each station in Fig. 12 are created with individually optimized 778 hyperparameters, namely ! (length of time window to compute indices , , , and , in days) and 779 (number of clusters for the k-means clustering). The optimal hyperparameter values for each station 780 are indicated in the titles for each station. Let us discuss the details of this optimization process in the 781 next section.

826
To wrap up this section, let us describe how to select the optimal hyperparameter for each station. We 827 did this in two steps: first, we selected the hyperparameters with highest values (1 for many 828 stations); next, in case of ties, we chose the hyperparameter with the highest ?:> as the winner. For 829 example, for WANL station in Fig. 14, there are many cells with = 1. We therefore proceeded to 830 check the heatmap for station WANL in Fig. 13, and identified the hyperparameter combination ! = 831 0.03 and = 80 as optimal, since it has the highest ?:> value. Using this selection procedure, we 832 identified the optimal hyperparameter for each station, and used these individually optimal 833 hyperparameters to create Figs 7 to 12. This selection procedure could also be adapted for real-world 834 applications, when more historical data and computational power are available, to provide even better 835 model performances. EQ forecasting is an important research topic, because of the potential devastation EQs can cause. As 841 pointed out by many past studies, there is a correlation between features within geoelectric TSs and 842 individual large EQs. In those studies, different features of geoelectric TSs were explored for their use 843 of EQ forecasting, among which the GEMSTIP model was the first one to directly use statistic index TSs 844 of geoelectric TSs to produce TIPs for EQ forecasting. Inspired by this, we took a second look at the 845 relationship between these statistic indexes and the timing of EQs, and found out that there is an abrupt 846 shift of the indexes' distribution along the TTF axis. This suggested that there are at least two distinct 847 geoelectric regimes, which can be modeled and identified using a 2-state HMM. This motivation is 848 further backed by the knowledge that there can be drastic tectonic configuration changes before and after 849 a large EQ, one important aspect of which being the telluric changes identified in the region around the higher/lower EQ frequencies, we would expect to also find two matching geoelectric regimes with 853 https://doi.org/10.5194/nhess-2021-378 Preprint. Discussion started: 4 January 2022 c Author(s) 2022. CC BY 4.0 License. contrasting statistical properties, which can be of good utility for EQ forecasting.

855
Specifically, we modeled the earth crust system as having two HSs identifiable with distinctive 856 geoelectric features encoded by 8 index TSs from each station. To obtain the HMM for each station, we 857 needed to run the BWA, which is most convenient to use with a discrete observation TS input. Therefore, 858 we used the k-means clustering to convert the continuous TS of 8-dimensional index vectors into a 859 discrete observation TS, and subsequently obtained a converged HMM for each station. We then 860 investigated whether these HS TSs provide informative partitions of EQs, i.e., one of the HS can be 861 interpreted as a passive state with less frequent EQs, while the other one as an active state with more 862 frequent EQs. For this task, we defined the EQ frequency's ratio ( # ), which is the frequency of EQs in 863 one of the HSs divided by the total frequency of the EQs. Using # we further defined the 864 discrimination power ( ), to measure how differently one HS is from the other HS in terms of the EQ 865 frequency. We then plotted 16-by-16 grid maps of # and for all 20 stations, and tested the statistical 866 significance of in each cell, by comparing the empirical value against the distribution of from 400 867 simulated HS TSs, to end up with the grid maps of discrimination reliability ( $ ) for all 20 stations. To 868 further investigate the statistical significance level at the global scale, we defined ?:> to measure the 869 percentage of total EQs located within satisfactory cells, i.e., cells having $ ≥ $_K8-for a user-870 specified $_K8-value. This $_K8-value can be easily customized for different cells, but in this paper, 871 we used a constant $_K8-value across the grid map for demonstration. By comparing the ?:> value 872 of the empirical model against those of 400 simulated models, we obtained one global significance value 873 for each station, namely the global confidence level ( ). This tells us how confident we can be that 874 information contained in the empirical HS TSs can be used for EQ forecasting.

876
Finally, we showed how we optimized the values through a grid-search in the 2-dimensional 877 hyperparameter space and obtained the optimal combination of [ ! , ] individually for each station. 878 As a result, among the 20 stations with optimized hyperparameters, there are 19 stations with > 879 0.95, 15 of which having > 0.99. Additionally, the confidence levels are also robust across the 880 hyperparameter space for most stations. Based on these positive results, the Hidden Markov Modelling 881 of the index TSs computed from geoelectric TSs is indeed a viable way to extract information that can 882 be useful for EQ forecasting. As discussed in greater detail in Sect. 3.3, in real-world scenarios, the HS 883 TSs can be useful for intermediate-term EQ forecasting either directly (for high $ cells) or as input 884 features for higher-level algorithms that take information from all 20 stations (for low $ cells). Beyond 885 our demonstration of extracting EQ-related information from geoelectric TSs, the HMM approach 886 described in this paper can also be explored on other high-frequency geophysical data, such as those 887 from geomagnetic, geochemical, hydrological, and GPS measurements, for EQ forecasting. 888 889 At this point, we would like to address the issue of out-of-sample testing (or cross-validation) to support 890 the validity of our model. There are two ways to do this: (1) split a long time series into a training data 891 set to calibrate the model, and a testing data set to validate the model; and (2) use whatever time series 892 data is available to calibrate the model, before collecting more data to validate the model. If the model is 893 statistically stationary (its parameters do not change with time), both approaches are acceptable. However, 894 many would agree that an out-of-sample test with freshly collected data (approach (2)) is more 895 impressive, especially if it is done in real time. We would certainly like to try this, and are writing a grant 896 application to fund such a validation study. For this paper, however, we were not even able to use 897 https://doi.org/10.5194/nhess-2021-378 Preprint. Discussion started: 4 January 2022 c Author(s) 2022. CC BY 4.0 License.