How can seismo-volcanic catalogues be improved or created using robust neural networks through weakly supervised approaches?

Titos, Manuel; Benítez, Carmen; Kowsari, Milad; Ibáñez, Jesús M.

doi:https://doi.org/10.5194/nhess-2024-102

Preprints

https://doi.org/10.5194/nhess-2024-102

Preprints

17 Jun 2024

| 17 Jun 2024

Status: a revised version of this preprint was accepted for the journal NHESS and is expected to appear here in due course.

How can seismo-volcanic catalogues be improved or created using robust neural networks through weakly supervised approaches?

Manuel Titos, Carmen Benítez, Milad Kowsari, and Jesús M. Ibáñez

Abstract. Real-time monitoring of volcano-seismic signals is complex. Typically, automatic systems are built by learning from large seismic catalogues, where each instance has a label indicating its source mechanism. However, building complete catalogues is difficult owing to the high cost of data-labelling. Current machine learning techniques have achieved great success in constructing predictive monitoring tools; however, catalogue-based learning can introduce bias into the system. Here, we show that while monitoring systems recognize almost 90 % of events annotated in seismic catalogues, other information describing volcanic behavior is not considered. We found that weakly supervised learning approaches have the remarkable capability of simultaneously identifying unannotated seismic traces in the catalogue and correcting mis-annotated seismic traces. When a system trained with a master dataset and catalogue is used as a pseudo-labeller within the framework of weakly supervised learning, information related to volcanic dynamics can be revealed and updated. Our results offer the potential for developing more sophisticated semi-supervised models to increase the reliability of monitoring tools. For example, the use of more sophisticated pseudo-labelling techniques involving data from several catalogues could be tested. Ultimately, there is potential to develop universal monitoring tools able to consider unforeseen temporal changes in monitored signals at any volcano.

Received: 28 May 2024 – Discussion started: 17 Jun 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 1015 KB)

Supplement (621 KB)

Download & links

Manuel Titos, Carmen Benítez, Milad Kowsari, and Jesús M. Ibáñez

Status: closed

RC1:
'Comment on nhess-2024-102', Gordon Woo, 27 Aug 2024

The authors are to be commended for tackling an important, but very challenging issue. However the equivocal nature of the results obtained is reflected in the interrogatory title.
For the outcome to be at all persuasive, the analyses should be undertaken for multiple Master catalogues and test databases, not just those from Deception Island and Popocatepetl. At the very least, there should be many test databases, not just a seismic experiment conducted in 2002. What would be the outcome for data from 2022?
In lines 235 to 240, assumptions are made about the equivalence of the marginal and conditional distributions for the source and target domains. For other selections of source and target, these assumptions may be much more tenuous.
Progress in volcano hazard forecasting requires a multidimensional monitoring approach, covering geodetic and geochemical monitoring, as well as modelling of volcano dynamics. Machine learning will be an important auxiliary tool to support volcanological decision-making. However, a restriction to seismological data is too limited.
An extended version of this paper might be a useful contribution to the volcano machine learning literature.

Citation: https://doi.org/10.5194/nhess-2024-102-RC1
- AC1: 'Reply on RC1', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC1
RC2:
'Comment on nhess-2024-102', Anonymous Referee #2, 28 Aug 2024

This manuscript can be better written, and its science better executed. As it stands, the manuscript appears too ambitious in scope. The results being presented are insufficient to deliver the intended scientific message, and the methods described lack sufficient details for reproducibility and related discourse.
In this work, the authors put a strong emphasis on using weakly supervised frameworks to improve seismo-volcanic catalogs for eruption forecasting or early warning. While there is novelty in the application of weakly supervised approaches in the context of volcano seismology, far too many details are left out on the seismology front, and the discussion in relating (improved) seismic catalogs to eruption onset or characterization is clearly absent.
Consider the following issues:
1. Throughout the manuscript, the authors utilize catalogues derived from Deception Island and Popocatepetl. Strong words are used to assert their robustness and quality, yet readers lack information on the related monitoring network, duration of observation (Deception Island), and contemporaneous volcanic activity for which each catalog was constructed. There is some passing discussion on "seismic attenuation processes" and "source radiation patterns", as well as the proposed underlying mechanisms behind each signal type from literature, but how do we know for sure if there is no information on the seismic network geometry, source-receiver distance, or eruption style being recorded?
2. Even if we were to assume that the catalogs are 100% accurate in their labels, this does not mean that they are necessarily suited for machine learning applications. When building a classification model, at least some care must be taken to balance the labeled dataset, especially if accuracies are being used as a metric. A perfectly labeled catalog could still be deficient in certain classes which the model hopes to classify. In such cases, the resultant biases need to be more thoroughly discussed.
3. The authors introduce a set of (typical) labels used in volcano seismology, but fail to show clear examples from each dataset until late in the paper. An early figure showing the different classes from each volcanic setting (waveforms and spectrograms) would have been really informative on how the human experts had distinguished the different signal types, and what classifications they are hoping to achieve with their models.
4. Although the algorithm framework is shown in Figure 1, there is data pre-processing and feature engineering step is too opaque. What does the "stream data" entail exactly? Why was a bandpass filter of 1-20 Hz chosen? How many stations are being used to constrain each label? Are we looking at the vertical component only? Is the instrumentation the same at Deception Island and Popocatepetl? What is the feature space here? If the features were indeed learned in the "deep learning" sense, it was not entirely clear to me how they were computed.
5. The authors mention that volcano monitoring and eruption forecasting involves a multidisciplinary approach. However, much of the manuscript is aimed at improving a catalog using machine learning techniques, which only involves the discipline of volcano seismology. The translation of this information into understanding unrest and hazards is absent. If the authors were to show that rapid catalog improvement could result in near-real-time characterization of real volcanic unrest, it would have made a far more convincing case. Unfortunately, this was not done or shown.
6. A key issue in volcano seismology machine learning literature is that volcanoes do not behave uniformly over time. Unrest signatures can vary from eruption to eruption, and from volcano to volcano. One way to make the applicability of this work more convincing could be to show its "temporal transferability" for one volcano in between different eruptive periods, before showing its applicability at a completely different volcano (i.e. "volcano transferability") as the authors have attempted in this work.
As the manuscript stands, it seems more suited for a journal like IEEE, where novel applications of machine learning techniques are discussed. In the context of NHESS or any other Earth Science journal, I would hope to see a more rigorous discussion of (1) the labeled dataset, (2) the different ML architectures, (3) the contextual volcanic unrest for which the seismic signals are observed, and (4) the relation between seismic catalogs and eruption forecasting.

Citation: https://doi.org/10.5194/nhess-2024-102-RC2
- AC2: 'Reply on RC2', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC2
RC3:
'Comment on nhess-2024-102', Anonymous Referee #3, 29 Aug 2024

This paper deals with the automatic classification of seismic signals in volcanic environnement. The authors suggest that weakly supervised machine learning approaches can be used to improve the detection and classification of signals, in comparison to direct transfer learning methods. Although the subject is very interesting, I agree with previous reviewers that the work must be improved before being considered for publication.
1) As stated in RC2, the manuscript lacks a description and a discussion of the phenomena to put the results in the perspective of volcanic monitoring and eruption forecasting. In its present form, you focus on signal classification, and not on the understanding of "volcano dynamics" as stated in the conclusion (l.455). A more detailed description of the data acquisition and catalog construction methodology is also missing.
2) In the introduction, the authors mainly rely on their own publications for transfer learning approaches (e.g. "Based on our experience" l.86), but are they really the only team working on transfer learning methods for seismic signal classification? It is not clear either the extent to which this paper is novel compared to previous works on the same subject by the authors research team (citations l.86), or to other applications / studies in the litterature. The litterature review in the introduction is also, in my opinion, lacking key elements, in particular regarding existing fully unsupervised approaches. They have proven effecetive in volcanic context (e.g. Steinmann et al. (2024). Machine learning analysis of seismograms reveals a continuous plumbing system evolution beneath the Klyuchevskoy volcano in Kamchatka, Russia. JGR: SOlid Earth, 10.1029/2023JB027167). What are the limitations of fully unsupervised machine learning? How is your approach complementary? Similarly, you only refer to recent publications on early-warning systems based on seismic monitoring (Rey-Devesa et al. (2023) : you must be more explicit on the different approaches, and use more references.
3) The description of the catalog must be improved. As suggested by other reviewers, you need to explain more clearly what the different classes of seismic events correspond to, both in terms of physical processes and features used for classification. Their names must be homogenized (e.g. use only GAR or BGN for background noise), and you should show in a single figure / table how many events of each class there are in the two catalogs. As this is not done, it is sometimes difficult to understand your results (e.g. what are the 5 and 7 seismic categories used in Table 2)? In the same perspective, the features used for classification must be given, as well as the methodology used to compute them. It is also not clear to me if the catalogs associate labels to successive and constant time windows on the full signal, or on time windows of various lengths, defined manually, and corresponding to specific events.
4) The methodology of the weakly supervised learning must be more clearly explained, at least in an Appendix. It is not clear how the assumptions stated l. 236 to 241are important, and how the results can be interpreted if they are not verified (l.242-243 -> are the marginal distributions indeed the same? l.247-249: I don't understand the logical link suggested by "therefore", between the assumptions and the possibility to use weakly suppervised learning). Figure 1 must also be improved. In particular, the iterative refinement process is not displayed. Following remark 3), it is also not clear in the Figure what the signal in B) corresponds to : a portion of the signal identified manually in the catalog, or continuous data? For the same reason, it is not clear to me what the "dataset" mentionned in the text and in D) corresponds to. You should also explain how you define the threshold used for the drift adaptation method (l.266), and how you choose to stop the iterative refinement (what is the "desired result" l.270?). More generally, there are many terms that are technical and could be clarified for non experts readers (e.g. "self-consistency" (equivalent to accuracy?), "softmax", "argmax softmax", "confusion", accuracy" ...).
5) The presentation of the Results can also be improved. As mentionned above, as the event classes are not clearly defined, it is not always easy to understand the results. Regarding to the cross-validation : you use it for the direct TL approach, but not the weakly supervised TL, why? Besides, isn't it interesting to look at the variations of the accuracy to see if the learning is stable or not, in addition to the mean accuracy? Table 5 must be presented and discussed in more details : it is referred to only in the discussion and after a reference to Table 6.
6)A major argument of your work is that catalogs can be biased, and that the accuracy of ML learning techniques should thus not be the only criterion of a classifyer efficiency. Although this is worth saying it, you say it repeatedly. E.g. l.360 to 375 is only about this point and is only a repetition of what is already said in the introduction : I don't see what is new in this paragraph. Another unclear point is that you present the Popo2002 catalog as a high-quality catalog, but then suggest that some VTE, LPE and TRE are missclasified (l.396-708). Then, you refer to difference between the catalog and your calssification as "an 'error' that was not really an error" (l.404), but as I understand it is is based only on the judgment of "a geophysical expert" (l. 406). Why is this jugemnt more reliable than the classification obtained thanks to the "quality of the human team" mentionned l.103?
7)You state l.355 that you have "verified" that weakly supervised approaches could "significantly enhance the detection and identification capabilities". However it is not clear what the enhancement refers to. In comparison to what? How do you quantify the enhancement? Thus, I don't think the Results section illustrates correctly this sentence. As a metter of fact, Table 4 shows that a weakly supervised approach improves the accuracy in comparison to a direct application of the MASTER-DC classifier to the Popo database, but then you state that the accuracy is not necessairly a good indicator of a classifier efficiency. On the contrary, if you consider the accuracy as a robust indicator, then the classic TL approach yields better results than the weakly supervised approach (compare Tables 2 and 4).
8) Another argument you put forward is that the weakly supervised approach allows to detect more events than in the catalog. However this is expected, as you apply your classifier to more time windows than the Popo2022 catalog (2139 labelled events in the Popo catalogue, more than 20,000 times windows labelled with your classifier). Besides the number of labelled events is different depending on the classifier (compare sum of columns in Table 6), why is that so? The real question, that I think you don't answer fully in your paper, would be : do weakly supervised classifiers allow to detect more events, and in a more robust way, than classical ML methods and direct TL approaches?
For these reasons, I suggest the authors to review thouroughly their work before considering it agin for submission. Their work is of great interest and importance. However, its implications both in terms of pure classification problems, and in terms of volcano monitoring, are not sufficiently investigated.

Citation: https://doi.org/10.5194/nhess-2024-102-RC3
- AC5: 'Reply on RC3', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC5
RC4:
'Comment on nhess-2024-102', Anonymous Referee #4, 06 Sep 2024
The authors have taken on an interesting and challenging topic, using machine learning to classify volcanic seismic signals. However, there are several important areas where the paper needs improvement to better explain the methods and show how this research can be useful for volcano monitoring and eruption prediction.
The paper focuses mainly on classifying seismic signals but does not clearly explain how this helps us understand volcano activity or predict eruptions. The authors should add more detail about how these results can be used in real-world volcano monitoring systems, including other important data types like geodetic (ground movement) and geochemical data. This would make the study more useful for predicting volcanic hazards.

The paper does not provide enough information about the seismic data used. The authors should explain more clearly how they collected the data, what each type of seismic event means, and how the events were labeled. A table or figure showing how many events of each type were found would help the reader understand the data better. The authors should also show examples of different signal types earlier in the paper to make it clearer how the classification works.

The paper lacks detail about how the data was processed. For example, the authors mention using a bandpass filter (1-20 Hz), but they do not explain why. They should also explain which components of the signal were analyzed (e.g., vertical component) and whether the same stations and equipment were used for both volcanoes. Providing these details will make the study more transparent.

The authors suggest that their method can detect more seismic events, but they don’t provide enough examples of how this would help in real-time volcano monitoring. It would strengthen the paper if they could show how these improved classifications lead to better volcano hazard assessments or warnings. Additionally, volcanoes can behave differently over time. It would be useful to see if the model works well over different eruption periods or at different volcanoes.
Citation: https://doi.org/10.5194/nhess-2024-102-RC4
- AC4: 'Reply on RC4', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC4
RC5:
'Comment on nhess-2024-102', Anonymous Referee #5, 06 Sep 2024

The Authors apply different machine learning techniques to create, from a database of daily seismic registrations (if I understood well) of a certain volcanic area (Deception Island Volcano), the seismic catalogue of another volcanic area (Popocatepetl volcano) labelling the type of event following some criteria (not adequately and quantitatively described). To do this, the Authors use a high-quality database of seismic events, already labelled by human supervision, and collected in another volcanic area. The purpose is to reduce or eliminate the use of human work for labelling the seismic events in catalogue. The purpose is very important and interesting, since the increase of seismic networks in the recent years has the undebatable advantage of having increased the seismic monitoring both in volcanic and in tectonic areas but at the same time it has increased the number of data to be analysed by seismologists. So, an automatic system that can be able to detect and label seismic events in volcanic or tectonic context can be very useful if the system is reliable and it will reduce human supervision to a minimum or even it will eliminate it altogether, working as a human would do.
The Authors conclude that the three ML approaches produce different results and all of them are able to detect a number of events much greater than the existing catalogue (except in one case) and I think that this a very important and intriguing result.
I appreciate the work and the idea, but the manuscript has many problems that I try to list.
Reading the description of the work done, it is not clear how a researcher could verify the results and reply the work with its own database. The description of the method is very confused and only who has already used the same techniques could follow and understand the steps. Moreover, the description both of the method and of the data used is only qualitative and discursive, never detailed and quantitative.
The manuscript is full of non-useful repetitions and the Authors should do an effort to re-read the manuscript and be more concise. As an example: Line 281-294. This is a repetition of something already written in the manuscript. Line 170-184. This paragraph should be moved in the Introduction section and rewritten to avoid repetition.
The same acronyms are referred to with other acronyms. As an example, the three techniques employed to achieve the purposes of the manuscript (specified in Line 184-186) are referred to ANN or to ML in different part of the manuscript, generating a great confusion among all the acronyms. I suggest simplifying and reducing the use of the acronyms to the strictly needed.
Regarding the used database, the Authors use a qualitative language that does not help to understand. Some examples:
Line 115. Where the original labelled database can be consulted? Is it already released?
Line 134. What do the Authors mean with “subset of data considered the most reliable”? Which criteria did they use for reliability?
Line 139. As above, what do the Authors mean with “ the most representative and of the highest quality”? How do they measure the quality?
Line 149-154. Where the Popocatepetl 2002 catalogue can be found? A citation is missing here
Line 152. Which are the classes of event, adding a table can hep.
Line 229. The phrase “The target domain (denoted as Dt) is the Popo2002 dataset (whose available seismic catalogue will not be considered)” what does it mean? At line 149 the Authors state that Popo2002 consists of 4,883 events, what type of event they are? I can understand that the data used for the target domain is a subset of the Popo2002 collected excluding earthquakes. Is it true? I think that the authors should be more concise and clearer in describing both the technique and the data used.
In conclusion, I suggest publication after a deep rewriting of the manuscript that does justice to the work done, makes it understandable also to those who have never used the specified ML techniques before and makes the proposed method replicable for other interested scientists.

Citation: https://doi.org/10.5194/nhess-2024-102-RC5
- AC6: 'Reply on RC5', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC6
RC6:
'Comment on nhess-2024-102', Anonymous Referee #6, 06 Sep 2024

This is a solid paper that reports on systematic evaluation of volcanic earthquakes using several machine learning (ML) techniques. I should state at the outset that I am a volcano seismologist with more than a decade of experience, but I have never directly used any ML or AI techniques. Hence my comments are of a more general nature.

The approach in the paper is thorough. The results are repeatable, which is good. The results also show that it is possible to get more out of the data, which is always welcome. The procedures are efficient, so it is possible, in principle, to obtain similar results in much less time than it takes an experienced geophysicist to manually process the data. But this brings up a philosophical point: what is reality? Manual or ML? I would think that a manual effort by an experienced person would be the benchmark, and ML results would be judged relative to them. The paper mostly does this, with a few exceptions.

The paper is mostly well written. Here are a few corrections keyed to line numbers:

36 – the V in VLP stand for very, not ultra
44 – is this frequency in Hz or frequency of occurrence?
51-52 – inconsistent use of parentheses ( )
143 – confront? Odd word choice
184-186 – at this point I had a hard time keeping all the acronyms in my head. I suggest adding a table of acronyms.
233 – spacing
Table 2 – are all values percentages? Needs better labeling
Table 3 – add bolder vertical lines between three main sections; spell out the abbreviations in notes at the bottom of the table (Tables should stand alone)
Table 5 – same comments as Table 3
357 – used
381 – “unbiased” but how determined? This is the place that made me rethink the question of what is reality, as described above.
Table 6 – is all the time with no events equal to the background?

Overall, the paper is in good shape and is suitable for publication with minor revisions as indicated above.

Citation: https://doi.org/10.5194/nhess-2024-102-RC6
- AC7: 'Reply on RC6', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC7
RC7:
'Comment on nhess-2024-102', Anonymous Referee #7, 09 Sep 2024

Summary
I have now had the opportunity to read and review the manuscript “How can seismo-volcanic catalogues be improved or created using robust neural networks through weakly supervised approaches?”. Where the authors use machine learning techniques and a dataset from Deception Island as the master catalog to create and compare a new catalog for Popocatepetl in Mexico. While there are a lot of caveats and author interpretations in this research, the science, information and methods are interesting. The manuscript shows a small progress in ML techniques that can be used as the basis for future research. Below I list a few major comments for the review along with some line-by-line comments. Additionally, I would like to make a note about the subject matter. I feel this research would be more suited for a different journal. I was a bit surprised when I saw this manuscript was submitted to Natural Hazards and Earth System Sciences.
Major comments:
-What about other signals when building the model? There is a lot of source noise in volcanic terrains, how do these methods work when you introduce for example mass flows, edifice collapse, rock falls, ballistics, etc. In the same train of thought, how did leaving these out affect the outcomes. Furthermore, how about teleseismic earthquakes, how does the classification work on these?
-“Early warning” is capitalized on line 24, but is not anywhere else, stay consistent throughout.
-Have you looked at the source depth of the signal, differing characteristics can occur depending on depth, you may have a problem similar to the attenuation issue.
-I think the length of the training dataset is too short, how can we get a sense of what goes on at a volcano in just two months of data. Similarly, please explain why you are using a pre-eruptive model on a volcano that is in a phase of unrest. The difference in signal characteristics are going to be different, also the types of signals as I mentioned before.
-Some acronyms do not match, I tried to correct some in my Line-by-line comments, but it got too out of hand. A good example is the constant change between VT and VTE.
-While the frequency band of 1-20 Hz is fine, I am wondering about the difference between sensors. This paper does not mention any details about the sensors. What is the sampling rate of each sensor, are they all the same, are they different at different volcanoes. The details about each sensor are very important in knowing which frequency range can be used. Furthermore, how about is the sensor broadband or not. Is every signal from the vertical competent? If so how about using horizontal components?
-How did you choose which time window to use? What if there is a signal longer than 4 seconds, e.g. tremor, mass flow?
-You only train on one volcanic environment or master. I would like to see what the results would be if you used multiple environments from different volcanoes to make the master.
-Most of the text in the methods section should be in the introduction. I suggest making a section in the introduction describing different kinds of methods people used in the past and then in the methods, explain the techniques you used for this research. Most of everything before section 3.1 should be in the introduction.
-I would like to see some comments about computing power and time. Some ML models and processes take lots of commuting resources as well as extended processing lengths. I would like to see a paragraph discussing these stats in the manuscript. What would I need to reproduce or do a similar computation at my observatory?
-I would like to know how much human work or time goes into creating this new catalog. Since it is a supervised learning technique, you still need human input and review, so how much time/effort are we gaining?
-In Lines 400-408: The training missed labeled tremor events, and you say this error was not actually an error, how can this be? The algorithm mislabeled, which means it did not work. Furthermore, reading your explanation further signals that this technique cannot be completed universally across different volcanoes. The attenuation affects you mention, points to the fact this would be difficult to do universally. A human had to go back in and review every event to make sure the event was labeled correctly, so how does this save time or is a better option?
-I would like to see more one-on-one comparison statistics in reference to Table 6. It is great the algorithm found more events but how many of the catalog events did it find and how many of the “human” events did it miss? Also, how many of these “new” events are real? Do the humans perform better for certain signal types and vice versa? How does each signal classification compare to one another.
-There is a lot of repetitive nature of some paragraphs, try to go over the manuscript and cut some of this out.
-A point on universality, every volcano is different even in the pre-eruption context of this manuscript. Some volcanoes do not even display signs of activity before erupting, so how can these ML techniques be considered universal at this point?
-References are not in alphabetical order
Line by line comments:
Line 16: delete “remarkable”
Line 25: delete "accomplish”
Line 31: is the reference the same as line 39? McNutt
Line 36: change “ultra” to “very”
Line 39: change “debris flows” to “mass flows”
Line 40: there are 2 “)” after “1974”
Line 50: there are 2 spaces between “Neural Networks” and “as Bayesian”
Line 50: Bayesian spelled wrong
Line 52: check Bueno et al citations and reference listed at line 541
Line 52: no Titos et al 2017 in references
Line 211: change “2027” to “2017”
Line 289: why is Transfer Learning capitalized in some spots but not in others?
Line 312-313: use both “VTE events” and “VT events”, switch to just “VT”
Line 314: delete the “E” after the “LP”
Line 345: delete “dramatically”
Line 357: change “use” to “used,”
Line 357: delete "significantly”
Line 358: change “being” to “be”
Line 389: delete one of the “)” after fig.3a
Line 391: same as line 389
Line 393: delete the “E” after “VT”
Line 415: delete “remarkable”
Line 445: I suggest changing “several monitoring systems” only seismic is used
Line 575: no year for publication
Figure 2: label points the same as in text or spell out each
Figure 4: please properly label each signal instead of the legend in the corner

Citation: https://doi.org/10.5194/nhess-2024-102-RC7
- AC3: 'Reply on RC7', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC3
RC8:
'Comment on nhess-2024-102', Anonymous Referee #8, 20 Sep 2024
General Comments:
The manuscript presents a highly relevant approach that combines machine learning with weakly supervised methods for seismic-volcanic event detection. The application of these techniques to geophysical event detection is an exciting and promising field of study, and I commend the authors for their effort in tackling such a complex problem. The subject matter is particularly valuable given the growing interest in leveraging machine learning models for natural hazard monitoring, and the use of weak supervision opens new possibilities for working with limited labeled data, a common challenge in seismology and volcanology.
However, while the approach is interesting, the manuscript, in its current form, requires substantial rewriting to improve clarity, structure, and the strength of its arguments. There are several critical issues that need to be addressed before the manuscript can be considered for publication:
Methodology Section Reconstruction: The methodology section lacks sufficient clarity and structure. Key concepts such as UMAP, the Leave-One-Out cross-validation method, and the iterative processes involved in the pseudo-labeling task are either insufficiently explained or poorly integrated into the overall narrative. The methodology needs to be rewritten to clearly define these elements and their role in the overall framework, ensuring that readers can follow the steps taken in the model development and evaluation process.

Justification for Using a Single Dataset in Transfer Learning: The authors attempt to justify the use of a single dataset in their transfer learning approach, but the arguments presented are not convincing. As the authors themselves note, ‘it could change when using a different test dataset,’ suggesting that model performance may not generalize well to other geological settings. The authors need to make a stronger case for why the use of a single dataset is valid for this weakly supervised learning approach. Ideally, the manuscript should explore the potential limitations of this approach or, alternatively, incorporate multiple datasets from different volcanic settings to demonstrate broader applicability.

Overall Structure and Writing Quality: The manuscript, though scientifically significant, suffers from poor structure and unclear writing, which detracts from its scientific contributions; this has resulted in several instances where key ideas are poorly expressed or ambiguously presented. A thorough revision of the manuscript is needed to ensure that the concepts and findings are communicated effectively. I suggest the authors consider restructuring the entire manuscript to enhance readability, focusing particularly on tightening the introduction, improving transitions between sections, and making the arguments in the discussion more robust.

In conclusion, while the study introduces an interesting and timely approach to seismic-volcanic event detection using machine learning, the manuscript requires significant rewriting to better articulate its methodology and address critical gaps in the explanation of its approach. I recommend a major revision to enhance clarity, strengthen the justification for key methodological choices, and improve the overall presentation of the research.

Specific comments & Technical corrections:
- 1. Introduction.
line 50:"Bayesian" misspelled

line 99: ¿references for master dataset?

lines 99-100: "has already been successfully applied in different DL architectures"; ¿references?

line 102: references for the Popo dataset?, and ¿why it is of high quality?

line 105: It would be very useful to provide more information about the volcanic dynamics observed in the proposed datasets, especially as machine learning developments and methodologies are evolving to incorporate physics-based input.

- 2. Seismic data and catalogues.
line 125: "..on the applicationof HMM models, etc."; ¿references?

line 130: “While it is true that not all types of signals are present in this 'Master database’, especially those associated with ongoing eruptive processes.”, so, perhaps it would be important to have a master dataset that includes this information as well. It is crucial to incorporate datasets representing different stages of volcanic unrest and to clarify which specific stages the machine learning models are most useful for.

line 145: A more detailed description of Popocatépetl’s volcanic activity is needed, including its cyclical behavior of effusive activity, dome formation followed by explosive events, tremor signals, and other relevant features.

line 148: Are there any references available for this group of geophysicists or their work?

Table 1: nice.

Data & sensors: It would be ideal to provide a clearer explanation of the types of instruments being used, including whether all components are available, sampling, etc., as well as details on the sensors. For example, are all instruments capable of measuring all types of events in both datasets? Nowadays, seismic networks are densified with a combination of broadband and short-period sensors, which may influence data quality, coverage, and distance to volcanic sources. The proximity of sensors to the volcanic source is critical, as it directly affects the resolution and accuracy of the recorded data.

- 3. Methodology.
lines 234 - 246: about marginal and conditional distributions: a need for clarity:

The authors’ explanation regarding the assumptions of marginal and conditional distributions in the pseudo-labeling task could benefit from greater clarity. Specifically, they state that the marginal distributions of the source and target domains are assumed to be the same ( P_s (X_s) = P_t (X_t) ), maybe implying that the input features (seismic windows) in both domains are similarly distributed? However, they also assume that the conditional distributions of the source and target domains are the same ( Q_s (Y_s | X_s) = Q_t (Y_t | X_t) ), suggesting that the relationship between input features and event types is identical across both datasets.
Key Challenge and, Potential Problem?:
The text acknowledges that while the marginal distributions of the input features may be the same, the conditional distributions might differ between the source and target domains. This introduces a key challenge: even though seismic signals may “look similar” across different datasets (i.e., the marginal distributions are similar), the relationship between these signals and the seismic events they represent (i.e., the conditional distribution) may vary.
This discrepancy can create a potential problem when using pseudo-labeling and transfer learning techniques. If the model is trained assuming that the conditional distributions are the same, it may misclassify events in the target domain, especially if the seismic signatures there correspond to different types of events than in the source domain. This issue could result in reduced accuracy and reliability of event detection in the target domain, undermining the effectiveness of the model’s generalization.
Conclusion: The Need for Diverse Datasets
This challenge is crucial because it highlights a potential flaw in the transfer learning approach: the assumption that conditional distributions are the same across different volcanic settings may not always hold. To address this, it may be necessary to collect and incorporate datasets from a wider range of volcanic regions, where the relationships between seismic features and event types can vary. Doing so would enable the development of more robust models that can better generalize across domains, improving the accuracy and reliability of event detection in different geological contexts. This would strengthen the use of transfer learning techniques and ensure that models are more adaptable to varying volcanic behaviors.
Figure 1: bad quality figure in the PDF file. Do steps A, B, etc., correspond to the actual process in your proposed methodology?

line 276: reference missing.

- 4. Results.
line 291: review grammar (“..using as training..”)

line 302: The text on self-consistency should be explained and included in the methodology section (‘We apply the Leave-One-Out cross-validation method’).

line 329: Should Section 4.3 be renumbered as Section 4.2?

line 343: These iterations need to be clearly specified in the methodology section, as you mention the goal of “until a reliable catalog is achieved”.

Line 344: The authors mention that “however, it could change when using a different test dataset” which highlights an important point regarding model generalization. While their approach is based on a single dataset, this raises questions about its robustness across varying geological settings. To truly validate the effectiveness of the model, it would be crucial to demonstrate its performance using multiple datasets from different volcanic environments. By doing so, they could provide stronger evidence that the model can generalize across diverse conditions, rather than being tailored to a specific dataset. The authors need to convincingly argue why relying on a single dataset is sufficient, or alternatively, why incorporating multiple datasets might be necessary for ensuring broader applicability.

- 5. Discussion.
line 357: It would be helpful to clarify the phrase “when effectively use” throughout the text to strengthen the main arguments. Perhaps the grammar could be reviewed in that sentence.

line 365: Should Fig. 1 be renumbered as Fig. 2?

2: UMAP should be introduced in the methodology section and connected to the general objectives.
Citation: https://doi.org/10.5194/nhess-2024-102-RC8
- AC8: 'Reply on RC8', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC8

Status: closed

RC1:
'Comment on nhess-2024-102', Gordon Woo, 27 Aug 2024

The authors are to be commended for tackling an important, but very challenging issue. However the equivocal nature of the results obtained is reflected in the interrogatory title.
For the outcome to be at all persuasive, the analyses should be undertaken for multiple Master catalogues and test databases, not just those from Deception Island and Popocatepetl. At the very least, there should be many test databases, not just a seismic experiment conducted in 2002. What would be the outcome for data from 2022?
In lines 235 to 240, assumptions are made about the equivalence of the marginal and conditional distributions for the source and target domains. For other selections of source and target, these assumptions may be much more tenuous.
Progress in volcano hazard forecasting requires a multidimensional monitoring approach, covering geodetic and geochemical monitoring, as well as modelling of volcano dynamics. Machine learning will be an important auxiliary tool to support volcanological decision-making. However, a restriction to seismological data is too limited.
An extended version of this paper might be a useful contribution to the volcano machine learning literature.

Citation: https://doi.org/10.5194/nhess-2024-102-RC1
- AC1: 'Reply on RC1', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC1
RC2:
'Comment on nhess-2024-102', Anonymous Referee #2, 28 Aug 2024

This manuscript can be better written, and its science better executed. As it stands, the manuscript appears too ambitious in scope. The results being presented are insufficient to deliver the intended scientific message, and the methods described lack sufficient details for reproducibility and related discourse.
In this work, the authors put a strong emphasis on using weakly supervised frameworks to improve seismo-volcanic catalogs for eruption forecasting or early warning. While there is novelty in the application of weakly supervised approaches in the context of volcano seismology, far too many details are left out on the seismology front, and the discussion in relating (improved) seismic catalogs to eruption onset or characterization is clearly absent.
Consider the following issues:
1. Throughout the manuscript, the authors utilize catalogues derived from Deception Island and Popocatepetl. Strong words are used to assert their robustness and quality, yet readers lack information on the related monitoring network, duration of observation (Deception Island), and contemporaneous volcanic activity for which each catalog was constructed. There is some passing discussion on "seismic attenuation processes" and "source radiation patterns", as well as the proposed underlying mechanisms behind each signal type from literature, but how do we know for sure if there is no information on the seismic network geometry, source-receiver distance, or eruption style being recorded?
2. Even if we were to assume that the catalogs are 100% accurate in their labels, this does not mean that they are necessarily suited for machine learning applications. When building a classification model, at least some care must be taken to balance the labeled dataset, especially if accuracies are being used as a metric. A perfectly labeled catalog could still be deficient in certain classes which the model hopes to classify. In such cases, the resultant biases need to be more thoroughly discussed.
3. The authors introduce a set of (typical) labels used in volcano seismology, but fail to show clear examples from each dataset until late in the paper. An early figure showing the different classes from each volcanic setting (waveforms and spectrograms) would have been really informative on how the human experts had distinguished the different signal types, and what classifications they are hoping to achieve with their models.
4. Although the algorithm framework is shown in Figure 1, there is data pre-processing and feature engineering step is too opaque. What does the "stream data" entail exactly? Why was a bandpass filter of 1-20 Hz chosen? How many stations are being used to constrain each label? Are we looking at the vertical component only? Is the instrumentation the same at Deception Island and Popocatepetl? What is the feature space here? If the features were indeed learned in the "deep learning" sense, it was not entirely clear to me how they were computed.
5. The authors mention that volcano monitoring and eruption forecasting involves a multidisciplinary approach. However, much of the manuscript is aimed at improving a catalog using machine learning techniques, which only involves the discipline of volcano seismology. The translation of this information into understanding unrest and hazards is absent. If the authors were to show that rapid catalog improvement could result in near-real-time characterization of real volcanic unrest, it would have made a far more convincing case. Unfortunately, this was not done or shown.
6. A key issue in volcano seismology machine learning literature is that volcanoes do not behave uniformly over time. Unrest signatures can vary from eruption to eruption, and from volcano to volcano. One way to make the applicability of this work more convincing could be to show its "temporal transferability" for one volcano in between different eruptive periods, before showing its applicability at a completely different volcano (i.e. "volcano transferability") as the authors have attempted in this work.
As the manuscript stands, it seems more suited for a journal like IEEE, where novel applications of machine learning techniques are discussed. In the context of NHESS or any other Earth Science journal, I would hope to see a more rigorous discussion of (1) the labeled dataset, (2) the different ML architectures, (3) the contextual volcanic unrest for which the seismic signals are observed, and (4) the relation between seismic catalogs and eruption forecasting.

Citation: https://doi.org/10.5194/nhess-2024-102-RC2
- AC2: 'Reply on RC2', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC2
RC3:
'Comment on nhess-2024-102', Anonymous Referee #3, 29 Aug 2024

This paper deals with the automatic classification of seismic signals in volcanic environnement. The authors suggest that weakly supervised machine learning approaches can be used to improve the detection and classification of signals, in comparison to direct transfer learning methods. Although the subject is very interesting, I agree with previous reviewers that the work must be improved before being considered for publication.
1) As stated in RC2, the manuscript lacks a description and a discussion of the phenomena to put the results in the perspective of volcanic monitoring and eruption forecasting. In its present form, you focus on signal classification, and not on the understanding of "volcano dynamics" as stated in the conclusion (l.455). A more detailed description of the data acquisition and catalog construction methodology is also missing.
2) In the introduction, the authors mainly rely on their own publications for transfer learning approaches (e.g. "Based on our experience" l.86), but are they really the only team working on transfer learning methods for seismic signal classification? It is not clear either the extent to which this paper is novel compared to previous works on the same subject by the authors research team (citations l.86), or to other applications / studies in the litterature. The litterature review in the introduction is also, in my opinion, lacking key elements, in particular regarding existing fully unsupervised approaches. They have proven effecetive in volcanic context (e.g. Steinmann et al. (2024). Machine learning analysis of seismograms reveals a continuous plumbing system evolution beneath the Klyuchevskoy volcano in Kamchatka, Russia. JGR: SOlid Earth, 10.1029/2023JB027167). What are the limitations of fully unsupervised machine learning? How is your approach complementary? Similarly, you only refer to recent publications on early-warning systems based on seismic monitoring (Rey-Devesa et al. (2023) : you must be more explicit on the different approaches, and use more references.
3) The description of the catalog must be improved. As suggested by other reviewers, you need to explain more clearly what the different classes of seismic events correspond to, both in terms of physical processes and features used for classification. Their names must be homogenized (e.g. use only GAR or BGN for background noise), and you should show in a single figure / table how many events of each class there are in the two catalogs. As this is not done, it is sometimes difficult to understand your results (e.g. what are the 5 and 7 seismic categories used in Table 2)? In the same perspective, the features used for classification must be given, as well as the methodology used to compute them. It is also not clear to me if the catalogs associate labels to successive and constant time windows on the full signal, or on time windows of various lengths, defined manually, and corresponding to specific events.
4) The methodology of the weakly supervised learning must be more clearly explained, at least in an Appendix. It is not clear how the assumptions stated l. 236 to 241are important, and how the results can be interpreted if they are not verified (l.242-243 -> are the marginal distributions indeed the same? l.247-249: I don't understand the logical link suggested by "therefore", between the assumptions and the possibility to use weakly suppervised learning). Figure 1 must also be improved. In particular, the iterative refinement process is not displayed. Following remark 3), it is also not clear in the Figure what the signal in B) corresponds to : a portion of the signal identified manually in the catalog, or continuous data? For the same reason, it is not clear to me what the "dataset" mentionned in the text and in D) corresponds to. You should also explain how you define the threshold used for the drift adaptation method (l.266), and how you choose to stop the iterative refinement (what is the "desired result" l.270?). More generally, there are many terms that are technical and could be clarified for non experts readers (e.g. "self-consistency" (equivalent to accuracy?), "softmax", "argmax softmax", "confusion", accuracy" ...).
5) The presentation of the Results can also be improved. As mentionned above, as the event classes are not clearly defined, it is not always easy to understand the results. Regarding to the cross-validation : you use it for the direct TL approach, but not the weakly supervised TL, why? Besides, isn't it interesting to look at the variations of the accuracy to see if the learning is stable or not, in addition to the mean accuracy? Table 5 must be presented and discussed in more details : it is referred to only in the discussion and after a reference to Table 6.
6)A major argument of your work is that catalogs can be biased, and that the accuracy of ML learning techniques should thus not be the only criterion of a classifyer efficiency. Although this is worth saying it, you say it repeatedly. E.g. l.360 to 375 is only about this point and is only a repetition of what is already said in the introduction : I don't see what is new in this paragraph. Another unclear point is that you present the Popo2002 catalog as a high-quality catalog, but then suggest that some VTE, LPE and TRE are missclasified (l.396-708). Then, you refer to difference between the catalog and your calssification as "an 'error' that was not really an error" (l.404), but as I understand it is is based only on the judgment of "a geophysical expert" (l. 406). Why is this jugemnt more reliable than the classification obtained thanks to the "quality of the human team" mentionned l.103?
7)You state l.355 that you have "verified" that weakly supervised approaches could "significantly enhance the detection and identification capabilities". However it is not clear what the enhancement refers to. In comparison to what? How do you quantify the enhancement? Thus, I don't think the Results section illustrates correctly this sentence. As a metter of fact, Table 4 shows that a weakly supervised approach improves the accuracy in comparison to a direct application of the MASTER-DC classifier to the Popo database, but then you state that the accuracy is not necessairly a good indicator of a classifier efficiency. On the contrary, if you consider the accuracy as a robust indicator, then the classic TL approach yields better results than the weakly supervised approach (compare Tables 2 and 4).
8) Another argument you put forward is that the weakly supervised approach allows to detect more events than in the catalog. However this is expected, as you apply your classifier to more time windows than the Popo2022 catalog (2139 labelled events in the Popo catalogue, more than 20,000 times windows labelled with your classifier). Besides the number of labelled events is different depending on the classifier (compare sum of columns in Table 6), why is that so? The real question, that I think you don't answer fully in your paper, would be : do weakly supervised classifiers allow to detect more events, and in a more robust way, than classical ML methods and direct TL approaches?
For these reasons, I suggest the authors to review thouroughly their work before considering it agin for submission. Their work is of great interest and importance. However, its implications both in terms of pure classification problems, and in terms of volcano monitoring, are not sufficiently investigated.

Citation: https://doi.org/10.5194/nhess-2024-102-RC3
- AC5: 'Reply on RC3', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC5
RC4:
'Comment on nhess-2024-102', Anonymous Referee #4, 06 Sep 2024
The authors have taken on an interesting and challenging topic, using machine learning to classify volcanic seismic signals. However, there are several important areas where the paper needs improvement to better explain the methods and show how this research can be useful for volcano monitoring and eruption prediction.
The paper focuses mainly on classifying seismic signals but does not clearly explain how this helps us understand volcano activity or predict eruptions. The authors should add more detail about how these results can be used in real-world volcano monitoring systems, including other important data types like geodetic (ground movement) and geochemical data. This would make the study more useful for predicting volcanic hazards.

The paper does not provide enough information about the seismic data used. The authors should explain more clearly how they collected the data, what each type of seismic event means, and how the events were labeled. A table or figure showing how many events of each type were found would help the reader understand the data better. The authors should also show examples of different signal types earlier in the paper to make it clearer how the classification works.

The paper lacks detail about how the data was processed. For example, the authors mention using a bandpass filter (1-20 Hz), but they do not explain why. They should also explain which components of the signal were analyzed (e.g., vertical component) and whether the same stations and equipment were used for both volcanoes. Providing these details will make the study more transparent.

The authors suggest that their method can detect more seismic events, but they don’t provide enough examples of how this would help in real-time volcano monitoring. It would strengthen the paper if they could show how these improved classifications lead to better volcano hazard assessments or warnings. Additionally, volcanoes can behave differently over time. It would be useful to see if the model works well over different eruption periods or at different volcanoes.
Citation: https://doi.org/10.5194/nhess-2024-102-RC4
- AC4: 'Reply on RC4', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC4
RC5:
'Comment on nhess-2024-102', Anonymous Referee #5, 06 Sep 2024

The Authors apply different machine learning techniques to create, from a database of daily seismic registrations (if I understood well) of a certain volcanic area (Deception Island Volcano), the seismic catalogue of another volcanic area (Popocatepetl volcano) labelling the type of event following some criteria (not adequately and quantitatively described). To do this, the Authors use a high-quality database of seismic events, already labelled by human supervision, and collected in another volcanic area. The purpose is to reduce or eliminate the use of human work for labelling the seismic events in catalogue. The purpose is very important and interesting, since the increase of seismic networks in the recent years has the undebatable advantage of having increased the seismic monitoring both in volcanic and in tectonic areas but at the same time it has increased the number of data to be analysed by seismologists. So, an automatic system that can be able to detect and label seismic events in volcanic or tectonic context can be very useful if the system is reliable and it will reduce human supervision to a minimum or even it will eliminate it altogether, working as a human would do.
The Authors conclude that the three ML approaches produce different results and all of them are able to detect a number of events much greater than the existing catalogue (except in one case) and I think that this a very important and intriguing result.
I appreciate the work and the idea, but the manuscript has many problems that I try to list.
Reading the description of the work done, it is not clear how a researcher could verify the results and reply the work with its own database. The description of the method is very confused and only who has already used the same techniques could follow and understand the steps. Moreover, the description both of the method and of the data used is only qualitative and discursive, never detailed and quantitative.
The manuscript is full of non-useful repetitions and the Authors should do an effort to re-read the manuscript and be more concise. As an example: Line 281-294. This is a repetition of something already written in the manuscript. Line 170-184. This paragraph should be moved in the Introduction section and rewritten to avoid repetition.
The same acronyms are referred to with other acronyms. As an example, the three techniques employed to achieve the purposes of the manuscript (specified in Line 184-186) are referred to ANN or to ML in different part of the manuscript, generating a great confusion among all the acronyms. I suggest simplifying and reducing the use of the acronyms to the strictly needed.
Regarding the used database, the Authors use a qualitative language that does not help to understand. Some examples:
Line 115. Where the original labelled database can be consulted? Is it already released?
Line 134. What do the Authors mean with “subset of data considered the most reliable”? Which criteria did they use for reliability?
Line 139. As above, what do the Authors mean with “ the most representative and of the highest quality”? How do they measure the quality?
Line 149-154. Where the Popocatepetl 2002 catalogue can be found? A citation is missing here
Line 152. Which are the classes of event, adding a table can hep.
Line 229. The phrase “The target domain (denoted as Dt) is the Popo2002 dataset (whose available seismic catalogue will not be considered)” what does it mean? At line 149 the Authors state that Popo2002 consists of 4,883 events, what type of event they are? I can understand that the data used for the target domain is a subset of the Popo2002 collected excluding earthquakes. Is it true? I think that the authors should be more concise and clearer in describing both the technique and the data used.
In conclusion, I suggest publication after a deep rewriting of the manuscript that does justice to the work done, makes it understandable also to those who have never used the specified ML techniques before and makes the proposed method replicable for other interested scientists.

Citation: https://doi.org/10.5194/nhess-2024-102-RC5
- AC6: 'Reply on RC5', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC6
RC6:
'Comment on nhess-2024-102', Anonymous Referee #6, 06 Sep 2024

This is a solid paper that reports on systematic evaluation of volcanic earthquakes using several machine learning (ML) techniques. I should state at the outset that I am a volcano seismologist with more than a decade of experience, but I have never directly used any ML or AI techniques. Hence my comments are of a more general nature.

The approach in the paper is thorough. The results are repeatable, which is good. The results also show that it is possible to get more out of the data, which is always welcome. The procedures are efficient, so it is possible, in principle, to obtain similar results in much less time than it takes an experienced geophysicist to manually process the data. But this brings up a philosophical point: what is reality? Manual or ML? I would think that a manual effort by an experienced person would be the benchmark, and ML results would be judged relative to them. The paper mostly does this, with a few exceptions.

The paper is mostly well written. Here are a few corrections keyed to line numbers:

36 – the V in VLP stand for very, not ultra
44 – is this frequency in Hz or frequency of occurrence?
51-52 – inconsistent use of parentheses ( )
143 – confront? Odd word choice
184-186 – at this point I had a hard time keeping all the acronyms in my head. I suggest adding a table of acronyms.
233 – spacing
Table 2 – are all values percentages? Needs better labeling
Table 3 – add bolder vertical lines between three main sections; spell out the abbreviations in notes at the bottom of the table (Tables should stand alone)
Table 5 – same comments as Table 3
357 – used
381 – “unbiased” but how determined? This is the place that made me rethink the question of what is reality, as described above.
Table 6 – is all the time with no events equal to the background?

Overall, the paper is in good shape and is suitable for publication with minor revisions as indicated above.

Citation: https://doi.org/10.5194/nhess-2024-102-RC6
- AC7: 'Reply on RC6', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC7
RC7:
'Comment on nhess-2024-102', Anonymous Referee #7, 09 Sep 2024

Summary
I have now had the opportunity to read and review the manuscript “How can seismo-volcanic catalogues be improved or created using robust neural networks through weakly supervised approaches?”. Where the authors use machine learning techniques and a dataset from Deception Island as the master catalog to create and compare a new catalog for Popocatepetl in Mexico. While there are a lot of caveats and author interpretations in this research, the science, information and methods are interesting. The manuscript shows a small progress in ML techniques that can be used as the basis for future research. Below I list a few major comments for the review along with some line-by-line comments. Additionally, I would like to make a note about the subject matter. I feel this research would be more suited for a different journal. I was a bit surprised when I saw this manuscript was submitted to Natural Hazards and Earth System Sciences.
Major comments:
-What about other signals when building the model? There is a lot of source noise in volcanic terrains, how do these methods work when you introduce for example mass flows, edifice collapse, rock falls, ballistics, etc. In the same train of thought, how did leaving these out affect the outcomes. Furthermore, how about teleseismic earthquakes, how does the classification work on these?
-“Early warning” is capitalized on line 24, but is not anywhere else, stay consistent throughout.
-Have you looked at the source depth of the signal, differing characteristics can occur depending on depth, you may have a problem similar to the attenuation issue.
-I think the length of the training dataset is too short, how can we get a sense of what goes on at a volcano in just two months of data. Similarly, please explain why you are using a pre-eruptive model on a volcano that is in a phase of unrest. The difference in signal characteristics are going to be different, also the types of signals as I mentioned before.
-Some acronyms do not match, I tried to correct some in my Line-by-line comments, but it got too out of hand. A good example is the constant change between VT and VTE.
-While the frequency band of 1-20 Hz is fine, I am wondering about the difference between sensors. This paper does not mention any details about the sensors. What is the sampling rate of each sensor, are they all the same, are they different at different volcanoes. The details about each sensor are very important in knowing which frequency range can be used. Furthermore, how about is the sensor broadband or not. Is every signal from the vertical competent? If so how about using horizontal components?
-How did you choose which time window to use? What if there is a signal longer than 4 seconds, e.g. tremor, mass flow?
-You only train on one volcanic environment or master. I would like to see what the results would be if you used multiple environments from different volcanoes to make the master.
-Most of the text in the methods section should be in the introduction. I suggest making a section in the introduction describing different kinds of methods people used in the past and then in the methods, explain the techniques you used for this research. Most of everything before section 3.1 should be in the introduction.
-I would like to see some comments about computing power and time. Some ML models and processes take lots of commuting resources as well as extended processing lengths. I would like to see a paragraph discussing these stats in the manuscript. What would I need to reproduce or do a similar computation at my observatory?
-I would like to know how much human work or time goes into creating this new catalog. Since it is a supervised learning technique, you still need human input and review, so how much time/effort are we gaining?
-In Lines 400-408: The training missed labeled tremor events, and you say this error was not actually an error, how can this be? The algorithm mislabeled, which means it did not work. Furthermore, reading your explanation further signals that this technique cannot be completed universally across different volcanoes. The attenuation affects you mention, points to the fact this would be difficult to do universally. A human had to go back in and review every event to make sure the event was labeled correctly, so how does this save time or is a better option?
-I would like to see more one-on-one comparison statistics in reference to Table 6. It is great the algorithm found more events but how many of the catalog events did it find and how many of the “human” events did it miss? Also, how many of these “new” events are real? Do the humans perform better for certain signal types and vice versa? How does each signal classification compare to one another.
-There is a lot of repetitive nature of some paragraphs, try to go over the manuscript and cut some of this out.
-A point on universality, every volcano is different even in the pre-eruption context of this manuscript. Some volcanoes do not even display signs of activity before erupting, so how can these ML techniques be considered universal at this point?
-References are not in alphabetical order
Line by line comments:
Line 16: delete “remarkable”
Line 25: delete "accomplish”
Line 31: is the reference the same as line 39? McNutt
Line 36: change “ultra” to “very”
Line 39: change “debris flows” to “mass flows”
Line 40: there are 2 “)” after “1974”
Line 50: there are 2 spaces between “Neural Networks” and “as Bayesian”
Line 50: Bayesian spelled wrong
Line 52: check Bueno et al citations and reference listed at line 541
Line 52: no Titos et al 2017 in references
Line 211: change “2027” to “2017”
Line 289: why is Transfer Learning capitalized in some spots but not in others?
Line 312-313: use both “VTE events” and “VT events”, switch to just “VT”
Line 314: delete the “E” after the “LP”
Line 345: delete “dramatically”
Line 357: change “use” to “used,”
Line 357: delete "significantly”
Line 358: change “being” to “be”
Line 389: delete one of the “)” after fig.3a
Line 391: same as line 389
Line 393: delete the “E” after “VT”
Line 415: delete “remarkable”
Line 445: I suggest changing “several monitoring systems” only seismic is used
Line 575: no year for publication
Figure 2: label points the same as in text or spell out each
Figure 4: please properly label each signal instead of the legend in the corner

Citation: https://doi.org/10.5194/nhess-2024-102-RC7
- AC3: 'Reply on RC7', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC3
RC8:
'Comment on nhess-2024-102', Anonymous Referee #8, 20 Sep 2024
General Comments:
The manuscript presents a highly relevant approach that combines machine learning with weakly supervised methods for seismic-volcanic event detection. The application of these techniques to geophysical event detection is an exciting and promising field of study, and I commend the authors for their effort in tackling such a complex problem. The subject matter is particularly valuable given the growing interest in leveraging machine learning models for natural hazard monitoring, and the use of weak supervision opens new possibilities for working with limited labeled data, a common challenge in seismology and volcanology.
However, while the approach is interesting, the manuscript, in its current form, requires substantial rewriting to improve clarity, structure, and the strength of its arguments. There are several critical issues that need to be addressed before the manuscript can be considered for publication:
Methodology Section Reconstruction: The methodology section lacks sufficient clarity and structure. Key concepts such as UMAP, the Leave-One-Out cross-validation method, and the iterative processes involved in the pseudo-labeling task are either insufficiently explained or poorly integrated into the overall narrative. The methodology needs to be rewritten to clearly define these elements and their role in the overall framework, ensuring that readers can follow the steps taken in the model development and evaluation process.

Justification for Using a Single Dataset in Transfer Learning: The authors attempt to justify the use of a single dataset in their transfer learning approach, but the arguments presented are not convincing. As the authors themselves note, ‘it could change when using a different test dataset,’ suggesting that model performance may not generalize well to other geological settings. The authors need to make a stronger case for why the use of a single dataset is valid for this weakly supervised learning approach. Ideally, the manuscript should explore the potential limitations of this approach or, alternatively, incorporate multiple datasets from different volcanic settings to demonstrate broader applicability.

Overall Structure and Writing Quality: The manuscript, though scientifically significant, suffers from poor structure and unclear writing, which detracts from its scientific contributions; this has resulted in several instances where key ideas are poorly expressed or ambiguously presented. A thorough revision of the manuscript is needed to ensure that the concepts and findings are communicated effectively. I suggest the authors consider restructuring the entire manuscript to enhance readability, focusing particularly on tightening the introduction, improving transitions between sections, and making the arguments in the discussion more robust.

In conclusion, while the study introduces an interesting and timely approach to seismic-volcanic event detection using machine learning, the manuscript requires significant rewriting to better articulate its methodology and address critical gaps in the explanation of its approach. I recommend a major revision to enhance clarity, strengthen the justification for key methodological choices, and improve the overall presentation of the research.

Specific comments & Technical corrections:
- 1. Introduction.
line 50:"Bayesian" misspelled

line 99: ¿references for master dataset?

lines 99-100: "has already been successfully applied in different DL architectures"; ¿references?

line 102: references for the Popo dataset?, and ¿why it is of high quality?

line 105: It would be very useful to provide more information about the volcanic dynamics observed in the proposed datasets, especially as machine learning developments and methodologies are evolving to incorporate physics-based input.

- 2. Seismic data and catalogues.
line 125: "..on the applicationof HMM models, etc."; ¿references?

line 130: “While it is true that not all types of signals are present in this 'Master database’, especially those associated with ongoing eruptive processes.”, so, perhaps it would be important to have a master dataset that includes this information as well. It is crucial to incorporate datasets representing different stages of volcanic unrest and to clarify which specific stages the machine learning models are most useful for.

line 145: A more detailed description of Popocatépetl’s volcanic activity is needed, including its cyclical behavior of effusive activity, dome formation followed by explosive events, tremor signals, and other relevant features.

line 148: Are there any references available for this group of geophysicists or their work?

Table 1: nice.

Data & sensors: It would be ideal to provide a clearer explanation of the types of instruments being used, including whether all components are available, sampling, etc., as well as details on the sensors. For example, are all instruments capable of measuring all types of events in both datasets? Nowadays, seismic networks are densified with a combination of broadband and short-period sensors, which may influence data quality, coverage, and distance to volcanic sources. The proximity of sensors to the volcanic source is critical, as it directly affects the resolution and accuracy of the recorded data.

- 3. Methodology.
lines 234 - 246: about marginal and conditional distributions: a need for clarity:

The authors’ explanation regarding the assumptions of marginal and conditional distributions in the pseudo-labeling task could benefit from greater clarity. Specifically, they state that the marginal distributions of the source and target domains are assumed to be the same ( P_s (X_s) = P_t (X_t) ), maybe implying that the input features (seismic windows) in both domains are similarly distributed? However, they also assume that the conditional distributions of the source and target domains are the same ( Q_s (Y_s | X_s) = Q_t (Y_t | X_t) ), suggesting that the relationship between input features and event types is identical across both datasets.
Key Challenge and, Potential Problem?:
The text acknowledges that while the marginal distributions of the input features may be the same, the conditional distributions might differ between the source and target domains. This introduces a key challenge: even though seismic signals may “look similar” across different datasets (i.e., the marginal distributions are similar), the relationship between these signals and the seismic events they represent (i.e., the conditional distribution) may vary.
This discrepancy can create a potential problem when using pseudo-labeling and transfer learning techniques. If the model is trained assuming that the conditional distributions are the same, it may misclassify events in the target domain, especially if the seismic signatures there correspond to different types of events than in the source domain. This issue could result in reduced accuracy and reliability of event detection in the target domain, undermining the effectiveness of the model’s generalization.
Conclusion: The Need for Diverse Datasets
This challenge is crucial because it highlights a potential flaw in the transfer learning approach: the assumption that conditional distributions are the same across different volcanic settings may not always hold. To address this, it may be necessary to collect and incorporate datasets from a wider range of volcanic regions, where the relationships between seismic features and event types can vary. Doing so would enable the development of more robust models that can better generalize across domains, improving the accuracy and reliability of event detection in different geological contexts. This would strengthen the use of transfer learning techniques and ensure that models are more adaptable to varying volcanic behaviors.
Figure 1: bad quality figure in the PDF file. Do steps A, B, etc., correspond to the actual process in your proposed methodology?

line 276: reference missing.

- 4. Results.
line 291: review grammar (“..using as training..”)

line 302: The text on self-consistency should be explained and included in the methodology section (‘We apply the Leave-One-Out cross-validation method’).

line 329: Should Section 4.3 be renumbered as Section 4.2?

line 343: These iterations need to be clearly specified in the methodology section, as you mention the goal of “until a reliable catalog is achieved”.

Line 344: The authors mention that “however, it could change when using a different test dataset” which highlights an important point regarding model generalization. While their approach is based on a single dataset, this raises questions about its robustness across varying geological settings. To truly validate the effectiveness of the model, it would be crucial to demonstrate its performance using multiple datasets from different volcanic environments. By doing so, they could provide stronger evidence that the model can generalize across diverse conditions, rather than being tailored to a specific dataset. The authors need to convincingly argue why relying on a single dataset is sufficient, or alternatively, why incorporating multiple datasets might be necessary for ensuring broader applicability.

- 5. Discussion.
line 357: It would be helpful to clarify the phrase “when effectively use” throughout the text to strengthen the main arguments. Perhaps the grammar could be reviewed in that sentence.

line 365: Should Fig. 1 be renumbered as Fig. 2?

2: UMAP should be introduced in the methodology section and connected to the general objectives.
Citation: https://doi.org/10.5194/nhess-2024-102-RC8
- AC8: 'Reply on RC8', Manuel Titos Luzon, 17 Oct 2024
  
  Dear Reviewer, please find a detailed response to your comments in the attached file.
  
  Citation: https://doi.org/10.5194/nhess-2024-102-AC8

Manuel Titos, Carmen Benítez, Milad Kowsari, and Jesús M. Ibáñez

Supplement

https://doi.org/10.5194/nhess-2024-102-supplement

Manuel Titos, Carmen Benítez, Milad Kowsari, and Jesús M. Ibáñez

Viewed

Total article views: 1,294 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
848	241	205	1,294	80	41	54

HTML: 848
PDF: 241
XML: 205
Total: 1,294
Supplement: 80
BibTeX: 41
EndNote: 54

Views and downloads (calculated since 17 Jun 2024)

Month	HTML	PDF	XML	Total
Jun 2024	161	30	15	206
Jul 2024	100	7	6	113
Aug 2024	147	47	6	200
Sep 2024	126	20	70	216
Oct 2024	100	46	9	155
Nov 2024	37	6	0	43
Dec 2024	22	10	1	33
Jan 2025	18	6	90	114
Feb 2025	17	4	0	21
Mar 2025	23	21	1	45
Apr 2025	23	12	1	36
May 2025	13	5	1	19
Jun 2025	35	16	2	53
Jul 2025	24	9	3	36
Aug 2025	2	2	0	4

Cumulative views and downloads (calculated since 17 Jun 2024)

Month	HTML	PDF	XML	Total
Jun 2024	161	30	15	206
Jul 2024	100	7	6	113
Aug 2024	147	47	6	200
Sep 2024	126	20	70	216
Oct 2024	100	46	9	155
Nov 2024	37	6	0	43
Dec 2024	22	10	1	33
Jan 2025	18	6	90	114
Feb 2025	17	4	0	21
Mar 2025	23	21	1	45
Apr 2025	23	12	1	36
May 2025	13	5	1	19
Jun 2025	35	16	2	53
Jul 2025	24	9	3	36
Aug 2025	2	2	0	4

Viewed (geographical distribution)

Total article views: 1,185 (including HTML, PDF, and XML) Thereof 1,185 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 07 Aug 2025

Download

Preprint (1015 KB)
Metadata XML

Short summary

Developing seismo-volcanic monitoring tools is crucial for Volcanic Observatories. Our study reviews current methods using Transfer Learning techniques and finds that while these systems identify nearly 90 % of seismic events, they miss other important volcanic data due to the catalogue-learning bias. We propose a weakly supervised technique to reduce bias and uncover new volcanic information. This method can improve existing databases and create new ones efficiently using machine learning.


Total:	0
HTML:	0
PDF:	0
XML:	0