Development of a seismic loss prediction model for residential buildings using machine learning – Ōtautahi&thinsp;/&thinsp;Christchurch, New Zealand

Roeslin, Samuel; Ma, Quincy; Chigullapally, Pavan; Wicker, Joerg; Wotherspoon, Liam

doi:https://doi.org/10.5194/nhess-23-1207-2023

Articles | Volume 23, issue 3

https://doi.org/10.5194/nhess-23-1207-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Special issue:

Advances in machine learning for natural hazards risk...

https://doi.org/10.5194/nhess-23-1207-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 23, issue 3

Research article

|

22 Mar 2023

Research article |

| 22 Mar 2023

Development of a seismic loss prediction model for residential buildings using machine learning – Ōtautahi / Christchurch, New Zealand

Samuel Roeslin, Quincy Ma, Pavan Chigullapally, Joerg Wicker, and Liam Wotherspoon

Download

Final revised paper (published on 22 Mar 2023)
Preprint (discussion started on 01 Sep 2022)

Interactive discussion

Status: closed

RC1: 'Comment on nhess-2022-227', Zoran Stojadinovic, 22 Sep 2022

1. The overall quality of the preprint (general comments)

The overall quality of the preprint is good. The topic of mapping a building representation directly to monetary compensation is exciting and significant for the science community. The research is well structured and explained.

But, since the dataset is substantially large and the prediction task relatively easy (3 broad classes), this reviewer expected slightly better results in the confusion matrix. Therefore, it seems reasonable that the model could improve with some adjustments.

2. Individual scientific questions/issues (specific comments)

Here are some suggestions to improve the model and preprint:

a) Dataset.

For better prediction results, the authors should preserve (and demonstrate) the original data distribution from the initial dataset when merging and filtering instances. For the same reasons, this reviewer believes that it is necessary to include the class of undamaged buildings in the dataset (with 0$ compensations), unavoidable when mapping damage states.

b) Building representation.

For a better presentation of the mapping problem, the authors should show the data distribution for more features (like in Figure 6 for construction type). Surprisingly, the height of buildings isn’t included in the building representation even though it captures the dynamics of the building. A feature like “number of floors” could possibly be informative.

c) More discussion on the confusion matrix would be helpful. For example, how do the authors explain the accuracy difference between b) and c) in Figure 12? Does it have to do with PGA ranges of earthquakes? The worse prediction seems to be "Predicted Medium / Actual OverCap". How to explain this?

d) It would be interesting to evaluate the prediction accuracy for the sum of compensations for all buildings. It is reasonable to expect good prediction accuracy for total cost since errors would cancel out each other. But it is difficult to perform without precise “OverCap” values.

e) Finally, how do the authors evaluate the usefulness of the research and model implementation for new earthquakes? Namely, what about the changing value of money over time and frequent changes in market prices? How to implement the model without the class of undamaged buildings (this version could work just if damaged buildings were pre-selected)?

3. Technical corrections

There is a significant number of needed technical corrections. Some examples are highlighted in the attached file. The authors should carefully check the paper.

Citation: https://doi.org/10.5194/nhess-2022-227-RC1

AC1: 'Reply on RC1', Samuel Roeslin, 22 Nov 2022

#	Comment	Reply
1	a) Dataset. For better prediction results, the authors should preserve (and demonstrate) the original data distribution from the initial dataset when merging and filtering instances. For the same reasons, this reviewer believes that it is necessary to include the class of undamaged buildings in the dataset (with 0$ compensations), which is unavoidable when mapping damage states.	Thanks for the comment. This model has been developed for insurance purposes with the aim of helping the Earthquake Commission (EQC) to get an understanding of the possible loss distribution across Christchurch for any future earthquake. EQC’s interest is concentrated on damaged buildings for which a claim might be lodged. It must be reminded that the data used to train the machine learning (ML) model pertains to buildings for which one or multiple claims have been lodged as part of the Canterbury Earthquake Sequence (CES). Getting details information on undamaged buildings was not part of the scope. Thus, extracting reliable data related to the undamaged category is not straightforward. Simply assuming that a building was not damaged because no EQC claims were lodged is not satisfactory, as it is not possible to affirm that the remaining buildings had proper insurance coverage. Moreover, for some buildings which suffered slight damage, the building owners might have decided to cover the cost of the reparations by themselves to avoid paying the excess. Nevertheless, following the suggestion, we tried to include a fourth category for undamaged buildings and explored the influence on the ML model performance. The EQC data includes a few instances with zero compensation (BuildingPaid = NZ$0). Figure i shows the number of instances for 4 Sep 2010 and 22 Feb 2011. With 7%, the number of instances is very limited compared to the “low” and “medium” categories. Despite the low number in the category with no damage, a new machine learning model has been retrained (considering class imbalance). Figure ii shows the confusion matrix for the Random Forest algorithm trained using four categories. It can be seen that the overall accuracy dropped and that the model is having difficulties making predictions for the zero damage category. It was thus decided to keep only three categories (low, medium, and overcap). Figure i: Number of instances in BuildingPaid categorical in the filtered data set including the category with zero compensation: (a) 4 Sep 2010, (b) 22 Feb 2011 Figure ii: Confusion matrix for Random Forest algorithm including the category with BuildingPaid = 0
2	b) Building representation. For a better presentation of the mapping problem, the authors should show the data distribution for more features (like in Figure 6 for construction type). Surprisingly, the height of buildings isn’t included in the building representation even though it captures the dynamics of the building. A feature like “number of floors” could possibly be informative.	Re: data distribution Thanks for the suggestion. A new table has been added showing the nine attributes used in the model. The table gives information about the type and distribution of each attribute. Re: height of the buildings Thanks for the remark. The authors of this paper are aware of the inclusion of the building height (or number of stories) as an attribute in the ML models in similar studies (Ghimire et al., 2022; Harirchian et al., 2021; Mangalathu et al., 2020; Stojadinović et al., 2022). The non-inclusion of the building height in this study was dictated by the availability of the information in the dataset and not based on a deliberate choice. While the original EQC data has an attribute for the number of storeys, many instances are missing. Figure iii shows the number of instances available for each storey category. It can be seen that for 4 Sep 2010 and for 22 Feb 2011, storey information is missing for 58% and 93% of the instances respectively. The information available related to the building height is thus very limited. When available, the data for 4 Sep 2010 shows that 10,37 buildings have one storey, 2,677 two storeys, and 140 three storeys. Similarly, for the 22 Feb 2011, 1,522 of the buildings have one storey, 727 two storeys, and 91 three storeys. Selecting only instances where the number of storeys is known would have been very limiting from the aspect of training a ML model. It should be reminded that the EQC cover is limited to residential dwellings. This study is thus limited to residential buildings only which for Christchurch are mostly one-story height houses. Accurate information on the number of storeys could not be obtained from the RiskScape database either. The attribute ‘Storeys’ is reported as a float which seems to have been calculated from the building floor area 'BLDGFL_1' divided by the building footprint 'FOOTP_1'. In the aim of retaining high data accuracy it has been decided to not include the number of storeys from the EQC dataset in this study. This has been clarified in section 3.2 of the paper. Figure iii: Number of instances for each storey category (EQC data): (a) 4 Sep 2010, (b) 22 Feb 2011
3	c) More discussion on the confusion matrix would be helpful. For example, how do the authors explain the accuracy difference between b) and c) in Figure 12? Does it have to do with PGA ranges of earthquakes? The worse prediction seems to be “Predicted Medium / Actual OverCap”. How to explain this?	Thanks for the comment. We are still in the process of retraining the ML model with the suggestions provided by both reviewers. Section 8 of the paper will be updated.
4	d) It would be interesting to evaluate the prediction accuracy for the sum of compensations for all buildings. It is reasonable to expect good prediction accuracy for total cost since errors would cancel out each other. But it is difficult to perform without precise “OverCap” values.	Thanks for the suggestion. Building claims larger than NZ$100,000 (+GST) were handled by private insurers. Unfortunately, private insurers were not inclined to make their data available for this research work. As mentioned in paragraph 5.2, the current data is thus soft-capped at NZ$100,000 (+GST) making the analysis of the total costs not possible with the currently available data.
5	e) Finally, how do the authors evaluate the usefulness of the research and model implementation for new earthquakes? Namely, what about the changing value of money over time and frequent changes in market prices? How to implement the model without the class of undamaged buildings (this version could work just if damaged buildings were pre-selected)?	Re: model implementation for new earthquakes Thanks for the comment. The ML model presented in this paper was designed to be easily retrainable. This enables the addition of new instances in the training set after the occurrence of a new earthquake. However, one of the challenges, is that claim amounts are not readily available after an earthquake as on-site assessment of building damage can be spread over a long period of time. To circumvent this issue, building damage should be assessed on a representative set of buildings. This subset where the damage extent will be known should be added to the training set of the ML model that can be retrained. The ML model can then be used to make predictions on the entire building portfolio. This approach has been schematically described in a new version of Fig 10. The selection of a representative training set can be made following an event based on the effects of the earthquake or prior to the event using a predetermined representative subset of the residential building in Christchurch. Special care should be applied to ensure that the selected buildings can be used to produce a satisfactory seismic loss assessment at the city level. Recent discussions with experts highlighted the uniqueness of the Canterbury region. They mentioned that the analysis of damage observations across the CES showed that the main earthquake events affect different areas in Christchurch. They thus suggested that the selection of a representative set of buildings should take into account the geographical characteristics, the liquefaction setting, and building characteristics. When expert opinion is not available, similar studies showed that ML could even be employed in the selection of a representative building set (Mangalathu & Jeon, 2020). The actual selection of a representative set of buildings for Christchurch is beyond the scope of this study. Re: Change in value of money Thanks for the remark. The authors agree with the necessity to consider the evolution of the market over time. Here again, the ease of retraining the ML model when new or updated training data gets available should be highlighted. The authors are aware of the step change related to the value of the EQC cap over time (e.g., since 1 Oct 2022, the new cap is at NZ$300,000 + GST (Earthquake Commission (EQC)). Nevertheless, Fig 8 showed that for 4 Sep 2010 earthquake most of the claims fell in the ‘low’ and ‘medium’ categories. Even for the 22 Feb 2011 earthquake, which was unprecedented in the damage extent caused in the Canterbury region, many claims still relate to the ‘low’ and ‘medium’ categories. It is thus believed that the value of the model lies in its ability to make predictions for those categories (‘low’ reflecting the limit of initial cash settlement consideration, ‘medium’ for building having more damage but where claims are still fully addressed by EQC only, and ‘overcap’ where private insurer come into consideration for higher level of damages). Re: no undamaged buildings class Thanks again for the comment. Please see our reply to comment #1. This ML model has been developed with the purpose of being used in the insurance setting. The focus is thus on being able to predict the possible damage and loss extent for buildings having EQC claims in future earthquakes.
6	Technical corrections There is a significant number of needed technical corrections. Some examples are highlighted in the attached file. The authors should carefully check the paper.	Thanks for having highlighted those typos. The errors have been corrected.

References

Earthquake Commission (EQC). (2022). EQC Insurance Overview. https://www.eqc.govt.nz/what-we-do/insurance-overview/

Ghimire, S., Guéguen, P., Giffard-Roisin, S., & Schorlemmer, D. (2022). Testing machine learning models for seismic damage prediction at a regional scale using building-damage dataset compiled after the 2015 Gorkha Nepal earthquake. Earthquake Spectra. https://doi.org/10.1177/87552930221106495

Harirchian, E., Kumari, V., Jadhav, K., Rasulzade, S., Lahmer, T., & Raj Das, R. (2021). A Synthesized Study Based on Machine Learning Approaches for Rapid Classifying Earthquake Damage Grades to RC Buildings. Applied Sciences, 11(16), 7540. https://doi.org/10.3390/app11167540

Mangalathu, S., & Jeon, J.-S. (2020). Regional Seismic Risk Assessment of Infrastructure Systems through Machine Learning: Active Learning Approach. Journal of Structural Engineering, 146(12). https://doi.org/10.1061/(ASCE)ST.1943-541X.0002831

Mangalathu, S., Sun, H., Nweke, C. C., Yi, Z., & Burton, H. v. (2020). Classifying earthquake damage to buildings using machine learning. Earthquake Spectra, 36(1), 183–208. https://doi.org/10.1177/8755293019878137

RiskScape. (2015). Asset Module Metadata. https://github.com/samroeslin/samroeslin.github.io/blob/gh-pages/assets/pdf/RiskScape-Asset_Module_Metadata-2015-12-04.pdf initially hosted on https://wiki.riskscape.org.nz/images/7/75/New_Zealand_Building_Inventory.pdf

Stojadinović, Z., Kovačević, M., Marinković, D., & Stojadinović, B. (2022). Rapid earthquake loss assessment based on machine learning and representative sampling. Earthquake Spectra, 38(1), 152–177. https://doi.org/10.1177/87552930211042393

Citation: https://doi.org/10.5194/nhess-2022-227-AC1

RC2: 'Comment on nhess-2022-227', Anonymous Referee #2, 27 Sep 2022

The authors have presented a novel ML based approach to estimating the loss to buildings after an earthquake is 3 categories - low, medium, and high. As part of model training, the authors have performed spatial data merging between 3 datasets, and only selected the subset of buildings with the least chance of erroneous data attributes. The authors have also focused on the 4 earthquakes in NZ around 2011 with the maximum number of data points. The paper is well structured, and is fairly easy to follow.

However, I found some of the key information about the ML approach missing or confusing in the paper, and have highlighted it below. I believe that the paper would be further improved substantially by adding or clarifying the ML approach. I have listed both my major and minor concerns below.

The selection of the test set is unclear in the paper. It also appears that the test set has been erroneously used as a validation set. If that is the case, then it is difficult to assess the generalizability of the authors’ conclusions. It would be helpful to clarify how the test set was selected and used in this study. Additional comments regarding test set are also included below with specific line references.
While it is a suitable approach to only select the 4 events with the highest number of claims during model training, the other events with fewer claims could be used for testing purposes. This would not only ensure that no data leakage occurred between the training and test sets, but also enable the authors to validate the generalizability of their models more effectively.
It would improve the paper if the authors added their thoughts on some of the potential use cases of this research. While the authors’ conclusions indicate a promise for using ML within this domain, it was unclear how this model and approach could be used in the future. For example, if training data is needed each time an earthquake occurs, then is one of the use cases to manually collect a subset of ground truth data for building losses, train a model, and then apply it widely to the rest of the buildings?
Further discussion of the model metrics such as recall and precision would be helpful. For example, a recall of only 20% for overcap, and 49% for low loss category indicates that 80% and 51% of these losses, respectively would be missed when implementing this model. Depending on the model’s use cases, this could have a significant impact on the model’s utility. Further discussion of the most appropriate metric (or their combinations), given the model’s use cases would also improve the paper. For example, why was accuracy selected as the primary evaluation metric for choosing the best performing model?
I appreciated that the authors listed the distribution imbalance of different features, such as construction type. However, the paper could be further improved by adding the model performance in those different feature categories. This would enable the reader to understand in which categories the model performs better than others.
Given the relatively low performance of the ML model (as highlighted above for recall), adding a section on error analysis would substantially improve the paper. In error analysis for ML, the objective is to identify the cases in which the model does not perform well. This error analysis is often used in ML modeling to improve model performance and generalizability.
Figure 13 is missing, and appears to be a repeat of Figure 12. Hence, Section 9 - Insights - could not be reviewed.
It would further improve the paper if the authors added some information about their hyperparameter tuning methodology, and which search strategy they used.
Line 50 - While the authors are completely correct in the paragraph at line 50, this paper deals with ML for structured data, for which the goal is often to surpass human performance since humans are generally unable to identify all patterns in millions of data points with hundreds of features, often found in these problems. Hence the paragraph does not apply to the ML scope of this paper. It may be suitable to remove the paragraph within the scope of this paper, or change “human-level performance” to “baseline model”, which would be a more suitable term in this case.
Line 65 - latter -> later
Line 73 - Suggest adding reference/url for the source of the data.
Line 83 - It would be helpful to further describe Figure 2. Why is there a difference in the number of claims and buildings?
Line 95 - I was curious about the accuracy of the Riskscape dataset. For example, are the building characteristics determined statistically from Census data similar to HAZUS in the US, or was it based on collecting data from building records so that it is expected to be fairly accurate? If possible, it would be helpful in the paper to include some information describing Riskscape’s data collection methodology and comment on its expected accuracy.
Line 115 - Although a reference is provided to the authors’ previous work, it would be helpful to summarize the major reasons for incorrect merging using direct spatial joins within this paper to help understand the issue without having to read the previous work.
Line 122 - It would be helpful if the authors added the percentage of addresses in each of the 3 categories - 1-1 match with titles, 0-1 match, and many-1 match.
Line 131 - It would be helpful if the authors added the percentage of RiskScape data that was discarded.
Line 132 - I was unable to understand the intent described in this paragraph, especially the first and the last sentences.
Table 1 - The table is very helpful. However, the action taken for 2 points LINZ and 1 point Riskscape was unclear. The above mentioned percentages of data could also be added to Table 1 instead.
Line 150 - It would be helpful if the authors added the methodology used to merge soil conditions, and liquefaction occurrence with street address. Did they use the same inverse distance weighted interpolation as seismic demand?
Line 172 - The reason for discarding claims with maximum value lower than or higher than $115,000 is unclear. Is it because this wasn’t possible and hence the data is erroneous?
Line 240 - It is unclear which event was selected as the test set. From my understanding of line 243, one of the 4 events was selected as test set, and the other 3 events as training+validation sets. However, it also appears from the sentence that in different instances of the model, a different event was selected as a test set so as to determine the most generalizable model. If that is the case, the test set was erroneously used as a validation set, since the model cannot be changed at any point after evaluating its performance on the test set. It would be helpful to clarify the selection of the test set, and ensure that it was only used once at the end to evaluate the performance of the final developed model.
Line 254 - It would be helpful if the authors added how the min-max scaling was implemented with respect to training, validation, and test sets.
Line 286 - It is unclear which limitations related to random forest model the authors are referring to.
Figure 11 - The SVM model does not appear to have been modeled correctly as its output prediction is always the medium category, hence it has been reduced to a trivial model.
Line 295 - It appears that the model was selected based on the best performing model on the test set. This indicates that the test set was not used correctly, as the model selection can only be done using validation sets. The test set must only be used to show the performance of an already selected model on it.
Line 326 - While the authors raise an accurate point about the lack of claims information exceeding $115,000, it is not clear how that data could have benefitted this study since the claims have been bucketed and all those claims greater than $115,000 are already expected to be included in the over-cap category.

Citation: https://doi.org/10.5194/nhess-2022-227-RC2

AC2: 'Reply on RC2', Samuel Roeslin, 22 Nov 2022

We would like to thank referee #2 for the detailed and constructive feedback. We are grateful for the thoughtful comments and suggestions regarding the application of machine learning to real-world data. We are currently in the process of exploring and trying to implement the requested changes.

#	Referee #2 Comments	Responses
1	The selection of the test set is unclear in the paper. It also appears that the test set has been erroneously used as a validation set. If that is the case, then it is difficult to assess the generalizability of the authors’ conclusions. It would be helpful to clarify how the test set was selected and used in this study. Additional comments regarding test set are also included below with specific line references.	Thanks for the comment. The selection of the training, validation, and test set has been revised. The updated version is schematically shown in Fig 10. The training and the test set are now sourced from the same earthquake. The validation set is implemented using k-fold cross-validation as part of the hyperparameter tuning. Four models have been trained and tested using data from the four main earthquakes in the CES (4 Sep 2010, 22 Feb 2011, 13 Jun 2011, 23 Dec 2011). Figure i: New Figure 10
2	While it is a suitable approach to only select the 4 events with the highest number of claims during model training, the other events with fewer claims could be used for testing purposes. This would not only ensure that no data leakage occurred between the training and test sets, but also enable the authors to validate the generalizability of their models more effectively.	Thank you for the suggestion. As mentioned in the reply to comment #1, the process of selecting the training, validation, and test set has been reviewed. The training and test set are now clearly separated thus ensuring that no data leakage can occur. Additionally, the way the model should be applied to future earthquakes has been clarified in Fig 10 as well as in the paper. We are here deliberately focusing on the claim data related to the four main earthquakes rather than the claim data pertaining to the aftershocks. Figure ii shows the situation in the area around Christchurch as of December 2012 (one year after the end of the CES). It can clearly be seen that the earthquakes with the larger magnitudes were: 4 Sep 2010, 22 Feb 2011, 13 Jun 2011, 23 Dec 2011. Severe buildings damage, and hence higher claims, are mainly related to those four earthquakes as the seismic intensity and liquefaction occurrence were larger for those events, hence the selection of those earthquakes in this study. Figure ii: Map of the region around Christchurch showing the epicentre location of the four main earthquake and multiple aftershocks in the Canterbury earthquake sequence (CES) (O’Rourke et al., 2014, originally from GNS)
3	It would improve the paper if the authors added their thoughts on some of the potential use cases of this research. While the authors’ conclusions indicate a promise for using ML within this domain, it was unclear how this model and approach could be used in the future. For example, if training data is needed each time an earthquake occurs, then is one of the use cases to manually collect a subset of ground truth data for building losses, train a model, and then apply it widely to the rest of the buildings?	Thanks for the suggestions Please see our reply to comment #5 of reviewer #1 concerning the implementation of the model. Clarifications have been added in section 8 of the paper.
4	Further discussion of the model metrics such as recall and precision would be helpful. For example, a recall of only 20% for overcap, and 49% for low loss category indicates that 80% and 51% of these losses, respectively would be missed when implementing this model. Depending on the model’s use cases, this could have a significant impact on the model’s utility. Further discussion of the most appropriate metric (or their combinations), given the model’s use cases would also improve the paper. For example, why was accuracy selected as the primary evaluation metric for choosing the best performing model?	Thanks for the comment. As studies applying machine learning in an earthquake engineering context (see section 2.1 of the paper) are now more common, many publications include background explanation on theoretical part related to ML (e.g., Harirchian et al. (2021)). The authors are aware of the different metrics related to a classification (Figure iii) but decided to not include generic explanations related to ML to keep the paper to reasonable length. One of the main reason for conveying the model performance using the accuracy was to enable the benchmarking of this ML model against the performance of other ML models for damage prediction. While a direct comparison would be improper as the earthquake selected, model attributes, and algorithm are not the same, most of the current studies report the model performance using the accuracy (Ghimire et al., 2022; Harirchian et al., 2021; Mangalathu et al., 2020; Stojadinović et al., 2022). Nevertheless, following the suggestion, the ML model is currently being retrained with emphasis on ‘recall’. We will observe the outputs and possible benefits and update section 8 of the paper accordingly. Figure iii: Details of a confusion matrix for a binary class problem
5	I appreciated that the authors listed the distribution imbalance of different features, such as construction type. However, the paper could be further improved by adding the model performance in those different feature categories. This would enable the reader to understand in which categories the model performs better than others.	Thanks for the suggestion. This is currently work in progress. Once the model retrained, we will have a closer look at the model performance for each of the categories. We will add interesting findings to the paper.
6	Given the relatively low performance of the ML model (as highlighted above for recall), adding a section on error analysis would substantially improve the paper. In error analysis for ML, the objective is to identify the cases in which the model does not perform well. This error analysis is often used in ML modeling to improve model performance and generalizability.	Thanks for the suggestion. We are currently doing a more detailed analysis of the model to identify the scenarios causing problems. We will add a paragraph to the paper discussing the cases in which the model does not perform well.
7	Figure 13 is missing, and appears to be a repeat of Figure 12. Hence, Section 9 - Insights - could not be reviewed.	Thanks for note. Fig 13 will be reuploaded.
8	It would further improve the paper if the authors added some information about their hyperparameter tuning methodology, and which search strategy they used.	Thanks for the suggestion. For the hyperparameter tuning we used the randomized search cross-validation approach. We selected RandomizedSearchCV over GridSearchCV as for larger datasets RandomizedSearchCV often outperforms the results from a GridSearchCV. As suggested in comment #4, recall is currently being considered. This is done during the hyperparameter tuning process adjusting the scoring for recall. Clarifications will be added to section 7 of the paper.
9	Line 50 - While the authors are completely correct in the paragraph at line 50, this paper deals with ML for structured data, for which the goal is often to surpass human performance since humans are generally unable to identify all patterns in millions of data points with hundreds of features, often found in these problems. Hence the paragraph does not apply to the ML scope of this paper. It may be suitable to remove the paragraph within the scope of this paper, or change “human-level performance” to “baseline model”, which would be a more suitable term in this case	Thanks for the remark. We have reformulated the paragraph to make it clearer that in this case, the ML model can surpass the human-level performance. As explained in section 11, the performance of this ML model was assessed against the outputs of the software RiskScape v1.0.3.
10	Line 65 - latter -> later	Thanks for the note. The typo has been corrected.
11	Line 73 - Suggest adding reference/url for the source of the data.	Thanks for the suggestion. Unfortunately, the EQC data is not public. We signed a confidentiality agreement with EQC which made the data available to us for research purposes only.
12	Line 83 - It would be helpful to further describe Figure 2. Why is there a difference in the number of claims and buildings?	Thanks for the comment. In some cases, multiple claims might have been lodged for the same earthquake event. The clarification has been added in the paper “For a particular earthquake event, it sometimes happens that multiple claims pertaining to the same building were lodged. Figure 2 thus specifically differentiates between the number of claims and the distinct number of buildings affected. “
13	Line 95 - I was curious about the accuracy of the Riskscape dataset. For example, are the building characteristics determined statistically from Census data similar to HAZUS in the US, or was it based on collecting data from building records so that it is expected to be fairly accurate? If possible, it would be helpful in the paper to include some information describing Riskscape’s data collection methodology and comment on its expected accuracy.	The building characteristics data was obtained from the RiskScape - Asset Module Metadata (RiskScape, 2015). A copy of the document has been attached in Appendix. According to the documentation the “data in this inventory is partly derived from information purchased from Quotable Value (QV) Ltd, together with ‘industry knowledge’ and data gathered from surveying the area. All QV data is applied at the meshblock level, and the RiskScape attributes are derived from this information so as to provide a suitable model of the actual building stock.” Further information can also be found in (King & Bell, 2006; Reese et al., 2007).
14	Line 115 - Although a reference is provided to the authors’ previous work, it would be helpful to summarize the major reasons for incorrect merging using direct spatial joins within this paper to help understand the issue without having to read the previous work.	Thanks for the comment. The paragraph has been rewritten to include explanations related to the location of the EQC and RiskScape coordinates. It is now clearly stated in the paper that the coordinates provided as part of the EQC dataset relate to the location of the street address while the RiskScape coordinates are located in the actual centre of the footprint of a building. In some cases, buildings from neighbouring properties are located closer to the street address than the actual building. For those cases, the use of spatial join functions and spatial nearest neighbour joins led to unsatisfactory outputs.
15	Line 122 - It would be helpful if the authors added the percentage of addresses in each of the 3 categories - 1-1 match with titles, 0-1 match, and many-1 match.	Thanks for the comment. As suggested in comment #18, the percentages have been added to Table 1.
16	Line 131 - It would be helpful if the authors added the percentage of RiskScape data that was discarded.	Thanks for the comment. A comment mentioning that 27.1% of the RiskScape instances were having 3 or more RiskScape datapoints within a LINZ property title as been added at the end of the phrase. Additionally, as suggested in comment #18, the percentages have also been added to Table 1.
17	Line 132 - I was unable to understand the intent described in this paragraph, especially the first and the last sentences.	Thanks for your comment. Fig 3 showed that some LINZ property titles included more than one point (i.e. building) per property title (polygon). This poses challenges for an automated data merging process as it needs to be ensured that the RiskScape and EQC information get assigned to the correct building. As mentioned in line 121: “The merging process was thus started with instances having a unique street address per property”. Those instances having a unique building per polygon were used to merge RiskScape attributes and EQC info to the buildings constraining the merging within a property title. The actions for merging scenarios related to property titles having 1 point per LINZ property are listed in Table 1 rows #1 to #3. It is known that the training of an ML model is often benefiting from more training data. Thus, options to merge more data were explored. It was found that among all the LINZ data, 7% of the LINZ property titles have two street address points (e.g. two buildings within one polygon). We explored if there would be an automated way to merge RiskScape attributes for such cases and thus get more data. For instances having 2 buildings (LINZ points) per property and 2 RiskScape points, the automated merging was done joining the RiskScape attribute to the closest LINZ point and filtered using the building floor area. Among the instances where a LINZ property title had 2 buildings it was also found that some of the RiskScape data only had one point per property title. While a human could make an educated guess to find which of the two buildings the RiskScape attribute were pertaining to, no satisfying automated approach could be developed to merge these instances. This case was thus discarded. More details can be found in section 5.6.6 Properties with two street addresses and one or multiple RiskScape Instances of the thesis (Roeslin, 2021). Section 5.6.6 also includes multiple figures showing examples of properties having two LINZ NZ street addresses per polygon. To keep this paper to a reasonable length such figures were not included.
18	Table 1 - The table is very helpful. However, the action taken for 2 points LINZ and 1 point Riskscape was unclear. The above mentioned percentages of data could also be added to Table 1 instead.	Thanks for the comment. Re: action for 2 LINZ points and 1 RiskScape The wordiness has been removed. In short, those points were discarded. Re: Percentage to Table 1 Thanks for the suggestion. The percentages have been added in brackets for each LINZ and RiskScape scenario of Table 1. Out of the selected instances in Christchurch, 89% have 1 street address point per LINZ property title, 7% two addresses per property, and 4% have three instances or more. For the RiskScape data, after merging in ArcMap it was found that 29.3% of the properties have one RiskScape instance, 43.6% have two instances, and 27.1% have three RiskScape points or more.
19	Line 150 - It would be helpful if the authors added the methodology used to merge soil conditions, and liquefaction occurrence with street address. Did they use the same inverse distance weighted interpolation as seismic demand?	Thanks for the comment. The seismic demand captured through the peak ground acceleration (PGA) was interpolated using the inverse distance weighted (IDW) technique, applying the IDW spatial analyst function in ArcMap (Esri, 2021). In the paper, the explanation on how the seismic demand was interpolated from the ground motion recordings obtained from GeoNet is explained in section 3.3.
20	Line 172 - The reason for discarding claims with maximum value lower than or higher than $115,000 is unclear. Is it because this wasn’t possible and hence the data is erroneous?	Thanks for pointing that out. The EQC data entails a feature called ‘EQC Building Sum Insured’. As EQC provided a maximum cover of NZ$100,000 (+ GST) at the time of the Canterbury Earthquake Sequence, it was expected that all the individual dwellings for which one or multiple claims have been lodged during the CES would have an EQC Building Sum Insured equal to NZ$115,000. However, the initial exploratory data analysis (EDA) revealed that for some of the instances the building sum insured was not at NZ$115,000 (see Figure vi). As only a few instances were not equal to NZ$115,000 (2,594 instances for 4 Sep 2010 and 2,329 instances for 22 Feb 2011), it was decided not to include those instances (in the objective of retaining accurate data for the training of the ML model). Figure iv: Number of instances in the attribute EQC Building Sum Insured (categorised to simplify the visualisation)
21	Line 240 - It is unclear which event was selected as the test set. From my understanding of line 243, one of the 4 events was selected as test set, and the other 3 events as training+validation sets. However, it also appears from the sentence that in different instances of the model, a different event was selected as a test set so as to determine the most generalizable model. If that is the case, the test set was erroneously used as a validation set, since the model cannot be changed at any point after evaluating its performance on the test set. It would be helpful to clarify the selection of the test set, and ensure that it was only used once at the end to evaluate the performance of the final developed model.	Thanks for the remark. As mentioned in the reply to comment #1, the selection of the training, validation, and test set has been revised. All of those sets are now coming from the same earthquake event. For new earthquakes, a sample of buildings will be selected and added to the training set. Figure 10 has been updated to reflect those changes.
22	Line 254 - It would be helpful if the authors added how the min-max scaling was implemented with respect to training, validation, and test sets.	Thanks for the comment. The min-max scaling of the numerical features was performed using the sklearn.preprocessing.MinMaxScaler available in scikit-learn (Pedregosa et al., 2011). A pipeline containing the min-max scaler was created. All sets were passed through the same pipeline.
23	Line 286 - It is unclear which limitations related to random forest model the authors are referring to.	Thanks for pointing that out. The limitations mentioned in this section refer to the overall limited model performance not to the limitations of the random forest algorithm itself. The first section of the phrase has been rewritten to clarify that it is related to the overall model accuracy.
24	Figure 11 - The SVM model does not appear to have been modeled correctly as its output prediction is always the medium category, hence it has been reduced to a trivial model.	Thanks for pointing that out. We are currently having a deeper look at the retraining of the SVM model.
25	Line 295 - It appears that the model was selected based on the best performing model on the test set. This indicates that the test set was not used correctly, as the model selection can only be done using validation sets. The test set must only be used to show the performance of an already selected model on it.	Thanks for the comment. Sorry for the confusion. The paragraph has been rewritten and Fig 10 has been improved to clarify that the training, validation, and testing are initially done on one earthquake only.
26	Line 326 - While the authors raise an accurate point about the lack of claims information exceeding $115,000, it is not clear how that data could have benefitted this study since the claims have been bucketed and all those claims greater than $115,000 are already expected to be included in the over-cap category.	Thanks for the remark. The ML model currently presented in this study is categorical. Yet, the target attribute BuildingPaid is initially numerical but soft-capped at NZ$100,000 (+GST) or NZ$115,000. Having the actual numerical distribution of the extent of the claims might enable a deeper analysis of the target attribute and can possibly enable better performance for a regression model. This might alleviate the need to transform BuildingPaid from a numerical attribute to a categorical feature.

References

Esri. (2021). ArcMap - IDW (Spatial Analyst) - Documentation. https://desktop.arcgis.com/en/arcmap/latest/tools/spatial-analyst-toolbox/idw.htm

Ghimire, S., Guéguen, P., Giffard-Roisin, S., & Schorlemmer, D. (2022). Testing machine learning models for seismic damage prediction at a regional scale using building-damage dataset compiled after the 2015 Gorkha Nepal earthquake. Earthquake Spectra. https://doi.org/10.1177/87552930221106495

Harirchian, E., Kumari, V., Jadhav, K., Rasulzade, S., Lahmer, T., & Raj Das, R. (2021). A Synthesized Study Based on Machine Learning Approaches for Rapid Classifying Earthquake Damage Grades to RC Buildings. Applied Sciences, 11(16), 7540. https://doi.org/10.3390/app11167540

King, A., & Bell, R. (2006). Riskscape New Zealand - A Multihazard Loss Modelling Tool. Proceedings of 2006 New Zealand Society for Earthquake Engineering Conference, Napier, New Zealand, Paper 34, 9. https://nzsee.org.nz/db/2006/Paper30.pdf

Mangalathu, S., Sun, H., Nweke, C. C., Yi, Z., & Burton, H. v. (2020). Classifying earthquake damage to buildings using machine learning. Earthquake Spectra, 36(1), 183–208. https://doi.org/10.1177/8755293019878137

O’Rourke, T. D., Jeon, S. S., Toprak, S., Cubrinovski, M., Hughes, M., van Ballegooy, S., & Bouziou, D. (2014). Earthquake response of underground pipeline networks in Christchurch, NZ. Earthquake Spectra, 30(1), 183–204. https://doi.org/10.1193/030413EQS062M

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 12, 2825–2830. https://doi.org/10.1007/s13398-014-0173-7.2

Reese, S., Bell, R., & King, A. (2007). RiskScape: a new tool for comparing risk from natural hazards. Water & Atmosphere, 15(3), 24–25. https://niwa.co.nz/sites/niwa.co.nz/files/import/attachments/tool.pdf

RiskScape. (2015). Asset Module Metadata. https://github.com/samroeslin/samroeslin.github.io/blob/gh-pages/assets/pdf/RiskScape-Asset_Module_Metadata-2015-12-04.pdf

Roeslin, S. (2021). Predicting Seismic Damage and Loss for Residential Buildings using Data Science [University of Auckland]. https://hdl.handle.net/2292/57074

Stojadinović, Z., Kovačević, M., Marinković, D., & Stojadinović, B. (2022). Rapid earthquake loss assessment based on machine learning and representative sampling. Earthquake Spectra, 38(1), 152–177. https://doi.org/10.1177/87552930211042393

Supplement - Appendix

RiskScape. (2015). Asset Module Metadata.

Citation: https://doi.org/10.5194/nhess-2022-227-AC2

AC3: 'Reply on RC2', Samuel Roeslin, 22 Nov 2022

We would like to thank referee #2 for the detailed and constructive feedback. We are grateful for the thoughtful comments and suggestions regarding the application of machine learning to real-world data. We are currently in the process of exploring and trying to implement the requested changes.

#	Referee #2 Comments	Responses
1	The selection of the test set is unclear in the paper. It also appears that the test set has been erroneously used as a validation set. If that is the case, then it is difficult to assess the generalizability of the authors’ conclusions. It would be helpful to clarify how the test set was selected and used in this study. Additional comments regarding test set are also included below with specific line references.	Thanks for the comment. The selection of the training, validation, and test set has been revised. The updated version is schematically shown in Fig 10. The training and the test set are now sourced from the same earthquake. The validation set is implemented using k-fold cross-validation as part of the hyperparameter tuning. Four models have been trained and tested using data from the four main earthquakes in the CES (4 Sep 2010, 22 Feb 2011, 13 Jun 2011, 23 Dec 2011). Figure i: New Figure 10
2	While it is a suitable approach to only select the 4 events with the highest number of claims during model training, the other events with fewer claims could be used for testing purposes. This would not only ensure that no data leakage occurred between the training and test sets, but also enable the authors to validate the generalizability of their models more effectively.	Thank you for the suggestion. As mentioned in the reply to comment #1, the process of selecting the training, validation, and test set has been reviewed. The training and test set are now clearly separated thus ensuring that no data leakage can occur. Additionally, the way the model should be applied to future earthquakes has been clarified in Fig 10 as well as in the paper. We are here deliberately focusing on the claim data related to the four main earthquakes rather than the claim data pertaining to the aftershocks. Figure ii shows the situation in the area around Christchurch as of December 2012 (one year after the end of the CES). It can clearly be seen that the earthquakes with the larger magnitudes were: 4 Sep 2010, 22 Feb 2011, 13 Jun 2011, 23 Dec 2011. Severe buildings damage, and hence higher claims, are mainly related to those four earthquakes as the seismic intensity and liquefaction occurrence were larger for those events, hence the selection of those earthquakes in this study. Figure ii: Map of the region around Christchurch showing the epicentre location of the four main earthquake and multiple aftershocks in the Canterbury earthquake sequence (CES) (O’Rourke et al., 2014, originally from GNS)
3	It would improve the paper if the authors added their thoughts on some of the potential use cases of this research. While the authors’ conclusions indicate a promise for using ML within this domain, it was unclear how this model and approach could be used in the future. For example, if training data is needed each time an earthquake occurs, then is one of the use cases to manually collect a subset of ground truth data for building losses, train a model, and then apply it widely to the rest of the buildings?	Thanks for the suggestions Please see our reply to comment #5 of reviewer #1 concerning the implementation of the model. Clarifications have been added in section 8 of the paper.
4	Further discussion of the model metrics such as recall and precision would be helpful. For example, a recall of only 20% for overcap, and 49% for low loss category indicates that 80% and 51% of these losses, respectively would be missed when implementing this model. Depending on the model’s use cases, this could have a significant impact on the model’s utility. Further discussion of the most appropriate metric (or their combinations), given the model’s use cases would also improve the paper. For example, why was accuracy selected as the primary evaluation metric for choosing the best performing model?	Thanks for the comment. As studies applying machine learning in an earthquake engineering context (see section 2.1 of the paper) are now more common, many publications include background explanation on theoretical part related to ML (e.g., Harirchian et al. (2021)). The authors are aware of the different metrics related to a classification (Figure iii) but decided to not include generic explanations related to ML to keep the paper to reasonable length. One of the main reason for conveying the model performance using the accuracy was to enable the benchmarking of this ML model against the performance of other ML models for damage prediction. While a direct comparison would be improper as the earthquake selected, model attributes, and algorithm are not the same, most of the current studies report the model performance using the accuracy (Ghimire et al., 2022; Harirchian et al., 2021; Mangalathu et al., 2020; Stojadinović et al., 2022). Nevertheless, following the suggestion, the ML model is currently being retrained with emphasis on ‘recall’. We will observe the outputs and possible benefits and update section 8 of the paper accordingly. Figure iii: Details of a confusion matrix for a binary class problem
5	I appreciated that the authors listed the distribution imbalance of different features, such as construction type. However, the paper could be further improved by adding the model performance in those different feature categories. This would enable the reader to understand in which categories the model performs better than others.	Thanks for the suggestion. This is currently work in progress. Once the model retrained, we will have a closer look at the model performance for each of the categories. We will add interesting findings to the paper.
6	Given the relatively low performance of the ML model (as highlighted above for recall), adding a section on error analysis would substantially improve the paper. In error analysis for ML, the objective is to identify the cases in which the model does not perform well. This error analysis is often used in ML modeling to improve model performance and generalizability.	Thanks for the suggestion. We are currently doing a more detailed analysis of the model to identify the scenarios causing problems. We will add a paragraph to the paper discussing the cases in which the model does not perform well.
7	Figure 13 is missing, and appears to be a repeat of Figure 12. Hence, Section 9 - Insights - could not be reviewed.	Thanks for note. Fig 13 will be reuploaded.
8	It would further improve the paper if the authors added some information about their hyperparameter tuning methodology, and which search strategy they used.	Thanks for the suggestion. For the hyperparameter tuning we used the randomized search cross-validation approach. We selected RandomizedSearchCV over GridSearchCV as for larger datasets RandomizedSearchCV often outperforms the results from a GridSearchCV. As suggested in comment #4, recall is currently being considered. This is done during the hyperparameter tuning process adjusting the scoring for recall. Clarifications will be added to section 7 of the paper.
9	Line 50 - While the authors are completely correct in the paragraph at line 50, this paper deals with ML for structured data, for which the goal is often to surpass human performance since humans are generally unable to identify all patterns in millions of data points with hundreds of features, often found in these problems. Hence the paragraph does not apply to the ML scope of this paper. It may be suitable to remove the paragraph within the scope of this paper, or change “human-level performance” to “baseline model”, which would be a more suitable term in this case	Thanks for the remark. We have reformulated the paragraph to make it clearer that in this case, the ML model can surpass the human-level performance. As explained in section 11, the performance of this ML model was assessed against the outputs of the software RiskScape v1.0.3.
10	Line 65 - latter -> later	Thanks for the note. The typo has been corrected.
11	Line 73 - Suggest adding reference/url for the source of the data.	Thanks for the suggestion. Unfortunately, the EQC data is not public. We signed a confidentiality agreement with EQC which made the data available to us for research purposes only.
12	Line 83 - It would be helpful to further describe Figure 2. Why is there a difference in the number of claims and buildings?	Thanks for the comment. In some cases, multiple claims might have been lodged for the same earthquake event. The clarification has been added in the paper “For a particular earthquake event, it sometimes happens that multiple claims pertaining to the same building were lodged. Figure 2 thus specifically differentiates between the number of claims and the distinct number of buildings affected. “
13	Line 95 - I was curious about the accuracy of the Riskscape dataset. For example, are the building characteristics determined statistically from Census data similar to HAZUS in the US, or was it based on collecting data from building records so that it is expected to be fairly accurate? If possible, it would be helpful in the paper to include some information describing Riskscape’s data collection methodology and comment on its expected accuracy.	The building characteristics data was obtained from the RiskScape - Asset Module Metadata (RiskScape, 2015). A copy of the document has been attached in Appendix. According to the documentation the “data in this inventory is partly derived from information purchased from Quotable Value (QV) Ltd, together with ‘industry knowledge’ and data gathered from surveying the area. All QV data is applied at the meshblock level, and the RiskScape attributes are derived from this information so as to provide a suitable model of the actual building stock.” Further information can also be found in (King & Bell, 2006; Reese et al., 2007).
14	Line 115 - Although a reference is provided to the authors’ previous work, it would be helpful to summarize the major reasons for incorrect merging using direct spatial joins within this paper to help understand the issue without having to read the previous work.	Thanks for the comment. The paragraph has been rewritten to include explanations related to the location of the EQC and RiskScape coordinates. It is now clearly stated in the paper that the coordinates provided as part of the EQC dataset relate to the location of the street address while the RiskScape coordinates are located in the actual centre of the footprint of a building. In some cases, buildings from neighbouring properties are located closer to the street address than the actual building. For those cases, the use of spatial join functions and spatial nearest neighbour joins led to unsatisfactory outputs.
15	Line 122 - It would be helpful if the authors added the percentage of addresses in each of the 3 categories - 1-1 match with titles, 0-1 match, and many-1 match.	Thanks for the comment. As suggested in comment #18, the percentages have been added to Table 1.
16	Line 131 - It would be helpful if the authors added the percentage of RiskScape data that was discarded.	Thanks for the comment. A comment mentioning that 27.1% of the RiskScape instances were having 3 or more RiskScape datapoints within a LINZ property title as been added at the end of the phrase. Additionally, as suggested in comment #18, the percentages have also been added to Table 1.
17	Line 132 - I was unable to understand the intent described in this paragraph, especially the first and the last sentences.	Thanks for your comment. Fig 3 showed that some LINZ property titles included more than one point (i.e. building) per property title (polygon). This poses challenges for an automated data merging process as it needs to be ensured that the RiskScape and EQC information get assigned to the correct building. As mentioned in line 121: “The merging process was thus started with instances having a unique street address per property”. Those instances having a unique building per polygon were used to merge RiskScape attributes and EQC info to the buildings constraining the merging within a property title. The actions for merging scenarios related to property titles having 1 point per LINZ property are listed in Table 1 rows #1 to #3. It is known that the training of an ML model is often benefiting from more training data. Thus, options to merge more data were explored. It was found that among all the LINZ data, 7% of the LINZ property titles have two street address points (e.g. two buildings within one polygon). We explored if there would be an automated way to merge RiskScape attributes for such cases and thus get more data. For instances having 2 buildings (LINZ points) per property and 2 RiskScape points, the automated merging was done joining the RiskScape attribute to the closest LINZ point and filtered using the building floor area. Among the instances where a LINZ property title had 2 buildings it was also found that some of the RiskScape data only had one point per property title. While a human could make an educated guess to find which of the two buildings the RiskScape attribute were pertaining to, no satisfying automated approach could be developed to merge these instances. This case was thus discarded. More details can be found in section 5.6.6 Properties with two street addresses and one or multiple RiskScape Instances of the thesis (Roeslin, 2021). Section 5.6.6 also includes multiple figures showing examples of properties having two LINZ NZ street addresses per polygon. To keep this paper to a reasonable length such figures were not included.
18	Table 1 - The table is very helpful. However, the action taken for 2 points LINZ and 1 point Riskscape was unclear. The above mentioned percentages of data could also be added to Table 1 instead.	Thanks for the comment. Re: action for 2 LINZ points and 1 RiskScape The wordiness has been removed. In short, those points were discarded. Re: Percentage to Table 1 Thanks for the suggestion. The percentages have been added in brackets for each LINZ and RiskScape scenario of Table 1. Out of the selected instances in Christchurch, 89% have 1 street address point per LINZ property title, 7% two addresses per property, and 4% have three instances or more. For the RiskScape data, after merging in ArcMap it was found that 29.3% of the properties have one RiskScape instance, 43.6% have two instances, and 27.1% have three RiskScape points or more.
19	Line 150 - It would be helpful if the authors added the methodology used to merge soil conditions, and liquefaction occurrence with street address. Did they use the same inverse distance weighted interpolation as seismic demand?	Thanks for the comment. The seismic demand captured through the peak ground acceleration (PGA) was interpolated using the inverse distance weighted (IDW) technique, applying the IDW spatial analyst function in ArcMap (Esri, 2021). In the paper, the explanation on how the seismic demand was interpolated from the ground motion recordings obtained from GeoNet is explained in section 3.3.
20	Line 172 - The reason for discarding claims with maximum value lower than or higher than $115,000 is unclear. Is it because this wasn’t possible and hence the data is erroneous?	Thanks for pointing that out. The EQC data entails a feature called ‘EQC Building Sum Insured’. As EQC provided a maximum cover of NZ$100,000 (+ GST) at the time of the Canterbury Earthquake Sequence, it was expected that all the individual dwellings for which one or multiple claims have been lodged during the CES would have an EQC Building Sum Insured equal to NZ$115,000. However, the initial exploratory data analysis (EDA) revealed that for some of the instances the building sum insured was not at NZ$115,000 (see Figure vi). As only a few instances were not equal to NZ$115,000 (2,594 instances for 4 Sep 2010 and 2,329 instances for 22 Feb 2011), it was decided not to include those instances (in the objective of retaining accurate data for the training of the ML model). Figure iv: Number of instances in the attribute EQC Building Sum Insured (categorised to simplify the visualisation)
21	Line 240 - It is unclear which event was selected as the test set. From my understanding of line 243, one of the 4 events was selected as test set, and the other 3 events as training+validation sets. However, it also appears from the sentence that in different instances of the model, a different event was selected as a test set so as to determine the most generalizable model. If that is the case, the test set was erroneously used as a validation set, since the model cannot be changed at any point after evaluating its performance on the test set. It would be helpful to clarify the selection of the test set, and ensure that it was only used once at the end to evaluate the performance of the final developed model.	Thanks for the remark. As mentioned in the reply to comment #1, the selection of the training, validation, and test set has been revised. All of those sets are now coming from the same earthquake event. For new earthquakes, a sample of buildings will be selected and added to the training set. Figure 10 has been updated to reflect those changes.
22	Line 254 - It would be helpful if the authors added how the min-max scaling was implemented with respect to training, validation, and test sets.	Thanks for the comment. The min-max scaling of the numerical features was performed using the sklearn.preprocessing.MinMaxScaler available in scikit-learn (Pedregosa et al., 2011). A pipeline containing the min-max scaler was created. All sets were passed through the same pipeline.
23	Line 286 - It is unclear which limitations related to random forest model the authors are referring to.	Thanks for pointing that out. The limitations mentioned in this section refer to the overall limited model performance not to the limitations of the random forest algorithm itself. The first section of the phrase has been rewritten to clarify that it is related to the overall model accuracy.
24	Figure 11 - The SVM model does not appear to have been modeled correctly as its output prediction is always the medium category, hence it has been reduced to a trivial model.	Thanks for pointing that out. We are currently having a deeper look at the retraining of the SVM model.
25	Line 295 - It appears that the model was selected based on the best performing model on the test set. This indicates that the test set was not used correctly, as the model selection can only be done using validation sets. The test set must only be used to show the performance of an already selected model on it.	Thanks for the comment. Sorry for the confusion. The paragraph has been rewritten and Fig 10 has been improved to clarify that the training, validation, and testing are initially done on one earthquake only.
26	Line 326 - While the authors raise an accurate point about the lack of claims information exceeding $115,000, it is not clear how that data could have benefitted this study since the claims have been bucketed and all those claims greater than $115,000 are already expected to be included in the over-cap category.	Thanks for the remark. The ML model currently presented in this study is categorical. Yet, the target attribute BuildingPaid is initially numerical but soft-capped at NZ$100,000 (+GST) or NZ$115,000. Having the actual numerical distribution of the extent of the claims might enable a deeper analysis of the target attribute and can possibly enable better performance for a regression model. This might alleviate the need to transform BuildingPaid from a numerical attribute to a categorical feature.