Reply on RC2

This said, I would also encourage the authors to think a bit more broadly about the context in which they are laying out these suggestions. Specifically, with the rapidly evolving soil C sequestration landscape and an infusion of private interests into soil C world (e.g. https://seqana.com/, IndigoAg, etc), how should academics, industry, ngo’s and government agencies maintain data access and communication in what’s potentially a more crowded, active (and presumably better funded) field? I appreciate this touches on ideas that are broader and somewhat more existential than the soils data challenge the paper more narrowly addressesbut it seems relevant to contextualize the broader landscape of who and why harmonize soils databeyond the how it can be done better.

The uses of soil databases for research context are varied (for example Earth system model benchmarking Collier et al 2018) but there are other private economic impacts of having soil data available. Soil health metrics in public databases could impact land evaluation and there is increasing interest in soil carbon data from carbon markets for offsetting CO2 emissions. As mentioned in the geolocation section, specific information on the nutrient and water retention of a soil can make it more or less valuable, making landowners reluctant to release data. More recently, an increasing interest in generating carbon offsets by increasing soil sequestration has led to a proliferation of new venture corporations that either generate new or use available soil data in order to define land management practices to increase soil C stocks (e.g. IndigoAg, CIBO Technologies, Seqana, Regrow, Nori, LoamBio). Industry companies generally treat data that they collect or process as part of their intellectual property, which is kept private. While there is clearly scientific value in these data, it's unclear how researchers, landholders, and private companies will negotiate the use and integration of these data into research outputs. Nonetheless, privately held data would also benefit from connecting with community developed standards.
My remaining comments are relatively minor, and largely intended to clarify aspects of the text.
I'm not sure I agree with the statement in Line 43-45. Modeled soil properties (here I'm thinking of hydraulic and thermal properties) rely on pedotransfer functions that use input data of soil physical characteristics (texture and organic matter content). None of these 'soil properties' are used for benchmarking or evaluation, making me wonder what the growing need for more data are really needed for-especially if ILAMB already uses information on soil C stocks and inferred turnover times?
We can see how this was confusing, this was meant to refer to carbon and nutrient stocks but on review this section is unclear. We are removing the ILAMB sections (ln 41-45) and replacing this with the following: "A number of databases have been compiled in soils data around specific themes or measurement types including: soil carbon and nitrogen  Table XX for a complete list with database creation strategies)." Moreover, data products like SoilGrids already exist, which seems to have a wealth of data that can be used as inputs for or evaluation of Earth system models. Are you suggesting new efforts should go into recreating or augmenting the data processing wheel that informs ISRIC data products (SoilGrids and the Harmonized World Soils Database)? I don't get the sense this is what the authors are envisioning? I also appreciate that "This is just one of many potential uses for harmonized soil data", but I do worry that as written the authors are implying that the harmonized datasets we do have somehow do not reflect FAIR principles.
We contend that soil data products (like SoilGrids and HWSD) are not the same as an aggregated soil database and that a soil database is necessary to generate these data products but has other use cases as well. We address this, and related comments from R1, beginning on line 45.

Suggested text:
Soil resources curated by ISRIC (https://www.isric.org/) provide another example of how soil data feed into larger products. After archival on ISRIC servers, datasets from individual providers are incorporated into the World Soil Information Service workflow (WoSIS; https://www.isric.org/explore/wosis). The WoSIS workflow includes mapping diverse data contributions to a standard data model, harmonization, and distribution. Distribution includes a database, as defined in this paper (the WoSIS Soil Profile Database; https://www.isric.org/explore/wosis/faq-wosis#How_should_the_WoSIS_datasets_be_cited?), as well as derived data products, such as SoilGrids (Hengl T, de Jesus JM, MacMillan RA, Batjes NH, Heuvelink GBM, et al. Agreed. We will integrate this language on ln 73-74 into the figure and headings.
Line 79-84, I appreciate the challenge you're trying to articulate-but it kinds of seems like you're suggesting reviewers or journals need better evaluation of data publishing standards. I wonder I this is really where the responsibility should lie, specifically because I don't think as a community, we're well trained in best management of data practices.
Good point. We did not intend for the responsibility to lie with peer-reviewed journal, rather we diverted the focus to one that highlights that challenge as to who would be responsible, so it is more of an open question. We will add the following to ln 81 "... "high standard" are and whom is responsible for ensuring these standards are met. To complicate matters, key…" I think given better information, data providers would happily provide more useful datasets to repositories, but don't know how. Maybe this is what's implied in line 83 with data providers who 'become frustrated'? I realize you're trying to be brief here-and maybe a solution is articulated in Section 3-but I do worry that the takeaway message from this paragraph is 'currently archived data are incomplete and therefore useless, and we're not really going to tell you how to make them better'.
Good point. We propose extending this paragraph and adding to ln 84 "This is not to say that archiving data for the purpose of meeting funder requirements or reproducing the associated analysis can not be useful in and of itself. However this does not automatically lend the data to integration in a database."

Line 87, what's a harmonized template?
We agree this is unclear -we will reframe as 'aggregator provided template'

Line 99 What are TRUST and CARE? If an aim of this manuscript is to broadly educate soilminded scientists on best data practices describing features of these practices should be briefly articulated (not just referenced).
We propose adding to line 97: In general however, we feel that direct collaboration between data providers and data aggregators is a critical relationship to nurture. Other critical relationships for good data governance have been articulated by recent extensions of the FAIR Principles, including TRUST and CARE. The TRUST Principles (Transparency; Responsibility; User focus; Sustainability; Technology) articulate key features for trustworthy digital repositories, which are essential for preserving data access and reuse over time (Lin et al., 2020). The CARE Principles for Indigenous Data Governance (Collective benefit, Authority to control; Responsibility; Ethics) position decisions related to data management and reuse in the context of Indigenous cultures and knowledge systems, highlighting actions that ultimately support Indigenous data sovereignty (Carroll et al., 2020). As the community continues to converge on shared tenets of good data governance, it is becoming increasingly clear that "just put it in a repository" is only the beginning. Capturing these differences in a table form was challenging which is why we went with the narrative structure however we will create a table capturing some of these database strategies and add it as a new table.

Line 105. These different transcription / translation methods are nicely described in the text, with examples in Appendix A. Would a table help emphasize similarities and differences of databases listed in Appendix
Finally, both ISRAD and SODAH were organized with the nested hierarchy established with ISCN. Should this be mentioned? Should ISCN be highlighted in the text (a number of coauthors have contributed to this effort)? This hierarchical organization of the data is implied, but maybe not explicitly established in the metadata and data models we are or should be using.
Good point, We will add the ISCN connection to these two project descriptions.

Section 2.2, It seems like scripted transcription requires clear dictionaries, vocabulary and metadata to be successful, but based on text in 2.1 this is not common, OR is this just happening in keyed translation?
Both manual transcription and scripted methods require clear metadata descriptions that are formalized in different ways. We'll add this point here on lins 129: While this approach has the most explicit need for clear semantic resources, these are also essential for creating effective manual transcription templates and protocols.

Section 2.3 is pretty brief Would additional examples be helpful here to illustrate how different efforts have gapfilled or pruned their data? How do these databases expandwhich seems important aspect of curation (although discussed in 2.4 for COSORE).
With respect, these strategies are extremely diverse and beyond the scope of this paper, see lines 148-149.
Line 275, I may add something aboveground to this list (as vegetation, land use, productivity and climate are also important for belowground measurements, but rarely colocated with belowground measurements being collected).
We will extend section 2.3 to talk about annotation of soil observations with aboveground data (ie ISRaD annotating mean annual temperature and precipitation) Specifically ln 147 These activities include expanding the environmental context for a particular soil; for example, extracting net primary productivity and land use classification from satellite products. Soils are not unique and many of these are broad challenges in environmental data. Specifically we proposed adding on ln 60: "The approach and issues outlined in this paper are undoubtedly not unique to soils and are relevant to a wide range of scientific data, particularly environmental data. However we present this as a case study of soil specific database construction." I'm 100% behind the suggestions and vision the authors laid out, but I do wonder a bit about to what end? What are the pressing questions that a massive new soils database will let us address? Given the diversity of soil uses, measurements, and communities is a database of databases really what we need? OR, is the soil science community well enough served by individual collections of data that are more focused on more topical areas like radio carbon, respiration fluxes, spectral databases, or bulk C stocks? I realize this isn't you're grant proposal-but presumably it's heading that way. The text clearly delineates data providers and data aggregators, but who are the data users that will ultimately do something with these datasets once they're wrangled into something more useful?

Section 3.2 (or in the introduction
You are correct of course that this paper focuses on data aggregators as a class rather than data users, we choose to do this because the user community is exceptionally diverse but data aggregation is a common activity across this group. Respectfully we choose to focus this paper on the how instead of the why.