How well are hazards associated with derechos reproduced in regional climate simulations?

An 11-member ensemble of convection-permitting regional simulations of the fast-moving and destructive derecho of June 29 – 30, 2012 that impacted the northeastern urban corridor of the US is presented. This event generated 1100 reports of damaging winds, significant wind gusts over an extensive area of up to 500,000 km 2 , caused several fatalities and resulted in widespread loss 10 of electrical power. Extreme events such as this are increasingly being used within pseudo-global warming experiments that seek to examine the sensitivity of historical, societally-important events to global climate non-stationarity and how they may evolve as a result of changing thermodynamic and dynamic context. As such it is important to examine the fidelity with which such events are described in hindcast experiments. The regional simulations presented herein are performed using the Weather Research and Forecasting (WRF) model. The resulting ensemble is used to explore simulation fidelity relative to observations for wind gust 15 magnitudes, spatial scales of convection (as manifest in high composite reflectivity), and both rainfall and hail production as a function of model configuration (microphysics parameterization, lateral boundary conditions (LBC), start date, and use of nudging). We also examine the degree to which each ensemble member differs with respect to key mesoscale drivers of convective systems (e.g. convective available potential energy and vertical wind shear) and critical manifestations of deep convection; e.g. vertical velocities, cold pool generation, and how those properties relate to correct characterization of the associated atmospheric 20 hazards (wind gusts and hail). Here, we show that the use of a double-moment, 7-class scheme with number concentrations for all species (including hail and graupel) results in the greatest fidelity of model simulated wind gusts and convective structure against the observations of this event. We further show very high sensitivity to the LBC employed and specifically that simulation fidelity is higher for simulations nested within ERA-Interim than ERA5. consistent with recommendations that the maximum step in resolution at the domain boundary is < 12 (Lucas‐Picher et al., 2021). Because the goal of this research is to establish whether WRF can generate a derecho of the given intensity when provided only the large-scale environmental context, in most simulations no nudging is applied, and a relatively large simulation domain is 170 selected. Two initialization dates are included in the ensemble; most simulations are initialized at 0000 UTC on 26 June approximately 4 days before the peak of the event. These are type equivalent to true ‘climate mode’ simulations (i.e. those initialized well ahead of the event genesis), another two are initialized at 0000 UTC on 28 June, approximately 2 days before the peak of the event but much closer to the event genesis and thus are closer to a ‘weather-wise mode’ where the model initialization is a few hours before the event commences. of 2400 Jkg -1 in the genesis region and can increase to 4500 Jkg -1 during the propagation of the derecho, with later observational analyses indicating that Most Unstable Convective Available Potential Energy (MU-CAPE) has a 75th percentile value of nearly 4000 Jkg -1 and a peak of 8500 Jkg -1 (Evans and Doswell, 2001). MU-CAPE from a WRF simulation at 3 km of a Super Derecho in Kansas on 8 May 2009 was in excess of 3000 Jkg -1 (Weisman et al., 2013). 285 MU-CAPE for the 3 July 2003 derecho in the Midwest ∼ 500 Jkg -1 (Metz and Bosart, 2010). The June 2012 derecho that forms the focus of this research is remarkable not only for the number and intensity of wind gusts but also in terms of the Convective Available Potential Energy (CAPE) in the genesis region and near Washington DC. For example, CAPE estimates for the 0000Z 30 June 2012 rawinsonde sounding from Sterling VA (IAD) ∼ 5500 Jkg -1 . Here we employ maximum or most unstable CAPE (MU-CAPE) as our primary index of the ability of the atmosphere to support deep 290 convection. MU-CAPE is computed for t p and t p +/-3 hours from the 3-dimensional fields of pressure, temperature, and topographic variability within d03, the intensity of cold pools at the surface associated with the 315 derecho is quantified using anomalies from the simulation mean temperature or pressure in that grid cell over the entire simulation period. Both are computed for WRF grid cells with 50 km of all cells with cREF > 40 dBZ: over simulations 520 using ERA-Interim. Our finding has important implications for construction of hindcast simulations for use in Surrogate or Pseudo Global Warming (PGW) numerical experiments to quantify the potential of global warming on extreme weather events using regional models (Li et al., 2019). In such simulations an historically important extreme event is first simulated using contemporary LBC and then the simulation is repeated using LBC and IC perturbed to represent the change in, for example, air temperatures and water vapor availability. The difference in these two realizations is interpreted as the impact 525 of global climate non-stationarity. Our work indicates use of ERA5 for IC and LBC may not always result in improved baseline simulations of the extreme event in the contemporary climate, and the simulation deficiencies may render evaluation of the PGW response highly uncertain. The relatively low skill of the 11 WRF ensemble members for this

Over 20 deaths were reported during the 29-30 June 2012 derecho event, there was widespread property damage and extensive power outages (Halverson, 2014). According to one report power outages impacted over half of all homes within West Virginia and "approximately 600,000 citizens were still without power a week later" (Kearns et al., 2014). Many homes in West Virginia also lost access to clean water supply due to power failures at water treatment facilities (Kearns et al., 2014). During the evening of 29 June over 1.4 million people in the Washington DC metro area lost power, some of them for almost a week during a period 70 of relative high heat stress (Short, 2016). Virginia, Ohio, Virginia and West Virginia had the largest number of customers without power (Halverson, 2014), and an analysis in 2016 found this event was the single largest cause of power outages in the state of Maryland (Short, 2016) Most forecast models operating in 2012 did not predict either extensive deep convection or a significant severe weather event (Guastini and Bosart, 2016;Schumacher and Rasmussen, 2020), although once it had initiated the Storm Prediction Center (SPC) commenced issuance of severe weather warnings (Halverson, 2014). This event has subsequently been the subject of extensive research in terms of characterization of the environmental context (Bentley and Logsdon, 2016;Guastini and Bosart, 2016;Shourd and Kaplan, 2021), and has formed the basis of several modelling studies designed, for example, to examine whether model fidelity 80 is enhanced by data assimilation (Fierro et al., 2014). Our research is not focused on methods to improve forecasts of such events but rather to evaluate the inherent ability of the Weather Research and Forecasting (WRF) model to reproduce key aspects of this event in the contemporary climate as a function of model configuration in order to lay the foundation for examining how such events may evolve in the future.

Synthesis of insights and outcomes from previous simulations of deep convection and derecho events 85
Past research has illustrated that use of nested domains with convection-permitting resolutions (i.e., dx < 4 km), where the convective parameterization is deactivated and convective processes are partially resolved by explicit model physics, typically enhances simulation fidelity of deep convection (Prein et al., 2015). Emerging research has shown that using scale-aware https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License.
convective parameterizations throughout the model gray zone resolution helps to smooth the transition from the parameterized to resolved convective scale, leading to smaller errors in the timing and intensity of precipitation (Mahoney et al., 2016;Wagner et 90 al., 2018;Jeworrek et al., 2019). However, model fidelity as a function of model configuration, remains an ongoing open research question. As described below, model fidelity is a strong function of the precise cloud microphysics scheme applied, model grid spacing, lateral boundary conditions and the degree/manner in which the model parameterizations interact (Wang and Seaman, 1997;Warner, 2010).
Compute times for simulations with WRF and other atmospheric models exhibit a relatively high dependence on cloud 95 microphysics schemes (Barrett et al., 2019). Single-moment schemes do not predict particle size distribution for each species, which is instead derived from fixed parameters. They are thus more computationally efficient. Double-moment schemes, add a prediction equation for number concentration per species (cloud, water, ice, snow, hail, graupel). The trade-off between increased compute time -from more advanced microphysics -and meaningful forecast improvement is significant, such that the additional compute expense may not always be warranted (Jeworrek et al., 2019). Nevertheless, as the model resolution transitions through 100 the gray zone to kilometer-scale resolution, the microphysics begins to directly influence convective and cloud scale motions through latent heating/cooling and the weight of condensate, thus a double moment scheme should be used at such scales (Morrison et al., 2020). Spectral bin schemes may offer an additional fidelity enhancement but are even more computationally demanding (Shpund et al., 2019). One analysis of hail prediction for an event that impacted Oklahoma City employed a horizontal grid spacing of 500 m and compared three different bulk microphysics (MP) schemes: the Milbrandt-Yau double-moment scheme (MY2), the 105 Milbrandt-Yau triple-moment scheme (MY3), and the NSSL variable density-rimed ice double-moment scheme (NSSL). The authors found all three schemes generated skillful predictions for the surface areal coverage of severe surface hail (hail diameter (D) ≥ 25 mm) but particularly the NSSL scheme exhibited less skill for significant severe hail (D ≥ 50 mm) (Labriola et al., 2019).
Microphysics parameterizations are not only critical to production of solid precipitation (hail and graupel) but also to simulation of cold pool development and production of downbursts and outflow boundaries (Adams-Selin et al., 2013). Squall lines are well 110 suited to microphysics sensitivity studies because mature squall lines contain a range of ice hydrometeor types (Xue et al., 2017).
Much of the prior research examining squall line sensitivity to microphysics has been conducted with bulk schemes due to the added computational demand of bin schemes (Fovell and Ogura, 1989;Mccumber et al., 1991;Morrison et al., 2015;Fan et al., 2017). These studies have shown considerable spread to different microphysics, and this has been linked to varying representation of cold pool dynamics (Morrison et al., 2012;Morrison et al., 2015). 115 No optimal grid spacing has been found for simulation of MCSs including derechos. A previous analysis of 14 simulated MCSs found finer grid spacing was associated with better reproduction of the cold pool (grid spacing of 1 km showed enhanced skill over 3 km) but that forward propagation speeds of the MCS better matched observation for the simulations at 3 km (Squitieri and Gallus https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License. Jr, 2020). Further simulations of a derecho that impacted northern France, Belgium, the Netherlands and northwestern Germany on 3 January 2014 also found more realistic representation of the derecho intensity in simulations at a grid spacing of 1.1 km 120 relative to simulations at 2.8 km (Mathias et al., 2019).
Other studies have examined the sensitivity to model initial and lateral boundary conditions (IC and LBC) (Hohenegger et al., 2006;Johnson, 2014). Modelling of the major derecho event tracked over Belarus, Lithuania, Latvia, Estonia and Finland during August 8, 2010 with the HARMONE model applied at a 2.5 km grid spacing found a strong dependence on IC and LBC and a time delay (of approximately 1 hr) in derecho passage approximately 15 hours into the simulation (Toll et al., 2015). Nested 125 simulations of a European derecho event using the COSMO regional model found significant improvement in the simulation fidelity with use of ERA5 for the LBC over simulations using ERA-Interim (Mathias et al., 2019).

Objectives
It is important to emphasize that research presented herein is cast within the framework of use of short simulations with a Convection Permitting Regional Climate Model (CPRCM) to reproduce specific extreme events where a CPRCM is nested with 130 LBC from a reanalysis product (Lucas-Picher et al., 2021). By simulating only few days, this case study (or storyline) approach can permit many simulations to be performed and evaluated and model dependencies can be fully investigated (Mathias et al., 2019;Lucas-Picher et al., 2021). Accordingly, the objectives of this work are to build and evaluate an ensemble of WRF simulations performed in a hindcast mode (i.e. with reanalysis-derived LBC) that differ in terms of the microphysics schemes applied, the LBC, start date, and use of nudging and to use that ensemble to: 135 1) Evaluate the relative fidelity of regional climate simulations using different microphysics schemes for an historically important high-wind mesoscale convective event. The five microphysics schemes applied range in sophistication, from cloud-scale single-moment [Goddard (Tao et al., 1989)] to double-moment [Thompson (Thompson et al., 2008), Morrison , Milbrandt-Yau (Milbrandt and Yau, 2005b)] to double-moment with particle shape and density prediction [NSSL (Mansell et al., 2010a)]. 140 2) Evaluate how the fidelity of WRF varies for different LBC, start times and with and without nudging. The two reanalysis products used to provide the initial and lateral boundary conditions are ERA-Interim (Dee et al., 2011) andERA5 (Hersbach et al., 2020).
For objectives 1 and 2 we evaluate fidelity with respect to; peak reflectivity and spatial extent of reflectivity at the time of maximum deep convection, cumulative precipitation, presence/absence of hail, and peak wind gusts. We also provide context for the fidelity 145 assessment during the derecho with conditions during a subsequent frontal passage. We also seek to address a third objective: https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License.
3) Evaluate the degree to which the processes involved in generation of gust fronts from derechos are represented in the WRF ensemble simulations. In this part of the analysis, we are seeking to assess the differential fidelity of the ensemble members in terms of a range of diagnostic properties, the vertical structure of deep convection, the vertical velocities, and metrics of cold pool production. 150 This research is being performed as part of a project designed to examine how historically important extreme events may be modified in an evolving climate. Thus, while there is evidence that data assimilation can substantially enhance forecast and hindcast skill (Johnson et al., 2015;Johnson and Wang, 2016;Federico et al., 2019;Bachmann et al., 2020), no data assimilation is performed here.

WRF simulations
All the simulations presented herein were performed with WRF model version 3.8.1. The optimal domain size, number of nests and parent-grid ratio to be used in convection-permitting simulations are open questions (Prein et al., 2015), but there is evidence of bulk convergence (i.e. diminishing change of domain-wide properties as a function of grid spacing) at approximately 1 km (Panosetti et al., 2019). Accordingly, simulations performed herein use a grid spacing of 1.33 km in the innermost domain (d03, 160 see Figure 1a for the simulation domains) that covers a domain of almost 400 by 400 km (i.e. above the recommended target of 300 by 300 km for convection-permitting regional climate model simulations (CPRCM) (Lucas-Picher et al., 2021)). A single domain configuration and inner nest grid spacing is used in all members of the ensemble because prior research has generally found sensitivities related to cloud microphysical parameterizations are larger than those associated with mesh refinement at kilometer scales (Roh and Satoh, 2014). Model configuration settings that are consistent across all simulations are shown in Table  165 1 while the settings for which the 11 ensemble members differ are shown in Table 2. Here we use a fixed outer WRF simulation domain grid spacing of 12 km with lateral boundary conditions (LBC) from both ERA5 (dx ~ 30 km) and ERA-Interim (dx ~ 80 km) consistent with recommendations that the maximum step in resolution at the domain boundary is < 12 (Lucas-Picher et al., 2021). Because the goal of this research is to establish whether WRF can generate a derecho of the given intensity when provided only the large-scale environmental context, in most simulations no nudging is applied, and a relatively large simulation domain is 170 selected. Two initialization dates are included in the ensemble; most simulations are initialized at 0000 UTC on 26 June approximately 4 days before the peak of the event. These are type equivalent to true 'climate mode' simulations (i.e. those initialized well ahead of the event genesis), another two are initialized at 0000 UTC on 28 June, approximately 2 days before the peak of the event but much closer to the event genesis and thus are closer to a 'weather-wise mode' where the model initialization is a few hours before the event commences. 175 https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License.
Additional WRF output diagnostics options are employed. The 'output_diagnostics=1' setting is used to output climate diagnostics to a separate history file (wrfxtrm) every hour for domain 1, and every 10 minutes for domain 2 and 3. Advanced settings for NSSL are not used here. The 'hail_opt' switch for Morrison is used to run this scheme with hail. A Morrison simulation without hail is also run for comparison. The Goddard scheme does not include hail by default, but in this simulation 'gsfcgce_hail=1' is used to run the Goddard scheme with hail. The 'do_radar_ref=1' namelist setting is used to compute radar 180 reflectivity using microphysics-scheme-specific parameters in the Goddard, Thompson, and Morrison ensemble simulations. This option is not available for the NSSL and Milbrandt-Yau schemes, but radar reflectivity is still calculated by the model for those schemes without using the microphysics parameters. Two radar reflectivity estimates are provided by WRF; REFL_10CM (i.e. radar reflectivity in each vertical grid cell at a wavelength of 10 cm) and REFD_MAX (maximum derived radar reflectivity).
Composite reflectivity (cREF) is used here for comparison with RADAR estimates and is the maximum value for each WRF 185 column and time step.

Model evaluation
The ensemble of WRF simulations is evaluated against observations from National Weather Service (NWS) dual-polarization RADARs (Crum et al., 1998;Seo et al., 2015) and the NWS Automated Surface Observation System (ASOS) (Schmitt and Chester, 2009). There are four RADAR stations within the innermost WRF simulation domain (d03) and nine in the second domain (d02). 190 There are 34 ASOS stations in domain d03 and 149 in domain d02 (Figure 1a).

ASOS data
The following parameters from the 5-minute ASOS data set are used in the model evaluation and diagnostic interpretation: • Gust wind speeds (Ugust, ms -1 ): Sustained and gust wind speeds within the ASOS network are measured using Vaisala 2-D sonic anemometers deployed at 10 m a.g.l.. The data are sampled at 1 Hz and digitally output as 3-second moving 195 average wind speed. The gust wind speeds reported here represent the maximum 3-second wind speed measured in each 5-minute period when gust criteria are met. Gusts are reported in knots and are rounded up to the nearest whole knot.
• Air temperature (T, °C): measured at 2 m a.g.l. using a platinum-wire resistance thermometer.
• Accumulated precipitation (PPT, mm): Hourly precipitation is measured by a heated, tipping-bucket rain gauge. The 205 data are reported in hundredths of an inch and converted to metric units herein.
A light emitting diode weather identifier instrument is used to differentiate rain and snow at ASOS stations (Wade, 2003), but hydrometeors such as hail are only reported at ASOS stations with human observers. Thus for ~ 400 fully automated ASOS stations across the US there are no hail reporting functions. Hence, hail occurrence reported by the ASOS network (including the portion within the current domain of interest) is likely to be negatively biased. ASOS facilities with a surface-based observer also augment 210 the reports with flags to indicate the presence of thunderstorms. These data are presented herein to supplement evidence of high reflectivity from RADAR.

RADAR
Dual polarization Doppler S-band WSR-88D RADAR form the basis of the NWS network (Crum et al., 1998;Seo et al., 2015).
Scans are performed at between nine and fourteen elevation angles (0.5° to 19.5°) depending on precipitation conditions. Data are 215 collected with a standard azimuthal resolution of 1° and range resolution of 0.25 km (NOAA, 2016b(NOAA, , 2017. Data used herein are restricted to within 200 km of each RADAR station.
Four key RADAR-derived properties sampled at 10-minute intervals are used in the WRF model evaluation: • Composite reflectivity (cREF, dBZ) which is the maximum reflectivity in each vertical column.
• Precipitation rate (mmhr -1 ) derived from reflectivity using Z-R relationships (NOAA, 2016b). 220 • Hail reports and MESH: Hail presence in cloud is derived from reflectivity, aspect ratio of hydrometeors, verticallyintegrated liquid, and altitude of the melting layer NOAA, 2016a). Hail reports include the geographic position and the 75th percentile hailstone diameter (or maximum estimated size of hail, MESH) Wallace et al., 2019). In the current work, a distinction is drawn between hail reports with MESH > 5 mm and those without. This diameter threshold for classifying hydrometeors as hail (as opposed to graupel) comes from the ASOS 225 conventions (Nadolski, 1998).
• Radial wind speeds (ms -1 ) are presented herein ( Figure 2) from the 0.5° elevation angle and are computed from the Doppler shift (Alpert and Kumar, 2007).
All RADAR measurements are sampled at a 10-minute interval to match the WRF output and are re-gridded onto the WRF grid used for domain d03 prior to their use in the model evaluation. Where two RADAR cover the same area the data are averaged 230 using inverse-distance weighting. RADAR coverage of domain d03 is almost complete. RADAR data are available for 86436 total grid cells in d03 which is 99.4% of the total number of WRF grid cells.

Assessing and attributing model fidelity
The WRF simulation period encompasses both the derecho that forms the focus of this research and a subsequent frontal passage. the period of the spatial extent of cREF > 40 dBZ in domain d03 relative to RADAR observations ( Figure 3a). This is consistent with previous research that indicates WRF simulations not subject to data assimilation exhibit timing offsets when simulating extreme precipitation events (Knist et al., 2020). For this reason, and because the purpose of the current work is to examine whether a CPRCM simulation can generate atmospheric hazards associated with a derecho, the model evaluation is performed within a framework such that time-synchronization is not required. The storm peak time (tp) is defined independently for each ensemble 245 member and the RADAR observations as the time of maximum exceedance of 40 dBZ during the Derecho period and the Front period, respectively. WRF output at tp is used to characterize the intensity and characteristics of each event.
The fidelity of each ensemble member with respect to storm severity and spatial extent during the Derecho and Front periods is assessed using geospatial maps of composite reflectivity, precipitation accumulation and type, and maximum wind speeds, and is summarized using the following metrics: 250 • cREF >40 dBZ Ratio: This metric is the ratio of areal extent of WRF grid cells with composite reflectivity > 40 dBZ at tp, divided by the RADAR-derived estimate. Use of cREF > 40 dBZ as the index of the spatial coverage of deep convection is based on past research (Parker and Knievel, 2005;Schumacher and Johnson, 2005). The spatial coverage for other thresholds is shown in Figure 3b.
• Max Gust Ratio: This metric is the ratio of the maximum over-land 10m wind speed in each timestep from each WRF 255 ensemble member divided by the maximum wind gust speed from any ASOS station. This is thus a basic metric of the degree to which each WRF ensemble member produces wind gusts that approach the most severe gusts observed by the ASOS network.
• Total Precipitation Ratio: This metric is the ratio of precipitation accumulation in all d03 grid cells for which RADAR retrievals are available to the RADAR observations. Hail occurrence from the WRF ensemble members is also 260 evaluated against RADAR and ASOS observations along with the presence of 'significant hail'. Grid cells in d03 are https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License. classified as containing 'significant hail' in the WRF simulations if there is > 1 mm of hail and/or graupel accumulation, and in RADAR observations for MESH > 5mm.
As described above, and indicated by Figure 3, the timing of peak intensity and transit of the derecho across the innermost domain is not consistent across the WRF simulations and/or between the WRF simulations and observations. Given this research is being 265 performed in the context of a project designed to improve simulation of atmospheric hazards in the contemporary and possible future climates, we assess fidelity without requiring temporal synchronization. Thus, in the following we focus much of our evaluation of the simulations on their ability to reproduce the intensity and spatial extent of the derecho and thus define the time of peak intensity (tp) independently for each ensemble member. While we present some of the evaluation in terms of the degree of spatial agreement with in-situ and remote sensing data using spearman correlation of geospatial values at tp, we also include 270 analyses that examine the absolute intensity of, for example, reflectivity and wind gusts without requiring geospatial coherence between the model and the observations. In these analyses we are addressing the question; was the peak intensity of the event captured even if that peak is displaced in space and time? In considering these decisions it is worth reemphasizing that the purpose and concept of this analysis is not to assess deterministic (forecast) predictability but the representation of the convective system. The metrics of fidelity described above are considered here in the context of the environmental setting; convective available 275 potential energy and vertical wind shear, along with descriptors of the storm dynamics; vertical velocities, cloud depth, downburst intensity and cold pool generation/intensity during the Derecho period. Many of these diagnostic analyses focus on the time of maximum coverage of high reflectivity (tp) during the derecho as assessed for each individual ensemble member and/or over a window of 3 hours around that time. The metrics used are described in the following.

Convective Available Potential Energy (CAPE) is a measure of the available vertically integrated buoyant energy. 280
Multiple indices of convective potential have been proposed (Kunz, 2007). Derechos are frequently associated with CAPE values in excess of 2400 Jkg -1 in the genesis region and can increase to 4500 Jkg -1 during the propagation of the derecho, with later observational analyses indicating that Most Unstable Convective Available Potential Energy (MU-CAPE) has a 75th percentile value of nearly 4000 Jkg -1 and a peak of 8500 Jkg -1 (Evans and Doswell, 2001). MU-CAPE from a WRF 2. Wind shear from the ground to 6 km (S6) is often used as to differentiate environments associated with significant severe thunderstorms from less severe events (Brooks et al., 2003). In an analysis of observational data average shear vectors in the ambient environment close to derechos ranged from shear vector magnitudes ranging from 1 to 36 m s −1 , which were slightly lower than those manifest in idealized simulations of bow echos (Evans and Doswell, 2001). Mid-level shear has 300 also been shown to help maintain deep convective systems (Coniglio and Stensrud, 2001;Chen et al., 2015). S6 is presented based on output at tp for all ensemble members.
3. ZR20: Is the model height at which the 90 th percentile base reflectivity falls below 20 dBZ. It is used as a proxy for cloud top height in areas of deep convection and thus is computed using only cells with cREF > 40 dBZ.
4. Two metrics of the intensity of vertical motions are presented. For each grid cell within 50 km of one where cREF > 40 305 dBZ, the layer with highest standard-deviation of vertical velocities (σ(w)) at tp is found. The magnitude of σ(w) is used to provide information about the intensity of vertical motions, that to the first order should be a function of MU-CAPE.
The height at which the maximum variability in vertical velocities occur is used provide information regarding the vertical structure of convection. 5. Cold pools are a key component contributing to organization and propagation of MCS (Engerer et al., 2008). They are 310 generated by evaporative cooling, precipitation drag, and downdrafts and are key to triggering and organizing organized persistent convection (Knippertz et al., 2009;Schumacher, 2015). An analysis of cold pools associated with 39 MCS in Oklahoma found mean surface pressure perturbations associated with cold pools range from 3.2 hPa to 4.5 hPa and mean temperature perturbations range from 9.5 to 5.4 K depending on the MCS stage (Engerer et al., 2008). To account for the presence of substantial topographic variability within d03, the intensity of cold pools at the surface associated with the 315 derecho is quantified using anomalies from the simulation mean temperature or pressure in that grid cell over the entire simulation period. Both are computed for WRF grid cells with 50 km of all cells with cREF > 40 dBZ: a. 95% temperature deviation: This metric is the lowest 5-percent (coldest) 2m air temperature anomalies close to the regions with most active convection.
Because variables and metrics considered here are not gaussian distributed, Spearman rank correlations (Wilks, 2011) are used to describe their co-variability. Rank correlation coefficients are computed between the model fidelity metrics and the diagnostic metrics across the 11 ensemble members to identify which model properties most greatly influence model skill.  (Table 3).
When remapped to the WRF grid, the RADAR data indicate 2148 of the almost 90,000 grid cells experienced significant hail during the Derecho period (Table 4). These locations identified by the RADAR detection algorithm as exhibiting hail and MESH > 5 mm are distributed throughout domain d03 ( Figure 6). The WRF ensemble members -particularly those that employ the Milbrandt microphysics scheme indicate much greater spatial coverage of hail (Table 4). When the threshold of > 1 mm hail 360 accumulation is applied to the WRF output the occurrence of hail greatly decreases rather few grid cells show hail above this threshold (Table 4).
During the Front period the situation is reversed in that RADAR observations show limited areas with high precipitation totals over 40 mm and 2152 grid cells where hail was detected in clouds. Areas with substantial precipitation accumulation are only evident from RADAR in bands in the south of the domain, in regions where hail is also indicated by the RADAR detection 365 algorithm (Figure 7). Two-thirds of the domain shows little or no precipitation in either RADAR or ASOS data. All non-nudged WRF ensemble members indicate positive bias in domain-wide precipitation and over-predict the occurrence of hail (Table 4). All four non-nudged ensemble members with the Milbrandt microphysics scheme simulations also indicate multiple locations with hail accumulation above 1 mm. The number of grid cells with RADAR detection of hail (3078) shows closest agreement with the Morrison+Hail simulation (3000) ( Table 4). Using MESH > 5 mm and WRF hail accumulation of 1 mm as indicative of substantial 370 hail, the closest accord for the Front period is found for the Milbrandt-628 ensemble member (Table 4). simulations that use the Milbrandt microphysics scheme tend to have deep layers with base reflectivity above 35 dBZ and lower spatial variability (Figure 8), consistent with the high production of hail (Table 4). In contrast to the other ensemble members, the nudged simulations with LBC from both ERA5 and ERA-Interim indicate the region of highest inferred RADAR base reflectivity at tp that is displaced from the ground (Figure 8).
Links between deep convection, downdrafts and near-surface wind gusts are highly complex (Geerts, 2001;Kuchera and 385 Parker, 2006;Brown and Dowdy, 2021b), and this combined with observational limitations mean very little previous research has quantified skill in model simulations of wind gust generated by downdrafts from deep convection. Consistent with evidence presented above of spatial displacement of the regions of deepest convection, the spatial correlation coefficients of maximum wind speeds between the individual ensemble members and ASOS wind gust observations (see time series in Figure 9a and spatial maps in Figure 10) are also low ( Table 3). As with precipitation and RADAR reflectivity, wind speeds are underestimated during the 390 Derecho period and overestimated during the Front period. Some ensemble members (again, Milbrandt-XXXX and Morrison) produce wind gusts during the Derecho period that are within a factor of 0.6 of the ASOS maximum observed wind gust, but only one of the ensemble members generates a wind gust anywhere in domain d03 that exceeds the NWS definition of 'severe wind' (i.e. wind gusts at 10 m a.g.l. above 25.7 ms -1 ) while multiple time periods and ASOS stations reported wind gusts above this threshold ( Figure 9b). Indeed, the highest 2% of modeled wind speeds is substantially lower than the equivalent near-surface gust 395 observations (Figure 9b). Only the Morrison, Milbrandt-626-ERA-I and Mibrandt-628-ERA-I, exhibit 98 th percentile wind speeds (sampled at the model time step in both space and time over land grid cells) that lie within 50% of the ASOS observations of wind gusts (Figure 9b). While some of the offset between observed point measurements of 3-second duration wind gusts and grid-cell average wind speeds at the model time step of 3.33 seconds is expected due to the spectral truncation inherent in grid-cell average modeled wind speeds (Pryor et al., 2012), it is interesting to note that virtually all members of the model ensemble overestimate 400 peak wind gusts during the frontal passage (Figure 9a and 11). The two ensemble members that use ERA-Interim IC and LBC are associated with highest wind speeds and greatest accord with near-surface measurements from ASOS during the Derecho period ( Figure 9-11).
The sensitivity to LBC in simulations with Milbrandt (e.g. Figure 4 and 10) is inconsistent with past research (Majewski, 1997).
Despite the higher resolution and larger data assimilation volumes in ERA5, simulations within ERA-Interim produced better 405 spatial agreement with observations from RADAR and ASOS. For simulations with the Milbrandt microphysics scheme that are initialized on 26 June at 0000 UTC the correlation coefficients are -0.412 vs. 0.225 for ERA5 and ERA-Interim respectively, while for the simulations started on 28 June 0000 UTC the correlation coefficients are 0.318 and 0.669 (Table 3). The spatial correlation for peak cREF is also higher in simulations with ERA-Interim LBC (Table 3). An examination of the IC generated by WRF real for 26 June 00Z ( Figure S2) indicate higher pressure is prevalent and broader than in ERA-Interim, particularly across the derecho 410 https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License. genesis region of the Midwest. The derecho event came at the end of an extended period of high near-surface temperatures. While the ERA-Interim and ERA5 fields at the model initialization time are superficially similar (on the two dates), some differences are evident ( Figure S2). For example, on 26 June the region of elevated 2 m temperature extends further north and east in ERA-I and the SLP anomalies (and suppressed lower tropospheric specific humidity) associated with the anticyclone over the Great Lakes is slightly more intense in ERA5. On 28 June the region of elevated 2 m temperatures extends further east in ERA-I. Much larger 415 differences are naturally evident in the initialization from each of the reanalyses across the two start dates (26 June v 28 June). The weaker, but evident, influence from model initialization time (e.g Figure 10) is consistent with information from the short-term forecasting community, although interestingly the spatial fields of precipitation accumulation exhibit higher agreement with observations in ensemble members initialized on 26 June.

Linking fidelity to metrics of CAPE, downbursts, and cold pool generation 420
As described above there is considerable spread among the ensemble members in terms of their fidelity relative to remote sensing and in situ observations. Here we seek to link model skill in reproducing aspects of derecho intensity (maximum wind gust, precipitation, and spatial coverage of cREF > 40dBZ) to metrics of convective potential specifically; MU-CAPE and wind shear between the ground and 6 km, plus metrics of convective intensity, specifically; indices of cold pool intensity, vertical velocities and cloud top height. We begin by describing the magnitudes and spatial variability of the diagnostic metrics in each 425 ensemble member.
Consistent with estimates of parcel CAPE from rawinsonde soundings for this event and modeling of other derechos (Gatzen, 2004;Coniglio et al., 2011;Weisman et al., 2013;Celiński-Mysław and Matuszko, 2014) observational estimates for derecho events over the contiguous US between 1988-1993(Evans and Doswell, 2001. The nudged ensemble members, plus Morrison+Hail and NSSL indicate relatively low shear. The degree to which MU-CAPE decreases by tp+3 varies considerably across the ensemble members (SM Figures S3-S5 and Table 5). The change in 50 th percentile MU-CAPE values across domain d03 ranges from ∼ 0 in the ensemble members NSSL and Thompson to ≥ 900 Jkg -1 in ensemble members Morrison, https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License. 628-ERA-I. Indeed, the change in median MU-CAPE is ∼2000 Jkg -1 in the Milbrandt-626-ERA-I and Milbrandt-628-ERA-I ensemble members that also showed highest agreement with observations of the spatial extent of high cREF, total precipitation accumulation, maximum wind gusts and large hail (Table 5). Other metrics that describe convective intensity that are diagnosed at tp also indicate substantial variability across the ensemble members. Modeled vertical velocity at/close to 5 km height at tp are highest in the Goddard, Morrison, Milbrandt-626-ERA-I and Milbrandt-628-ERA-I ensemble members (Figure 12, see also SM 445 Figure S8) which also show substantial coverage of upward velocities in excess of 3 ms -1 and also proximal regions with substantial downdrafts of greater than 3 ms -1 . This is manifest as high values of the standard deviation of vertical velocities within 50 km of grid cells with cREF > 40 dBZ (Table 5). Goddard, Morrison, Milbrandt-626-ERA-I and Milbrandt-628-ERA-I are also the ensemble members with highest maximum near-surface wind speeds ( Figure 10 and Table 5). The estimate of cloud top height derived using a threshold of base reflectivity from each model layer ranges from a low of 9 km (Morrison+Hail) to over 13.5 km 450 in all ensemble members that employ the Milbrandt microphysics schemes and that were not subject to nudging (Table 5).
Cold pool intensity as measured by the highest 5-percent of sea-level pressure anomalies (95 th percentile SLP) and lowest 5percent of temperature anomalies (i.e. 95 th percentile negative temperature perturbations) also exhibit substantial variability between ensemble members. This is consistent with previous research that has examined microphysics scheme spread and its associated impact on cold pool properties and dynamics (Xue et al., 2017). The lowest 5-percent temperature deviations vary from 455 -1.38 to -5.58 K (Table 5 and example fields shown in Figure 13 for the Morrison and Milbrandt-628-ERA-I ensemble members).
The upper end of this range is thus consistent with the cold pool intensities from the experiment study of Derechos from Oklahoma that indicated maximum (point) temperature anomalies of 5.4 to 9.5 K (Engerer et al., 2008). Four ensemble members (Goddard, , and the two simulations within ERA-Interim LBC) also exhibit 95 th percentile SLP deviations of above 2 hPa (Table 5 and example fields shown in Figure 13). While it is challenging to evaluate the simulation of these cold pools due to the 460 limited spatial coverage of the ASOS network, the range of SLP and near-surface temperature anomalies from these ensemble members is broadly consistent with those calculated from the ASOS observations. The estimate of cloud top height derived using a threshold of base reflectivity from each model layer ranges from a low of 9 km (Morrison+Hail) to over 13.5 km in all ensemble members that employ the Milbrandt microphysics schemes and that were not subject to nudging (Table 5).
The Spearman correlation coefficients (r) between the three metrics of model fidelity from this 11 member ensemble are > 0.9 465 indicating that a simulation that exhibits atypically high skill with respect to maximum wind speed is also likely to perform well in describing the spatial extent of high cREF and accumulated precipitation ( Table 5). The storm intensity metrics all also exhibit positive r but of varying magnitude. For example, there is only a weak association between the rank correlation of cloud top height and vertical velocities (r < 0.38).
Simulated wind gusts at the surface are a product of downdrafts/downbursts and resulting gust fronts. Accordingly, the highest 470 5% of downward vertical velocities exhibits a Spearman correlation coefficient (r) with the ratio of modeled to observed maximum wind gusts of 0.90 (Table 5). All ensemble members that exhibit higher max gust ratios also exhibit stronger downdrafts (exhibit largest negative vertical velocity), stronger vertical wind shear and show higher median MU-CAPE change. Consistent with past research that examined ensemble spread for simulated squall lines from use of different microphysics schemes (Morrison et al., 2015;Xue et al., 2017), the two cold pool metrics are also shown to be predictive of model fidelity for wind gusts associated with 475 the derecho. That is models that generate the strongest cold-pools (as measured by either the near-surface temperature or pressure anomalies) tend to be those that perform best in terms of the associated near-surface wind gusts (r across the 11 members is 0.72-0.75, see Table 5). Metrics of cold pool dynamics are also predictive of other aspects of simulation fidelity (e.g. extent of cREF) consistent with their importance for the triggering and organization of persistent convection.
Although the two ensemble members that exhibit highest fidelity with respect to the areal coverage of cREF (Morrison that  480 Milbrandt-628-ERA-I) also exhibit relatively high skill in reproducing precipitation and wind gusts, as illustrated by Figure 13 these simulations generate different morphologies of the derecho. Specifically, the region of high cREF is much more spatially homogeneous in Morrison than Milbrandt-628-ERA-I. Further, the cold pool intensity at tp exhibited important differences. The region of elevated SLP is much more marked in Morrison but the associated temperature anomaly is much smaller than that from Milbrandt-628-ERA-I, this may be linked to the lower elevation of downdraft maximum intensity in the Morrison ensemble 485 member (Table 5).
In those ensemble members that perform comparatively poorly in terms of reproducing key aspects of the derecho (e.g Morrison+Hail, NSSL, Thompson and both nudged simulations), MU-CAPE is not consumed in sufficient amounts resulting in under-production of deep convection during the derecho (Table 5). This leaves excess MU-CAPE availability for the subsequent frontal passage resulting in excess production of convective cells, wind gusts, cREF > 40dBZ and precipitation (Figure 6, 8 and  490   11). This may have implications for climate-scale (long-term) simulations from CPRCM and specifically inference regarding temporal sequencing of deep convection and associated hazards such as flooding.

Summary and conclusions
Severe wind gusts associated with derechos represent an important natural hazard resulting from MCSs. Efforts to improve simulations of deep convection in both weather forecasting and climate projections have been hampered by both conceptual gaps 495 in understanding of small scale cloud processes, lack of observations both of the associated hazards and hydrometeor properties on the microscale (Morrison et al., 2020) and challenges in representing scale linkages in numerical models. Additionally, advanced schemes tend to be computationally expensive (Xue et al., 2017) which may limit their utility in CPRCM simulations. Accordingly, https://doi.org/10.5194/nhess-2021-373 Preprint. Discussion started: 10 December 2021 c Author(s) 2021. CC BY 4.0 License. while a limited number of studies have sought to examine how severe convective wind environments might change in the future (Brown and Dowdy, 2021a), very few robust hindcast ensemble simulations exist for specific events that can be leveraged in a 500 pseudo-global warming framework. Evaluating the inherent ability of models to reproduce key aspects of historic, poorly forecasted severe events will facilitate the further development of model parameterization schemes, allow selection of optimal model configuration for simulating high impact events (Dai et al., 2021) and provide context for examining how such events might change in the future.
Revisiting the main objectives of this work, we sought to evaluate an ensemble of simulations with WRF that differ in terms 505 of the microphysics schemes applied, start date, the lateral boundary conditions and use of nudging. The main findings of this study are: 1. This 11 member WRF ensemble tends to underestimate the spatial extent of high composite reflectivity, near-surface wind speed and precipitation during the Derecho period and overestimate cREF, wind speed and precipitation during a subsequent frontal passage. The bias with respect to the subsequent front is linked to a negative bias in MU-CAPE 510 depletion during the derecho. The use of a double-moment, 7-class scheme with number concentrations for all species (including hail and graupel) [Milbrandt-Yau] results in the greatest model fidelity for maximum wind speeds, hail, and precipitation accumulation. This is consistent with numerous studies that have shown increased fidelity when using double-moment, bulk microphysics schemes with number concentrations for ice, graupel, and hail (Morrison et al., 2015).
2. Model settings such as initialization time and LBC exhibit a strong signal in driving different convective conditions and 515 results in large spread of the associated natural hazards; wind gusts and hail. The ensemble spread from changing the microphysics scheme and the resulting simulated dynamic and thermodynamic convective structures (Xue et al., 2017) is similar to that caused by changing the lateral boundary conditions. The higher fidelity associated with use of ERA-Interim reanalysis data as opposed to ERA5 is unexpected. Nested simulations of a European derecho event using the COSMO regional model found significant improvement in the simulation fidelity with use of ERA5 for the LBC over simulations 520 using ERA-Interim. Our finding has important implications for construction of hindcast simulations for use in Surrogate or Pseudo Global Warming (PGW) numerical experiments to quantify the potential of global warming on extreme weather events using regional models (Li et al., 2019). In such simulations an historically important extreme event is first simulated using contemporary LBC and then the simulation is repeated using LBC and IC perturbed to represent the change in, for example, air temperatures and water vapor availability. The difference in these two realizations is interpreted as the impact is consistent with past research that has indicated forecast errors in the simulation of deep convection have a doubling 530 time of only a few hours (Prein et al., 2015). This represents an important challenge for simulations of these atmospheric hazards.
3. The diagnostic metrics applied here to represent pre-conditioning of the environment plus key dynamic and thermodynamic aspects of the storm (development and propagation of squall lines, downbursts and cold pool development) are highly predictive of the relative skill of individual model ensemble members. This seems to imply that 535 although the ensemble members incompletely resolve key outcomes of the derecho (e.g. the intensity of the wind gusts), their relative ability in terms of the associated dynamics appears to indicate the better performing ensemble members are generating 'the right answers for the right reasons'.
Due to the computational demand, a spectral bin microphysics scheme was not used here, even though such schemes have been shown to outperform double moment bulk schemes in a weather forecasting context (Fan et al., 2017;Xue et al., 2017). Future 540 work in the field of model fidelity and scheme sensitivity that examines historically significant weather events would benefit from even larger ensembles and, as computing developments allow, the use of more conceptually realistic spectral bin microphysics parameterization schemes.

Code availability
The WRF code version used in this study (v3.8 FL, TJS and SCP performed the analyses, and prepared the figures/tables. SCP and TJS developed the initial manuscript. All 560 authors contributed to the final manuscript.

Competing interests
The authors declare that they have no conflict of interest.  Figure 1) and physics settings (see also Varies -see Table 2 Longwave radiation RRTM (Mlawer et al., 1997) Shortwave radiation Dudhia (Dudhia, 1989)    805 29-Jun-2012 21:30:00 to 30-Jun-2012 13:30:00) from WRF and ASOS observations. In this analysis WRF output for maximum time step wind speeds (dt = 6 sec) is sampled at the 34 ASOS locations and compared with the maximum 3-second ASOS wind gusts measurements (see spatial fields in Figure 10). Also shown are the Spearman rank correlations between spatial fields of total accumulated precipitation from WRF output relative to RADAR estimates and ASOS in situ measurements. In these analyses the correlations between WRF and the RADAR data are for all WRF grid cells sampled by the RADAR (