A tailored multi-model ensemble for air traffic management: Demonstration and evaluation for the Eyjafjallajökull eruption in May 2010

High quality volcanic ash forecasts are crucial to minimize the economic impact of volcanic hazards on air traffic. Decisionmaking is usually based on numerical dispersion modeling with only one model realization. Given the inherent uncertainty of such approach, a multi-model multi-source term ensemble has been designed and evaluated for the Eyjafjallajökull eruption in May 2010. Its use for air traffic management is discussed. Two multi-model ensembles were built: the first is based on the output 5 of four dispersion models and their own implementation of ash ejection. All a priori model source terms were constrained by observational evidence of the volcanic ash cloud top as a function of time. The second ensemble is based on the same four dispersion models, which were run with three additional source terms: (i) a source term obtained with background modeling constrained with satellite data (a posteriori source term), (ii) its lower bound estimate, and (iii) its upper bound estimate. The a priori ensemble gives valuable information about the probability of ash dispersion during the early phase of the eruption, when 10 observational evidence is limited. However, its evaluation with observational data reveals lower quality compared to the second ensemble. While the second ensemble ash column load and ash horizontal location compare well to satellite observations, 3D ash concentrations are negatively biased. This might be caused by the vertical distribution of ash, which is too much diluted in all model runs, probably due to defaults in the a posteriori source term and vertical transport and/or diffusion processes in all models. Relevant products for the air traffic management are horizontal maps of ash concentration quantiles (median, 75 %, 15 99 %) at a fine-resolved flight level grid. These maps can be used for route optimization in the areas where ash does not pose a direct and urgent threat to aviation. Cost-optimized consideration of such hazards will result in much less impact on flight cancellations, reroutings, and traffic flow congestions. 1 https://doi.org/10.5194/nhess-2021-96 Preprint. Discussion started: 9 April 2021 c © Author(s) 2021. CC BY 4.0 License.

In WRF-Chem, external meteorological data are only used as input and boundary conditions. Therefore, they need to extend over a slightly larger area than the modeling domain itself. For this study, these data were obtained from ECMWF every three hours with 0.25 • horizontal resolution and at 91 vertical levels. 155 For each volcanic source term, volcanic ash mass was distributed between 10 aerosol bins with diameter sizes ranging from about 3.9 µm to 2 mm following the S2 classification scheme of Stuefer et al. (2013). Fine volcanic ash with particles smaller than 31.25 µm accounted for 27.5 % of total emitted ash.
Mean concentration of fine volcanic ash between selected flight levels was calculated after interpolating the hourly-resolved model output to the regular 0.1 • × 0.1 • EUNADICS-AV grid. 160

Source terms
Every model uses two types of source terms: an a priori souce term and an a posteriori source term. A priori source terms are computed from simple schemes, e.g., using data such as plume height estimates as input information. The a posteriori source term was obtained after inversion of satellite total column estimates (Stohl et al., 2011).
For all models, the a priori source terms are, in a nutshell, derived from a plume model using plume height estimates from 165 radar data (Arason et al., 2011). The plume models are respectively, PLUMERIA (Mastin, 2007) for FLEXPART, F-PLUME (Folch et al., 2016) for MOCAGE, and the Mastin et al. (2009) relationship, which relates plume height with emitted mass per time step, for MATCH and WRF-Chem. MATCH and WRF-Chem assume an umbrella-shaped plume with slightly different structures: while MATCH assigns 10 % of the mass to below 25 % of the plume height, 15 % of the mass up to 50 % of the plume height, and the remaining 75 % of the mass to above 50 % of the plume height, WRF-Chem assigns 25 % of the mass 170 from the vent height to the umbrella base (75 % of the plume top height) and 75 % of the mass to the umbrella (Stuefer et al., 2013;Hirtl et al., 2019). Furthermore, radar plume heights used in both models were slightly different. The records of radar plume heights used in MATCH are those reported at the time of the Eyjafjallajökull eruption. Reports were provided for various durations from 3 to 21 hours (in steps of 3 hours). Some longer durations may just be a concatenate of repeated height levels in consecutive reports. For WRF-Chem, the 75 % quantile of plume-top altitude statistics calculated for 3 hour time intervals The time-height evolution of the source term intensity differs significantly from one model to the other (Fig. 1). While all source terms emit most ash at high altitudes, FPLUME/MOCAGE emits very little ash at medium levels and only little ash near the ground. The transition from high to low emissions is smoother for the other models. Furthermore, topography of the volcanic vent, which is actually located at 1666 m above sea level (asl), is fully considered in FPLUME/MOCAGE and 180 FLEXPART but neglected in MATCH and WRF-Chem.
Contrary to the a priori source terms, the a posteriori source term is the same for all models. Described by Stohl et al. (2011), it is based on a Bayesian inverse modelling approach using FLEXPART in forward mode to establish the sensitivities between the emissions and the measurements. In this study, altitude and time resolved volcanic ash emissions were derived for the Eyjafjallajökull eruption in April and May 2010 using satellite observations. The data used was total column ash measurements 185 of the Spinning Enhanced Visible and Infrared Imager (SEVIRI) aboard the geostationary Meteosat Second Generation (MSG) satellite and the Infrared Atmospheric Sounding Interferometer (IASI) aboard the polar orbiting Metop satellite, utilizing their high temporal coverage (SEVIRI) and enhanced sensitivity to ash (IASI). Furthermore, a priori estimated ash emissions based on observed plume heights and the eruption column model PLUMERIA (Mastin, 2007) were used. Validations with independent observations revealed improved model results when using the inversion-derived a posteriori ash emission source 190 term rather than an a priori one (Stohl et al., 2011).
Based on an optimal estimation algorithm, the Stohl et al. (2011) source term provides an optimal value of the emission.
There is, however, also uncertainty associated with the source term that is quantified as error variance after the inversion. In order to use this uncertainty, two additional source terms are built: a "lower bound" estimate of the ash emission and an "upper bound" estimate ( Fig. 2). At each time and vertical level (t, z) above the vent, Stohl et al. (2011) provide an optimal mass 195 estimate and its error variance (m, σ 2 ). Lower and upper bounds are derived from the assumption that these bounds equal the 15 % and 85 % quantiles, respectively, of a log-normal distribution with maximum likelihood m and variance σ 2 . The choice of a log-normal distribution is preferred to a normal one because it does not allow for negative values, despite the fact that the Stohl et al. (2011) inversion assumed a normal distribution of emission flux errors.

VACOS reference ash observations 200
The Volcanic Ash Cloud properties Obtained from SEVIRI algorithm (VACOS, Piontek et al., 2021b) derives volcanic ash coverage, ash optical thickness at 10.8 µm, mass column loads, volcanic ash plume height, and volcanic ash effective particle radius from data of the passive SEVIRI imager aboard the geostationary MSG satellite (Schmetz et al., 2002). SEVIRI is a twelve-channel instrument with a spatial resolution of 3 km at the sub-satellite point that diminishes towards the edges of the Earth disk. Its temporal resolution is 15 min in the operational mode over the Equator at 0°E for the entire disk. Thus, this 205 instrument is well-suited for the continuous monitoring of volcanic ash clouds in the atmosphere.
The VACOS retrieval of volcanic ash used here is the follow-on version of the algorithm Volcanic Ash Detection Using Geostationary Satellites (VADUGS) developed after the Eyjafjallajökull eruption in (Kox et al., 2013Graf et al., 2015;WMO, 2015WMO, , 2017de Laat et al., 2020;Bugliaro et al., 2021). Like other spaceborne retrievals of volcanic ash (e.g. Prata, 1989a;Prata and Grant, 2001;Francis et al., 2012;Prata and Prata, 2012;Pavolonis et al., 2013;Pugnaghi et al., 2013;Piscini 210 et al., 2014) it exploits the spectral signatures of volcanic ash in the atmosphere, in particular the reverse absorption effect between two window channels centred at 10.8 and 12.0 µm (Prata, 1989a, b) and the information coming from all thermal SEVIRI channels. Thus, it is applicable during day and night. Moreover, VACOS uses auxiliary information like satellite viewing angle, surface temperature obtained from a numerical weather prediction (NWP) model, or clear sky brightness temperatures derived from the SEVIRI observations. VACOS consists of four artificial neural networks (ANNs) with three hidden layers and 100 neurons each. While the development of the VADUGS and VACOS algorithms follows the ideas implemented in the related ice cloud retrievals COCS ("Cirrus Optical properties derived from CALIOP and SEVIRI algorithm during day and night", Kox et al., 2014) and CiPS ("Cirrus Properties from SEVIRI", Strandgren et al., 2017), the most important difference consists in the fact that the volcanic ash retrievals are trained using simulated SEVIRI thermal observations instead of collocated observations of ice clouds of the spaceborne lidar CALIPSO/CALIOP (Winker et al., 2009) as for COCS and CiPS. The VACOS input data 220 set consists of brightness temperatures for the SEVIRI thermal channels corresponding to observations of volcanic ash under a multitude of meteorological conditions with a multitude of microphysical and macrophysical properties like plume bottom and top height as well as volcanic ash concentrations. Ash optical properties have been computed according to a set of representative refractive indices to encompass a large variety of possible volcanic eruptions (Piontek et al., 2021c). Furthermore, VACOS is trained to detect volcanic ash in cloud-free environments and also above liquid water clouds. Ice clouds are excluded a priori 225 through the application of the dedicated ice cloud algorithm COCS mentioned above. VACOS has a fairly good volcanic ash detection probability for ash layers with column loads between 0.2 and 1 g m −2 (between 1 and 10 g m −2 ) of approximately 93 % (99 %) and also allows for the quantification of the ash load of the plume with a mean absolute percentage error of ca. 40 % (26 %). These values have been derived for a simulated test data set, using the retrieved ash optical depth at 10.8 µm, a typical mass extinction coefficient of 0.2 m 2 g −1 and a threshold value of 0.2 g m −2 (Piontek et al., 2021a).

Airborne measurements
During the period of the study, airborne measurements were reported in the literature. A lidar onboard the Facility for Airborne Atmospheric Measurements (FAAM) BAe-146 research aircraft (Marenco et al., 2011) scanned ash layers to estimate ash load, ash layer height, and ash 3D concentrations above the United Kingdom and the North Sea. During the period of interest, measurements were performed on 14, 16, 17, and 18 May. Furthermore, DLR flights reported in-situ estimates of 3D ash 235 concentrations above the North Sea, Germany, and the Netherlands (Schumann et al., 2011) on 13, 16, 17, and 18 May. In the present study, these data sets are used to assess the performance of individual models and both ensembles.

Grids and post-processing
All model results were provided with hourly resolution on a common 0.1 • × 0.1 • horizontal grid from 30 • W to 40 • E in longitude and from 30 • N to 75 • N in latitude. The vertical discretization is defined by 13 flight levels from FL50 up to FL650 240 with 50 hecto feet steps. This common grid has been designed to facilitate the computation of scores on a uniform domain.
The horizontal and vertical resolutions are significantly higher compared to past studies (Kristiansen et al., 2012;Dacre et al., 2016).  Over all dates considered, VACOS ash load is within the range of values found in the literature.
The Fraction Skill Score (FSS) is a meaningful metric to assess the performance of volcanic ash dispersion simulations by 260 determining the scale over which a simulation has skill for the location of ash plumes along the horizontal dimensions (Harvey and Dacre, 2016). It is calculated as: with N being the total number of grid points in the verification area, and M j (r) and O j (r) being the fractions of contaminated grid points within the circle of radius r around point j, for the model and the observations, respectively. Before the computation 265 of F SS(r), a normalization step was applied, where the G most contaminated grid points were determined for VACOS and model data. For VACOS, all grid points (within the verification area) with ash load higher than 0.2 g m −2 are assumed to be contaminated; G is defined as number of these grid points. For each model output, the G grid points with the highest ash concentration in the domain are kept for further analysis and used to calculate the FSS. This normalization step helps to keep only the most intense ash features in both models and observations and avoids the problem due to different magnitudes of ash 270 load in the observation and the model. A model has skill at a given scale if the FSS is above 0.5; the higher the FFS, the better the model performance.

Ash location
Generally speaking and not surprisingly, the FSS (Fig. 4) increases with the radius of detection (50 km, 200 km, 500 km). While the models FSSs-50km do not always exceed the 0.5 threshold, the FSSs-500km are clearly higher than 0.5, except after 19 275 May, when the eruption stopped and scores are less relevant. The FSSs based on the a posteriori source term (right panels) are generally higher than the ones based on the a priori source terms, particularly for the 50 km and 200 km radii. There is a large variability of scores between the models, although this variability is lower for the models with the same a posteriori source term. Snapshot ash load maps can help to analyse the FSS scores and the performances of the models and/or of the observation shortcomings, e.g., clouds hindering satellite observations. At first sight, the differences between the different a posteriori source term bounds for the same model are in a similar range as the differences between two models with the same best estimate a posteriori source term. Comparison with ash load using the a priori source term (first row of Fig. 5) shows that the simulations with the a posteriori source terms do not reach the extreme values that can be reached with the simulation with the a priori source terms. The analysis and evaluation of model ash load on 295 14 May at 15 UTC and on 17 May at 16 UTC (suppl. material) support these findings.
As a conclusion, the comparison of FSSs and of ash column load maps helps to understand the relative performance of models and the origin of their differences. When using different (a priori) source terms, large differences in ash load magnitude can be observed. Using the same a posteriori source term generates simulations with similar magnitudes of ash load and similar location of ash in the models, but some differences in ash load and details in plume shapes are obvious. If models were forced 300 by meteorological forecasts (instead of analyses), larger differences in plume location and ash load can be expected.

Ash vertical distributions
Cross-sections of ash concentrations on 16 May at 14 UTC (Fig. 6)  The presence of ash at all levels can be due to the source term or due to the representation of vertical processes in the models, 310 and it is possible to provide arguments to help disentangling the two. For most of the source terms, except the MOCAGE a priori one, ash is injected at all layers. Before 16 May at 14 UTC (Fig. 1), the WRF-Chem a priori source term also concentrates ash in the upper levels. Since MOCAGE and WRF-Chem ash tends to extend largely along the vertical (Fig. 6), vertical parameterizations and processes including grid-scale vertical velocity, diffusion, aerosol sedimentation and/or vertical resolution should probably be improved. However, from this study alone, it is not possible to conclude whether the a posteriori source 315 term is too much diluted along the vertical or not.

Building ensembles and evaluating their quality
Ensemble forecasting has a 30-years long history in meteorology (Buizza, 2019), from which some guidelines and pitfalls can be learned for an extension to volcanic dispersion forecasting. A first lesson learned is that all possible sources of uncertainty should be taken into account. The possible methods to take into account the uncertainty (perturbations or stochastic represen-320 tation of features) can be diverse and have been an active field of research (Leutbecher et al., 2017). Another lesson learned is that the evaluation of ensembles is critical and is usually done based on long-period data sets for which homogeneous ensembles are run and compared against measurements. Usually, the evaluation metrics are used to further design the perturbation methods or bounds. Regarding rare events such a volcanic eruptions, for which observations are rare, evaluation of ensembles is clearly a more difficult task.  Evaluation of ensemble performance is done against VACOS ash location to assess the horizontal spread and against aircraft observations of lidar and in-situ measurements in order to evaluate simulated local ash load and ash concentrations along flight routes.

350
Like for the individual model outputs, ash location of ensemble outputs is evaluated using the FSS metric applied on a smaller domain shown in Fig. 3 at 0.2 • resolution. While the FSS can differ significantly from one model to another (Fig. 4), the FSS of ensemble quantiles (Fig. 8) are quite close to each other, except for some specific points in time. In that case, FSSs of the median are higher than the Q99 FSSs. In general, the a posteriori ensemble median performs better than the a priori ensemble median. This is also generally true for the higher quantiles, except for the Q99 on 17 and 18 May, when the Q99 quantiles of 355 the a posteriori ensemble show lower performance than the a priori one. However, the differences are rather small. A possible explanation is that the Q99 a posteriori ensemble (suppl. material) does not catch the ash patch in the north of the Netherlands on 17 May (Fig. 3).
To summarize the capacity of ensemble quantiles to catch the regions of highest ash load, the different ensemble quantiles have similar performance, and the a posteriori ensemble performs generally better than the a priori ensemble.

Evaluation of ash column load
Evaluation of ash column load is done against the most precise airborne lidar measurements from Marenco et al. (2011). Fig. 9 summarizes how the ensemble performs at different points along three flight tracks: on 14 May around 15 UTC, on 16 May around 14 UTC, and on 17 May around 16 UTC. Not all the flight track is taken into account, but only some locations (referred to as F in Fig. 7), which correspond to points where the highest values of ash load were measured. For comparison, ensemble 365 results are extracted from the few (4 to 9) grid points obtained along the flight track where those highest values were measured.
VACOS data extracted at the same grid points are also superimposed. Although such evaluation does not provide a rigorous probabilistic evaluation of the ensemble, it helps to characterize the ensemble dispersion and shows how the ensemble quantiles match the observations.
For the three episodes (Fig. 9), the airborne measurements fall in the range of the median, Q75, or Q99 ensemble values, 370 which shows that the both ensembles capture the real ash load. The range from the median to Q99 is smaller for the a posteriori ensemble than for the a priori ensemble: the ensemble dispersion is smaller using the a posteriori source term, though keeping it large enough. The median of the a posteriori ensemble performs best, except that it is biased low on 17 May at 16 UTC, but the median of the a priori ensemble is always biased low. Of course, further calibration based on different test cases should be done to validate these results.

Evaluation of 3D concentrations
The evaluation of ash concentration is performed against DLR and FAAM aircraft measurements taken during their flight routes. In-situ ash measurements are rare, though some exist, particularly for the phase of the eruption studied in the present article. Table 1 compares in-situ airborne measurements with ensemble values (using the a priori and a posteriori source terms).
The maps of concentration values for different ensemble quantiles and the flight locations are shown in Fig. 10.

380
A general conclusion based on Table 1 is that the ash Q99 values based on the a priori source term are much higher than based on the a posteriori source term. This is consistent with the fact that the a priori member with maximum ash concentration yields generally higher values than the a posteriori member with maximum ash concentration (Figs. 5 and 9). On 13 May at 14 UTC (map not shown), the ash concentrations are low (around 10 to 20 µg m −3 ), in the measurements, but also in the ensemble, which are in a good agreement for both source terms. On the 14 th , the aircraft flew through a rather ash-loaded area (between 500 and 900 µg m −3 ), which is also obviously polluted by ash according to the ensemble, as shown by the corresponding map (Fig .10), but with much lower concentration values (100 to 500 µg m −3 for both source terms).
However, where high ash concentrations are measured (above 200 µg m −3 ), the ensemble values generally tend to be lower.
Even for the flight routes where modelled ash column loads are in reasonable agreement with the measurements (FAAM flights, Fig. 9), ash concentrations at the flight levels are significantly lower. An explanation, consistent with conclusions in previous 390 parts of this article, is that ash is too much diluted along the vertical, due to the shape of the source term or to vertical dilution processes during ash transport. This is also true for the a posteriori source term which is constrained by ash load satellite estimates, but where the vertical distribution of ash is mainly constrained by the a priori. Their ensemble was based on one model with different model parameters, source terms, and meteorology. It was evaluated on a synthetic hypothetical use case.
In this study on a real case, we found considerable differences in ash location and ash concentrations due to the model 400 choice indicating that a multi-model ensemble increases information about uncertainty. Even though the a posteriori ensemble was, in general, in better agreement with the observations, an a priori ensemble is preferred for ATM as it is available in nearreal-time also during the early phase of an eruption. The a posteriori source term can only be computed if a sufficient number of measurements is available. Due to the limited number of ground-based measurements, the missing temporal coverage of measurements on polar-orbiting satellites, and only a few instruments on geostationary satellites, it usually takes a couple of 405 days to gather sufficient high-quality ash measurements to compute the a posteriori source term. Thus, a better quality of past ash dispersion and therefore better initial conditions for upcoming ash forecasts, can only be obtained after several hours or even a few days. An a priori ensemble includes different realizations of the source term, the most important component of ash dispersion uncertainty. The wide range of these a priori ash dispersion forecasts, however, might result in too conservative ATM, which can only be eased when a refined a posteriori source term is available. An intermediate approach could be to update 410 the a priori source term continuously by constraining the assumptions of the source term evolution with updated measurements.
The evaluation of both ensembles revealed that ash location and load is in good agreement with the observations. However, the vertical structure of ash clouds is not correctly represented by the models and ash concentrations at individual flight levels are biased low. Therefore, care must be taken concerning flight rerouting. Flying between distinct layers of ash would only be possible knowing the precise vertical structure of ash clouds, which is clearly not the case. However, model performance 415 allows flying above clouds and also around highly contaminated regions.
Ash concentrations smaller 2 mg m −3 are considered safe. Ash concentrations higher than 80 mg m −3 are definitely considered unsafe (Clarkson et al., 2016). Following a conservative approach, we therefore recommend avoiding regions within the 2 mg m −3 contour line of the Q99 of the apriori ensemble. Looking at Fig. 10 2010). Knowing that the vertical distribution of ash is not correctly represented in the models, vertical cross sections can nevertheless be used to estimate the upper limit of the ash cloud. The cross sections shown in Fig. 11  can be used to optimize the cost/loss function, in a similar approach as the one developed by other end-users of meteorological ensemble forecasting (Richardson, 2000).
This approach would lead to a better management during future volcanic eruption crises. Probabilistic ash concentration forecasts, combined with safe "no-fly" areas could become the future operational ash products, allowing cost-optimized consideration of such hazards. This will result in much less impact on flight cancellations, reroutings, and traffic flow congestions 435 during volcanic ash events.

Summary and conclusions
This article has presented an inter-comparison of volcanic ash forecasts using different models and different source terms for the Eyjafjallajökull eruption in May 2010. Furthermore, a methodology to build and use an ensemble for ATM was discussed.
Most important findings include:

440
large differences in ash location and ash load were found when models were run with their individual a priori source terms, which confirms that ash dispersion forecasts are highly sensitive to the volcanic ash source term; perturbed a posteriori source terms can be shared and used as input for any model yielding a multi-model ensemble; this a posteriori ensemble performs satisfactorily for ash location in two dimensions and for ash column load.
the main shortcoming of all simulations is the vertical representation of ash concentration, which is evenly distributed 445 over a wide vertical range without distinct layers of ash; therefore, the vertical distribution of ash needs to be improved -in relation to the source term, even after inversion, but also as a consequence of vertical aerosol processes in models (sedimentation, diffusion, aggregation). Even with a vertically-layered source term, there is high vertical diffusion of concentrations some hours after the emission; ash does not pose a direct and urgent threat to aviation. Probabilistic ash concentration forecasts combined with safe "no-fly" areas can become a future operational product for ATM; the a priori ensemble is available in near-real-time also in the early phase of an eruption but the wide range of ash dispersion might be too conservative for ATM. This behaviour might be case-dependent, although experience shows that a priori source terms rather overestimate ash emissions. Therefore, flight rerouting can be based first on an a priori 455 ensemble, and only at an later stage, when the a posterori ensemble is available, less conservative approaches can be taken for flying through low-concentration ash clouds.
In this study, only source term and model process uncertainty have been taken into account. In real conditions, the meteorological forecast error cannot be neglected and would also increase the spread in plume location and ash column loads.
A rigorous evaluation of any ensemble should be done for a large number of cases, which is difficult for rare events such as 460 volcanic eruptions. Besides, only few measurements are available, hindering a comprehensive ensemble evaluation. The use of observations by assimilation along the vertical (such as lidar data) could also improve the model and ensemble representation of ash, even though such measurements remain rare and are available only where the ash plume is thin enough to be penetrated by the lidar.
The proposed methodology cannot only applied for ash dispersion during volcanic eruptions but also for other air pollutants, 465 such as SO 2 , desert dust, or forest fires. Every airspace closure or even re-routing of planes can immediately increase the costs for airliners which will lead to a certain risk acceptance to fly nevertheless at least through regions which are below the safety-critical pollutant concentration threshold. For future natural disasters, cost and disruption of air traffic can be eliminated to a great extent by including the results of dispersion models into flight planning software to apply cost-based trajectory optimizations.

470
Data availability. The ensemble data is available in NetCDF, CF-compliant format, upon request to the corresponding author. The ash concentration, as described in the article, cover the percentiles Q50, Q75, Q99, on 13 FL, on a 0.1°resolution grid. Instants of validity are from 13 to 20 May 2010, at an hourly step. Two ensembles are available: one using the a priori source terms, one using the a posteriori source terms.