Evaluation of forest fire models on a large observation database

. This paper presents the evaluation of several ﬁre propagation models using a large set of observed ﬁres. The observation base is composed of 80 Mediterranean ﬁre cases of different sizes, which come with the limited information available in an operational context (burned surface and ap-proximative ignition point). Simulations for all cases are carried out with four different front velocity models. The results are compared with several error scoring methods applied to each of the 320 simulations. All tasks are performed in a fully automated manner, with simulations run as ﬁrst guesses with no tuning for any of the models or cases. This approach leads to a wide range of simulation performance, including some of the bad simulation results to be expected in an operational context. Disregarding the quality of the input data, it is found that the models can be ranked based on their performance and that the most complex models outperform the more empirical ones. Data


Introduction
Model evaluation requires comparing predicting observed values and is critical to establish the model's potential errors and credibility. The first step in evaluating model performance is to be able to evaluate single simulation results against observations. Such scoring methods have been the subject of several studies, initiated by Fujioka (2002) and recently compiled and extended in Filippi et al. (2013). Basically, it is clear from all these studies that a single value cannot be representative of a model performance, as it only gives limited insights on all aspects of performance, while the analysis of a human eye provides a better understanding of what was good and what went wrong.
Nevertheless, the problem of evaluating model performance must be tackled as it is important to know if a parameterisation or a new formulation is superior, and to continue the process of enhancing models, codes and data. From an operational point of view, it also appears that models are used with a clear lack of systematic evaluation, as noted recently by Alexander and Cruz (2013). A major step in such model evaluation is to compare observed rate of spread (ROS) with simulated ROS, as many data exist in the literature. Cruz and Alexander (2013) carried out such a comparison with the clear and reassuring conclusion that well-built empirical and semi-empirical models may provide a good ROS approximation. Our study focuses on the use of these models to simulate the overall two-dimensional fire spread and its corresponding burned area. Whilst evaluating the absolute model performance is, as yet, out of the question, it is proposed here to evaluate specific model performance. This specific evaluation will be linked to a typical model usage and to a territory. The typical usage proposed corresponds to the plausible "first guess" case, where only an ignition location is known with no direct observation of the wind or fuel moisture near the fire. The selected area is the Mediterranean island of Corsica, where numerous wildfires occur every year in a variety of configurations. The test consists of running simulations with four different models and a large number of observations, and compiling the results in the form of comparison scores between simulated and observed fires. These simulations must be run in a fully automated manner, withoutobservationbiases introduced by manual adjustment of fuel, wind or ignition location in order to enhance results. The overall results are the distributions of scores, using different scoring methods; this fulfills the goal of ranking the models according to their specific use.
The four different models are presented in the first section along with the simulation method used to compute the front propagation. The second section details the evaluation method, data preprocessing and numerical set-up. The results are presented and discussed in the last section, along with focuses on specific cases.

Models description
Fire propagation modelling can refer to a vast family of codes, formulations, systems or even data sets (Sullivan, 2009a, b). As this study focuses on the evaluation of large scale fire simulation, our selected definition of "fire model" is the formulation of the fire-front velocity. A velocity is obviously not enough to obtain fire progression and burned areas. A fire-front solver code and input data are needed. These two are the same for all models and described in the next section. Note that the proposed model selection is unfortunately not exhaustive of all existing formulations, but rather focuses on representing some kind of evolution in the model types.
Depending on their complexity, the models can take into account the terrain slope, the atmospheric properties (wind velocity v, air density ρ a and temperature T a ), a spatial characterisation of the fuels (mass loading σ , density of alive/dead ρ l,d , height e, surface to volume ratio S v , emissivity ǫ v and moisture content m defined as the fraction of water over total weight) and the fuel combustion properties (ignition temperature T i , calorific capacity c p,v , combustion enthalpy h, stoichiometry s and mass exchange rate, due to pyrolysisσ ). Each model prognoses the fire-front velocity V in the normal direction to the front n, pointing towards the unburned fuel.

3 % model
The first and most simple model makes the major assumption that the fire is propagating at 3 % of the wind velocity, as long as there is fuel available, regardless of the vegetation changes or the terrain slope. In practice, in order to compute the velocity everywhere on the fire front, the wind normal to the front W s = v.n is taken here as wind velocity and V = 0.03W s . This "rule of thumb model" is sometimes used by firefighters (with caution, because of its lack of reliability). Here, it will serve the purpose of hopefully being the lowest reference in terms of performance.

Rothermel model
The quasi-empirical Rothermel model (Rothermel, 1972) forms the basis of the United States National Fire Danger Rating System and fire behaviour prediction tool (BEHAVE) (Andrews, 1986). It builds on earlier works from Byram and Fons (1952) and is based on a heat balance developed by Frandsen (1971). It highly relies on a set of parameters obtained from wind tunnel experiments in artificial fuel beds (Rothermel and Anderson, 1966) and Australian experiments (McArthur, 1966). This model is also widely used in Mediterranean countries for various purposes (fire risk and behaviour), but usually requires some adjustments by experts in order to be fully efficient on all fuel types. Since this study is created as a blind test for each model, these adjustments were not performed here. The default values given in Rothermel (1972), such as moisture of extinction (M χ = 0.3) and mineral damping (η s = 1) were therefore used. Since its first developments, the Rothermel model has not changed in its formulation, but users have adapted coefficients and added optional sub-models to fit specific cases. In this paper, we have used the latest revision from its original team of authors (Andrews et al., 2013) that includes, in particular, a revised wind speed limit function to lower the spread rate dependence on strong winds.
This quasi-physical model also uses a number of fitted parameters (in US customary units) that read with reaction intensity 2.59(m/M χ ) 3 ) .
A propagating flux ratio is given by The wind factor is The slope factor is An optimal packing ratio with The packing ratio is given by The maximum reaction rate is where In (Andrews et al., 2013), a new wind limit function is imposed over the original model as

Balbi model
The Balbi model , like Rothermel, can be classified as a quasi-physical model. Its formulation is based on the assumption that the front propagates as a radiating panel in the direction normal to the front. The model verifies that, for a specific wind, terrain and fuel configuration, the absorbed energy equals the combustion energy directed toward the unburned fuel. This energy is the sum of "radiant part" from the flame and a "conductive" part within the fuel layer. The assumption is also made that only a given portion χ 0 of the combustion energy is released as radiation, because the flame is viewed as a tilted radiant panel with an angle γ towards the unburned fuel. The equation governing the propagation velocity of the front reads is the contribution of the vegetation undergoing pyrolysis (B is the Boltzmann constant and h w is the water evaporation enthalpy). The second term accounts for the propagation by radiation and reads where H R + is the Heaviside function for positive reals and µ is an evolution coefficient of the ratio between radiated energy and released combustion energy. The volume-surface ratio is noted S v , and τ is the burning duration given by the Andersen model (Anderson, 1969). The flame tilt angle γ depends on the slope angle α and wind v projected onto n the front normal vector: A major assumption of the Balbi and Rothermel models is that the fire is always travelling at a stationary speed that verifies V = κ/τ , and that all energy is absorbed within the fuel bed for the computation of V 0 . Because of these assumptions, front velocity is not dependent on the local fire state (previous intensity, front curvature, depth). It cannot accelerate or go to extinction, and it is only dependent on the local fuel, wind and terrain properties. These assumptions are required to compute a priori potential rate of spread without knowing explicitly the local front depth λ or its curvature κ, such as in the BEHAVE tool. Later versions of Rothermel added sub-models for acceleration or extinction. A more fundamental approach was developed for the Balbi model with a non-stationary formulation.

Balbi non-stationary model
By using the front tracking solver, local front depth λ and curvature κ are always available as numerical diagnostics of the front. The introduction of these variables in the model was rather simple as it removed strong assumptions. The updated formulation reads and β d a radiation dumping ratio that depends on fuel packing. The second term removes the requirement for the stationary speed with (1 + sin γ − cos γ )H R + (γ ) .
The main disadvantage of the model is that it is now tight to a solver able to locally and constantly diagnose λ and κ with a reasonable numerical cost, introducing, in the process, some additional numerical errors. The solver used for the study is the front tracking code "ForeFire".

Fire propagation method
The fire propagation solver ForeFire ) uses a front tracking method that relies on a discretisation with Lagrangian markers. The outward normal n i of marker i defines the direction of propagation, and v i is the local front speed. The maximum distance d m allowed between two consecutive markers is called the perimeter resolution. If two markers are further away than this distance, a remapping of the front is carried out in order to keep the resolution constant. A filtering distance d f = d m /2 is also needed to avoid the over-crossing of two markers and potential inversion of the norm. The advection scheme is an first-order Euler scheme in space: where the · (n+1) refers to the value of a variable at the next step, and · i the marker index. Spatial increment δl determines the resolution of the front propagation and should be smaller than the smallest space scale influencing the fire propagation, which are usually fire breaks, such as roads; i.e. in typical simulations δl ≈ 1 m.

Evaluation method
A selection of 80 fire cases have been compiled into an observation database for this study. For each fire, the required initial data are preprocessed to generate the initial conditions and the data required by the selected propagation model. The simulations are then run by distributing the computation of the different cases. Once a simulation result is available for an observation/simulation pair, the comparison is computed with the scores introduced in Sect. 3.3.

Observation database
The observation database is composed of 80 fires that all occurred in the Mediterranean island of Corsica between 2003 and 2006. These cases were extracted from the Prométhée database (http://www.promethee.com/), a repository of wildfire observations managed by the "Institut Géographique National" (http://www.ign.fr/). Data within this database offer precise burned surface for many wildfires, along with an information on the ignition date and approximate location. Nevertheless, this information cannot always be trusted, so each case was reviewed with a field expert in order to validate ignition points and date, which reduced the amount of relevant observations. The selection of the Corsica island was made because field expertise was available, as well as adequate data and homogeneity in fire dynamics, given the relatively limited data set (mostly shrubs and Mediterranean maquis). Fire sizes in the selection ranged from 1 hectare to several hundreds of hectares in order to be representative of all potential model uses.

Data preprocessing
The first step in order to launch a simulation is to compile and format data in a way that they can be processed by the fire propagation solver. For the fire simulation code ForeFire, the input data are composed of a configuration file, an elevation field, one or several fields of wind direction and speed, a land use field and a fuel field. The design of the study implied that the data were not tailored for any model or test case. All of the preprocessing is thus automatically done for the ignition location and date. We are aware that this automatic generation will generate input data that might seem extremely unrealistic compared to the observation. In particular, wind direction and fuel state can be significantly different from the real values. For the considered fires, or for any fire that may ignite at anytime, anywhere, the exact inputs to the simulation models are not observed or stored. This study tries to rely on the best available data provided in an operational context. The simulation configuration defines the size of the domain, the date and the numerical set-up. The rest of the input data are extracted from data files with the same domain extent and date using the Geospatial Data Abstraction Library (GDAL) and its tool ogr2ogr (GDAL development ream, 2013) using a conformal projection.
The elevation field originates from the Institut Géographique National, Base de Données ALTIude (IGN BDAlti) at a 25 m resolution. It is originally available in DEM ASCII format (Digital Elevation Model, American Standard Code for Information Interchange).
The wind field used in the simulation is extrapolated with the "WindNinja" mass consistent code (Forthofer, 2007) from station data. The extrapolation is especially expected to rectify the wind velocity in valleys. WindNinja inputs consist of the elevation field and the wind station data. The wind field is outputted at the same resolution as the elevation field and is used for the whole duration of the simulation (under the strong assumption that the wind does not change direction during the fire).
WindNinja input data originate from the nearest of 12 automatic 10 m high weather stations of the Météo France network, relatively evenly distributed across the island. The selection algorithm is very simple and ranks the distance from the ignition point to all stations, selecting the closest, and uses the nearest date and time available after the reported ignition time. It would definitely have been more relevant to select the closest upstream station, but such data are unfortunately completely unavailable for most of the fires because of the poor density of the station network. The closest method was thus selected because it was thought to be the most adapted first guess in an automatic, operational context.
Fuel distribution data are taken in vector format (shape file) from the Institut Géographique National, Institut français de l'environnement (IGN IFEN), with locally 10 fuel classes corresponding to the main burning vegetation in Corsica. Fuel parameters are derived from Anderson et al. (1981) for the mixed/grass and pine forest types, while the "Proterina" parameterisation (Santoni et al., 2011) is used for the shrubs and maquis class. Among the 10 possible fuels, these four classes (shrubs, pine, mixed, maquis) were the only burning fuels across all cases, with a large majority of maquis. Values for these fuels are found in Table 1.
Nat. Hazards Earth Syst. Sci., 14, 3077-3091, 2014 www.nat-hazards-earth-syst-sci.net/14/3077/2014/ As little as is known about the fuel state, the fuel moisture is set as moderate to high water stress for every fuel model, which corresponds to high fire danger on a summer's day.

Comparison scores
Our evaluation relies on scoring methods that were analysed (and, for two of them, proposed) in Filippi et al. (2013). We denote S(t) as the burned surfaces at time t in the simulation. The final simulation time is t f . At the end of the observed fire t o f (superscript "o" stands for observation), the observed burned surface is S(t o f ). is the simulation domain. The arrival time at point X is denoted as T (X) for the simulation and T o (X) for the observation. Finally, the area of a surface S is denoted as |S|. We assume that the ignition time is 0.
We relied on the following scores, with t u = max(t f , t o f ): where and The Sørensen similarity index and Jaccard similarity coefficient compare the final simulated and observed areas.
Kappa statistics compute the frequency with which simulation agrees with observation (P a ), with an adjustment (P e ) that takes into account agreement by chance. The arrival time agreement and the shape agreement both take into account the dynamics of the simulation. Even though there is only one observation at the end of the fire (i.e. the final burned surface), these scores were designed to partially evaluate the time evolution of the simulation dynamics. Further details on these scores may be found in Filippi et al. (2013).

Simulation set-up
The four models were run using the ForeFire code. In all, 80 cases resulted in a total of 320 simulations. Each simulation is set-up using the ignition point that defines where to carry out the simulation. The simulation domain is centered on the ignition point. In north-south and east-west directions, the domain size is about four times the extension of the fire. The spatial increment δl depends on the fire size. If the final observed burned area is A, then δl = max(1, log 10 A − 4) m, if A is in m 2 . The filtering distance d f is set to 20δl.
All simulations were carried out at most until the burned area equals the observed final burned area. In practice, on small fires, the area burned in the simulation may be larger because the stopping criterion is checked every 7 min (so the over-development cannot be more than 7 min of fire propagation). The simulation can also stop earlier if the front velocity is 0 everywhere (stopped fire).

Results
In this section, we investigate the performance of the models on all fire simulations. All simulations are concisely described in the Appendix, with plotted contours (simulations and observation). The performance of each model for an individual case is hardly analysed since it can vary a lot, depending on the quality of the data. We essentially draw conclusions that are supported by the overall performance. Moreover, it is expected that a large number of scores will be very low, mostly because of the poor input data (whose quality cannot be known in advance). We have decided not to drop these low scores to be representative of actual model performance in an operational context.

Distribution of the scores
An important aspect of the comparison is to select and understand the way scores are presented. At first, let us consider the distribution of the 80 scores for each model and scoring method. The 80 scores are sorted here into an ascending order and plotted. It is important to note that the sorting is carried out independently for each model. Therefore, the number in the abscissae corresponds to a rank in the list of sorted simulations (per model), but not to the same  fire case in all four distinct evaluated models. Figure 1 shows the distributions for all scoring methods. The distributions for the classical scores -the Sørensen similarity index, Jaccard similarity coefficient and kappa coefficient -are very similar with worst score distributions for the 3 % model. The Balbi model gives significantly better results. The Rothermel model arguably brings additional improvements over the Balbi simulations for the lower scores in the distribution. The non-stationary Balbi model clearly provides the best overall results for this data set. Less clear ranking is found with the distributions of shape agreements (Fig. 1e), except that the 3 % model shows poor performance. Rothermel model is a clear first in the distributions of the arrival time agreements (Fig. 1d), with a close 3 % model for the higher scores. We point out that all observed fires are assumed to last 24 h at most (because no data for the duration of the fire were found), which is a very rough approximation of the actual duration of the fire. In these conditions, arrival time agreement is a less relevant scoring method than the other scores, as it strongly penalises over-prediction. Nevertheless, it points out that the Rothermel model is less prone to over-prediction than all the other models. This is probably due to its wind limit function, which slows down the front and is generally diffusive.
In Fig. 1, the scores are sorted in ascending order, independently for each model. It is therefore impossible to compare the performance for each individual fire and to make sure that a model consistently shows better performance. The score distributions are shown in Fig. 2 with an ascending sorting based on the mean score across the models. The sorting is therefore the same for all models and their scores remain paired in the plot. There is clearly a large variability in the performance of the models at each individual fire. There is, however, an overall ascending tendency and, more importantly, for most fires, the non-stationary Balbi model shows the best performance, with Rothermel coming second and the 3 % model being the worst. It means that there is a consistent improvement for all types of fires and conditions. The conclusion is not so clear concerning the arrival time agreement, since it may lack relevance in this context. This scoring method requires a precise fire duration, while we assume that all fires lasted 24 h. Considering all scores, the large variance in the models' performance suggests that the generation of large ensembles of simulations may be needed in the simulation process of a single fire. Indeed, with a single simulation for a single fire and poor data quality, a poor simulation is likely to occur.
Based on all the distributions, one conclusion is that the non-stationary Balbi model consistently gives the best results. The second model is the Rothermel, with stationary Balbi as a close third. It is worth noting that the switch to a non-stationary model, where the fire depth is taken into account, leads to significant improvements, at least as high as the changes due to the fire spread rate. This suggests that the representation of the fire in the numerical model is a key aspect of a simulation and makes it more "physical" by removing assumptions to the model can enhance its versatility and performance. This suggests that more physics is needed to attain state-of-the-art model performance, even in a context Nat. Hazards Earth Syst. Sci., 14, 3077-3091 where the input data may be of poor quality. Overall, it is interesting to note that, despite the poor data quality (which we might expect in operational forecasts), the models can be ranked, hence even a poor database can help objective model development.

A look at the individual simulations
For many simulations, the input data are very poor, hence a dramatically low performance. When the data are probably reliable, the models may perform reasonably well, which is detected by the scoring methods. For each scoring method and model, we found the simulation with the highest performance across the 80 fires. These selected cases are shown in Table 2. The Sørensen, Jaccard and kappa scores essentially identify the same "best" cases. It is consistent with the strong similarities found between the score distributions for these scoring methods (Sect. 4.1). The dynamic aware scores (arrival time agreement and shape agreement) select a variety of fire cases. It is more difficult to understand the reason for these selections from the final contours, since these scores take into account the full dynamics. Figures A1-A3 present all simulation results of the data set. It can be observed that there are obvious numerical biases in these simulation results. In particular, the smaller fires (< 1 ha, such as R3 Alata, or O0 Propriano) are systematically over-predicted by both Balbi models that tend to predict higher propagation velocities. Since the area criterion to stop the simulation is checked every 7 min, the fire growth becomes relatively large considering the small fire size. Nevertheless, as all simulations were to be handled with the same method, the same delay was kept for all cases. Note that we included these much smaller fires in the test database because they are the most frequent fires to be fought.
Human supervision in these simulations would have helped to rectify the automatic wind data or misplaced ignition point that clearly appear to be erroneous in many of these simulations. Such examples of supervised rectification are carried out for the case "C2" in the Appendix -Ghisonaccia (95 ha). The results before and after tweaking the parameters can be seen in Fig. 3. The nearest station apparently provided a bad estimate of a very local wind direction. Ignition location data also appeared to be slightly misplaced in the original report as it was not directly in the main propagation axis. Hence, the wind direction and the ignition location were manually adjusted, in order to better reproduce the final burned area.
In an operational context, the burned area is not known in advance, but better results can still be obtained with better models. Let us consider more closely two fires that appear in Table 2: the Oletta case (21 August 2004; N3 in the Appendix) and the Santo Pietro di Tenda (1 July 2003; B0 in the Appendix). In Fig. 4, we plotted the former case. In this case, it is clear that the Balbi model shows a much better performance than the other models. The overall shape of the final simulated contour by Balbi model has a similar aspect Table 2. The best simulations, according to the scoring method and the fire model. "ATA" stands for arrival time agreement. "Shape" refers to shape agreement. The final contours may be found in Figs. A1-A3, and further information about the fires is available in Table A1. ratio than the observed one, while the Rothermel model and the 3 % model show large over-burning at the rear and the head, respectively, of the fronts. So, even when the wind direction is correct, there can still be large differences between the model simulations. When we only consider the Balbi simulations for this fire case, we might conclude that a model can reach a good prediction skill once proper input data are provided. Nevertheless, the performance of the other models questions the reliability of the input data. It is likely that the input data might be erroneous or compensating for deficiencies in the Balbi model. It is fairly possible that the Balbi model would underperform with the exact data. This suggests that exposing the actual prediction skill of fire simulation requires the use of several models and the exploration of the full probability distribution of the uncertain data.
In Fig. 5, we plotted the case for which the Balbi model and Rothermel model attained their best shape agreement. The visual agreement with the observed contour does not seem as good as in the previous fire case. Hence the dynamics of the fire must have significantly contributed to the  shape agreement. This suggests that taking into account the front dynamics in the evaluation can cast a different light on a model.

Forecast reliability
In an operational context, the ability of a model to predict that a location will burn can be of high importance. One way to assess the reliability of a model in this context is to evaluate its burn probability. When the model predicts that a fire will burn in a given location, we compute the frequency with which this actually happens. Conversely, we are interested in the frequency with which a model correctly predicts that a location will burn. We refer to this indicator as the detection skill. The detection skill may be perfect if the model burns very large regions: any location that is actually burned in reality will be burned in the simulation as well. On the contrary, the first indicator (burn probability) will be perfect if the fire stops right after it started; at the ignition point, the fire was observed, so the location was burned, just like   Table 3. Burn probability and detection skill for the four models, all fires included. The burn probability is the frequency with which the model correctly predicts that a location will be burned in reality. The detection skill is the frequency with which a burned location (in observations) is also burned in the simulation. the model "predicted". Consequently, we computed both the burn probability and the detection skill. See Table 3. The burn probability and detection skill are consistent with previous results. The non-stationary Balbi is the best model in terms of detection. The Rothermel model is, however, slightly better for the burn probability. These are followed by the (stationary) Balbi model that is close to the Rothermel model in terms of detection skills. When the Rothermel model predicts that a location will be burned, the real fire burns it in 31 % of all cases. If the real fire reached a given location, the non-stationary Balbi model would predict it in 41 % of all cases. It is noteworthy that the performance can vary a lot from one model to another. It is obvious that the 3 % model with a detection skill of 11 % is of very low value compared to the 41 % of the non-stationary Balbi model.
Overall, the performance may not be good enough for a model to be reliable in operational applications. The performance spread among the models suggests that further development on the model has the potential to immensely improve the results. Nonetheless, it is clear from the simulation results that the input data play a key role. In particular, an erroneous wind direction spoiled so many simulations that good local meteorological forecasts alone would be likely to dramatically improve performance.

Conclusions
The objective of this paper was to evaluate fire spread models and their relevance in a realistic operational context with limited information. We considered a database of 80 fires whose final burned surfaces were observed. We simulated these fires in a purely automated manner, using only a poor data set that may be available in this operational context. The meteorological values were taken at the closest observation station, even though the actual wind direction at the exact fire location may be significantly different. The vegetation cover and the associated fuel load were not tuned by any means for any cases or models.
Despite the crude application setting, we were able to rank these fire spread models. Overall, the non-stationary version of the Balbi model gave the best results. The higher relative performance of the non-stationary Balbi model was observed across most of the 80 cases. The stationary version of the Balbi model showed a significantly lower performance. This suggests that the numerical treatment of the fire front is a key aspect, just like the rate of spread. The Rothermel model (Andrews et al., 2013), with a new imposed wind speed limit function, was ranked second. Finally, we observed that the most empirical model, taking 3 percents of the wind velocity, is clearly not a good option. Overall, it is estimated that current fire spread models may benefit from a better physical description, even with poor data quality. In order to strengthen this conclusion, we can add that the performance of the original Rothermel model (whose results were not reported here) is significantly lower than that of the version used in this work. The original Rothermel model consistently performs significantly worse than the stationary Balbi model.
We evaluated the skill of the models to forecast that a certain location will be burned. We also evaluated whether the locations burned in a simulation are likely to be burned in reality. In both cases, there is a wide spread in the models' skills. In addition, the skill of the best model may not be high enough for a reliable use in operational context. Therefore, we conclude that further work on the physical models is still needed, and improved input data, especially for the wind direction, are obviously required. For instance, largest gains in performance may be obtained from local micrometeorological forecasts that properly forecast wind direction.
Considering the high uncertainties in both models and the input data, the simulation context is barely deterministic. There is a need for probabilistic approaches, such as those developed by Finney et al. (2011). With model results being so variable, a promising direction may be the use of ensembles of models, together with perturbed input data. Such developments could be used for uncertainty quantification, risk assessment and ensemble-based forecasting. Table A1. Information about all the 80 fires shown in Figs. A1-A3. The size is the final burned area in hectares. WS stands for wind speed, in m s −1 . WD is the wind direction in degrees, defined clockwise, with 0 corresponding a westerly wind (90 southerly). The resolution in metres is in column "Res.". The final three columns represent the best scores (among the four models) found for Jaccard similarity coefficient (BJ), arrival time agreement (BA) and shape agreement (BS).