Predictive modeling of hourly probabilities for weather-related road accidents

. Impacts of weather on road accidents have been identiﬁed in several studies with a focus mainly on monthly or daily accident counts. This study investigates hourly probabilities of road accidents caused by adverse weather conditions in Germany on the spatial scale of administrative districts using logistic regression models. Including meteorological predictor variables from radar-based precipitation estimates, high-resolution reanalysis and weather forecasts improves the prediction of accident 5 probability compared to models without weather information. For example, the percentage of correctly predicted accidents (hit rate) is increased from 30% to 70%, while keeping the percentage of wrongly predicted accidents (false alarm rate) constant at 20%. When using ensemble weather forecasts up to 21 h instead of radar and reanalysis data, the decline of model performance is negligible. Accident probability has a non-linear relationship with precipitation. Given an hourly precipitation sum of 1 mm, accident probabilities are approximately 5 times larger at negative temperatures compared to positive temperatures. The 10 ﬁndings are relevant in the context of impact-based warnings for road users, road maintenance, trafﬁc management, as well as rescue forces.

and low visibility. The winter storm events were identified by using radar-and station-based observations. Malin et al. (2019) observe a sharp increase of relative accident risk if road surface temperatures drop below the freezing point.
Since the first weather impact models for road accidents (Scott, 1986), various types of models have been used in this context. Most popular are generalized linear models (GLMs), e.g. Poisson regression for accident counts or logistic regression for accident probabilities (e.g. Fridstrøm et al., 1995;Caliendo et al., 2007;Keay and Simmonds, 2006), but also other methods 5 like state-space (Hermans et al., 2006b) or autoregressive models (Brijs et al., 2008;Scott, 1986;Bergel-Hayat and Depireb, 2004) have been applied. Mostly, statistical models for weather impact on road accidents are used in an inferential way; they test hypotheses for variable relations by means of statistical hypothesis testing for parameters significance of prescribed predictor variables, also referred to as explanatory modeling (Shmueli et al., 2010). This contrasts to predictive modeling, where statistical models are used for prediction of yet unobserved instances of the target variable (e.g., accident counts or 10 probabilities). In practice, predictive models are built and assessed using cross-validation.
This study follows the predictive modeling approach: We build and assess the skill of logistic regression models for hourly probabilities of weather-related road accidents at the scale of administrative districts in Germany. The aim is to assess model performance at small spatial and temporal scales, as well as identifying relevant meteorological predictor variables for optimizing the predictive skill. We thus seek an adequate functional relationship between hourly precipitation and accident probability 15 under different temperature conditions and district characteristics. Instead of station-based observations, we use a gridded radarbased precipitation product and a new high-resolution regional reanalysis. Additionally, using ensemble weather forecasts, we assess the predictive skill of the accident model for lead times of up to 21 hours. Section 2 describes data and pre-processing approaches. Statistical models and associated verification methods are described in Sect. 3. Results of model verification and the application of the models in a case study of a snowfall event are presented in 20 Sect. 4, which is followed by a discussion and conclusions in Sect. 5. Straßenverkehrsunfälle, 2007-2012, own calculations). Heavy road accidents include all accidents with injuries, fatalities or write-offs. Minor accidents are not included in the data set. In total 2,392,329 accidents were reported during the 6-year period under investigation. Most accidents were indicated by the police as being caused by driver behaviour. However,7.7% (184,201) of the accidents were indicated as being caused by adverse road conditions, which includes a wet, snowy or icy road, but also mud or dirt on the road. This class of accidents, which we refer to as weather-related accidents, is selected to generate the 30 response variable used in the logistic regression models. The location of the individual accidents is available on the level of administrative districts (Landkreise). Because of several territorial reforms during the study period, all accidents are assigned to boundaries of the 401 administrative districts as they existed in 2017. For each district an hourly time series is created, which is one if at least one accident happened within an hour and zero otherwise. In total this results in 21,076,961 data points, of which 0. 80% (168,404) contain at least one weather-related accident.

Radar-based precipitation data
Gridded hourly precipitation sums derived from the RADOLAN data set (Bartels et al., 2004) are available from the German Meteorological Service at a spatial resolution of 1 × 1 km . The RADOLAN combines radar reflectivity, measured by the 16 5 C-band Doppler radars of the German weather radar network, and ground-based precipitation gauge measurements. As from radar reflectivity we cannot directly infer the precipitation amount at the ground but only the amount of reflection in the lower troposphere, observations from rain gauges are used to calibrate the precipitation amounts estimated from the radar reflectivity in an online-procedure typically used for nowcasting. Before calibration, a statistical clutter filtering is applied and orographic shadowing effects are corrected for. The RADOLAN project aims at combining the benefits of high spatial resolution of the 10 radar network with the accuracy of gauge-based measurements.

Reanalysis data
A reanalysis produced by a novel convective-scale regional reanalysis system for Central Europe (COSMO-REA2;Wahl et al., 2017) is used to generate meteorological predictor variables for the logistic regression models. The reanalysis results from the integration of COSMO-REA2 (a physical model for the atmosphere) with various heterogeneous observational data 15 assimilated. COSMO-REA2 was developed within the framework of the Hans-Ertel Center for Weather Research (https:// www.hans-ertel-zentrum.de). It contains different gridded atmospheric and surface variables for Central Europe at a spatial resolution of 2 km and at hourly time steps. Deep convection is explicitly resolved by the model, while shallow convection is parameterized using the Tiedke scheme (Tiedtke, 1989). In addition to conventional station-based observations, radar-derived rain rates are assimilated using latent heat nudging. On hourly to daily time scales, the assimilation of radar information 20 substantially improves the parameterized precipitation compared to other reanalysis datasets (Wahl et al., 2017).

Ensemble weather forecasts
Weather forecasts are used to study the predictability of accident probabilities based on weather forecasts with an ensemble prediction system (EPS). We use the regional high-resolution ensemble forecasting system COSMO-DE-EPS, which run operationally at the German Meteorological Service (DWD) before May 2018 with a spatial resolution of 2.8 km for the area of 25 Germany. The COSMO-DE-EPS is initiated every 3 h with a lead time τ of up to 21 h. τ is the difference between the time the model simulation is initialized and the time the forecast is valid for. For example, if the model simulation is initialized at 0 UTC, τ = 21 h corresponds to the forecast for 21 UTC. For each initialization time 20 ensemble members are available, generated using different global model forecasts as initial and lateral boundary conditions and variations of parameterizations for unresolved processes as described in detail in Gebhardt et al. (2011) andPeralta et al. (2012). The spread of the ensemble 30 members allows an estimation of the forecast uncertainty. Similar to COSMO-REA2, precipitation rates derived from radar observations are assimilated at forecast initialization using latent heat nudging (Stephan et al., 2008).
For our study, a post-processed product of the archived COSMO-DE-EPS forecasts for the years 2011 and 2017 was provided by the DWD. Instead of archiving the forecast data on the original model grid, area averages of 21×21 grid boxes (56×56km) around 758 DWD owned gauge stations are stored. This drastically reduces the large amount of data, which facilitates their 5 processing.

Data preparation
We aggregate the different meteorological variables to the level of administrative districts. For the station-based COSMO-DE-EPS forecasts a weighted mean of all available stations in the vicinity of the districts was calculated using the probability 10 density function of a bi-variate circular symmetric normal distribution as the weighting function. A standard deviation of 25 km proved to be most appropriate, as it corresponds well to the average district area.
For a fair comparison of RADOLAN and COSMO-REA2 with COSMO-DE-EPS forecasts, the same aggregation is applied to the gridded RADOLAN and COSMO-REA2 products: the areal averages around the 758 gauge stations is computed as described in section 2.4 and the data is aggregated to the district level by applying the weighting function, as described above. 15

Logistic regression
Logistic regression models are used to model the probability of a certain event based on independent predictor variables (e.g. Menard, 2002). Here, we model hourly accident probabilities. If P t is the probability that an accident occurs in a 1h timeinterval (t − 1h, t], the logistic model equation is 20 where α is the intercept term, X t = (X t1 , ...X tn ) the set of n predictor variables, and β = (β 1 , ..., β n ) are the corresponding parameters. α and the β i are estimated using maximum likelihood. If the effects of the two predictor variables X ti and X tj are not additive (i. e. the effect of X ti on P t depends on the state of X tj ), interaction terms can be added to the model equation. If X ti and X tj are continuous variables, for example, this can be achieved by adding β ij X ti X tj to the linear term in Eq. 1, with β ij quantifying the combined effect of X ti and X tj . For more detailed description of interactions, see Wood (2017).

25
The parameters of the logistic regression model can be easily converted to the odds ratio OR = exp β i . The odds ratio for a given term X ti describes the change of the odds of the event to occur in case of a unit change in X ti .

Assessing model performance
Parameter estimatesβ i associated to individual predictor variables X ti can be tested for being significantly different from 0 using the p-values of a two-tailed z-test (Dobson and Barnett, 2008).

30
Different logistic models are compared with information criteria. The most popular is the Akaike Information Criterion (AIC; Akaike, 1974) defined as where k is the number of parameters used in the model andL is value of the likelihood at its maximum. Fitted to the same data the model with lower AIC is to be preferred. The AIC penalizes models with more parameters to prevent overfitting.

5
The Brier Score (BS) is a proper score to measure accuracy of probabilistic forecasts for binary events, as they result from a logistic regression model. Based on Brier (1950) the BS can be defined as where f t is the forecast probability, o t is the observed outcome of the event (o t ∈ {0, 1}), t lables the events and N is the total number of events. However, Benedetti (2010) has shown that the Brier Score may not be suitable when forecasting very rare 10 (or very frequent) events. He suggests the use of the logarithmic score (or absolute score) where a = −(2 ln 2) − 1 is simply a scaling factor, making LS comparable in size to BS. The LS is frequently used in the field of statistical mechanics and information theory and fulfills three basic desiderata (i.e. additivity, exclusive dependence on physical observations, and strictly proper behaviour).

15
By defining a threshold u, a probabilistic forecast (0 ≤ f t ≤ 1) can be transformed into a binary forecast, which is either positive (accident) or negative (no accident) if the forecast probability falls above (f ≥ u) or below the threshold (f < u) , respectively. The true positive rate (TPR, or hit rate) is the number of correctly predicted positive events divided by the total number of positive events. The false positive rate (FPR, or false alarm rate) is the number of incorrectly predicted negative events divided by the total number of negative events. The receiver operating characteristic (ROC) curve is a common way to 20 illustrate the performance of a logistic regression model as a binary classifier, by plotting the TPR against the FPR for various thresholds 0 < u < 1 (Hanley and McNeil, 1982). The area under the ROC curve (AUC) is frequently used for measuring the ability of a model to discriminate between positive and negative events. The AUC ranges between 0.5 and 1, which compares to random guessing and perfect discrimination, respectively. For a given FPR the corresponding TPR can be identified based on the ROC curve. In this study, we compare the TPR of different models while selecting u so that the FPR is kept constant at 25 0.2.
A skill score SS is a relative measure of how much a forecast S f outperforms a reference forecast S r , defined as where S p is the score of a perfect forecast. In this study we use the BS to compute the Brier Skill Score (BSS, S p,BS = 0), the LS to compute the Logarithmic Skill Score (LSS, S p,LS = 0) and the AUC to compute a skill score based on the ROC curve 30 (AUCSS, S p,AU C = 1).
While AIC penalizes large numbers of model parameters to avoid overfitting, in cross-validation techniques model parameter are estimated on a training data set and scores are computed on an independent testing data set. Here, we use a yearly crossvalidation approach. Model parameters are estimated on a data set with one year of data left out and scores are calculated for this respective year. This is repeated several times until for all years a score has been estimated. The score is then averaged over all years and used for model comparison.

5
To understand the behaviour of the model, the predicted accident probabilities of the regression models can be compared to non-parametric estimates for accident frequencies within bins of specific parameter ranges. For example, a predicted accident probability for negative temperatures and a precipitation amount of 1 mm/h at 7:00 local time can be compared to the relative accident frequency for all time steps that showed negative temperatures and precipitation amounts in an interval of 1 ± 0.1 mm at 7:00. The uncertainty of model probability forecasts is estimated by computing the 95% confidence interval based on asymp-10 totic standard errors. The uncertainty related to the non-parametric estimates of the accident frequency is estimated by using a bootstrapping approach. The observed accident frequency is computed 10,000 times after drawing random samples with replacement from the available data. The range between the 0.025 and 0.975 quantile of the resulting distribution of values can be used to construct a 95% confidence interval around the average observed accident frequency.

Models without weather information
The models NULL and HOUR predict the accident probabilities for each district without using weather information (see Table 1 and Table 2 for a detailed description of predictor variables and models, respectively). The simplest model is the NULL model, using only the intercept and the time average accident probability P for each district as a predictor. P is transformed into P using the inverse logistic function. By using P in the logistic regression equation, a linear relationship between P and the 20 hourly accident probability is established. By introducing P we can distinguish between different districts using a single model parameter. Alternatively, we could include an individual intercept parameter for each district. However, this would require the estimation of 401 parameters. By adding interaction terms, the number of parameters would increase even more, making the model inapplicable.
The model HOUR includes an additional categorical variable H specifying the time of day in hours (local time), which 25 describes the diurnal cycle additionally to the average accident probability of each district. These two models are used as reference models to assess the benefit of adding weather information.

Models using radar and reanalysis data
Accident, radar and reanalysis data overlap in time for the years from 2007 to 2012. For this time period, a binary predictor variable with hourly resolution for the near surface temperature T REA (temperature at 2 m height) is derived from COSMO-30 REA2, which distinguishes between temperatures above and below 0°C. Furthermore, a continuous variable P r RAD with the hourly precipitation sum in mm/h is used. In model RAD the model HOUR is extended by adding T REA and (P r RAD ) 0.2 as direct effects. Different combinations of exponents have been tested to transform the precipitation, but 0.2 lead to the best results in terms of model skill. In the model RAD_INT the two-point interaction terms between P , H, T REA and (P r RAD ) 0.2 are added to the model equation. Model parameter estimates result from using data from all districts simultaneously. However, the skill scores are calculated for each district individually within the cross-validation procedure. This allows us to compare the performance of the model in different districts.

5
Additionally, we fit the models to the individual districts, yielding models RAD_IND and RAD_INT_IND 1 , respectively. On the one hand, these models capture the district specific characteristics; on the other hand, the amount of available data points for each model is strongly reduced, which complicates the estimation of model parameters, in particular for districts with low accident numbers. These models are used to quantify the benefit of having one model for all districts.
3.4.3 Models using weather forecast data 10 The overlapping time period of accident data and COSMO-DE-EPS data are the years 2011 and 2012. For this time period temperature and precipitation is aggregated to district level as before for all 20 ensemble members. This is done separately for all forecast lead times τ , ranging between 1 h after forecast initialization and 21 h after initialization.
The COSMO-DE-EPS provides hourly forecast data, but is initialized only every three hours. 3. Probability-averaged ensemble: In case of the model EPS_PMEAN_INT accident probabilities are predicted using the models EPS_MEM i _INT for the individual ensemble members, but the ensemble mean of the predicted probabilities is calculated before using it to compute the scores in the cross-validation procedure.
The models EPS_HOUR, EPS_RAD_INT correspond to the models HOUR, RAD_INT, but are fitted separately to the data available for each lead time, to allow a direct comparison to the models using COSMO-DE-EPS data.

Models using radar and reanalysis data
The time average hourly probability that at least one weather-related accident occurs in an administrative district is referred to 5 as P . It ranges from less than 0.001 for smaller districts with few inhabitants to more than 0.05 for densely populated cities.
The NULL model simply gives P for each district and serves as a reference model. As expected, the AUC is 0.5, indicating that the model is not able to distinguish between accident and non-accident cases (   The modeled accident probabilities as a function of P r RAD are shown for 7:00 local time for a district with an average probability for weather-related accidents of P = 0.01 (Fig. 2, top row). Non-parametric probability estimates are calculated 25 for precipitation bins with a width of 0.1 mm/h including only districts with P = 0.01 ± 0.002. In general, accident probabilities are lowest at P r RAD = 0 and show a steep increase with increasing precipitation with a decreasing slope at higher precipitation rates. Probabilities are higher at temperatures below 0°C. At P r RAD 1 mm/h probabilities are about 5 times higher if temperatures are below 0°C. For RAD the modeled probabilities fit well to the non-parametric probability estimates at P r RAD < 0.5 mm/h, but overestimate probabilities at higher precipitation rates. In contrast, the model RAD_INT shows 30 reduced probabilities, which fit much better to the non-parametric probability estimates. The curved shape of the functional relationship between precipitation and probability is realized by taking precipitation to the power of 0.2. The value 0.2 was found to be the best choice after testing a series of different exponents, other functional relationships as log 1 + P r, as well as categories of precipitation.
The modeled probabilities as a function of H are shown for P r RAD = 0 mm/h (solid lines) and P r RAD = 0.5 mm/h (dashed lines) for P = 0.01 (Fig. 2, middle row). Non-parametric probability estimates are calculated using time steps with P r RAD = 0 mm/h (circles) and P r RAD = 0.5±0.25 mm/h (triangles) including only districts with P = 0.01±0.002. In general, accident 5 probabilities show a pronounced diurnal cycle with maximum probabilities during morning and afternoon rush hours. RAD overestimates the observed probabilities in particular during the morning hours with precipitation at negative temperatures.
The model RAD_INT is able to capture the observed diurnal cycle more precisely.
The modeled probabilities as function of P are shown for P r RAD = 0 mm/h (solid lines) and P r RAD = 0.5 mm/h (dashed lines) at H = 7 h (Fig. 2, bottom row). Non-parametric probability estimates are calculated using time steps with P r RAD = 10 0 mm/h (circles) and P r RAD = 0.5 ± 0.25 mm/h (triangles) including districts with P = 0.01 ± 0.002. In general, the probabilities show a monotonic increase with P , which justifies the introduction of P as a predictor to distinguish between different districts. The predictions of RAD and RAD_INT are relatively similar and lie mostly within the confidence intervals of the observed probabilities. variable HOUR or are included in an interaction term with this variable. This might indicate the diurnal cycle could be modeled sufficiently with less than the 23 coefficients used here. However, we do not expect a large impact of such a reduction on the metrics that have been discussed earlier.
Next, we compare the models RAD and RAD_INT, which are fitted to all districts simultaneously, to the models RAD_IND and RAD_INT_IND, which are fitted to all districts individually. Fig. 3 shows the difference of the AUCSS between RAD and 25 RAD_IND (red) and between RAD_INT and RAD_INT_IND (black) as a function of P . P provides direct information about how many accident cases were available in the time series used for training the models. In general, the AUCSS differences are mostly negative, indicating that the models fitted to each district individually perform poorer than the models including all districts. The AUCSS differences decrease with increasing P , i.e. increasing accident numbers. Furthermore, the AUCSS differences are larger for the more complex models with interaction terms. The results are similar for the LSS (not shown).

30
Based on the results of this section, we can conclude that RAD_INT should be preferred over RAD, since it achieves the best scores and better represents the functional relationship between probability and precipitation as well as the diurnal cycle.
Furthermore, RAD_INT preforms better than RAD_INT_IND, which is fitted to each district individually.

Models using weather forecast data
The model RAD_INT showed the best performance among the models predicting accident probability using radar and reanalysis data (Sect. 4.1). In this section the model formulation of RAD_INT is modified to allow the use of COSMO-DE-EPS ensemble weather forecasts. To facilitate the modeling procedure, the variables H and P are combined into a single variable P H by using the model HOUR, which effectively results in a district-specific diurnal cycle of accident probabilities (see The model EPS_RAD_INT uses T REA and P r RAD and serves as a reference, representing the best available model based on reanalysis and radar data. The AUCSS of EPS_RAD_INT as a function of lead time shows a three-hourly cyclic pattern 10 with maximum values of around 0.5 at lead times 1, 4, 7, etc. and 0.47 in between (Fig. 4,  include only 0 UTC, 3 UTC, 6 UTC, etc. As a consequence, there are three sets of lead times that are associated to different 15 sets of hours of the day, which correspond to the repetitive three-hourly pattern in the AUCSS. Apparently, the model performs differently for these three sets of hours, possibly due to different traffic characteristics during the specific hours.

Case study
The models RAD_INT and EPS_PMEAN_INT are used in a case study with adverse winter weather conditions on Dec. 3rd, 2012. At temperatures below the freezing point the fronts of a low pressure system lead to snowfall in large parts of Germany.

30
These weather conditions lead to a total number of 280 accidents classified by the police as caused by road condition. The majority of the accidents occurred in southern and western Germany 2 .
For the district of Stuttgart, which was located within the affected area, the RADOLAN data shows low precipitation amounts in the early morning and higher precipitation amounts of up to 0.3 mm/h in the afternoon (Fig. 5a). The COSMO-DE-EPS forecast, which was initialized on Dec. 3rd, 2012 at 00:00 UTC (02:00 h local time), shows ensemble mean precipitation 5 amounts of more than 0.6 mm/h in the afternoon and a large spread between the ensemble members.
The temperature in COSMO-REA2 is below 0°C until 19:00 h and then changes to warmer conditions (Fig. 5b). All ensemble members of COSMO-DE-EPS predict the change to positive temperatures two hours earlier than observed.
The accident probability of EPS_RAD_INT shows the combined effect of the average diurnal cycle, RADOLAN precipitation and COSMO-REA2 temperature (Fig. 5c). It shows a peak of 0.07 in the morning during rush hour at low precipitation 10 amounts at freezing temperatures, a drop to 0.02 at noon when RADOLAN shows no precipitation, a maximum peak of 0.22 in the afternoon, when precipitation is strongest. In general, the accident probability of EPS_PMEAN_INT matches well with EPS_RAD_INT. However, it slightly overestimates the morning peak and overestimates the afternoon peak due to the too intense and persistent precipitation.
The hourly accident probability P is useful for authorities to assess how likely the occurrence of an accident is in a certain 15 district at a certain point in time. However, it does not reflect the risk of an individual road user, as it does not distinguish whether P changes due to weather-related effects, due to a change in traffic density along the diurnal cycle, or due to the district characteristic. For example, a road user traveling from a district with a high average accident probability P to a district with a low P would observe a decrease of P , also if the weather conditions remain the same. Therefore, to estimate the impact on an individual road user, we compare P to P 0 , the probability under conditions without precipitation and positive 20 temperatures (Fig. 5c, dotted line). The fraction P/P 0 gives the amplification of the actual predicted probability P compared to warm and dry conditions (Fig. 5d). In case of the forecast for Dec. 3rd, 2012, the amplification factor ranges between 50 in the afternoon when the precipitation amount is high and 5 around noon when precipitation amount is low. This factor could be a potential weather impact forecast product.
On Dec. 3rd, 2012 at 17:00 h local time, the COSMO-DE-EPS overestimates the precipitation amount in large parts of 25 western and southern Germany, compared to RADOLAN (Fig. 6). The area with temperatures below 0°C is captured relatively well, compared to COSMO-REA2. The accident probability P is largest where high precipitation amounts and freezing temperatures occur. Spatially, P is relatively inhomogeneous, which reflects the large differences in average accident probability between the individual districts. P/P 0 , representing the increase in accident probability of individual drivers, is spatially more homogeneous.
Police reports of heavy road accidents in Germany were used to construct hourly time series based on weather-related accidents caused by adverse road conditions for German administrative districts. Different meteorological datasets aggregated to district level were used in logistic regression models to predict hourly accident probabilities. Models of different complexity were compared after calculating different skill scores using a yearly cross-validation approach. The best model with respect to 5 these scores included district-specific average accident probability, the hour of the day, hourly precipitation and temperature, as well as their interaction terms. The model reached a hit rate (TPR) of 0.7, when the false alarm rate (FPR) was fixed at 0.2. With the same false alarm rate, a model without meteorological parameters only reached a hit rate of 0.3. It was shown that the probability of weather-related accidents increases non-linearly with increasing hourly precipitation. Given an hourly precipitation of 1 mm, the accident probability is approximately 5 times higher at negative temperatures, compared The target variable of this study were weather-related road accidents. The accidents included in the analysis were indicated by the police as being caused by adverse road conditions, which includes a wet, snowy or icy road, but also mud or dirt on the road. Thus, the categorization of the accident cause is based on the subjective decision of the police at the location of the accident. This might introduce a bias to the results whose direction or extent is hard to estimate. For example, a large number 20 of accidents that occur during adverse weather conditions are likely to be unrelated to the weather but are caused only by inattention of the driver. Police officers might still categorize these accidents as being weather-related in unclear situations.
It should be kept in mind that this could lead to an overestimation of weather-related accident probabilities in the models developed in this study.
It is known that the main parameters affecting accident probability are traffic flow and density. In an optimal case one 25 would use measurements of these variables as a model predictor for accident probability. However, traffic measurements are not continuously available for all administrative districts. Additionally, measurements of traffic flow are mainly available for highways and federal roads and might not be representative for municipal roads, where the majority of the accidents occur. Furthermore, in an operational setting, where the model is applied for predicting future accident probabilities, traffic measurements are not available. Therefore, we decided not to directly include traffic measurements in the models. Instead, the 30 hour of the day was used as a categorical predictor variable to capture the average diurnal cycle of accident probability. It was shown that this approach is able to reasonably represent the inner-day variability of accident probability. The introduction of additional factors like at weekends or holidays did not lead to a significant improvement of the model. It is a challenging task to combine accident data, which is available for the area of administrative districts, with meteorological data, which is usually available in the form of point observations or gridded data. Different ways of aggregating meteorological data to district level were tested and the approach based on distance-weighted averaging, which is presented in this study, showed the best results.
The temperature at 2 m height was used in this study to include the effect of negative temperatures in the statistical model 5 in a relatively simple approach. It has the benefit, that the temperature at 2 m height is a well-established meteorological parameter, which is measured at most stations and available in all weather forecasting models. However, it might not reflect the conditions at the road surface, which can deviate from the conditions at 2 m height. Also, the choice of 0°C as a fixed threshold is a simplified approach, since ground frost or snowfall could also occur at higher 2 m temperatures. By using non-linear approaches like generalized additive models (Wood, 2017) a smooth transition between positive and negative temperatures 10 could be established in future studies. Furthermore, it might be detrimental that area averaged temperatures are used, which does not fully represent topographic variations within the area of a certain district. A more complex approach could make use of a road surface model, which includes the combined effects of precipitation, evaporation, and road surface temperatures in a more sophisticated way (e.g. Juga et al., 2013).
In addition to the weather parameters presented in this study, other parameters like snow fall amount or combined measures 15 of cloud cover and sun angle to describe the impact of sun glare were tested as potential predictor variables. Furthermore, advanced predictor selection techniques like genetic algorithms (Calcagno et al., 2010) and the least absolute shrinkage and selection operator (Tibshirani, 1996) were applied, to find optimal combinations of parameters. However, none of the results were able to significantly improve the skill of the best models presented in this study, as measured by the cross-validation approach. 20 We found that the probability of weather-related accidents depends on hourly precipitation to the power of 0.2. This exponent should not be understood as a universal relationship. Instead, it is likely to depend on different aspects of the road system (e.g. how fast is the water able to leave the road surface) or the average car characteristics (e.g. the share of cars equipped with assistance systems, or the type tires). It may even change in time, as road and car qualities improve.
In this work we showed two ways of modeling probabilities in different districts: first, by creating a model that distinguishes 25 between different districts based on their average accident probability and, second, by creating a model for each district individually. We found that to first approach lead to higher skill scores, particularly for districts with low accident numbers.
Including additional district-specific parameters describing the characteristics of the road network or topographic conditions could help to further refine the model.
This study shows that a skillful relationship between meteorological parameters and weather-related road accidents can 30 be established. Forecasts of probabilities of weather-related road accidents, as presented in this study, might be useful for authorities (traffic management, police or emergency services) on the one hand and road users on the other hand. However, it is reasonable to provide the information about accident risk in different, user specific formats, which were introduced in Sect. 4.3. Authorities might be primarily interested in aggregated risk information for their region of interest, e.g. the occurrence probability of accidents in an administrative district. On the other hand, a road user is rather interested in his individual risk.

35
The individual risk is better reflected through an amplification of risk compared to certain reference conditions (e.g. warm and dry weather).
It was shown that impact-based warning can lead to better actions of the recipients (Weyrich et al., 2018). Furthermore, Hemingway and Robbins (2019) state that information about weather impacts can be helpful for operational meteorologists when issuing weather warnings. This was found using a prototype impact model for predicting the risk of road disruption due 5 to the wind-induced overturning of vehicles. In this context, the accident model presented in this study can be considered a useful tool for reduction of road traffic risk.   Table 2) based on P and H