Machine learning models to predict myocardial infarctions from past climatic and environmental conditions

. Myocardial infarctions (MI) are a major cause of death worldwide, and temperature extremes, e.g., during heat waves and cold winters, may increase the risk of MI. The relationship between health impacts and climate is complex and is inﬂuenced by a multitude of climatic, environmental, socio-demographic, and behavioral factors. Here, we present a Machine Learning (ML) approach for predicting MI events based on multiple environmental and demographic variables. We derived data on MI events from the KORA MI registry dataset for Augsburg, Germany between 1998 and 2015. Multivariable predictors include 5 weather and climate, air pollution (PM 10 , NO, NO 2 , SO 2 , and O 3 ), surrounding vegetation, as well as demographic data. We tested the following ML regression algorithms: Decision Tree, Random Forest, Multi-layer Perceptron, Gradient Boosting and Ridge Regression. The models are able to predict the total annual number of MI reasonably well (adjusted R 2 = 0 . 59 − 0 . 71 ). Inter-annual variations and long-term trends are captured. Across models the most important predictors are air pollution and daily temperatures. Variables not related to environmental conditions, such as demographics need to be considered as well. 10 This ML approach provides a promising basis to model future MI under changing environmental conditions, as projected by scenarios for climate and other environmental changes. models than the regression models presented here. Overall, the models’ capacity to give reasonable estimates of possible future developments of MI based on the predictive features appears robust. In a next step we aim to apply the trained models to scenarios of future climatic, environmental and demographic conditions. This will allow estimating future changes in MI taking into account climatic, as well as other environmental and demographic changes that has most often not been done. These changes could also include further improvements in air quality, or increased ’greening’ of urban environments 475 with vegetation. New estimates of changes in MI can be based on these ML methods, using ensembles of projected climate change, demographic scenarios including ageing population, and environmental scenarios contingent on societal transformation (e.g., electriﬁcation of trafﬁc, greening of cities). Such estimates will enable to gauge the sensitivity of the complex health-environment interactions, and beneﬁts of proposed environmental and health interventions in urban areas.


Introduction
Myocardial infarctions (MI), commonly known as heart attacks, are a major cause of death worldwide, responsible for a substantial amount of cardiovascular related mortality, as well as a major contributor to disability. The estimated prevalence 15 of MI worldwide in 2015 was close to 16 million, with 33.000 years lived with disability attributed to the condition (Vos et al., 2016). In light of ageing western societies as well as ongoing environmental and climatic change, MI is likely to remain Several studies have employed statistical methods as well as ML to predict infectious diseases, such as malaria transmissions (Zinszer et al., 2012;Sewe et al., 2017). Zhang et al. (2014) studied heat-related mortality, and identified relevant temperature and humidity variables using Random Forests. Other studies applied ML to evaluate risk of MI or to predict acute MI based 55 on data such as patient history, blood markers, or electrocardiogram, but lack an environmental dimension (e.g., Tamarappoo et al., 2021;Commandeur et al., 2020).
In this study we correlate the total number of MI events on a given day to the coinciding and preceding environmental hazards. We employ several machine learning algorithms in a data-driven setting, using a range of meteorological, environmental, demographic and health variables. Our hypothesis is that the ML algorithm will predict a higher number of MI events 60 when temperatures are extremely high or extremely low and/or when other environmental conditions deteriorate, for instance when air pollution is high. We estimate the importance of the predictive variables in the models. We also assess the effects on different sub-groups, depending on location (urban/rural) as these may exhibit different vulnerabilities (Gabriel and Endlicher, 2011), and patient properties (age, smoking, and diabetes). The ML models that are presented can be used to estimate future health outcomes, using a set of scenarios for changes in climatic, environmental and demographic variables.
We expect however that none of the risk factors that are included in our models is strong enough to directly trigger MI in an otherwise healthy person. Instead, these environmental and demographic factors must be assumed to increase the statistical likelihood of vulnerability to MI over longer periods of time. Many common root causes of MI, e.g., atherosclerosis, are chronic and acquired over time and usually in large part due to lifestyle risks such as consumption of alcohol, tobacco, obesity, emotional stress, lack of exercise. Some of the risk factors covered in this study, e.g., air pollution or lack of green spaces, 70 can also contribute to the development of chronic disease. Once vulnerable, some factors, e.g., high ambient temperatures, are thought of as being able to directly trigger an event as well (Chen et al., 2019). In light of these considerations, it becomes clear that predicting the exact timing of MI events is complex, as they are generally the result of developments that can span decades, after which they might be triggered randomly or as a result of otherwise benign circumstance, e.g., physical exercise, emotional stress or exposure to air pollution or heat stress during summer time. 75 However many of the risk factors that we cover in this study can still modify the individual likelihood of suffering from MI. Moreover, our research motivation is to eventually estimate the long-term tendencies in MI due to climate change. In light of this, we do not expect for the models to be able to accurately provide predictions on a daily basis. Instead, we decided to aggregate our model results on an annual basis. This should allow for some of the inherent randomness to average out and allow a more statistical view on the developments. 80 In the next Section 2, we present the methods used to develop the ML models. In Section 3, we describe the process to derive the input for our data-driven approach. In Section 4 the results of the simulations and their performance are given. In Section 5 we discuss the results and give an outlook for using the models to project future MI events, and finally in Section 6 we provide the conclusions.
In this section, we present the approach to modelling the occurrence of MI events from a large variety of data and discuss the ML methods that were applied. We also consider correlations among the features and describe how we selected suitable parameters for the ML algorithms.

A supervised learning problem for MI events
There are many ways to build ML models from data, e.g., classification or regression based algorithms. In this study, we 90 focus solely on regression methods, as the registry data is case-only, i.e., by design each participant is bound to have an MI.
It is therefore not possible to build individual risk profiles or estimate the likelihood of suffering MI, due to the lack of a healthy control group. The data, however, constitutes a very good basis for predicting the number of expected MI events with regression.
The target variable, i.e., the data one is interested in learning from or about in our case is the time series of daily events of 95 MI observed in the study region. In addition, many of the co-occurring environmental variables that have a plausible causal relation to this target variable are collected and used as predictors in the training process.
We apply a range of different ML algorithms in order to test the performance of different learning approaches for predicting MI events based on the data. We use the scikit-learn package for performing the calculations (see Pedregosa et al., 2011;Pedregosa, F. et al.). The figures were produced using disability-friendly colormaps (Crameri et al., 2020). Before discussing 100 the individual regression methods that are applied to the problem, we first describe how the supervised learning problem is posed based on the daily time series of MI events and the corresponding predictive variables. Here, we use the sliding window method.
For any given day d let y d be the number of MI events and x i,d the value of the i-th predictive variable on that day (e.g., daily maximum temperature or daily mean PM 10 ). To work with standard regression algorithms a fixed number of features 105 must be selected and together with the target value y d be provided as training input. The variables x i,d represent a time series and therefore only a subset of them should be selected as a feature of the regression problem, namely the conditions on the day of prediction. Past conditions, however, might also have an influence on current events, both long and short term. The sliding window method allows for this by selecting the features with a lag n, referred to as the window size. The merits of allowing for shorter or longer memories are difficult to estimate. For instance, the effects of extremely high temperatures on MI are 110 generally expected to be short-term (Breitner et al., 2014), ranging from immediate effects to up to two weeks lag. The vector of features, i.e., the training (or test) instance on day d, is then given as: where n is the windows size and m the number of variables. Each predictive variable then yields n features and the total number of features for this problem is n·m. Accumulating the x d and y d for all days into a matrix X and a vector y yields input that can directly be used with the scikit-learn regression algorithms. We applied the ML methods and associated scikit-learn 115 4 https://doi.org/10.5194/nhess-2021-389 Preprint. Discussion started: 7 January 2022 c Author(s) 2022. CC BY 4.0 License. classes listed in Table 1, together with their abbreviations as used in the remainder of this paper. Note that some features, such as the slowly changing demographic variables, were not subject to the sliding window and instead simply used the value on the day of prediction. For this study, after testing different lags between 1 and 14 days, we exclusively used a lag of n = 3 days as this resulted in the best overall scores.
Note that throughout this paper, we use the terms predictor and feature in an interchangeable manner, namely to refer to 120 the features of the supervised learning problem derived above (the vector X and its components). The term variable is used when referring to data that is not (directly) used as a feature of the learning problem, e.g., MI counts, or when addressing predictors in a more general manner. For instance, temperature is a variable used in this study, but it is to be differentiated from the mean temperature feature TMK. This differentiation is important as the features depend on the chosen window size, i.e., a window size of n = 3 results in three individual features, that represent the mean temperature on three different days during 125 and preceding the MI event.
In addition to the environmental and socio-demographic predictors, we also added a random feature to be able to use its importance as a benchmark. Predictors less important than the random feature can be assumed to be irrelevant.

Scaling and random split
Different magnitudes of the features can have adverse effects as the results could be biased towards those variables given in 130 nominally large units relative to others. To avoid this, we apply the sklearn.preprocessing.StandardScaler class to the input, resulting in features that are centered around 0 with unit variance. Second, we withhold parts of the data from the training to have independent data instances for validation. We apply sklearn.model_selection.train_test_split with shuffle, resulting in a random 75%/25% split of the data in training and test portions. The 25% of data not used for training the algorithms are used for validation. Splitting the data randomly means that the underlying time series lose their natural temporal order. This has 135 implications when visualising and interpreting model results that we will cover in a later section, but it reduces the likelihood of autocorrelations (e.g., seasonal signals) present in the time series to result in overoptimistic predictions. Note that in order to split the data randomly, the random number generator has to be initialised with a seed. We found that different random seeds can result in significantly different results. To avoid reporting results that are strongly dependent on the chosen seed, we repeated all calculations with 100 randomly selected seeds. The result with the R 2 -score closest to the average score of the 140 5 https://doi.org/10.5194/nhess-2021-389 Preprint. Discussion started: 7 January 2022 c Author(s) 2022. CC BY 4.0 License. ensemble was then selected as a representative example of model capability. Moreover, as the dependency on the random seed is likely related to unbalanced splits, we employed a simple stratification strategy. The data is stratified along the number of MI occurrences observed, i.e., data points with the same number of MI are split among test and training in a representative way. This is especially important for rare events, such as 5 or more MI in one day. The dependency on the random seed was substantially reduced in this way, but significant differences between different seeds could still be observed.

Feature reduction
A common concern in the application of ML algorithms is the influence of correlated features, where redundancy in the related variables can lead to an overemphasis of their influence on the target variable. This can be counteracted by choosing only one of the correlated features, usually the one that has the strongest correlation with the target variable. In our case, we aimed to include as many variables as possible that could reasonably have an effect on MI. The downside is that some 150 features, for instance maximum, minimum, and mean temperature, are highly correlated on a daily basis. A visualisation of the correlation between the predictors used in this study is shown in appendix Figure A8. To address this issue we tested the option of transforming the data to a smaller feature space using principal component analysis (PCA). The resulting principal components are uncorrelated to each other and the risk of introducing spurious or overly strong relationships into the training data is reduced while retaining most of the original information. We used sklearn.decomposition.PCA and opted to retain at 155 least 98% of the variance. Having the principal components as optional features allowed us to compare predictions with PCA to estimate the potential adverse effects of correlations present in our data. The results using the PCA data (not shown here) did not improve, suggesting that using the original set of features does not introduce spurious relations. Moreover, using PCA leads to a reduction of interpretability, as the principal components are linear combinations of the original features, without a clear relation to the original variables.

Hyperparameter optimisation
The ability of the ML algorithms listed in Table 1 to produce accurate predictions is dependent on the selection of appropriate hyperparameters. These parameters generally control specific aspects of the underlying methods, such as the maximum depth of a decision tree, the number of neurons in a layer, or the strength of regularisation. With regularisation, a penalty is added as model complexity increases, which helps to avoid overfitting. In this study, we used the sklearn.model_selection.GridSearchCV 165 class to optimise hyperparameters over predefined parameter spaces with 5-fold cross validation. For this study we used the R 2 as the governing score to make decisions on optimal parameters. Repeating the procedure for all five possible splits and averaging the computed scores yields the final score for the given parameter values. Iterating over the hyperparameter space the parameter set with the best overall score is selected. Using cross-validation allows to produce more robust generalisation error estimates without having to reserve a dedicated cross-validation set that would not be available for training. Moreover, 170 by using folds based only on 75% of the training data, no information from the remaining 25% data is used for optimising the models and validation through parameter selection, i.e., leakage is prevented.
Ideally, a very comprehensive subset of the available hyperparameters would be optimised over a large and dense parameter space. In practice, time and available computing resources limit the extent to which an exhaustive optimisation can be carried out. Due to the curse of dimensionality, the number of possible combinations of parameters to test can easily become pro-175 hibitively. This applies even when the individual training process is rather fast and makes the optimisation too time-consuming to compute for practical purposes. We therefore performed a partial optimisation to circumvent this issue, i.e., we optimised over rather sparse parameter spaces and a limited number of the available parameters.
Here, some of the hyperparameters are chosen to optimise the models based on their purpose, the importance of what they control, and prevalence between the methods present in this paper (see Table 2). The listed parameters are used as model 180 optimisation variables simultaneously. As an example, in tree-based methods (DTR and RF) the selected parameters include the maximum depth of the tree as well as splitting criteria for internal nodes.
On the other hand, the procedure limits the extent to which the exponential growth of the parameter space with increasing numbers of parameters or increasing density results in prohibitive execution times. This makes the workload more manageable and works as a trade-off between accuracy and feasibility. Table 2 shows a list of the selected hyperparameters for all the 185 methods used as well as their optimised values.
Fortunately, for most of these models there are parallel implementations of their algorithms to speed up the process. Here, we used the Intel® extension package for scikit-learn, called scikit-learn-intelex. It has been designed to dynamically patch scikit-learn estimators making it much faster to optimise the models used in this study with minimal changes to the code. It is important to note, however, that the extension does not enhance all scikit-learn algorithms, such as the gradient boosting 190 regressor.

Data
The dataset used in this study is highly heterogeneous along many dimensions, with differences ranging from file format, metadata conventions, spatial coverage (e.g., regional, local) and resolution, to temporal frequency (e.g., daily, monthly, annual) and representation (e.g., raster, polygon and point data). The capabilities of ML algorithms strongly depend on the quality of the 195 data used. In this section, we give an overview of the data used in this study and describe the workflow applied to homogenise and prepare these. Table 3 lists all environmental and demographic predictive variables that were used for this study in addition to the MI data and their characteristics, as well as the source datasets and associated references.

KORA MI registry
The health dataset for our study is the KORA/MONICA MI Registry (see Tunstall-Pedoe et al., 1994;Holle et al., 2005), index (BMI), smoking status, and preexisting conditions such as diabetes. Although no detailed information is provided on the location of an MI event, they can be assigned to either the urban (City of Augsburg) or one of the two rural counties 205 (Landkreise) of the study region (Landkreis Augsburg and Aichach-Friedberg). As pointed out earlier, the individual, i.e., participant-specific information, could not be used as predictive data due to the nature of the regression approach, which aims to predict the gross number of MI in the population, rather than the outcomes of specific individuals. It is, however, possible to use this data to confine investigations to subgroups, e.g., to inhabitants of either urban or rural areas, and also to the elderly or smokers, albeit at the cost of being limited to a smaller subset of the overall data. In total the number of recorded MI is 210 n = 34618. Until 2008 the study was limited to participants of up to 74 years of age, with n = 30081 records total in that category. Figure 1 shows the aggregated number of MI per year and the mean annual cycle for the population aged under 75.
The yearly maximum in MI is observed during the winter months, whereas the summer time shows the lowest occurrences. To generate the ground truth for our regression problem, we counted the total daily number of MI observed in the KORA study and used the resultant time series as input for the ML algorithms.

Air temperature and humidity
Given the research goal of constructing relations between extreme temperature events and MI events, air temperature close to the ground is the most important factor to consider as the most direct measure of human exposure to heat and cold. The relatively small spatial scale of the study region (1998 km 2 ) puts high demand on the data in terms of resolution and accuracy.
At the same time, daily environmental data are required for our approach. 220 We opted to derive a 1x1 km grid for the study period between 1985 and 2015 from daily DWD station data in the vicinity of Augsburg and its neighboring districts. To this end, we applied universal Kriging with linear drift to the daily values at the temperature stations shown in Figure 2. The resulting gridded datasets (minimum, maximum and mean temperature) were aggregated to the counties comprising the study region. This relatively simple approach proved to be accurate enough to obtain realistic aggregated daily time series for the study region, as shown by the reasonable predictions in this paper. 225 We also include humidity related features in the models to gauge their relative importance in the models, despite the fact that humidity is often considered less important. For this project only relative humidity from the DWD was available. We applied the same Kriging procedure as used for temperature. To account for possible effects of perceived heat stress expressed by simultaneous high humidity and high temperatures we included apparent temperature. Measures of apparent temperatures relate a given temperature to the ambient humidity to account for the perceived temperature differences between dry and 230 humid conditions. Here, we applied a formula by Davis et al. (2016) to derive the apparent temperature based on temperature and humidity data. The specifics of the computation can be found in the Appendix A1.
In a next step, the data was aggregated for the three different counties within the model region, the urban and the two rural areas by computing weighted area means. The resulting daily time series can be readily used as input to the machine learning models, as described in Section 2.

Air quality
Air pollution is complex and several particulate and gaseous pollutants should be considered in investigating its effects on MI (e.g., Chen et al., 2018;Bourdrel et al., 2017).  Table A1 in the Appendix gives an overview of the selected measuring stations and their urban-rural categories, the corresponding pollutants data and their availability. Figure 2 gives an overview of the selected temperature and air quality measurement stations. We determined the aggregated daily means by calculating the mean values of the aforementioned stations, taking into consideration their location proximity to the city centers, traffic-loaded inner-city streets, on industrial areas, on the outskirts, or the large-scale background pollution. Such hot spots of pollution should have large effect on the calculated daily 245 mean.
The map indicates that there are only few air quality stations within the study region (five blue circles, and an archived station with red border circle). Moreover, these stations have not all been always active during our study period. In this case we use merely the active stations. However, if none of the regularly used station in the counties had recorded data on a given day, especially for the surrounding counties, alternative stations (light blue dots in Figure 2) from outside the study region were used as

255
The Normalised Difference Vegetation Index (NDVI) is an indicator of the greenness of the natural vegetation and other vegetation types such as agriculture and parks and gardens. It is widely used for ecosystems monitoring. In this study NDVI also is used as a proxy for shade as well as potential local cooling effect of vegetation by absorbing sunlight and through evapotranspiration. The NDVI_v2_1km database of the Copernicus Global Land Service (CGLS) vegetation products is freely available at a 1x1 km spatial resolution starting on April 1998 measured every ten days. We extracted the NDVI in our region 260 of interest and a cubic spline interpolation has been used to upscale the temporal resolution from 10 days to daily values. Given the very gradual rate of change in vegetation cover and consequently the NDVI, we assume this interpolation does not produce large errors. Note that due to lack of availability of NDVI data before 1998, training and testing of the algorithms had to be  a given year.

Weekly predictions of MI events
Our models produce daily predictions of MI events based on the environmental and demographic features within the given window size. We found that the models are not able to reproduce the daily variability of MI with sufficient accuracy. As an 280 example we show the daily predictions aggregated to 7-day intervals to increase visibility in Figure 3. The resultant scores are given in Table 4 for both training and validation respectively. Although the seven-day predictions fit reasonably well for the training period, for the testing period the models do not predict 7-day variations (or day-to-day predictions) accurately enough. The predictions are too close to the mean and lack the variability displayed by the observations. This is likely related to randomness as well as risk factors that affect MI events that 285 were not considered in the models. For instance, the temperature or air quality predictors may not sufficiently capture actual local circumstances, but also information about the built environment and other conditions that can not be easily accounted for is missing. Figure 4 shows the model performances on both the training and test sets as well as the actually observed MI as a reference for 290 the five ML models given in Table 1. After training the models and performing the daily prediction on the test set, the results were aggregated to annual sums. By aggregating the model results to an annual basis, some of the inherent randomness is averaged out. Based on the annualised prediction results and time series of observed MI, the performance scores were derived (see Table 5). The training scores demonstrate that the ML models are able to predict the year-to-year variations quite well, with adjusted R 2 scores between 0.86 and 0.93. The performance on the test dataset is relevant for assessing the generalisation error 295 for previously unseen data. In contrast to the training data, the results on the test set are less but still reasonably accurate, with adjusted R 2 scores between 0.59 and 0.71, showing that inter-annual variations and long-term trends are largely captured. The RR and MLP models exhibit the best performance, showing that both well-tuned linear models as well as neural networks are able to simulate the relations between environmental conditions and MI events. The DTR shows the lowest overall performance by comparison.

Feature importance
In Figure 5 we show a condensed rendition of the feature importance where related variables have been grouped together of each model; except for the MLP which does not support feature importance within the scikit-learn framework. Note that variables subject to the sliding window were aggregated over the window length of three days to improve readability. Moreover, features related to time such as the current year or the day of the week were also aggregated to a single group. More detailed 305 plots retaining the differentiation of all features and window days can be found in the Appendix (see Figures A6 and A7). The latter Figure also shows that many of the original demographic features carry little to no weight. We therefore reduced the granularity of the demographic data to the age groups 0 − 29, 30 − 49, 50 − 74 and > 75, generally yielding improved results.
While the performance of the models differs, some trends can be observed. Overall, the single most important group is air quality, followed by temperature and demographic predictors. In contrast to apparent temperature humidity only exhibits low 310 importance across all models. Time related features as well as NDVI exhibit the lowest explanatory power. NDVI is ranked very closely to the random feature by all models, and in some cases even slightly lower. All other features are consistently ranked above the random feature, indicating that they have relevance in predicting MI occurrence.

Subgroup analysis
The models were also applied to subgroups of the population, albeit at the expense of a reduced amount of available training 315 data (see Table 6) for an overview). For this analysis we selected a total of 5 subgroups: the urban (Augsburg city) and rural population (two adjacent counties) respectively, the elderly (people aged between 60 and 74), diabetics, and active smokers.
The data was reduced to include only participants with the associated attribute. The training procedure was then repeated as detailed for the general case on the resulting subsets. As expected, the validation scores dropped considerably for all subgroups, likely a consequence of reduced amounts of training data. We refer to the Appendix for detailed results, but for the urban and 320 rural subgroups adjusted R 2 scores between 0.6 and 0.37 were observed in validation (see Tables A2 and A3). Both subgroups, being of almost equal size, performed comparably well, with the urban population exhibiting slightly lower scores however.  Table 6. Overview of the number of cases as well as age and sex distribution for the different study populations considered.
The validation results for the elderly population (see Figure A3 and Table A4) are more accurate (adjusted R 2 between 0.5 and 0.68) than for the urban and rural populations, although the number of training samples is much higher in both of those cases.

325
The results for the diabetic population are shown in Figure A4 and Table A5. As observed with the elderly, the scores for the diabetic part of the population (adjusted R 2 between 0.44 and 0.55) are comparable to those of the (much bigger) rural and urban subgroups, albeit slightly lower.
The results for the smoking population are shown in Figure A5 and the scores are given in Table A6. For this group adjusted R 2 validation scores drop to around 0.42 on average, indicating a less accurate fit than for all the other subgroups. This is 330 consistent with the smoker group being the smallest of the explored subgroups resulting in the lowest amount of training data as well.
Overall, the skill of the models is clearly reduced when limited to subsets of the overall data. The decrease in performance, however, is quite different between subgroups, especially when taking into account their relative sizes. A particularly interesting question is whether the variable importance for any one subgroup changes substantially in comparison to the general 335 population. Figure 6 shows the difference in variable importance for each of the subgroups in relation to that of the general population. To aid readability related features have been grouped again. Considerable differences between subgroups, models and feature groups can be observed. For instance, three out of four models agree that air quality, humidity and demographics are particularly important for predicting urban MI, while giving less weight to the temperature related features. The importance of time related indicators reduced consistently over the general population. In most cases the importance of the random feature is 340 also reduced, indicating increased robustness of the results. NDVI exhibits mostly increased importance, but with considerable inter-model spread.
For the rural population the results suggest slightly increased relevance of air quality, and a bit more pronounced increases in humidity and NDVI. Temperature and apparent temperature mostly align with the results for the general population. The demographic indicators are less relevant when compared to the general case, as are the time related features. The random 345 feature is mostly ranked as less important.
For the elderly, the models are undecided on air quality, with a slight tendency towards reduced importance. On the other hand, the weight of the demographic features is emphasized in comparison to the general case. Less importance is also attributed to relative humidity and apparent temperature. On both temperature and time related features, as well as the random feature the models are undecided, whereas NDVI is mostly found to be more important.

350
For the diabetic population, the models mostly agree that demographic features and temperature are more important in predicting MI for this group in comparison to the general population. On the other hand, relative humidity, time related features and the random feature are ranked lower. NDVI is mostly considered to be more important overall. Only for the air quality group and the apparent temperature are the models undecided, with a tendency towards decreased importance in both cases.
Lastly, for the group of active smokers the models mostly suggest an increased importance of air quality as well as de-355 mographic features and NDVI for the prediction of MI. Relative humidity as well as apparent temperature and time-related features are overall considered less important. The models are undecided on temperature but with a tendency of increased relevance. The same holds true for the random feature, albeit with the opposite tendency.

Discussion
To our knowledge, this is the first study building and testing machine learning models that include more than only weather 360 variables (such as Zhang2009 for heat mortality) for predicting MI incidence. The developed ML models have varying skill in predicting MI. At the daily to 7-day time scales, randomness seems too large to produce meaningful predictions. However, when the prediction are aggregated to annual sums, the models are very well capable of reproducing the inter-annual variability of observed MI as well as the long-term trends, also for the validation datasets. This is comparable to the performance of methods used for predicting malaria incidence (e.g., Sewe et al., 2017). In terms of performance scores the models achieve 365 very similar outcomes both in training and validation (see Table 5), indicating some robustness of the predictions. More qualitative differences emerge, however, when investigating feature importance. There are substantial differences between the ML models in terms of some features ( Figure 5). Most models rank air quality variations and temperatures among the most important features, but a large spread can be observed for humidity (UPM) and the time-related features. This indicates at least some inherent uncertainty. The feature importance for the MLP model, which had the highest performance, was not assessed 370 in this study, due to limitations within the scikit-learn framework.
Classical epidemiological approaches like general linear or additive models are mostly interested in explaining the direction and corresponding uncertainty of an association between one or several (environmental) risk factors and health thereby adjusting for potential confounding factors. In case of potential non-linearities, the shape of the exposure-response curve is usually modeled as a smooth function. However, the models are limited in case of high correlation and/or high-dimensional interactions between the covariates. The suggested ML approaches can (partly) handle these issues and offer the possibility to compare the importance and predictive performance of a multitude of environmental predictors.
The training scores in many cases are close to the maximum, with adjusted R 2 values greater than 0.85. This may be indicative of overfitting, possibly opening room for improving further on the generalisation by applying stronger regularisation.
While the models were adjusted by optimising the hyperparameters, not all possible parameter values have been explored. For 380 instance, in the case of the tree-based models pruning is an effective way to reduce overfitting. We did not apply pruning, because this is a very time-consuming approach and we had to limit the extent to which we could perform an exhaustive optimisation. For the MLP and RR models regularising parameters were explicitly included in the optimisation, but possibly the ranges were not wide enough to achieve the best trade-off between training and validation.
The model results are sensitive to the selection of the random seed that is used in making the initial train-test split. We 385 found that changes in the random seed routinely had greater impact than the choice of hyperparameters. One way of dealing with this would be to also include this random seed in the optimisation process. Currently, only the random seeds used for randomly selecting the folds in cross-validation and in initialising the regressors are optimised. In light of the strong influence of the initial split, however, we opted to instead test over a range of possible seeds and select the results closest to the average performance of the models, not to overstate our results. The sensitivity to the initial split may indicate a lack of data, but is 390 likely mostly due to unbalanced splits. We reduced this sensitivity by employing a simple but effective stratification strategy.
This reduced the variation across seeds, but does not entirely resolve the issue. Possibly, more intricate stratification approaches may reduce the dependency even further.
We were already able to indicate differences between different geographical regions, e.g. urban and rural populations. For instance, humidity and air quality are more important predictors for the urban population, compared to the overall population.

395
The models could be further improved by increasing the spatial representation, as the environmental predictors also would support this. Increased spatial representation would also allow for additional exposure metrics to be established and more predictors, such as those related to building structures, their insulation and energy efficiency.
We used a sliding window of three days to allow for lagged effects. Depending on the variable, however, longer windows might be more appropriate such as for cold exposure in winter. Ideally every predictor would have its own individual window 400 length, derived as part of the optimisation process.
An additional area of possible improvement is the environmental data. Some variables such as the NDVI and the air quality indicators were not fully available for the period between 1985 and 2015, effectively limiting the data available to the period from 1998 to 2015. Many variables were also derived from station networks after cleaning and applying a simple Kriging method. It is possible that reducing bias in these data, for instance by using more sophisticated interpolation methods or 405 additional data sources such as from remote sensing, could further improve the results.
Finally, it could also be worthwhile to study the direction of the change related to a given feature as to get a better grasp on the qualitative relationships between predictors and MI events in the models. It should be noted, however, that other than RR all methods used are nonlinear and may therefore not exhibit simple directional effects.
All models consistently stressed the importance of air quality variables. Climate impact studies, especially related to MI, 410 might therefore benefit from carefully analysing possible future developments of these variables. Electrification of traffic, reduction of fossil fuel and related changes might yield substantial improvements in air quality in the future. Instead of just focusing on projected changes in, e.g., temperature and humidity, scenarios for air quality need to be considered as well, to gain more precise insight into possible impacts of climate change on health and possible strategies to reduce these.
Current research and data availability in the fields of climate modelling, and demographic and environmental scenario 415 development provide many opportunities to use the developed ML models from our research for projecting future health risks. Ensembles of regional climate models provide climate projections with the highest spatial resolution. For the study region, EURO-CORDEX simulations (Jacob et al., 2014(Jacob et al., , 2020 can be considered as they provide the largest ensemble at a high spatial (0.11°, i.e., 12 km) and temporal (daily) resolution climate simulations that are available today. Several of the predictors used in this study could be derived from the EURO-CORDEX ensemble, namely temperatures and in many cases 420 relative humidity, dew-point temperatures and therefore also apparent temperatures. Alternatively, an ensemble of convection permitting decadal regional climate simulations at roughly 3km, both for historic and future conditions, has been created within CORDEX FPS (e.g., Ban et al., 2021). Using an ensemble of near-future (2035-2065) climate model simulations allow for scenario uncertainty, internal climate variability, and climate model uncertainty to be assessed (Hawkins and Sutton, 2011) when comparing the changes in MI to the reference historical simulations. Projections of vegetation changes, as represented by NDVI, at the required spatial scale are not readily available. On the other hand, it can be reasonably assumed that the potential for increased greenness in the inner city is limited. Likewise, the potential for substantial effects from added green in the rural surroundings of Augsburg is low, as it is already ubiquitous there.
We therefore believe that moderate up-or downscaling of NDVI patterns observed in the past and present may suffice to yield suitable estimates of possible future developments, such as adaptation measures of increasing vegetation to reduce the urban 435 heat island effect or offset impacts from overall warming due to anthropogenic climate change.
The air quality projections are related to the emission scenarios used by global climate models. For the CMIP6 climate models, estimates of regional surface air quality are available at the global model scale Turnock et al. (2020). These projections could be used to scale the observed daily air quality observations, but more exhaustive and local projection data would be preferred. To date, however, regional climate models do not feature the necessary complex chemical models to accurately 440 model the transport, dispersion and diffusion of pollutants. Here, we also excite a pragmatic up-or downscaling (depending on related socioeconomic scenarios, e.g., fossil fuels vs. electrification of traffic) of the observed patterns to bridge the gap between the projections of future air pollution at the global level (from the global climate models) to the local level.

Conclusions
We have developed an approach for predicting MI events using multi-variable Machine Learning methods, based on environ-445 mental and demographic data for a case in Augsburg, Germany. Given that health outcomes depend on a multitude of factors, we applied this data-driven approach to establish relevant relationships, and collected several different variable datasets. We acquired heterogeneous data on MI events from the KORA MI registry, as well as weather, environmental and demographic data from various sources to create a meaningful and consistent daily time series of the predictive features and the target variable.
to three days. Five different regression algorithms were trained on this data, based on random 75/25 train-test splits for the period between April 1998 and December 2015. An extensive effort has been carried out to optimise a selection of the various hyperparameters modifying the behaviour of the regressors, based on 5-fold cross validation with respect to the R 2 scores.
Applying the trained models on the unseen test data allowed an estimation of the generalisation error of the models. We found that the daily predictions do not show meaningful predictions of MI events. We found that the annually aggregated predictions 455 agree moderately well with the observed MI events, accurately reflecting observed trends and inter-annual variability of MI.
The match between observations and the model predictions is supported by the observed validation scores, with adjusted R 2 scores ranging between 0.59 and 0.71. Overall, the models displayed comparable skill, but the RR and MLP models slightly outperformed the tree-based methods. The least accurate results were produced by the DTR model. Moreover, analysing the feature importance where possible, we found that despite similar overall scores, the relative weight put into different features 460 can vary substantially between the models. This emphasised the necessity to consider ensembles of models, as it allows to gauge the model spread and estimate inherent uncertainty. In this study, all models explored show that air quality is the most important feature to predict MI, followed by absolute and apparent temperatures, the latter including humidity. We also applied the models to various vulnerable subgroups, such as the elderly or diabetic patients, resulting in only slightly reduced skill scores due to the reduced amounts of training data.

465
Possibilities to improve the current approach are manifold. One aspect is to improve on the quality of the predictive data, e.g., by using more advanced point to space methods rather than ordinary Kriging as was used in this study. Given the relatively good performance of MLP in this study more complex and potent neural networks could be explored to possibly improve on this further. Moreover, different ML approaches could be explored, such as density estimation and Bayesian methods, yielding estimates of relative risk of different groups to suffer MI. Such estimates could be more readily compared with commonly used 470 epidemiological models than the regression models presented here. Overall, the models' capacity to give reasonable estimates of possible future developments of MI based on the predictive features appears robust. In a next step we aim to apply the trained models to scenarios of future climatic, environmental and demographic conditions. This will allow estimating future changes in MI taking into account climatic, as well as other environmental and demographic changes that has most often not been done. These changes could also include further improvements in air quality, or increased 'greening' of urban environments 475 with vegetation. New estimates of changes in MI can be based on these ML methods, using ensembles of projected climate change, demographic scenarios including ageing population, and environmental scenarios contingent on societal transformation (e.g., electrification of traffic, greening of cities). Such estimates will enable to gauge the sensitivity of the complex healthenvironment interactions, and benefits of proposed environmental and health interventions in urban areas.

A1 Derivation of apparent temperature
We have computed the apparent temperature according to: where T a is the apparent temperature, T the near-surface mean temperature and T d the near surface dewpoint temperature.
Dewpoint temperature, however, was not available for this study. To facilitate estimating the apparent temperature we therefore first derived another humidity related quantity: vapour pressure. Applying again universal Kriging with linear drift, we arrived at 1x1 km gridded data for vapour pressure, applying the Magnus formula to estimate the dewpoint temperature: where a = 7.5, b = 237.3, v = log 10 pv 6.1078 with p v the vapour pressure. Applying these formulas to the gridded temperature and humidity data derived before yields a 1x1 km grid for apparent temperature. Note that the formulas were independently applied to mean, maximum and minimum temperature. Subsequent aggregation over the model region then completed the preparation of apparent temperature as input feature.

Stations:
Augsburg/Königsplatz  Figure A5. Annually aggregated predictions of MI in the smoking population for all models. Predicted (solid) and observed MI (dashed) for training (a) and test (b) sets.      Code and data availability. The data used in this paper are available from third party sources. The principle MI registry data is available from the KORA data set, and can be applied for at HMGU here: https://epi.helmholtz-muenchen.de/. Other data sources (environmental and demographic data) are available from the sources quoted in the paper (Table 3). The code used for the ML models and data pre-and post-processing is available on request from the authors.
ogy, University Hospital of Augsburg. Many thanks for their support go to the local health departments, the office-based physicians and the clinicians of the hospitals within the study area. Finally, we express our appreciation to all study participants.
This work was supported by the Helmholtz Zentrum München, German Research Center for Environmental Health, which is funded by the German Federal Ministry of Education, Science, Research and Technology and by the State of Bavaria and the German Federal Ministry

505
of Health. This research received also supported from the Faculty of Medicine, University of Augsburg, and the University Hospital of Augsburg, Germany. Since the year 2000, the collection of MI data has been co-financed by the German Federal Ministry of Health to provide population-based MI morbidity data for the official German Health Report (see www.gbe-bund.de).