Leveraging multi-model season-ahead streamflow forecasts to trigger advanced flood preparedness

. Disaster planning has historically allocated minimal effort and finances toward advanced preparedness, however 10 evidence supports reduced vulnerability to flood events, saving lives and money, through appropriate early actions. Among other requirements, effective early action systems necessitate the availability of high-quality forecasts to inform decision making. In this study, we evaluate the ability of statistical and physically based season-ahead prediction models to appropriately trigger flood early preparedness actions based on a 75% or greater probability of surpassing the 80 th percentile of historical seasonal streamflow for the flood-prone Marañón River and Piura River in Peru. The statistical prediction 15 model, developed in this work, leverages the asymmetric relationship between seasonal streamflow and the ENSO phenomenon. Additionally, a multi-model (least squares combination) is also evaluated against current operational practices. The statistical and multi-model predictions demonstrate superior performance compared to the physically based model for the Marañón River by correctly triggering preparedness actions in all four historical occasions. For the Piura River, the statistical model proves superior to all other approaches, and even achieves an 86% hit rate when the required threshold 20 exceedance probability is reduced to 50%, with only one false alarm. Continued efforts should focus on applying this season-ahead prediction framework to additional flood-prone locations where early actions may be warranted and current forecast capacity is limited. streamflow, FMA precipitation (mm/day) predictions derived from of two NCEP the North American Multi-Model Ensemble (NMME) are also evaluated. The two models have exhibited superior performance in terms of RMSE, temporal correlation, and Heidke Skill Score in northwest Peru compared to other NMME models when simulating January, February and March precipitation across lead times of one 195 to six months (Wang Vavrus, Individually, each model’s FMA precipitation prediction correlates with streamflow at 0.76; when averaged, correlation


Introduction and motivation
Globally, flood catastrophes lead all natural hazards in terms of mortality and cause billions of dollars in damages annually 25 (Doocy et al., 2013;IFRC, 2020;Lee et al., 2018;Munich RE, 2012, 2018. Government agencies and relief organizations have historically prioritized disaster relief, allocating the majority of financial resources to response efforts in a reactionary mode, in lieu of pre-disaster preparedness . However, forecast based early action (FbA) initiatives are now recognized as a critical component of disaster risk reduction (World Disasters Report 2009: Focus on early warning, early action, 2009. While no strict definition for FbA exists, the term generally refers to initiatives that provide assistance 30 actions such as unconditional cash disbursements targeting vulnerable households can yield a benefit regardless of whether or not the event occurs (Wilkinson et al., 2018). Forecast models that proficiently predict extreme events at lead times permitting early action are critical for minimizing false positives and false negatives. In addition to short term weather forecasts which are commonly viewed as skillful, medium to long range climate forecasts have also been demonstrated to 65 improve preparedness protocols, resulting in reduced mortality, morbidity, and resource demands (Braman et al., 2013), yet their applications have been limited predominantly as a result of moderate forecast performance and significant uncertainty.
Hydrologic models are essential components of flood early warning systems and can be typically divided into two categories. Physically based (or dynamical) models simulate physical processes such as infiltration and runoff to produce 70 streamflow predictions and are often forced with climate predictions downscaled from general circulations models (GCMs) or numerical weather models. Statistical (also called empirical or data-driven) models forgo the parameterization of complex physical processes in favor of understanding the lagged relationships between precipitation or streamflow and antecedent land, atmosphere and ocean conditions. Statistical and physical models have been successfully applied to seasonal prediction of hydrologic variables including precipitation and streamflow (e.g. Badr, et al., 2013;Block & Rajagopalan, 2009). Both 75 frameworks have their own set of advantages and disadvantages with prediction skill varying according to season and location . While statistical models are not intended to provide a complete understanding of the hydro-climate system, they offer an appealing complement to physically based models by focusing solely on the prediction variable of interest (Zimmerman et al., 2016).

80
This study evaluates multiple season-ahead forecast approaches, namely locally tailored statistical and existing global-scale physical models, to individually and collectively inform advanced flood preparedness actions, using Peru as a case study.
Typically, only physically based forecast approaches are used operationally, however augmenting with a locally tailored statistical forecast may considerably improve forecast performance and opportunities for preparedness.

Flood impacts in Peru
Peru experiences catastrophic flooding with relative frequency, resulting in significant adverse economic and health impacts.

Hydroclimatology of Peru 95
While floods are common throughout many regions of Peru, climate and hydrology vary dramatically. The hydroclimatology of Peru is broadly characterized by a disruption of tropospheric flow caused by the Andes cordillera, which maintains an arid climate along the Pacific coast and wet conditions in the Amazon basin to the east (Garreaud et al., 2009). Particularly along coastal Peru, a major source of interannual variability in precipitation and temperature is controlled by the El Niño Southern Oscillation (ENSO) phenomenon, a system of ocean-atmosphere feedbacks in the tropical Pacific (Garreaud et al., 2009). In 100 the southern coastal region, the warm, positive phase of ENSO (El Niño) is associated with below average precipitation (Wu et al., 2018). In northwest Peru, strong El Niño years are often associated with above average precipitation, most notably during the 1982-83 and 1997-98 El Niño events which coincided with extreme rainfall and flooding (Bayer et al., 2014).
However, the impacts of similarly intense El Niño events are variable. Despite very strong El Niño conditions in 2015-2016, rainfall and flood impacts in Peru were minimal (French & Mechler, 2017;Ramirez & Briones, 2017;Venkateswaran et al., 105 2017). El Niño events can span the equatorial Pacific region (e.g. 1982-83, 1997-98) or they can be confined to the coast of northern Peru and Ecuador (Ramirez & Briones, 2017). The latter type is known as a "coastal El Niño" or "El Niño costero" and occurred in 1925 and 2017, in both cases resulting in extreme rainfall and flooding (Ramirez & Briones, 2017;Takahashi & Martínez, 2017). While El Niño conditions are associated with extreme events along the coast, La Niña (cool, negative phase of ENSO) conditions can also produce slightly higher than average streamflow (see Figure 2b). 110 In the Amazon basin, while the literature has described relationships between climate patterns and hydrometeorological variables, the way in which climate variables influence flood risk remains understudied (Towner et al., 2020) as a result of the nonlinear relationship between precipitation and streamflow (Stephens, Day, Pappenberger, & Cloke, 2015).
Hydrometeorological regimes in the Amazon basin are diverse and are driven by seasonal warming of the northern and 115 southern hemispheres and the migration of the Intertropical Convergence Zone (Espinoza Villar et al., 2009). Precipitation in the Peruvian austral summer (DJFM) is dominated by the South American Monsoon season which enhances the north Atlantic trade wind (Zhou & Lau, 1998) as well as by deep convection that recycles moisture over Amazonia (Garreaud et al., 2009). El Niño conditions and above-average sea surface temperatures (SST) in the tropical north Atlantic, south Atlantic, and Indian Oceans are associated with decreased rainfall in the northern portion of the basin and increased rainfall 120 in the south (Marengo, 2004). La Niña conditions are weakly associated with increased precipitation in the western Amazon basin (Garreaud et al., 2009).

Flood early action plan
In October 2019, the International Federation of Red Cross and Red Crescent Societies (IFRC) approved an Early Action Plan (EAP) submitted by the Peruvian Red Cross for flooding in the Peruvian Amazon. The plan is based in part on an 125 extension of the Global Flood Awareness System (GloFAS) called GloFAS-seasonal, a global streamflow forecast model developed by the European Centre for Medium-Range Weather Forecasts (ECMWF) that couples seasonal climate forecasts from GCMs to a physically based hydrology model (Emerton et al., 2018). Early actions, which involve the prepositioning of supplies and release of funds, are triggered when 75% of GloFAS ensemble members forecast streamflow above the 80 th percentile (IFRC, 2019) at a 45-day lead time. Because GloFAS exhibits only modest forecast skill in Peru when detecting 130 floods at short lead times (Bischiniotis et al., 2019), there is an opportunity to leverage complementary prediction frameworks to improve forecast performance. Similarly, an EAP is in development for the Piura basin in coastal northwest Peru to address extreme precipitation and flooding.

Case study locations
Study locations prone to riverine flooding were identified by collaborators at the Red Cross Climate Center in Lima, Peru, 135 and the EAPs, namely the Marañón River at San Regis and the Piura River at Puente Sánchez Cerro (Figure 1). The Marañón is a tributary to the Amazon River, east of the Andes, with a basin covering approximately one-half (362,000 km 2 ) of the Peruvian Amazon River basin. Here, tropical lowland forest (below 600 m elevation) is the dominant ecozone followed by tropical montane forest (above 600 m elevation) (Kvist & Nebel, 2001). The Piura River basin above Puente Sánchez Cerro is significantly smaller in size (7,435 km 2 ) consisting of tropical shrubland and tropical mountain systems and 140 is generally classified as arid with precipitation averaging less than 50 mm/year for elevations below 500 m (FAO 2001;Rodriguez et al., 2005). Throughout this paper, the names of the monitoring stations will be used to describe the stations and the basins they delimit.

Streamflow variability
Daily streamflow data for each location (1999at San Regis, 1971 at Puente Sánchez Cerro) was provided by the 145 Peruvian Meteorological Agency, El Servicio Nacional de Meteorología e Hidrología del Perú (SENAMHI). Monthly mean streamflow at Marañón exhibits a statistically significant autocorrelation at one-and two-month lags, however monthly streamflow at Piura exhibits no significant autocorrelation. This is predominantly an effect of catchment size and watershed memory, and an important feature for streamflow prediction.

150
The high flow season during which floods are likely to occur is computed using an approach modified from Lee et al. (2015). This season is defined as the three consecutive months with the largest combined number of days having streamflow

160
FMA while the annual maximum discharge occurred in FMA in 40 out of 47 years. Clearly, high flow conditions occur outside these seasons, however in this study these will not be captured as the focus is on the likelihood of high flow conditions within the target season only.
3 Statistical approach to season-ahead streamflow prediction

Potential local-scale predictor variables 165
Ocean-land-atmospheric variables representative of slowly evolving hydro-climatic conditions offer prospects for predicting streamflow from a season-ahead lead. This includes considering pre-season large-scale ocean-atmosphere teleconnections https://doi.org/10.5194/nhess-2021-25 Preprint. Discussion started: 9 February 2021 c Author(s) 2021. CC BY 4.0 License. and basin-scale hydrologic processes such as observed streamflow, precipitation, soil moisture, and temperature ( Table 2).
Predictions of seasonal average streamflow are issued on the first day of the three-month high flow season identified in Sect.
2, leveraging predictors based on values in the preceding months. Potential predictors must be statistically significantly 170 correlated with streamflow (p < 0.1) to be retained.
Precipitation data used in this study leverages the Peruvian Interpolation data of SENAMHI's Climatological and hydrological Observations (PISCO) v2.1 dataset (Aybar et al., 2020), provided by SENAMHI and accessed via the International Research Institute for Climate and Society (IRI; http://iridl.ldeo.columbia.edu). PISCO contains monthly and 175 daily precipitation at a 0.1 degree grid resolution from 1981 to 2017, and is based on the Climate Hazards group InfraRed Precipitation with Stations (CHIRPS; Funk et al., 2015) quasi-global precipitation product calibrated with SENAHMI station data. Basin-averaged precipitation over January-February is included as a potential predictor for the Marañón at San Regis (Table 2). January and February precipitation each also correlate significantly, though less so compared to the January-February average; to maintain model parsimony we included only the latter as a potential predictor. The Piura catchment is 180 approximately 2% the size of the Marañón and only basin-averaged precipitation in January significantly correlates with streamflow (Table 2).
Soil moisture data (0.5°, monthly) is provided by the National Oceanic and Atmospheric Administration (NOAA) Climate Prediction Center (Fan & van den Dool, 2004). Atmospheric moisture transport can occur over long distances and across 185 catchment boundaries; to capture potential signals of soil moisture on streamflow variability, a principal component analysis is conducted on one-month ahead gridded soil moisture across northern South America, and the first principal component (PC) is retained as a potential predictor. Basin-averaged mean air temperature in the month prior to the forecast, provided by the NOAA (https://psl.noaa.gov/) is also considered ( Table 2).

190
Given that the Piura basin is relatively small and within-season precipitation is an important contributor to seasonal streamflow, FMA precipitation (mm/day) predictions derived from the mean of two GCM members (NASA GEOSS2S and NCEP CFSv2) of the North American Multi-Model Ensemble (NMME)  are also evaluated. The two models have exhibited superior performance in terms of RMSE, temporal correlation, and Heidke Skill Score in northwest Peru compared to other NMME models when simulating January, February and March precipitation across lead times of one 195 to six months (Wang & Vavrus, 2020). Individually, each model's FMA precipitation prediction correlates with streamflow at 0.76; when averaged, correlation increases to 0.84 (Table 2).

Potential large-scale predictor variables
A common approach for identifying SST regions for use as predictors is to search for stable correlations between the 205 predictand (streamflow in this case) and SSTs over a moving window of historical data (Gámiz-Fortis et al., 2010;Ionita et al., 2015). However, the state of ENSO can influence the mean state of the atmospheric-oceanic system, which in turn affects the relevant teleconnections between SSTs and precipitation or streamflow (Zimmerman et al., 2016). This asymmetric relationship between ENSO and streamflow may prove challenging from a traditional modeling perspective. At our study sites, the distributions of seasonal streamflow shift and change shape according to the state of ENSO, though significant 210 variability within each phase exists (Figure 2). A Nino Index Phase Analysis (NIPA; Giuliani et al., 2019;Zimmerman et al., 2016) approach is advantageous in such cases, capturing the variance and signals within each phase separately, and thus addressing the overall asymmetric challenges. correlated with streamflow (Figure 3), the first and second PC is extracted as a potential predictor in the statistical model.
For Piura (Marañón) the first and second PCs explain 83% and 7% (84% and 6%) of the variance respectively and only the first PC significantly correlates with streamflow. Selecting SST regions based on the preseason state of the Niño 1+2 anomaly index instead of MEI did not materially change results at Piura. 235 Given that SLP evolves more quickly than SSTs, only the single month values prior to the target season are evaluated, otherwise the process mirrors SST selection. SLP data is from the NCEP/NCAR Climate Data Assimilation System I (Kalnay et al., 1996) and accessed via the IRI data library. Only regions statistically significantly correlated at p<0.05 are included.

Statistical prediction model
For each location, a principal component regression (PCR; coupled principal component analysis and multiple linear regression) framework is adopted to predict seasonal streamflow by ENSO phase. This results in two PCR "submodels" for 245 the Marañón River at San Regis and three for the Piura River at Puente Sánchez Cerro where the submodel used for prediction in a given year is selected based on preseason MEI. For example, in 1998 the preseason (NDJ) average MEI value is 2.43 so the positive phase submodel is selected to predict Piura River FMA streamflow. In each submodel, relevant predictors by ENSO phase are included; predictor variable types may be included in some submodels and not others, depending on their correlation with streamflow in that phase. A subset of PCs is retained for input into the multiple linear 250 regression, given as: where yt is observed seasonal streamflow in year t, " is a constant, # … % are regression coefficients, #,! . . . %,! are the PCs retained, and e is the residual or error. There are numerous methods for selecting the appropriate number of PCs to retain; here, the first two PCs are retained unless the model has two or fewer predictors, and then only the first PC is 255 retained.
To favor parsimonious models, the optimal subset of predictors is selected according to the generalized cross-validation (GCV) score function (P. Block & Rajagopalan, 2007), given as:

260
where et is the model error, or difference between observed and predicted values, m is the number of predictors, and N is the number of data points (time steps). GCV penalizes the use of additional predictors; lower scores indicate optimal tradeoff between minimizing prediction errors and the number of predictors included.
To evaluate the performance of each submodel, a drop-one-year cross validation hindcast is constructed, refitting the 265 regression coefficients each year, to produce a deterministic seasonal streamflow prediction. When model residuals are normally distributed, according to the Shapiro-Wilk test with alpha=0.05, an error distribution is created by taking 1000 random samples. Otherwise, an error distribution is derived by directly sampling the model residuals with replacement 1000 times. The resulting error distribution is then added to the cross-validated deterministic prediction to create a probabilistic streamflow prediction. This process is repeated for each year to create a probabilistic hindcast for all years in the submodel. 270 Hindcasts from each submodel are subsequently joined to create a full observational period hindcast.

GloFAS and multi-model predictions
Predictions from the physically based GloFAS model for the two study locations are available from ECMWF (https://www.globalfloods.eu/general-information/data-and-services/). GloFAS forecasts are issued on the first day of every month and consist of 25 ensemble members predicting mean weekly streamflow 17 weeks out from the issue date; only 275 predictions for weeks 1-13 (approximately three months) are retained. A mean bias correction is applied to the GloFAS ensemble mean according to the difference between mean observed and predicted seasonal streamflow across all years. A quantile mapping approach, relating the cumulative distributions functions of observed and predicted streamflow, was also tested (Hashino et al., 2006); however, forecast skill did not substantially differ from the mean bias correction approach. In addition to evaluating the statistical model and GloFAS independently, a multi-model forecast is also constructed utilizing a 280 least squares linear regression to assign weights according to the relative Pearson correlation strength between observed streamflow and each model's predictions (P. J. .

Forecast verification and performance measures
Forecast performance for the three models (statistical, GloFAS, and multi-model) is evaluated at both locations by Pearson correlation coefficient, Rank Probability Skill Score (RPSS), Probability of Detection (POD), False Alarm Ratio (FAR) and 285 Threat Score (TS).
RPSS is an extension of the rank probability score (RPS), which measures the categorical accuracy of a forecast (Wilks, 2011). Here, two categories are selected to represent high flow and non-high flow conditions, with the 80 th percentile of observed seasonal streamflow representing the threshold. The RPS is the sum of the squared differences between the forecast 290 and observed categorical probabilities, and is given as: where J is the number of categories, yj is the forecast probability in the jth category, and oj is 1 if the event is observed in that category, otherwise 0. RPS scores range from 0 to 1. RPSS indicates the relative skill of the forecast compared to a reference forecast and takes the form: (4) RPSS can vary from -∞ to 1; values above 0 are considered skillful compared to the reference forecast, and a value equal to 1 indicates a perfect categorical forecast. Mean RPSS values across all hindcast years are presented; the reference forecast is based on historical averages (i.e. climatology).

300
POD, or "hit rate," describes the fraction of observed extreme (e.g. high flow) events that are correctly predicted and is calculated as: where a perfect score is 1 (Wilks, 2011). Because POD can be artificially improved by issuing more extreme predictions, it must be evaluated in combination with FAR. FAR describes the fraction of predicted extreme events that did not occur, or 305 "false alarms", calculated as: where a perfect score is 0 (Wilks, 2011).
TS, also called the "critical success index," is the number of predicted extreme events divided by the total number of times 310 that an extreme event is either predicted or observed, calculated as: TS = 34!5 34!56,45575689:57 9:9;,5 , where a perfect score is 1 (Wilks, 2011). TS is preferred over accuracy (the sum of true positives and true negatives divided by the total number of events) for situations where the extreme category is rarely observed. As previously stated, the extreme category is classified as seasonal streamflow values in the top 20% (80 th percentile) of observations. 315

Large-scale predictor regions
The locations of SST regions that correlate significantly with streamflow vary according to the phase of ENSO (Figure 3).

Piura streamflow in El Niño years is positively associated with equatorial Pacific SSTs, encompassing the Niño 1+2 and
Niño 3 regions (Figure 3a). This finding aligns with previous work demonstrating that above-average precipitation in 320 northwest Peru is driven primarily by ENSO (e.g., Lagos et al., 2008). Strong El Niño years (e.g. 1983, 1998)  Indian Oceans. While El Niño episodes have been linked to below-average precipitation in the Amazon basin (Garreaud et al., 2009;Marengo, 2004), significant teleconnections between equatorial Pacific SSTs and Marañón streamflow are not identified here (Figure 3b).

Final predictor selection
Of the potential predictors listed in Table 2, a subset is selected for each statistical forecast submodel based on correlation 335 significance and model parsimony as described in Section 3 (Table 3). This results in the first PC of statistically significant pre-season SST regions being included in all submodels for both locations. Pre-season streamflow is included in both submodels for Marañón, in line with its greater temporal autocorrelation, while it is included in only the positive phase submodel for Piura. No pre-season precipitation observations are included for Marañón; for Piura the GCM precipitation forecast is included in the negative phase submodel and pre-season observed precipitation is included in the positive and 340 neutral phase submodels.

Statistical model forecasts
The primary focus of this study is to predict occurrence of high flow conditions to initiate flood preparedness actions. The probabilistic statistical forecast model at each location effectively captures interannual variability and extremes (Figs. 4  (Table 4). El Niño years are associated with lower forecast uncertainty for Marañón; the average standard deviation of error distributions is 42% smaller than in La Niña years. For Piura, La Niña conditions result 355 in lower forecast uncertainty; the average standard deviation of error distributions is 73% larger for years in the neutral phase  and 17% larger in El Niño years. Despite low streamflow in many years, the forecast model for Piura captured the 365 approximate magnitude of the top three extremes in 1983, 1998 and 2017 ( Figure 5). An analysis of flood reports from news media and global disaster databases including EM-DAT and the Dartmouth Flood Observatory indicate that flooding along the Piura River occurred in each of these years, though not necessarily at the station itself.

Multi-model forecasts
For the multi-model forecast, least squares weighting results in a significantly higher weight (83% and 72% for the Marañón 370 and Piura, respectively) assigned to the statistical model and therefore multi-model Pearson correlation and RPSS values are similar to the independent statistical forecast model ( Table 5). The Marañón multi-model detects all four true positives in the upper category -two more than GloFAS and the same as the statistical model. The Piura multi-model detects four true positives, two fewer than the statistical model and one more than GloFAS. For both Piura and Marañón, the multi-model forecast improves POD, FAR and TS compared to GloFAS (Table 6). 375

Triggering early action
While verification metrics offer useful ways to evaluate forecast performance, a forecast's true value is determined by the end user (Hartmann et al., 2002). Because floods are the main hydro-meteorological threat in the Peruvian Amazon (IFRC, 2019) and Piura basins, correctly predicting the years with high seasonal streamflow are of outsized importance compared to 385 predicting low-flow years. The Peruvian Red Cross early action protocol steps for flooding are triggered when a forecast predicts a 75% chance (probability) of streamflow above the 80 th percentile (threshold). This criterion is applied to the three forecasts (statistical model, GloFAS, and multi-model) to understand when actions would be triggered based on each forecast at San Regis on the Marañón River and at Puente Sánchez Cerro on the Piura River.

390
Based on this criteria, four years in the historical record qualify for early action at San Regis (2009,2012,2013,2015). Out of these four, the statistical model predicts action in all four years and GloFAS in two (2009 and 2012) ( Figure 6). While an million (Caviedes, 1984;"Emergency Events Database (EM-DAT)," 1988;Peru -Floods Fact Sheet #1, Fiscal Year (FY) 420 1998; "Peru floods: Four killed as Piura bursts its banks," 2017; French & Mechler, 2017) (Figure 7). The statistical model included one false positive in 2000 with an 81.3% predicted probability of exceedance (observed streamflow was at the 74 th percentile). Additional historical years (2001, 2002, 2008 and 2012) also meet the criteria for early action with evidence of flooding in the Piura province, collectively resulting in 60 deaths and affecting 508,000 people ("Emergency Events Database (EM-DAT)," 1988), although streamflow magnitudes were substantially lower. Of these the statistical 425 model captured one (2012) while GloFAS failed to capture any. A modified trigger mechanism enables capturing some of these lower-magnitude events without additional false positives; if early action is triggered based on just a 50% probability of exceeding the 80 th percentile, the statistical model also triggers in 2001 and 2008 (thus capturing 6 of the 7 observed events). However, this study forgoes any systematic attempt to assess when early actions may or may not be warranted (e.g. determining an optimal threshold) in favor of illustrating that additional skill in detecting observed early action triggers is 430 possible with the use of tailored statistical and multi-model forecasts. Further refinement of effective trigger levels also requires understanding regionally specific flood impacts and expected benefits of early action.

Varying the probability required to trigger action
The trigger mechanism for early action, which requires a 75% probability of streamflow above the 80 th percentile, suggests a tolerance for a FAR of 0.25 for an unbiased forecast. Indeed, the tolerance for false positives when implementing early 435 action is an open question for decision makers and may depend on numerous technical, institutional and political factors. In both locations, the probability of exceeding this threshold can be reduced significantly below 75% while remaining at or below an acceptable FAR, thereby enabling the forecasts to capture additional high-flow events. At Puente Sánchez Cerro, lowering the probability can lead to the capture of six out of seven events by the statistical and multi-model forecasts (improving from 4 and 3 events at the 75% probability, respectively) while still maintaining a low FAR (Figure 8b and 8d). 440 At San Regis the statistical and multi-model approaches both detect all four triggers at 75% probability, but no additional false positives are introduced by either forecast until the required trigger probability is reduced to approximately 50% ( Figure 8c). For GloFAS, the benefit of additional events captured is not realized until the required trigger probability is well below 50% (Figure 8a and 8b), at which point the FAR is above 0.25 (Figure 8c and 8d). False positives incurred by reducing the trigger probability may also be offset by a stopping mechanism in which action is halted if the forecast is not 445 confirmed 30 days later (IFRC, 2019).
Threat Score (TS), a validation metric that describes the degree to which triggering of observed events corresponds to triggering of events based on forecasts, is one method to evaluate the benefits of additional true positives against the costs of additional false positives when true positives are relatively rare. TS is maximized from 53% to 84% (36% to 61%) and 44% 450 to 83% (27% to 44%) respectively for the statistical and multi-model approaches for Marañón (Piura) (Figure 8e and 8f). By comparison, TS for GloFAS is nearly always lower and generally less variable, reaching its maximum of 0.57 (0.44) from  25% to 28% (21% to 24%) for Marañón (Piura). Thus, a 75% required trigger probability tends to be a relatively strict level, as it often significantly surpasses the required trigger probability yielding the highest TS.

Implications of binary trigger mechanism
The binary nature of the trigger mechanism is vulnerable to situations where similar observed conditions result in early action in one instance but not in another. Marañón River streamflow, which averages 24,600 m 3 /s during the MAM season, percentile and so did not count as an observed trigger (the stated mechanism requires that streamflow exceed the 80 th 465 percentile). From an operational standpoint, such edge cases beg the question: should some amount of early action still occur? Absent a direct physical basis underpinning the streamflow magnitude required to trigger early action (e.g. setting a threshold based on when a levee begins to overtop), two events of similar magnitude -one slightly above and one below the threshold -are likely to produce similar impacts with early actions likely to yield similar benefits. Moreover, early action in response to two such events may suggest that the action taken "in vain" yields fewer or no benefits compared to actions 470 initiated in response to a true positive. For example, when GloFAS triggers early action for the Marañón River in 2017 ( Figure 6), this is considered a false positive despite observed conditions falling less than 1% below the threshold, illustrating a potential weakness of both the trigger mechanism and categorical evaluation of forecasts in general. This reinforces the need to also evaluate forecasts with complementary performance measures paired with local contextual knowledge. A modified trigger approach could incorporate multiple tiers of early actions triggered by increasing levels of 475 forecast confidence. Likewise, if forecast confidence later decreases, a tiered stopping mechanism could halt actions in reverse order.

Conclusion
This paper describes a method by which locally-tailored season-ahead statistical forecasts can improve the detection of trigger-based early actions and is illustrated with a case study for two sites in Peru. The statistical forecast developed in this 480 study -as well as a multi-model ensemble forecast composed of the statistical and an operational physically-based modelconsistently outperform the aforementioned physically-based model for both study locations. Detection of additional highflow events is possible by lowering the forecast probability required to trigger actions while maintaining a low false alarm ratio.
While higher seasonal average streamflow values typically imply a greater probability of both flooding and the need for early action, lower seasonal average streamflow values may obscure high daily peaks that nonetheless result in flood impacts. Thus, even a perfect seasonal forecast may not reflect all instances where early action is justified. Additionally, because the statistical model developed here is optimized for performance across all years, further refinement prioritizing the detection of appropriate trigger levels for early action in high flow years may be warranted. Such efforts could involve 490 alternative modeling frameworks (e.g. logistic regression), additional predictors, and evaluation of category selection applied in the prediction process.
Code availability. Code used in this study is available upon request.

495
Data availability. Streamflow data used in this study are from SENAMHI. While the dataset is not public, it may be made available upon request. PISCO precipitation data are available at piscoprec.github.io. Climate data obtained from NOAA are available at noaa.gov.
Author contributions. PB was responsible for conceptualization. CK developed and evaluated the prediction model with 500 input from PB and DL. JB facilitated access to project resources (including datasets and documents) and provided contextual information. CK prepared the manuscript with editing contributions from all authors. PB and DL were responsible for project administration and PB was responsible for funding acquisition.
Competing interests. The authors declare that they have no conflict of interest. 505