Machine learning analysis of lifeguard flag decisions and recorded rescues

Rip currents and other surf hazards are an emerging public health issue globally. Lifeguards, warning flags, and signs are important, and to varying degrees they are effective strategies to minimize risk to beach users. In the United States and other jurisdictions around the world, lifeguards use coloured flags (green, yellow, and red) to indicate whether the danger posed by the surf and rip hazard is low, moderate, or high respectively. The choice of flag depends on the lifeguard(s) monitoring the changing surf conditions along the beach and over the course of the day using both regional surf forecasts and careful observation. There is a potential that the chosen flag is not consistent with the beach user perception of the risk, which may increase the potential for rescues or drownings. In this study, machine learning is used to determine the potential for error in the flags used at Pensacola Beach and the impact of that error on the number of rescues. Results of a decision tree analysis indicate that the colour flag chosen by the lifeguards was different from what the model predicted for 35 % of days between 2004 and 2008 (n= 396/1125). Days when there is a difference between the predicted and posted flag colour represent only 17 % of all rescue days, but those days are associated with ∼ 60 % of all rescues between 2004 and 2008. Further analysis reveals that the largest number of rescue days and total number of rescues are associated with days where the flag deployed over-estimated the surf and hazard risk, such as a red or yellow flag flying when the model predicted a green flag would be more appropriate based on the wind and wave forcing alone. While it is possible that the lifeguards were overly cautious, it is argued that they most likely identified a rip forced by a transverse-bar and rip morphology common at the study site. Regardless, the results suggest that beach users may be discounting lifeguard warnings if the flag colour is not consistent with how they perceive the surf hazard or the regional forecast. Results suggest that machine learning techniques have the potential to support lifeguards and thereby reduce the number of rescues and drownings.


Introduction
Rip currents are the main hazard to recreational swimmers and bathers and, in recent years, have been recognized as a serious global public health issue (Brighton et al., 2013;Woodward et al., 2013;Kumar and Prasad, 2014;Arozarena et al., 2015;Brewster et al., 2019;Vlodarchyk et al., 2019).Rips are strong, seaward-directed currents that can develop on beaches characterized by wave breaking within the surf zone (Castelle et al., 2016) and are capable of transporting swimmers a significant distance away from the shoreline into deeper waters.Weak swimmers or those who try and fight the current can become stressed and experience panic (Brander et al., 2011;Drozdzewski et al., 2012), leading to increased adrenaline, an elevated heart rate and blood pressure, and rapid and shallow breathing.On recreational beaches in Australia and the US, rips have been identified as the main cause of drownings and are believed to be responsible for nearly 80 % of all rescues (Brighton et al., 2013;Brewster et al., 2019).It is estimated that the annual number of rip current drownings exceeds the number of fatalities caused by hurricanes, forest fires, and floods in Australia (Brander et al., 2013), while rip-related drownings on a relatively small number of beaches in Costa Rica account for a disproportionately large number of violent deaths in the country (Arozarena et al., 2015).However, recent evidence suggests that public knowledge of this hazard is limited (Brander et al., 2011; Published by Copernicus Publications on behalf of the European Geosciences Union. C. Houser et al.: Machine learning analysis of lifeguard flag decisions and recorded rescues Williamson et al., 2012;Brannstrom et al., 2014Brannstrom et al., , 2015;;Gallop et al., 2016;Fallon et al., 2018;Ménard et al., 2018;Silva-Cavalcanti et al., 2018;Trimble and Houser, 2017) and that few people are interested in rip currents compared to other hazards (Houser et al., 2019).
Many beaches have warning signs at primary access points to warn beach users of the rip hazard, but recent studies suggest that signs may not be effective (e.g.Matthews et al., 2014;Brannstrom et al., 2015).Many beaches also use a combination of beach flags to designate either the location of supervised and safe swimming areas (e.g.Australia and the UK) or areas and times to avoid entering the water (e.g.Costa Rica and the US).Unfortunately, not every country uses the same flagging convention, and there are regional variations that can lead to confusion amongst beach users.The United States and Canada use green, yellow, and red coloured flags to indicate whether the danger posed by the surf and rip hazard is low, moderate, or high, respectively (Houser et al., 2017).A beach manager or lifeguard decides on the surf hazard and the flag colour to fly based on a combination of daily updates on rip conditions provided by local lifeguards as well as a rip forecast from the US National Weather Service (NWS).Most rip forecasts are based on a simple correlation between the number of rip-related rescues and meteorological and oceanographic conditions on that day (Lushine, 1991a, b;Lascody, 1998;Engle et al., 2012;Dusek and Seim, 2013;Kumar and Prasad, 2014;Scott et al., 2014;Moulton et al., 2017).These forecasts do not account for the surf zone morphology, which may be conducive to the development of rips on days when wave breaking is relatively weak.Even under green flag days, the presence of shore-attached nearshore bars (called a transverse-bar and rip morphology; Wright and Short, 1984) can force a current of ∼ 0.5 m s −1 that can pose a threat to weak swimmers (Houser et al., 2013).
Rip currents can still be present even if a regional forecast predicts that the hazard potential is low based on wind and wave conditions.Beach users can be at risk if the flag colour is based solely on the regional forecast.To be effective, the flag system requires lifeguards to continuously assess surf conditions and monitor swimmers and bathers, and ultimately intervene if someone does not heed the warning implied by a yellow or red flag indicating moderate and high ("do not enter the water") hazard levels respectively.Recent evidence suggests that many beach users do not adhere to warnings if their own experience (whether accurate or not) or behaviour of others on the beach contradicts the hazard, as indicated by the warning flag (Houser et al., 2017;Ménard et al., 2018).Beachgoers may lose trust in authority (i.e. the lifeguards) if a forecast is perceived, wrongly or rightly, to be inaccurate (Espluga et al., 2009).If the forecast is for dangerous surf conditions and a yellow or red flag is placed on the beach when conditions appear to the beach user to be relatively calm, the beach user may discount or ignore the forecast now and in the future if they enter the water and do not experience any difficulties.Trust and confidence in the authority figures can be eroded if they believe that the lifeguards are being overly cautious.It can be difficult to change (or reset) public perception about the accuracy of the flag system as soon as a discrepancy is perceived, and subsequent visits and experiences may confirm the biases of the beach user (Ménard et al., 2018).It is a situation analogous to the boy who cries wolf (Wachinger et al., 2013).
This study examines the consistency of flag warnings at Pensacola Beach, Florida, between 2004 and2008 when daily data are available for flag colour, wind, and wave forcing, as well as the daily number of rescues performed by lifeguards.A decision tree, a form of machine learning, is used to predict the posted flag colour using lifeguard observations in combination with wind and wave forcing.The modelled flag colour, based solely on wave and wind forcing, can be compared to the flag colour posted by the lifeguards on a particular day to identify days when there is a difference and how that influences the number of rescues performed on that day.It is hypothesized that there will be a greater number of rescues performed on days when there is a difference between the predicted and posted flag colour.Specifically, it is hypothesized that a greater number of rescues will occur on days when the model underestimated the hazard level compared to the lifeguard who made their decision based on local observations including the presence of semi-permanent rip channels.In this scenario, the public may believe that the lifeguard is being overly cautious, leading to people entering the water.

Study site
The analysis was completed at Pensacola Beach, Florida (Fig. 1), where records of daily flag colours, wind and wave forcing, and lifeguard-performed rescues between 2004 and 2008 are available.The beaches of the Florida Panhandle have been described as "the worst in the nation for beach drowning" (Tuscaloosa News, 2002), based on the presence of semi-permanent rips along the length of the island (Houser et al., 2011;Barrett and Houser, 2012).These rips can be active and pose a threat to swimmers when conditions may appear to be safe for swimming (Houser et al., 2013).During the period of the study (2004)(2005)(2006)(2007)(2008), the Santa Rosa Island Authority maintained a flagging system to alert beach users about the heavy surf and rip hazard based on the NWS rip forecast.The highest flag colour for that day was recorded by the Santa Rosa Island Authority, along with the number of prevents, assists, and rescues.The Santa Rosa Island Authority reserve the rescue definition for those persons in extreme difficulty who, in the opinion of the lifeguard, would have drowned without assistance.Rescues, assists, and prevents are recorded regardless of whether they are conducted in a "guarded" area, a designated swimming area where there are typically many beach users (Casino Beach, Fort Pickens Gate Beach, and Park East), or along the ∼ 13 km of unguarded beach where lifeguards conduct regular patrols and respond to emergency calls.As shown by Barrett and Houser (2013), there are rip current hotspots with semi-permanent alongshore variation in the nearshore morphology due to a ridge and swale bathymetry on the inner shelf (Fig. 2).The innermost bar varies alongshore at a scale of ∼ 1000 m, consistent with the ridge and swale bathymetry (Houser et al., 2008), and tends to exhibit a transverse-bar and rip morphology immediately landward of the deeper swales (Barrett and Houser, 2012; see Fig. 1).Historically, most drownings and rescues on this popular beach have occurred at these rip hotspots because they correspond to the main access points along the island (Houser et al., 2015b;Trimble and Houser, 2017).
Santa Rosa Island experienced widespread erosion and washover during Hurricane Ivan in September 2004.The storm reinforced the alongshore variation in the nearshore bar morphology and forced the bars farther offshore.As described in Houser et al. (2015a), the nearshore bars migrated landward and recovered to the beachface for 3 years following the storm.During this period, the inner-bar morphology transitioned from a rhythmic bar and beach morphology to a transverse-bar and rip morphology before ultimately attaching to the beachface in May 2008 (Houser and Barrett, 2010).This changing bar morphology is a primary control on the presence of rip channels, with the greatest density of rips present in 2005 as the innermost bar first started to develop a transverse-bar and rip morphology (Houser et al., 2011).

Methodology
Offshore wave conditions and wind forcing function are based on long-term meteorological and oceanographic records from an offshore wave buoy located ∼ 100 km southeast of the study area (buoy 42039; Fig. 1).Between 2004 and 2008, this was the closest buoy to Pensacola Beach and had been previously used to estimate the incident wave field (Wang and Horwitz, 2007;Claudino-Sales et al., 2008, 2010;Houser et al., 2011), and it was the basis for the rip hazard at Pensacola Beach until a new buoy was placed closer to the beach in 2009.The available wave data from buoy 42039 included offshore significant wave height, significant wave period, and direction, and the wind data included speed and direction.Local water level data were acquired from a station at the Port of Pensacola just north of the study site.A decision tree analysis was used to determine what combination of wave and wind forcing was associated with the flag posted by the Santa Rosa Island Authority on that day.After training on the available dataset, the model produces a decision tree that can be used for future decisions about what flag colour should be posted, although further training would be required to validate the model and operationalize.The modelled (i.e.predicted) flag colour is then compared to the posted flag colour for all days to determine if there is a relationship between the flag colour and the number of rescues.The comparison is also used to determine if there is a specific combination of wind and wave forcing on the days when the modelled flag colour and the posted flag colour do not align.
A decision tree model was developed using the Chi-square Automatic Interaction Detector (CHAID) technique developed by Kass (1980).The goal of the CHAID analysis is to build a model that helps explain how independent variables (wind speed, wave height, wave period, wave direction, wind direction and water level) can be merged to ex- plain the results in a given dependent variable.To develop a decision tree, the first step is declaring the root node; this corresponds to the target variable that will be predicted throughout the model.Then, the independent variable that provides the most information about the target values is identified.The root node is then split on this independent variable into statistically significant different subgroups using the F test.These subgroups are then split using the predictor variables that provide the most information about them.CHAID analysis continues this process until terminal nodes are reached and no splits are statistically significant.Previous use of CHAID analysis in hazard studies includes landslide prediction (e.g.Althuwaynee et al., 2014), farmer perception of flooding hazard (Bielders et al., 2003;Tehrany et al., 2015), and property owner perception and decision making along an eroding coast (Smith et al., 2017).

Results
The decision tree model was trained on the 1125 d with complete data between 2004 and 2008.Over this same period there were 145 d with rescues.The annual number of rescues and rescue days (i.e.days with one or more rescues) varied by year, with a peak in both the total number of rescues and the number of rescue days in 2005.The number of rescues was at a minimum in 2007, while the number of rescue days was at a minimum in 2006 (Fig. 3).The number of rescues decreased linearly between 2005 and 2007 as the nearshore bar morphology continued to recover following Hurricane Ivan and welded to the beachface, consistent with previous observations at the site (Houser et al., 2011).It is important to note that the CHAID analysis does not incorporate nearshore morphology as an independent variable because changes in nearshore morphology were not tracked daily over the study period.In this respect, differences between the posted and predicted flag colour may reflect lifeguard observations of nearshore morphology conducive to the development of rip currents despite winds and waves typical of green flag conditions.
The decision tree analysis suggests that the posted flag colour was not predicted by the model on 35 % of days between 2004 and 2008 (n = 396).There was a total of 342 rescues over 66 days when the model predicted a different flag than was posted representing over 60 % of all rescues (Table 1).By comparison, 40 % of all rescues (n = 224) occurred over 79 days when the predicted and posted flags were the same.χ 2 analysis suggests that the number of rescue days is significantly greater at the 95 % confidence level when the predicted and posted flags are different (χ 2 = 7.77, ρ ∼ 0.005).This supports the hypothesis that there are a greater number of rescues performed on days when there is a discrepancy between the predicted and posted flag colour.
χ 2 analysis was also used to determine if the number of rescue days depends on whether the model predicts a flag of greater or lesser hazard compared to the posted flag (Table 2).Results suggest that the number of rescue days is greater when the model predicts hazardous surf (i.e.red or yellow flag), but the posted flag was either yellow or green  (χ 2 = 18.11, ρ ∼ 0.0001).The number of rescue days was over-represented when the posted flag colour was red or yellow, but the model predicted that the flag should have been yellow or green, respectively, suggesting that posting what a beach user may perceive as an overly cautious flag can present a danger.These 47 d were associated with 268 of the total 566 rescues between 2004 and 2008, or ∼ 7.2 rescues per day when the island authority posted a more cautious flag than was predicted by the model .In comparison, the number of rescues (n = 298) was under-represented on days when the posted flag suggested conditions were not as hazardous (n = 74) as the model or were identical to the model (n = 224).
The greatest number of rescues were performed on days when the posted flag was yellow (moderate hazard, moderate surf and/or currents), but the model predicted a green flag (low hazard, relatively calm surf and/or currents) based on the wind and wave forcing.Specifically, a total of 231 rescues were performed on 37 of the 168 d when the posted flag was yellow, and the model predicted that the flag colour should be green.In comparison, there were only 12 rescues on 3 of 20 d when the posted flag was red (high hazard, strong surf and/or currents) and the model-predicted flag colour was green.Finally, there were 25 rescues preformed on 7 of 30 d when a red flag was posted, and the model predicted a yellow flag was appropriate.The number of rescues and rescue days when the posted flag was more cautious than predicted by the model were at a maximum in 2005 and linearly decreased to  48) 83 ( 7) 15 (1) Y 168 ( 231) 154 ( 125) 80 (66) R 20 ( 12) 30 ( 25) 100 ( 51) a minimum in 2007 as the bar morphology recovered from Hurricane Ivan.While there were fewer-than-expected rescue days when the posted flag was green or yellow and the model predicted a yellow or red flag, rescues were still performed on those days.There was a total of 66 rescues on 13 of 80 d when the posted flag was yellow, but the model predicted a red flag should be posted (Table 3).Only seven rescues were performed on 5 of the 83 d when the posted flag was green and the model predicted a yellow flag, with even fewer rescues performed on days when the posted flag was green but should have been red.The number of rescues and rescue days when the posted flag was lower than the predicted flag decreased from 2004 to 2007, with a statistically significant outlier in 2008.The large number of rescues in 2008 is the result of 2 d with 13 rescues each (19 April and 14 September), when a yellow flag was being flown, but the model predicted a red flag was more appropriate.This suggests that the difference between posted and predicted flag colours can vary inter-annually with changes in the nearshore morphology and/or changes in the individual who makes the flag decision.

Discussion
Results of the present study suggest that over 60 % of all rescues at Pensacola Beach, Florida, between 2004 and 2008 occurred on days when the posted hazard flag was different from the flag colour predicted by a decision tree model.
The posted flag colour was not predicted by the model on 35 % of days between 2004 and 2008 (n = 396), with one or more rescues occurring on 66 of those days (∼ 17 %).While rescues did not occur on a vast majority of the days when the posted and predicted flag colours were different, days when the predicted and posted flag colours were different accounted for a majority of the rescues.This is not to suggest that the Santa Rosa Island Authority made a mistake in their flag choice.Rather, the results suggest that the difference between the posted and predicted flag colour could be associated with the lifeguards noting that the nearshore had a transverse-bar and rip morphology, which is common at this location.The morphology of the nearshore and other variables that could influence whether a beach user will enter the water or not (e.g.weather, number of beach users, or presence of seaweed) are not captured by the current model, which is based on wind and wave forcing alone.The model developed in this study is similar to rip forecasts produced by the US National Weather Service (NWS) and does not include local variables known to the beach manager based on experience and years of careful observation.Discrepancies between the predicted and posted flag colours provide a basis for future model development and expansion.Incorporating more data into the model will cause it to evolve and better capture the variables that influence the colour of flag chosen by the lifeguards, while ensuring that the model remains computationally efficient.Introducing additional variables, such as nearshore morphology, to the model has the potential to better capture a lifeguard or beach manager's understanding of what constitutes dangerous surf conditions at their beach.At the same time, it is also important to examine the accuracy of beach managers and lifeguards in assessing the nearshore morphology and potential for rip development.
The model predictions and most forecasts are based solely on wind and wave forcing (Lushine, 1991a, b;Lascody, 1998;Engle et al., 2012;Dusek and Seim, 2013;Kumar and Prasad, 2014;Scott et al., 2014;Moulton et al., 2017).Noticeably absent from the current model is surf zone morphology, which ultimately determines whether a rip can develop under those conditions or not.The beach manager and lifeguard can observe the nearshore morphology and assess the potential for rip development, which would lead to them putting out a yellow or red flag when the model would predict a green or yellow flag as being appropriate.While beach managers and lifeguards are being prudent, their assessment may not conform to those of the beach user who decides on whether the water is safe or not based on wave-breaking conditions (Caldwell et al., 2013;Brannstrom et al., 2014Brannstrom et al., , 2015)).Most beach users assume that larger breaking waves are more dangerous, and many will not enter the water if they (and the model) believe that it is a red flag condition.This may partially explain why there were fewer-than-expected rescues on days when the posted flag colour was green or yellow and the model predicted a yellow or red flag, respectively.Independent of the flag or warning signs, beach users appear to be making personal decisions about the surf and rip hazard (Brannstrom et al., 2015) based on experience at the site or elsewhere (see Ménard et al., 2018).Whether this causes beach users to lose confidence in the lifeguards and other authorities managing the beach is an important question for future research.
A large number of rescues occurred when the posted flag was yellow, but the model predicted the wind and wave forcing warranted a green flag.Rightly or wrongly, the beach user will observe that wave breaking is limited and assume that conditions must be safe.As shown by Caldwell et al. (2013) and Brannstrom et al. (2014) most beach users along the Gulf Coast of the US assume that the calm flat water of a rip is safer than adjacent areas where the waves are breaking.The lifeguard, however, may observe a bar morphology that is conducive to the development of rips and post a yellow flag to warn about the potential for rips, despite the weak wind and wave forcing.As observed by Barrett and Houser (2012), rips with speeds of ∼ 0.5 m s −1 can develop on green flag days because of the transverse-bar and rip morphology that is present in the inner-nearshore.This would suggest that posting a green flag should never be permitted when wind and swell waves are breaking over the bar, even if the regional forecast suggests a low-level hazard that day.As shown by Scott et al. (2014), rescues are still possible with seemingly fine-weather conditions when a green flag would be predicted by the model or in regional forecasts.Even in the presence of a small swell wave, breaking can be induced as water levels fall with the tide (Castelle et al., 2016).
It is difficult for beach users to spot a rip or assess the potential for rip development, and they may assume that the lifeguard is being overly cautious if they perceive fineweather conditions and the lifeguard posts a yellow or red flag.Going to the beach is a reward-based activity, and many people commit significant personal and financial investment to be at the beach (Ménard et al., 2018).If they believe that the lifeguard is wrong they will ignore the warning and remain committed to entering the water.The longer and more times that their perceptions are inconsistent with the experience and knowledge of the lifeguard, the more trust in authority is lost -a beach that is perceived to be safe based on experience will always be safe despite warnings to the contrary (Ménard et al., 2018).This is an example of confirmation bias, in which an opinion quickly becomes entrenched and subsequent evidence is used to either bolster the belief or is rapidly discarded.How this can be addressed to reduce the number of rescues is an important focus for future research on rips and other hazards in general.
The results of this study also highlight the limitations of regional rip forecasts that are used in the US and elsewhere around the world.A forecast based solely on the wind and wave forcing does not account for the nearshore morphology, which determines the potential for rip development.This raises one of the most important considerations for future modelling efforts based on machine learning techniques -the model will only be accurate if the bar morphology and conceptual knowledge of the lifeguard is included as input variables.Getting the beach user to observe and heed that forecast and warning, however, will remain a challenge.

Conclusions
Lifeguards and beach managers decide on warnings and flag colours based on careful monitoring of the changing surf conditions along the beach and over the course of the day using both regional surf forecasts and direct observation.A decision tree analysis predicts a flag colour different to the one flown on ∼ 35 % of days between 2004 and 2008 (n = 396/1125) and that those differences account for only 17 % of all rescue days and ∼ 60 % of the total number of rescues.The posting of a yellow flag when the model would predict a green flag based solely on the wind and wave forcing was found to be responsible for the largest number of rescues over the study period.Variables such as the nearshore morphology and the potential for rip development are not included in traditional forecasts or the model developed in this paper, and most beach users use a simple assessment of wave breaking to determine if the water is safe.Even though a lifeguard will post the appropriate flag based on direct observation of the bar morphology and experience, the beach user, like simple models based solely on meteorological data, may not believe that warning and still enter the water.This suggests that reducing the number of rip and surf rescues will require that we are able to address confirmation bias on the part of the beach user, which can cause them to lose their confidence in the lifeguards.

Figure 1 .
Figure1.Map of study site showing location of the flagged section of beach and approximate location of the wave buoy used in the analysis and for regional rip forecasts (ESRI, 2019).

Figure 2 .
Figure 2. Satellite image of the flagged section of beach in April 2004 (before Hurricane Ivan) showing the presence of transverse-bar and rip morphology of the innermost bar and the variable nature of the outermost bar for the flagged section of beach.The aerial image is not necessarily representative of the nearshore morphology throughout the remainder of the study (© Google Earth 2019).

Figure 3 .
Figure 3. Interannual variation in number of rescues and rescue days at Pensacola Beach between 2004 and 2008.

Table 3 .
Number of days and rescues (in brackets) based on the combination of posted and predicted flag colours.

Table 1 .
Results of χ 2 analysis of posted and predicted flag colour versus rescue and no rescue days at Pensacola Beach, Florida, between 2004 and 2008.

Table 2 .
Results of χ 2 analysis of posted and predicted flag colour versus rescue and no rescue days at Pensacola Beach, Florida, between 2004 and 2008.