Review article: Detection of informative tweets in crisis events

Messages on social media can be an important source of information during crisis situations, be they short-term disasters or longer-term events like COVID-19. They can frequently provide details about developments much faster than traditional sources (e.g. official news) and can offer personal perspectives on events, such as opinions or specific needs. In the future, these messages can also serve to assess disaster risks. One challenge for utilizing social media in crisis situations is the reliable detection of informative messages in a flood of data. 5 Researchers have started to look into this problem in recent years, beginning with crowd-sourced methods. Lately, approaches have shifted towards an automatic analysis of messages. In this review article, we present methods for the automatic detection of crisis-related messages (tweets) on Twitter. We start by showing the varying definitions of importance and relevance relating to disasters, as they can serve very different purposes. This is followed by an overview of existing, crisis-related social media data sets for evaluation and training purposes. We then compare approaches for solving the detection problem based (1) on 10 filtering by characteristics like keywords and location, (2) on crowdsourcing, and (3) on machine learning techniques with regard to their focus, their data requirements, their technical prerequisites, their efficiency and accuracy, and their time scales. These factors determine the suitability of the approaches for different expectations, but also their limitations. We identify which aspects each of them can contribute to the detection of informative tweets, and which areas can be improved upon in the future. We point out particular challenges, such as the linguistic issues concerning this kind of data. Finally, we suggest future avenues 15 of research, and show connections to related tasks, such as the subsequent semantic classification of tweets.

data as most other social media sources do not offer a possibility to obtain large amounts of their data to outside researchers, or are not commonly used in a way that facilitates gaining information quickly during a disaster.
In the next section, we will examine the problem definition more closely. Section 3 introduces some already existing social media data sets useful for analyzing the task of retrieving informative tweets and for training as well as testing modeling approaches. In section 4, we will then show how such approaches have been implemented so far, grouped into filtering, crowd-40 sourcing, and machine learning methods. Section 5 then goes into detail about the challenges these approaches frequently face, while section 6 briefly describes some related problems. We finish with suggestions for new developments in section 7, and a conclusion in section 8.

Problem definition
The task of finding social media posts in a crisis may appear clearly defined at first, but quickly becomes more convoluted 45 when attempting an exact definition. Existing publications have gone about defining their problem statement in a variety of ways. An overview is provided in table 1.
What emerges from this table is a trichotomy between the concepts "related", "relevant", and "informative". As a tentative definition, we subsume that "related" encompasses all messages that make implicit or explicit mention of the event in question.
The "relevant" category is a subset of these, comprised of messages that contain actual information pertaining to the event. 50 "Informative" messages, finally, offer information useful to the user of the system. Not all publications necessarily follow this pattern, and lines between these categories are blurry. In reality, many border cases arise, such as jokes, sarcasm, and speculation. In addition, the question of what makes a tweet informative, or even relevant, is highly dependent on who is asking this question, i.e. who the user of this system is. Such users are often assumed to be relief providers, but could also be members of the government, the media, affected citizens, their family members, and many others. Building on top of this, each of these 55 users may be interested in a different use case of the system. For instance, humanitarian and governmental emergency man-1 https://www.omnicoreagency.com/twitter-statistics/

Crisis related
Message related to a crisis situation in general without taking into account informativeness or usefulness

Non-crisis related
Message that is not related to a crisis situation (Stowe et al., 2018) Relevant Any information that is relevant to disaster events, including useful information but also jokes, retweets, and speculation

Irrelevant
Not related to a disaster event agement organizations are interested in understanding "the big picture", whereas local police forces and firefighters desire to find "implicit and explicit requests related to emergency needs that should be fulfilled or serviced as soon as possible" (Imran et al., 2018). Moreover, some of these use cases may require a high precision of the detected tweets while possibly missing some important information; others may be more accepting of false alarms while focusing on a high recall.

60
These questions are commonly not explicitly taken into account in publications focusing on technical solutions. In practical application scenarios, however, they add complexity to the system. Moreover, these varying definitions make existing approaches and data sets difficult to compare.
time, and to create models for automatic detection and other tasks. For these reasons, several such data sets have already been created. As mentioned above, Twitter is the most fruitful source of data for this use case; therefore, available data sets have mainly focused on Twitter data. proposed a balanced compilation of labeled Tweets from 48 different events covering the ten most common disaster types. A distinction can also be made for corpora focusing on natural disasters and those also including man-made disasters. Events2012 goes even further, containing around 500 events of all types, including disasters.
Annotations vary between these data sets. Some of them do not contain any labels beyond the type of event itself, while others 75 are labeled according to content type (e.g. "Search and rescue" or "Donations"), information source (first-party observers, media, etc.), and priority or importance of each tweet (CrisisLexT26 and TREC-IS 2019A).
A general issue with these data sets lies in the fact that researchers cannot release the full tweet content due to Twitter's redistribution policy 2 . Instead, these data sets are usually provided as lists of tweet ID's, which must then be expanded to the full information ("hydrated"). This frequently leads to data sets becoming smaller over time as users may choose to delete their 80 tweets or make them private. Additionally, the teams creating these corpora have mainly focused on English-and occasionally Spanish-language tweets to facilitate their wider usage for study. More insights would be possible if tweets in the language(s) of the affected area were available. However, Twitter usage also varies across countries. Another factor here is that less than 1% of all tweets contain geolocations (Sloan et al., 2013), which are often necessary for analysis.
In recent months, specific data sets for the COVID-19 pandemic have been collected; we also list these in the table. This is 85 a unique case, as the crisis has been progressing for much longer than all other covered events, and much higher numbers of tweets have consequently been produced. It has also affected most of the world, rather than just particular regions.
The following sections provide descriptions of the data sets in more detail: Events2012 This data set contains 120 million tweets, of which around 150,000 were labeled to belong to one of 506 events (which are not necessarily disaster events) (McMinn et al., 2013).
It contains tweets collected during 26 crises, mainly natural disasters like earthquakes, wildfires and floods, but also human-induced disasters like shootings and a train crash. Amounts of these tweets per disaster range between 1,100 and 157,500. In total, around 285,000 tweets were collected. They were then annotated by paid workers on the CrowdFlower crowdsourcing platform 3 according to three concepts: Informativeness, information type, and tweet source. CrisisNLP Similar to CrisisLexT26, the team behind CrisisNLP collected tweets during 19 natural and health-related disasters and published them for research (Imran et al., 2016a). Collected tweets range between 17,000 and 28 million per event, making up around 53 million in total. Out of these, around 50,000 were annotated both by volunteers and by paid workers on CrowdFlower with regard to information type.
CrisisMMD CrisisMMD is an interesting special case because it only contains tweets with both text and image content. 16,000 100 tweets were collected for seven events that took place in 2017 in five countries. Annotation was performed by Figure   Eight for text and images separately. The three annotated concepts are: Informative/Non-informative, eight semantic categories (like "Rescue and volunteering" or "Affected individuals"), and damage severity (only applied to images) (Alam et al., 2018b).
Epic This data set with a focus on Hurricane Sandy was collected in a somewhat different manner than most others. The 105 team first assembled tweets containing hashtags associated with the hurricane, and then aggregated them by user. Out of these users, they selected those who had geotagged tweets in the area of impact, suggesting that these users would have been affected by the hurricane. Then, 105 of these users were selected randomly, and their tweets from a week before landfall to a week after were assembled. This leads to a data set that in all probability contains both related and unrelated tweets by the same users. Tweets were annotated according to their relevance as well as 17 semantic categories (such as 110 "Seeking info" or "Planning") and sentiment (Stowe et al., 2018).
Florence The Florence data set contains 600,000 tweets collected in the area affected by Hurricane Florence in the week of September 10, 2018, to September 17, 2018. These were not originally pre-filtered in any way; therefore, only a subset of them is related to Hurricane Florence. Such possible subsets were determined in a number of ways, including a filtering with various approaches. The overlap between these results was interpreted to contain related tweets with high 115 confidence; this leaves around 20,000 tweets .

120
TREC-IS 2019A A crisis classification task named "Incident Streams" has been a part of the Text REtrieval Conference covid19_twitter This is a larger data set of COVID-19-related tweets. Collection started on January 1st with very few found tweets, and was expanded on March 11. Tweets are collected worldwide from the API using 13 keywords. 238 countries are covered. There is also a version without retweets, which contains around 100,000 messages. The top 1000 terms, bigrams, and trigrams are also contained in the data set (Banda et al., 2020).

140
GeoCoV19 This is an even larger data set of coronavirus-related tweets. Messages were collected on a basis of 800 multilingual keywords, starting on February 1, 2020. For nearly all of these tweets, geolocation information was retrieved, either from the tweet's own geolocation metadata, from user location or place information metadata, or from place mentions in the tweet's text. Non-coordinate information has been geocoded to coordinates (Qazi et al., 2020).

145
As described above, users generate huge amounts of data on Twitter every second, and finding tweets related to an ongoing event is not trivial (Landwehr and Carley, 2014). Several detection approaches have been presented in literature so far. We will group them into three categories: Filtering by characteristics, crowdsourcing, and machine learning approaches.

Filtering by characteristics
The most obvious strategy is the filtering of tweets by various surface characteristics as shown in (Kumar et al., 2011), for 150 example. Keywords and hashtags are used most frequently for this and often serve as a useful pre-filter for data collection.
The Twitter API allows searching directly for keywords and hashtags or recording the live stream of tweets containing those, meaning that this approach is often a good starting point for researchers. This is especially relevant because only 1% of the live stream can be collected for free (also see section 5) -when a keyword filter is employed, this 1% is more likely to contain relevant tweets. shifts during the event. To tackle this problem, the authors propose a method to update the keyword list based on new messages.
Another problem with keyword lists is that unrelated data that contains the same keywords may be retrieved (Imran et al., 2015).
Nevertheless, such approaches have been used in insightful studies, e.g. in (de Albuquerque et al., 2015), where keyword-160 filtered tweets during a flood event were correlated with flooding levels.
Geolocation is another frequently employed feature that can be useful for retrieving tweets from an area affected by a disaster.
However, this approach misses important information that could be coming from a source outside the area, such as help providers or news sources. Additionally, only a small fraction of tweets is geo-tagged at all, leading to a large amount of missed tweets from the area (Sloan et al., 2013).

Crowdsourcing approaches
To resolve these problems, other strategies were developed. One solution lies in crowdsourcing the analysis of tweets, i.e.
asking human volunteers to manually label the data. Naturally, due to the large amount of incoming tweets, many helpers are necessary. Established communities of such volunteers can be activated quickly in a disaster event. One example of such a community is the Standby Task Force 6 .

170
To facilitate their work, platforms have been developed over the years. One of the most well-known systems is Ushahidi 7 . This platform allows people to share situational information in various media, e.g. by text message, by e-mail, and of course by to integrate automatic analysis tools into the platform (named "SwiftRiver"), but discontinued in 2015.
Such automatic analysis tools are the motivation for AIDR (Imran et al., 2015). AIDR was first developed as a quick response to the 2013 Pakistan earthquake. Its main purpose lies in facilitating machine learning methods to streamline the annotation process. In a novel situation, users first choose their own keywords and regions to start collecting a stream of tweets. Then, volunteers annotate relevant categories. A supervised classifier is then trained on these given examples, and is automatically 180 applied to new incoming messages. A front-end platform named MicroMappers 9 also exists. AIDR is available in an opensource format as well 10 . It has been used in the creation of various data sets and experiments.
Another contribution to crowdsourcing crisis tweets is CrisisTracker (Rogstadius et al., 2013). In CrisisTracker, tweets are also collected in real-time. Local Sensitive Hashing (LSH) is then applied to detect clusters of topics (so-called stories), so that volunteers can consider these stories jointly instead of single tweets. The AIDR engine has also been integrated to provide 185 topic filtering. As a field trial, the platform was used in the 2012 Syrian civil war. CrisisTracker is also available free and open-source 11 , but maintenance stopped in 2016.

Machine learning approaches
As another avenue, various machine learning approaches for automatically detecting crisis-related tweets have been developed over the years. Employed algorithms include Latent Dirichlet Allocation (Resch et al., 2018;Kireyev et al., 2009), Naive Bayes 190 models (Li et al., 2018), Support Vector Machines (Sakaki et al., 2010;Stowe et al., 2018), and Random Forests (Kaufhold et al., 2020). A 2015 overview over these more traditional machine learning approaches can be found in (Imran et al., 2015).
In recent years, deep learning techniques, i.e. neural networks, have come to the forefront of research. We will focus on these here.
On a general level, the problem falls under the umbrella of event detection as shown, for example, in (Chen et al., 2015;Feng 195 et al., 2016;Nguyen and Grishman, 2015). Caragea et al. (2016) first employed Convolutional Neural Networks (CNN) for the classification of tweets into those related to flood events and those unrelated. (Lin et al. (2016) also applied CNNs to social media messages, but for the Weibo platform instead of Twitter). In many of the following approaches, a type of CNN developed by Kim for text classification is used (Kim, 2014), such as in (Burel and Alani, 2018;Kersten et al., 2019). A schematic is shown in figure 1. These methods achieve accuracies of around 80% for the classification into related and unrelated tweets.

200
In (Burel and Alani, 2018) as well as in (Burel et al., 2017a) and (Nguyen et al., 2016), this kind of model is also used for information type classification. Figure 1. CNN for text classification as proposed by Kim (2014). 11 https://github.com/JakobRogstadius/CrisisTracker/ Burel et al. (2017a) integrate semantic concepts and entities from DBPedia 12 . The resulting system is packaged as CREES (Burel and Alani, 2018), a service that can be integrated into other platforms similar to AIDR. The model in (Nguyen et al.,205 2016) has the ability to perform online learning, i.e. updating the model as new blocks of tweets arrive.
A crucial component of a social media classification model is the embedding of the text data at the input (i.e. how words or sentences are mapped to numeric values that the model can process). Many approaches employ word2vec, a pre-trained word embedding that was first presented in 2013 (Mikolov et al., 2013). A version specifically trained on crisis tweets is presented in (Imran et al., 2016a). More sophisticated models have been developed in the meantime, such as BERT (Devlin et al., 2019), 210 for which a crisis-specific version is proposed in . A comparison of various word and sentence embeddings for crisis tweet classification can be found in (ALRashdi and O'Keefe, 2019).
All of the approaches mentioned above aim to generalize to any kind of event without any a priori information. The transferability of pre-trained models to new events and event types is thoroughly investigated in (Wiegmann et al., 2020b). However, a real-world system may not need to be restricted in this way; in many cases, its users will already have some information about 215 the event, and may already have spotted tweets of the required type. This removes the need to anticipate any type of event. It also directs the system towards a specific event rather than any event happening at that time. Li et al. (2018) and Mazloom et al. (2019) showed that models adapted to the domain of the event perform better than generalized models. Alam et al. (2018a) propose an interesting variant for neural networks: Their system includes an adversarial component which can be used to adapt a model trained on a specific event to a new one (i.e. a new domain). 220 Kruspe et al. (2019) propose a system that does not assume an explicit notion of relatedness vs. unrelatedness (or relevance vs. irrelevance) to a crisis event. As described above, these qualities are not easy to define, and might vary for different users or different types of events. The presented system is able to determine whether a tweet belongs to a class (i.e. crisis event) implicitly defined by a small selection of example tweets by employing few-shot models. The approach is expanded upon in (Kruspe, 2019). Another perspective on this is shown in (Kruspe, 2020), where tweets are not classified into explicit "related" 225 or "unrelated" classes, but rather clustered by topic at time of publication; these clusters can shift over time.

Challenges
None of the approaches presented are able to solve the problem of detecting tweets in disaster events perfectly. In some respects, this is due to technical limitations; however, there are several difficulties immanent to the task itself, which we will discuss in this section.

230
Ambiguous problem definition As described in section 2, the task of tweet detection in disasters is ill-defined and heavily dependent on the final user of the detection product. Annotation experiments also show that even if the goal is clearly stated, inter-rater agreement is commonly low, with raters often interpreting both the problem statement as well as tweet content very differently (Stowe et al., 2018). This problem becomes even more emphasized when annotating more finegrained labels, e.g. for content type classes or for priority.
Twitter also forbids direct redistribution of tweet content, meaning that the described data sets are only available as lists of tweet IDs. This introduces two difficulties: One, retrieving the actual tweet content ("hydrating") can take a very long time for large data sets due to the rate limit. Two, tweets may become unavailable over time because their creator deleted them or their whole account, or because they were banned. In some cases of older data sets, this means that a significant portion of the corpus cannot be used anymore, impeding reproducibility and comparability of published research.

255
Apart from access limitations, Twitter and legal restrictions also regulate what researchers are allowed to do with this data. As an example, the Twitter user agreement states (Twitter, Inc., 2020): "Unless explicitly approved otherwise by Twitter in writing, you may not use, or knowingly display, distribute, or otherwise make Twitter Content, or information derived from Twitter Content, available to any entity for the purpose of: (a) conducting or providing surveillance or gathering intelligence, including but not limited 260 to investigating or tracking Twitter users or Twitter Content; (b) conducting or providing analysis or research for any unlawful or discriminatory purpose, or in a manner that would be inconsistent with Twitter users' reasonable expectations of privacy; (c) monitoring sensitive events (including but not limited to protests, rallies, or community organizing meetings); or (d) targeting, segmenting, or profiling individuals based on sensitive personal information, including their health (e.g., pregnancy), negative financial status or condition, 265 political affiliation or beliefs, racial or ethnic origin, religious or philosophical affiliation or beliefs, sex life or sexual orientation, trade union membership, Twitter Content relating to any alleged or actual commission of a crime, or any other sensitive categories of personal information prohibited by law." Many interesting research questions are not identical, but related to problematic usages described in this statement, e.g.
inference on a user basis or monitoring of protests. Researchers must therefore be careful not to step into prohibited 270 territory.
Lack of geolocation In a disaster context, knowing exactly where a tweet was sent is often crucial to the usability of this information. Twitter provides several ways of detecting geolocation. The most precise of them is the option for users to send their coordinates along with the tweet. However, only about 1% of tweets contain this information (Sloan et al., 275 2013). A tweet's location can also be estimated from the location stated in the user profile, or by analyzing the tweet's content with regards to mention of geolocation. For operationalization, a geocoding to coordinates is then required, which can be provided by services such as Google Maps or OpenStreetMap's Nominatim. Unfortunately, these geolocations are prone to errors, e.g. because a user mentions a position other than their own, because they might be traveling, or because the center coordinates of a city are to imprecise to be usable.

6 Related tasks
Once tweets related to a disaster event have been discovered, many further analysis steps are possible. We will only touch upon those briefly here. As described in section 3, some of the available data sets have already been annotated with these additional concepts.
A popular next step is the classification into semantic or information type classes. Such classes may include sentiments, affected 285 people seeking various types of assistance, media reports, warnings and advice etc. No common set of such classes exists; in the CrisisNLP and CrisisLexT26 corpora, 9 and 7 classes are used respectively with some overlap. For the TREC Incident Streams challenge, potential end users were questioned about their classes of interests, resulting in a two-tier ontology with 25 classes on the lower tier. As an added difficulty, classes often overlap in tweets; for these reasons, TREC allows multiple labels per tweet. Furthermore, annotators often disagree whether an information type is present in a tweet.

290
Another way of further discerning between tweets is a distinction between levels of informativeness or priority. This can be implemented either with discrete classes (low/medium/high importance), on a continuous numerical scale, or as a ranking of tweets. The CrisisLexT26 and TREC-IS 2019A data sets contain such annotations.
Apart from approaches processing single tweets, the analysis of the spatio-temporal distribution and development of discussed topics within affected areas at different scales may provide valuable insights (Kersten and Klan, 2020). Other research focuses 295 on the detection of specific events, or types of events (e.g. floods, wildfires, or man-made disasters) (e.g. Burel et al. (2017b)).
This can often be helpful when social media is used as an alerting system. Additionally, models specialized to event types can be more precise and allow for different distinctions than general-purpose models Wiegmann et al., 2020b); detection of the event type enables the automatic selection of such a more specialized method.
Apart from these text-based tasks, image analysis can also be a helpful source of information. As an example, images posted https://doi.org/10.5194/nhess-2020-214 Preprint. Discussion started: 10 July 2020 c Author(s) 2020. CC BY 4.0 License.
on social media can be used to determine the degree of destruction in the aftermath of a disaster (Alam et al., 2017;Nguyen et al., 2017b).

Future work
Many very interesting new analysis tasks are thinkable based on the detection methods described so far, particularly when employing automatic methods. A good starting point to identify relevant practical issues related to acquisition tasks that could 305 potentially be solved by analyzing social media data is provided in (Wiegmann et al., 2020c). Here, opportunities and risks of disaster data from social media are investigated by means of a systematic review of currently available incident information.
One aspect that has not been considered in research so far is how an event changes over time. New approaches could be used to analyze the spatiotemporal development of disasters, and how this could be utilized in disaster prevention. During the course of an event, clustering methods could be employed to rapidly detect novel developments such as sub-events or new topics. This 310 is particularly relevant for relief providers, who require extremely fast situation monitoring.
As described in section 5, localizing information coming from Twitter is often a challenge. Approaches that are able to deal with this lack of information are necessary. This could be implemented either by deriving location by some other means, or by spatiotemporal and semantic analysis of large sets of tweets to cross-reference and check information.
As mentioned above, languages other than English have also usually not been included in research on this topic. Multilingual 315 approaches would be a very helpful next step to facilitate usage of such methods in regions of the world where English is not the main language. Another aspect of the data that has not been used often so far are images posted by users. In particular, a multimodal joint analysis of text and images is very interesting from both the research as well as the usage perspective. The CrisisMMD data set is an interesting first step in this direction.
As described in section 4, some crowdsourcing approaches already integrate machine learning-based methods. In future work, 320 expanding human-in-the-loop approaches would be very useful.
In general, social media is usually not the only source of information and cannot provide a full picture of the situation. Therefore, an integration with other information sources, such as earth observation data, media information, or governmental data, is highly relevant.

325
In this review paper, we gave an overview over current methods to detect tweets pertaining to disaster events. There are three main ways to approach this problem: Filtering tweets by characteristics such as location and contained keywords or hashtags, crowdsourcing, and machine-learning based methods. Each of these has its advantages and disadvantages, but machine learning appears to be the current main avenue of research with big improvements in the past few years. To train and test these models, various data sets spanning one or multiple events have been created and are available online. However, these usually only provide ID's of the tweets, which leads to changes in the data sets over time.
tive" tweets, but struggle to define these concepts. Other difficulties include the subjectivity of classes and tweet interpretations, data limitations, linguistic difficulties, and legal issues.
Nevertheless, large strides have been made in the past years to tackle this problem, and research in this area remains highly active. Many related and novel analysis tasks are possible in the future.
Author contributions. Anna Kruspe wrote this paper with assistance and input from Jens Kersten and Friederike Klan.
Competing interests. The authors declare that they have no conflict of interest.