Leveraging Social Media to Map Disasters
This is a summary of a project I developed in collaboration with Ryan Stewart, Hakob Abjyan and Jonathan Slapnik in order to leverage twitter information to map disasters. The code and presentation is available here!
PROBLEM STATEMENT
When responding to disasters, it is critical to map and identify locations of survivors needing assistance. In recent history, social media has grown rapidly, with new platforms coming out every year. Now that a majority of people use social media, it has becoming increasingly helpful when natural disasters hit.
Social media can help identify isolated communities at risk, locations of survivors and areas where assistance team should be sent for search and rescue, levels of damage, where more information needs to be collected, and where resources should be allocated. People will tweet that themselves or others need help, they will post pictures and videos to Twitter, Facebook, Instagram, Snapchat, and YouTube showing the conditions that they are currently dealing with.
Since these platforms are updated every minute, leveraging the different platforms can be very helpful in figuring out where help needs to be sent next. When these disasters hit, there are areas that need to receive help that have not so far. It is critical that we are able to map and identify the locations of survivors that need aid.
FINDING RELEVANT DATA
As is normal with web scraping, there are some limitations to what data we can actually acquire. One of our goals for this project is to create either a map of locations in need or create a list of latitudes and longitudes of those locations.
We worked with Twitter Standard API. As a team we faced two limitations related to this API. Firstly, the standard API access to twitter only allowed us to pull 100 tweets at a time and only provides access to tweets from the past 7 days. Therefore, we worked under the assumption that an organization working for FEMA or FEMA itself could use our code through an Enterprise Level API access which would give them much greater access to twitter data.
A second limitation was with regard to the geo-data. While twitter collects location data about tweets, only a small percentage of users opt for sharing that data with third parties. This means that the geolocation field will be ‘Empty’ for each tweet. We thought about using location tied to the users account, as opposed to the location from which they posted a tweet, but that location represents the place where the account was created and was often more hurtful than helpful.
During our research we found that social media companies like Facebook have provided FEMA with access to restricted geolocation data during emergencies[1]. As an alternative for the future, FEMA might negotiate with Twitter to have access under similar terms.
For the purposes of this project, we used a third party database that allowed us to pull almost 10,000 tweets related to the hashtag → #HurricaneFlorence2018[2]. To solve the geo location limitation, we artificially simulated geolocation data for those tweets by creating a random distribution of longitudes and latitudes centered somewhere in the southeastern US. Theoretically, it would be real location data tied to the tweets that an emergency response organization is pulling.
###
# CREATING DEMONSTRATION LATITUDE AND LONGITUDE DATA.
florence_tweets['Latitude'] = 0florence_tweets['Longitude'] = 0for x in range(len(florence_tweets['Latitude'])):
florence_tweets['Latitude'][x] = format((np.random.random() + np.random.uniform(25, 39)), ".5f")florence_tweets.Latitude = [float(i) for i in florence_tweets['Latitude']]index = 0
for x in florence_tweets['Latitude']:
if x > 37:
florence_tweets['Longitude'][index] = format(-(77.5001 +np.random.random() + np.random.normal(0.0000, .8000)), ".5f")
elif x > 35:
florence_tweets['Longitude'][index] = format(-(77.6001 +np.random.random() + np.random.normal(0.0000, .8000)), ".5f")
elif x > 32:
florence_tweets['Longitude'][index] = format(-(81.0001 +np.random.random() + np.random.normal(0.0000, .80000)), ".5f")
elif x > 30:
florence_tweets['Longitude'][index] = format(-(82.5001 +np.random.random() + np.random.normal(0.0000, .65000)), ".5f")
elif x > 28:
florence_tweets['Longitude'][index] = format(-(81.5001 +np.random.random() + np.random.normal(0.0000, .4000)), ".5f")
else:
florence_tweets['Longitude'][index] = format(-(80.6001 +np.random.random() + np.random.normal(0.0000, .3000)), ".5f")
index += 1florence_tweets.Longitude = [float(i) for i in florence_tweets.Longitude]
In order to map the geolocation data, we decided to use GeoPandas, which is an opensource library created specifically to make it easy to work with geolocation data within Pandas.
Graph 1. Tweets reported during Huricane Florence 2018
MODELING
There are two distinct elements to modeling a solution to this problem. One being the processing of the text, and the second being the actual mapping of the relevant tweets. To tackle the first, we used some common Natural Language Processing techniques (tokenizing, stemming and removing punctuation and stopwords) to extract the meaningful words from each tweet while ignoring the words that do not provide any additional analytical value. This allowed us to identify the most frequently words across all tweets.
Graph 2. Most Common Words in Tweets
However, frequency of a commonly used word does not alone indicate that a tweet is important. We decided to create an unsupervised learning model that would allow us to classify our tweets among categories.
To define a framework that would allow us to make a relevant classification of our tweet, we used as a guideline the categories of emergency and permanent work defined by the Damage Assessment Operations Manual of FEMA[3]. To the emergency work, which is categorized as debris removal and emergency protective measures, we added relevant words that might be used by people in Twitter when reporting emergencies. This bag of words represents a non exhaustive collection of words related to the demand of urgent responses.
##URGENT BAG OF WORDSemergency_work = ['help', 'people', 'flood', 'leave', 'rescue', 'sos', 'come','debris', 'removal', 'junk', 'waste', 'property', 'tree', 'private', 'cubic', 'yard', 'creek',
'removal', 'roads', 'levees', 'unsafe', 'structures', 'water', 'floodplains',
'critical', 'evacuation', 'shelter', 'emergency', 'transport', 'access', 'safe', 'rescue',
'barricades', 'fire', 'generator', 'safety', 'hazard', 'need', 'lost', 'seen','missing', 'flooding', 'reach','trying']
The second bag of words took as a base the permanent work category of FEMA’s operations manual, which is categorized as roads and bridges, water control facilities, buildings and equipment, utilities, and parks, recreation, and others. We added other relevant words that might be used by people using Twitter in this bag of words. This is a non exhaustive collection of words related to the demand of a less urgent response.
## LESS URGENT BAG OF WORDSpermanent_work = ['help', 'vegetation', 'mud', 'silt', 'bridge', 'waterways', 'facility', 'utilities', 'park', 'traffic', 'replacement', 'control',
'repairs', 'stabilization', 'remediation', 'surfaces', 'bases', 'shoulders', 'ditches', 'drainage',
'sidewalk', ' guardrails', 'signs', 'decking', 'pavement', 'channel' ,'alignment', 'irrigation', 'erosion',
'prevention', 'dams', 'reservoirs', 'basins ', 'canals', 'aqueducts', 'coastal', 'shoreline', 'pumping',
'building', 'mechanical', 'electrical', 'basement', 'painting', ' treatment',
'power', 'transmission', 'natural gas', 'sewage', 'permanent', 'restoration', 'communication', 'systems',
'inspection', 'assessment', 'beach', 'park', 'playground', 'pool', 'docks', 'golf', 'tennis',
'ball', 'port', 'harbor', '
In order to identify whether a tweet warranted action, we used a Word2Vec. model Word2Vec is a neural network model that takes words and converts them into vectors. The idea behind Word2Vec is that it takes something that the computer can not understand (i.e. human language) and turns them into something it can, in this case, vectors in 300-dimensional space. By taking the average vector of all of the words contained in a sentence, tweet, or any list of words, we can identify the “average vector” to determine the overall sentiment of the message, something we commonly call sentiment analysis.
Word2Vec can make highly accurate guesses about a word’s meaning based on past appearances, and words that have more similar meaning usually have closer vector approximations. In order to train our model, we used 3Mil words from Google news[4].
import gensimmodel = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
Then we proceed to define vectors that were associated with our urgent or less urgent bag of words.
# CREATE THE URGENT VECTORemerg_vect = np.zeros((1,300))
counter=0
for word in emergency_work:
if word not in model.vocab:
continue
else:
temp = model.word_vec(word)
emerg_vect=emerg_vect+temp
counter +=1emerg_vect=emerg_vect/counter
emerg_vect=np.squeeze(emerg_vect)
#print(emerg_vect)# CREATE THE LESS URGENT VECTOR
permanent_vect = np.zeros((1,300))
counter=0
for word in permanent_work:
if word not in model.vocab:
continue
else:
temp = model.word_vec(word)
permanent_vect=permanent_vect+temp
counter +=1permanent_vect=permanent_vect/counter
permanent_vect=np.squeeze(permanent_vect)
Then we took our tweets and each tweet is broken down into its component words (tokens).
# LOAD YOUR DATA. WE USE A DATASET FOR DEMONSTRATION PURPOSES. IF COLLECTING DATA
#... THROUGH AN API, YOU CAN EITHER SAVE THE DATAFRAME OF TWEETS AS A CSV AND SPECIFY THE FILE
#... PATH OR FEED THE OUTPUT DIRECTLY INTO THE CODE BELOW.florence_tweets = pd.read_csv('./hurricane_florence_tweets.csv')florence_tweets.drop('Unnamed: 0',axis = 1, inplace = True)florence_tweets.rename({'text':'tweet'}, axis=1, inplace=True)english_stops = set(stopwords.words('english'))tokenizer = RegexpTokenizer(r'\w+')# IF USING ANY DATA BESIDES THE SET PROVIDED, ENSURE THAT YOUR LIST COMPREHENSION POINTS TO THE TWEET TEXT COLUMN
florence_tweet_tokens = [tokenizer.tokenize(tweet.lower()) for tweet in florence_tweets.tweet]clean_florence_tweet_tokens = []
for tweet in florence_tweet_tokens:
clean_tweet_tokens = []
for word in tweet:
if word not in english_stops and word not in ['@','rt', 'https', 'co', 'hurricaneflorence2018'] and '@' not in word:
clean_tweet_tokens.append(word)
clean_florence_tweet_tokens.append(clean_tweet_tokens)
Each of those words is assigned a vector based off on our trained Word2Vec model that was trained on our Google News vectorized words. After each word was assigned with a vector, we calculated the average vector for each tweet.
Using dot product, we compare the angle between the average vector of the tweet and the vectors of each of the bag of words that we created (Urgent Response and Non-Urgent Response). We used cosine similarity to do this comparison. If the average vector of the tweet is closer to that of the urgent words, it is assigned a 1. If the average vector is closer to that of the non-urgent words, the tweet is assigned a 0.
# THIS IS THE NLP STEP. THE LOOP WILL TURN EACH WORD IN YOUR LIST OF TOKENIZED TWEETS (or
#... tokenized sentences) INTO A VECTOR, CLASSIFY THE WORD AS EMERGENCY OR NON-EMERGENCY USING COSINE
#... SIMILARITY, AND ASSIGN THE ENTIRE TWEET (or sentence) A VALUE BASED OFF OF THE BASELINE CLASSIFICATION
#... SCORE.target = [] # WE WILL FILL THIS WITH OUR CLASSIFICATIONS FOR EACH FULL TWEET
for tweet in clean_florence_tweet_tokens:
counter=0
for item in tweet:
temp_vect = np.zeros((1, 300))
if item not in model.vocab.keys(): # IF WORD NOT IN Word2Vec
MODEL, IT IS NOT INCLUDED
continue
else:
temp_vect = temp_vect + model.word_vec(item)
counter += 1
if counter==0:
counter=1
temp_vect = np.squeeze(temp_vect)/counter
# THE BELOW STEP IS CALCULATING AND COMPARING THE COSINE
SIMILARITIES. THE DOT PRODUCT IS CALCULATED
#... FOR THE TWEET VECTOR AND THE EMERGENCY VECTOR AND FOR
THE TWEET VECTOR AND THE NON-EMERGENCY
#... VECTOR. AFTER CALCULATING THE DOT PRODUCT, WE DIVIDE BY
THE ABS. VALUE OF THE TWO GIVEN VECTORS
#... TO GET THE COSINE VALUE FOR THE ANGLE BETWEEN THE
VECTORS. THE GREATER THE COSINE VALUE, THE CLOSER
#... TWO VECTORS ARE TO ONE ANOTHER, AND TWEETS ARE ASSIGNED
A CLASSIFICATION OF EMERGENCY OR NON-
#... EMERGENCY ACCORDINGLY if np.dot(temp_vect,
emerg_vect)/
(np.linalg.norm(emerg_vect)*np.linalg.norm(temp_vect))
>= np.dot(temp_vect, permanent_vect)/
(np.linalg.norm(permanent_vect)*
np.linalg.norm(temp_vect)):
target.append(1)
else:
target.append(0)
4. RESULTS
We proceed to add the classifications of each of our tweets to our original data frame.
florence_tweets[‘target’] = target
Once we have filtered the tweets into the urgent and less-urgent categories, we map our tweets based on their location. The map below show our tweets classified as urgent and non-urgent tweets, and here is code.
# PLOTTING
#... NOTE: YOU CAN ONLY PLOT DATA ON A MAP IF YOU HAVE CREATED Point Objects FROM LAT/LONG DATA IN THE STEPS ABOVE
fig, ax = plt.subplots(figsize = (100,100))
usa.plot(ax = ax, color = 'gray') #USA .shp FILE READ IN ABOVE# PLOTTING CLASSIFICATIONS SEPARATELY
florence_tweets[florence_tweets.target == 0].geometry.plot(marker='*', color='yellow', markersize=900, ax=ax, label = 'Tweet Loc')
florence_tweets[florence_tweets.target == 1].geometry.plot(marker='^', color='red', markersize=900, ax=ax, label = 'Tweet Loc')# ADJUST MAP AREA DEPENDING ON TARGET AREA. xlim CORRESPONDS TO LONGITUDE RANGE, ylim CORRESPONDS TO
#... LATITUDE RANGE
plt.xlim(-85, -60)
plt.ylim(20,50)"""
Graph 4. Urgent and Non-Urgent Tweets
5. CONCLUSIONS AND NEXT STEPS
The main take aways of this project are:
- Accessing geo location in social media can be difficult, but not impossible. Public agencies like FEMA might have the political capital to negotiate better access to this type of information from tech companies in times of emergency.
- Word2vec and the use of cosine similarities between words can be used to classify your tweets or posts from any source in a meaningful way.
- As next steps, we would like to work with emergency experts to optimize our bag of words for hurricane response, but also to think about bag of words that might be applicable to the different type of disasters.
- It would also be important to impute tweets with real geo location to see the results of our model.
[1]Techcrunch, 2017. “Facebook will share anonymized location data with disaster relief organizations”.https://techcrunch.com/2017/06/07/facebook-will-share-anonymized-location-data-with-disaster-relief-organizations/
[2]TAGS, 2018. https://tags.hawksey.info/
[3]FEMA, 2016. “Damage Assessment Operations Manual”
[4]Google, 2018. “Google News Vectors”. https://groups.google.com/forum/#!topic/word2vec-toolkit/z0Aw5powUco