Natural Language Processing for Detecting Victims in Emergency Situations

Published in

Nerd For Tech

8 min readNov 23, 2021

How social media posts and texts during emergencies can be used to improve the reach and response time of first-responders and aid agencies.

Fires, flooding, landslides, chronic hunger, epidemic, armed conflict or invasion, road accidents, earthquakes and even violent winds, crises and disasters form a part of our daily lives. They tend to define how we interact amongst ourselves and with the natural environment.

One other thing that is continuously redefining how we interact with ourselves in these modern times is social media. Through its power, access and coverage are unlike at any point in human history. Leveraging social media ubiquity and the power of Machine learning, there are no limits to how much we can improve our lives and provide timely aid to those in crisis in Nigeria.

This motivation — using social media posts and texts to detect people in need of relief during emergencies — defines the scope of the challenge I attempted tackling. Follow along as I walk you through my work.

Dataset

For this work, the dataset required is a collection of post-crisis social media posts and text messages with annotations for identifying the manner of aid needed.

One method of getting the desired dataset is tediously scraping, filtering, cleaning and labelling tens of thousands of messages from social platforms like Twitter and Facebook. This method requires months and months of arduous manual work depending on the manpower, but it gives the best control over data distribution and can be easily tailored to a particular location or region.

Fortunately, FigureEight Inc., now Appen, already did all the hard work of gathering such data. The FigureEight disaster-response dataset consists of 26, 249 post-disaster messages extracted from several platforms after several disasters. Each message was then annotated with a binary label (0 or 1) for each of the 36 predefined aid categories.

For clarity, if a message has label ‘1’ under a particular aid category, it means the message composer requires that aid, and a label of ‘0’ means the opposite. In all, this dataset is perfect for building a model that can detect people needing aid through their social media posts.

Methodology

Machine Learning Pipeline (Source GOSMAR.EU)

Since the goal is to build a model that can take in thousands of raw text messages during a crisis and detect posts of people in need of emergency aid, there is a need to develop a data cleaning (or processing) pipeline and a machine learning pipeline for training the model.

ETL Pipeline

ETL is short for Extract-Transform-Load. An ETL pipeline represents a set of processes or steps for extracting data from a source (.csv file, in our case), transforming the data and loading it into a database.

From the overview image above, it is clear that the dataset in the categories column is a mix of strings (letters) and numbers (integers). With a custom ETL pipeline (built with Python), the raw text data in the format above can be transformed as seen below.

Extract data

#   Import data files    
messages = pd.read_csv(messages_filepath)    
categories = pd.read_csv(categories_filepath)     
#   Merge datasets    
df = pd.merge(messages, categories)

Transform data

#   Split the categories column into separate columns.     categories_table = df.categories.str.split(';',expand = True)    categories_table.columns = categories_table.iloc[0].apply(lambda x: x[0:-2])
         
#   Clean class values for each category and cast as numeric.   
for column in categories_table:        
categories_table[column]=categories_table[column].str[-1].astype('int')     #   Remove duplicates from data.    
df = df.drop_duplicates(subset='message')     #   Filter out 'related' category with non-binary class    
df = df[df['related']!=2]

Load Data into an SQLite Database

#   Create SQL engine with database name    
engine = create_engine('sqlite:///'+database_filename)     #   Load cleaned data into SQL engine, replacing data in database if #   defined name already exists.    
df.to_sql(table_name, engine, index=False,if_exists='replace')

If you’re interested in giving the codes a spin, please visit the project repo here. The end product of the ETL pipeline is a dataset ready to be fed into a model for training.

ML Pipeline

Although a complete Machine Learning pipeline can be said to consist of data preprocessing steps (ETL), modelling and deployment, I have separated the pipelines into logical chunks of modular python scripts to make debugging easier.

The ML pipeline employed here is made up of four major steps: Text Vectorization (and tokenization), Term Frequency-Inverse Document Frequency (Tf-Idf) Normalization, classifier model training and Hyper-Parameter Tuning.

While the text vectorization step takes in our raw text messages and converts them into a matrix of token (individual word) counts, the Tf-Idf transformer takes in a matrix of token counts and transforms it into a normalised tf-idf representation.

For a classification problem like this, there are quite a few algorithms we can train and validate to determine the best-performing one. For example, a Naive Bayes classifier, an AdaBoost classifier, and a Random Forest classifier were all fitted to our transformed message data, but the Random Forest classifier slightly edged the others in performance.

Scikit-learn provides an easy-to-use ML Pipeline class which I took advantage of for this step.

Hyper-Parameter Tuning

After the model has been trained in the ML pipeline, we can go a step further to improve the model performance by selecting the optimum hyperparameters for the model.

#   Pipeline Hyperparameter Range to tune. parameters = {'vect__ngram_range': ((1,1), (1,2)),        'vect__max_features': (None, 5000, 10000),        
'tf_idf__use_idf': (True, False),        'multi_clf__estimator__min_samples_leaf':[1,2],        'multi_clf__estimator__n_estimators': [10,20,100],        'multi_clf__estimator__max_depth': [None,5,10],        'multi_clf__estimator__min_samples_split': [2,3,5]}

By exhaustively searching over the specified hyperparameter grid, the model is retrained with the optimum parameters and evaluated. I used the Scikit-Learn Grid Search CV for this purpose.

Model Evaluation

The performance of our classification model was evaluated using Precision, Recall and F1 scores. They are calculated mathematically as:

where positive and negative represent 1 and 0 respectively for each of the 36 aid categories mentioned above. Below is a table and a plot showing the trained model performance across each category. Click here to see a more interactive metric plot.

             36 Categories  precision    recall  f1-score               related       0.83      0.96      0.89
               request       0.84      0.49      0.62
                 offer       0.00      0.00      0.00
           aid_related       0.78      0.67      0.72
          medical_help       0.45      0.04      0.07
      medical_products       0.76      0.07      0.14
     search_and_rescue       0.73      0.08      0.14
              security       0.33      0.01      0.02
              military       0.38      0.02      0.04
           child_alone       0.00      0.00      0.00
                 water       0.85      0.45      0.58
                  food       0.84      0.61      0.71
               shelter       0.79      0.40      0.53
              clothing       1.00      0.08      0.15
                 money       0.60      0.03      0.05
        missing_people       1.00      0.01      0.03
              refugees       0.67      0.04      0.08
                 death       0.85      0.19      0.31
             other_aid       0.62      0.04      0.07
infrastructure_related       0.33      0.00      0.01
             transport       0.64      0.04      0.07
             buildings       0.75      0.16      0.26
           electricity       0.67      0.02      0.04
                 tools       0.00      0.00      0.00
             hospitals       0.00      0.00      0.00
                 shops       0.00      0.00      0.00
           aid_centers       0.00      0.00      0.00
  other_infrastructure       0.00      0.00      0.00
       weather_related       0.87      0.62      0.73
                floods       0.90      0.32      0.48
                 storm       0.82      0.36      0.50
                  fire       0.00      0.00      0.00
            earthquake       0.92      0.71      0.80
                  cold       0.57      0.04      0.08
         other_weather       0.50      0.03      0.06
         direct_report       0.81      0.36      0.50
             micro avg       0.82      0.51      0.63

Ultimately, the choice of the evaluation metric to emphasize (or prioritise) for each aid category depends on whether it is more costly to waste resources to detect some wrong messages (precision) or fail to detect some messages of people that need help in emergencies (Recall), or maybe even both (f1 score).

The bar plot above represents the label (0 and 1) for the top five most balanced categories in the original dataset. Most of the other categories show a heavy imbalance between the positive (1) and negative (0) labels. For example, important categories like 'fire', 'missing people', 'offer', 'hospital', 'electricity', and 'transport' have very few messages.
Even with Stratified Sampling embedded in the Grid Search Cross-Validation that helps to maintain class weights, the model still fails miserably for these categories. As a result, we have very high or low precisions and (or) recalls for these categories as seen above.

All that’s left is to save the trained model for future use and it is as easy as it sounds. Simply import the python Pickle module and dump the model.

import pickle
pickle.dump(model, open(model_filepath, ‘wb’))

Note: Unless otherwise stated, all codes in this work are written in Python 3. There are several other dependencies, libraries and packages that were used in the course of the project. If you’d love to dig a little deeper, kindly visit the repository on GitHub for all scripts and codes you will require (and please leave a star there ⭐️.)

If you’ve followed me up till this point, well done! 👍

Deployment

In a bid to breathe a bit of life into this project and showcase the detection capabilities of the trained model tangibly, I decided to build and deploy a web app on the model. I used the Flask framework to build the backend, while Bootstrap and JQuery were used to design the interface. If you would love to take it for a spin, click here to visit, at your own risk😅.

Conclusion

I’m excited about the range of applications this solution could have in different localities across Nigeria. Millions of people post live updates and call out for help on social media platforms in Nigeria during crises and emergencies. With an average f1-score of 63% overall, this project opens up a world of possibilities for developing better-performing solutions for use in our communities.

The methods I applied here can be improved upon. For example, I was recently advised by a more experienced reviewer that using an XGBoost tuned with Bayesian optimization could give a much better performance.

Thank you so much for taking the time to read through! If you also see the potential in a solution like this, please drop a clap👏 , leave a comment and most importantly, share, share and share👊!!!!

If you’d like to replicate this work, improve on the approach or just can’t have enough of the technical details, please visit GitHub for the codes and how to reuse them. Special thanks to the team at Udacity for their guidance and feedback during this work.

If you’d like to collaborate on other interesting projects or want to share some new insight, please reach me on LinkedIn, Twitter, or by mail at idowuodesanmi@gmail.com.