Untangling the cluttered data for Visual Analysis: A walk through NLP Feature Engineering

Building an NLP pipeline for feature extraction within eight weeks

Pratyush Priyadarshi

Published in

Omdena

6 min readJul 9, 2021

Authors: Gagan Bhatia and Pratyush Priyadarshi

A Natural Language Processing (NLP) analysis pipeline walkthrough for feature extraction, scraping Twitter, Google, and 1200 PDF files through automated APIs. The overall approach allowed us to gather data that visualizes several billion dollars of not-for-profit grant data for further NLP analysis across six countries. Finally, the team built an interactive dashboard visualizing the distribution of the grants.

This article was originally published on Omdena’s blog.

To read more articles on NLP applications and How-to’s, check here.

Problem Statement

Every year government, philanthropy, the private sector, and other grantmakers from across the globe allocate a significant portion of their budget for grants aiming to further a variety of causes. Despite what seems like an abundance of funds flowing through the social sector, many not-for-profits suffer from a lack of resources. A significant reason is the lack of transparency of grant information.

It is estimated that up to $80–90 billion Australian dollars of grants are disbursed each year.

Our Community is a social enterprise that provides information, tools, and advice to thousands of social sector organizations to support their crucial work of building stronger communities. Our Community’s Innovation Lab, together with Omdena, took on the challenging task of tackling unstructured information and building solutions that would facilitate positive social change for NGOs in need.

Traditional methods to find and monitor grants are time-consuming, expensive, and limited. The aim of this project is to help get money flowing between grantmakers (funders) and the not-for-profit sector, providing the necessary capital to enable positive social change.

BUT wait for it, where is the data?

Like every other problem which tries to leverage Artificial Intelligence for finding a viable solution, a lack of data can hamper any real progress.

Data is growing at an astonishing rate every minute and the majority stake in all this is held by unstructured data. Historically, unstructured data have been ignored because of the complexities involved in dealing with them, but since the majority of human information is embedded in this form, it couldn’t be ignored anymore.

That’s where Natural Language Processing comes into the picture, a subfield of artificial intelligence through which computers are enabled to understand and interpret human languages.

In this article, we will mainly be focused on the way we generated data from various unstructured sources.

Photo by Austin Distel on Unsplash

How we got the unstructured data

Major chunks of data were stored in pdf format which could offer some valuable insights and therefore couldn’t be ignored.

In our case, we ended up with more than 1200 pdf files to be downloaded and scraped from various websites. This task could become cumbersome if done manually, so we decided on automating the entire process by designing a microservice that used RESTful APIs under the hood.

We leveraged the flask framework for developing RESTful APIs which were then deployed over AWS EC2 as a containerized service using docker. A CSV file containing links to all the pdf files is uploaded to the service. The pdf files are automatically downloaded into EC2 and the files are parsed using the grobid service. The parsed data from all the pdfs were then collated into a single file which was then uploaded over AWS S3.

Source: RedHat

That’s the great thing about microservices; they are standalone programs that can be developed and readily deployed to be made available to different users. We wanted our code to be reusable so developing the microservice to use RESTful APIs seemed like a no-brainer. Flask is a python based micro-framework that comes in handy when needed to quickly develop small web-based applications.

By this stage, it had become apparent that Google was to be our go-to tool so we certainly couldn’t stop now. We decided to automate the scraping process to collect all data returned by Google search based on certain keywords which were achieved through Apify.

After this, we certainly couldn’t ignore Twitter — after all, some of the major action is taking place on that platform. We decided to scrape relevant data from Twitter as well.

Once we had our unstructured data in a single place, the next challenge was feature engineering to preprocess the data and extract features for further analysis. For this task, we brought in NLTK and Spacy which are two powerful NLP libraries for related use cases.

Spacy comes loaded with named entity recognition (NER) and parts of speech tagging (pos tagging). NER helps in locating and classifying named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, and so on. Through spacy’s NER we identified major features including the grantmaker countries and money being awarded. This became our final dataset for visualization.

Now finally, the NLP analysis and visualization

After the initial cumbersome process of feature engineering, we finally reached the point of visualizing our structured data under a beautifully created dashboard. We created the dashboard using StreamLit and Plotly. To visualize the dataframe that was created after feature extraction, we displayed the dataset with interactive features to help the user understand the data better.

In addition to processing unstructured data using NLP, the team wrote numerous site-specific scrapers to extract data from the web that was already structured in tables.

The overall approach allowed us to gather several billion dollars worth of grant data for analysis across six countries.

Extracted Feature set from Apify

To give you a little flavor for the final dashboard, will leave you with a screenshot.

Interactive dashboard showing the distribution of grants

Recognition

This project was made possible by 50 technology changemakers who built solutions over eight weeks to facilitate positive social change for NGOs in need. A special thanks to Our Community who gave us the opportunity to use our AI skills for good.

A huge shout out to our task managers and all the collaborators.

Join high-impact Omdena’s projects, and build your experience.

Untangling the cluttered data for Visual Analysis: A walk through NLP Feature Engineering

Building an NLP pipeline for feature extraction within eight weeks

BUT wait for it, where is the data?

Written by Pratyush Priyadarshi