Using NLP to enrich findings in the social sciences

Unique research collaborations at UC Berkeley

Keyan Nasseri
8 min readMay 6, 2020
A comparison of traditional and DL-based approaches to common NLP tasks (source: DataJango)

The internet generates a staggering amount of text — every day, thousands of news articles, hundreds of millions of Tweets, and billions of words total are produced and published online. For researchers in both academia and industry, this wealth of information presents both an opportunity and a challenge.

Text data’s utility has been widely demonstrated in various fields; hedge funds analyze financial news in quantitative models of asset prices, political scientists aggregate social media posts to study the extent of government censorship, and economists use text from news outlets to study the drivers and effects of political slant. At the same time, extracting meaningful results from text data is not a simple process. Language is not only inherently varied across individuals and thus difficult to standardize, but it also does not induce a natural numerical representation. In other words, there is no obvious way to featurize text for analysis in a model — this is, of course, one of the primary focuses of NLP.

It follows that using text data effectively in the social sciences requires integrating computational techniques from NLP with traditional research methods. Cyrus Dioun, a fellow at the Berkeley Institute for Data Science and management professor at the University of Colorado, Denver, is an industrial organization researcher doing exactly that. Dioun leverages text data effectively in his research with the help of UC Berkeley computer science students, who are responsible for implementing the crucial NLP tasks that make such data useful.

Over the past couple of months, we (Abrar Rahman and myself) have had the opportunity to work with Professor Dioun on two of his current projects, one investigating the impact of a NASA aviation incident reporting program, and the other studying consumer preferences and firm behavior in the nascent, high-growth legal cannabis industry. In both cases, we contributed by implementing NLP algorithms in Python which transform large amounts of raw text data into useful explanatory variables (features) in a social science regression model.

Storefront identification and tracking for cannabis industry dataset

The rapidly-growing legal cannabis market is already worth over $13 billion, but consumer preferences and firm behavior in the space are not well understood (source: Statista)

The cannabis industry in the US has been in a legal grey area since its inception, making it a fascinating topic for researchers studying how emerging markets grow and develop. Though cannabis has some level of legality (whether decriminalized, allowed medical use, or regulated recreational use) in all but 11 states, it is still illegal on the federal level. Meanwhile, in the midst of the COVID-19 pandemic, some states have gone as far as declaring cannabis to be an “essential” industry, while Democratic candidate Joe Biden has made full federal decriminalization a part of his platform.

One of the major goals of our cannabis industry research is to study behavior at the storefront level, where ‘storefront’ refers to one specific retail outlet, typically either a dispensary or delivery service. For example, we might want to identify the factors that predict businesses’ success or failure in the market, find the rate at which consolidation is taking place, or study the dichotomy between innovation and imitation across competitors. In order to answer questions like these, we need to be able to track unique cannabis storefronts over time.

Cannabis product data

In order to begin our analysis of the cannabis markets, we turned to online listings where local dispensaries, brands, and delivery services display their information. The dataset we scraped included product info and a host of additional fields, as shown below.

The list of fields included in the raw dataset

The raw data contained a number of challenging artifacts which hindered our data analysis efforts, so we further cleaned the dataset. For instance, phone numbers were listed in a variety of formats, from +1 (XXX) XXX-XXXX to XXXXXXXXXX and anything in between. We created a short Python script using regular expressions (regex) to resolve them all to a single standardized format. In addition, key pieces of information like company name or email, essential to extract any good insights from the data, were often missing. In those cases, we merely isolated and removed those rows from the data.

Further, the data were collected in 9 waves, each months apart, a crucial step from an econometric perspective in order to allow for analysis of the industry over time. This necessitated additional preprocessing steps focused on reconciling data from different waves, which are further explained below.

Storefront movement algorithm

As mentioned previously, one of our key objectives was to create a list of storefronts from the loosely-assembled dataset of product descriptions, as this was a key prerequisite for any future research. Note that the data was often inconsistent with certain key details —addresses would shift, businesses would change their names, phone numbers would be rotated — which is somewhat expected considering the young age of the industry. We additionally suspected that some of this movement could be attributed to fraudulent storefronts getting caught operating without licenses, only to rebrand and reopen to continue the scheme. Thus, we could not rely solely on the name of the business to match storefronts across our 9 waves of data.

Diagram of storefront movement algorithm

In order to circumvent this, we developed a comprehensive storefront movement algorithm (SMA), which initially groups the product descriptions by address and phone numbers (the assumption being that observations with the same address or phone number are extremely likely from the same storefront). Each one of these groupings is assigned a unique ID.

After this first step, SMA utilizes a system of matching we developed based on a number of different fields (phone, address, URL, and email)—the number of changes in these fields would be used to link a storefront with the previous wave. In other words, an entry in wave i will be assigned the unique ID of the storefront in wave i -1 with which it has the most fields in common (or fewest changes). In the case that an entry’s fields do not match with those of any of the previous entries, we create a new storefront unique ID.

Implementation and results

Example of a unique storefront the algorithm identified despite a change in name and address

SMA ended up doing a very good job of tracking storefronts over time despite inconsistencies in name, address, and other key fields. One illustrative example is included above — though this Northern California storefront changes its name and address between waves (note the 6-month difference in ‘access_date’ between the left and right entries), SMA correctly identifies it as one unique business. Upon visual inspection of the output, we saw this behavior for hundreds of different dispensaries and delivery services.

To go deeper into implementation, above is an example intermediate output from SMA which illustrates the importance of the multiple-field matching strategy. The numbers on the right-hand side delineate the fields which line up with different storefronts in the previous wave. On the left-hand side are the unique IDs of these previous storefronts. The underlined example is the only one with no matches to the previous wave of data, so it is the only new unique ID to be created in the sample provided — the rest were continued from previous waves.

Cosine similarity for NASA aviation incident reports

ASRS dataset

An example of a confidential ASRS report (source: Michael Dorneich)

The Aviation Safety Reporting System, or ASRS, is a program jointly administered by NASA and the FAA that allows pilots to anonymously report close call aviation events which do not result in an accident, in order to highlight potential safety issues without fear of reprisal. The overall goal of our aviation project is to assess how these reports impact safety practices at airports; to that end, we were tasked with algorithmically evaluating the similarity of temporally adjacent reports using cosine similarity.

Cosine similarity

Cosine similarity is a straightforward method for computing the similarity of two vectors by finding the cosine of the angle between them (or their normalized inner product). Cosine similarity is often used to compare vectorized forms of documents since it produces an interpretable similarity value between -1 (least similar) and 1 (identical), as opposed to metrics like Euclidean distance which can produce arbitrary values.

Text representation

A diagram of the doc2vec algorithm, which generates a numerical, vectorized representation of a document (source: Gidi Shperber)

However, to compute similarity, the ASRS reports first must be represented in a numerical, vectorized format. We used three approaches to text representation: doc2vec, tf-idf, and Bag of Words (BOW). BOW and tf-idf essentially count the occurrences of each word in a text and form a vector of these frequencies, while doc2vec learns a representation using a shallow neural network. The links above provide gentle introductions to each of these methods.

Implementation and results

We use pandas, sklearn, ntlk, and gensim in our implementation. Before vectorizing, we preprocess the report text by stemming and lemmatizing, among other cleaning steps, which ensures that slightly different forms of the same word (e.g. “climbs” and “climbing”) are not considered to be separate words. We also implement serialization for the results of text vectorization in order to save memory.

For a comparison of how two given text vectorization approaches (doc2vec/tf-idf/BOW) could differ from each other, see the below boxplots of the distribution of calculated cosine similarity values on the dataset.

Side-by-side comparison of similarity results using Bag of Words and tf-idf text representation approaches

As expected, the approaches vary significantly. As the BOW approach does not take into account the frequency of words across the entire dataset (which tf-idf does), it provides a far less tight confidence interval, which also allows for fewer outliers. In comparison, tfidf has a much narrower confidence interval, but as a result, has more outliers. Note that in both cases, we do not have any negative values—this is due to the fact that BOW and tf-idf generate vectors with strictly positive entries, since they essentially count word frequency (which cannot be negative).

Next steps

The technical work that we have contributed to the ASRS and cannabis projects sets the stage for future research efforts. With cosine similarity and storefront tracking implemented, both can move forward: similarity values are now ready to be used as exogenous variables in a model of aviation report impact, while the previously discussed firm-level analyses can now be applied to the cannabis dataset. These two projects will add to our understanding of aviation safety, a crucial concern for the 50 percent of Americans who fly at least once a year, as well as our understanding of the unique ways in which nascent and/or stigmatized markets behave, with important implications for entrepreneurs, investors, and regulatory bodies.

--

--