Using Natural Language Processing to Analyze Sentiment Towards Big Tech Market Power

Harrison Zhang
The Startup
Published in
7 min readAug 12, 2020
Photo by Pietro Jeng on Unsplash

To read this paper in document format click here.

ABSTRACT
NLP (Natural Language Processing) is a branch of artificial intelligence geared towards allowing computers to interact with humans through an understanding of human (natural) languages. This study focuses on training an NLP model to be used in a sentiment analysis on Big Tech policy by scraping and analyzing reactions to Big Tech articles linked on Reddit, using PRAW, an Reddit-specific web scraping API. Posts were scraped from the r/politics subreddit, a forum dedicated to the discussion of American politics. I found that there was a somewhat substantial skew towards support for policies intended to inhibit Big Tech power.

MOTIVATION
In the wake of Congress’s Big Tech Hearing [1], many social media activists began to release anti-Big Tech posts and graphics, as well as the tangentially-related blasts against billionaires and their role in wealth inequality. However, every post would always host a controversial comments section between those against sweeping antitrust moves and those supportive of them. This prompted me to wonder the true sentiment on Big Tech’s market power.

METHODS
By web scraping Reddit with the PRAW API a list of this year’s top 100 articles about Big Tech was compiled from the r/politics subreddit. Since these articles all loosely involved policies intended to inhibit FAANG market power, using NLP to analyze the top-level comments for each post could provide an adequate representation of sentiment towards big tech. Each comment could be reasonably inferred to have a noticeable negative or positive sentiment towards a Big Tech policy, since I used a subreddit dedicated to discussing American politics.

Both the Reddits scraping and the machine learning model were coded in a single file in this Google Colab (the code can be run in the environment). More in-depth annotations with the code are provided as well. All machine learning code was written using the TensorFlow library.

The model was trained on the TensorFlow IMDb Dataset, an open-source dataset of 50,000 movie reviews split into 25,000 reviews intended for training and 25,000 reviews intended for validation. These were automatically randomized upon initialization. This dataset was inferred to be applicable to the Reddit comment data because reviews and political discussion often consist of a similar pool of opinionated words. To validate this assumption, I manually cluster-sampled and rated 20% of the Reddit posts and their comments to compare to the algorithm’s prediction later on.

After experimenting with training the neural network under a supervised learning system and comparing it against the validation data, it was determined that a good model would consist of a sequential model with the following layers:

This model would ultimately consist of 1,356,841 trainable parameters (see Fig. 1.). The binary_crossentropy loss function was used because the intention was to categorize the comments into an either negative or positive reaction towards each article. The adam optimizer was also used because it works particularly well with NLP models. The metric used was just left as accuracy.

During testing, the model was run across 10 epochs, noting the accuracy, val_accuracy, loss, and val_loss across the epochs. The changes were graphed using the matplotlib library:

By comparing the accuracy and loss versus epochs graphs (see Fig. 2.), it was evident that maximizing val_accuracy while minimizing val_loss would require writing an abrupt callback. Since val_accuracy had roughly plateaued by 0.93 accuracy and val_loss had hit a minimum, the training should have been called back under the 0.93 accuracy. Beyond 0.93 accuracy, the val_loss would increase, showing a risk of overfitting, for little to no gain in val_accuracy, with the possibility of decreasing val_accuracy. Thus, the callback was written to stop the training once the 0.93 accuracy benchmark was hit. Depending on the training process, the benchmark could be hit between 4 epochs (see Fig. 2. (a)) to 10 epochs (see Fig. 2. (b)).

DATA
After running each Reddit comment against the trained model, each comment was assigned a predicted rating. They can be seen in the Colab code. Ratings from the machine learning algorithm returned a sentiment on a scale of 0 to 1, with 0 being a fully negative sentiment, 0.5 as neutral, and 1 as a fully positive sentiment. The floats for the Reddit comment ratings are somewhat difficult to discern from the rest of the computer jargon; however, I consolidated a list of overall post ratings derived from the weighted average of the comment ratings within the posts, a list that could be printed out with relative readability (see Table 1.).

DATA ANALYSIS
After individually rating each comment, they were weighted-averaged for each comment within a post to create a post rating. The weight was based on the scraped net upvotes of each comment, a useful feature on Reddit where people would either downvote (-1) or upvote (+1) a comment. There isn’t a downvote cap on net upvotes, so comments could potentially reach negative votes. This would allow for less-echoed sentiments to be punished and the more-respected comments to benefit.

After adding each of the post ratings into a post_rating list, the list was iterated through again to create another weighted average based on the votes on each post. Since Reddit upvotes on news posts often correlate to article exposure, this may help adjust for differences in sample sizes that are reflected between more popular and less popular posts.

As the post_rating list was passed through the iterations, a few data points were deleted–those that only included a 0, rather than a nested list (see Table 1.). The nested list served as an identifier for which posts were able to be rated, with the zeros identifying posts that had no comments to be scraped and would therefore be left out of the final calculations.

The negative post ratings also had to be filtered out (see Fig. 3. (a)); occurrences consequent of posts consisting of populations of comments that were controversial enough to receive the downvotes necessary to create a negative net vote score, carrying on to cause the post sentiment to become a negative value from the multiplication in the weighting process. Since the negative values could not be accurately adjusted to a specific sentiment intensity supporting the opposite position, these three posts were removed from the pool.

Although the algorithm output on a weighted sentiment rating per post, the value’s meaning was still unclear–it only revealed the predicted sentiment towards the article, not on Big Tech itself. I then went back through each of the 100 articles and manually skimmed the headlines to confirm the specific side of the issue that article wrote about. Since each article generally prefaced anti-Big Tech policies, the returned values were subtracted from 1. The subtraction allowed the sides to swap, so that those supporting anti-Big Tech values would be represented on the negative (0) end of the scale and those opposing anti-Big Tech values would be represented on the positive (1) end (see Fig. 3. (b)).

After the data manipulation, the algorithm came up with a sentiment score of approximately 0.4032, a substantial but not overwhelmingly negative sentiment towards Big Tech market power.

CONCLUSION
From this study, it can be concluded that there is currently a substantial anti-Big Tech sentiment. However, if there was a way to effectively convert the aggressively downvoted data points into a scaled opposite sentiment, the study could more accurately reflect the true Big Tech sentiment in America.

A limitation to note is Reddit’s young, left-leaning political skew. In 2016, Barthel [2] found that 47% of Reddit users identify as liberals compared to the estimated 24% across US adults. Scraping other forums to compare the sentiment may be an interesting next step.

As a sanity check for the model, I manually cluster-sampled 20% of the posts before running the algorithm, picking every 5 posts to score and averaging the sentiment score out to approximately 0.3031. Relatively within range of the model’s 0.4032 prediction, the model was concluded to have a reasonably accurate, albeit still imperfect, fit.

Provided that necessary logistics are not an issue, a further step in the study could entail either compiling a separate, subreddit-specific labeled dataset to improve the model or developing a model that trained under an unsupervised learning environment.

REFERENCES
[1]: Kang, Cecilia, and David Mccabe. “Lawmakers, United in Their Ire, Lash Out at Big Tech’s Leaders.” The New York Times, The New York Times, 29 July 2020, www.nytimes.com/2020/07/29/technology/big-tech-hearing-apple-amazon-facebook-google.html.

[2]: Barthel, Michael. “How the 2016 Presidential Campaign Is Being Discussed on Reddit.” Pew Research Center, Pew Research Center, 26 May 2016, www.pewresearch.org/fact-tank/2016/05/26/how-the-2016-presidential-campaign-is-being-discussed-on-reddit/.

ACKNOWLEDGEMENTS
I would like to thank Kevin Trickey for his help refining the study design, Fang Wang for her aid in the model validation process and data science expertise, and Kevin Chen for his research insight.

--

--

Harrison Zhang
The Startup

I enjoy studying the applications of machine learning in finance and economics.