Hi! This post born as a means to fulfill a homework assignment on putting various AWS products to use in a practical setting. Alternatively, a simple step-by-step guide would have sufficed, but that probably would be more dreadful both to do & to read. Anyhow, when thinking of use-cases for speech recognition, translation, or comprehension, you would be forgiven to have your imagination go wild.
Indeed, using AWS translation & comprehension/sentiment analysis services on various news websites’ articles to compare sentiments came to my mind. Being more practical myself, & generally feeling the need to come up with something of more return on invested effort, I looked for something more immediately relevant to my professional circumstances:
I’m a 26 years old MSc Business Analytics Student at CEU, with an undergrad in Business Administration in Hotel Management & 5 years of experience in that field. As a relatively newbie to Data Science, Analytics & all that, 2 things constantly on my mind are a. Finding employment in the field & b. Trying to gain a level of domain/market understanding similar to what I came to know with Hotels.
Obviously, 1 project is not going to achieve either, but I think I might just be able to put my web-scraping skills to the test, and incorporate AWS Comprehend into the workflow, to build something helping me get a sense of how employers are in this field, in the sense that what employees are posting on a major employment website, Glassdoor. For purposes of this post I’ll focus on the IT sector, though obviously condensing down Data Science & Analytics to 1 sector is rather limited.
The generalized use-case is figuring out how the company is perceived by its employees. You’d be wise to say the approach below does not answer this question in itself, after all, Amazon Comprehend can ‘only’ categorize reviews by its detected sentiments, as well as to detect key phrases within each review. I will illustrate this with 1000 reviews on the 10 best-ranked Tech companies, per Glassdoor’s company ranking, having scraped their Companies’ & Reviews’ sections.
Scraping Glassdoor & Using Amazon Comprehend in R
Unaware of any APIs Glassdoor is hosting & hoping to use expand upon this project in the near future, I created 2 functions in R, 1st to get top-ranked companies’ websites, given the industry sector of interest using the httr package.
2nd, I wanted to collect reviews from a chosen number of pages per company, using rvest, Google’s selector gadget & loop functions. For illustration, I used 100 reviews for 10 Companies. A Review on Glassdoor outlines a general comment & 1–1 Pro & Con working for the company. Being a student seeking Data Science & Analytics jobs in the near future, I plan on writing functions to scrape salary & jobs data too. I’ll post it on my GitHub account for your review for sure.
Having obtained Glassdoor data (and hopefully not have been blocked) you still need an AWS account, which takes ca. 10 minutes to achieve. From there, create an access key on the Identity & Access Management (IAM) Console for secure API interaction. Create the access key & save it to your R project folder. In R, install & load the aws.comprehend library & provide your recently-generated access key to be able to use it. See the snippet below & do checkout cloudyr’s GitHub, if you’re interested in further details.
Okay, so with the needed reviews obtained, & AWS setup, we can go forward. To obtain sentiments, I simply looped through review comments, using the detect_sentiments() function. You want to be cautious on the amount of data you feed at once to AWS functions, because you might get an error. Since the output is 5 columns, I stored it in a temporary list & used rbindlist() from the data.table library to obtain a structured table & cbind-ed company names & the specific comment at the end. From here, I looped through this data frame again, to keep only the confidence with which the classified sentiment has been chosen, so I could remove the Mixed-, Positive-, Negative-, & Neutral columns, & narrow my data:
I mentioned before that a review on Glassdoor contains 3 elements: 1 general text (which a called a comment & just showed how I looped through it to detect its sentiments), & 1–1 Pros & Cons section. I can only presume these are places to further specify what is/was good & bad about an employees’ experience. Given its’ Pro-Contra nature, I feared these would either be mostly categorized as neutral or misrepresent how the commenter feels overall. So I rather used the detect_phrases() function, to look for keywords, under the assumption that these texts would point to the positive/negative aspects of ones’ experience. I again looped through each observation row & stored function outputs in a temporary list, to eventually rbind them. I also added indices to be able to track from which review observation was a specific keyphrase detected (by joining columns from the previously obtained data):
With these 3 data frames obtained (df_Pros, df_Cons, & Comments_n_Sentiments), we are ready for some EDA. As I mentioned earlier, I imagine using these codes to get a better understanding of how is it to be working for a company, should I be so lucky one day to do so for either of those analyzed here. 2 starting questions I would have then is a) Whether it is a positive experience overall, based on AWS’s detected sentiments? & b) How confident is Amazon Comprehend in its’ categorizations to either sentiment category.
To answer b) with the geom_density() above, Amazon Comprehend seems to most confidently detect Positive & Neutral comments, while it quite uniformly distributed in its’ confidence to classify comments as Mixed or Negative. This may also result from the infrequency of mixed & negative comments after all only 100 comments were collected by the company, & the companies analyzed are the 10 best-ranked IT companies, so overwhelmingly positive experiences are expected, & confirmed by the bar chart below, showing how comments were categorized by Amazon Comprehend:
As presumed, positive experiences are strongly in favor, followed by neutral-categorized comments as 2nd most frequent. This may be due to very rarely having terrible experiences with such well-established companies, or due to longer comments having a better chance of being characterized as neutral. A larger sample is required to make a conclusion from, for which the code is provided below.
To follow up on these initial questions, I would be interested in finding out what was good or bad about either experience, leading me to the Pros & Cons data frames. I was also intrigued by text clouds since my bachelor graduation ceremony, where a live version was showing most frequent responses for a question. So, I took on the opportunity & used the quanteda library, 1st plotting the general comments colored by the detected sentiment by Amazon Comprehend.
While I wouldn’t say I found definitive reasons why the people writing reviews had the type of experience they had, the text cloud above shows, those with negative detected sentiment mentioned ‘management’ most frequently, while those with positive detected sentiment mentioned ‘great’ ‘company’, ‘good’ ‘place’ & ‘work to’.
Plotting the same text cloud using df-Pros & df_Cons data frames, I would expect to see more concrete aspects however. Using df_Pros, it seems most frequent terms were ‘great’ ‘benefits’, ‘work culture’ ,‘good’ ‘company’, ‘people’ & ‘opportunities’. On the flip side, using df_Cons to try to detect what were sources of negative experiences, most frequent distinguishable terms were ‘company’ ‘management’ & ‘work hours’.
Also, find my code snippet for this sections plots, as well as another few that were left out:
In this article, I tried to show you how Amazon Comprehend can be used in combination with a little web scraping in R, to find out about current & former employees’ experiences at employers. I used detect_sentiment() & detect_phrases(), together with the httr, rvest & data.table packages.
Admittedly I’d be honoured to join either company plotted above in the future, but I see bigger value in this framework when I’d be looking for jobs further, at firms I am not yet familiar with. This being a 1st go at using Comprehend I am sure I did not come near the full-potential of this AWS service, but I do believe the framework above can already provide you with useful insights.
Finally, though an interesting aspect already to scrape employee reviews for analysis, I am planning to write similar functions to get Jobs & Salary data, to have a framework for finding suitable jobs, as well as be able to quote an informed figure when asked the uncomfortable question of ‘What is your expected salary?’