Background
On August 12, 2015, the U.S. District Court for the District of Columbia issued its decision in Washington Alliance of Technology Workers v. U.S. Department of Homeland Security civil action. The Court found that DHS had failed to satisfy proper procedural requirements in their original 2008 interim rule, effectively rendering all STEM extensions held by current F-1 students as invalid. Since displacing hundreds of thousands of F-1 students will have draconian consequences on the industry, the Court asked DHS to propose a new rule before February 12, 2016. DHS proposed a new OPT STEM extension rule on October 16, 2015 which is open for comments till November 18, 2015 here.
Since the comment data is publicly available, we downloaded and played around with it.
Motivation
Although this was mostly an academic endeavor, the idea behind analyzing this data was to understand how many people support OPT, what are the most common patterns in the language of the comments and any other interesting insights.
Analysis
Our first order of business is to identify how many people support the new rule. For details on data collection and labeling, see the Appendix.
Yeas and Nays
We used a Naive Bayes classifier with Laplace 1-smoothing to predict whether the comment supports or opposes OPT. The classifier performed really well on the dev/test splits:
Train set accuracy: 1.000000
Dev set accuracy: 0.924658
Test set accuracy: 0.906667
The three splits were made from the manually labeled data in the following proportions: train (70%), dev (20%) and test (10%). The dev set isn’t really used here since the NB classifier does not require weight tuning— it practically works as another test set. Applying this classifier to the entire set of ~13679 comments reveals:
Number of YES=9733, NO=3964
Percentage of YES=0.710594, NO=0.289406
So ~71% people support OPT while ~29% comments are in opposition.
More than words
Before we read more into the language of these comments, let’s verify whether the data, at least approximately, follows the Zipf’s law. The following is a log-log plot of the rank of the word vs. the frequency of occurrence.
Since the frequency of occurrence is inversely proportional to the rank, the log-log plot should be a straight line, which it loosely is.
Now that we are more confident of the sanity of the data, let’s have a look at the language used by the two camps. Here are the word clouds of the two kinds of comments:
Here is a plot of the 20 most frequent words overall and their ranks (most frequent word has a rank of 1) in the supporting and opposing comments:
After excluding the obvious contextual words like “opt”, “stem” and “extension”, the top words in the supporting camp are “students”, “international”, “country” and “people”. Similarly, the top words in the opposing camp are “foreign”, “american”, “workers” and “program”. A few interesting things to note:
- The word “students” is used far more often in the supporting comments. The opposing comments prefer “foreign workers”.
- The word “citizen” is used almost exclusively in opposing comments.
- The word “people” has a rank of 7 in the supporting comments and a rank of 20 in the opposing comments.
- In the opposing comments, the word “extension” has a rank of 429, while “opt” has a rank of 4. This probably implies that the opponents are concerned about OPT itself, not just the extension.
Name calling (without any name-calling)
Given that the top words in the opposing comments are “american”, “jobs”, “foreign”, “workers” and “citizens”, it appears that the concern is mostly about the loss of american jobs to foreign workers. This is also clear through a quick read of the opposing comments. This probably means the opposing comments are mostly coming from US citizens. To verify this, we try to identify whether the author of the comment is a US citizen (or long-time resident). We do this by comparing the first and last names of the author against the dataset of common US names taken from the SSA website and the 2010 US Census [2]. To be conservative, we consider a name as belonging to a US citizen only if both the first and the last names are in the top x percentile (e.g: At 90 percentile, we consider a name US only if both their first and last names are in the top 10% of the names database).
Since US is an immigrant country, this method is very flaky — Many names considered as belonging to US citizens are actually Asian names. It’s hard to tell whether these comments were made by international students or naturalized citizens. Also, many names that are not common in the Names database are actually US citizens. So there’s both a false positive and a false negative problem due to the inherent diversity in the US population.
Either way, there are a few interesting things to note:
- Left: Notice how the light blue line (-ves among US) goes from ~45% to ~55% as we reduce the set of people to high confidence US names. Also, we have ~80% US among negatives. However, we still have ~40% US among positives.
- Right: Increasing the percentile alone does not give better results because while it drops the international names, it also drops a lot of rare US names. In the right plot, we explicitly remove the most common Asian names. This isn’t a particularly bad strategy because an overwhelming number of comments are by foreign Asian students. Notice that once we do this, at 65 percentile, the ‘US among +ves’ drops to 20%, while ‘US among -ves’ stays about the same at 80%. Also notice that at 65 percentile, the ‘-ves among US’ rises to 60% from 45% earlier.
As unreliable as this method is, it does give a general confirmation to the fact that most of the opposition is from US citizens, which isn’t surprising at all and would be true if such a proposal was made in any other country.
Since its easy to plot data, here is a plot of the top 20 last names in both the camps.
Fear of the dark
Wouldn’t it be interesting to know what the opposing comments say that the supporting comments do not? Yes, most certainly! Here is a word cloud of the top words in the opposing comments that are not present in the supporting comments:
Some snippets of where these words appear:
- artificially: The program would allow U.S. companies to hire foreign citizens who have been pursuing a degree for at least nine months in the U.S., artificially expanding the pool of available workers for jobs particularly in science, technology, engineering, and mathematics fields
- 65% (the cloud generator does not support numbers, but this is the second most common ‘word’): OPT removed $4 billion from Social Security and Medicare trust funds. Also employers get around H-1 B Cap. Employers save 7.65% when they hire foreign students instead of U.S. Workers. They do not pay FICA or Medicare taxes.
- 2013 (same, numbers not supported by cloud generator): OPT has denied American workers more than 430,000 jobs during the years 2009–2013
- 568: There are approximately 568,000 F-1 students in the U.S. in addition to 98,000 in 12-month OPT programs and 30,000 in 29-month OPT programs. Giving employers incentive to hire from this giant pool of workers undermines the job opportunities for American STEM workers.
- Grassley: Charles Grassley of Iowa (in a letter to President Obama) had concerns about administratively using the STEM OPT program to establish a de facto shadow H-1B program, in violation of Congressional intent. The Regulation does this by making all STEM OPT foreign student eligible for replacing American workers. American workers are not protected.
- administratively: By increasing the total amount of time a foreign student may work in OPT after each degree to 3 years — the same amount of time that an H-1B visa would be valid — there is little doubt that the Administration has administratively established a de facto shadow H-1B program, in violation of Congressional intent.
One more thing…
We created a small page collecting live data from regulations.gov and predicting the outcomes of each comment. We are expecting the data to catch up to the current snapshot of the site by 11/18 end of the day.
Web: http://vikeshkhanna.webfactional.com/opt
API: * Get the whole dump of the dataset with : curl ‘http://vikeshkhanna.webfactional.com/opt/api/dump’
* Get paginated results at /api/data/<PAGE_NO>/<ROWS_PER_PAGE>: curl ‘http://vikeshkhanna.webfactional.com/opt/api/data/0/50'
Appendix
Some gory details.
Data collection
The list of links of all the comments can be exported through the regulations.gov website here. However, each of these links loads a bare bones HTML page and populates the comment through Javascript. We used a headless browser (phantomJS) to download all the comments and their authors. There are a total of 13697 comments in our dataset.
Data Labeling
Since there are no explicit votes on the site, the data is completely unlabeled. However, a quick glance through some of the comments reveals striking similarities in the words and the language used in both supporting and opposing comments. A Naive Bayes (NB) classifier works really well for a problem like this. At first, we thought of training the NB classifier on the Congressional floor debate data. However, that data has language very specific to, well, congressional floor debates. We had our doubts about how well it would perform for language that is far more colloquial and lacks the political context of the debates. We decided to manually label the dataset and ended up labeling ~670 comments. You can download the labeled data here.
Website
The website, which is available here, is written in web.py and hosted on Webfaction. It uses a MySQL database to store the link, author, comment, timestamp and our prediction. The database is populated with a long-running shell script (running under screen currently and restarted with cron every 30 minutes) to download the latest data dump from regulations.gov and add new rows to the DB. The pickled Naive Bayes probabilities are used to predict the vote on the given comment and then the data is persisted into the MySQL table. The front-end uses jQuery and Bootstrap.
[1] We wrote a script that made it really easy for me to do manually label the data. The iPython notebook and all other scripts are available in our GitHub repository.
[2] The first name database is taken from the Social Security Administration website. The last names database is taken from US Census Bureau 2010 census.