Identify “Bullshit” in Today’s Internet Era through Exploratory Data Analysis

Ben Niu
Spring 2019 — Information Expositions
5 min readJan 28, 2019

In today’s fast-paced and low-cost information exchanging era, people’s opinions and conclusions are exposed everywhere and anytime on the Internet. The fake news, the misinformation and the bullshit were produced more than ever.

The definition of Bullshit by different sources are slightly varied. Merriam-Webster indicates that “bullshit” means nonsense. Wikipedia explains that “ It is mostly a slang profanity term meaning “nonsense”, especially as a rebuke in response to communication or actions viewed as deceptive, misleading, disingenuous, unfair or false.” The definition by the Urban dictionary, “ A blatant lie, a fragrant untruth, an obvious fallacy.” However, they all indicate that “ Bullshit” means nonsense on some common aspects.

Harry Frankfurt is the early researcher from Princeton University study on the subject of “bullshit”. He indicates when he started to write the article, only a little work has done on this most common subject on the Internet. Meanwhile, bullshit is different than lie but the greater enemy of the truth than lies are. According to “On Bullshit” by Frankfurt, He has several statements of bullshit, for example, “Bullshit is unavoidable whenever circumstances require someone to talk without knowing what he is talking about. Thus the production of bullshit is stimulated whenever a person’s obligations or opportunities to speak about some topic are more excessive than his knowledge of the facts that are relevant to that topic.” Frankfurt precisely explained that why bullshit is unavoidable especially in today’s Internet era. Because this situation is common in public life, the people are required by different demands to talk extensively on the subject of which they are to some degree ignorant. Thus, the method to identify and call “Bullshit” out to improve the online conversation environment became important. One method to identify it is called Exploratory Data Analysis. The following example is a “Bullshit” identified by exploratory data analysis.

Currently, I am a Junior at the University of Colorado at Boulder and have one year left for my college study. With the reluctant mood, I started to search the information for my graduate application. The screenshot below was from the Quora website and this answer had the highest votes in the question of “How important is the role of SOP in graduate admissions?”

Screenshot from Quora

There are two phrases of “the most” used in the answers to indicate that SOP is the most critical factor in the graduate admission process. However, it is somewhat different from what I found from other sources of information on graduate admission. Therefore, I started to use data analysis to prove is it true or not.

I found one graduate admission dataset of UCLA on Kaggle. This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya. The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are 1. GRE Scores ( out of 340 ) 2. TOEFL Scores ( out of 120 ) 3. University Rating ( out of 5 ) 4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 ) 5. Undergraduate GPA ( out of 10 ) 6. Research Experience ( either 0 or 1 ) 7. Chance of Admit ( ranging from 0 to 1 )

The statistical overview of the dataset

Then I made a heatmap to check every parameter’s correlation with the “Chance of Admit” to find out which parameter influences the Chance of Admit the most then verify which factor is the most important to the admission process.

This heatmap shows the correlation scores between the parameter on the left and the parameter on the bottom.
Dropped half of the duplicated correlation scores just more intuitive

The heatmap shows that GRE Score(0.8), TOEFL Score(0.79), and CGPA(0.87) are the three most important factors affecting the Admit of Chance. The SOP has 0.68 correlation scores when contributing to Admit of Chance. The analysis proves the answers on the Quora was wrong.

Furthermore, the chart above shows more details about how applicants are distributed on the aspect of “Admit of Chance” when contributing to each parameter. All charts show one trend which is the higher the standard score the higher the admit of chance.

Quora is high-quality information exchanging platform. I personally like to read content on it frequently. However, even with the professionals and specialists to answer the questions, we still have to be careful about bullshit.

Reference

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

--

--