Quora Insincere Question Classification

Detect Toxic Content to Improve Online Conversations

Published in

Analytics Vidhya

19 min readOct 1, 2019

In this blog I’ll be explaining how I performed Classification of Quora Insincere question dataset through Machine Learning and Deep Learning Methods.

Overview Of The Problem:

An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions — those founded upon false premises, or that intend to make a statement rather than looking for helpful answers.

Quora has come up with a competition where we develop models that identify and flag insincere questions.

Problem Statement:

Predict whether a question asked on Quora is sincere or not.

Evaluation Metrics:

Metric is F1 Score between the predicted and the observed targets. There are just two classes, but the positive class makes just over 6% of the total. So the target is highly imbalanced, which is why a metric such as F1 seems appropriate for this kind of problem as it considers both precision and recall of the test to compute the score.

Data Overview:

Quora provided a good amount of training and test data to identify the insincere questions. Train data consists of 1.3 million rows and 3 features in it.

File descriptions

train.csv — the training set
test.csv — the test set
embeddings

Data fields

qid — unique question identifier
question_text — Quora question text
target — a question labeled “insincere” has a value of 1, otherwise 0

Exploratory Data Analysis:

Load Train and Test Data-set:

train = pd.read_csv(“train.csv”)
test=pd.read_csv(‘test.csv’)
print(“Number of train data points:”,train.shape[0])
print(“Number of test data points:”,test.shape[0])
print(“Shape of Train Data:”, train.shape)
print(“Shape of Test Data:”, test.shape)
train.head()

First, load the train and test data set. Here, I also check the shape and data points in a set.

Distribution of data points among output classes:

print(‘~> Percentage of Sincere Questions (is_duplicate = 0):\n {}%’.format(100 — round(train[‘target’].mean()*100, 2)))
print(‘\n~> Percentage of Insincere Questions (is_duplicate = 1):\n {}%’.format(round(train[‘target’].mean()*100, 2)))#Output
~> Percentage of Sincere Questions (is_duplicate = 0):
   93.81%

~> Percentage of Insincere Questions (is_duplicate = 1):
   6.19%

We can see that dataset is highly imbalanced with only 6.19% of Insincere Questions.

Basic Feature Engineering:

We can add some features as a part of feature engineering pipeline for Quora Insincere Questions Classification Challenge.

Some features that I have included are listed below:

freq_qid = Frequency of qid
qlen = Length of qid
n_words = Number of words in Question
numeric_words = Number of numeric words in Question
sp_char_words = Number of special characters in Question
unique_words = Number of unique words in Question
char_words = Number of characters in Question

#Feature Engineering on Train Data
train[‘freq_qid’] = train.groupby(‘qid’)[‘qid’].transform(‘count’) 
train[‘qlen’] = train[‘question_text’].str.len() 
train[‘n_words’] = train[‘question_text’].apply(lambda row: len(row.split(“ “)))
train[‘numeric_words’] = train[‘question_text’].apply(lambda row: sum(c.isdigit() for c in row))
train[‘sp_char_words’] = train[‘question_text’].str.findall(r’[^a-zA-Z0–9 ]’).str.len()
train[‘char_words’] = train[‘question_text’].apply(lambda row: len(str(row)))
train[‘unique_words’] = train[‘question_text’].apply(lambda row: len(set(str(row).split())))#Feature Engineering on Test Data
test[‘freq_qid’] = test.groupby(‘qid’)[‘qid’].transform(‘count’) 
test[‘qlen’] = test[‘question_text’].str.len() 
test[‘n_words’] = test[‘question_text’].apply(lambda row: len(row.split(“ “)))
test[‘numeric_words’] = test[‘question_text’].apply(lambda row: sum(c.isdigit() for c in row))
test[‘sp_char_words’] = test[‘question_text’].str.findall(r’[^a-zA-Z0–9 ]’).str.len()
test[‘char_words’] = test[‘question_text’].apply(lambda row: len(str(row)))
test[‘unique_words’] = test[‘question_text’].apply(lambda row: len(set(str(row).split())))

I have added above mentioned features as they will help us evaluate our data better in determining what features are useful and which ones to discard/remove.

Data Preprocessing:

The text data is not entirely clean, thus we need to apply some data preprocessing techniques.

Preprocessing techniques for Data Cleaning:

Removing Punctuation

Special Characters that were in the data; we’ll use replace to remove these characters

puncts=[‘,’, ‘.’, ‘“‘, ‘:’, ‘)’, ‘(‘, ‘-’, ‘!’, ‘?’, ‘|’, ‘;’, “‘“, ‘$’, ‘&’, ‘/’, ‘[‘, ‘]’, ‘>’, ‘%’, ‘=’, ‘#’, ‘*’, ‘+’, ‘\\’, 
 ‘•’, ‘~’, ‘@’, ‘£’, ‘·’, ‘_’, ‘{‘, ‘}’, ‘©’, ‘^’, ‘®’, ‘`’, ‘<’, ‘→’, ‘°’, ‘€’, ‘™’, ‘›’, ‘♥’, ‘←’, ‘×’, ‘§’, ‘″’, ‘′’, 
 ‘█’, ‘…’, ‘“‘, ‘★’, ‘”’, ‘–’, ‘●’, ‘►’, ‘−’, ‘¢’, ‘¬’, ‘░’, ‘¡’, ‘¶’, ‘↑’, ‘±’, ‘¿’, ‘▾’, ‘═’, ‘¦’, ‘║’, ‘―’, ‘¥’, ‘▓’, 
 ‘ — ‘, ‘‹’, ‘─’, ‘▒’, ‘：’, ‘⊕’, ‘▼’, ‘▪’, ‘†’, ‘■’, ‘’’, ‘▀’, ‘¨’, ‘▄’, ‘♫’, ‘☆’, ‘¯’, ‘♦’, ‘¤’, ‘▲’, ‘¸’, ‘⋅’, ‘‘’, ‘∞’, 
 ‘∙’, ‘）’, ‘↓’, ‘、’, ‘│’, ‘（’, ‘»’, ‘，’, ‘♪’, ‘╩’, ‘╚’, ‘・’, ‘╦’, ‘╣’, ‘╔’, ‘╗’, ‘▬’, ‘❤’, ‘≤’, ‘‡’, ‘√’, ‘◄’, ‘━’, 
 ‘⇒’, ‘▶’, ‘≥’, ‘╝’, ‘♡’, ‘◊’, ‘。’, ‘✈’, ‘≡’, ‘☺’, ‘✔’, ‘
’, ‘≈’, ‘✓’, ‘♣’, ‘☎’, ‘℃’, ‘◦’, ‘└’, ‘‟’, ‘～’, ‘！’, ‘○’, 
 ‘◆’, ‘№’, ‘♠’, ‘▌’, ‘✿’, ‘▸’, ‘⁄’, ‘□’, ‘❖’, ‘✦’, ‘．’, ‘÷’, ‘｜’, ‘┃’, ‘／’, ‘￥’, ‘╠’, ‘↩’, ‘✭’, ‘▐’, ‘☼’, ‘☻’, ‘┐’, 
 ‘├’, ‘«’, ‘∼’, ‘┌’, ‘℉’, ‘☮’, ‘฿’, ‘≦’, ‘♬’, ‘✧’, ‘〉’, ‘－’, ‘⌂’, ‘✖’, ‘･’, ‘◕’, ‘※’, ‘‖’, ‘◀’, ‘‰’, ‘\x97’, ‘↺’, 
 ‘∆’, ‘┘’, ‘┬’, ‘╬’, ‘،’, ‘⌘’, ‘⊂’, ‘＞’, ‘〈’, ‘⎙’, ‘？’, ‘☠’, ‘⇐’, ‘▫’, ‘∗’, ‘∈’, ‘≠’, ‘♀’, ‘♔’, ‘˚’, ‘℗’, ‘┗’, ‘＊’, 
 ‘┼’, ‘❀’, ‘＆’, ‘∩’, ‘♂’, ‘‿’, ‘∑’, ‘‣’, ‘➜’, ‘┛’, ‘⇓’, ‘☯’, ‘⊖’, ‘☀’, ‘┳’, ‘；’, ‘∇’, ‘⇑’, ‘✰’, ‘◇’, ‘♯’, ‘☞’, ‘´’, 
 ‘↔’, ‘┏’, ‘｡’, ‘◘’, ‘∂’, ‘✌’, ‘♭’, ‘┣’, ‘┴’, ‘┓’, ‘✨’, ‘\xa0’, ‘˜’, ‘❥’, ‘┫’, ‘℠’, ‘✒’, ‘［’, ‘∫’, ‘\x93’, ‘≧’, ‘］’, 
 ‘\x94’, ‘∀’, ‘♛’, ‘\x96’, ‘∨’, ‘◎’, ‘↻’, ‘⇩’, ‘＜’, ‘≫’, ‘✩’, ‘✪’, ‘♕’, ‘؟’, ‘₤’, ‘☛’, ‘╮’, ‘␊’, ‘＋’, ‘┈’, ‘％’, 
 ‘╋’, ‘▽’, ‘⇨’, ‘┻’, ‘⊗’, ‘￡’, ‘।’, ‘▂’, ‘✯’, ‘▇’, ‘＿’, ‘➤’, ‘✞’, ‘＝’, ‘▷’, ‘△’, ‘◙’, ‘▅’, ‘✝’, ‘∧’, ‘␉’, ‘☭’, 
 ‘┊’, ‘╯’, ‘☾’, ‘➔’, ‘∴’, ‘\x92’, ‘▃’, ‘↳’, ‘＾’, ‘׳’, ‘➢’, ‘╭’, ‘➡’, ‘＠’, ‘⊙’, ‘☢’, ‘˝’, ‘∏’, ‘„’, ‘∥’, ‘❝’, ‘☐’, 
 ‘▆’, ‘╱’, ‘⋙’, ‘๏’, ‘☁’, ‘⇔’, ‘▔’, ‘\x91’, ‘➚’, ‘◡’, ‘╰’, ‘\x85’, ‘♢’, ‘˙’, ‘۞’, ‘✘’, ‘✮’, ‘☑’, ‘⋆’, ‘ⓘ’, ‘❒’, 
 ‘☣’, ‘✉’, ‘⌊’, ‘➠’, ‘∣’, ‘❑’, ‘◢’, ‘ⓒ’, ‘\x80’, ‘〒’, ‘∕’, ‘▮’, ‘⦿’, ‘✫’, ‘✚’, ‘⋯’, ‘♩’, ‘☂’, ‘❞’, ‘‗’, ‘܂’, ‘☜’, 
 ‘‾’, ‘✜’, ‘╲’, ‘∘’, ‘⟩’, ‘＼’, ‘⟨’, ‘·’, ‘✗’, ‘♚’, ‘∅’, ‘ⓔ’, ‘◣’, ‘͡’, ‘‛’, ‘❦’, ‘◠’, ‘✄’, ‘❄’, ‘∃’, ‘␣’, ‘≪’, ‘｢’, 
 ‘≅’, ‘◯’, ‘☽’, ‘∎’, ‘｣’, ‘❧’, ‘̅’, ‘ⓐ’, ‘↘’, ‘⚓’, ‘▣’, ‘˘’, ‘∪’, ‘⇢’, ‘✍’, ‘⊥’, ‘＃’, ‘⎯’, ‘↠’, ‘۩’, ‘☰’, ‘◥’, 
 ‘⊆’, ‘✽’, ‘⚡’, ‘↪’, ‘❁’, ‘☹’, ‘◼’, ‘☃’, ‘◤’, ‘❏’, ‘ⓢ’, ‘⊱’, ‘➝’, ‘̣’, ‘✡’, ‘∠’, ‘｀’, ‘▴’, ‘┤’, ‘∝’, ‘♏’, ‘ⓐ’, 
 ‘✎’, ‘;’, ‘␤’, ‘＇’, ‘❣’, ‘✂’, ‘✤’, ‘ⓞ’, ‘☪’, ‘✴’, ‘⌒’, ‘˛’, ‘♒’, ‘＄’, ‘✶’, ‘▻’, ‘ⓔ’, ‘◌’, ‘◈’, ‘❚’, ‘❂’, ‘￦’, 
 ‘◉’, ‘╜’, ‘̃’, ‘✱’, ‘╖’, ‘❉’, ‘ⓡ’, ‘↗’, ‘ⓣ’, ‘♻’, ‘➽’, ‘׀’, ‘✲’, ‘✬’, ‘☉’, ‘▉’, ‘≒’, ‘☥’, ‘⌐’, ‘♨’, ‘✕’, ‘ⓝ’, 
 ‘⊰’, ‘❘’, ‘＂’, ‘⇧’, ‘̵’, ‘➪’, ‘▁’, ‘▏’, ‘⊃’, ‘ⓛ’, ‘‚’, ‘♰’, ‘́’, ‘✏’, ‘⏑’, ‘̶’, ‘ⓢ’, ‘⩾’, ‘￠’, ‘❍’, ‘≃’, ‘⋰’, ‘♋’, 
 ‘､’, ‘̂’, ‘❋’, ‘✳’, ‘ⓤ’, ‘╤’, ‘▕’, ‘⌣’, ‘✸’, ‘℮’, ‘⁺’, ‘▨’, ‘╨’, ‘ⓥ’, ‘♈’, ‘❃’, ‘☝’, ‘✻’, ‘⊇’, ‘≻’, ‘♘’, ‘♞’, 
 ‘◂’, ‘✟’, ‘⌠’, ‘✠’, ‘☚’, ‘✥’, ‘❊’, ‘ⓒ’, ‘⌈’, ‘❅’, ‘ⓡ’, ‘♧’, ‘ⓞ’, ‘▭’, ‘❱’, ‘ⓣ’, ‘∟’, ‘☕’, ‘♺’, ‘∵’, ‘⍝’, ‘ⓑ’, 
 ‘✵’, ‘✣’, ‘٭’, ‘♆’, ‘ⓘ’, ‘∶’, ‘⚜’, ‘◞’, ‘்’, ‘✹’, ‘➥’, ‘↕’, ‘̳’, ‘∷’, ‘✋’, ‘➧’, ‘∋’, ‘̿’, ‘ͧ’, ‘┅’, ‘⥤’, ‘⬆’, ‘⋱’, 
 ‘☄’, ‘↖’, ‘⋮’, ‘۔’, ‘♌’, ‘ⓛ’, ‘╕’, ‘♓’, ‘❯’, ‘♍’, ‘▋’, ‘✺’, ‘⭐’, ‘✾’, ‘♊’, ‘➣’, ‘▿’, ‘ⓑ’, ‘♉’, ‘⏠’, ‘◾’, ‘▹’, 
 ‘⩽’, ‘↦’, ‘╥’, ‘⍵’, ‘⌋’, ‘։’, ‘➨’, ‘∮’, ‘⇥’, ‘ⓗ’, ‘ⓓ’, ‘⁻’, ‘⎝’, ‘⌥’, ‘⌉’, ‘◔’, ‘◑’, ‘✼’, ‘♎’, ‘♐’, ‘╪’, ‘⊚’, 
 ‘☒’, ‘⇤’, ‘ⓜ’, ‘⎠’, ‘◐’, ‘⚠’, ‘╞’, ‘◗’, ‘⎕’, ‘ⓨ’, ‘☟’, ‘ⓟ’, ‘♟’, ‘❈’, ‘↬’, ‘ⓓ’, ‘◻’, ‘♮’, ‘❙’, ‘♤’, ‘∉’, ‘؛’, 
 ‘⁂’, ‘ⓝ’, ‘־’, ‘♑’, ‘╫’, ‘╓’, ‘╳’, ‘⬅’, ‘☔’, ‘☸’, ‘┄’, ‘╧’, ‘׃’, ‘⎢’, ‘❆’, ‘⋄’, ‘⚫’, ‘̏’, ‘☏’, ‘➞’, ‘͂’, ‘␙’, 
 ‘ⓤ’, ‘◟’, ‘̊’, ‘⚐’, ‘✙’, ‘↙’, ‘̾’, ‘℘’, ‘✷’, ‘⍺’, ‘❌’, ‘⊢’, ‘▵’, ‘✅’, ‘ⓖ’, ‘☨’, ‘▰’, ‘╡’, ‘ⓜ’, ‘☤’, ‘∽’, ‘╘’, 
 ‘˹’, ‘↨’, ‘♙’, ‘⬇’, ‘♱’, ‘⌡’, ‘⠀’, ‘╛’, ‘❕’, ‘┉’, ‘ⓟ’, ‘̀’, ‘♖’, ‘ⓚ’, ‘┆’, ‘⎜’, ‘◜’, ‘⚾’, ‘⤴’, ‘✇’, ‘╟’, ‘⎛’, 
 ‘☩’, ‘➲’, ‘➟’, ‘ⓥ’, ‘ⓗ’, ‘⏝’, ‘◃’, ‘╢’, ‘↯’, ‘✆’, ‘˃’, ‘⍴’, ‘❇’, ‘⚽’, ‘╒’, ‘̸’, ‘♜’, ‘☓’, ‘➳’, ‘⇄’, ‘☬’, ‘⚑’, 
 ‘✐’, ‘⌃’, ‘◅’, ‘▢’, ‘❐’, ‘∊’, ‘☈’, ‘॥’, ‘⎮’, ‘▩’, ‘ு’, ‘⊹’, ‘‵’, ‘␔’, ‘☊’, ‘➸’, ‘̌’, ‘☿’, ‘⇉’, ‘⊳’, ‘╙’, ‘ⓦ’, 
 ‘⇣’, ‘｛’, ‘̄’, ‘↝’, ‘⎟’, ‘▍’, ‘❗’, ‘״’, ‘΄’, ‘▞’, ‘◁’, ‘⛄’, ‘⇝’, ‘⎪’, ‘♁’, ‘⇠’, ‘☇’, ‘✊’, ‘ி’, ‘｝’, ‘⭕’, ‘➘’, 
 ‘⁀’, ‘☙’, ‘❛’, ‘❓’, ‘⟲’, ‘⇀’, ‘≲’, ‘ⓕ’, ‘⎥’, ‘\u06dd’, ‘ͤ’, ‘₋’, ‘̱’, ‘̎’, ‘♝’, ‘≳’, ‘▙’, ‘➭’, ‘܀’, ‘ⓖ’, ‘⇛’, ‘▊’, 
 ‘⇗’, ‘̷’, ‘⇱’, ‘℅’, ‘ⓧ’, ‘⚛’, ‘̐’, ‘̕’, ‘⇌’, ‘␀’, ‘≌’, ‘ⓦ’, ‘⊤’, ‘̓’, ‘☦’, ‘ⓕ’, ‘▜’, ‘➙’, ‘ⓨ’, ‘⌨’, ‘◮’, ‘☷’, 
 ‘◍’, ‘ⓚ’, ‘≔’, ‘⏩’, ‘⍳’, ‘℞’, ‘┋’, ‘˻’, ‘▚’, ‘≺’, ‘ْ’, ‘▟’, ‘➻’, ‘̪’, ‘⏪’, ‘̉’, ‘⎞’, ‘┇’, ‘⍟’, ‘⇪’, ‘▎’, ‘⇦’, ‘␝’, 
 ‘⤷’, ‘≖’, ‘⟶’, ‘♗’, ‘̴’, ‘♄’, ‘ͨ’, ‘̈’, ‘❜’, ‘̡’, ‘▛’, ‘✁’, ‘➩’, ‘ா’, ‘˂’, ‘↥’, ‘⏎’, ‘⎷’, ‘̲’, ‘➖’, ‘↲’, ‘⩵’, ‘̗’, ‘❢’, 
 ‘≎’, ‘⚔’, ‘⇇’, ‘̑’, ‘⊿’, ‘̖’, ‘☍’, ‘➹’, ‘⥊’, ‘⁁’, ‘✢’]def clean_punct(x):
 for punct in puncts:
 if punct in x:
 x = x.replace(punct, ‘{}’ .format(punct))
 return x

2. Cleaning Numbers

def clean_numbers(x):
 if bool(re.search(r’\d’, x)):
 x = re.sub(‘[0–9]{5,}’, ‘#####’, x)
 x = re.sub(‘[0–9]{4}’, ‘####’, x)
 x = re.sub(‘[0–9]{3}’, ‘###’, x)
 x = re.sub(‘[0–9]{2}’, ‘##’, x)
 return x

3. Correcting Misspelled Words

For better embedding coverage we’ll replace misspelled words using a misspell mapping and regex functions.

mispell_dict = {‘colour’: ‘color’, ‘centre’: ‘center’, ‘favourite’: ‘favorite’, ‘travelling’: ‘traveling’, ‘counselling’: ‘counseling’, ‘theatre’: ‘theater’, ‘cancelled’: ‘canceled’, ‘labour’: ‘labor’, ‘organisation’: ‘organization’, ‘wwii’: ‘world war 2’, ‘citicise’: ‘criticize’, ‘youtu ‘: ‘youtube ‘, ‘Qoura’: ‘Quora’, ‘sallary’: ‘salary’, ‘Whta’: ‘What’, ‘narcisist’: ‘narcissist’, ‘howdo’: ‘how do’, ‘whatare’: ‘what are’, ‘howcan’: ‘how can’, ‘howmuch’: ‘how much’, ‘howmany’: ‘how many’, ‘whydo’: ‘why do’, ‘doI’: ‘do I’, ‘theBest’: ‘the best’, ‘howdoes’: ‘how does’, ‘mastrubation’: ‘masturbation’, ‘mastrubate’: ‘masturbate’, “mastrubating”: ‘masturbating’, ‘pennis’: ‘penis’, ‘Etherium’: ‘bitcoin’, ‘narcissit’: ‘narcissist’, ‘bigdata’: ‘big data’, ‘2k17’: ‘2017’, ‘2k18’: ‘2018’, ‘qouta’: ‘quota’, ‘exboyfriend’: ‘ex boyfriend’, ‘airhostess’: ‘air hostess’, “whst”: ‘what’, ‘watsapp’: ‘whatsapp’, ‘demonitisation’: ‘demonetization’, ‘demonitization’: ‘demonetization’, ‘demonetisation’: ‘demonetization’, 
‘electroneum’:’bitcoin’,’nanodegree’:’degree’,’hotstar’:’star’,’dream11':’dream’,’ftre’:’fire’,’tensorflow’:’framework’,’unocoin’:’bitcoin’,‘lnmiit’:’limit’,’unacademy’:’academy’,’altcoin’:’bitcoin’,’altcoins’:’bitcoin’,’litecoin’:’bitcoin’,’coinbase’:’bitcoin’,’cryptocurency’:’cryptocurrency’,‘simpliv’:’simple’,’quoras’:’quora’,’schizoids’:’psychopath’,’remainers’:’remainder’,’twinflame’:’soulmate’,’quorans’:’quora’,’brexit’:’demonetized’,‘iiest’:’institute’,’dceu’:’comics’,’pessat’:’exam’,’uceed’:’college’,’bhakts’:’devotee’,’boruto’:’anime’,‘cryptocoin’:’bitcoin’,’blockchains’:’blockchain’,’fiancee’:’fiance’,’redmi’:’smartphone’,’oneplus’:’smartphone’,’qoura’:’quora’,’deepmind’:’framework’,’ryzen’:’cpu’,’whattsapp’:’whatsapp’,
‘undertale’:’adventure’,’zenfone’:’smartphone’,’cryptocurencies’:’cryptocurrencies’,’koinex’:’bitcoin’,’zebpay’:’bitcoin’,’binance’:’bitcoin’,’whtsapp’:’whatsapp’,‘reactjs’:’framework’,’bittrex’:’bitcoin’,’bitconnect’:’bitcoin’,’bitfinex’:’bitcoin’,’yourquote’:’your quote’,’whyis’:’why is’,’jiophone’:’smartphone’,‘dogecoin’:’bitcoin’,’onecoin’:’bitcoin’,’poloniex’:’bitcoin’,’7700k’:’cpu’,’angular2':’framework’,’segwit2x’:’bitcoin’,’hashflare’:’bitcoin’,’940mx’:’gpu’,
‘openai’:’framework’,’hashflare’:’bitcoin’,’1050ti’:’gpu’,’nearbuy’:’near buy’,’freebitco’:’bitcoin’,’antminer’:’bitcoin’,’filecoin’:’bitcoin’,’whatapp’:’whatsapp’,‘empowr’:’empower’,’1080ti’:’gpu’,’crytocurrency’:’cryptocurrency’,’8700k’:’cpu’,’whatsaap’:’whatsapp’,’g4560':’cpu’,’payymoney’:’pay money’,‘fuckboys’:’fuck boys’,’intenship’:’internship’,’zcash’:’bitcoin’,’demonatisation’:’demonetization’,’narcicist’:’narcissist’,’mastuburation’:’masturbation’,‘trignometric’:’trigonometric’,’cryptocurreny’:’cryptocurrency’,’howdid’:’how did’,’crytocurrencies’:’cryptocurrencies’,’phycopath’:’psychopath’,\‘bytecoin’:’bitcoin’,’possesiveness’:’possessiveness’,’scollege’:’college’,’humanties’:’humanities’,’altacoin’:’bitcoin’,’demonitised’:’demonetized’,‘brasília’:’brazilia’,’accolite’:’accolyte’,’econimics’:’economics’,’varrier’:’warrier’,’quroa’:’quora’,’statergy’:’strategy’,’langague’:’language’,‘splatoon’:’game’,’7600k’:’cpu’,’gate2018':’gate 2018',’in2018':’in 2018',’narcassist’:’narcissist’,’jiocoin’:’bitcoin’,’hnlu’:’hulu’,’7300hq’:’cpu’,‘weatern’:’western’,’interledger’:’blockchain’,’deplation’:’deflation’, ‘cryptocurrencies’:’cryptocurrency’, ‘bitcoin’:’blockchain cryptocurrency’,}def _get_mispell(mispell_dict):
 mispell_re = re.compile(‘(%s)’ % ‘|’.join(mispell_dict.keys()))
 return mispell_dict, mispell_remispellings, mispellings_re = _get_mispell(mispell_dict)
def replace_typical_misspell(text):
 def replace(match):
 return mispellings[match.group(0)]
 return mispellings_re.sub(replace, text)

4. Removing Contractions

Contractions are words that we write with an apostrophe.

contraction_dict = {“ain’t”: “is not”, “aren’t”: “are not”,”can’t”: “cannot”, “‘cause”: “because”, “could’ve”: “could have”, “couldn’t”: “could not”, “didn’t”: “did not”, “doesn’t”: “does not”, “don’t”: “do not”, “hadn’t”: “had not”, “hasn’t”: “has not”, “haven’t”: “have not”, “he’d”: “he would”,”he’ll”: “he will”, “he’s”: “he is”, “how’d”: “how did”, “how’d’y”: “how do you”, “how’ll”: “how will”, “how’s”: “how is”, “I’d”: “I would”, “I’d’ve”: “I would have”, “I’ll”: “I will”, “I’ll’ve”: “I will have”,”I’m”: “I am”, “I’ve”: “I have”, “i’d”: “i would”, “i’d’ve”: “i would have”, “i’ll”: “i will”, “i’ll’ve”: “i will have”,”i’m”: “i am”, “i’ve”: “i have”, “isn’t”: “is not”, “it’d”: “it would”, “it’d’ve”: “it would have”, “it’ll”: “it will”, “it’ll’ve”: “it will have”,”it’s”: “it is”, “let’s”: “let us”, “ma’am”: “madam”, “mayn’t”: “may not”, “might’ve”: “might have”,”mightn’t”: “might not”,”mightn’t’ve”: “might not have”, “must’ve”: “must have”, “mustn’t”: “must not”, “mustn’t’ve”: “must not have”, “needn’t”: “need not”, “needn’t’ve”: “need not have”,”o’clock”: “of the clock”, “oughtn’t”: “ought not”, “oughtn’t’ve”: “ought not have”, “shan’t”: “shall not”, “sha’n’t”: “shall not”, “shan’t’ve”: “shall not have”, “she’d”: “she would”, “she’d’ve”: “she would have”, “she’ll”: “she will”, “she’ll’ve”: “she will have”, “she’s”: “she is”, “should’ve”: “should have”, “shouldn’t”: “should not”, “shouldn’t’ve”: “should not have”, “so’ve”: “so have”,”so’s”: “so as”, “this’s”: “this is”,”that’d”: “that would”, “that’d’ve”: “that would have”, “that’s”: “that is”, “there’d”: “there would”, “there’d’ve”: “there would have”, “there’s”: “there is”, “here’s”: “here is”,”they’d”: “they would”, “they’d’ve”: “they would have”, “they’ll”: “they will”, “they’ll’ve”: “they will have”, “they’re”: “they are”, “they’ve”: “they have”, “to’ve”: “to have”, “wasn’t”: “was not”, “we’d”: “we would”, “we’d’ve”: “we would have”, “we’ll”: “we will”, “we’ll’ve”: “we will have”, “we’re”: “we are”, “we’ve”: “we have”, “weren’t”: “were not”, “what’ll”: “what will”, “what’ll’ve”: “what will have”, “what’re”: “what are”, “what’s”: “what is”, “what’ve”: “what have”, “when’s”: “when is”, “when’ve”: “when have”, “where’d”: “where did”, “where’s”: “where is”, “where’ve”: “where have”, “who’ll”: “who will”, “who’ll’ve”: “who will have”, “who’s”: “who is”, “who’ve”: “who have”, “why’s”: “why is”, “why’ve”: “why have”, “will’ve”: “will have”, “won’t”: “will not”, “won’t’ve”: “will not have”, “would’ve”: “would have”, “wouldn’t”: “would not”, “wouldn’t’ve”: “would not have”, “y’all”: “you all”, “y’all’d”: “you all would”,”y’all’d’ve”: “you all would have”,”y’all’re”: “you all are”,”y’all’ve”: “you all have”,”you’d”: “you would”, “you’d’ve”: “you would have”, “you’ll”: “you will”, “you’ll’ve”: “you will have”, “you’re”: “you are”, “you’ve”: “you have”}def _get_contractions(contraction_dict):
 contraction_re = re.compile(‘(%s)’ % ‘|’.join(contraction_dict.keys()))
 return contraction_dict, contraction_recontractions, contractions_re = _get_contractions(contraction_dict)def replace_contractions(text):
 def replace(match):
 return contractions[match.group(0)]
 return contractions_re.sub(replace, text)

5. Removing Stopwords

stopword_list = nltk.corpus.stopwords.words(‘english’)
def remove_stopwords(text, is_lower_case=True):
 tokenizer = ToktokTokenizer()
 tokens = tokenizer.tokenize(text)
 tokens = [token.strip() for token in tokens]
 if is_lower_case:
 filtered_tokens = [token for token in tokens if token not in stopword_list]
 else:
 filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
 filtered_text = ‘ ‘.join(filtered_tokens)
 return filtered_text

6. Stemming

Stemming is the process of converting words to their base forms using crude Heuristic rules. For example, one rule could be to remove ’s’ from the end of any word, so that ‘cats’ becomes ‘cat’.

from nltk.stem import SnowballStemmer
from nltk.tokenize.toktok import ToktokTokenizer
def stem_text(text):
 tokenizer = ToktokTokenizer()
 stemmer = SnowballStemmer(‘english’)
 tokens = tokenizer.tokenize(text)
 tokens = [token.strip() for token in tokens]
 tokens = [stemmer.stem(token) for token in tokens]
 return ‘ ‘.join(tokens)

7. Lemmatization

Lemmatization is very similar to stemming but it aims to remove endings only if the base form is present in a dictionary.

from nltk.stem import WordNetLemmatizer
from nltk.tokenize.toktok import ToktokTokenizer
wordnet_lemmatizer = WordNetLemmatizer()
def lemma_text(text):
 tokenizer = ToktokTokenizer()
 tokens = tokenizer.tokenize(text)
 tokens = [token.strip() for token in tokens]
 tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens]
 return ‘ ‘.join(tokens)

Once we are done with processing the text, we’ll apply below given steps to clean the text on train and test data.

def clean_sentence(x):
 x = x.lower()
 x = clean_punct(x)
 x = clean_numbers(x)
 x = replace_typical_misspell(x)
 x = remove_stopwords(x)
 x = replace_contractions(x)
 #x = preprocess(x)
 x = stem_text(x)
 x = lemma_text(x)
 x = x.replace(“‘“,””)
 return x

Analysis of Extracted Features:

From the wordcloud we can see that Muslim, Trump, Black, Indian etc are the words that appear a lot in Insincere questions.

After applying Data preprocessing and cleaning, our text is ready for Classification. I have applied both Conventional and Deep Learning Methods for Classification.

Let’s first understand Machine Learning Methods of Classification.

Machine Learning Methods

Advance NLP Text Processing:

1. Bag of Words (CountVectorizer)

Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

cnt_vectorizer = CountVectorizer(dtype=np.float32,
 strip_accents=’unicode’, analyzer=’word’,token_pattern=r’\w{1,}’,
 ngram_range=(1, 3),min_df=3)# Fitting count vectorizer to both training and test sets (semi-supervised learning)
cnt_vectorizer.fit(list(train.preprocessed_question_text.values) + list(test.preprocessed_question_text.values))
X_train = cnt_vectorizer.transform(train.preprocessed_question_text.values) 
X_test = cnt_vectorizer.transform(test.preprocessed_question_text.values)
y_train = train.target.values

CountVectorizer converts a collection of text documents to a matrix of token counts. For this I have selected the n-gram range to be 1–3 and min_df to be 3 for building the vocabulary.

We run these features for machine learning models like Logistic Regression, Naive Bayes and LightGBM.

# Fitting a simple Logistic Regression
clf = LogisticRegression(C=1.0)
clf.fit(X_train,y_train)

2. Term Frequency — Inverse Document Frequency (TF-IDF)

Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.

Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is given by the equation below.

Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:

tfv = TfidfVectorizer(dtype=np.float32, min_df=3, max_features=None, 
 strip_accents=’unicode’, analyzer=’word’,token_pattern=r’\w{1,}’,
 ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
 stop_words = ‘english’)# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(train.preprocessed_question_text.values) + list(test.preprocessed_question_text.values))
X_train = tfv.transform(train.preprocessed_question_text.values) 
X_test_tfv = tfv.transform(test.preprocessed_question_text.values)
y_train = train.target.values

TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features. For this also n-gram range is 1–3 and min_df is 3 to build vocabulary.

We again, run these features for machine learning models like Logistic Regression, Naive Bayes and LightGBM.

3. Hashing Feature (HashingVectorizer)

HashingVectorizer is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved.

from sklearn.feature_extraction.text import HashingVectorizerhv = HashingVectorizer(dtype=np.float32,
 strip_accents=’unicode’, analyzer=’word’,
 ngram_range=(1, 3),n_features=2**10)
# Fitting Hash Vectorizer to both training and test sets (semi-supervised learning)
hv.fit(list(train.preprocessed_question_text.values) + list(test.preprocessed_question_text.values))
X_train = hv.transform(train.preprocessed_question_text.values) 
X_test_tfv = hv.transform(test.preprocessed_question_text.values)
y_train = train.target.values

HashingVectorizer converts a collection of text documents to a matrix of token occurrences. 2**10 is the number of features we want in a column in the output matrices.

For this, I tried Logistic Regression and LightGBM.

4. Word2Vec Feature (Word Embeddings)

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

Word2Vec consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer.

def load_glove_index():
 EMBEDDING_FILE = ‘glove.840B.300d/glove.840B.300d.txt’
 def get_coefs(word,*arr): return word, np.asarray(arr, dtype=’float32')[:300]
 f = open(EMBEDDING_FILE,encoding=”utf-8")
 embeddings_index = dict(get_coefs(*o.split(“ “)) for o in f)
 return embeddings_indexembeddings_index = load_glove_index()print(‘Found %s word vectors.’ % len(embeddings_index))from nltk.corpus import stopwords
from nltk import word_tokenizestop_words = stopwords.words(‘english’)
def sent2vec(s):
 words = str(s).lower()
 words = word_tokenize(words)
 words = [w for w in words if not w in stop_words]
 words = [w for w in words if w.isalpha()]
 M = []
 for w in words:
 try:
 M.append(embeddings_index[w])
 except:
 continue
 M = np.array(M)
 v = M.sum(axis=0)
 if type(v) != np.ndarray:
 return np.zeros(300)
 return v / np.sqrt((v ** 2).sum())# create sentence vectors using the above function for training and validation set
xtrain = [sent2vec(x) for x in tqdm(train.preprocessed_question_text.values)]
xtest_glove = [sent2vec(x) for x in tqdm(test.preprocessed_question_text.values)]

For this also, tried Logistic Regression and LightGBM.

Since we have seen conventional methods in details, lets now see how I applied deep learning methods.

Deep Learning Methods

I”ll explain the deep learning models like TextCNN, BiLSTM and Attention that I tried for this classification.

Note: For all models, sigmoid as activation function is used in the output layer and to compile the model, used Adam optimizer and Binary Cross Entropy as loss fucntion.

1. TextCNN

TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling layer. Suppose the input text sequence consists of n words, and each word is represented by a d-dimension word vector. Then the input example has a width of n, a height of 1, and d input channels. The calculation of textCNN can be mainly divided into the following steps:

Define multiple one-dimensional convolution kernels and use them to perform convolution calculations on the inputs. Convolution kernels with different widths may capture the correlation of different numbers of adjacent words.
Perform max-over-time pooling on all output channels, and then concatenate the pooling output values of these channels in a vector.
The concatenated vector is transformed into the output for each category through the fully connected layer. A dropout layer can be used in this step to deal with over-fitting.

filter_sizes = [1,2,3,5]
num_filters = 36inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix],trainable=False)(inp)
x = Reshape((maxlen, embed_size, 1))(x)maxpool_pool = []
for i in range(len(filter_sizes)):
 conv = Conv2D(num_filters, kernel_size=(filter_sizes[i], embed_size),
 kernel_initializer=’he_normal’, activation=’relu’)(x)
 maxpool_pool.append(MaxPool2D(pool_size=(maxlen — filter_sizes[i] + 1, 1))(conv))z = Concatenate(axis=1)(maxpool_pool) 
z = Flatten()(z)
z = Dropout(0.1)(z)outp = Dense(1, activation=”sigmoid”)(z)

2D convolution layer creates a convolution kernel that is convolved with the layer inputs to produce a tensor of outputs. I have chosen 36 as the dimentionality of the output space and a list for filter size that specifies the height and width of the conv2D window. ReLu as activation and he_normal as kernel initializer is used. On top of it max pooling is applied.

This model gave an f1-score of 0.6101.

2. Bidirectional LSTM

Bidirectional LSTMs are supported in Keras via the Bidirectional layer wrapper.

This wrapper takes a recurrent layer (e.g. the first LSTM layer) as an argument.

It also allows you to specify the merge mode, that is how the forward and backward outputs should be combined before being passed on to the next layer.

A bidirectional RNN is effectively just two RNN’s where one gets fed the sequence forward, while the other one gets the sequence fed backward.

For a most simplistic explanation of Bidirectional RNN, think of RNN cell as a black box taking as input a hidden state(a vector) and a word vector and giving out an output vector and the next hidden state. This box has some weights which are to be tuned using Backpropagation of the losses. Also, the same cell is applied to all the words so that the weights are shared across the words in the sentence. This phenomenon is called weight-sharing.

For a sequence of length 4 like “The Quick Brown Fox”, The RNN cell gives 4 output vectors, which can be concatenated and then used as part of a dense feedforward architecture.

In the Bidirectional RNN, the only change is that we read the text in the usual fashion as well in reverse. So we stack two RNNs in parallel, and hence we get 8 output vectors to append.

Once we get the output vectors, we send them through a series of dense layers and finally a softmax layer to build a text classifier.

inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix],trainable=False)(inp)
x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
conc = concatenate([avg_pool, max_pool])
conc = Dense(64, activation=”relu”)(conc)
conc = Dropout(0.1)(conc)
outp = Dense(1, activation=”sigmoid”)(conc)

Here 64 is the size(dim) of the hidden state vector as well as the output vector. Keeping return_sequence we want the output for the entire sequence. So what is the dimension of output for this layer?
64*70(maxlen)*2(bidirection concat)

Note: CuDNNLSTM is fast implementation of LSTM layer in Keras which only runs on GPU.

The BiLSTM model gave an F1-score of 0.6272.

3. Attention Models

Attention was presented by Dzmitry Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate” that reads as a natural extension of their previous work on the Encoder-Decoder model.

Attention is proposed as a solution to the limitation of the Encoder-Decoder model encoding the input sequence to one fixed length vector from which to decode each output time step. This issue is believed to be more of a problem when decoding long sequences.

Attention is proposed as a method to both align and translate.

Alignment is the problem in machine translation that identifies which parts of the input sequence are relevant to each word in the output, whereas translation is the process of using the relevant information to select the appropriate output.

We want to create scores for every word in the text, which is the attention similarity score for a word.

To do this, we start with a weight matrix(W), a bias vector(b) and a context vector u. The optimization algorithm learns all of these weights. On this note I would like to highlight something I like a lot about neural networks — If you don’t know some params, let the network learn them. We only have to worry about creating architectures and params to tune.

Then there are a series of mathematical operations. See the figure for more clarification. We can think of u1 as nonlinearity on RNN word output. After that v1 is a dot product of u1 with a context vector u raised to exponentiation. From an intuition viewpoint, the value of v1 will be high if u and u1 are similar. Since we want the sum of scores to be 1, we divide v by the sum of v’s to get the Final Scores,s

These final scores are then multiplied by RNN output for words to weight them according to their importance. After which the outputs are summed and sent through dense layers and softmax for the task of text classification.

inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix], trainable=False)(inp)
x = Bidirectional(CuDNNLSTM(128, return_sequences=True))(x)
x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(x)
x = AttentionWithContext()(x)
x = Dense(64, activation=”relu”)(x)
x = Dense(1, activation=”sigmoid”)(x)

Note: CuDNNLSTM is fast implementation of LSTM layer in Keras which only runs on GPU.

This model gave an F1-score of 0.6305 which is better than any other models that I have tried.

Result

The Result after running a 5 fold Stratified CV.

******************** Conventional Methods ********************
+---------------------+-------------------+---------------------+
|        Model        |     Vectorizer    |    Test F1-Score    |
+---------------------+-------------------+---------------------+
| Logistic Regression |  CountVectorizer  |  0.6061869265414231 |
|     Naive Bayes     |  CountVectorizer  |  0.5411124898212416 |
|       LightGBM      |  CountVectorizer  | 0.44805759949363083 |
| Logistic Regression |  TFIDFVectorizer  |  0.5940808706791779 |
|     Naive Bayes     |  TFIDFVectorizer  |  0.5029684543565248 |
|       LightGBM      |  TFIDFVectorizer  | 0.44225094843332863 |
| Logistic Regression | HashingVectorizer |  0.3491367081425039 |
|       LightGBM      | HashingVectorizer |  0.3469309621513192 |
| Logistic Regression |      Word2vec     |  0.5191204693754455 |
|       LightGBM      |      Word2vec     |  0.4302154845215734 |
+---------------------+-------------------+---------------------+


******************** Deep Learning Methods ********************
+-----------+--------------------+
|   Model   |   Test F1-Score    |
+-----------+--------------------+
|  TextCNN  | 0.6101427665518363 |
|   BiLSTM  | 0.6272316455189987 |
| Attention | 0.6304589487057329 |
+-----------+--------------------+

Future Work

For all the above models, I did not work on hyperparameter tuning. You can try to improve the performance by performing hyperparameter tuning using hyperopt or Grid Search or Random Search.

Conclusion:

The F1-score obtained from deep learning method CuDNNLSTM with Attention performed better than any other model, giving a score of 0.6304.

For the classification on this problem, I first performed necessary data preprocessing and cleaning such as removal of punctuation, contraction, stopwords, replacing misspelled words, stemming and lemmatizing the text.

After that I performed machine learning classification methods such as Naive Bayes, Logistic Regression and LightGBM using sklearn’s text feature extraction methods like CountVectorizer, TF-IDF, Hashing and Word2vec embeddings.

To get better results I also performed deep learning using models like TextCNN, Bidirectional LSTM with and without Attention.

A coding link to this problem implementation.

Also a big shoutout to mlwhiz whose blog on this project helped me in implementing it.

Finally, thanks to all of you for reading my blog.

References:

Quora Insincere Questions Classification

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

www.kaggle.com

NLP Learning Series: Part 1 - Text Preprocessing Methods for Deep Learning

Recently, I started up with an NLP competition on Kaggle called Quora Question insincerity challenge. It is an NLP…

mlwhiz.com

Python | Word Embedding using Word2Vec - GeeksforGeeks

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words…

www.geeksforgeeks.org

NLP Learning Series: Part 2 - Conventional Methods for Text Classification

Recently, I started up with an NLP competition on Kaggle called Quora Question insincerity challenge. It is an NLP…

mlwhiz.com

https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/

How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine…

machinelearningmastery.com

https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39

Quora Insincere Question Classification

Detect Toxic Content to Improve Online Conversations

Overview Of The Problem:

Problem Statement:

Evaluation Metrics:

Data Overview:

File descriptions

Data fields

Exploratory Data Analysis:

Load Train and Test Data-set:

Distribution of data points among output classes:

Basic Feature Engineering:

Data Preprocessing:

Analysis of Extracted Features:

Machine Learning Methods

Advance NLP Text Processing:

1. Bag of Words (CountVectorizer)

2. Term Frequency — Inverse Document Frequency (TF-IDF)

3. Hashing Feature (HashingVectorizer)

4. Word2Vec Feature (Word Embeddings)

Deep Learning Methods

1. TextCNN

2. Bidirectional LSTM

3. Attention Models

Result

Future Work

Conclusion:

References:

Quora Insincere Questions Classification

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

NLP Learning Series: Part 1 - Text Preprocessing Methods for Deep Learning

Recently, I started up with an NLP competition on Kaggle called Quora Question insincerity challenge. It is an NLP…

Python | Word Embedding using Word2Vec - GeeksforGeeks

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words…

NLP Learning Series: Part 2 - Conventional Methods for Text Classification

Recently, I started up with an NLP competition on Kaggle called Quora Question insincerity challenge. It is an NLP…

How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine…

Written by Priyanka Patel