My Take on 30 Questions to test a data scientist on Natural Language Processing with Interactive Code — Part 1

GIF from this website

Shivam Bansal, is a data scientist with exhaustive experience in Natural Language Processing and Machine Learning. And he have an amazing blog post about Natural language processing. So if anyone is interested please check his work out, they are super informative. And today, I’ll try to answer some of the questions from his 30 NLP questions from this blog.

Also, I am not going to answer the questions in numeric order. And for every question I’ll try to find the right answer and link them. However, I am always open to learning and growing, so if you know a more optimal solution please comment down below.


Q1) Which of the following techniques can be used for the purpose of keyword normalization, the process of converting a keyword into its base form?

So keyword normalization is a processing a word (keyword) into the most basic form. One example of this can be, converting sadden, saddest or sadly into the word sad. (Since it is the most basic form) Knowing this now lets look at the options we can choose from.

Image from this website
Image from this website

So from above image we can directly see that both stemming and lemmatization are techniques used to convert a word into their most basic form. (And they even give an example of cars,car’s → car). Finally, lets see what the other two choices means. (This article also does an amazing job explaining the difference between Stemming and Lemmatizing.)

Image from this website
Image from this website

As seen above, we know that Levenshtein is used to measure the similarity among different sentences and Soundex is used to indexing words by their pronunciation. Hence they are not appropriate tools to use for keyword normalization. Finally, lets actually look how this looks in python.

Red Box → Normalized Keywords from the original word

As seen above, we know the word studies/studying have changed to studi or study after stemming/lemmatization. And we can confirm our solution is correct.


Q2) N-grams are defined as the combination of N keywords together. How many bi-grams can be generated from given sentence:

I first needed to answer the question what an N-grams is, and I found one stack overflow question that gives an excellent answer to what exactly an N-gram is.

Image from this website

So assuming that we are not using # (word boundary), below are all of the combination of bi-gram we can generate from the sentence “Analytics Vidhya is a great source to learn data science”.

[“Analytics Vidhya”, “Vidhya is”, “is a”, “a great”, “great source”, “source to”, “to learn”, “learn data”, “data science”]. Meaning there are in total of 9 combination of bi-gram. Now lets look at the implementation.

Finally, we can confirm our solution is correct.


Q3) How many trigrams phrases can be generated from the following sentence, after performing following text cleaning steps

We already know what a tri-gram is (from question 2). Now lets take a deeper look at what stop-word removal is as well as replacing punctuation.

Image from this website

From one simple google search we know that stop word removal is the process of removing words such as ‘is’, ‘a’, and ‘the’. Now we need to replace every punctuation to single space. For this we need to know what punctuations are available in python string library. And we can get this by doing something like….

Knowing all of this we can firstly remove all of the stop-words from the sentence giving us… (To see the list of stop-words please click here)

“#Analytics-vidhya great source learn @data_science.”

And now lets replace all of the punctuation with single space giving us…

Analytics vidhya great source learn data science”

Finally, lets create our tri-gram which gives us the list of words [“Analytics vidhya great”, “vidhya great source”, “great source learn”, “ learn data science”] which have a length of 5. (Giving us the answer c). Now lets take a look at the implementation.

From above we can see that we have gotten the same sentence “Analytics vidhya great source learn data science”.


Q4) Which of the following regular expression can be used to identify date(s) present in the text object:

This is a tricky question (at least for me), but with simple implementation we can see none of the regex matches the dates.

And from there we know that one solution regex can be (‘\d{4}-\d{2}-\d{2}|\d{2}/\d{2}, \d{4}’) which does not exist as an option hence the answer would be D)


Q5) Which of the following models can perform tweet classification with regards to context mentioned above?

Before diving into this question, lets do a simple review on what a SVM / Naive Bayes is.

Image from this website
Image from this website

Without going into the details of each classifiers, we can already tell they are similar algorithms which are used in supervised learning environment. However, our question states that we have only collected tweets and nothing more (I would assume these are the label data.). Hence none of them, but if the labels were included I think we can use both of them. Than the answer would be both of them.


Q6) You have created a document term matrix of the data, treating every tweet as one document. Which of the following is correct, in regards to document term matrix?

All of these options seems correct to me for now, but lets dive deeper into what a document term matrix is.

Image from this website
Example from this website

From the two images above we can get a idea of what a document matrix is. I would simply call it as a easier method to represent text, or sentence. (In a vector form) Now, lets go over the options, removing stop-words means removing words such as ‘is’ hence it would be considered as dimensionality reduction. I understand normalizing as keyword normalization, converting studying into study etc. Hence if we have two sentence with words studying and study it will be both represented as study hence dimensionality reduction. Finally I think converting all of the word in to lower case is still a dimensionality reduction, since it reduces the number of words we need to represent all of the sentences. Therefore, I think the answer is D).


Interactive Code

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding!

To access the code used in this post please click here.


Final Words

These questions are really a good place to start NLP. Not only I am able to practice implementation but also gain some theoretical knowledge.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.


Reference

  1. NLP], 3., & NLP], 3. (2017). 30 Questions to test a data scientist on Natural Language Processing [Solution: Skilltest — NLP] — Analytics Vidhya. Analytics Vidhya. Retrieved 24 May 2018, from https://www.analyticsvidhya.com/blog/2017/07/30-questions-test-data-scientist-natural-language-processing-solution-skilltest-nlp/
  2. Shivam Bansal, Author at Analytics Vidhya. (2018). Analytics Vidhya. Retrieved 24 May 2018, from https://www.analyticsvidhya.com/blog/author/shivam5992/
  3. lemmatization — Google Search. (2018). Google.ca. Retrieved 24 May 2018, from https://www.google.ca/search?q=lemmatization&rlz=1C1CHBF_enCA771CA771&oq=Lemmatization&aqs=chrome.0.0l6.308j0j7&sourceid=chrome&ie=UTF-8
  4. Levenshtein distance. (2018). En.wikipedia.org. Retrieved 24 May 2018, from https://en.wikipedia.org/wiki/Levenshtein_distance
  5. Soundex. (2018). En.wikipedia.org. Retrieved 24 May 2018, from https://en.wikipedia.org/wiki/Soundex
  6. Gram?, W. (2018). What exactly is an n Gram?. Stack Overflow. Retrieved 24 May 2018, from https://stackoverflow.com/questions/18193253/what-exactly-is-an-n-gram
  7. NLTK, G. (2018). Generate bigrams with NLTK. Stack Overflow. Retrieved 24 May 2018, from https://stackoverflow.com/questions/37651057/generate-bigrams-with-nltk
  8. Stopwords. (2018). Ranks.nl. Retrieved 25 May 2018, from https://www.ranks.nl/stopwords
  9. Brownlee, J. (2017). How to Clean Text for Machine Learning with Python. Machine Learning Mastery. Retrieved 25 May 2018, from https://machinelearningmastery.com/clean-text-machine-learning-python/
  10. python?, h. (2018). how to replace punctuation in a string python?. Stack Overflow. Retrieved 25 May 2018, from https://stackoverflow.com/questions/12437667/how-to-replace-punctuation-in-a-string-python/12437721
  11. [online] Available at: https://www.quora.com/How-do-I-remove-stopwords-from-a-file-using-python [Accessed 25 May 2018].
  12. Python Regular Expressions. (2018). www.tutorialspoint.com. Retrieved 25 May 2018, from https://www.tutorialspoint.com/python/python_reg_expressions.htm
  13. expression?, P. (2018). Python/Regex — How to extract date from filename using regular expression?. Stack Overflow. Retrieved 25 May 2018, from https://stackoverflow.com/questions/7728694/python-regex-how-to-extract-date-from-filename-using-regular-expression
  14. operator, P. (2018). Python regex match OR operator. Stack Overflow. Retrieved 25 May 2018, from https://stackoverflow.com/questions/19821487/python-regex-match-or-operator
  15. Support vector machine. (2018). En.wikipedia.org. Retrieved 25 May 2018, from https://en.wikipedia.org/wiki/Support_vector_machine
  16. Naive Bayes classifier. (2018). En.wikipedia.org. Retrieved 25 May 2018, from https://en.wikipedia.org/wiki/Naive_Bayes_classifier
  17. Document-term matrix. (2018). En.wikipedia.org. Retrieved 25 May 2018, from https://en.wikipedia.org/wiki/Document-term_matrix