My Take on 30 Questions to test a data scientist on Natural Language Processing with Interactive Code — Part 2

Jae Duk Seo
May 27, 2018 · 9 min read
GIF from this website

Shivam Bansal, is a data scientist with exhaustive experience in Natural Language Processing and Machine Learning. And he have an amazing blog post about Natural language processing. So if anyone is interested please check his work out, they are super informative. And today, I’ll try to answer some of the questions from his 30 NLP questions from this blog. (Please click here to see part 1.)

Also, I am not going to answer the questions in numeric order. And for every question I’ll try to find the right answer and link them. However, I am always open to learning and growing, so if you know a more optimal solution please comment down below.


Q7) Which of the following features can be used for accuracy improvement of a classification model?

Lets look over our options more closely.
Frequency count of terms → Counting each of the words within the given sentence. (I would consider as a feature engineering, for a very simple example a sentence with more negative words might indicate a negative emotions.)

Image from this website

Vector Notation of sentence → Process of converting a document with sentence into a vector. (We can thing of this as feature engineering as well)

Image from this website

Part of Speech Tag → Process of converting each word in the sentence into their parts of speech to each word such as noun, verb etc…

Image from this website

Dependency Grammar → Using this we maybe able to perform Dependency Parsing to create more sophisticated features of each sentence.

Image from this website

So it seems like all of the methods can be used to create more high-level features from a given text. Hence I will say we can use all of them.

Finally lets take a look at simple implementation of Frequency count of terms and Part of Speech Tag.


Q8) What percentage of the total statements are correct with regards to Topic Modeling?

I really like how this question is formed, before moving on lets see what a topic modeling is. (Even from the name we can kinda know what it means but just to be sure. And it seems like the process summarizing a sentence/document into topic by topic.)

Image from this website
Example of Topic Modeling form this website

This can be thought of unsupervised learning problem, since if we already know the topic of the document there is no need to extract the unlabeled data.

Image from this website

And with one simple google search we can see it is not linear discriminant analysis rather Latent Dirichlet allocation if we wish to perform topic modeling.

For the third option it seems unreasonable to think that the number of topic does not depend on the size of the data. For example, lets say our data only consist of one sentence, “That dog is so cute”. Well the topic is dog, and it would be extremely hard to talk about 2 billion topics in one given sentence hence I think there is ‘some’ dependency on number of topics and the size of given data. But for the final option, even if we have 400 billion documents if half of them only talks about dogs and other the half only talks about cats, we only have two topics. We can see that the number of topics and the size of the given data are not directly proportional to one another. Hence 0 percent, none of them.


Q9) In Latent Dirichlet Allocation model for text classification purposes, what does alpha and beta hyper-parameter represent-

Well I currently have no idea the inner details of Latent Dirichlet Allocation so lets do some research. (Also strangely this question have False / True after each sentence.)

Image from this blog

Thankfully, there is no shortage of explaining the hard topic of LDA. This blog post and this blog post does a great job explaining. And thankfully lol, another user in stack overflow asked the exact same question.

Image from this website

So in other words, the alpha value indicates the density of the topic within each document, and the beta value indicates the density of the word within each topic. (The main topic can be dog, or dog food, dog food price etc…) Hence the answer is D).


Q10) Solve the equation according to the sentence “I am planning to visit New Delhi to attend Analytics Vidhya Delhi Hackathon”.

Since we already know how to do POS tagging and frequency counting lets just implement this and get the answer.

As seen above if we count all of the noun (and pronoun) as well as all the verb and the words with more than 1 frequency. We can conclude that we have 7, 4, and 2 words hence the answer is D).


Q11) In a corpus of N documents, one document is randomly picked. The document contains a total of T terms and the term “data” appears K times.

What is the correct value for the product of TF (term frequency) and IDF (inverse-document-frequency), if the term “data” appears in approximately one-third of the total documents?

This question blew my mind since I have no idea what it was talking about. But here it goes, lets first see the equations for both term frequency and Inverse Document Frequency.

Image from this website

TF = K / T
[ K number of time term ‘data’ appeared / Total # of term in doc ]
IDF = log(N / (N/3)) → log(3)
[ N document / Number of document with term ‘data’ ]

Hence combining those two terms we can get the equation, (K/T) * log(3). giving us the answer B)


Q12) Which of the following documents contains the same number of terms and the number of terms in the one of the document is not equal to least number of terms in any document in the entire corpus.

From the start we can remove option A) since d1 and d4 does not contain same number of terms. (d1 have 6 while d4 have 4). And we can know every other document have same number of terms.

The second portion was VERY confusing, “ the number of terms in the one of the document is not equal to least number of terms in any document in the entire corpus” and I have no idea what this portion is asking. If anyone know what the above sentence means please let me know in the comment section.


Interactive Code

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding!

To access the code used in this post please click here.


Final Words

Again these questions are a great place to practice fundamental NLP topics.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.


Reference

  1. My Take on 30 Questions to test a data scientist on Natural Language Processing with Interactive…. (2018). Medium. Retrieved 27 May 2018, from https://medium.com/@SeoJaeDuk/my-take-on-30-questions-to-test-a-data-scientist-on-natural-language-processing-with-interactive-5b3454a196ef
  2. Learning, M., & NLP], 3. (2017). 30 Questions to test a data scientist on Natural Language Processing [Solution: Skilltest — NLP] — Analytics Vidhya. Analytics Vidhya. Retrieved 27 May 2018, from https://www.analyticsvidhya.com/blog/2017/07/30-questions-test-data-scientist-natural-language-processing-solution-skilltest-nlp/#comment-153570
  3. Learning, M., & Python), U. (2017). Ultimate Guide to Understand & Implement Natural Language Processing. Analytics Vidhya. Retrieved 27 May 2018, from https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/
  4. Word Frequency Counter. (2018). Writewords.org.uk. Retrieved 27 May 2018, from http://www.writewords.org.uk/word_count.asp
  5. vector?, H. (2018). How can a sentence or a document be converted to a vector?. Stack Overflow. Retrieved 27 May 2018, from https://stackoverflow.com/questions/30795944/how-can-a-sentence-or-a-document-be-converted-to-a-vector
  6. Navigli, R. (2017). Lecture 7: part-of-speech tagging. Naviglinlp.blogspot.ca. Retrieved 27 May 2018, from http://naviglinlp.blogspot.ca/2017/04/lecture-7-part-of-speech-tagging.html
  7. Dependency Grammar — Google Search. (2018). Google.ca. Retrieved 27 May 2018, from https://www.google.ca/search?q=Dependency+Grammar&rlz=1C1CHBF_enCA771CA771&oq=Dependency+Grammar&aqs=chrome..69i57j69i59l3j69i60j0.104j0j7&sourceid=chrome&ie=UTF-8
  8. frequency, C. (2018). Count frequency of words in a list and sort by frequency. Stack Overflow. Retrieved 27 May 2018, from https://stackoverflow.com/questions/20510768/count-frequency-of-words-in-a-list-and-sort-by-frequency
  9. it?, I. (2018). Is there a library for splitting sentence into a list of words in it?. Stack Overflow. Retrieved 27 May 2018, from https://stackoverflow.com/questions/7026620/is-there-a-library-for-splitting-sentence-into-a-list-of-words-in-it
  10. Python Programming Tutorials. (2018). Pythonprogramming.net. Retrieved 27 May 2018, from https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/
  11. topic modeling — Google Search. (2018). Google.ca. Retrieved 27 May 2018, from https://www.google.ca/search?q=topic+modeling&rlz=1C1CHBF_enCA771CA771&source=lnms&sa=X&ved=0ahUKEwjp-OGkmabbAhWHn4MKHTIxDhwQ_AUICSgA&biw=1173&bih=954&dpr=1
  12. Topic modeling complaints to the CFPB. (2017). Austinbrian.github.io. Retrieved 27 May 2018, from https://austinbrian.github.io/blog/cfpb-topic-modeling/
  13. linear discriminant analysis topic modeling — Google Search. (2018). Google.ca. Retrieved 27 May 2018, from https://www.google.ca/search?rlz=1C1CHBF_enCA771CA771&ei=Dc4KW6a3C6TNjwSOgqOIDQ&q=linear+discriminant+analysis+topic+modeling&oq=linear+discriminant+analysis+topic+modeling&gs_l=psy-ab.3...865.2623.0.2691.0.0.0.0.0.0.0.0..0.0....0...1c.1.64.psy-ab..0.0.0....0.gzt07D89c_U
  14. Latent Dirichlet allocation. (2018). En.wikipedia.org. Retrieved 27 May 2018, from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
  15. allocation?, W. (2018). What does the alpha and beta hyperparameters contribute to in Latent Dirichlet allocation?. Data Science Stack Exchange. Retrieved 27 May 2018, from https://datascience.stackexchange.com/questions/199/what-does-the-alpha-and-beta-hyperparameters-contribute-to-in-latent-dirichlet-a
  16. Your Easy Guide to Latent Dirichlet Allocation — Lettier — Medium. (2018). Medium. Retrieved 27 May 2018, from https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
  17. Introduction to Latent Dirichlet Allocation. (2018). Blog.echen.me. Retrieved 27 May 2018, from http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
  18. 5. Categorizing and Tagging Words. (2018). Nltk.org. Retrieved 27 May 2018, from https://www.nltk.org/book/ch05.html
  19. method?, D. (2018). Does Python have a string ‘contains’ substring method?. Stack Overflow. Retrieved 27 May 2018, from https://stackoverflow.com/questions/3437059/does-python-have-a-string-contains-substring-method
  20. Nouns and pronouns | Ask The Editor | Learner’s Dictionary. (2018). Learnersdictionary.com. Retrieved 27 May 2018, from http://www.learnersdictionary.com/qa/nouns-and-pronouns
  21. Tf-idf :: A Single-Page Tutorial — Information Retrieval and Text Mining. (2018). Tfidf.com. Retrieved 27 May 2018, from http://www.tfidf.com/

Jae Duk Seo

Written by

https://jaedukseo.me | | | | |Your everyday Seo, who likes kimchi