Hi Jennifer Koenig, glad to hear the story was helpful for you. I think that your question doesn’t have a definitive answer though. It depends a lot of the text you’re working with and what you really need to get out of it. For example, if you’re processing some technical content, let’s say in medicine, and your goal is to capture some professional/scientific words to work around some kind of classification problem afterwards, you might only need a very small percentage if you find that kind of words are popping up within the most common words in the content. On the contrary, if those words are not very common through the text and you find them only appearing within the last 20% common words, then you would need to keep most of the words. All in all, I would say keep your objective in mind and, at first, try to visualize at least once the entire words distribution using CountVectorizer, to have a notion of what’s happening with the content you need. Then choose a threshold according to this :)

    Gonzalo Ferreiro Volpi

    Written by

    ELI5 DATA: Explain Like I’m 5 | Just a guy trying to make data simpler | Data Science @ Ravelin | Fancy joining my mailing list? 👉🏻http://eepurl.com/gA_rkj