Vineet Kumar
Sep 1, 2018 · 1 min read

Thank you for your kind words Chris! :)

You are absolutely right, in wondering about the big drop. I should have been more careful in analyzing this! I checked, and found that blank sentence occurs about 13M times! This should definitely be excluded from the sentences.. Now remains the question of 78M -> 53M. You can see the distribution of Top 10K sentences here https://github.com/vineetm/tf-similar-sentences/blob/master/data/sentences.10k.counts.txt

You would notice that most of the top occurring ‘sentences’ are not really sentences, but single words, and perhaps page titles/annotations. We start seeing some useful sentences at from Line#566. Perhaps short sentences and wiki markup sentences should be filtered more carefully.

    Vineet Kumar

    Written by

    Machine Learning and Deep Learning enthusiast. Tensorflow hacker. Love python! Research Software Engineer at IBM Research Labs, New Delhi, India.