Sep 1, 2018 · 1 min read
Thank you for your kind words Chris! :)
You are absolutely right, in wondering about the big drop. I should have been more careful in analyzing this! I checked, and found that blank sentence occurs about 13M times! This should definitely be excluded from the sentences.. Now remains the question of 78M -> 53M. You can see the distribution of Top 10K sentences here https://github.com/vineetm/tf-similar-sentences/blob/master/data/sentences.10k.counts.txt
You would notice that most of the top occurring ‘sentences’ are not really sentences, but single words, and perhaps page titles/annotations. We start seeing some useful sentences at from Line#566. Perhaps short sentences and wiki markup sentences should be filtered more carefully.
