Hi Daniel – Thanks for this post and this great blog.
Prithiviraj Damodaran
11

Prithiviraj, thanks for the kind words. As for NER systems for English text only recognizing named entities in title case, that’s usually a function of how they are trained. If they’re trained on sentences from grammatical long-form documents, then it’s reasonable for them to expect named entities to be in title case. Indeed, restricting their attention to title case string and phrases is a great way to improve both accuracy and efficiency.

But you can certainly train a model that ignores case. For example, Stanford NLP includes a model for caseless English: https://stanfordnlp.github.io/CoreNLP/download.html. And, as you’ve noted, Google also ignores case in its search queries.

How you train the model should be appropriate to where you will apply it. If you’re analyzing news articles, it’s probably a good idea to take advantage of capitalization as a signal. In contrast, most English search queries are lowercase.

Hope that helps!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.