Input text , ouput stopwords
With the new node module, stopword-trainer, it’s as simple as that
For natural language processing and information retrieval, stop words are an essential part, weeding out words that bears little or no meaning. We already have the stopword module with stopwords for 19 languages. But if your language is missing, we now have the stopword-trainer so people can get stopword lists for any language or content. It just needs a lot of text within a language to do a good job calculating the most to least likely stopwords within a text corpus.
The actual calculation is Term Frequency * Document Frequency. A combination of how many times the word is used in total, and in how many documents the term is present.
How to make the module better
There are two really obvious features to add to make the module better. One is to make it only process part of the text, typically body-text where you have whole sentences. And lots of them. This is to not add noise to what is really non-important words that bears no meaning.
The other feature would be to protect some words from being defined as stopwords. Typically, keyword fields would contain the same words in many documents, and they would easily be defined as stopwords. There should be a way to protect words found in certain fields from being defined as stopwords.
Explained by example
In the reuters-21578-json module you’ll have five fields per document: title, body, date, places and id. The id and date fields are just noise if your regular expression matches numbers too (Now it doesn’t, but that will be configurable at a later stage). And the title may contain words with a higher content to noise ratio than regular text. Then you’re left with body and places, and the last one is a keyword field for geographical places and that means the text is all but bearing no or little meaning. It’s actually a vector defining the content and should be kept out of the calculation, no matter what.
What the future could hold
The automatic stopword generator for your content
What’s defined as stopwords will always vary, depending on content. A language doesn’t just have one way of being used. Every organisation, sub-culture or group of people have their jargon. The combination of words they use, misuse and don’t use are unique to them. For search-index and norch, you could export the index on a schedule, feed it to the stopword-trainer, and constantly get an up to date list of stopwords when indexing and re-indexing your content.
Variable length stopword lists
The languages available in the stopword module comes from various places. Some are sorted alphabetically, and some may be sorted from highes to lowest “stoppwordiness”. If we would know that all were sorted according to the different words “stopwordiness”, we could easily choose where to cut of the list, depending on the needs for few/many stopwords.
Need stopwords for a lanugage?
Create a new issue over at the stopword module, and we can discuss how to collaborate, creating a new stopword module.
Also, if you happen to know Punjabi well, I’m very interested in some help creating a stopword list for that language. I’m trying to cover the most used languages in the world.