Inside_Wallapop
Published in

Inside_Wallapop

Spanish Plural Stemmer: Matching plural and singular forms in Spanish using Lucene

At Wallapop we build our search engine on top of Solr/Lucene. Our core base of users use Spanish as their main language, so we have worked on several topics to enhance searches in Spanish against our catalogue. The following article and related code is our first try at contributing back to the community some of our efforts enhancing Lucene’s capabilities for processing Spanish text.

“Stemming” is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming uses a number of approaches to reduce a word to its base from whatever inflected form is encountered.

In this text, we will introduce a new stemmer implementation for Spanish that allows precise stemming of plural forms into singular forms for the Spanish language respecting gender information. Our aim is to contribute this implementation to the Lucene community.

First, we will describe current stemming implementations, their behaviour and possible applications. Second, we will introduce the use case driving our development. Third, we will present our solution for stemming plural into singular. Finally, we will conclude with an analysis of the different Spanish stemmers in action.

Analysis of current stemming Lucene implementations

In this section, we will analyze the behaviour of the different available implementations of Spanish stemmers in Lucene.

Snowball based stemmer for Spanish (https://snowballstem.org/ ).

Characteristics

  • General-purpose stemmer: stems plural/singular forms, masculine and feminine inflections, verbal forms, adverbs…
  • It will increase recall but precision can be drastically reduced
  • Lots of collisions between words with really different meaning
  • Use it when recall is more important than precision
  • Does not stem plural words of foreign origin (i.e. complots, bits, punks, robots)
  • Does not support invariants like “gafas”, “caries”, etc

An algorithmic approach based on the algorithm described in “Report on CLEF-2001 Experiments” by Jacques Savoy.

Characteristics

  • Designed to stem plural to singular form and feminine and masculine inflections to the same root
  • It will increase recall but precision can be reduced depending on the use case/information need
  • Use it when the distinction between singular and plural is not relevant and gender is also not relevant
  • Does not stem plural words of foreign origin (i.e. complots, bits, punks, robots)
  • Some collisions between words with really different meaning
    * “caro” (expensive), “cara” (face/expensive)
    * “barra” (bar), “barro” (mud)
  • Does not support invariants like “gafas”, “caries”, “seis”, etc

An algorithmic approach with very minimal rules: tries to stem only plural to singular but there are several cases where it does not produce the correct singular form. In these cases, it will not match with the non-stemmed version of the word. Besides, it could potentially collide with other unrelated words.

Characteristics

  • Designed to stem just plural to singular form
  • Distinguishes between masculine and feminine forms
  • Folds ‘ñ’ to ’n’: this generates collisions between wildly different words *i.e: “peñas” (hills) clashes with “penas” (pains)
    * i.e. “caña” (canes) clashes with “canas” (white hair)
    * i.e “años” (years) clashes with “anos” (anuses)
  • Does not support invariants like “gafas” (gafa), “caries” (cari), etc. These words are stemmed and can cause collisions with other unexpected words (e.g. in e-commerce “gafa” is a brand of appliances thus colliding with “gafas”; “cari” is a slang word for lover).
  • Does not correctly stem correctly some types of plurals, especially those that terminate in ‘es’ and the previous character is one of `a, b, f, g, h, i, j, l, m, p, r, t, u, v`. In the following table, you can find some examples.
  • There are other cases where it fails to give a matching token both for plural and singular like those words that end in ‘y’ but their plural uses ‘i’ (e.g. “jerseis” should stem to “jersey” but stems to “jersei” )

Note: It’s true that stemmers must not generate grammatically correct tokens, but they must generate matching tokens when applied to singular and plural. If we generate correct stems we decrease the possibility of collisions with other words. Hence increasing recall but maintaining precision.

Use case: matching singular and plural whilst respecting gender information

This is a very common use case in e-commerce and second-hand classifieds, where you want to match plural and singular versions of the same keyword, to increase recall without losing precision, but you don’t want to lose gender information. There are lots of different types of products where there is no relevant difference if they are described in singular or plural form in this context. The user’s information need is covered by both, thus giving the same results for those keywords makes total sense.

In the images below we can see 2 different user searches that provide the same type of products. In this case, the results are different because we are not applying any form of stemming. Using stemming, in this case, helps the user to find the product they are looking for: a bike for a child.

Searching for “bicicleta niños”
Searching for “bicicleta niño”

In some use cases, we will not care about gender so using the Spanish Light Stemmer will do the trick. Although a different approach to stemming plurals is needed if we want to retain gender. There are lots of cases in e-commerce or classifieds search where the user usually wants to distinguish gender (e.g. in fashion products). In Lucene 8 the SpanishMinimalStemmer was introduced. This stemmer implementation tries to specifically cover this use case but it fails in providing precise stemming in some cases. Thus, we think our custom implementation could be of interest to anybody facing a similar use case.

We developed in-house a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. Our goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

You can find a pre-release version here: https://github.com/xaviersanchez/lucene/pull/1

Characteristics

  • Algorithmic approach Spanish rules for building plural forms
    based on rules defined in wikilengua (http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n))
  • Designed to stem just plural to singular form
  • Distinguishes between masculine and feminine forms
  • It will increase recall but precision can be reduced depending on the use case/information need
  • Stems plural words of foreign origin
    * i.e. complots, bits, punks, robots
  • Support for invariant words: same plural and singular form or plural does not make sense
    * crisis, jueves, lapsus, abrebotellas, etc
    * Support for special cases
    * yoes, clubes, itemes, faralaes
  • Use it when the distinction between singular and plural is not relevant but gender is relevant
  • Produces meaningful tokens in form of singular
    * Not strange stems like “amig”: as stated in previous sections it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
  • Preparing it to be released to the community

Stemmers in Action

In this section, we will show the output of the different Spanish stemmers when applied to different types of words. The focus is essentially plural stemming.
The table tries to show the different cases commented on in each stemmer’s own section.

Some General Guidelines

  • Look for unwanted collisions: use SetKeywordMarkerFilter for marking tokens as keywords to avoid collisions like brands, models, names, etc
  • Choose it for specific purposes: think about what you want to achieve with stemming.
    * removing plurals
    * removing gender
    * increase recall at cost of precision: “when you want to see everything” Vs “I want a specific document or type of document

The following table summarizes the usage of the different stemmers according to the desired outcome.

Some General Guidelines

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store