Using Edit-Distance Algorithms To Build a Product Listing Spelling Corrector

Maanuj
Queenly Engineering
7 min readApr 29, 2022

Introduction

Misspelling or mistyping is an issue everyone faces, whether intentional or accidental. On Queenly, if a seller were to misspell a designer’s name, it would affect them in specific ways, whether by potential buyers not seeing their listing due to sorting by designer or the credibility of the seller themselves can be questioned by buyers. Such mistakes can also affect product ranking in searches, and we get it; sellers are human too!

By creating this spelling corrector, we are trying to help sellers get their dresses onto the market with the correct description. And with this, there will be a higher chance of a buyer buying the dress due to increased visibility. Thus improving product quality for sellers by helping get their dress sold and buyers by providing a better experience in searching for a dress of their choosing.

Explanation

Most spell checkers work through something called Natural Language Processing (or NLP for short), in which an artificially intelligent program would be given some number of guiding rules to use and fulfill some set purpose; whether that by responding to a query, writing a book, or in this case — correct spellings.

Some big companies develop their own NLP model by training it upon a large data set. Training is when data is used to teach an AI to output a prediction or perform a specific task based on the inherent algorithm upon which it is based. But not everyone has access to such an extensive dataset, in which case they use an algorithm to be the guiding rules that the program has to follow to get the wanted data.

Now that we know we would most likely be using an algorithm, what comes next would be determining exactly which algorithm to use. Although algorithmic approaches differ in methodology, complexity, and accuracy, we’ll need to determine which one suits our specific use case best.

Architecture Overview

General architecture for processing product listing data and maintaining a product quality feedback loop
General architecture for processing product listing data and maintaining a product quality feedback loop

Here, we’re adding spell-checking the dress title as a key workflow to our product quality platform.

As a bit of background: general spell-checkers will not necessarily work for our product data sets as many proper nouns (for designer labels and domain keywords) are not found in standard dictionary data sets. When sellers list products on Queenly, our engineering platform batch processes the product data associated with each listing to extract common names of Fashion Design labels (ex. Mac Duggal, a popular prom dress label), domain-specific terms, and key search terms. After another data processing layer, these corrected popular keywords are stored internally as a canonicalized dataset.

Our spell-checker will use this canonicalized dataset to correct the titles of these product listings as a new data workflow. Below we will describe comparing and contrasting different algorithm approaches.

Comparing and Contrasting Spell Check Algorithms

Cosine Similarity

Cosine Similarity algorithm (Image source: Wikipedia)

Cosine Similarity is used to determine how similar two vectors are. If the Cosine Similarity yields 0, the two vectors are orthogonal to one another (90 degrees) and are not a match. In contrast, the closer it gets to one, the more of a match they are due to the smaller angle between the two vectors. This algorithm would be the most common first approach for similarity calculations, but it isn’t the most accurate as other comparison studies prove.

Levenshtein Distance

Levenshtein Distance algorithm
Levenshtein Distance algorithm (Image source: Wikipedia)

The Levenshtein Distance algorithm determines how many “steps” two words are from one another. The lower the distance between two different words would mean the words are similar, whereas it would be the opposite if the distance were larger.

Although this algorithm tells us the words that have the least amount of distance compared to the list of words, it isn’t that accurate as some of the words that are a few steps away might be an entirely different word than what was intended.

Jaro-Winkler

Jaro-Winkler algorithm (Image source: Wikipedia)
Jaro-Winkler algorithm (Image source: Wikipedia)

The Jaro-Winkler algorithm is also a measure of distance between two strings where the closer it is to 1 means the two words are similar, and the opposite if they are closer to 0. This algorithm is excellent at what it does because when it first checks the matching characters in a word, it checks the approximate location rather than having it be exactly as seen in the Levenshtein Distance algorithm. The Jaro-Winkler algorithm favors words with the same first letter, whereas the Jaro algorithm doesn’t, affecting the results. For this project, we have opted to choose the Jaro-Winkler algorithm.

Result Comparisons

To compare the results between the Levenshtein Distance algorithm and the Jaro-Winkler algorithm, we ran it on the listings dataset by classifying a possible match for Levenshtein Distance if the distance is six or below (with the lowest number of steps being the correct designer). The Jaro-Winkler with a scaling factor of 0.1 (with the highest percentage match being the correct designer)

Coverage (Similarity Hit Rate)

Coverage (Similarity Hit Rate) of Jaro-Winkler vs Levenshtein Distance
Coverage (Similarity Hit Rate) of Jaro-Winkler vs Levenshtein Distance

The result of the execution was that Jaro-Winkler had more hits (more words that were found to be incorrect) than the Levenshtein Distance had. Jaro-Winkler has a 98.2% hit rate, and Levenshtein Distance has a 93.6% hit rate, with a 4.6% difference.

Accuracy

Accuracy of Jaro-Winkler vs Levenshtein Distance
Accuracy of Jaro-Winkler vs Levenshtein Distance

The accuracy is measured by a sample of 100 random results and then checked by hand to ensure the spelling correction was correct. Jaro-Winkler had an accuracy of 95%, while Levenshtein had one of 86%. This shows that Jaro-Winkler is superior in accuracy with its current settings and boasts an approximate 9% higher accuracy rate than Levenshtein Distance’s results.

An example of inaccuracy here is when the original designer’s name was Blush, with the actual designer’s name being Blush Proms. Levenshtein Distance answered with the correction being Lulus, while Jaro-Winkler answered with Blush Proms, which was correct.

Speed

Time Taken in milliseconds per Word
Time Taken in milliseconds per Word Avg for Every 25 Words

The graphs above shows the difference in the time it takes for each algorithm to process over the entire dataset. As we can see, the Jaro-Winkler algorithm was ~3000ms faster than the Levenshtein Distance algorithm meaning Jaro-Winkler is more efficient.

Takeaways

Some takeaways for Levenshtein Distance were:

  • The algorithm is more suited towards words missing one or two characters here and there
  • Reducing distance may increase overall accuracy, but may also decrease total hit rate

Some takeaways for Jaro-Winkler were:

  • The algorithm is more suited for determining what the word is supposed to say due to the taking the approximate location of letters into account.
  • Reducing the algorithm’s sensitivity may increase the hit rate but may also decrease total accuracy.
  • It has been very effective for our use case as it checks the seller entered designer against a canonicalized list of designers. If a designer can meet the similarity threshold, JW is able to correct it accurately.

Impact and Future Work

Our goal at Queenly is to create a trusted marketplace for formalwear. The integration of the Jaro-Winkler algorithm marks an important step in creating more product-quality workflows. As Queenly’s inventory grows, we’ll be able to linearly expand our spell corrector to a broader range of designers for different events and cultures. We’ll also start to consider the JW algorithm and other spell correction approaches to fix product listing descriptions, and common search term autocomplete suggestions.

Thank you for reading this! Huge thanks to everyone who has helped out on this project.

To learn more about engineering at Queenly, check out the rest of our Engineering Blog. To view and apply to open opportunities, visit our Careers page.

Sources

  • P. Sitikhu, K. Pahi, P. Thapa and S. Shakya, “A Comparison of Semantic Similarity Methods for Maximum Human Interpretability,” 2019 Artificial Intelligence for Transforming Business and Society (AITB), 2019, pp. 1–4, doi: 10.1109/AITB48515.2019.8947433.

--

--