Natural Language Processing (Part 41)-Building the model

Coursesteach
8 min readApr 28, 2024

--

📚Chapter5: Autocorrect and Minimum Edit Distance

Introduction

In the previous blog, I briefly mentioned the four steps required to implement auto correct. Now, you’re going to take a deep dive into these steps. You already know the four steps inside the auto-correct model. It’s time to look at each step in detail.

Sections

Step 1, identify a misspelled word.
Step two, find strings that are 1, 2, 3, or any number n edit distance away.
Now Step 3, filter candidates.
Step 4- Calculate word probabilities
Summary

Natural Language Processing with Probabilistic Models

Section 1- Step 1, identify a misspelled word.

When the string there is encountered, how do you know it’s a misspelled word? Well, if it’s spelled correctly, you will find it in the dictionary. If not, then it’s probably a misspelled word. If a word is not given in a dictionary, flag it for correction. Recall that you’re not searching for contextual errors, just spelling errors. There are much more sophisticated techniques for identifying words that are probably incorrect by looking at the words surrounding them. Some of which you’ll visit later in the course. But for now, quickly identifying a word as incorrect by its appearance misspelling is a simple and is a powerful model that works well. Words like deer will pass through this filter just fine as it is spelled correctly regardless of how the context may seem.

Natural Language Processing with Probabilistic Models

Section 2- Step two, find strings that are 1, 2, 3, or any number n edit distance away.

When saying n edit distance, I’m referring to an edit distance of n, such as edit distance of one, edit distance of two, and so on. An edit is a type of operation performed on a string to change it into another string. Edit distance counts the number of these operations so that the n edit distance tells you how many operations away one string is from another. Now consider an insert operation, for example. This is a type of edit that adds a letter to its string in any position. For example, starting with the word to insert P at the end and you get top or inserts W in the middle and you get two. A delete operation removes the letter. For example, starts with the word hat. Delete T from the end and you get to ha or delete H from the fronts and you get at or delete A from the middle and you get the string H-T. A switch edit swap two adjacent letters. For example, the string E-T-A, switch T and A and you get eat, or switch E and T and you get tea. Notice that you are switching two letters that are next to each other. This does not include switching two letters that are not next to each other, such as switching the E and the A to make ate. A replace edit changes one letter to another. For example, the word jaw, change W to R and you get jar, or change J to P and you get paw. Using the four edits; insert, delete, switch, and replace, you can modify any string. By combining these edits, you can find a list of all possible strings that’s are n edits away. For auto-correct, n is usually 1–3 edits. You’ll implement each of these edits in this week’s programming exercise and combine edits to get a list of two edit distances from the original input string.

Natural Language Processing with Probabilistic Models

Section 2- Now Step 3, filter candidates.

Notice how many of the strings that are generated do not look like actual words. To filter these strings and keep ones that are real words, you only want to consider real and correctly spelled words from your candidate lists. Again, compare it to a known dictionary or vocabulary, just like in Step 1. This time, if the string does not appear in the dictionary, remove it from the list of candidates. When you’re left with a list of actual words only, then that is good progress. That’s the first three steps of building the auto-correct model. In the next lesson, you’ll see the fourth and final step. You learned about three of the four steps required to implement auto-correct for the first week. The first step was identifying the misspelled word, then finding the strings that are n edit distances away, then filtering the candidates that are actual words.

Natural Language Processing with Probabilistic Models

Section 4-Calculate word probabilities

Now show you how to calculate the probabilities of each possible correct word. Now that you have a list of actual words you can move on to step four, calculate worth probabilities.

The final step is to calculate word probabilities and find the most likely word from the candidates. For example the word and is more common than the word and in any given body of text also called a corpus. This is how auto correct knows which were to substitute for the incorrect one.

To understand this better, look at this example sentence. I am happy because I am learning. To calculate the probability of a word in the sentence you need to first calculate the word frequencies. In addition you want to count the total number of words in the body of text or corpus. Normally a corpus would be much larger. Imagine every issue of a certain magazine ever published or all of the harry potter books. To keep this example as simple as possible the corpus here is defined as this one sentence.

For example, the word I appears twice. The word am appears twice also. And so on for the rest of the words. The total number of words in this corpus is seven. The probability of any word within the corpus is the number of times the word appears divided by the total number of words.

For example the word am appears twice and the size of the corpus is seven. For auto correct, you find the word candidate with the highest probability and choose that word as the replacements and that’s it.

Summary

In summary to implement auto, correct you did the following. You entered a word to correct for example the misspelled word deah. Then follow the four steps inside the auto correct model to get its replacement.

You identified there as being misspelled by checking it against known words. Then you made a list of all the strings that are n edits away. You filter this list of strengths to include only the ones that are actual words in a given dictionary. And then you calculate that the word probabilities for each of these words. You selected the word with the highest probability as the autocorrect replacement and that was it. That’s a lot to cover. But breaking it down step by step gives you a good intuition for how to implement auto correct. Also you now understand edit and edit distance and how they can be used to measure similarity between words. Next, get ready to apply these concepts to building a metric very common in NLP for measuring similarity between words, strings and many more. You have seen the four steps required to implement auto correct for this week’s programming assignments. This is exciting. Now in the next blog we will look at how we can evaluate similarity between two strings, for example, a word with a typo and the word without.

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at

Natural Language Processing with Probabilistic models Course

Natural Language Processing course

🔍 Explore cutting-edge tools and Python libraries, access insightful slides and source code, and tap into a wealth of free online courses from top universities and organizations. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your NLP potential!”

Stay tuned for our upcoming articles where we will explore specific topics related to NLP in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and sharing with others!💻✌️

Note:if you are a NLP export and have some good suggestions to improve this blog to share, you write comments and contribute.

👉📚GitHub Repository

👉 📝Notebook

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

To Do List

1- Collects Keys points from the blogs

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter

Source

1- Natural Language Processing with Probabilistic Models (Coursera)

--

--