Natural Language Processing (Part 43)-Minimum edit distance algorithm

Coursesteach
6 min readMay 26, 2024

--

📚Chapter5: Autocorrect and Minimum Edit Distance

I’ll teach you about dynamic programming with one example. Once you understand this example, you’ll be able to use dynamic programming for other problems too. You have the source word “play” listed down the left column. Across the top row, you have the target word “stay” and an empty string placeholder. Each string starts with a hashtag symbol as a placeholder. The reason for this will become clear in the next steps. There are also some extra rows and columns to make it easier to follow along.

Figure 1

Now, the goal is to fill out this distance matrix, D. For example, D(2, 3) will show the minimum distance from “pl” to “sta.” So basically, if you take the first two letters of “play” and the first three letters of “stay,” you get “pl” to “sta.” You can apply this formula to any element in the matrix by using the row and column numbers. In this case, if i is 2 and j is 3, you get “pl” to “sta.”

Figrue 2

So basically, from the slides before, you learned that each “D” of “ij” will be at its lowest distance between the start of the source word and index “i” and the start of the target word and index “j”. So basically, if you have a source string of length m and a target string of length n, when you reach the bottom right corner D of a grid with m x n cells, you’ll find the smallest number of edits needed to turn one string into the other.

Figure 3

You will calculate this from the shortest sub strength to the full strength that is. Begin with the shortest string in the top-left corner and proceed downwards, then to the right. The underlying concept is that solutions can be developed by building upon each sub-problem and integrating their solutions.

For instance, calculating the minimum edit distance between two characters is straightforward. Then, you increment the problem size by one letter at a time, expanding upon your existing knowledge.

Example

Figure 4

Before beginning, remember the cost for each edit: one for insert and delete, and two for replace.

First steps

The initial step involves transforming the source’s empty string and entering the target’s empty string. The edit distance, which measures the number of edits needed to transform one string into another, is zero when both the source and target strings are empty. There is no formula here, it’s just a special case so far, so good.

Second Steps

Then you can proceed to change ‘p’ to an empty string, and this can be done where they perform the delete operation at a cost of one.

Then convert the empty string to ‘s’, which can be done by inserting at this operation at a cost of one.

Now, consider the term “P2S.”, there is more than one possible way to make this transformation. Every potential sequence of edits is referred to as a path that begins with P, and by appending S to the end, you obtain PS. Then, delete ‘P’ from the beginning to obtain ‘s’, which incurs a cost of one insertion and one deletion. At this juncture, it should be noted that the cost of inserting a letter ‘s’ has been calculated. The calculation is located in the red box at the top, which determines the transition from an empty string to ‘s’. Essentially, you calculate the cost of deleting ‘p’ and add it to the cost previously stored in the red box.

This is “one plus one equals two,” in the second possible path, starting with “p.” You can delete ‘P’ to obtain a hashtag, and then insert ‘s’ to get ‘s’, taking into account that the cost of deleting ‘p’ has already been calculated.

In the blue box on the left, the value represents the cost of transitioning from ‘p’ to ‘#’ by deleting ‘p’. Therefore, you can calculate the cost of inserting ‘p’ and add it to the value stored in the blue box on the left. The total is one plus one, which equals two.

The third method involves replacing ‘p’ with ‘s’, transitioning from ‘p’ to ‘s’ in a single step at a cost of two. Visually, this can be represented as moving directly from a green box to an orange box. The green box represents the cost of no edit, which is zero, and a replacement incurs a cost of two, thus zero plus two equals two.

Next, you determine the minimum value among these three paths, which is two in this instance. You then insert this minimum value into the cell, representing the shortest distance from p to s.

Building on what you’ve already completed, you’ll see that to fill in a cell, you need to be aware of the cell above, the one to the left, and the one diagonally above to the left, indicated here in red, blue, and green. By doing this, you can take advantage of some calculations that have already been done.

I will demonstrate how to create a general formula to populate the remaining matrix. Initially, I’ll guide you through completing the first column and row, ensuring that each subsequent cell has access to its respective red, blue, and green cells as dependencies. This is the next step in the process.

In this blog, you have learned about the cost associated with each operation: inserting and deleting cost one, while replacing costs two. You have also discovered how to populate your table. In the next blog, you will explore a much faster method to populate your table.

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at

Natural Language Processing with Probabilistic models Course

Natural Language Processing course

🔍 Explore cutting-edge tools and Python libraries, access insightful slides and source code, and tap into a wealth of free online courses from top universities and organizations. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your NLP potential!”

Stay tuned for our upcoming articles where we will explore specific topics related to NLP in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and sharing with others!💻✌️

Note:if you are a NLP export and have some good suggestions to improve this blog to share, you write comments and contribute.

👉📚GitHub Repository

👉 📝Notebook

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

To Do List

1- Collects Keys points from the blogs

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter

Source

1- Natural Language Processing with Probabilistic Models (Coursera)

--

--