Maximum Matching (Word Segmentation) Algorithm — Python Code

Anshul Sachdev
3 min readAug 16, 2019

--

‘This is insane’. It’s easy to map it’s semantics right? Well how about ‘Thisisinsane’ your mind still reads it as ‘This is insane’. Well how does it do it? How will a computer do it? There are languages like Chinese which have no spaces between them. So how does a computer map meaning of such languages? Today, let’s talk about an algorithm that helps in doing so. It’s called the maximum matching algorithm.

Algorithm:

Step-1: Start with first character of the given string.

Step-2: Search the longest word in list starting with this character.

Step-3:

If match is found, boundary is marked.

Else, character is treated as word.

Let us take a string “thisisinsane”. We will start with the first character ‘t’. There is only a single word that can be formed and that is ‘the’. So we move on to the character next to the first maximum word(the). So the only word that can be formed starting with the character ‘i’ is ‘is’. So we add this word to the tokens list and move to the next character which is ‘i’. The words that can be formed with this character are ‘in’ and ‘insane’, we take the word with the greater length(i.e.,insane) and add it to the list.

Therefore the list of tokens that we get is: [ ‘this’ , ‘is’ , ‘insane’ ]

Some other examples are:

  1. ‘thecatinthehat’ — [‘the’ , ‘cat’ , ‘in’, ‘the’, ‘hat’]

Note: In our python code we will get the output as [‘theca’ , ‘tint’ , ‘he’, ‘hat’] because ‘theca’ is apparently a word. But in the books it is not considered as a word and this can be seen as their output [‘the’ , ‘cat’ , ‘in’, ‘the’, ‘hat’]. Both the outputs are correct. It just depends on the words that are in the corpus you take.

2. ‘tomorrowissunday’ — [‘tomorrow’ , ‘is’, ‘sunday’]

Note: In our python code we have taken an nltk corpus’ words which contains ‘Sunday’ but not ‘sunday’(with lowercase s). That is why we will first convert all the words to the lowercase words and store those words in the new list called lowercaseCorpus.

Flow Diagram:

Flow Diagram

Python Code:

from nltk.corpus import words

string = "thisisinsane"
tokens = []
lowercaseCorpus = [x.lower() for x in words.words()]
i = 0
while i < len(string):
maxWord = ""
for j in range(i, len(string)):
tempWord = string[i:j+1]
if tempWord in lowercaseCorpus and len(tempWord) > len(maxWord):
maxWord = tempWord
i = i+len(maxWord)
tokens.append(maxWord)

print(tokens)

Output: [‘this’, ‘is’, ‘insane’]

In the above code we have used nltk.corpus package to get the list of English words. The rest of the code is self-explanatory.

Hope you find this blog useful. If you did do like(clap) it and share your thoughts in the comments section below.

--

--