How to Build an Autocorrect in Python

Example of how to build an Autocorrect in Python by taking the vocabulary from a corpus

George Pipis
The Startup
Published in
3 min readOct 1, 2020

--

Image by Author

Description

We assume that you are familiar with the concepts of String Distance and String Similarities. You can also have a look at the Spelling Recommender. We will show how you can easily build a simple Autocorrect tool in Python with a few lines of code. What you will need is a corpus to build your vocabulary and the word frequencies. The idea is the following:

  • You enter a word, if this is word exists in the vocabulary then we assume that is correct.
  • If this word does not exist in the vocabulary we try to find the most similar words ordered by their frequency probability.

Build the Vocabulary

We will work with the Moby Dick book. Let’s start.

import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter
words = []with open('moby.txt', 'r') as f:
file_name_data = f.read()
file_name_data=file_name_data.lower()
words = re.findall('\w+',file_name_data)
# This is our vocabulary
V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")

We get:

The first ten words in the text are: 
['moby', 'dick', 'by', 'herman', 'melville', '1851', 'etymology', 'supplied', 'by', 'a']
There are 17140 unique words in the vocabulary.

Get the Word Frequencies

We have already built a list of words called words and now we can build our word frequency. We can use the Counter function.

word_freq_dict = {}  
word_freq_dict = Counter(words)
print(word_freq_dict.most_common()[0:10])

--

--

George Pipis
The Startup

Sr. Director, Data Scientist @ Persado | Co-founder of the Data Science blog: https://predictivehacks.com/