Enhancing Language Learning with AI: A Practical Tutorial on Building a Voice Assistant for Pronunciation, Grammar, and Vocabulary

6 min readMar 16, 2023

In this tutorial, we will be building a language learning voice assistant using Python. The voice assistant will be able to recognize pronunciation errors, teach grammar, and provide vocabulary practice. We will be using a dataset from the Common Voice API provided by Mozilla to train our voice recognition model.

By the end of this tutorial, you will have built a language learning voice assistant that can recognize and correct pronunciation errors, teach grammar rules, and provide vocabulary practice exercises.

Prerequisites:

To follow this tutorial, you will need the following:

Basic knowledge of Python programming
Python 3.6 or later installed on your machine
An internet connection
A microphone for voice input

Agenda:

Introduction to Common Voice API
Installing necessary Python packages
Building a voice recognition model using Common Voice dataset
Implementing pronunciation error recognition and correction
Teaching grammar rules
Providing vocabulary practice exercises

Step-by-step Tutorial:

Introduction to Common Voice API:

The Common Voice API is an open-source dataset provided by Mozilla. It contains a large number of voice recordings from people of various ages, genders, and accents. We will be using this dataset to train our voice recognition model.

To use the Common Voice API, you will need to sign up for an account on the Mozilla website and download the necessary files. Once you have done this, you can use the files to train your voice recognition model.

Installing necessary Python packages:

We will be using several Python packages to build our language learning voice assistant. You can install these packages using pip, the Python package installer.

pip install numpy
pip install pandas
pip install matplotlib
pip install tensorflow
pip install keras
pip install librosa
pip install pyaudio
pip install soundfile
pip install nltk

Building a voice recognition model using Common Voice dataset:

First, we will download the Common Voice dataset from the Mozilla website. Once we have the dataset, we can use it to train our voice recognition model.

import urllib.request
import os
import tarfile

url = "https://common-voice-data-download.s3.amazonaws.com/cv_corpus_v1.tar.gz"
filename = "cv_corpus_v1.tar.gz"
foldername = "cv_corpus_v1"

if not os.path.exists(foldername):
    os.mkdir(foldername)
    
if not os.path.exists(filename):
    urllib.request.urlretrieve(url, filename)
    
    with tarfile.open(filename) as tar:
        tar.extractall(foldername)

Next, we will preprocess the data by extracting the audio files and their corresponding transcriptions.

import pandas as pd

df = pd.read_csv("cv_corpus_v1/cv-valid-train.csv")
df = df[["filename", "text"]]
df = df[df["text"].str.isalpha()]
df = df[df["text"].str.len() > 5]
df = df.sample(frac=1).reset_index(drop=True)

audio_path = "cv_corpus_v1/train_files/"
os.makedirs(audio_path, exist_ok=True)

for i, row in df.iterrows():
    file_url = "https://common-voice-data-download.s3.amazonaws.com/" + row["filename"]
    file_path = audio_path + row["filename"]
    
    if not os.path.exists(file_path):
        urllib.request.urlretrieve(file_url, file_path)

Now that we have preprocessed the data, we can use it to train our voice recognition model. We will extract the audio features and convert them into Mel Frequency Cepstral Coefficients (MFCCs), which are commonly used in speech recognition.

import librosa
import librosa.display
import soundfile as sf
import numpy as np

def extract_features(file_path):
    y, sr = librosa.load(file_path, sr=None)
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfccs_scaled = np.mean(mfccs.T,axis=0)
    return mfccs_scaled

def build_dataset(audio_path, df):
    X = []
    y = []
    
    for i, row in df.iterrows():
        file_path = audio_path + row["filename"]
        
        try:
            feature = extract_features(file_path)
            X.append(feature)
            y.append(row["text"])
        except Exception as e:
            print(f"Error processing file {file_path}: {str(e)}")
            
    return np.array(X), np.array(y)

X_train, y_train = build_dataset(audio_path, df)

We can now train our voice recognition model using the extracted features.

from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import Adam

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_train_categorical = to_categorical(y_train_encoded)

model = Sequential()
model.add(Dense(512, input_shape=(X_train.shape[1],)))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(len(le.classes_)))
model.add(Activation("softmax"))

adam = Adam(lr=0.0001)
model.compile(loss="categorical_crossentropy", optimizer=adam, metrics=["accuracy"])

model.fit(X_train, y_train_categorical, batch_size=32, epochs=100, validation_split=0.1)

Implementing pronunciation error recognition and correction:

Now that we have trained our voice recognition model, we can use it to recognize pronunciation errors and suggest corrections. We will be using the Natural Language Toolkit (NLTK) package for this purpose.

import nltk
from nltk.tokenize import word_tokenize

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

def recognize_errors(text):
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    
    for i in range(len(pos_tags) - 1):
        current_tag = pos_tags[i][1]
        next_tag = pos_tags[i + 1][1]
        
        if current_tag.startswith("NN") and next_tag.startswith("V"):
            error_word = pos_tags[i][0]
            correction = le.classes_[model.predict_classes(extract_features(audio_path + error_word + ".mp3")).item()]
            text = text.replace(error_word, correction)
            
    return text

Teaching grammar rules:

Next, we will teach our voice assistant to recognize and correct grammar errors. We will be using NLTK again for this purpose.

from nltk.parse import CoreNLPParser

nltk.download("corenlp")

parser = CoreNLPParser(url="http://localhost:9000")

def recognize_grammar_errors(text):
    sentences = list(parser.tokenize(text))
    corrected_sentences = []
    
    for sentence in sentences:
        parse_tree = list(parser.parse(sentence))[0]
        for subtree in parse_tree.subtrees():
            if subtree.label() == "VP":
                verb = subtree.leaves()[0]
                if verb.lower() not in ["is", "am", "are", "was", "were", "be", "been"]:
                    error_word = verb
                    correction = le.classes_[model.predict_classes(extract_features(audio_path + error_word + ".mp3")).item()]
                    subtree[0] = correction
                    corrected_sentences.append(" ".join(subtree.leaves()))
                    break
            else:
                corrected_sentences.append(" ".join(subtree.leaves()))
                
    corrected_text = " ".join(corrected_sentences)
    return corrected_text

Providing vocabulary practice:

Finally, we will provide vocabulary practice to the user by generating random vocabulary quizzes.

We will be using the WordNet database, which is a lexical database for English that is used extensively in natural language processing and computational linguistics.

from nltk.corpus import wordnet

nltk.download("wordnet")

def generate_vocabulary_quiz():
    words = []
    definitions = []
    
    for synset in wordnet.all_synsets():
        if synset.pos() == "n":
            words.append(synset.name().split(".")[0])
            definitions.append(synset.definition())
            
    quiz_words = np.random.choice(words, size=4, replace=False)
    correct_word = np.random.choice(quiz_words)
    
    print("What is the definition of the following word?")
    print(correct_word)
    
    for i in range(len(quiz_words)):
        print(f"{i + 1}. {definitions[words.index(quiz_words[i])]}")
        
    answer = int(input("Enter the correct option number: "))
    
    if answer == np.where(quiz_words == correct_word)[0][0] + 1:
        print("Congratulations! Your answer is correct.")
    else:
        print(f"Sorry, your answer is incorrect. The correct answer is {np.where(quiz_words == correct_word)[0][0] + 1}.")

Putting it all together:

Now that we have implemented all the necessary functionalities, we can put them together to build our language learning voice assistant.

import speech_recognition as sr

r = sr.Recognizer()

def recognize_speech():
    with sr.Microphone() as source:
        print("Speak now...")
        audio = r.listen(source)
        try:
            text = r.recognize_google(audio)
            print("You said: " + text)
            return text
        except sr.UnknownValueError:
            print("Sorry, I could not understand what you said.")
            return ""
        except sr.RequestError as e:
            print("Could not request results from Google Speech Recognition service; {0}".format(e))
            return ""

def start_voice_assistant():
    print("Welcome to the language learning voice assistant.")
    while True:
        print("What would you like to do?")
        print("1. Recognize pronunciation errors and suggest corrections")
        print("2. Teach grammar rules")
        print("3. Practice vocabulary")
        print("4. Exit")
        choice = input("Enter your choice: ")
        
        if choice == "1":
            text = recognize_speech()
            if text:
                corrected_text = recognize_errors(text)
                print(f"Corrected text: {corrected_text}")
        elif choice == "2":
            text = recognize_speech()
            if text:
                corrected_text = recognize_grammar_errors(text)
                print(f"Corrected text: {corrected_text}")
        elif choice == "3":
            generate_vocabulary_quiz()
        elif choice == "4":
            print("Thank you for using the language learning voice assistant.")
            break
        else:
            print("Invalid choice. Please try again.")

Testing the voice assistant:

We can now test our voice assistant by running the start_voice_assistant() function.

start_voice_assistant()

When the function is executed, the voice assistant will welcome the user and present a menu of options:

Welcome to the language learning voice assistant.
What would you like to do?
1. Recognize pronunciation errors and suggest corrections
2. Teach grammar rules
3. Practice vocabulary
4. Exit

The user can choose from the following options:

Recognize pronunciation errors and suggest corrections: The voice assistant will prompt the user to speak a sentence and then recognize any pronunciation errors and suggest corrections.
Teach grammar rules: The voice assistant will prompt the user to speak a sentence and then recognize any grammar errors and suggest corrections.
Practice vocabulary: The voice assistant will generate a random vocabulary quiz and prompt the user to answer it.
Exit: The voice assistant will exit.

The user can continue to choose options until they choose to exit the voice assistant.

And that’s it! You have now built a language learning voice assistant that can recognize pronunciation errors, teach grammar rules, and provide vocabulary practice.

In this tutorial, we have learned how to build a language learning voice assistant using Python.

We have covered the fundamental concepts of speech recognition, natural language processing, and machine learning, and how they can be combined to create a personalized and interactive language learning experience.

By leveraging available datasets such as Common Voice by Mozilla, we can train our voice assistant to recognize pronunciation errors, teach grammar rules, and provide vocabulary practice.

With the knowledge gained in this tutorial, you can continue to expand and customize your language learning voice assistant to fit your unique needs and preferences.

Thank you for following along with this tutorial, and I hope you found it informative and helpful. Happy learning!