How I Coded My Own Private French Tutor Out of ChatGPT
Step-by-step guide to how I used the latest AI services to teach me a new language, from architecture to prompt engineering
Code of the discussed foreign-language tutor can be found in the companion
repo on my GitHub page, and you can use it freely for any non-commercial use.
So after postponing it for quite some time, I made a choice to resume my French studies. As I signed up for the class, this thought struck me — what if I could program ChatGPT to be my personal French tutor? What if I could speak to it, and it would speak back to me? Being a Data Scientist working with LLMs, this seemed like something worth building. I mean, yes, I can just speak to my wife, who’s French, but that’s not as cool as designing my own personal tutor out of ChatGPT. Love you, honey ❤️.
But seriously now, this project is a little more than just “another cool code-toy”. Generative AI is headed to every field in our lives, and Large Language Models (LLMs) seem to take the lead here. The possibilities of what a single person can do these days with access to these models is jaw-dropping, and I decided this project is worth my time — and yours, too, I believe — for two main reason:
- Using ChatGPT as the well-known online tool is powerful, but integrating an LLM into your code is a whole different thing. LLMs are still somewhat unpredictable, and when your product relies on an LLM — or any other GenAI model — for the core product, you need to learn how to really control GenAI. And it’s not as easy as it sounds.
- Getting a first working version took only a few workdays. Before GenAI and LLMs, this would take months, and would probably require more than a single person. The power of using these tools to create powerful applications fast is something you really have to try yourself — that’s the future, as far as I see it. We’re not going back.
Plus, this project can actually do good. My mom really wants someone she can practice her English with. Now she can, and it costs less than $3 a month. My wife’s mom wants to kick off Korean studying. Same thing, same cost. And of course I use it too! This project really helps people, and it costs less than a small cup of coffee. That’s the real GenAI revolution, if you ask me.
Starting from Scratch
Looking at this project from a high-level perspective, what I needed were 4 elements:
- Speech-to-text, to transcribe my own voice to words
- Large-Language Model, preferably a Chat-LLM, which I can ask questions and get responses from
- Text-to-speech, to turn the LLM’s answers to voice
- Translation, to convert any French text I do not fully understand to English (or Hebrew, my native language)
Luckily, it’s 2023, and all of the above are very very accessible. I’ve also chosen to use managed services and APIs rather than running any of these locally, as inference will be much faster this way. Also, the low prices of these APIs for personal use made this decision a no-brainer.
After playing around with several alternatives, I’ve chosen OpenAI’s Whisper and ChatGPT as my Speech-to-text and LLM, and Google’s Text-to-speech and Translate as the remaining modules. Creating API keys and setting these services up was super-simple, and I was able to communicate with all of them through their native Python libraries in a matter of minutes.
What really struck me after testing all these services, is that the tutor I’m constructing is not a just an English-to-French teacher; as Whisper, ChatGPT, and Google Translate & TTS support dozens of languages, this can be used to learn pretty much any language while speaking any other language. That’s insane!
Architecture and Threading
Let’s first make sure the overall flow is well understood: (1) We begin by recording the user’s voice, which is (2) sent to Whisper API, and returns as text. (3) The text is added to the chat history and sent to ChatGPT, which (4) returns a written response. Its response is (5) sent to Google Text-to-speech, which returns a sound file that will be (6) played as audio.
My first practical step was to break this down to components and design the overall architecture. I knew I’ll need a UI, preferably a Web UI as it’s just easier to launch apps through the browser these days then having a standalone executable. I’ll also need a “backend”, which will be the actual Python code, communicating with all the different services. But in order to provide a real-time flowing experience, I realized I’ll need to break it to different threads.
The main thread will run the majority of the code: it’ll transcribe my recording to text (via Whisper), display this text on the screen as part of the chat, and then display the tutor’s written response on the chat screen as well (as received by ChatGPT). But I’ll have to move the tutor’s text-to-speech to a separate thread — otherwise, what we’ll have is:
- the tutor’s voice will only be heard once the entire message will be received from ChatGPT, and its response might be long
- it’ll block the user from responding while the tutor speaks
that’s not the “flowing” behavior I’d like to have; I’d like the tutor to begin speaking as its message is being written on the screen, and to certainly not block the user and prevent it from responding just because audio is still playing.
To do that, the text-to-speech part of the project was split to two additional threads. As the tutor’s response was being received from ChatGPT token-by-token, every full sentence created was passed to another thread, from which it was sent to the text-to-speech service, converting it to sound files. I’d like to emphasize the word files here — as I’m sending text to the TTS service sentence-by-sentence, I also have multiple audio files, one per each sentence, which need to be played in the correct order. These sound files are then played from another thread, making sure audio playback does not block the rest of the program from running.
Making all this work, along with several other issues originating from UI-server interactions, were the complicated part of this project. Surprising, huh — software engineering is where things get hard.
Designing the UI
Well, a UI was something I knew I’ll need, and I also knew pretty much how I’d like it to look — but coding a UI is beyond my knowledge. So I decided to try a novel approach: I asked ChatGPT to write my UI for me.
For this I used the actual ChatGPT service (not the API), and used GPT-4 (yes, I’m a proud paying customer!). Surprisingly, my initial prompt:
Write a Python web UI for a chatbot application. The text box where
the user enters his prompt is located at the bottom of the screen, and
all previous messages are kept on screen
delivered an amazing first result, ending up with a Python-Flask backend, jQuery code, HTML and matching CSS. But that was only about 80% of all the functionality I was hoping to get, so I spent roughly 10 hours going back-and-forth with GPT-4, optimizing and upgrading my UI, one request at a time.
If I made it look simple, I want to clearly say that it wasn’t. The more requests I added, the more GPT-4 got confused and delivered malfunctioning code, which at some point was easier to correct manually than asking it to fix it. And I had a lot of requests:
- Add a profile picture next to each message
- Add a button for every message re-playing the its audio
- Add a button to every French message that will add its translation below the original text
- Add a save-session and a load-session buttons
- Add a dark-mode option, make it choose the right mode automatically
- Add a “working” icon whenever a respond from a service is waited for
- And many many more…
Still, even though usually GPT’s code never worked out of the box, considering the fact I have very little knowledge in the fields of front-end, the results are amazing — and far beyond anything I could have done myself just by Googling and StackOverflowing. I’ve also made a lot of progress in learning how to craft better prompts. Thinking about it, perhaps I should write another blogpost just on the lessons-learned from literally building a product from the ground up alongside an LLM… (well, I did).
Prompt Engineering
For this part of the post, I will assume you have some basic knowledge of how communication with a Chat LLM (like ChatGPT) works via API. If you don’t, you might get a little lost.
Last but most-certainly not least — I had to make GPT take the role of a private tutor.
As a starting point, I added a System Prompt to the beginning of the chat. As the chat with an LLM is basically a list of messages sent by the user and the bot to each other, a System Prompt is usually the first message of the chat, which describes to the bot how it should behave and what is expected of it. Mine looked something like this (parameters encapsulated by curly-braces are replaced by run-time values):
You are a {language} teacher named {teacher_name}.
You are on a 1-on-1 session with your student, {user_name}. {user_name}'s
{language} level is: {level}.
Your task is to assist your student in advancing their {language}.
* When the session begins, offer a suitable session for {user_name}, unless
asked for something else.
* {user_name}'s native language is {user_language}. {user_name} might
address you in their own language when felt their {language} is not well
enough. When that happens, first translate their message to {language},
and then reply.
* IMPORTANT: If your student makes any mistakes, be it typo or grammar,
you MUST first correct your student and only then reply.
* You are only allowed to speak {language}.
This was actually yielding pretty good results, but it seemed like the effectiveness of the behavioral instructions I gave the bot (“correct me when I’m wrong”, “always respond in French”), decayed as the chat went on.
Trying to fight this vanishing behavior, I came up with an interesting solution; I manipulated the user messages before sending them over to GPT. Whatever the user’s message was, I added additional text to it:
[User message goes here]
---
IMPORTANT:
* If I replied in {language} and made any mistakes (grammar, typos, etc),
you must correct me before replying
* You must keep the session flow, you're response cannot end the session.
Try to avoid broad questions like "what would you like to do", and prefer
to provide me with related questions and exercises.
* You MUST reply in {language}.
Adding these at the end of every user’s message made sure the LLM responds exactly the way I wanted it to. It is worth mentioning that the long suffix I added is written in English, while the user’s message might not. This is why I added an explicit separator between the original message and my addition (the ---
), ending the context of the original message and starting a new context. Also note that as this suffix is added to the user’s message, it is written in first-person (“I”, “me”, etc..). This little trick improved results and behavior dramatically. While it might goes without saying, it might be worth emphasizing that this suffix is not displayed in the chat UI and the user has no idea it is added to their messages. It is inserted behind the scenes, right before being sent with the rest of the chat history to ChatGPT.
One more behavior I wanted to have was making the tutor speak first, meaning ChatGPT will send the first message as the session begins, without waiting for the user to initiate the session. That is, apparently, not something ChatGPT was designed to do.
What I found out when attempting to make ChatGPT reply on a messages-history that contained only the System Prompt, is that ChatGPT “lost it”, and began creating a chat with itself, playing both the user and the bot. No matter what I tried, I wasn’t able to make it properly initiate the session without the user saying something first.
And then I had an idea. When the session initialized, I send ChatGPT the following message on behalf of the user:
Greet me, and then suggest 3 optional subjects for our lesson suiting my
level. You must reply in {language}.
This request was designed to make GPT’s response look exactly how I thought a proper initialization of the session by the bot should be like. I then removed my message from the chat, and made it seem as if the bot kicked off the session by itself.
Summary
What kicked-off as a funny little whim became reality in literally no-time, done completely during spare-time of only one quite-busy man. The fact that tasks such as these are now so simple to create does not cease to amaze me. Just one year ago, having something as ChatGPT available was sci-fi, and now I can shape it from my own personal laptop.
This is the beginning of the future, and whatever is to come — at least I know I’ll be ready for it with one more foreign language. Au revoir!