How I learn vocabulary šŸ“–

Vladislav Kuznetsov
Nov 3 Ā· 7 min read

Hi everyone!

In my previous company, I took English classes twice a week. Our teacher forced us to learn new words. She usually started our classes with checking our vocabulary. She asked us to translate random words from our topics which we had learnt before. But it was hard, because we didn’t have time to memorize them. Therefore, I bought a vocabulary notebook and started writing down new words in there. Over time I realized that I started to forget about words from that book. It just wasn’t convenient for me to take the book everywhere and everywhen, to find the forgotten word and to write new words down neatly.

My awful vocabulary notebook

The vocabulary notebook in 2019? Are you serious? Where is your device, dude? Exactly! I could use mobile phone, but I didn’t want to set up a new application for this purpose. Probably for that reason I like my profession. Each developer can design everything he needs for himself and I decided to create a Telegram bot.

The idea for the start was simple — a dictionary application. It is a piece of cake for any developer, isn’t? In just a few minutes my bot was made.

Skeleton of the bot

Whenever I wanted to practice my vocabulary, I sent the command /ask to the chat. The bot started asking me, like a teacher, random words from my dictionary, which I’d sent to him before and checked my translation or vice versa. If I didn’t know the right answer, I could send my typical face in similar situations ā€˜:(’ and would get translation in the next message. Also I could always ask about forgotten words or add new ones. That whole process reminded me of funny member berries from South Park episode. I called my bot the same way, and he has never upset me since then.

I added a few words from my vocabulary notebook and started to use my bot on my way to work. Subway, traffic jams - on the average 30 minutes twice a day — good practice. But, over time, I became tired of the same questions. My bot didn’t see the difference between words, which I’d already learned and words, which I couldn’t recall. There were situations like this:

Do you member Дircumstances?
ŠžŠ±ŃŃ‚Š¾ŃŃ‚ŠµŠ»ŃŒŃŃ‚Š²Š°
Of course!

Do you member Дircumstances?
ŠžŠ±ŃŃ‚Š¾ŃŃ‚ŠµŠ»ŃŒŃŃ‚Š²Š°
Of course!

Do you member Дircumstances?
ŠžŠ±ŃŃ‚Š¾ŃŃ‚ŠµŠ»ŃŒŃŃ‚Š²Š°
Of course!

…

I’ve already learned this word. Please don’t ask me that again and again… but still remember and continue check it less often.
See /help for instructions.

Little idea popped into my head. The Bot could choose the next word depending on the mistakes which I’d made. It would ask me those words which I couldn’t memorize or always made mistakes in. The probability that the bot would ask me unknown words needed to be increased. For that, I collected the next statistics for each pair of words and keep it to my Firebase storage.

I decided to keep all statistics by sessions. Session is the time interval between the first asked word and the first asked word after long pause (~20 min). I hoped that it would allow me to analyze the ideal time period for user to learn new words effectively because I was thinking that if the session was too long it could be quite difficult for user to memorize more words.

When I deployed my application on Heroku I was interested to know how to pass data from python backend to js frontend on condition that only backend knew about data. I supposed that it wasn’t the correct way, but it was possible. Just for fun I’ve created the simple react application with graphics which shows some general statistics and progress in the last session. By default this site shows my data but you can insert telegram userId after the host name.

https://memberries-vocabulary.herokuapp.com/<userId>

Each time when the bot asks me the next word it filters the words, which have been asked before, out by the time interval. Thanks to that, bot knows the words which it has already asked recently. I choose 7 (just magic number) random words from all vocabulary for the bot not to ask the same words twice. It allows the words I’ve learnt before to be on the list when the bot starts to ask.

After that, I’ve analyzed statistics on these words and calculated probability of making mistakes. Here numpy helps me. I use random.choice method which can apply p argument for probabilities associated with each entry in my array. And as a result — I choose next 7 words based on mistakes I’ve made.

Since then the number of my sad faces ā€˜:(’ in our chat has been increased. Now the Bot knows which word it can ask me so I make a mistake. It asks those words which I can’t memorize or always make mistakes in. I’m sure that’s the way to stimulate your brain to memorize a lot of information.

Do you member Indefatigable?
ā€˜:(’
Remember that: ŠŠµŃƒŃ‚Š¾Š¼ŠøŠ¼Ń‹Š¹

Do you member Dilatory?
ā€˜:(’
Oooh, remember: ŠœŠµŠ“Š»ŠøŃ‚ŠµŠ»ŃŒŠ½Ń‹Š¹

Do you member Abnegation?
ā€˜:(’
Keep in mind that: ŠžŃ‚Ń€ŠµŃ‡ŠµŠ½ŠøŠµ

…

I know nothing. Bot, please ask me something, what I know.
Дircumstances for example. Normally talked!
Ho-ho-ho!!! Only if you get lucky (my Дreator range your vocabulllary by magical 7 number)

Since the first version of my chat Bot, I’ve started reading the book in English. After a few pages I realized that searching those words in the dictionary and sending messages to my bot took a lot of time. I skipped about 10 new words per page because I couldn’t waste my time. I started to underline each unknown word for me in the book to check what it means later. But I was always so lazy to send all underlined words into the chat. Instead of that I was always thinking that simpler way had to exist.

To be a developer — it’s a kind of superpower. I’m serious. You can create everything that you need and it’s amazing. The new challenge, which appeared in front of me, was named Optical Character Recognition. Fortunately, that task was solved and I could use pytesseract library. All that I needed for that was to crop my photo with underlined words.

The solution for me was searching underlined lines with OpenCV help. I filtered my photo colors by blue color borders and got the dark mask and then found hough lines in the picture. I created a pandas data frame, grouped the frame by coordinates to line my page by strings and found coordinates of the start and the end for each word on the string. The font size and horizontal distance between words were found approximately.

The recognize logic

I’ve colored different words on purpose, for convenient checking. After cropping I got the set of photos with underlined words. All that I need to do was to use pytesseract to get an array of words and use translation api for each item in the array. I’m showing the recognized and translated words pairs. I can copy the necessary pair and keep it in vocabulary after sending into the chat.

Searching underlined words on the photo
Example of Bot words translation

I know it’s not a brilliant result. I crop my photos by approximate coordinates. Photos can be taken from different books and their quality will never be ideal. I can’t recognize the word wraps and it’s always hard to get full translations. Besides, unfortunately pytesseract sometimes can’t recognize words on page band.

But this simple solution is enough for my start. For your start on this way, check the link on GitHub at the end of story. I believe that the distance between underlined unknown words will always be short. Besides, if the word couldn’t be recognized, it would be found in the dictionary.

I’ve come across a few problems, which I must mention

  1. The quality of photos is worse when the Telegram compresses them. In the most cases, my local recognition worked better than in the chat. I discovered that the size of compressed files is smaller and therefore, the quality was decreased (3mb -> 100kb).
  2. Heroku doesn’t allow using local storage by 12Factors. It means that you can’t check temp files which were created in request time. For debug of all files, I was forced to use aws s3 buckets.

In conclusion

It was a good experience. When I started, I didn’t even think to use firebase, openCV, pytesseract, flask, aws services and deploy all of that to heroku. I like my bot, I like python, telegram and my profession. I highly recommend developing everything which can help you in your daily life and share it.

You can take a look on my github project:
https://github.com/Squirre1/words-book-bot

And start to learn your vocabulary:
@words_book_bot

Thank you for attention!
Good luck in your learning!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade