From Pixels to Words: Building a Text Recognition System with YOLOv8 and NLP, 1/2

Arthur Lagacherie
The Deep Hub
Published in
5 min readJul 18, 2024

--

Discover how a simple image can be transformed into readable text using YOLOv8 and NLP.

Hello everyone! Today we’ll train not 1 model but 3 models to have a functional reading bot. So the goal of this story is to have a system that is from a photo of text (computer text to begin, we’ll see about reconnaissance with handwriting later) extract the text.

My plan is the following:

  1. train a YOLOv8 model to detect words
  2. train another YOLOv8 model which from the cutting of the previous model finds letters
  3. Fine-tune a NLP model to correct the errors and add spaces

Words Detection

Before detecting letters we want to cut the text into words to make it easier for the model to detect the letters after.

To accomplish this task we will train a YOLOv8n model but before we need a dataset (logic), so I take 17 images and label them on Roboflow (We don’t need more than 17 for YOLOv8).

You can see and download the dataset here.

If you’d like to know more about labeling and training, you can read my article in which I explain exactly how to go about it.

I trained it for 5 minutes on a Google Colab GPU and it gave some good results:

The great thing about Ultralytics is that you can directly test your model and see the results on the site, you can insert your image and it apply the trained model to it. So I give the website a picture of a sentence:

We can that is not perfect, some words aren’t framed but if I decrease the threshold.

All the words are framed. Nice!

Here’s the link to the model you can download and try out:

Letters recognition

Now that we’ve got the first part of our model for separating words, we can move on to the second part of our model:

  • letter recognition

To accomplish this task we will follow the same process as the previous model. So I began by creating a dataset of 51 images with all the letters labeled.

You can see and download it here.

After this, I train a YOLOv8 model.

But.. It doesn’t work very well. Even if I set the threshold to zero.😔

But, I have an idea! The YOLOv8 model is available in several sizes! I take the (YOLOv8n) but there are 4 larger models (s, m, l, and x). So I take the “s” model which is a little large.

https://docs.ultralytics.com/models/yolov8/#supported-tasks-and-modes

After training, he, unlike the “n” model, gave results that were not excellent but better than the other.

But with the threshold at min, it doesn’t give all good answers, it detects two “e”.

So I add to the dataset 13 images, and I train the “m” model, for more time than “s”. More larger model, more larger dataset, and more longer time, I think it will be good.

This is not perfect, he detects “s, 1”. So let’s add images to the dataset (30)! And train a model bigger “l”.

It works very well!! All the letters detected are good.

We can see the model doesn’t detect all the letters 100% of the time but has good results, the NLP model should be able to reconstruct the right words from this.

The model.

This is the end of part one I hope you enjoyed this article, and if it has, there’s no reason why you shouldn’t clap it. = )

--

--

Arthur Lagacherie
The Deep Hub

I am a French high school student with a passion for artificial intelligence. I like to share my curiosity with others.👍