Generating Synthetic Indic Scene Text Recognition Data
India being a diverse country is home to hundreds of local languages and dialects resulting in the existence of scene text of different languages out in the wild. This warrants a need to build text recognition models capable of accommodating all these local languages. However, To build such models language specific real scene text data is needed which unfortunately is scarce and only available abundantly for the English language. One solution to the data scarcity issue is to generate high-quality realistic synthetic data, and lucky for us there happens to be a method SynthText that enables us to just do that.
SynthText is a fast scalable engine to generate realistic synthetic images of text that blends well into the geometry of a given scene. It was initially used to generate synthetic data for the English language but can be extended to Indic languages as well which we have done by generating a text recognition dataset consisting of around ten thousand samples for the following Indic languages — Hindi, Bangla, Odia, Telugu, Tamil, Malayalam, Kanada, Marathi and Sanskrit. The text generation pipeline can be summarised as follows:
- Sampling a word from a text corpus(Wikipedia articles, news articles) and a background image pulled from a Google search.
- Segmenting the background image into contiguous regions based on local color and texture cues obtained by thresholding gPb-UCM contour hierarchies.
- Predicting a dense depth map by an off-the shelf CNN for a contiguous region to help orient and fit the text to a contiguous region.
- Once location and orientation for a text is known the text is rendered using a color palette learned on the IIIT 5k Dataset, the color palette is a set of color pairs approximating the foreground and background colors. The color pair whose background color matches the target region’s color is selected to render the text.
To render text for other languages, We first need to download fonts and a text corpus of the language we need to generate data for, A good resource to download fonts can be found at google fonts.
Some of the challenges that we faced when adapting the generator to a new language were— some of the generated examples were either mirrored or inverted a fix is to ensure that the determinant of the rotation sub-matrix of the homography is always positive; The language text corpus sometimes can have a mixture of characters of multiple languages it then becomes prudent to filter out text instances with characters not present in the language character set, The filtering can be carried out using regular expressions and knowledge of Unicode ranges for different languages. Unicode ranges for different languages can be found here.
As mentioned before the generator was used to create a synthetic Indic text recognition dataset, The dataset is stored in lmdb format due to its advantage in terms of providing low disk storage space which is helpful when dealing with large synthetic datasets. The dataset has already been preprocessed and split into train and validation set so model training can start at the get go. Apart from the dataset and the adapted generator we also provide code for training single task or multi-task text recognition models with pretrained baseline models for the languages present in the dataset. The dataset can be downloaded from here, code with instructions to generate additional samples or samples for a new language can be found here, code to start training new language text recognition models can be found here.
We hope that these resources will help in advancing the sate of the art in text recognition of Indic languages. If you enjoyed reading the article hit that clap button below!
References
Synthetic Data for Text Localisation in Natural Images https://arxiv.org/abs/1604.06646
You can follow me on Twitter. Let’s also connect on LinkedIn.