ShareOCR: The end-to-end OCR for Indic content: Part 2

Tech @ ShareChat
ShareChat TechByte
Published in
11 min readNov 19, 2021

Written by Rishabh Jain, Praveen Dhinwa, Vikram Gupta, Hastagiri Vanchinathan, Debdoot Mukherjee

ShareChat is India’s leading social media platform & Moj is India’s number one short video platform with one of the highest monthly active users. Both allow users to share their opinions and record their lives, all within the comfort of their preferred languages. Our content consists of images and videos, most of which are created and uploaded by our users.

In part 1 of this blog, we talked about building an in-house OCR system on our user-generated images and videos. We spoke about data challenges, text detection and steps for text recognition. We also touched on why the text embedded in videos and images is an extremely important feature to us.

While all of that works out, our data is way more nuanced and challenging due to multiple languages and our use-case. Some of the problems and their solutions which we will discuss in this part will be:

Problem 1 :: Imperfect Bounding Boxes                          Problem 2 :: Support for multiple languages                    Problem 3 :: Box weaving for comprehensible fullText           Problem 4 :: Relevant Text Extraction                          Problem 5 :: OCR on videos

Dynamic Padding on Text Detection output

As shown in Fig 1, the bounding boxes generated by the text-detection algorithm are imperfect and often short on the horizontal axis, leading to imperfections in text recognition down the line in further steps. This is no fault of the algorithm but the ground truth itself.

Fig 1: Red: Original coordinates in data. Green: Coordinates by our approach. Original coordinates often cut characters in between as compared to corrected coordinates. Verified by human labeling on random examples

A common approach is to pad the boxes with some pixels horizontally to cover the entirety of the word. Pad the boxes a little less, and we still get broken words. Pad the boxes a little more, and they can lead to collisions or extra-pad which will hamper the text recognition step as non-meaningful words can get created or diminishes the size of the actual word in the bounding box.

Following similar lines, we first tried with a constant x pixel to be added to the length of the box on each side. But, as boxes can vary in aspect ratios and length-sides, this did not perform well on our metrics.

Then we tried percentage-based padding, where each length-side is expanded by x% for each box. This approach performed a little better but was not up to the mark. The reason was that the increase was so slight that it hardly mattered for 1–2 character words, and for 6–7 character words, the expansion led to extra padding and collisions.

So, we tried an intuitive, dynamic padding algorithm. To explain it simply, it simultaneously increases lengths of all the boxes in an image to a fixed threshold, such that they do not collide with each other, stopping early when there is a chance of collision. The green boxes are the result of the dynamic padding algorithm in Fig 1 and look better than the original red boxes.

The exact algorithm we used is as follows:1. Each box looks at the nearest box on its left and right that falls in its line-neighborhood. Line-neighborhood for a box is the region covered by the box on its extension, if its length-sides parallel to the horizontal-axis are extended until the end of the image on each side.2. Each box of the image is expanded on the left side by a maximum of half the distance with its left neighbor and similarly on the right. The expansion is thresholded by a fraction of the original word box length as well. For one-off instances of boxes having no neighbors on some side, we set a max threshold based on the box length of the original box.Fig 2 illustratively shows the dynamic padding strategy.
Fig 2: Dynamic Padding approach shown in diagram

We compared all the padding techniques by training a recognizer for each of them and comparing accuracy metrics. The dynamic padding approach outperformed the other two techniques significantly. The direct advantage of dynamic padding can be seen in Fig 3 as well.

Fig 3: Applying dynamic padding on corrupted data. Recognition Results: सपने (Incorrect) vs अपने (Correct)

Incorporating new languages

Script Identification Module

At ShareChat & Moj, we serve our users and creators in 15+ Indic languages. One approach is to train our text-recognition model, as mentioned in part 1, by combining each language’s vocab.For computational purposes, it is unfeasible to train such a model. The other approach is to put a script-identification module before the text-recognition module. This module can be best described as an image classification problem, in which we categorize bounding-box level images into one of several categories based on their scripts. English, though not part of any of our 15+ languages of our ecosystem directly, is still present in large numbers, as it has its presence in content across all languages in significant proportions.

Script vs Language Identification Module?

A lot of Indic languages and dialects share the same character list with other Indic languages. For eg, Assamese and Bengali share the same script, and so does Rajasthani and Bhojpuri with Hindi. It doesn’t make sense to use a language identification module instead of script identification module, as essentially the data would be imperfect for the former. So, we decided to use a script identification module instead.

We faced two major challenges while developing the script identification module.1. Enabling support for languages that share the same script. Languages like Hindi, Marathi, Bhojpuri etc share the same script, but have vastly different word vocabularies as shown in Fig 4.2. How to handle the incorrect examples from our underlying data? Remember, our data issues from part 1 of the blog.

To handle the first challenge, we understood the linguistic details of each language and verified our observations by seeing what characters are shared by different languages from our data. We finally arrived at an exhaustive set of 11 different scripts: English-Latin, Devanagari, Gurumukhi, Bangla, etc.

Fig 4: Marathi vs Hindi, both use Devanagari scripts, but the word-vocabulary of each of them is pretty different

To handle the second challenge of incorrect examples from our data, we first tried to remove them using an approach that uses a dictionary created from our data combined with a standard dictionary of words in that language to remove such examples. With a vast number of unique words, it removed lots of true examples as well.

Then we checked the script of all the words present in our data, with reference to the post-language we have. Further, we fixed what scripts can be expected for a particular post-language based on attributes from the linguistic knowledge of our diverse content team. Based on our observations, we found that there are two different kinds of false examples:

  1. The Indic language words that were not expected to be present in that post-language
  2. Words that are neither English nor Indic in their characters. We called this category Gibberish and included it as the 12th category in our script-identification module. This Gibberish category also helped us with false instances where something looks like text and is detected by our detector but is rejected (labeled Gibberish) by our script-identification module. One example can be seen in Fig 5.
Fig 5: Blue boxes from the text detection step are classified “Gibberish” in the script identification step. Eventually helping in detecting false positives.

We finally trained a model similar to an image classifier model and set it to predict one of the 12 categories of scripts.

Text Recognition

Once the crops are identified with a particular script, we send them to the text recognizer. As discussed in part 1 of this blog, we used the same architecture for our text-recognizer step. Just the vocabulary of characters that we used in each step changed based on each script.

For generating the vocab for each script, we used the most common 94 characters for each language. For most of the scripts, they could cover more than 98% of the characters available for that language. The idea to use 94 characters came from English, as described in part 1, and we stick to that to efficiently use it for transfer learning.

We used the same architecture for our eleven scripts, excluding Gibberish, whose crops are rejected and not sent further. We used pre-trained ASTER weights of the recognizer of one script to train another. In the end, we had eleven different text recognizers, one for each script.

The scripts having a single language associated with them performed well for this task. But the scripts, e.g., Devanagri, which has multiple associated languages (Hindi, Marathi, Bhojpuri, etc.), failed to generalize if trained on data from one language and tested on the other same-script languages. It is because models like ASTER inherently do remember the sequence of characters to form a meaningful word. With different languages sharing the same script, it isn’t easy to generalize such a pattern of sounds and characters. So, to handle such cases, we developed a mixed bag of data that consists of all such languages of a single script based on their proportions in our traffic but artificially allocating a minimum quota to each one.

Fig 4 covers the list of some languages supported in ShareChat & Moj vs. the primary-languages we used to create the character level vocab from.

Fig 4: Vocab-Created-From vs Languages Supported. Some minor changes are done, as specified in the text

Box Weaving

Though our problem from an outside perspective is similar to OCR in the wild problem, the primary distinction is content intelligence. We should be able to do the tasks of detection, identification, and recognition correctly and weave the content in such a manner that it is readable and understandable to humans, let alone models. Though on the surface, this seems to be an easy task, the way content pieces are placed on the background makes it quite challenging to understand which blob of the image-text combination represents what? Is it connected to something else, or should it be intersected further as it is a union of many independent sections?

We first looked at several deep-learning approaches that could solve the problem but could not find one useful for us. We finally settled on a Box-Weaving algorithm that we designed and verified with manual annotations on some data samples.

Here is our approach,1. We compute each box's average box height and width and manually tune a height and width expansion ratio. We expanded each box in all four directions, using both of these factors.2. Once expanded, some of these boxes intersect with each other. An edge joins the boxes that intersect in the graph space, where each node represents a box.3. We use a connected-component algorithm to find individual components that can be a blob of the same text segment.4. Finally, we arrange these segments in a top-down left-right approach based on their respective centers and sizes in the full-text generation.5. We used a manually tuned vertical padding value to handle slightly rotated text lines to get the order of words inside each segment.

We also experimented with training the same EAST model over text lines to get the orientation of lines, but the classical approach of top to bottom, left to right, outperforms the deep learning approach in our case. We also understand that this will not be able to weave vertical and highly rotated text lines correctly, but it does not have many cases for our data and user behavior. First, Illustrations of box-weaving output are given in Fig 5 and Fig 6.

Fig 5: A sample image from one of our Bengali Posts
FullText for Fig 5:<Seg1>জীবনটা অনেকছােটো !<Seg2>শুভ সকাল বন্ধুরা<Seg3>তাই হাঁসাে এমন করেযাতে কষ্টভ তােমার কাছে।এসে আনন্দ পায় ।
Fig 6: A sample image from one of our Hindi Posts. The red boxes of text are rejected by our relevant text extraction module
FullText for Fig 6:<Seg1>दोस्त दवा से भी ज्यादेअच्छे होते हैक्योकिदवा तो expire हो जाती है..<Seg2> --------(Not Relevant)--------omessagekaro<Gibberish> more<Seg3>पर अच्छे दोस्तों की कोईexpiry date नहीं होती..<Seg4>good morning ji

Relevant Text Extractor

While outputting the full text, not all words of the full text are vital to us. Often websites, contact information, author signature, and other logos are intertwined with our full-text and make further models in the content intelligence pipeline have a degraded performance than what they could have. So, we made a relevant text extractor that uses full-text from our previous steps and other attributes of the bounding-box level metadata to return relevant information alongside full-text.

We identify flags that make a segment from the box-weaving module irrelevant to us. The flags commonly include:

  1. Logo-specific stop words
  2. Contact information
  3. URL
  4. If the sizes and number of bounding boxes are below a certain threshold, we reject the content.
  5. To remove author signatures, watermarks, and brand names, we use other attributes, including the aspect ratio of boxes, anomalous rotations, scripts, and size differences between the main content and other boxes and their placing with respect to the full-text.

We compared the results with the manual annotation and tuned the parameters to our use case. The output of one such case is shown in Fig 6, which clubs both the box-weaving module and the relevant text extractor module.

Getting Relevant Image Frames from Videos

To enable our image-based ShareOCR to solve the problem of video-OCR as well, we apply a decoding algorithm on the videos to get the relevant frames out. We do this by sampling frames from the video input.

Here is our approach:1. First, we decode the video using an FFMPEG decoder. The decoder breaks the video into frames and associates a value called scene-change-value s with each frame, which lies between 0 to 1, 0 indicating that the frame is the same as the previous frame, 1 indicating a high magnitude of dissimilarity between the current frame and the previous frame.2. We take a set of at most 16 frames where the scene-change-value is more than a particular threshold.3. These frames are then passed to a RESNET architecture, which generates a k-dimensional feature vector for each frame. We map them into k-dimensional space, and the top four frames most dissimilar to each other post RESNET are kept.4. These four frames are then fed to our OCR model separately, which are combined and output for the complete video.

Final Remarks

Developing ShareOCR has been a ride full of integrating different deep learning networks for each task and creating algorithms to tackle various data inaccuracies and solve complex but niche problems that happened for our use case. Some of these lessons are general when dealing with Indic language content, especially of the languages that the AI community hasn’t picked up much. We look forward to your comments on ShareOCR.

Cover illustration by Ritesh Waingankar

--

--