HappiestMinds uses PyTorch to automatically extract critical product information for a large US Retailer
The first wave of digital has settled in and every business that is selling directly to its consumers has had to embrace an omni-channel strategy. However, there are several teething troubles that crop up as a traditional business entity goes about moving its operations into the alluring but complicated digital landscape. This possibly doesn’t ring truer of any other industry than retail which has a plethora of challenges to be navigated around suppliers, inventory management, multiple SKUs, or customer engagement and loyalty, to name just a few.
When a retailer dives into the digital ecosystem by setting up an online storefront, it becomes imperative that all their product pages are factual, engaging, and responsive. Anything that doesn’t meet this cardinal rule can lead to substandard customer experience, loss of revenues, and a high rate of returns.
One of our large retail clients was plagued with similar issues as their existing system wasn’t cut out to manually check the correctness of information for millions of SKUs listed across their digital storefronts. In addition, they were working with thousands of suppliers who provided product information in multiple formats that only made matters worse.
Our client was looking to verify whether the information printed on the product matches with the product description on the webpages apart from verifying whether the vendor contact details had been included. The need of the hour was an AI-driven automated solution to solve this problem by ensuring pinpoint accuracy.
CREATING A DEEP-LEARNING BASED SOLUTION POWERED BY PYTORCH
As the team at Happiest Minds set about this task, the initial approach was to extract all the text from a product image using PyTesseract. While a library like Tesseract made our life easier by extracting text from an image, it had certain limitations in accurately identifying italics and small sized text. In addition, it also required a lot of pre-processing to carry out the removal of non-relevant information like logos and promotional information from the input image.
This made us pivot towards a deep learning model with pre-trained weights to accelerate the solution development and save valuable time for our client. The decision to go with PyTorch as the deep learning framework of choice was made easy as it comes with a multitude of pre-trained open source models that are tailored specifically for images. Moreover as PyTorch is pythonic it is easier to use and faster to implement while providing a way to fine-tune pre-trained model weights that significantly improve task specific accuracy. PyTorch’s data-parallelism is highly useful to build and fine-tune models even with large amounts of data and most cloud platforms readily support PyTorch, which swung the scales heavily in its favor.
EXTRACTING THE MOST OUT OF CRNN — AN END TO END TRAINABLE NEURAL NETWORK
We at Happiest Minds chose CRNN (Convolution Recurrent Neural Network) for extracting text from the millions of product images we were working on. CRNN is a combination of CNN, RNN and CTC for image-based sequence recognition tasks, such as scene text recognition and OCR, that typically provides better accuracy for text extraction even from product images. And for the removal of non-relevant information we used custom processing which has been further explained in the pipeline below.
The following pipeline is designed to extract text information from product images. The text displayed in product images might have variations due to illumination, angles and font sizes and we considered these problems as part of the pipeline design.
1. Text bounding-box detection
The first step is to detect bounding-boxes of words. EAST  is a convolution neural network used for scene text detection to extract text bounding boxes from product images. These bounding boxes are combined and processed in the following stages.
2. Transform and construct lines
The word bounding-boxes extracted using EAST are sometimes skewed due to their quadrilateral shapes and rotated. These irregular bounding-boxes are reshaped to proper rectangles using perspective transforms. This conversion helps to reshape the skewed text to proper shaped text.
These bounding boxes are clustered into lines by sorting and grouping the X and Y co-ordinates. The clustering of texts into line helps to construct relevant meaningful sentences.
3. Bounding box detection of non-relevant region
The area of logo and promotional ads are considered as a non-relevant region for our use-case, because they are not part of the product description.
For the given product image, the non-relevant region can be detected by getting the convex hull of the largest contour. The above assumption is made from the business, for the set of products.
4. Removing text in non-relevant regions
Mostly the text residing in the non-relevant region contains product name, offers or discounts which are not part of our analysis. The text residing in the non-relevant region is removed by calculating the median for the list of bounding boxes. If a bounding box median lies within the non-relevant region then the text is eliminated else the text is retained.
5. Scale the smaller text for better accuracy
One of the common issues faced in text extraction is extracting the text from smaller bounding boxes. This issue can be handled by scaling up the bounding boxes using the ‘lanczos’ method and passed it into CRNN. This method leads to significantly better results.
6. Text Extraction
PyTorch implementation of CRNN (Convolution recurrent neural network) with pre-trained weights is used to extract text information. When compared to tesseract, we observed more accuracy for text in italics and smaller sized fonts.
GREATER ACCURACY, INCREASED COVERAGE AND ACCELERATED AUDITS TO DRIVE CUSTOMER SATISFACTION
By using PyTorch pre-trained models, we reduced the development complexity while speeding up the process to deliver results for our client well before time. Using the pipeline described above along with pre-trained PyTorch models we achieved 90% accuracy in relevant text extraction. Also, as PyTorch is supported by most cloud providers we were able to productionize the pipeline with far less effort.
The solution helped our retail client in maintaining the quality of product images and ensuring the sanctity of information provided to drive down the number of returns and complaints. We helped in ensuring factual correctness of everything ranging from product descriptions to vendor details while accurately capturing and reporting on policy violations. Our solution, driven by PyTorch, reduced the herculean effort involved in the manual validation of products to driven up product coverage from 30% to 80% while reducing the cycle time for audits from 12 weeks to 2 weeks.
 EAST: An Efficient and Accurate Scene Text Detector
: CRNN: An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition