How to moderate images based on text and logo using ML/DL?

Published in

The Algorithmic Minds

9 min readAug 20, 2022

In this blog, we are going to classify images based on text and logos which are inappropriate for our use-cases. Let’s say, you own your ecommerce website where merchants upload images of their product on your site. Now you have to moderate this images and remove images which contain any social-media logo/id’s , discount’s watermark , their own website promotion , sponsor content or any other keyword which is inappropriate for your ecommerce. So for that we need automatic moderation system which can flag the image into appropriate and inappropriate category. So, Image can contain both inappropriate text as well logo/image-patches, to identify this we need both Object Detection task as well as Optical Character Recognition task.

Let’s start with Object Detection from Images.

Removing Social-Media logos from images:

We can perform logo detection on images using classical Image processing techniques as well modern deep learning methods.

Image Processing Techniques : in this we have template matching , handcrafted features like HOG , SIFT , ORB etc. and make simple classifier which take this features as input and classify our image into whether image contain logo or not(or even multi-class classification). For all this classical techniques we have popular tool which has implementation of all this method is OpenCV. For this methods generally don’t require huge training data and mostly done in unsupervised way.

Pros : We don’t require labeled/annotated dataset of images(which require lot of human effort and time-consuming).
Cons : This techniques is not scalable, which will not work in complex background , different color contrast etc. ( occlusion , cluttering , shadow etc. also not solved by classical techniques but it will not comes in our case for identifying social media logos ).

2. Deep Learning models : in this we have region based object detector like RCNN , Fast RCNN , Faster RCNN , Mask RCNN which all 2 stage detectors( as we first find region of interest(object localization) and then classify(object recognition) it into whether object exist or not and which category if exist) designed for model performence and YOLO and SSD which are single stage (proposal free) object detectors designed for speed and real-time use. This models are supervised in nature so we require huge amount of data and computing power to train models.

Pros : We do require labeled/annotated dataset of images(which require lot of human effort and time-consuming) and GPU’s for training model and this deep learning model is robust to scaling , background , different color contrast , occlusion , cluttering etc.
Cons : Huge amount of training data is required ( where we need location of object in image and its class label ) which is expensive and need to precise to take utmost leverage of model.(Thanks to Transfer Learning)

How YOLO works :

YOLO is abbreviation for the ‘You Only Look Once’. Which is a single shot algorithm means it detect object in single stage compare to 2 stage algorithms like RCNN, Fast RCNN etc. All previous algorithm use classifier to perform object detection , while YOLO frame object detection as a regression problem to spatially separated bounding boxes and class probabilities. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

Credit : https://arxiv.org/pdf/1506.02640.pdf

Unlike sliding window and region proposal-based(RCNN) techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method , mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

Architecture :

Parameters :

S = S*S grid cell on image (s=7 in above image)

B = no. of bounding boxes per cell ( B = 2 in above image and each box contain x,y,w,h and confidence).

C = no. of classes ( C = 20 in PASCAL VOC dataset and C =80 in COCO dataset)

( So on PASCAL VOC YOLO output prediction is 7*7*30 tensor, where s=7, B=2 and C=20. So, Output is S * S ( C + B*5) tensor. )

YOLO basically divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. confidence is calculated using Pr(Object) ∗ IOU(truth,pred) . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
Each grid cell also predicts one set of C conditional class probability Pr(class(i)|Object) regardless of the number of boxes.
At test time we multiply the conditional class probabilities and the individual box confidence predictions, Pr(Class(i)|Object) ∗ Pr(Object) ∗ IOU(truth,pred) = Pr(Classi) ∗ IOU(truth,pred) which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

Custom YOLOv5 :

So for our case we are going to use YOLOv5 with our custom dataset. Firstly, we need to create custom dataset for our purpose. There are lot of opensource online tools available for this purpose label-studio , label-me ,makesense etc. where you can draw bounding box around logo and assign label.

After Creating Dataset for finetuning YOLOv5 on our data we first clone repository and install all required library,

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

after that open yolov5/data/coco128.yaml file where you have to give train and validation data path and change value of variable nc with #classes and names with list of class names and remove the download line if you have data in your local machine.

That’s it now you have to only train yolov5 model on this data for that you have to run below code,

# Train YOLOv5s on COCO128 for 3 epochs
$ python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt 
#img 640 means image resize to 640
#batchsize you can vary based on your resources
#epochs start with 200 and check model performence and increase #decrease based on that
#data where our yolo model will found dataset path
#weights pretrained are recommended which is automatically downloaded (yolov5s means small version of yolo)
#you can add other parameters also like --device for gpus --cache

Now you can see results in yolov5/runs folder. For testing image you can write below command

$ python detect.py --weights runs/train/exp4/weights/best.pt --img 640 --conf 0.1 --source ./test --save-txt
#source means path of our test folder which contain test image
#conf is threshold for confidence score on detected object

Now for you can convert your yolov5 model into ONNX format for easy to use, for that run below command

python export.py --weights runs/train/exp6/weights/best_ckpt.pt --include onnx
#for best_ckpt.pt you can check your expX where X is max number

That’s it. Now you are ready with your custom yolov5 model on your classes. For using in your project just load best_ckpt.onnx created by above code,

import torch

# Model
model = torch.hub.load('ultralytics/yolov5','custom', './best_ckpt.onnx') 

# Images
img = 'link/path/list'  # or file, Path, PIL, OpenCV, numpy, list

# Inference
results = model(img)

# Results
results.print()#you can covert results into pandas dataframe also 
res = results.pandas()
res.head() # you will get idea what different columns are there

So Our logo detection model is ready. Let’s Now go to the OCR topic.

Extracting Text From Images (OCR):

In Market, we have some very good paid API services for OCR like Amazon Textract, Microsoft’s cognitive service, Google cloud vision, etc. Sametime, Pytesseract, easyocr, paddleOCR & Keras-OCR are good open-source APIs that are freely available. They also give comparable results to other paid API services. Many people think OCR is solved problem but when it comes to complex background, different fonts, distortion, noise, low-quality etc. our OCR model not work well, But for our purpose we still use this model with some preprocessing or assuming input images are of good quality.

Here, We are going through open-source OCR only and code for easyocr to extract text from image. Before that we start with how OCR works:

I/P Image Preprocessing :

So we are not going into depth of image processing techniques to improve quality of overall image or only patch which contain text(after text-detection phase), So I am going to list down few techniques that you can try and explored different techniques which can work for your task , there is no single technique which can work for all data/task.

Text Detection :

It detects the text in the image and create and bounding box around the portion of the image having text. Objection detection techniques will also work here(YOLO,RCNN etc.). EAST Detector(Efficient and Accurate Scene Text Detector) is also one of the best detector for finding location of text. It can be used with any text recognition method and its implementation is available in OpenCV library.

#for downloading pretrained EAST model
!wget https://raw.githubusercontent.com/sanifalimomin/Text-Detection-Using-OpenCV/main/frozen_east_text_detection.pb#Link for using this model
https://pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector

Text Recognition :

This stage transforms image(patch from Text-Detection) into string of characters or sentences. Basically we have 2 different technique for recognition 1)Character recognition and 2)Word recognition. Character recognition recognize each indivisual character from image while word recognition combines this character by language models or lexicons. One popular technique for recognition is CRNN (Convolutional Recurrent Neural Network) is combination of CNN , RNN and CTC (Connectionist Temporal Classification) loss for text recognition. Refer this.

EASY OCR code :

First install EasyOCR using pip. Below is code for easyocr.

import easyocr
#creating Reader Object
r = easyocr.Reader([‘en’],gpu = True,detector=True,recognizer=True)
def easy_ocr(img):
 #text detection
 t=r.detect(img)
 cp=img.copy()
 regions = []
 bboxes = []
 for i in range(len(t[0][0])):
  xmin,xmax,ymin,ymax = t[0][0][i]
  regions.append(cp[max(0,ymin):ymax,max(0,xmin):xmax])
  bboxes.append([max(0,ymin),ymax,max(0,xmin),xmax])
  #cv2.rectangle(cp,tl,br,(0,255,0),2)for i in range(len(t[1][0])):
  tl = tuple(int(max(0,x)) for x in t[1][0][0][0])
  br = tuple(int(max(0,x)) for x in t[1][0][0][2])
  # regions.append(cp[tl[1]:br[1],tl[0]:br[0]])
  if(abs(tl[1]-br[1])<=2 or abs(tl[0]-br[0])<=2):
   continue
     regions.append(cp[min(tl[1],br[1]):max(tl[1],br[1]),min(tl[0],br[0]):max(   tl[0],br[0])])
  bboxes.append([min(tl[1],br[1]),max(tl[1],br[1]),min(tl[0],br[0]),max(tl[0],br[0])]) cv2.rectangle(cp,tl,br,(0,255,0),2)
 plt.imshow(cp)
 text = []
 confidence_score=[]
 for reg in regions:
  # print(reg.shape)
  op = r.recognize(reg)
  text.append(op[0][1])
  confidence_score.append(op[0][2])
 return bboxes,text,confidence_score

Results :

easyOCR output: Some of the word, phone number is correctly extracted from images while some words have mistakes.

YOLO logo detection output : Logos are correctly identified with more than 90% probabilities.

Now from output of OCR we can use some text matching with/ classifier to classify it as inappropriate or appropriate image. eg. we identified phone number so it is inappropriate etc. in this case we need to have set of black-listed keywords for our task. and YOLO will give us our desired output itself as we trained on custom dataset . Now we do OR operation on output of both models and we finalize image as appropriate/inappropriate.