Detection and Recommendation — Where Catalogue Meets Real World

Using unannotated images to build a detection network

Published in

Analytics Vidhya

9 min readJun 30, 2021

Image recognition has reached a new high since the last 10 years, mainly because of the evolution of CNN and its applications in solving real world image recognition and localization tasks. Thanks to the years of development researchers and scientists are doing in order to achieve near human performance in terms of:

Identifying and localizing images by training innovative and complex neural network architectures
Providing real time solution(low inference time) without compromising on accuracy
Consistency across different environment and image properties.

Developers have identified different approaches and AI/ML techniques in order to solve different image recognition scenarios. In our scenario, the use case was based upon an object detection followed by image recommendation which can be a standalone component for any computer vision based recommendation system. Below is the process flow of such an image recommendation system

Process flow of detection based image recommendation

An image catalogue with ~180k images are chosen for recommendation and total 22 apparel categories have been identified to include for training. Below is the distribution of catalogue images

YOLOv5 for Detection

YOLO performs single stage detection and provides state-of-the-art solution for object detection with real time prediction during inference time. More details of YOLOv5 can be found here. If you want to learn more about evolution of YOLO you can read this.

YOLO Detection with Catalogue Images:

Catalogue images were captured in constrained environment as well the image were cropped consisting of only the apparel object. In such scenario, any model trained on such images would suffer from variability in the real images. To overcome such issue the two below mentioned techniques helped a lot:

Each of the whole catalogue images was considered as the object. Followed by a bounding box was drawn around the whole image maintaining a minimal gap from the edges of the image.
A random amount of padding was added on each side of the image. The padding value was determined by randomly populating value between 10% — 50% of the image shape for each side followed by additionally adding 2 times the shape for each side based on a randomly generated flag. The code snippet for random padding as added here:

 img = cv.imread(input_path+img_name)
 y,x,_ = img.shape
 gap = 5
 rm = round(random.random())
 padding_top = random.randint(int(0.1*y),int(0.5*y)+rm*2*y)
 padding_bottom = random.randint(int(0.1*y),int(0.5*y)+rm*2*y)
 padding_right = random.randint(int(0.1*x),int(0.5*x)+rm*2*x)
 padding_left = random.randint(int(0.1*x),int(0.5*x)+rm*2*x)
 image = cv.copyMakeBorder(img, padding_bottom, padding_top,   padding_left, padding_right, cv.BORDER_CONSTANT)
 height,width,_ = image.shape
 w,h = x-2*gap,y-2*gap
 x,y = padding_left+gap,padding_bottom+gap
 x,y = int(x + w/2), int(y+h/2)
 x,y,w,h = x/width, y/height, w/width,h/height
 cv.imwrite(input_img_path+img_name,image)

After adding the padding the objects are typically placed in different part of the in different shapes. Followed by resizing is done on the padded image. The x,y,w,h coordinates are also saved to provide as parameters for object detection network.

Prepare YOLO Format Dataset:

--data
        --images
            -- train
                - image1.jpg
                - image2.jpg
            -- valid
                - image11.jpg
                - image22.jpg
        --labels
            -- train
                - image1.txt
                - image2.txt
            -- valid
                - image11.txt
                - image22.txt
           
Example: for any file which is present in /images/train, filename should be same in /labels/train but the extension will be '.txt'What is there in .txt file:YOLO expects a .txt file for each image saved in /images/train or /images/valid which contains the necessary information (l,x,y,w,h)
    l : label of that image
    x,y : Co-ordinate of centre pixel of the object
    w,h : width and height of the objectNote: All the values (x,y,w,h) are normalized to (0,1)x,w = x/W,w/W &
    y,h = y/H,h/H
    where W<,H are the width & height of the original image respectively

YOLO was trained with minor changes in the configuration to adjust the augmentation parameters for rotation, perspective transformations etc. Images were scaled to 320x320 pixel size and 40 epochs were trained with YOLOv5 medium size model. Large size model was not used due to hardware constraint.

Challenge with Detection using Catalogue Images:

Random padding helps model learn the variability with respect to the real images up to some extent in terms of position and size of the image. However, still the learned detection model suffered a lot from identifying bounding box correctly for real world images. The main reason is the lack of variation in the images in terms of background and pixel differences due to pictures taken in different environment.

Fine Tune YOLO:

We have collected ~5k samples from Flickr, Reddit and other popular websites and tagged the images using makesense.ai. This website gives a choice to download the annotations in YOLO consumable format. The main idea behind collecting images manually was to increase the variability of the training images in terms of

variation in number of objects the model expects in one image
variation in background in the training images
variation in size, shape and position for the apparel objects
variation in different types of objects for a single class

The updated YOLOv5 model was fine tuned using this real world images. A huge jump is mAP (increase to 0.78 as compared to only 0.42 in previous scenario) was also observed after performing the fine tuning.

A few examples for detection network is shown below

YOLO Detection with Confidence Score of 0.4

A few tricks that helped boosting the performance further:

Train with smaller scale of images if you have low resolution images. During inference we can use relatively higher pixel size if we have high resolution images
Inference with augmentation comes with around cost of 30–50% increase in inference time. However, significant improvement can be observed in terms of recognition ability if test time augmentation is activated. For example, in our application detection without augmentation takes around 80–100ms where as detection with test time augmentation takes 110–150ms approximately. However, in the latter scenario, accuracy is significantly better.
Do not change hyperparameters unless necessary since the YOLOv5 network is fine tuned by putting years of hard work and taking care of lots of scenarios into consideration.

Recommendation for Detected Objects

So far, a detection network is built and validated which can detect the bounding boxes and provide us the predicted apparel category. Next task is to create a recommendation network which will take us back to the catalogue images for recommending based on the detected objects. A few assumptions here include:

Recommendation should be fast.
Recommendation should take account of gender information
Recommendation should have customization for color preference, season preference, size preference etc.

In this use case, object detection was our primary focus to validate whether we can use cropped catalogue images as a starting point for building an efficient AI application which can work on real world images too. To build a recommender system, we did not use YOLOv5 as a feature descriptor. In stead, we experimented with a standalone CNN based classification model to build a separate component for image recommendation. In this case, we used VGG16 network to build two separate classifiers. The functionality of these two models is:

VGG for Gender Classification: The catalogue images were parallelly trained for gender classification. An accuracy of 92% was observed for gender classification for catalogue images
VGG for Category Classification: The catalogue images were also trained for category classification. The purpose of training this model is to use the model as a feature descriptor. We have used a standalone model for category classification so that we can replace the model or change the last feature embedding layer size as per our modelling choice or as per hardware configuration. An overall accuracy of 84% was observed for apparel category classification using VGG 16. For this use case, we have extracted 2048 dimensional feature vector for measuring distance among the images.

The code snippet for such network can be referred from here:

class BotNet(nn.Module):
    def __init__(self,n_classes=None):
        super(BotNet, self).__init__()
        self.model = models.vgg16(pretrained=True)
        self.n_classes = n_classes
    
        for param in self.model.parameters():
            param.requires_grad = False
            
        self.fc1 = nn.Linear(25088,2048)
        self.fc2 = nn.Linear(2048,256)
        self.fc3 = nn.Linear(256,self.n_classes)
        self.relu = nn.ReLU()
        self.logSoftMax = nn.LogSoftmax(dim=1)
            
    def forward(self, x):
        x = self.model.features(x)
        x = x.view(x.shape[0], -1)
        y = self.fc1(x)
        y = self.relu(y)
        x = self.fc2(y)
        x = self.relu(x)
        x = self.fc3(x)
        out = self.logSoftMax(x)
        return y,outmodel = BotNet(n_classes=22)
# print(model)

Once category classification model is trained all 180k catalogue images were passed through the category classification network and the 2048 dimensional feature vector is saved for each image. For each gender and category combination the feature vectors were saved in a group so that while searching for closest ranked images we can look up at only the relevant group of features.

import pickle
import time
start=time.time()
image_size = 224
transform = A.Compose([
    A.Resize(image_size,image_size),
    A.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225]),
    ToTensorV2()
])image_path = '/images/'
#identifiers is a list of strings containing gender_category information#data has image_id,gender,category information and in saved as a pandas dataframefor idnt in identifiers:
    df_filt = data[data.identifier==idnt]
    dict_to_save = {}
    all_feats = []
    all_images = []
    for i,row in df_filt.iterrows():img_name = row['image']
        gender = row['gender']
        label = row['final_category']
        try:
            image = cv2.imread(image_path+img_name)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            x,y,_ = image.shape
            if (x>90) & (y>90):
            # Augment an image
                transformed = transform(image=image)
                transformed_image = transformed["image"]
                transformed_image = transformed_image.unsqueeze(0)  # if torch tensor
                transformed_image = transformed_image.cuda()
                feats,_ = model(transformed_image)
                all_feats.append(feats.to(device))
                all_images.append(img_name) 
        except Exception as e:
            pass
    dict_to_save['feats'] = all_feats
    dict_to_save['images'] = all_images
    print(idnt,len(all_images))
    torch.save(dict_to_save, working_dir+idnt+'.pt')
end = time.time()
print('Time Taken: ', str(end-start))

And then the recommendation takes place based on predicted gender and predicted category followed by looking up at the correct gender_category level feature set.

#define euclidiean distance and find n closest matches
def euclidean(p1,p2):
    p1,p2 = p1.cuda(),p2.cuda()
    dist =torch.dist(p1,p2)
    return dist#find index corresponding to the closest ranked images
def find_n_closest(tag,feats,n=5):
    feats = feats.cuda()
    targets = feats_dict[tag]['feats']
    n_dist = [euclidean(feats,t.cuda()).cpu().numpy() for t in targets]
    match_indices = np.argpartition(n_dist,range(n))[1:1+n]
    match_images = np.array(feats_dict[tag]['images'])[match_indices]
    return match_images#recommend based on predicted gender and category
def get_recommendation(img,input_data_path=None):  
    match_images=None
    image = cv2.imread(input_data_path+img)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    # Augment an image
    transformed = transform(image=image)
    transformed_image = transformed["image"]
    transformed_image = transformed_image.unsqueeze(0)  # if torch tensor
    transformed_image = transformed_image.cuda()
    
    gender_output = model_gender(transformed_image).cpu()
    _, gender_pred = torch.max(gender_output, 1)
    gender = gender_mapper[gender_pred.cpu().numpy()[0]]
    
    feats,_ = model_category(transformed_image)
    #_, cat_pred = torch.max(category_output, 1)
    #category = category_mapper[cat_pred.cpu().numpy()[0]]
    category = ' '.join(Path(img).stem.split('_')[-1].split(' ')[:-1])
    if category in ['skirt', 'dress']:
        gender = 'women'
    if category in ['tie']:
        gender = 'men'
    refer_tag = gender+'_'+category
    match_images = find_n_closest(tag=refer_tag,feats=feats)
    
    return img,refer_tag,match_images

This hugely reduces the response time during inference. However, this also comes with a cost of incorrect recommendation as a result of misclassification for either of apparel category or gender. The misclassification occurs mainly due to limited capability of the model and variability in background with catalogue images wrt real world images.

Scope of Improvement:

Scope for improvement is broken down to two main components. They are as mentioned below:

Model Performance Improvement:

Improve recommendation accuracy : Current solution suffers from generalization as the gender and category classification networks are only trained on catalogue images. Finetuning these classification models with real world images can mitigate the generalization error.
Include more classes like Age category, Season etc. to provide more customizations and user specific recommendations
Improve gender classification accuracy by changing classification model architecture
Improve gender classification by including class information in detection network
Use YOLOv5 as feature descriptor

Application Performance Improvement:

Response time minimization: It is observed that 90% of processing time is bound by I/O. Current solution takes ~5sec to provide detection or recommendation output. The response time can be minimized up to 90% providing fast and real time recommendations.
User preference for recommendation (Ex: Clickable bounding boxes, Extended recommendation for specific images)

Conclusion:

The experiment showed the efficacy of YOLO to perform state-of-the-art object detection. With proper architecture and code design the application can be used for real time recommendation only. We can integrate such application with any chat bot or web cam based applications.

References: