Create custom Object Detection without using tensorflow API

Published in

Analytics Vidhya

7 min readNov 8, 2021

Object detection has been one of the most widely used application of computer vision. Whether its Tesla’s auto driving mode or just a simple mask detection model, object detection and localisation is the way to go. I was equally amazed when object detection was introduced to me and when I did a simple google search, all I could find was tensorflow object detection API. Even when I add “without using tensorflow API” to my keywords, it still shows results using tensorflow API. This immensely motivated me to create custom object detection model, removing the dependency of using high-level API’s.

Tensorflow’s object detection API is pretty cool but I found it quite clumsy and time-consuming. Also its GitHub repository keeps updating which doesn’t match with the available tutorials and blogs. And thus it becomes difficult for beginners to cope up with. In this blog, I am going to explain how to create an accurate object detection model using CNN without using tensorflow API. However, we are definitely going to import keras and tensorflow libraries for creating the model. So, lets get started !

The Concept

Just try to recall our traditional CNN image classification model — We take an RGB image as an input, lets say 200x200 and predict whether its a cat or dog. So, our input shape becomes (None,200,200,3) and our output shape, after one-hot encoding becomes (None,1) where None represents batch-size.

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(3, 200, 200)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Flatten())  
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='rmsprop', metrics=['accuracy'])

We are going to use the same concept here. For object detection, we have an RGB image and our output would be 2 coordinates (x,y) so our model will have total 4 outputs — xmin, ymin, xmax, ymax. hence, our output shape, in this case will be (None,4)

bounding box coordinates — source: ResearchGate

Data preparation

Lets divide this phase into two parts — X and Y. X will be images and Y will be our coordinates.

For training, we will have to manually annot images using annotation tool. For this purpose, we will use LabelImg. The process is quite simple and you can follow the steps on its GitHub readme.

Once we have images with xml files in the same folder. We will use xml_to_csv.py for converting xml files into csv. You can download this file from here

import osimport glob
import pandas as pd
import xml.etree.ElementTree as ET  def xml_to_csv(path):    
xml_list = []    
for xml_file in glob.glob(path + '/*.xml'):        
     tree = ET.parse(xml_file)        
     root = tree.getroot()        
     for member in root.findall('object'):            
           value = (root.find('filename').text, 
                  int(root.find('size')[0].text),
                  int(root.find('size')[1].text),                      
                  member[0].text,
                  int(member[4][0].text),                     
                  int(member[4][1].text),                     
                  int(member[4][2].text),                     
                  int(member[4][3].text))            
           xml_list.append(value)    
      column_name = ['filename', 'width', 'height', 'class',                    'xmin','ymin', 'xmax', 'ymax']           xml_df = pd.DataFrame(xml_list, columns=column_name)    
       return x xml_df  def main():    
  image_path = os.path.join(os.getcwd(), 'foldername')    
  xml_df = xml_to_csv(image_path)      
  xml_df.to_csv('filename.csv', index=None)         
  [xml to csv script]main()

Dont forget to change “foldername” and “filename.csv” in the above script.

So we shall have a csv file with images and its corresponding coordinates in csv.

Data generator

Now since we have images ie our X and csv containing coordinates ie our Y ready, we will generate data for model traininig.

Step 1 : First we will import our csv file that we just generated using xml_to_csv.py

req_images = pd.read_csv('labels.csv')
fin_images = req_images.drop_duplicates(subset='filename', keep="last")fin_images.head()

This shall load our csv and look something like this:

Step 2 : Write code for generating images from the csv and resize to 96x96 as this will reduce training time and computation power. Note that we will return two objects — resized images and source_images list which we will use to validate keypoints further.

def generate_images():
    
    source_images = fin_images['filename'].to_list()
    
    images = []
   
    for i in source_images:
        path = 'train/'+i
        input_image = cv2.imread(path)
        input_image = cv2.resize(input_image,(96,96))
        images.append(input_image)
#         cv2.imwrite('img_t.jpg',input_image)
    
    images = np.array(images)
    print('images_shape:',images.shape)
    return images, source_imagesmodel_input_images, source_images = generate_images()

Step 3 : Similarly, we will load coordinates of bounding box and normalise them according to our 96x96 resized images.

Since we resized our images to 96x96, so the points of bounding box will change accordingly. For that we will find width and height ratio and multiply xmin and xmax by width ratio and ymin and ymax by height ratio. This will give us normalised coordinates which we will use for model training.

fin_images['width_ratio'] = fin_images['width']/96
fin_images['height_ratio'] = fin_images['height']/96fin_images['xmin'] = fin_images['xmin']/fin_images['width_ratio']
fin_images['xmax'] = fin_images['xmax']/fin_images['width_ratio']
fin_images['ymin'] = fin_images['ymin']/fin_images['height_ratio']
fin_images['ymax'] = fin_images['ymax']/fin_images['height_ratio']

To check if we have normalised correctly, just make sure that all coordinates in dataframe, should be less than 96 (obviously !)

Step 4 : Just like images, we will generate Y ie normalised coordinates. Here we will validate that we are taking correct keypoints corresponding to images by checking filename with image_name in source_images list.

def generate_keypoints():
    
    keypoint_features = []
    
    for i in source_images:
        try:
            print(i)
            image_name = i
            mask = fin_images[fin_images['filename'] == image_name]
            mask = mask.values.tolist()
            print(mask)
            keypoints = (mask[0][4:8])
            #print(keypoints)
            newList = [int(x) / 96 for x in keypoints]
            #print(newList)
            keypoint_features.append(newList)
        except:
            print('error !')
    keypoint_features = np.array(keypoint_features, dtype=float)    
    return keypoint_features

Just to make sure everything has been done correctly till now, load any random X and Y and plot the rectangle on the image. If the bounding box appears correctly that means you have performed normalisation correctly. If not, have a cup of coffee & try the above steps again !

Testing our generated X and Y by manually plotting the bounding box

Step 5 : Finally we will create model architecture. You can create any CNN based sequential model and tweak its complexity and optimiser according to your needs and requirement.

input_shape = (96,96,3)
no_of_keypoints = 4  #xmin, ymin, xmax, ymaxmodel = Sequential()# Input dimensions: (None, 96, 96, 3)
model.add(Convolution2D(32, (3,3), padding='same', use_bias=False, input_shape=input_shape))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
# Input dimensions: (None, 96, 96, 32)
model.add(Convolution2D(32, (3,3), padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2)))# Input dimensions: (None, 48, 48, 32)
model.add(Convolution2D(64, (3,3), padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
# Input dimensions: (None, 48, 48, 64)
model.add(Convolution2D(64, (3,3), padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2)))# Input dimensions: (None, 24, 24, 64)
model.add(Convolution2D(96, (3,3), padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
# Input dimensions: (None, 24, 24, 96)
model.add(Convolution2D(96, (3,3), padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2)))# Input dimensions: (None, 12, 12, 96)
model.add(Convolution2D(128, (3,3),padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
# Input dimensions: (None, 12, 12, 128)
model.add(Convolution2D(128, (3,3),padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2)))# Input dimensions: (None, 6, 6, 128)
model.add(Convolution2D(256, (3,3),padding='same',use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
# Input dimensions: (None, 6, 6, 256)
model.add(Convolution2D(256, (3,3),padding='same',use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2)))# Input dimensions: (None, 3, 3, 256)
model.add(Convolution2D(512, (3,3), padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())
# Input dimensions: (None, 3, 3, 512)
model.add(Convolution2D(512, (3,3), padding='same', use_bias=False))
model.add(LeakyReLU(alpha = 0.1))
model.add(BatchNormalization())# Input dimensions: (None, 3, 3, 512)
model.add(Flatten())
model.add(Dense(512, activation='linear'))
model.add(Dropout(0.3))
model.add(Dense(no_of_keypoints))
model.summary()

Don’t forget to keep last dense layer as :

model.add(Dense(no_of_keypoints))) # 4 in our case

it will be 4 in our case since we have 4 key points as our output — xmin, xmax, ymin, ymax.

Step 6 : Once your sequential model is ready, just compile and fit the model using generated X (model_input_images) and Y (model_input_keypoints)

from keras.callbacks import EarlyStopping, ReduceLROnPlateauearlyStopping = EarlyStopping(monitor='loss', patience=30, mode='min', baseline=None)rlp = ReduceLROnPlateau(monitor='val_loss', factor=0.7, patience=5, min_lr=1e-15, mode='min', verbose=1)model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mse'])history = model.fit(model_input_images, model_input_keypoints, epochs=500, batch_size=16, validation_split=0.15, callbacks=[earlyStopping])

Step 7 : That marks the end of our object detection training. Now it’s time to test our efforts ! Let’s test the model. Remember that we will have to perform same pre-processing steps that we did while training.

image_test = "final_test1.jpg"
test_image = cv2.imread(image_test)
test_image = cv2.resize(test_image,(96,96))
cv2.imwrite('final_test.jpg',test_image)
images = np.array(test_image)/255.0print(images.shape)
test=np.expand_dims(images,axis=0)
print(test.shape)ans = model.predict(test)
print(abs(ans))ans = ans * 96  #denormalizing
print(ans)

Step 8 : Lets plot the coordinates on the image to visualise our results.

test_image = cv2.imread('final_test.jpg')
test_image = cv2.resize(test_image,(96,96))
out = cv2.rectangle(test_image,(ans[0],ans[1]),(ans[2],ans[3]),    (255,0,0), 2)
plt.imshow(out)
cv2.imwrite('final_test.jpg',out)

The biggest advantage of using this pipeline for object detection is that you can tweak the model and other parameters of the pipeline as per your requirement easily. You can also try transfer learning with this. Just make sure to keep 4 nuerons in the last dense layer.

Future improvements and scope

Need to figure out, how to modify this for multiple bounding boxes
Add one more parameter for predicting confidence score
Need to tweak data generation pipeline if you are using any other data labelling tool which exports coordinates in a format other than Pascal-VOC

Feel free to write down comments in case you face any errors or issues. Also, If you try this pipeline and get amazing predictions, don’t hesitate to flaunt your results in comments.

For customised computer vision and AI based products and services, visit us on Think In Bytes or take a look at our portfolio !

Create custom Object Detection without using tensorflow API

The Concept

Data preparation

Data generator

Future improvements and scope

Written by Darshil Modi