Running PyTorch based object detection model on Edge device with pre and post processing steps.

Published in

Eumentis

9 min readMar 1, 2024

Running object detection model on mobile devices.

This article is the third in a series of four articles on object detection on Edge devices. The first two articles can be found here:

Need to add link

In the previous article, we successfully converted our object detection model trained on a web device into a .ptl format, which is a mobile-optimized version. Now, in this post, we will guide you through the steps to run a machine learning model on a mobile device.

In addition to loading the model and running inference, we’ll also cover how to replicate the pre and post-processing steps we implemented on the web. These pre-processing steps involve cropping and overlaying an image. Below is a step-by-step guide to help you navigate through these processes.

To begin, we utilize the functions provided by react-native-pytorch-core to load the image and retrieve its height and width.

Import ImageUtil from react-native-pytorch-core

export default async function detectObjects(image, path) {
# loading the image from a file path
const fly_image = await ImageUtil.fromFile(path to image);
# getting image width 
let imageWidth = fly_image.getWidth();
# getting image height
let imageHeight = fly_image.getHeight();
# calculating image aspect ratio
let img_aspect_ratio = imageWidth / imageHeight;

Once we have obtained the height and width, we proceed to perform operations to ensure that both values are even. If they are not already even, we utilize a cropping function provided by the react-native-photo-manipulator to make them even.

if ((imageWidth % 2 !== 0) & (imageHeight % 2 !== 0)) {
        // cropping the last pixel column & last pixel row of image to make width and height even respectively
        const cropRegion = {
          // starting x co-ordinate
          x: 0,
          // starting y co-ordinate
          y: 0,
          // ending y coordinate
          height: imageHeight - 1,
          // ending x coordinate
          width: imageWidth - 1,
        };
        // Photo manipulator returns path of cropped image
        path_saved = await RNPhotoManipulator.crop(
         filepath,
          cropRegion,
        );
      } else if (imageWidth % 2 != 0 && imageHeight % 2 == 0) {
        // cropping the last pixel column of image to make width even
        const cropRegion = {
          x: 0,
          y: 0,
          height: imageHeight,
          width: imageWidth - 1,
        };
        // photo Manipulator returns path of cropped image
        path_saved = await RNPhotoManipulator.crop(
          filepath,
          cropRegion,
        );
      } else if ((imageHeight % 2 != 0) & (imageWidth % 2 == 0)) {
        //cropping the last pixel row of image to make height even
        const cropRegion = {
          x: 0,
          y: 0,
          height: imageHeight - 1,
          width: imageWidth,
        };
        // photo manipulator returns path of cropped image
        path_saved = await RNPhotoManipulator.crop(
         filepath,
          cropRegion,
        );
      } else {
        // if both the height and width are even then do nothing
      }

 // We check if the image has been cropped. If yes, then either the height or width
 // might have been updated. We save them
 if (path_saved) {
     new_dimension_image = await ImageUtil.fromFile(image path);
     //--------------updated height and width----------//
     imageHeight = new_dimension_image.getHeight();
     imageWidth = new_dimension_image.getWidth();
      }

Since the object to be detected is considerably smaller than the image size, we’ve opted for a tiling approach to enhance accuracy. This involves dividing the image into tiles. We plan to use a tile size of either 640x640 or 960x960 pixels. To ensure accurate cropping of the image, the height and width must be multiple of the chosen tile size. To achieve this, we perform specific operations that adjust the image size to be a multiple of the tile size, allowing for precise tile cropping.

// calculating new width and height of image which needs to be a multiple of our tile_size for proper tiling
let new_imageWidth = tile_size * Math.ceil(imageWidth / tile_size);

let new_imageHeight = tile_size * Math.ceil(imageHeight / tile_size);


// setting height and width equal to each other to get a square padded image.
new_imageWidth = Math.max(new_imageHeight, new_imageWidth);

new_imageHeight = Math.max(new_imageHeight, new_imageWidth);

To create an image with the newly calculated height and width, we generate a blank white image of the specified size and then overlay the original image onto it. This process ensures that the resulting image matches the new height and new width, enabling proper tiling operations.

  // add padding to the image to make the image square shaped
  const pd_w = Math.floor((new_imageWidth - imageWidth) / 2);
  const pd_h = Math.floor((new_imageHeight - imageHeight) / 2);

Rather than creating a blank white image every time, we optimized our approach by storing a blank white image of the maximum possible size and cropping it to match our desired dimensions based on the newly calculated height and width. In our initial trials, we observed that repeatedly creating a blank image caused memory spikes. Therefore, instead of generating a blank white image in every loop, we bundled a white image with the app. If the calculated new height and new width were smaller, we simply cropped the bundled white image to achieve the desired dimensions, which served as the canvas for placing the original image.

// new height and new width are the calculated height and width to form a sqaured image.

if (new_imageHeight !== 9600 && new_imageWidth == 9600 ) {
        // cropping the image if the new height and width are less than 9600.
        const cropRegion = {
          x: 0,
          y: 0,
          height: new_imageHeight,
          width: new_imageWidth,
        };
        // after cropping the image it returns the path of the cropped image.
        blank_image_absolute = (
          await RNPhotoManipulator.crop(blank_image_absolute, cropRegion)
        ).split('://')[1];
        blank_image_absolute = 'file://' + blank_image_absolute;
      } else if (new_imageHeight == 9600 && new_imageWidth !== 9600) {
        const cropRegion = {
          x: 0,
          y: 0,
          height: new_imageHeight,
          width: new_imageWidth,
        };
        // after cropping the image it returns the path of the cropped image
        blank_image_absolute = (
          await RNPhotoManipulator.crop(blank_image_absolute, cropRegion)
        ).split('://')[1];
        blank_image_absolute = 'file://' + blank_image_absolute;
      } else if (new_imageHeight !== 9600 && new_imageWidth !== 9600) {
        const cropRegion = {
          x: 0,
          y: 0,
          height: new_imageHeight,
          width: new_imageWidth,
        };
        // after cropping the image it returns the path of the cropped image.
        blank_image_absolute = (
          await RNPhotoManipulator.crop(blank_image_absolute, cropRegion)
        ).split('://')[1];
        blank_image_absolute = 'file://' + blank_image_absolute;
      } else {
        // 9600 * 9600 is the highest resolution possible and if the new height and new width turn out to be 9600
        // we do nothing
      }

After preparing the squared white image, we position our original image on top of it. This step allows us to seamlessly execute the tiling process. Consider this scenario: if the original image had a height or width that wasn’t even or wasn’t a multiple of our tile size, the last tile, whether in the row or column direction, would be incomplete.

// we perform the overlay operation here with the help of react-native-photo-manipulator.
if (path_saved) {
        overlay_path = await RNPhotoManipulator.overlayImage(
          blank_image_absolute,
          path_saved,
          {x: pd_w, y: pd_h},
        );
        overlayyed_image_path = overlay_path.split('://')[1];
      } else {
        const original_image_path = 'file://' + path;
        overlay_path = await RNPhotoManipulator.overlayImage(
          blank_image_absolute,
          original_image_path,
          {x: pd_w, y: pd_h},
        );
        overlayyed_image_path = overlay_path.split('://')[1];
      }

Now we are ready to carry out our tiling process. In our initial effort to reduce processing time, we implemented a strategy based on the aspect ratio. We decided to discard the first and last row/column of tiles, reasoning that our dataset typically had objects centered within the image, making it unlikely to find them near the borders. This adjustment reduced the number of tiles from 49 to 35, improving processing efficiency.

if (img_aspect_ratio >= 1) {
// This is the rejection part. Where first and last row get rejected
        for (
          let i = tile_size;
          i < new_imageHeight - tile_size;
          i = i + tile_size
        ) {
          for (let j = 0; j < new_imageWidth; j = j + tile_size) {
            answers = await preprocessing(
              j,
              i,
              tile_size,
              overlay_path,
              pd_w,
              pd_h,
              model,
              imageScaleX,
              imageScaleY,
            );
            if (answers.length !== 0) {
              for (let m = 0; m < answers.length; m++) {
                finalresultBoxes.push(answers[m]);
              }
            }
          }
        }
      } else {
        for (let i = 0; i < new_imageHeight; i = i + tile_size) {
     // This is the rejection part where the first and last column are rejected.
          for (
            let j = tile_size;
            j < new_imageWidth - tile_size;
            j = j + tile_size
          ) {
            answers = await preprocessing(
              j,
              i,
              tile_size,
              overlay_path,
              pd_w,
              pd_h,
              model,
              imageScaleX,
              imageScaleY,
            );
            if (answers.length !== 0) {
              for (let m = 0; m < answers.length; m++) {
                finalresultBoxes.push(answers[m]);
              }
            }
          }
        }
      }

We’ll now take a deeper look into the preprocessing function. Here, we load an image, crop it, convert it to a tensor, apply specific manipulations to the tensor, and ultimately feed it to the model for the inference task.

async function preprocessing(
  j,
  i,
  tile_size,
  padded_image_path,
  pd_w,
  pd_h,
  model,
  imageScaleX,
  imageScaleY,
) {
  // coordinates of the image to be cropped.
  const cropRegion = {x: j, y: i, height: tile_size, width: tile_size};
  // after cropping react-native-photo-manipulator returns the path of the cropped image
  let cropped_path = (
    await RNPhotoManipulator.crop(padded_image_path, cropRegion)
  ).split('://')[1];
  // we load the cropped image
  const cropped_image = await ImageUtil.fromFile(cropped_path);
  // convert it to blob
  const blob = media.toBlob(cropped_image);
  // and from blob we convert it to tensor. 
  let new_tensor = torch.fromBlob(blob, [
    cropped_image.getHeight(),
    cropped_image.getWidth(),
    3,
  ]);
  // apply transformations on the tensor
  new_tensor = new_tensor.permute([2, 0, 1]);
  new_tensor = new_tensor.div(255);
  const resize = T.resize([640, 640]);
  new_tensor = resize(new_tensor);
  // unsqueezing the tensor
  const formattedInputTensor = new_tensor.unsqueeze(0);
 
  // giving the tensor as input to the model for inference. 
  const output = (await model.forward(formattedInputTensor))[0];

 
  //----releasing the cropped image----------------//
  await ImageUtil.release(cropped_image);

  // ----- bounding box dimesions w.r.t to original_image,original_height (padding) ------- //
  // the coordinates of bounding box will be for the tiled image. we need to readjust
  // the bounding box coordinates w.r.t to our original image. 
  const startX = j - pd_w;
  const startY = i - pd_h;
 

  //------------------results---------------------//

  // This function basically is the non - max - suppression algorithm 
  const results = outputsToNMSPredictions(
    output,
    startX,
    startY,
    imageScaleX,
    imageScaleY,
  );

  // storing the output in the desired form
    const match = {
      objectClass: result.classIndex,
      bounds: result.bounds,
      score: result.score,
    }; 
    resultBoxes.push(match);
  }
  return resultBoxes;
}

Post Processing

Here, we perform the non-max-suppression algorithm along with the intersection of the union to weed out overlapping predictions at a defined iou threshold.

function outputsToNMSPredictions(
  prediction,
  startX,
  startY,
  imageScaleX,
  imageScaleY,
) { 
  // The prediction threshold at which we want to store our predictions
  const predictionThreshold = 0.01;
  // iou threshold
  const iOUThreshold = 0.45;
  const results = [];
  const rows = prediction.shape[0];
  const columns = prediction.shape[1];
  
  const numberOfClass = prediction.shape[0] - 4;
 
  for (let n = 0; n < columns; n++) {
    // getting the confidence score of each prediction
    const score = prediction[4][n].data();
    // processing only those confidence thresholds which are above our prediction threshold.
    if (score[0] > predictionThreshold) {
      // getting the coordinates of bounding boxes
      const x = prediction[0][n].data();
      const y = prediction[1][n].data();
      const w = prediction[2][n].data();
      const h = prediction[3][n].data();
      
      // Our tile size was of 960 * 960. But we resized it to 640 * 640 before passing it 
      // to the model. So now when the predictions take place we need to scale the bounding box 
      // coordinates w.r.t to 960 * 960 tile size.
      const left = imageScaleX * (x - w / 2);
      const top = imageScaleY * (y - h / 2);

      // bounding box coordinates w.r.t to our original image.
      const bound = [
        startX + left,
        startY + top,
        w * imageScaleX,
        h * imageScaleY,
      ];

      // Construct result and add it to results array
      const result = {
        classIndex: // class name, ( Note we only have a single class )
        score: score[0],
        bounds: bound,
      };
      results.push(result);
    }
  }
  return nonMaxSuppression(results, nMSLimit, iOUThreshold);
}

The steps to follow for the non-max-suppression algorithm are stated below:

function nonMaxSuppression(boxes, limit, threshold) {
  // Do an argsort on the confidence scores, from high to low.
  const newBoxes = boxes.sort((a, b) => {
    return b.score - a.score;
  });
  const selected = [];
  const active = new Array(newBoxes.length).fill(true);
  let numActive = active.length;
  // The algorithm is simple: Start with the box that has the highest score.
  // Remove any remaining boxes that overlap it more than the given threshold
  // amount. If there are any boxes left (i.e. these did not overlap with any
  // previous boxes), then repeat this procedure, until no more boxes remain
  // or the limit has been reached.
  let done = false;
  for (let i = 0; i < newBoxes.length && !done; i++) {
   
    if (active[i]) {
      const boxA = newBoxes[i];
      selected.push(boxA);
     
      for (let j = i + 1; j < newBoxes.length; j++) {
        if (active[j]) {
          const boxB = newBoxes[j];
          
          if (IOU(boxA.bounds, boxB.bounds) > threshold) {
           
            active[j] = false;
            
            numActive -= 1;
            if (numActive <= 0) {
              done = true;
              break;
            }
          }
        }
      }
    }
  }
  return selected;
}
// this function calculates intersection area between 2 bounding boxes.
function IOU(a, b) {
  let areaA = a[2] * a[3];
  const areaA_bottom_right_x_coordinate = a[0] + a[2];
  const areaA_bottom_right_y_coordinate = a[1] + a[3];
  if (areaA <= 0.0) return 0.0;
  let areaB = b[2] * b[3];
  if (areaB <= 0.0) return 0.0;
  const areaB_bottom_right_x_coordinate = b[0] + b[2];
  const areaB_bottom_right_y_coordinate = b[1] + b[3];
  const intersectionleftX = Math.max(a[0], b[0]);
  const intersectionleftY = Math.max(a[1], b[1]);
  const intersectionBottomX = Math.min(
    areaA_bottom_right_x_coordinate,
    areaB_bottom_right_x_coordinate,
  );
  const intersectionBottomY = Math.min(
    areaA_bottom_right_y_coordinate,
    areaB_bottom_right_y_coordinate,
  );
  const intersection_width = Math.max(
    intersectionBottomX - intersectionleftX,
    0,
  );
  const intersection_height = Math.max(
    intersectionBottomY - intersectionleftY,
    0,
  );
  const intersection_area = intersection_width * intersection_height;
 
  return intersection_area / (areaA + areaB - intersection_area);
}

Here, we complete the post-processing steps. Watch out for our next post on a comparison between web and mobile results.

Running PyTorch based object detection model on Edge device with pre and post processing steps.

Need to add link

Post Processing

Written by Madhur Zanwar