Tracking a Cat with TensorFlow Object Detection Model and Coral Edge Device for Inference

Alexander Komarov
10 min readApr 6, 2023


In this article I will tell how I’ve setup the infrastructure of microcontrollers at home to track my cat in realtime and collect the information where the cat spends how much time (for example, how often he eats/drinks).

The first part is about building a Tensorflow Object Detection API model, converting it to Tflite and TPU compatible format to run inference on Coral Edge dev board.

There will be another article where I build a pose classifier by training the Tensorflow Object Detection model on my custom keypoints which are joints of the cat’s body.

I will apply-transfer learning on CenterNet MobileNetV2 FPN 512x512 pre-trained model which were trained on COCO dataset. The full list of models you can find here. It has 90 classes and as it already includes the category “cat”, I will extract the pre-trained weights for my model.

Dataset preparation

For TensorFlow Object Detection API your dataset must be in TFRecords format.


It must be annotated on COCO data format as it’s pretrained model is initially trained on COCO dataset. For every item in the dataset, there should be the following information:

  1. An RGB image for the dataset encoded as jpeg or png.

2. A list of bounding boxes for the image. Each bounding box should contain:

  • A bounding box coordinates (with origin in top left corner) defined by 4 floating point numbers [ymin, xmin, ymax, xmax]. Note that we store the normalized coordinates (x / width, y / height) in the TFRecord dataset.
  • The class of the object in the bounding box.

This page is describing all the requirements pretty well and has some examples. You can also check the dataset tfrecords generator file in my repository of this project.

For annotations as keypoints, bounding boxes there are many free tools nowadays. I personally use Label Studio. I run it via Docker locally and it’s completely free for such tasks. This is an example of one of my keypoints annotation tasks for the cat.

In total, I annotated around 300 photos of my cat at home in addition with some relevant to my task pictures of cats from PASCAL VOC2012 Dataset.

Running the script would generate TFRecords which must be placed in pipeline config file in the section tf_record_input_reader path.

Important to notice: Tensorflow Object Detection API always returns coordinates in this order (y,x)

Labels map

In our case it’s pretty easy. We have only one item to detect. So the file will look like. This is a specific format we have to adhere for TF2 Object Detection API to rightly recognise this label map.

item {
name: “cat”
id: 1
display_name: “cat”

Important to notice: the category id always starts from 1 as 0 is reserved for the background in TF2.

Model preparation

Choosing pre-trained model

I am using pre-trained CenterNet MobileNetV2 FPN 512x512 model with the fine tuning following after.

It is important to choose only the models with SSD architecture as only the models based on SSD architecture are currenty supported for converting into TensorFlow Lite format. As we are planning to run the inference on Coral Edge device for real time streaming, we need TFlite format.

Regarding TPU compatibility, Currently, SSDMetaArch models are supported on TPUs. FasterRCNNMetaArch is going to be supported in the future.

Pipeline configuration

Here is how my pipeline config looks like. My model will be trained on TPU that’s why all input files (TFRecords dataset, labels map and fine_tune_checkpoint for pre-trained model) must be stored in gcloud Storage. For example, the fine_tune_checkpoint will look like fine_tune_checkpoint: “gs://FOLDER_NAME/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0”

In my case fine_tune_checkpoint_type doesn’t matter as I am going to manipulate the code of loading pre-trained checkpoint to extract particularly the weights for “cat” category for the heads of my model.

The number of classes will definitely change from the default 90 to 1 in our case.

Restoring weights from pre-trained model

This is how I do partial loading of the pre-trained model. The detection model consists of the feature extractor detection_model._feature_extractor (MobilenetV2 FPN Feature Extractor in this case) and the heads predictor detection_model._box_predictor.

The heads predictor stars with the early layers base_tower_layers_for_heads followed by the class prediction and box regression heads.

_box_prediction_head predicts the bounding boxes while _prediction_heads predicts the category. The default pretrained model is pre-trained on 90 COCO categories. We keep base_tower_layers_for_heads and _box_prediction_head for partial restoring. However we won’t restore _prediction_heads becauseand we want to have new classes and the number different from 90 as the default one. Thus, after partially restoring we can start re-training the model. This is the work flow for creating new classes.

fake_box_predictor = tf.compat.v2.train.Checkpoint(
_base_tower_layers_for_heads = detection_model._box_predictor._base_tower_layers_for_heads,
# _prediction_heads = detection_model._box_predictor._prediction_heads,
# will not restore the classifiction head
_box_prediction_head = detection_model._box_predictor._box_prediction_head,
_head_scope_conv_layers = detection_model._box_predictor._head_scope_conv_layers

tmp_model_checkpoint = tf.compat.v2.train.Checkpoint(
_box_predictor = fake_box_predictor

model_checkpoint = tf.compat.v2.train.Checkpoint(model = tmp_model_checkpoint)

# Run model through a dummy image so that variables are created
image, shapes = detection_model.preprocess(tf.zeros([1, 640, 640, 3]))
prediction_dict = detection_model.predict(image, shapes)
_ = detection_model.postprocess(prediction_dict, shapes)
print('Weights restored!')

However, in our case we work with cats. This category already exists in those 90 pre-trained COCO categories. Therefore it makes sense to load the weights particularly for this category only.

That’s what my script below does. Basically, we just need to take depthwise_kernel, pointwise kernel and bias weights for the cat category (number 17 in COCO label map) from pre-trained category and assign to our new detection model.

We can retrieve, for example, pointwise kernel from the checkpoint as reader.get_tensor(‘model/_box_predictor/_prediction_heads/class_predictions_with_background/_class_predictor_layers/0/pointwise_kernel/.ATTRIBUTES/VARIABLE_VALUE’)

It has the shape (1, 1, 128, 546) where 546 stands for background_class (the first 6 elements) + 90 classes (each with 6 elements). As the cat corresponds to the category number 17, to get the the depthwise kernel for cats we need to extract like this pointwise_kernel[:,:,:,6*18:6*19]].

###### My custom part #####
reader = tf.train.load_checkpoint(train_config.fine_tune_checkpoint)
head_pred_vars = [x for x in detection_model._box_predictor._prediction_heads.get('class_predictions_with_background').trainable_variables]
pointwise_kernel = reader.get_tensor('model/_box_predictor/_prediction_heads/class_predictions_with_background/_class_predictor_layers/0/pointwise_kernel/.ATTRIBUTES/VARIABLE_VALUE')
# (1, 1, 128, 546)

depthwise_kernel = reader.get_tensor('model/_box_predictor/_prediction_heads/class_predictions_with_background/_class_predictor_layers/0/depthwise_kernel/.ATTRIBUTES/VARIABLE_VALUE')
# (3, 3, 128, 1)

bias = reader.get_tensor('model/_box_predictor/_prediction_heads/class_predictions_with_background/_class_predictor_layers/0/bias/.ATTRIBUTES/VARIABLE_VALUE')
# (546,)

# depthwise_kernel
assert(head_pred_vars[0].shape == depthwise_kernel.shape)
tf.compat.v1.assign(detection_model._box_predictor._prediction_heads.get('class_predictions_with_background').trainable_variables[0], depthwise_kernel)

filtered_pointwise_kernel = tf.concat([pointwise_kernel[:,:,:,:6], pointwise_kernel[:,:,:,6*18:6*19]],axis = -1)
assert(head_pred_vars[1].shape == filtered_pointwise_kernel.shape)
tf.compat.v1.assign(detection_model._box_predictor._prediction_heads.get('class_predictions_with_background').trainable_variables[1], filtered_pointwise_kernel)

filtered_bias = tf.concat([bias[:6],bias[6*18:6*19]], 0)
assert(head_pred_vars[2].shape == filtered_bias.shape)
tf.compat.v1.assign(detection_model._box_predictor._prediction_heads.get('class_predictions_with_background').trainable_variables[2], filtered_bias)

image, shapes = detection_model.preprocess(tf.zeros([1, 640, 640, 3]))
prediction_dict = detection_model.predict(image, shapes)
_ = detection_model.postprocess(prediction_dict, shapes)
print('Weights restored!')

Now our newly created model has one class output and can detect cats pretty well. Now we can fine-tune it by re-training on our dataset.

Training on TPU

As mentioned earlier, for TPU training, gcloud storage must be used for all objects. This is the command to start training.

!python3 models/research/object_detection/ — model_dir gs://immo-280017-haro-bucket/trained_tpu_ssd_model0 — use_tpu True — pipeline_config_path ssd_gc_pipeline.config

In case of this errorGarbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation you might need to add this block to your

os.environ[‘TF_ENABLE_GPU_GARBAGE_COLLECTION’] = ‘false’
os.environ[‘TF_CPP_MIN_LOG_LEVEL’] = ‘2’

It is recommended check the loss in TensorBoard.

In my case 17k steps were already enough for high accuracy. However, one must take into account, I started training the model with already fully pre-trained weights for cat category (normally, _prediction_heads is not restored).

Converting to TF Lite

Exporting TFlite graph

First, we invoke to generate a TFLite-friendly intermediate SavedModel. This will then be passed to the TensorFlow Lite Converter for generating the final model.

!python3 models/research/object_detection/ — pipeline_config_path ssd_gc_pipeline.config — trained_checkpoint_dir gs://FOLDER_NAME/trained_tpu_ssd_model0 — output_directory exported_ssd_tflite_model

Only SSD meta-architectures are supported for now. The expected output SavedModel would be in the directory exported_ssd_tflite_model (which is created if it does not exist).

Post-training quantization

The next would be model optimisation which requires quantization. Model quantization is a technique that allows for reduced precision representations of weights and optionally, activations for both storage and computation. Here is the full guide how to decide your model optimization strategy.

The following decision tree helps you select the quantization schemes you might want to use for your model, simply based on the expected model size and accuracy.

In my case I did post-training quantization to be able to run the model on Coral Edge device.

I applied full integer quantization technique as the Edge TPU hardware requires all parameters and activations to be quantized to integers. Here is the full table how to decide which quantization technique you would need.

For simple cases with dynamic range quantization, you can use tflite_convert command line tool as this

!tflite_convert — saved_model_dir=/content/exported_ssd_tflite_model/saved_model — output_file=tflite/model.tflite

For the fully integer quantization, we need a representative dataset for calibration. This dataset can be a small subset (around ~10–500 samples depending on each case) of the training or validation data. Check out the GitHub repository of this project for my representative dataset generator function.

One thing to note regarding the representative dataset: the data must be normalized. In our model SSD Feature Extractor has embedded pre-processor

preprocessed_images = (2.0 / 255.0) * raw_inp — 1.0

This formula matches with mean=127.5 and std=127.5 as in the documentation.

Here is my code for quantization.

converter =  tf.lite.TFLiteConverter.from_saved_model('/content/exported_ssd_tflite_model/saved_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8, tf.lite.OpsSet.SELECT_TF_OPS, tf.lite.OpsSet.TFLITE_BUILTINS]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
converter.allow_custom_ops = True
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()

with open('model_with_repres_ds_int_custom_haro_ds.tflite', 'wb') as w:

And the following command to make it compatible with TPU Edge devices.

!edgetpu_compiler model_with_repres_ds_int_custom_haro_ds.tflite --show_operations

The Edge TPU compiler can be installed as this

curl | sudo apt-key add -
echo "deb coral-edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-edgetpu.list
sudo apt-get update
sudo apt-get install edgetpu-compiler

The result of this for me was as the following

Most of the operations were converted to run on TPU. It will make the model inference quite fast.

Setting up the cameras and Coral Edge dev board

This is how my ESP32 camera micro controller looks like. It costs around 18 EUR on Amazon.

Here is 3D prints for this microcontroller

Eventually these small devices were setup around my all apartment. This is, for example, how they look like in the kitchen.

Here is the full guide how to make it work in Python. You can set it up in frameworks as Thonny for example. In this tutorial, a point of interest is Chapter 30. It shows how to set up Camera Web Server in micropython. Eventually, every time you plug the ESP32 controller in, the camera web server automally starts transmitting video to the specified port.

This is how my Coral Edge dev board looks like. It has 4GB memory.

On eBay I’ve found a case for it and now it looks pretty cool.

In Coral getting start guide, there is a step by step tutorial how to flash the board and connect to it via ssh afterwards.



Here is the script of running the inference on Coral Edge device.

from pycoral.adapters import common
from pycoral.utils.edgetpu import make_interpreter
from PIL import Image
import numpy as np
import time

interpreter = make_interpreter('my_model_integer_quant_edgetpu.tflite')
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

scale = input_details[0]['quantization_parameters']['scales']
zero_point = input_details[0]['quantization_parameters']['zero_points']
output_scale = output_details[0]['quantization_parameters']['scales']
output_zero_point = output_details[0]['quantization_parameters']['zero_points']

image ='image.jpg').transpose(Image.FLIP_TOP_BOTTOM).convert('RGB').resize((640,640), Image.ANTIALIAS) # I need vertical flip due the position of my cameras
# normalization
raw_inp = (2.0 / 255.0) * np.asarray(image) - 1.0
raw_inp = raw_inp/scale + zero_point
common.set_input(interpreter, raw_inp.astype(np.int8))
start = time.time()
print('elapsed %s'%(time.time() - start))
output_data = interpreter.get_tensor(output_details[0]['index'])
dequant_output = (output_data - output_zero_point)*output_scale

The first output output_details[0] shows the probabilty of the cat being seen in the picture, while the second one output output_details[1] gives the bonding box points.

The output results need to be dequantized. That’s why I have the formula dequant_output = (output_data — output_zero_point)*output_scale

To draw the bounding box from the output_data, run the following

from object_detection.utils import visualization_utils as viz_utils
detected_keypoints = [output_data[0,0,:2],output_data[0,0,2:]]
viz_utils.draw_bounding_box_on_image(image, *detected_keypoints[1], *detected_keypoints[0])

The class probabilities output for this picture are the following:

We see 98% probability that there is a cat in the picture.

If comparing with the float TFlite model we see quite an increase as the float tflite model runs the inference around 600 ms.


This is how I take the stream from my ESP32 cameras

cap = cv2.VideoCapture('http://IP_ADDRESS:5050/video')
while True:
ret, frame =
cv2_im_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
image = cv2.resize(cv2_im_rgb, (640,640))
# convert to PIL Image
im_pil = PIL.Image.fromarray(image)
# here run the inference

Eventually drawing predicted bounding boxes for each frame gives the following result for the camera installed in the corridor.

As a result, now I can have statistics where at home my cat spends most of the time and how often he comes to eat/drink.

Here is my GitHub repository for this project below.

Stay tuned, soon I will post the second part where I will add custom keypoints detection as the joints of the cat’s body followed by the pose classifier.