Machine Learning: Detecting Dropped Pacifiers

Hank Jacobs
5 min readNov 6, 2020

--

I recently became a parent. As expected, it’s been full of challenges, one of which being sleep. Our little one is a decent sleeper but has one requirement any time she goes down: a pacifier. Not just any pacifier, her pacifier. If she drops it, we place it back in her mouth so she can fall asleep or stay asleep. My wife and I oblige each and every time because, hey, there’s nothing more peaceful than a sleeping baby. That being said, we’ve had to constantly keep one eye on the baby monitor to keep watch for a dropped pacifier.

Each time I replaced the pacifier, I had a thought. There must be a way technology can solve this. I’ve always eyed machine learning from afar but never had the time or a problem to apply it to. Now was the perfect opportunity so I dove right in. I had heard of this nifty machine learning platform called TensorFlow so I figured that would be a great place to start. Lucky for me, they already had an API for object detection and easy-to-follow instructions on training a custom object detector.

Labeling with LabelImg

After setting up my workspace, the first step was sourcing some images to train the model with. I had set aside a handful of various shots from the baby monitor we used. I started with roughly 50 shots of various baby poses including both pacifier in and pacifier out shots. Easy enough. Then it was time to label… and label… and label some more. I used LabelImg and worked through each and every image. If the pacifier was in, I labeled the baby and the pacifier as “baby_pacifier”. If it was out, I separately labeled “baby” and “pacifier”.

tensorboard showing progress

After labeling, munging the resulting XML into TFRecord files, and splitting my dataset into a train and test set, I was ready to train a model. I first went down the path of training my own model from scratch but quickly realized that may require thousands of images and hundreds of hours to train. I went back to following the instructions and used the pre-trained model referenced there as a starting point. We were off to the races…very slow races that is. Since I did not have access to a NVIDIA GPU and didn’t want to bother with cloud TPUs, I ran the training job locally on my laptop. In order to measure progress, I also started an evaluation job using the test dataset and spun up a TensorBoard instance to visualize everything. The training job performance left much to be desired but, low and behold, I started seeing results a few hours later.

first successful-ish detection

Though it was working, it consistently mislabeled objects even after ~50k training iterations. Unsure of what to do, I took a wild guess and decided more training data was needed. +300 labeled images and many wasted hours later, we were back at it. After another day’s worth of training, I was finally seeing consistently solid results from the test set. Also, the model was able to accurately label most new images I fed it. Cool. So now we have a working model that can differentiate between a baby and a baby with a pacifier. Now what?

Enter MediaPipe. In its own words, MediaPipe “offers cross-platform, customizable ML solutions for live and streaming media.” I had experimented with it a few months back to do pose tracking on live video and knew it had out of the box support for object detection as well as consuming live video with back pressure for slow models. By converting my SavedModel to a TensorFlow Lite model, I was able to use it as the model for MediaPipe’s object detection example. I did struggle briefly converting from SavedModel to TensorFlow Lite but using the latest build of TensorFlow via tf-nightly did the trick.

After overcoming that hiccup, I was able to successfully stream video from my baby monitor and detect whether the little one had a pacifier in or not. Unfortunately, it would work for a minute but then would fall behind. It quickly became evident that the model I used was just too slow for live video. I remembered a document I stumbled across in my research outlining the performance of the various pre-trained models offered by TensorFlow. I decided to redo the model using the fastest model listed: SSD MobileNet v2 320x320. Two days later, I finally had a working model that was able to keep up with live video!

object detection on video

So far so good but that solution still required me to watch a video and wait for a box to change. Being a SRE at my day job and inspired by a tweet, I figured a natural next step would be to wire up some metrics tracking the confidence of each label and trigger an alert when it changed. To accomplish this, I extended MediaPipe with a custom calculator that exposed Prometheus metrics and configured my home Prometheus server to scrape it. Though it did work, I found it difficult to craft the right query to detect a drop in the “baby_pacifier” label and an increase in the “baby” label while handling rapid changes over a short interval. Back to the drawing board.

baby metrics

My next attempt was something a little more rudimentary. I extended MediaPipe to output any detected labels to standard out. I then wrote a small Go program that parsed MediaPipe’s output, detected when the label changed from “baby_pacifier” to “baby” and sent me a push notification. To prevent rapid spurts of notifications, it debounced over 5 seconds before sending a notification. To my relief and to my wife’s eye roll, it worked for the first time in production over dinner that evening.

In summary, rather than “nap when baby naps”, I did what any not-so-sane parent would do and I over-engineered a solution to a problem I didn’t really have. I applied machine learning to train a TensorFlow model to detect the difference between my newborn daughter with and without her pacifier. I fed that model live video using MediaPipe and wrote a program that parsed its output to trigger push notifications if a pacifier drop occurs. Needless to say, the many hours it took to build this has saved me countless minutes over the past few weeks.

--

--