Honeycomb Rewind

Taking your baby from zero to one

Andy Isaacson
Honeycomb
11 min readNov 26, 2022

--

Nicole Wee, wearing a Honeycomb sweatshirt, smiling, and holding her infant on her shoulders.
Honeycomb’s Queen Bee of Product, Nicole Wee, posing with our Director of Cuteness, baby Blaze.

Inspiration

At Honeycomb, we build an app that lets newly blessed parents share their most precious, most emotionally gripping, their most sensitive photos and videos with their loved ones near and far. Our app hosts some truly droolworthy content (the new baby in the family!) and does it privately and securely. As parents ourselves, we know what a crazy busy time that first year is — almost like you don’t have time to reflect on how fast the changes are happening. It’s like time flashes by, and that got our VP of Product, Nicole Wee, thinking.

We were taking a break between sprints, running a mini hackweek, and she had just seen “Everything Everywhere All At Once” (go stream it if you haven’t yet — fantastic film!) With baby Blaze in her lap, she was struck by a sequence of Michelle Yeoh’s different lifetimes transposed right on top of each other. You could see the edges of the frame change between her various lives, but the core facial features were fixed in space. Blaze was growing up so fast, and she had approximately a bajillion photos of her over the months — what if we could make a tool to line them all up, putting the faces right in the same spot, starting from birth to year one? What if you could see her first year flash by in an instant?

Clip of Michelle Yeoh’s face flashing between different timelines from the film “Everything Everywhere All at Once”
From the trailer for “Everything Everywhere All at Once”

Nicole shared this idea with our engineering team at the perfect time. We had just started running user-uploaded photos through AWS Rekognition. This gave us a whole suite of facial analysis data to integrate into our logic (e.g. position, pose, age, gender, emotion, smile, eyes open, presence or absence of a beard, etc.) that we’d been waiting to put to good use. Additionally, we already had a scriptable video generation pipeline, assembling short videos on demand. Most of the backend infrastructure we needed to build this was already in place, so we set to work.

To get the entire Rewind app experience where we wanted it, we needed to make sure that it would:

Be built to scale: We hope to be able to accommodate everyone, everywhere, all at once. User scale is a great problem to have, and this would be good practice to deploy a machine that scales with it.

Be more automagical, require less interaction: New parents are busy! We decided early on that this should be a tool that builds a movie from the user’s library with little to no direct input.

Surprise and delight: We know a baby’s first year is exciting and magical. Our final product should reflect that joy!

An animated scene of magical moving painted portraits from a Harry Potter movie.
Harry Potter would never settle for stiff and lifeless (Warner Bros.)

Scaling: video generation on AWS Lambda

Early in developing Honeycomb, we realized that we wanted to app to feel very alive, seeming more like a glance at Harry Potter’s portrait wall than a static archive of photos. That meant generating bespoke video, and lots of it, even though 90% of our media uploads are photos. The solution we found was “post previews” — small, dreamy montages with subtle animations, rendered to a video file. Slow dissolves, motion effects, subtle nods to Ken Burns’ style of moving still pictures. To create these montages, we crafted a system that could spit out mp4s from user-contributed source media on demand, scaling to thousands of simultaneous requests. Running in a python-based Lambda, our video generator reads from a queue of new posts, pulls down the freshly uploaded media, and renders a few different sizes of the montage. It runs reasonably quickly, and with the ability to allocate GBs of RAM or disk if needed, we’re not concerned about being under-resourced. Developing the PreviewGenerator Lambda involved creating a custom Docker image (instead of the layer approach) to make sure the underlying environment had all necessary binary libraries. We built our new Rewind generator on that base image. It includes:

  • MoviePy — This is an open-source, python based video composition framework that allows for sequencing, effects, and transformations.
  • ffmpeg — Used by movie.py under the hood to read and write frames to video files.
  • imagemagick — Used to open, process, and manipulate image files.
  • libheif + libde256 — The system libraries that allow us to read the .HEIC files created by iPhones.
  • pango + freetype — Command line tool and library for rendering clean text to an image, for titles.

While Rewind was designed to dovetail nicely into Honeycomb (your finished video is waiting for you when you make an account), we wanted it to be a distinctly separate app. There are no sign-ins required or PII collected with the photo uploads, so the app starts by generating a random UUID as an identity. Photos get uploaded to a publicly writable (but not readable) “dropbox” in S3. When all the media transfers have finished, the app uploads a final JSON configuration file reflecting any additional user choices. Since all upload completion events get broadcast from S3, a preprocessor Lambda listens for just the right kind of file to be announced, and then anytime one of those JSON configuration lands, a smidgen of typescript kicks into gear (I like to think of them as a small robot army), submits the images listed in the configuration file to Rekognition for face detection, and accumulates the resulting facial analysis data in another JSON file. At this point, we’ve amassed all of the data we need, and are ready to begin assembling the Rewind video in earnest.

Filtering for usable faces

Once all the face recognition data has been compiled, we need to go through the hundreds of instances of face data, each one giving details on a single face from a single photo, and use the metadata that we’re given to rule out any faces that aren’t going to work well. These are the criteria:

  1. Filter out any faces that aren’t estimated to be age 0–3
    Notice the large range! Rekognition is one of the few face recognition tools that gave us anything close to usable for an approximation of age for infants. Other age detection models, like AgeNet, are trained on photos from adults, and seem to be much more concerned with detecting ages of older humans than younger. Throw out the adult faces — we just want the babies!
  2. Filter out any faces that aren’t looking at the camera
    Rekognition gives us a “pose” for each face: an angle in degrees for pitch, roll, and yaw. Pitch tells us if the face is looking up or down, yaw is an indicator of how much the head is turned left or right, and roll tells us if the entire head is tilted or upside down. We filtered out faces outside of a tight range of pitch and yaw, as that implied that the baby would be looking away from the camera, and the face alignment wouldn’t feel right. The roll is the one dimension of the pose that’s easy to correct for, since we just need to rotate the image.
  3. Filter out any faces that are too small in the frame, or too close to edges
    Anything too small would need to be enlarged, leaving some faces sharp and some faces a blurry mess of pixels. Anything too close to the edge would leave a strange black void after a rotation, as the outside of the image would get moved into the frame.

A final filter makes sure the images are appropriately sharp, and so armed with a high quality set of input images, we’re ready to line ’em up.

Early render of image alignment, with babies faces surrounded by a blue outline.
First test of face registration. The blue box represents the face boundaries as reported by Rekognition.

Image registration — or keeping it all together

One of my first jobs after college was in a brain imaging research lab. We took PET and fMRI scans of people’s brains, and before we could analyze them, we’d do a registration pass, to align them in 3D space. It’s a neat word that comes from the world of printing, where multiple ink layers would need to be lined up just right to blend into new colors (remember those slightly misaligned pink/yellow/blue plus signs on the bottoms of the comics page?). To really sell the timelapse illusion, we needed the face to appear not to move across all of our images. We needed perfect registration, a familiar problem, though this time with a much cuter input set (babies > brains). Rekognition gives us the bounding box of any face it detects, as well as locations of several other facial features. We found that simply lining up the bounding boxes and correcting for the roll rotation was good enough to sell the effect, even if we ignored the other identified facial features.

A little music to get the feeling right

Now that the visual effect was coming off properly, it looked great, but felt lifeless in silence. It was neat, but didn’t generate that “awwwww” factor we wanted. We needed music. Working with a pianist friend, we recorded a track that conveys the gentle sweetness of tender moments, juxtaposed with the frenetic energy that comes along with a new infant. The gentle, but insistent, piano track hits home, and amplifies the emotional impact 1,000x. Going through by hand, we found precise timing for all of the beats to the millisecond, so that the photo changes come naturally, on the beat.

To find the right musical composition, we knew we wanted something upbeat and with an even tempo, to match the quick changes reflected in the visual transitions. With upcoming video generation projects, we’re giving ourselves a little more freedom to play with musical pacing, telling a story through the evolution of a song, and using slow or quick video as fits the mood. Ultimately, the music plays such a strong role in eliciting the heartstring tugging, and getting in the aspect of “surprise and delight” that this project needs.

Automagically picking the right photos

Remember, the original idea was that Rewind should require almost zero interaction from the busy parent with their hands full, so we didn’t want to just throw up a photo picker and wish them luck. This meant using a separate bit of AI/ML to do the selection, running on the user’s device. All of the video processing was ultimately happening on our backend, but it needed to be fed the right kinds on input to be useful. The app needed a way to automatically tell which pictures in the user’s photo libraries were likely to contain a baby’s face, and approximately where it was in the frame so we could clip it out. It honestly blows my mind just how accessible computer vision and machine learning tools on mobile/desktop have come in the last few years! Apple’s CoreML tools make crafting performant, bespoke computer vision models seem simple. With literally zero previous experience in building ML models, our engineer Geoff Golda was able to use the apps bundled in Apple’s Developer tools to create two different models, trained from images already existing in the Honeycomb datastore.

Screenshot of CoreML machine learning model training in CreateML.
A screenshot of Create ML in action, training one of the models used in Rewind.

The first model is a classifier, essentially asking, “Is this the kind of photo that contains a baby?”. The answer comes back relatively quickly with a yes or no, but doesn’t say much more than that. We need to know where in the frame the faces actually lie!

To find the face positions, we built an object detector that tells us the bounding box of faces it identifies, and we use that to extract a close-up frame. However, too often we found that adult faces were making their way through the picker, so running the original classifier, again, on the cropped sections usually rules those out.

We went through several rounds of tuning these, combined with other packages we tried and ruled out (“expo-face-detector” e.g.) to hit a balance of speed and accuracy for the automagic photo picker. We didn’t want it to take TOO long, but it needed to give mostly right answers as well. AI only seems “smart” when it’s right nearly all the time — even the occasional wrong choice can shatter the illusion. We could probably spend lots more time fine tuning this “magic photo picker” bit, for the marginal gains get pretty small. And it’s certainly not perfect yet. We still get a decent amount of false positives (adult faces that the user can manually nix) and probably also plenty of false negatives we don’t even realize. We found it was important to manually allow the user to nix any of the photos after we rendered the first pass, or they could end up with an oddly placed monkey or book cover in the mix.

The finished product

Click to watch on Youtube Shorts.

Learning

We walked away from this experiment with some solid learning to take into our future product work. Experience has taught us how to whittle down our iteration loop for developing new video generation scripts, defaulting to postage stamp size renders while we’re still working out the logic. Observability quickly became an issue for the pipeline, as without building out more tooling, we couldn’t tell where the process had succeeded, or where jobs had gotten gummed up in the works. We starting making heavy use of a special internal Slack channel for automated status updates, as well as recording the input configuration and output result of any job in our database, to easily reconstruct failing cases. We also learned that baking in assumptions about our best fitting users could severely limit our ability to test and publicize. The app assumes that you’ve got a TON of high quality photos of your baby in your Photo Library, and while many Insta-crazed new moms and dads certainly do, the assumption doesn’t hold for many users. Some of our team, e.g., has never actually run the app on their own devices, for the simple reason that their photo libraries just don’t fit the profile, and so no media gets selected. We like to be opinionated about our product niche, but it can also make disseminating an app like this extremely difficult. Finally, we kept running into the limitations of text rendering in Linux using command-line tools. Despite our best efforts and hacks, we still haven’t been able to render color emoji inline with our black and white text, and we’ve seen how pervasive emoji use is among our mobile-first consumers. I’m still dreaming up hacks of Mac-based (iOS based?) runners that render lovely text to some cache using Apple’s text layout libraries and sends them to the Lambdas as inputs. Crazy suggestions welcome!

Conclusion

Honeycomb is rooted in helping parents save their memories of that first year. We’re always looking for new ways to make capturing it just a little more of a delightful reward, and Rewind fits that bill nicely. Try it for yourself, especially if you’re the kind of parent that took a lot of photos in that first year of parenthood. And if that’s NOT you, but you know somebody who is, send them the link, and have them try it out! Regardless, please pass along any feedback to our team — we’d love to hear it! — or you can contact me directly here. Enjoy!

You can find Honeycomb Rewind in the app store here:
https://apps.apple.com/us/app/baby-photo-timelapse-video-app/id1638278114

--

--

Andy Isaacson
Honeycomb

Head o' Engineering for Honeycomb, startup veteran, lifelong learner, father of three