How Ollie’s AI System Works

Published in

Ollie

11 min readDec 5, 2023

Ollie helps people organize their photos to make them useful. People take a lot of photos, but often get overwhelmed going through their personal photo collection, having to sort through pictures of screenshots, a blurry photo of a funny car and a beautiful sunset photo. It’s easier to take a photo than to browse through or organize your photos.

To make this stream of photos useful for people, we help them organize them into three categories:

Favorite (Top %20): The beautiful sunset photo, or the best photo of ten in a burst shot
Keep: The receipt you need to keep for work
Trash: That blurry photo or screenshot you don’t care about anymore.

It gets tedious fast to manually sort your photos one by one into categories to make them useful, so our app suggests which category a photo should be put in via a set of AI models we have created.

This presents another problem though. We have found what people like in photos is unique to themselves. Some may like food photos, others might like artistic blurry photos, others hate blurry photos and food photos. As a result you usually cannot make a generic model that will determine if a photo is ‘good’ or not. If so, we would be satisfied with the “zero explicit input” models that Apple and Google have to create highlight reel albums for their users today.

This led us to the current strategy of bundling a set of seed models with our app and training the model on device as the user expresses what their preferences are.

To solve this issue, we fine tune part of our initial generic AI model based on the user’s explicit input. As people organize their photos with our app, we train our model with new photos from the users camera roll and their organization decisions. As the user corrects our model with what their real preferences are, we create a model that becomes more accurate in guessing their personalized preferences.

Since people’s photos are private and it creates a lot of slowness to upload people’s photos to a centralized server to train a personalized model, we do this model training on the users device in a rate / update cycle, which you can see the architecture of here.

Current Model Architecture

We use multiple independent models in Ollie to implement its functionality today. We have various models for:

Photo Set Classifiers: Evaluates photos as part of a set

Determining if a photo is a duplicate or related in a cluster
Choosing the best photo in a cluster

Single Photo Classifiers: Evaluates each photo independently of the other

Trash vs. Non-Trash Photo: “Should we delete this photo?” classifier
Favorite vs. Non-Favorite Photo: “Is this a good photo?” classifier
Detecting screenshots
Initially targets about %20 of photos to be marked as ‘favorite’ to follow the pareto distribution.

Basic Favorite Model Example

To make our current favorite rating model, we use transfer learning. Transfer learning is a machine learning technique where a model developed for a specific task is repurposed as the starting point for a model on a second task. It leverages knowledge (features, weights, etc.) from the initial task to improve learning efficiency and performance on the new task, especially useful when the new task has limited data. This approach is akin to applying insights gained from solving one problem to a related but different problem, capitalizing on prior learning to accelerate subsequent learning processes.

In our case, we use a few layers on top of MobileNet. MobileNet is a small image classifier model that takes a photo for input, and returns classification tags such as “beach” or “car” about the contents of the photo. It’s a fairly small and lightweight model that executes quickly with good performance for its size. We then transfer the learning from that model, into our custom layers on top of it, and then train the new composite with our internal employee dataset of curated photos to create our base initial model.

We are currently developing a far more complicated multi-modal model set that makes our model determine if it is favorite, keep or trash in one shot, while still being fast to execute and updatable on the user’s device. We are trying many different experiments in this and it is an active work in progress, but our current first attempt has been a surprisingly effective one.

AI on the Edge

Currently we train our initial seed model on employee photos with our own desktop & cloud computers. Once we create a topology & initial data set that works well with employee data, we add it to the model experiment list on our CDN.

Once it’s added to the list, user’s apps download the new models dynamically and our app starts using it alongside the current model the user is using. We then watch the accuracy, recall & precision results it generates on users phones after multiple training runs with the user’s own photo data, and compare it against the current top model.

If the new model is a winner, it will replace the previous model that the user is using, if it isn’t, then we don’t fully deploy it. We don’t know the contents of the user’s photos when we are tracking this, the app just reports what our AI recommends a photo be rated as, and if the user readjusted that rating to something else or not along with which model they used. Basically a (random_photo_uuid, ai_rating, user_rating, model_id) log. Essentially we are doing model A/B testing on our user’s phones.

We also plan to create a ‘retraining’ system that retains an embedding cache (25kb per photo) from our base MobileNet model of deleted photos on the user phone. This allows us to get new models back up to speed faster, but we haven’t implemented it yet. The embedding cache will remain on the user’s phone to retain their privacy.

Preparing the Data

A significant part of building a great model starts from data engineering — collecting the raw information, understanding the “shape” of it, and preparing it for the model.

This is a big part of what we do also in training our initial generic model¹ for first installs, and how we feed data into the fine tune process to update the user’s personalized model.

[1] Our generic initial model is only trained with employee data or otherwise donated photo curation sets. We don’t use any photos or data from our actual users to train our initial generic model.

Example AI Photo Organization Data Issues

With photos, you still need to apply some heuristics in how you initially feed data into the model. A few classes of ‘bad’ data we discovered caused problems are:

Not Handling Similar or Duplicate Photos Properly

Often when people take photos, they often take 5+ photos in short succession of the same scene, and then choose only one or two to keep and delete the rest of the photos. Another similar situation is keeping one identical photo and deleting the rest of them.

*In this sunset photo cluster, they’re all good photos on their own, but because you have many similar ones, you would probably only pick one of them as a favorite, and delete or keep the rest.*

Naively just feeding these results into any model will confuse it greatly, because the base model you use to transfer learn with will probably output identical results, but have different training labels for the same thing. This introduces noise into the dataset, and certain user behavior patterns can cause it to get extra confused if you naively train with photo clusters. Have the right “bad” combo of user preferences and clusters and you create a useless photo organization model.

To deal with this, we filter out clusters from the training set. Currently we only take the single “top” photo according to a simple decision tree and feed that into the model. So if you have 1 favorite and 4 “keeps”, we only train the model with the one favorite and ignore training with the “keeps” to avoid this issue.

This also means you need a completely different model for evaluating ‘what’s the best photo in a cluster vs. ‘what rating should this individual photo get’.

Data Imbalance Issues

People tend to only like the top %20 of their photos, and often do not have a set of ‘deleted’ photos lingering on their phone like they tend to do with their current favorites. This makes bootstrap model personalization more difficult, because you don’t want to make a model that just coin flips with the distribution of best, keep or trash photos that the user tends to choose. This means you have to use various techniques, such as a sampling to keep the training data set even, so the model actually learns the content of photos, along with other techniques we won’t go into here.

Data Drift (The “Ex-Partner” Issue)

Another issue is something we call the ‘ex-partner’ problem. You favorite all the photos of your partner, then you break up, and now all the photos of your partner are put into the trash, but you always want to favorite the photos of your kids, even though your partner is in the photo. Or you are really into motorcycling, have a motorcycle accident and never want to see motorcycle photos again as a result. Like the duplicate photo issue above, this creates chaotic noise. The user liked this a lot, then they started not liking it on a dime.

This is a pretty complicated issue that will depend on the nature of the data drift and the user’s preferences, and is not something we have gotten to fixing yet since our app hasn’t been around long enough for this to develop into too much of an issue yet.

We have several ideas, with things like ‘period models’ where we create multiple models that govern specific time regions, or even making a feature that removes certain things like ‘motorcycles’ or ‘this face’ from the users training set and retrains their personal models with the new data set. It will be a pretty fun issue to solve.

We even have the funny idea for a mini app that will edit your ex out of all of your photos via cropping or AI subject erasing, but that is for the future 🙂.

Stalin would’ve loved an “erase your ex” app!

The Issue with Apple & Google Photo Data

As you see, there are many, many interesting data curation issues with personalized photo curation. If you observe the types of AI albums that apple and google photos create, they are very thematic, based on a sophisticated combo of heuristics & AI models to find various themes such as “beach time”, “vacation”, “cluster of photos with a face you take a lot photos of”, etc. They then observe user engagement with these generated album types and make more of them if they increase engagement metrics.

The problem with these albums is what I call the “line up at the stadium” issue. They can add photos that tend to be not that great on their own, such as a photo of the line up at the stadium, attached to an event you really care about and engage with, which is going to the stadium and seeing the ball game. They tend to pick photos that are a mix of meh and good. If they add a filter for lineups, they might exclude good photos with your buddies in the line up. It’s a hard thing to balance.

The Luxury of the Explicit Signal

Apple and Google work with the assumption that the user gives no explicit signal (up-or-down for every photo), and few implicit signals (favoriting, sharing, engagement). Because of that, it is incredibly difficult for them to gather the data set about what a user actually cares about within their personal photos.

To be clear, it’s a completely reasonable product decision to build the “hands off” model.

Human nature: It’s really difficult to get people to engage with their photos habitually. It’s just a lot of work. We’re dedicated to making that a reality, but right now still a small percentage of user, who care a lot about their photos and are drawn to it. For us, we can build a business out of it. For Google and Apple, who have a 1B+ users it doesn’t move the needle at all. This is the classic “innovator’s dilemma”.
Business incentives: Once you really help users organize their photos, since users tend to delete about %50 of them when they really get going, that would reduce cloud storage revenue, so working against their own business interests then.

With our app although, our incentives are aligned. We give the users the final choice on every photo. This allows us to quickly learn through their explicit preferences and create better results than what Apple Photos and Google Photos do. Our aim is not to replace these products, which are frankly excellent in their own right. Our aim is to work together and make the experience of using them even better. For example, say you spent a weekend in Tahoe, and Apple Photos creates a “memory” for it. If you didn’t give any explicit signals, the output would be a rather indiscriminate set from Tahoe. With Ollie though, because you’re favoriting ~20% of the photos from that trip, the “memory” that is created will be reflective of your own preferences. Eventually our goal is to auto-organize a user’s photo stream once accuracy hits a certain point and only show the good photos to them.

Summary

We have made a local, private AI app that learns what you like in organizing photos.
We use many small models at Ollie to implement our features currently, with one example being a MobileNet based model specialized for our rating case.
We are actively developing a multimodal model for all rating types that should be much better.
A significant chunk of our work is data management, like most AI projects. Learning to deal with these issues and learning about the problem space is the interesting part for the most part. There are still many things to learn!
Apple & Google are stuck in organizing photos well due to various internal structural counter incentives that prevent them from collecting it from their users, which makes it prime for a small AI startup to solve.

This is part of a series of blog posts where we go into: