What I learned from Photo Bot

It’s been an interesting week since we first released Photo Bot. I thought I’d share a few insights I’ve gathered from examining how each of the AIs performed in response to the dozens of photos that I personally sent, as well as the ones from my friends and others who signed up to use Photo Bot. As a reminder, here’s the link to the demo: https://muxgram.com/photobot

Google

Google is very good at identifying specific details. When I took a photo of a bowl of mussels, Google was the only one that identified the image as “mussels” while Microsoft just saw “food”. Amazon more accurately identified it as seafood — but guessed incorrectly that it was clams. And Google detected mussels as one of its top descriptors.

Furthermore, Google seems especially good at identifying the make and model of a car. In my previous post, I cited the fact that Google accurately identified my car as a BMW X3. More recently, as I was walking home after lunch one day, I took a picture of a nice sports car parked on the side of the road, which it correctly identified as a Ferrari 458.

However, Google doesn’t seem to do as well with older models of cars, like with this classic Mercedes 250SL — it only could identify it as an “antique car” or “luxury vehicle”.

Amazon

Amazon seemed more erratic in its performance. Sometimes it would be the only one to identify the object correctly, like this one of a peacock.

But then it would not be able to identify something as simple as a dog in the photo, mistaking it for a potted plant.

Otherwise, Amazon would be in the right ballpark for most general objects, like for this lamp.

Overall, it does OK for simple objects, but when it tries to be specific it tends to be either very wrong, or very right.

Microsoft

Microsoft is very good at composing the relationship of general objects. It completely nailed this photo my friend Karen sent in as “a statue of person riding a horse in the city”.

Same thing with this photo from my friend Geoff, while he was driving : “a car driving down a busy highway” was the result.

But Microsoft can also create very nonsensical descriptions, like this one of a tow truck: “a yellow and black truck sitting on top of a car”.

And because Microsoft tends to look for general objects it’s already familiar with, like a car or a road, it can completely miss the most prominent object in the photo. In this case, it missed the mailbox, and only sees the car parked on the side of a road. (To be fair, in this photo, all three missed the mailbox.)

So where does this all lead?

These comparisons make me think about user expectations. When submitting photos, people seemed to want Photo Bot to work like web search — if I send a picture of a restaurant, the AI should identify the specific restaurant. A picture of a skyscraper should identify which skyscraper it is. If there’s a picture of a shampoo bottle, people expect it to identify the specific product.

That expectation may be a window into connecting the offline and online worlds. If a computer vision AI system could identify a specific product, landmark, location of the image, and do it in real time, now that would be amazing. I think this experiment also points to the fact that AR might be the ideal device to manifest this computer vision AI. Given what I’ve seen so far with Photo Bot, I don’t think it’s ready yet, though it’s probably much closer than the general public might assume.

If Magic Leap actually delivers on its promise, and Google Lens performs exceptionally well with a very high detection rate in the real world, that could be a killer combination. But I do wonder if initial successes will instead come from enterprise applications, and not consumer products. True, that’s different from what’s happened in the last couple of decades: Google, Facebook, and Amazon all grew into massive consumer product successes. But for decades before that, Microsoft, IBM, Intel, were built on enterprise success. Apple is the anomaly, which has spanned both eras with phenomenal success.

In the next couple of decades, the emerging technologies like AI, self-driving cars, AR/VR and drones may go back to enterprise success. This is not to say that such technologies won’t reach the average consumer eventually, but it might be enterprise that will create the first successful business model that will define the next generation of tech companies.

All of this reminds me of the classic 1969 Honeywell ad for the “Kitchen Computer”, when the company was trying (very early!) to sell the notion of a computer for everyday use and it’s predictably awful that the popular use they envisioned was for women in kitchens. I do think companies will try to sell AI, VR, drones, etc. for the mass market from the get-go (in fact they are already doing so), but as with the Kitchen Computer, my guess is that there will be a mismatch on timing between when the average consumer is ready and when the technology is good enough. Those with the resources (Google, Amazon, Apple, Facebook) can stick it out for long-term wins—and startups that dig into this arena too early for consumer use may not make it.