Voice Search vs. Visual Search

What brands need to know

Published in

IPG Media Lab

9 min readOct 12, 2017

Voice search on Alexa vs. Visual search on Pinterest Lens

Over the past decade, the shift from desktop to mobile has changed the way people search. Indeed, by May 2015, Google reported over 50% of its search inquiries were from mobile devices, and that number has been steadily growing since. This has made search not only more readily accessible at any time or place, but also more contextually relevant thanks largely to the GPS location service on our phones. Naturally, many search marketers adapted their SEO and targeting tactics to capitalize on the opportunities that mobile search brought.

By 2019, more than half of all mobile searches will be voice or visual searches.

Now, thanks to the rapid advances in artificial intelligence, we’re on the verge of another tectonic shift that will see search transform from a mobile-first practice to an AI-led one. Market research firm Gartner predicted that by 2019, more than half of all mobile searches will be voice or visual searches. With voice search, powered by natural language processing (NLP), and visual search, powered by computer vision, search is about to become even better and far more intuitive than it is today. Together, they will revolutionize the way that people look for information, conduct pre-purchase researches, and discover new products and services.

The Current Landscape of Future Search

Before jumping into drafting new SEO strategies and revamping the sales funnel, we should first examine the current landscape of both new search methods.

In the past two years or so, the consumer tech industry has been laying the groundwork for voice and visual search. Voice search currently has a solid lead over visual search in terms of deployment. Voice assistants on mobile such as Apple’s Siri or Google Assistant (formerly known as Google Now) have always supported voice search since the day they launched over half a decade ago. User adoption has been steadily growing, and as of January 2017, about half of online consumers are using voice search on a daily basis.

But the latest development in voice search is quickly upending the existing status quo. The surging popularity of Amazon Echo smart speaker is effectively unbundling voice search from mobile and integrating search into connected home devices. As a result, Alexa is quickly emerging as a worthy challenger to the incumbent Google in voice searching, making up for the shortcomings of running on Microsoft’s Bing search engine by pulling information from third-party Alexa skills, and for shopping-related queries, from Amazon’s own site. Interestingly, Apple recently switched the search engine for Siri from Bing to Google, which should improve the web search result for Siri and make it better at conducting voice search.

In comparison, visual search is still in a nascent stage of development. The “search by image” feature of Google Image, launched back in 2011, serves as a predecessor to the visual search tools in development today. But unlike the sophisticated visual search tools that come with object recognition, it was simply searching for identical or similar-looking images across the internet by matching pixels.

In the past few months, social scrapbooking site Pinterest has emerged as an unexpected leader in bringing visual search to market with Pinterest Lens. In February, the company launched this virtual search tool on its iOS app as a public beta test for mobile users to try out. Then in March, it was revealed that Samsung will be using Pinterest Lens to power the visual search functions of its virtual assistant Bixby on its Galaxy S8 phones. Fast forward to late September, Target announced that it will start integrating Pinterest Lens into its mobile app to aid shopping and product discovery.

Google Lens was one of the major announcements of the Google I/O 2017 keynote in May. Though it’s easy to see how Google may integrate it into future Android OS as a built-in feature, for the moment being, Google is only talking about shipping a “preview” of Lens with its Pixel 2 phones later this year. Despite the slow and tentative rollout, Google Lens still has much potential to become the best visual search engine as Google’s capabilities in AI and computer vision grow stronger.

Beyond the two aforementioned Lenses, there are also some independent visual search startups that are providing custom solutions for brands. For example, fashion brands ASOS and Neiman Marcus both worked with unnamed tech partners to bring visual search into their apps to allow shoppers “snap and find” the items they are looking for. Then there is eBay, the ecommerce marketplace site that is reportedly developing its own proprietary visual search engine to better help shoppers discover listings across its site.

Two Disparate Search Experiences

Despite the common core of being AI-driven, these two new forms of online search create two disparate user experiences that brands will need to differentiate and take into consideration when coming up with a new search strategy. There is nothing that makes one inherently better than the other, but sometimes one would be more suitable and efficient than the other for maximizing customer engagement depending on the goal and context. Specifically, visual search and voice search differ from each other on the three following fronts.

Input Determines Search Focus

Text-based search requires very low bandwidth and works seamlessly across desktop and mobile devices. In contrast, voice search and visual search respectively use audio and image as input, therefore requiring a certain amount of bandwidth in order to trigger the search queries. The differences in input medium also mean that although both are more intuitive to use than text-based search, they each have their own limitations.

For example, when conducting voice search, users will need to already know the proper names of the subjects they are searching for, whereas visual search is perfect for learning about the unknown subjects thanks to the nonverbal nature of object recognition technologies. On the flip side, however, this means visual search is inherently tied to physical objects and the “here and now,” whereas voice search can easily handle queries about abstract concepts like the weather or business hours. Therefore, visual search places its main focus on discovery whereas voice search focuses on convenient access to information and updates on known subjects.

Visual search places its main focus on discovery whereas voice search focuses on convenient access to information.

Output Shapes User Experience

Obviously, voice search and visual search also deliver their results in fundamentally different ways, therefore resulting in distinct user experiences. For voice search, the results are usually delivered via voice assistants in a conversational way, which means that only the top results (that is, what each voice assistant determines to be the most relevant result to the query) will be presented to the user. Those results are selected based on the priorities of the platform owners themselves, such as the retail partners they have or the location data provider they use, as well as their knowledge of the user and their preferences. It’s only when users actively follow up with more questions would more options be surfaced. In comparison, visual search at its current stage focuses on surfacing all the relevant results to help the user learn more about the objects they pointed their camera at.

The conversational nature of voice search renders itself a friendly and intimate experience that is ready for follow-up questions, whereas the mechanical “point-and-snap” nature of visual search makes it more of a silent, one-and-done interaction that prioritizes efficiency over user-friendliness. For example, when one asks about the address of a local shoe store, they can follow up with a question about their business hours and parking availability. But when one takes a picture of a pair of shoes for visual search in order to identify the brand, they don’t get to follow up with a question to see if the same pair of shoes comes in a different color.

Accessibility Impacts Use Cases

While voice and visual search are considerably easier and quicker to access than traditional keyword-based search, they each carry varying levels of frictions and accessibility that shapes their availability and use cases. Voice search is already available on most mobile handsets and is proliferating the home space with the growing adoption of smart speakers. As more and more home appliances become connected and voice-activated, it is not hard to imagine that voice search will be integrated into a bevy of smart home devices as well, opening it up to new contexts of usage. In the not-so-distant future, you might just be able to ask your smart fridge where to buy avocados or get Chinese food deliveries.

Visual search, on the other hand, will be largely tethered to mobile devices for now due to the requirement of a good camera and the processing power required, which would be hard to build into wearables today. In addition, the “point-and-search’ gesture required by visual search also means that it is unsuitable for most connected devices. The obvious path forward for unshackling visual search from handsets is smart glasses, which perfectly aligns the camera with the field of vision and therefore transforms the trigger gesture from “point-and-search” to “look-and-search,” providing contextual information and actions for any object we can see. But until smart glasses can overcome the manufacturing and adoption hurdles, visual search will remain first and foremost as a mobile feature.

What Brands Need To Do

Given these disparate differences in between the two, brands will need to think carefully when they tackle these two emerging search methods. In general, here are the three principles that brands should heed when its comes to voice and visual search.

1. Index your site and work with platform owners

The shift toward voice and visual search will present new opportunities for brands to work with search platform owners to make sure that the search results deliver the relevant branded content to users. This is especially urgent in voice search, where only top results are likely to be surfaced.

Brands will need to to revamp their websites and online stores to make sure that all images and site maps are properly indexed and optimized for voice and visual search engines to parse through. According to Gartner’s estimation, early adopter brands that redesign their websites to support visual and voice search will increase digital commerce revenue by 30% by 2021.

Brands that redesign their websites to support visual and voice search will increase digital commerce revenue by 30%.

2. Determine which one of the two to prioritize

The inherent differences between voice and visual search result in different use cases. From a brand marketing perspective, this means that service-oriented brands, such as those in travel and fitness, will be relying more on voice search, while product-oriented brands, such as those in CPG and fashion industries, will be more geared towards visual search.

For some brands, both voice and visual search will be important lead-generating channels for different aspects of your business. For example, auto brands will need to invest in visual search to allow auto buyers to identify your car models easily, and they also need to optimize for voice search to make it easier for people to find their nearest dealerships. For entertainment brands, voice search would be great for finding local theaters and checking for showtimes, and visual search will be great for surfacing additional content for your out-of-home ads

3. Get ready to move from SEO toward SCO

As more and more consumers start to pick up voice and visual search, we will soon enter a whole new world of search optimization where the traditional SEO practices won’t suffice. As for potential media opportunities, search platform owners may add sponsored messages to voice search or “sponsored similar items” to visual search, but none has yet to figured out how to do so without severely disrupting the user experience just yet. Nevertheless, brands need to look ahead and start shifting their SEO focus towards an SCO (Search Channel Optimization) approach so as to acknowledge and optimize against the differing characteristics and applications of the two new searches that we dived into earlier.

Looking ahead, we see an inevitable convergence of the two AI-powered search methods to create a full-on AI search.

Looking ahead, we see an inevitable convergence of the two AI-powered search methods to create a full-on AI search. Instead of using just one input medium (text, image, or audio), we will be using all three to ask our AI assistant to search for whatever we are seeing, hearing, or thinking about, just like we would ask a human friend in real life. Contextual information such as location, time of the day, and the weather, coupled with a myriad of personal preferences, will all be taken into account in delivering the most relevant search results. But until that convergence happens, brand marketers should focus primarily on optimizing search in terms of the expanding platforms and channels.