Visual search: The next generation of search interaction and how ViSenze is taking on the challenge

by Li Guangda, CTO and Co-Founder of Visenze

ViSenze — visual search for fashion

Non-text search is on the rise.

For the past few decades, text search has been the major way of human-computer interaction for information seeking. However, new formats of search using images and voice as inputs have became more popular in recent years.

Google’s CEO Sundar Pichai revealed in his keynote speech at I/O earlier this year that 20% of the search volume on android mobile phones comes from voice search globally. That’s 1 in every 5 searches.

There are two possible reasons why voice search finally kicked off — the recent advance of deep learning has pushed the accuracy of speech recognition to more than 90%, and the millennials are getting used to this new way of human-computer interaction.

The demand for visual search

Visual search on the other hand, which uses images as queries to perform the search, is an even newer way to find information. It offers a totally different angle for users, allowing them to search for things they cannot describe with words.

At the global level, billions of photos are uploaded to the internet from various mobile and desktop devices, giving rise to a huge demand to indexing these visual content and understanding them as a foundation for building better search and discovery experience.

ViSenze started to work on visual search technology back in 2012, in a research lab in the National University of Singapore. For the last 3 to 4 years, we have observed the rise in usage of visual search — similar to the adoption rate of voice search. In fact, the amount of API calls from our customers has increased by 300% since.

Taobao, the Chinese online marketplace founded by Alibaba Group, has also noticed similar trends on the rapid user adoption rate for visual search — 55.1% of visual searches on mobile are performed by generation Z and the GMV from visual search in a single day goes up to tens of million Chinese RMB.

The rise in visual search can be attributed to the improvement in search accuracy, leveraging on deep learning based training, as well as the development of more inexpensive yet powerful computing resources.

ViSenze’s approach to building visual search technology

Our visual search technology is built upon many years of research in NUS, a top leader in the world on multimedia and social media research. Even though we are very focused on R&D, our efforts are heavily driven by customer requests. We want to bring more value to our customers while polishing our core product capacity at the same time.

So, after years of R&D and working with clients to improve our core product, our key competitive advantages lie in the following areas:

Deep understanding of vertical domain

In the process of matching a target object, we extract a range of visual features trained on specific vertical domains such as fashion-related items in the fashion vertical. We explore, iterate and build many advanced models for deep learning to arrive at the best features for each specific domain.

We then develop (deep) learning algorithms to fuse these features for effective retrieval. The system is then robustly and continuously tuned based on our patented technology, deep algorithmic knowledge, and in-house know-how (a trade secret).

Continuous self-learning capability

Our overall system is actually an advanced AI system with continuous self-learning capability. It is assiduously achieved through constantly tuning and improving our machine learning algorithms for object detection, feature extraction, feature integration, and more.

We also employ our in-house technology to harvest more training images to improve the search accuracy significantly. Through rigorous usage (processing millions of images every day), our system improves its performance over time.

Robust backend built for accelerated processing on distributed architecture

Our backend system for large-scale high speed matching is achieved through several innovations.

First we introduced feature compression, allowing the compression of the overall feature set to represent the visual objects while not sacrificing accuracy. This was done through algorithmic improvement and careful testing and tuning. A compressed feature set not only save computation efforts but also storage as well.

Second, our hashing technology came into play. We developed advanced algorithms to map similar objects into the same hash space, which allows us to retrieve a large set of similar objects very speedily for more detailed processing.

Third, we implemented distributed technology on the cloud by distributing the computation to multiple systems.

And lastly, we developed an advanced system to facilitate scalable cloud resource acquisition to meet dynamic computation requirements at minimal costs.

Our entire backend system is specially built for robustness, efficiency and scalability. We have been put to the test by many clients during evaluations where they stress tested us with extreme search requirements and we have delivered in each situation.

Full flexibility in tuning visual search results

While using our visual search API, users can define search parameters and data schema based on their requirements. For example, search results can be filtered by price, brand or any other filters they would like to apply.

In many cases, users may be interested in other objects instead of the main one detected automatically by the system. To offer more flexibility, we provided a simple UI to allow user to select the object that they are interested in, among the rest.

It is very difficult to find another player in this market who has achieved the standards we have achieved. Also, the level of customization that we provide for developers through the API is very thorough and also user-friendly.

Systematic evaluation and automated testing procedures

In addition, we have developed an in-house quality control procedure that enables us to continuously improve our search speed and accuracy. It is all about systematic evaluation, fast feedback flow and performance monitoring.

ViSenze’s roadmap to raise the bar further

As a tech start-up, while we sharpen our lead in providing the best visual search solutions in the market, we are also building new technologies and looking at the long term growth plan.

Image recognition technology, large-scale video processing capacity, as well as big data analytics capacity, are all part of the plan:

  1. Recognition technologies — These enable us to identify a wide range of general objects, and domain specific objects such as different styles of apparels or shoes. The technology is available for both images and videos. We already have some paying customers using us even though we are yet to productize the technology.
  2. With video recognition technologies, we are able to perform in-video object recognition and hence advertising in real time. We have partners in US and China respectively who are interested to apply our technology.
  3. Big data analytics. With huge amount of user data and access history, we are mining online behaviours of e-commerce users, their profiles, preferences and purchase habits.

We believe that the power of the technology — especially in deep learning and large-scale training — will push visual search applications to the next level.

Some advice for developers looking to get into AI

There are a lot of companies who treat the algorithm as their core competency. However, these companies are losing their edge if they cannot catch up with the trends in deep learning.

The rapid release of new deep learning algorithm and more matured open sourced software packages has allowed competitors to have shorter R&D cycles. So rather than the algorithm itself, the most valuable asset is a company’s data acquisition capacity, especially the high quality training data.

The algorithm is no longer a secret. It has changed the R&D skill set requirements for a lot of industries. Since the deep learning field is still relatively new, there are a lot of areas in which engineers from different domains need to quickly level up in — GPU programming, parallel computing infrastructure design, as well as deep understanding of machine learning.

The engineer who can master the above will be the most sought after in the market. At the same time, there is a drop in demand for algorithm engineers who only focus on specific domains.

To learn more about trends in the visual technology space, attend Guangda’s speech during the AI With The Best Online Conference, taking place 24–25 September 2016, where he shares crucial insights and experience on how we built AI capabilities for the fashion vertical, and lessons learned along the way.

About the author: As the Principal Investigator behind the patent-pending visual recognition technology at ViSenze, Guangda published over 19 international publications in video & image analysis. He also obtained his PhD in media computing from NUS & was a post-doctoral fellow of the same university prior to co-founding ViSenze. Guangda is a regional finalist for the MIT Technology Review’s 35 Innovators Under 35 Awards.