Ganesha : Gateway to Product Understanding

Yusuf Azis Henny Tri Yudhantoro
Tokopedia Data
Published in
8 min readSep 26, 2019

Products, buyers, and sellers. These are the backbone of any Customer to Customer (C2C) marketplace platform. C2C is a model where end-users of the platform are the customers. They could use this platform as a seller and/or buyer as well. Tokopedia as the leader of the C2C e-commerce market, have had a massive growth from year to year. Our platform simplifies users’ experience to browse and get what they want instantly. This also attracts people to start their online business opportunity.

With the increasing population of buyers combined with the wide array of products being sold on our platform, transactions are happening at a rapid pace. Every year, more and more products get purchased on our platform. Showcased products have become the centerpiece of our concern as it connects our buyers and sellers. We realized to improve our customers’ experience, we need to focus our efforts on understanding the products being sold on Tokopedia.

Number products increase 200 times since 2013

Product problems

One of the most prominent problems is product depiction. Our users are given the freedom to enter the product’s title, images, description and any other properties to depict their product. It is not uncommon that some properties are unclear or ambiguous. For example, product pictures are often blurred, making it harder for potential buyers to observe the product. These cases could reduce the discoverability and the appeal of the product itself, and ultimately jeopardizes our customers. Products with unclear depiction might suffer conversion loss. Tips and tutorials on using our features were provided to prevent this problem. Albeit a small percentage of the uploaded product are bad, with a myriad of products uploaded daily, the loss may accumulate.

Another problem is product terms and conditions (T&C) violation. Some set of rules were made regarding what kind of products are allowed to be sold using our platform considering the laws and regulations in Indonesia. While the rules are specific, there are many products that violate these T&C. The existence of prohibited goods is a danger to our customers and even more so, the society itself. To ensure a clean and safe environment for all of our customers, we need to routinely check the products that are being sold on our platform.

Understand First, Better Experience Later

To understand the data is a cornerstone in building a viable solution. With this in mind, we realize that it is paramount for us to get a better understanding of the products. With the ever-increasing number of products being uploaded to Tokopedia, we utilize Artificial Intelligence (AI) to assist us on our knowledge gathering.

We learn from our product data to find out what makes a good title, description, and image, then generalize it as Machine Learning (ML) models that are able to assess our product properties and score them based on their ‘quality’. Furthermore, we also created ML models to classify products that violate our T&C. These models are then deployed as an Application Program Interface (API) to make an assessment of product properties.

With these APIs come in place, we could help leverage conversion rate by including this score as a factor of discoverability parameters as well as educating sellers to give better depiction to their products automatically in the system. We also educate users not to upload prohibited goods and it will keep Tokopedia’s platform clean for everyone with minimum cost.

Product understanding is not simple

ML APIs are good solutions to aid manual assessment, but it is a little bit tricky when we assess a product. The challenge comes from the nature of the product itself. Take a look at this illustration to understand the situation.

Illustration of product states

When we ask a simple question, what kind of product is product id 123 ? The answer is dependent on the time the question was asked. Suppose we uploaded an iPhone X product at t0. After a while, we realized that it was supposed to be the case for iPhone X. Then, we changed the title, image, description and the price at t1 & t2. These changes could happen indefinitely up until this product deleted at tn.

Suppose we decided to trigger our models when the seller uploads their product. This is not sufficient since the sellers can change the product properties anytime they want right after it was uploaded. We call this changing checkpoint as a product state. Each of the product, obviously, has different number of states and we never know how many it could be since the seller has full control of it. So, it is clear that we need to keep track of the state to understand what product it is.

The processing time also differs from one inference to another. Generally, doing inference on text is faster rather than on image. If we only serve ML API, it could go well on text processing but could face some timeouts on the image prediction. We need to maintain the requests in such a way so users could do fire and forget to the API. Fire and forget is a mechanism when users just need to hit the API once and let itself return the results whenever the inference is complete.

Another problem is coming from the request number. At least there are three events when our models are going to be used; when users add the product for the first time, when they edit the product, and when the product admin wants to do sweeping on existing products. Those events combined make huge Request Per Second (RPS) demand, making it too risky to let the request go through the inferring API unmanaged in terms of fulfillment as we could get lost when deciding which one is containing the latest state of the product.

Thus, we think about an experimental gateway with tasks on product inferring:

  • Collect all the parameters needed for the product inferring API
  • Simplify the process for users of the API, so they just need to publish and subscribe to several topics in message queueing
  • Keep track of product states and manage the fulfillment of those states regardless of the processing time needed on each API
  • Orchestrate the inferring stacks

We called the gateway Ganesha, and this is the overview of the whole process.

Overview of Ganesha

The existence of Ganesha itself does not disrupt the existing business flow. However, if we want to have a better understanding of the products, there are layers of inference that needs to be done efficiently. Suppose we do the pre-process for the first layers on both image and text, then the result could be used to do the classification, clustering, or scoring. Having Ganesha in place could improve inferring efficiency since we could manage the flow and decide which part is reusable for the others.

Inside Ganesha, we communicate through NSQ message queueing in order to achieve this fire and forget style. There are three main consumers here: Active Consumer, Knowledge Collector Consumer, and Knowledge Checker Consumer.

Active Consumer

Active Consumer process

Active Consumer has a responsibility to subscribe to product changing event from active products (add / edit topic), idle products (sweeping topic), and distribute the request into the inferring APIs. Inferring APIs would process every request with different process time. The result would be published to the same topic to be subscribed to the next consumer.

Knowledge Collector Consumer

Knowledge Collector Consumer Process

Knowledge Collector has the responsibility to pool all the answer from the inferring APIs. We keep track of how many inferring tasks have been finished on that product state and keep pooling answers until all inferring tasks are complete. Once the tasks on a particular state are finished, it will publish a message to the next consumer in order to signal that process has finished and ready to be sent to the end-users.

Knowledge Checker Consumer

Knowledge Checker Consumer Process

Knowledge checker is the final gate of Ganesha. It will verify state completeness and prepare the message for appropriate end-users with the result of inference inside. After publishing the results, it will update the state status, delete the product cache, and write down the check out time.

The product cache is one of the most important parts apart from these consumers since every state might have different product properties and we never know for how long those properties are relevant. Also, there could be failures on some tasks as well, therefore the checker trigger won’t be sent. To address this, we create a cron to regularly sweep which state is not complete and re-asking the API once again using the last properties on the cache.

With these thorough steps, we could solve some problems here:

  • Ensure that each state of the product is captured and inferred with a minimum error rate
  • Orchestrate the inferring stacks, and ensure any improvements on product inferring could be used interchangeably
  • We could cover many end-users to utilize our models with minimum effort on their side

Data Science is a very interesting domain nowadays. We could solve almost everything with Machine Learning. The tricky part is how do we define solve. Most of the cases we use per-project metric as the lense to define solve. But if we want to see deeper on the root cause of the problem, we could understand that modeling is half part, continuous end-to-end utilization is the other half. We could add more lenses such as long term business goals, models’ reusability, and technical integration to define a more holistic definition of “solve”. This — in my opinion — is mandatory in any Data Science team.

I hope you find this article useful for you. And if you’re ready to tackle the infinite number of hurdles we got, hit career page here and join our amazing team :)

--

--