The Overlooked Effect of Data Size

Published in

Trendyol Tech

9 min readNov 30, 2022

Special thanks to Özgür Karakaya with whom I co-wrote this blog post.
All credit goes to PDP Team members who developed the solution before I joined this amazing team.

PDP (Product Detail Page) in Trendyol is a team responsible for providing information on specific products. This information includes a product's price, size, color, or other relevant information that customers want to know before making a purchase. The information can be presented in two different ways, the first is in the form of a single product detail page, mainly when you click on a product and get navigated to a page where it displays the specific product information. The team is also responsible for providing product detail data to other views on the website, i.e. your favorite list, recommended products, and checkout. Both ways were handled by one service called Product Detail API backed by couchbase NoSQL database serving single requests (single product detail page) as well as requests in bulk(favorite list, etc..).

Prior to the Black Friday event of 2021, the team had set a goal of handling 8 million Requests Per Minute for bulk requests. One problem was that the service performs poorly in the form of increased memory usage and a considerable number of errors when it receives high throughput. The challenge was to figure out a way to optimize the service.

So here is the journey.

Brainstorming & Analysis

During the analysis phase, PDP Team found that — domain-wise — clients who send bulk requests do not use all the fields in the database document. The first optimization point was to introduce a new document model without those unnecessary fields. The advantages of this restructuring are:

data size optimization
bulk-content domain focus

Another thing noticed was that some of the main conversion operations can be done at what we call Index-time (while writing to the database)

Below is a diagram of the existing and new architecture at the time:

AS-IS Architecture & Proposed Architecture

The dotted outside boxes define the bounds of a service. Notice that in this first illustration, one service was responsible for handling the entire pipeline of operations. The optimization itself was to split the conversion pipeline into two separate processes as seen in the second architecture: The Index-time Conversion process and Bulk Content Service. The latter is a Read service that handles HTTP requests. The former turned out to be an event-driven process that reads data from ProductContent CB (Source Of Truth), partially processes, then stores it in a Counchbase Bucket for the latter to use. But more on that later.
Under the proposed new architecture the main gain is that the runtime process will be optimized, which in turn reduces the response time of the service.

In this blog, we’ll be focusing on the Index-time process. For more information on “Bulk Content Service”, check this post by Enes Turgut

P.S. I’m using the term runtime here rather loosely. In the context of this blog, we’re defining runtime as the sequence of operations triggered by an incoming HTTP request.

Existing Solutions

So far the team had settled on the new architecture. Now it’s time to implement it. What we are trying to achieve is a near real-time couchbase-to-couchbase conversion solution. Couchbase already offers replication across the cluster, however, the shape of the data needs to be changed so normal replication is obviously not the way to go. Wouldn't it be awesome if there were a way to manipulate the data during the replication process? while skimming through Couchbase documentation we stumbled upon Couchbase Eventing Service. Did we just hit the jackpot?

Couchbase Eventing:

The Couchbase Eventing Service is a framework to operate on changes to data in real time. Events are changes to data in the Couchbase cluster

Couchbase Eventing seemed to be a good fit for us because you can:

Propagate changes to other systems
Enrich a document in real-time
Requires no maintenance

Which meets our requirement of removing unnecessary fields and applying some conversion logic to the documents.
However, despite how promising it is, it’s not suitable for high-complexity conversion logic, mainly because this feature requires a single javascript file. This is bad news; business logic is constantly evolving, new features are always and will continue to be added, and more features mean more development. Going this way would require a sacrifice of code modularity and ease of deployment. At the end of the day, we decided not to go this way forward.

No jackpot.

Our Solution (Bulk Content Data Converter)

Inspired by CB Elasticsearch Connector.

The Couchbase Elasticsearch Connector replicates your documents from Couchbase Server to Elasticsearch in near real time. The connector uses the high-performance Database Change Protocol (DCP) to receive notifications when documents change in Couchbase.

Replace Elasticsearch with Couchbase and you get precisely what we need. All we need to do is to accept document mutations (updates) from the DCP queue, process them, then write them into a new bucket.

Bear in mind the solution needs to be able to:

Scale as the number of documents and mutations grows.
Perform the end-to-end process (accepting messages from the queue, converting the data model, and writing it into a new bucket) in near real-time.
Manipulate the existing and not only the new documents when new fields are added to or removed from the document model.
Provide a mechanism to regenerate documents from the ground up in case of severe data corruption.

The downside is that the solution must be almost entirely built from scratch. All the same, the team decided to move this way forward.

Well, there’s a lot to unpack here, let’s summarize first:

The above architecture shows the final solution. Ultimately, Product Detail Service will be handling single requests, whilst Bulk Content Service is going to handle requests in bulk. The services are backed by different couchbase sources. While ProductContent CB represents the Source of Truth, ProductSummary CB represents a summarized pre-processed version of it. The conversion will be handled in Bulk Content Data Converter in near real-time via accepting document mutations from the DCP queue. Each document mutation represents the state of the document after the update has happened. This is another way to say that the converter will automatically get the entire updated document from ProductContent CB on each update.

In order to understand how this solution exactly scales one should take a look at the converter’s configuration file:

{
  "group": {
    "name": "v20220909", 
    "membership": {
      "memberNumber": 1, 
      "totalMembers": n
    },
    "limit": {
      "bytes": "5m",
      "actions": 100,
      "timeout": "1m",
      "concurrency": 2
    },
    "version": 1,
    "ignoreDelete": false
  },
  "dcp": {
    "sourceConfig": {
      "hosts": ["source_couchbase_host"],
      "network": "auto",
      "bucket": "bucketName",
      "metadataBucket": "metadataBucketName",
      "scope": "",
      "collections": [],
      "compression": true,
      "flowControlBuffer": "16m",
      "persistencePollingInterval": "100ms"
    },
    "targetConfig": {
      "hosts": ["taget_couchbase_host"],
      "bucket": "targetBucketName"
    }
  }
}

The solution is developed with configurability in mind. The system operates in a parallel fashion with n number of total converters as seen in group.membership.TotalMembers path. Each converter has a memberNumber ≤ n responsible for handling an m number of vBuckets. As of October 31st of 2022, Trendyol has approximately 300 Million documents across 1024 vBuckets with 8 total converters each taking a slice of the 1024 vBuckets. Thus scaling is just a matter of increasing the total number of converters.

Another important piece of configuration is group.name which represents the consumer group where an offset is stored, think like Kafka consumer groups. Changing this configuration resets the offset index which means the entire database state will be sent via the DCP queue. This is especially practical when an update on the document model is needed, adding to or removing new fields from the target bucket requires an entire update on all stored documents, this includes the documents that otherwise would have not been normally updated in the target bucket simply because no mutations would have occurred on the source documents. It also can function as an entire database regeneration mechanism in case of severe data corruption.

For more information about the configuration kindly check the documentation link.

Another optimization point was to reduce the target document size by having abbreviations or single characters as JSON nodes. Domain experts and PMs aren’t expected to make sense of this document and this is why we can get away with such evil hacks. Below is an example of the document:

And as of 31st of October, the result is…

This simple hack allowed us to reduce the Bucket size from 2.92TB to 595GB! A walloping size reduction of approximately 80%!
One also must have noticed that the document count in the target DB is reduced by approximately 12 million. The reason is we excluded out-of-stock products that have not been updated for 12 months. In the source bucket, we still may need those documents but it wouldn’t make sense to have them here, hence the count difference.

Monitoring & Performance

So far we have the solution up and running. But how exactly can we tell if it’s performant enough? what if it is, hypothetically, doing the conversion but lagging, say, 2 seconds behind? That would be an utter disaster! such eventual consistency will leak all the way to the website’s UI and severely affect user experience. Manually triggering a change on the production database isn’t possible and shouldn’t be the way to test in the first place. A systematic approach is needed to pinpoint the source of the lag if it happens. Is the lag happening because of congestion in the DCP queue? Or is it because the converter has some sort of bug?

To answer these questions we introduced three metrics as shown in the figure below:

The source document contains the timestamp of the last update, we call that LMD (Last Modified Date). Once a mutation is received by the converter, “Waiting in DCP queue” Time can be calculated easily by subtracting LMD from time.now(). Metrics then are exposed to Prometheus as shown in the graphs below:

Limitations:

Remember when I said earlier that scaling is just a matter of increasing the total number of converters? Well, this comes with a little caveat: The solution’s configuration is a tad bit hardwired to the infrastructure it speaks with; Consumer auto rebalancing on the DCP queue and distribution of consumers across vBuckets are hardwired to the static configuration the consumer has on startup. To change that one needs to change the configuration file itself which requires a new deployment. Furthermore, each consumer needs its own static configuration file which means each consumer needs — in Kubernetes terminology — to have its own deployment resource files. In Trendyol we widely use ArgoCD with Kustomize to manage our deployments, scaling the solution requires adding — in ArgoCD terminology — a new ApplicationSet.
However, So far our scale of 8 consumers is handling the big load with near real-time efficiency. But as the source bucket gets bigger and the number of mutations grows then manually reconfiguring the consumers is needed.

In-House Other Teams usage

Feeling confident about the developed CB-to-CB connector, PDP Team decided to present it to other teams within Trendyol. The team got lots of positive feedback, and before we know it, Search Team decided to fork the project and use it in their domain where they store domain events in a couchbase database and use the CB-to-CB connector to generate another document model for later use. Such adoption of In-House developed solutions inspires all of us to move forward and keep on improving and refining our usage of technology.

Apply for our open positions here!