Why we switched from Python to Go in Avito’s recommendations engine

Vasiliy Kopytov
AvitoTech
Published in
12 min readMar 2, 2023

Hi there! My name is Vasiliy Kopytov, I lead the Avito recommendation development group. We deal with systems that provide the user with personalized adverts on the site and in applications. I will use the example of our main service to show you when to switch from Python to Go and when to leave everything as it is. At the end I will give some tips on how to optimize Python services.

How recommendations work at Avito

Anyone who goes to the homepage of a site or app sees a personalized adverts feed — recommendations. The load on our main recommendation service, which is responsible for generating an endless feed of adverts on the home page, is about 200,000 requests per minute. Total traffic is up to 500,000 requests per minute for recommendations.

This is what the recommendations look like in the app and on the website

The service selects the most suitable adverts out of 130 million active adverts (items) for each user. Recommendations are generated based on every single action of a person during the last month.

Representation works according to the following algorithm:

1.The service accesses the user’s history storage and retrieves aggregated history of actions and interests from it.

Interests are a set of categories and subcategories of adverts that the person has recently viewed. For example, kids’ clothing, pets or home supplies.

2. Then it passes history and interests as a set of parameters, exposure to first-level ML models.

The first level ML models are the underlying services. Right now we have 4 such models. They predict the items using different machine learning algorithms. At the output of each service we get a list of id (recommended ones).

3. We filter the id based on the user’s history. We end up with about 3,000 items per user.

4. And the most interesting part is that the representation internally uses a second-level ML model based on CatBoost to rank adverts from first-level ML models in realtime.

5. From the data we prepare features. It’s the parameters for ranking recommendations. To do this, we use the id of the item to get the data in the storage (1 TB sharded database, Redis). The data of the item — title, price and about 50 other fields.

6. The service passes the features and the items to the second-level ML model based on the CatBoost library. The output is a ranked feed of adverts.

7. Next, representation performs the business logic. For example, it picks up in the feed those ads for which premium placement is paid (boost VAS).

8. We cache and give the generated feed recommendations to the user, it has about 3000 adverts.

The algorithm by which the recommendation feed is generated

Why we decided to rewrite the recommendations engine

Representation as a service has very loads in Avito. It processes 200,000 requests per minute. The service didn’t become that way all at once: we were constantly introducing something new and were improving the quality of the recommendations. At some point, it started consuming almost as many resources as the rest of the Avito monolith. It became difficult for us to roll out the service in the daytime, during peak hours, because of the lack of resources in the cluster — at that time most developers were deploying their services.

Interaction map of Avito services. The size of the circle indicates how many cluster resources the service is consuming

Along with the growth of resource consumption, the service’s response time also grew. During peak loads, users could wait up to 1.6 seconds for their recommendations — it´s an 8-fold increase over the past 2 years. All this could block further development and improvement of recommendations.

The reasons for this are fairly obvious:

  1. High IO-bound load. In representation, each request consists of about 20 coroutines — blocks of code that run asynchronously while processing network requests.
  2. CPU-bound load from real-time calculations by the ML-model, which are fully occupied by the CPU while the advert ranking takes place.
  3. GIL - Representation was originally written in single-threaded Python. In this programming language, it is impossible to combine IO-bound and CPU-bound workloads so that the service uses resources efficiently.

How we solved the recommendation engine challenge

Let me tell you what has helped us live up to our loads in Python.

1. ProcessPoolExecutor

ProcessPoolExecutor creates a pool of workers from processor cores. Each worker is a separate process running on a separate core. You can pass a CPU-bound load to a worker so that it does not slow down other processes.

In Representation, we originally used ProcessPoolExecutor to separate CPU-bound and IO-bound workloads. In addition to the main python process that serves requests and walks the network (IO-bound), we separated three workers for the ML-model (CPU-bound).

We have an asynchronous service on aiohttp that serves requests and successfully handles the IO-bound load. ProcessPoolExecutor creates a pool of workers. CPU-bound workload can be transferred to such a worker to prevent it from slowing down the main service process and affecting the latency of the whole service.

The time gain from using ProcessPoolExecutor is about 35%. For the experiment, we decided to make the code synchronous and disabled ProcessPoolExecutor. That is, IO-bound and CPU-bound workloads started to run in one process.

Without ProcessPoolExecutor, the response time increased by 35%

What it looks like in the code:

async def process_request(user_id): 
# I/O task
async with session.post(
feature_service_url,
json={'user_id': user_id},
) as resp:
features = await resp.json()

return features

We have an asynchronous handler that handles the request. For those who are not familiar with async await syntax, these are key words that mean coroutine switch points.

That is, on the seventh line of code, one coroutine goes to sleep and gives execution to another coroutine, which has already received data, thus saving CPU time. Python implements cooperative multitasking this way.

def predict(features)
preprocessed_features = processor.preprocess(featured)
return model.infer(preprocessed_features)


async def process_request(user_id):
# I/O task
async with session.post(
feature_service_url,
json={'user_id': user_id},
) as resp:
features = awat resp.json()

# blocking CPUtask
return predict(features)

Suddenly we need to execute a CPU-bound load from the ML-model. And so on predict function our coroutine will block the python process. So that all the service requests don’t queue up and the service response time doesn’t increase, as we saw earlier.

executor = concurrent.futures.ProcessPoolExecutor(man_workers=N)


def predict(features):
preprocessed_features = processor.preprocess(features)
return model.infer(preprocessed_features)


async def process_request(user_id):
# I/O task
async with session.post(
feature_service_url,
json={'user_id': user_id},
) as resp:
features = await resp.json()

# Non blocking CPU task
return await loop.run_in_executor(executor, predict(features))

This is where ProcessPoolExecutor comes in with its own pool of workers, which solves this problem. On line 1, we create the pool. In the end of code block, we take the worker from there and move the CPU-bound task to a separate core. That way, the predict function will be executed asynchronously with respect to the parent process and not block it. The nice thing is that all this will be wrapped in normal async-await syntax, and CPU-bound tasks will be executed asynchronously just like IO-bound tasks, but under the hood there will be extra magic with processes.

ProcessPoolExecutor allowed us to reduce the overhead from the realtime ML model, but even with it at some point it got bad. The first thing we started with was the most obvious one — profiling and identifying bottlenecks.

2. Service profiling

Even if the service is coded by experienced programmers, there is room for improvement. To understand which parts of the code are slow and which are fast, we profiled the service using the py-spy profiler.

The profiler plots a diagram in which the horizontal bars represent how much of the processor time a section of code is wasting. The first thing you see are the 3 bars to the right. These are just our children processes for scoring the ML-model features.

Rec-representation profiling result. For example, one can see that the ProcessPoolExecutor workers for the ML model take up almost the same amount of CPU resources

On the flame graph we saw some interesting details:

  • 7% of the CPU time is spent serializing data between processes. Serialization is the conversion of data into bytes. In Python, this process is known as pickle, and the reverse one is known as unpickle.
  • 3% of the time is spent on the ProcessPoolExecutor overhead — preparing a pool of workers and distributing the load between them.
  • 6.7% of the time is spent serializing data for network queries into json.loads and json.dumps.

In addition to the percentage distribution, we wanted to know the specific time that different sections of code take. To do this, we disabled ProcessPoolExecutor again, ran the ML model to rank synchronously.

Without the ProcessPoolExecutor, ranking is faster, because all the CPU time is occupied only by the preparation of features and scoring by the ML model, there is no overhead for pickle/unpickle and IO-wait

But the problem remains — a specific piece of code has become faster, but the service itself has become slower.

After experimenting, we found out:

  • The overhead of ProcessPoolExecutor is about 100 milliseconds.
  • IO-bound requests from coroutines wait 80 milliseconds, that is, the coroutine has fallen asleep and Event Loop gets to it again after 80 ms to resume its execution. In Representation there are three large groups of IO-bound requests — a total of 240 milliseconds is spent on IO-wait.

This is when we first thought about switching to Go, since it has a more efficient routines scheduling model out of the box.

3. Split CPU-bound and IO-bound workload into two separate services

One of the big changes we tried was to remove the ML model into a separate rec-ranker service. That is, we kept our representation service with only network requests, and the ML-model scoring was on a separate rec-ranker service, where we passed all the necessary data and returned scoring for ranking. It seemed that we would reduce latency a little bit and scale both parts separately.

The experiment showed us that we save time on the model operation, but we get a delay of 270 milliseconds when transferring data over the network and json.loads/json.dumps. We need to transfer about 4 Mb per request, and for very active users, up to 12 Mb of data for the ML model. After scaling the rec-ranker replicas are not much smaller than the old representation, and the response time has not changed. For our case, splitting into services turned out to be an unsuccessful solution, so we went back to the previous Representation implementation.

4. Evaluated Shared Memory

In the Representation service, data is transferred between processes via pickle/unpickle. Instead, processes that share data can point to a shared memory location. This saves serialization time.

The maximum estimate is that we could gain about 70 milliseconds on serialization with the same time reduction for the amount of query execution, since the pickle/unpickle — CPU-bound load was blocking the main Python process that handled requests from users. We drew this conclusion based on the profile: pickle/unpickle takes only 7% of CPU time, we wouldn’t get much profit from shared memory.

5. Preparation of features in Go

We decided to test the efficiency of Go first on a part of the service. For the experiment we chose the most CPU-intensive task in the service — features preparing.

Features in the recommendation service — item data and user actions. For example, the name of the ad, price, information about impressions and clicks. There are about 60 parameters that affect the outcome of the ML model. That is, we prepare all this data for 3,000 of the items and send it to the model, and it gives us a score for each item, which we use to rank the feed.

To link the Go code for preparing the features to the rest of the service code in Python, we used ctypes.

def get_predictions(
raw_data: bytes,
model_ptr: POINTER(c_void_p),
size: int,
) -> list:
raw_predictions = lib.GetPredictionsWithModel(
GoString(raw_data, len(raw_data)),
model_ptr,
)
predictions = [raw_predictions[i] for i in range(size)]
return predictions

This is what the preparation of the features inside Python looks like. The lib module is a compiled Go package with the GetPredictionsWithModel function. In it we pass bytes with data about items and pointers to ML-model. All the features are prepared by Go code.

The results were impressive:

  • Go features are counted 20–30 times faster;
  • the whole ranking step is 3 times faster, taking into account the extra serialization of data into bytes;
  • the response of the homepage dropped by 35%.
Preparing features on Go accelerated the loading of the main page of the site from 1060 to 680 milliseconds
Time to rank recommendations by ML-model with feature preparation. Here you also need to take into account that in the case of Go we have synchronous code and we do not use ProcessPoolExecutor

Results

After all the experiments, we drew four conclusions:

  • The Go features for 3,000 items per query are counted 20–30 times faster, which saves 30% of the time.
  • ProcessPoolExecutor wastes about 10% of the time.
  • Three groups of IO-bound queries take 25% of the time for empty waits.
  • After switching to Go, we will save about 65% of the time.

Rewrote everything on Go

There is an ML model in representation-go. Naturally it seems that ML is good only for Python, but in our case the ML model is on CatBoost and it has an C-API that can be called from Go. That’s what we took advantage of.

Below is a bit of code in Go. I won’t dwell on it too much, I’ll only point out that inference gives the same results as in Python. C is a pseudo package that provides Go with an interface to C libraries.

if !C.CalcModelPrediction(
model.Handler,
C.size_t(nSamples),
floatsC,
C.size_t(floatFeaturesCount),
CatsC,
C.size_t(categoryFeaturesCount),
(*C.double)(&results[0]),
C.size_t(nSamples),
) {
return nil, getError()
}

The problem is that the ML model is still trained in Python. And in order for it to learn and build on the same features, it’s important that they don’t drift apart.

We started to prepare them using Go-service code. Learning takes place on separate machines, the service code in Go is downloaded there, features are prepared by this code, saved to a file, then a Python script downloads this file and trains the model on them. As a bonus, training has also become 20–30 times faster.

Representation-go showed great results:

  • Main page response dropped by a factor of 3 from 1280 milliseconds to 450 milliseconds.
  • CPU consumption dropped by a factor of 5.
  • RAM consumption dropped 21 times.
The service is three times faster when written in Go compared to Python

We unlocked further development of recommendations — we can continue to implement the heavy features.

When to migrate a service from Python to Go

In our case, switching to Go brought the desired result. Based on our experience with the recommendations engine, we identified three conditions when you should consider switching to Go:

  1. the service has a high CPU-bound load
  2. at the same time, the IO-bound load is high
  3. you need to send a large amount of data over the network to prepare features.

If you only have IO-bound workloads, you’re better off sticking with Python. Switching to Go won’t buy you much time, you’ll only save resources, which is not that important for small to medium-sized workloads.

If the service uses both workloads, but doesn’t transmit as much data across the network as we do, there are two options:

  1. Use ProcessPoolExecutor. The time overhead will not be very large while service is not huge.
  2. As the traffic load becomes too high- split it into 2 services — CPU-bound and IO-bound services, to scale it separately.

Optimising the service, where to start

Profile your service. Use py-spy like we did, or another Python profiler. Chances are your code doesn’t have huge sub-optimal areas. But you need to take a closer look at all the small areas that are going to make a decent amount of improvement. You may not need to rewrite all of your code.

Run py-spy in non-blocking mode:

record -F -o record.svg -s - nonblocking -p 1

This is the first flame we got without any optimization. The first thing that caught my eye here is that a noticeable chunk of time is spent on json request validation, which in our case is not very necessary, so we removed it. Even more time was spent on json loads/dumps of all network requests, so we replaced it with orjson.

I will conclude with some tips:

  1. Use the request validator wisely.
  2. Use orjson for Python or jsoniter for Golang for parsing.
  3. Reduce network load — compress data (zstd). Optimize database storage, read/write data (Protobuf/MessagePack). Sometimes it is faster to compress, send and decompress than to send uncompressed data.
  4. Look at the sections of code that take the longest to execute.

--

--