How to improve API performance by Concurrency

PengB
ADEO Tech Blog
Published in
7 min readJan 10, 2024

--

In Adeo, managing the landing zones on cloud providers are under the responsibility of my team “Cloud Services”. A key feature of the landing zone is permission management, in other words, who can do what on which cloud resources. This year we have finalized the integration of an Adeo central product service management application (MyCloudConsole). It allows users to map directly the user groups, which are based on group members’ business role (Product Owner, Architect, Ops, Developper, Analyst etc.), to a GCP project for example, as owner, editor or viewer.

We also delivered a cool feature that allows users to bulk update the permissions on multiple landing zones in oneshot. For example, I can set a user group “support@example.com” a basic role Viewer to 20 GCP projects or more in one click.

To do that, we provide REST API(as Adeo is an API and data driven company) and a web application as well.

Only landing zone owners can be authorized to use this API to manage the permissions.

Here comes the interesting discussion that I would like to share with you, how we improve the performance of our API which depends on GCP API, that is … slow.

Serial vs Concurrent processing

We need to check if the requester is the owner of all the GCP projects that he wants to set new permissions. To check the owner role on all projects, my first idea is to list each project’s direct and folder-hirerated owners, then check if the requester is one of them. And loop through this check one by one for all projects.

It’s not that complex to list GCP project owners with the help of Google provided Cloud Asset Inventory API and its python client. My team is familiar with this API as we have already used it for other applications.

We use Python and FastAPI to build our applications. Everything seems ready. Let’s start the dev…

No surprise, as I’m a good Cloud Ops, I dev fast ​😉.

When the first version was deployed, I couldn’t wait to test it. I called the API with input 5 projects, waiting for the response. After about 20 seconds, I got the response. Then tested with 1 project, it responded in about 5 seconds. The API was really slow. The “owner check” took most of the processing time.

Deep dive with Datadog APM, in the following capture, we can see the “owner check” using google Cloud Asset Inventory API takes about 3.23 seconds.

For 5 projects in serial processing, it multiplies by 5, about 15–16 seconds. Suppose that if we request more projects, we could wait for minutes. That gives bad user experience. In the following article, I will focus on “owner check” processing time.

At that moment, we had no other choice but to use this slow google API to get each project’s owners. Then we thought about migrating the “owner check” process from Serial to Concurrent processing. The slow API calls don’t need heavy CPU compute, but just waiting the responses. Idea was to run the check concurrently, for 5 projects it would be much faster, instead of about 15–16 seconds, it would take only 3–4 seconds just like for only 1 input project. Let’s go this way and try to validate this hypothesis.

Python provides a coroutine mechanism which allows processes to run concurrently using library Asyncio. And on google side, it also provides the Cloud Asset Inventory python async client.

Coding time …

The coroutine async Asyncio functions are more complex, I show you some key codes as following if you are interested in.

async_cai.py:


async def get_owners_async(
self,
project_id: str,
) -> List[str]:
"""Get owners bindings on project (includes inherited bindings)."""
client_cloud_asset_async_iam = CloudAssetAsyncIAMAPI() # based on AssetServiceAsyncClient()
owners: List[GCPProjectOwners] = await client_cloud_asset_async_iam.get_project_identities_by_role(
project_id=project_id,
role="roles/owner",
)
owners_email = []
for owner in owners:
owners_email += [
owner_email.split(":")[1] for owner_email in owner.owners if owner_email.split(":")[0] == "user"
]
return owners_email

async def get_not_owner_project(self, sem: asyncio.locks.Semaphore, project_id: str, email: EmailStr) -> str:
"""Get project_id if email is not Owner of the project."""
async with sem:
owners = await self.get_owners_async(
project_id=project_id,
)
if email not in owners:
return project_id
return ""

async def check_not_owner_main_process(self, project_id_list: List[str], email: EmailStr) -> List[str]:
"""Check owner in concurrent mode."""
MAX_CONCURRENCY_TASKS = 5
semaphore = asyncio.Semaphore(MAX_CONCURRENCY_TASKS)
tasks = [asyncio.create_task(self.get_not_owner_project(semaphore, project_id, email)) for project_id in project_id_list]
return await asyncio.gather(*tasks)

main_api.py:


# run coroutine
asyncio.run(async_cai.check_not_owner_main_process(project_id_list, user_email)

I tested the new code and got the tracing from datadog. In the following capture, we proved the hypothesis above, running 5 “owner check” concurrently takes about 5.2 seconds instead of 15s in serial processing.

I sent 10 projects to the API, and the API was configured to run Maximum 5 projects concurrently. As we can see in the following capture, the first 5 cloud asset call processes are starting and running concurrently and once a coroutine finishes, API will immediately run another coroutine until the end of all 10 checks.

It’s always slow regarding Jakob Nielsen’s research on user experience published in 1993, but much better than Serial processing. The performance improved from 15s to 5s for 5 concurrent calls, which is 67% faster.

This solution then runs on production for months.

Not always the more the better

You may ask why not set more tasks in concurrency, only 5 here( MAX_CONCURRENCY_TASKS = 5). Another point I must point out is that the number of concurrent jobs is not the more the better. If you use Python Asyncio in an API to call another API concurrently, you must tune it. Because the target API may have different behaviors like rate limit protection, and it also depends on your computing, network resources. I have also tested to set MAX 10 calls in concurrency. After many tests, I figured out that it takes a bit longer than setting MAX to 5 for a request of a total number of 10 projects.

You must tune it out and find the best configuration to cover the most cases of the potential API utilizations.

Migrate Google API from Cloud Asset to Troubleshooter

It’s quite cool with Concurrent processing, but we don’t stop to investigate the Google API to try to find another API that is faster than Cloud Asset.

We recontacted Google support several times to try to find a better performance API. Finally, a nice guy told us to test the Troubleshooter API that Google internally uses. This API supports IAM checker. We can send via payload a user, a role and a project to the API, it will give us the information if permission or role is granted to that user. That matches exactly what we are looking for.

This is the Python client of this API, we can also use it in Async mode in concurrent processing, as it provides an Async client.

After coding, I took the same test case as before.

Here is what it looks like in 1 call, it takes 1.72s instead of 3.2s by using Cloud Asset.

We won 1.5 seconds. A small gain in processing time, but a big step, the performance improved by 46%.

Here is the result for 5 “owner check” in concurrency, it takes 2.02s compared to previously 5.19s by Cloud Asset, performance improved by 61%. That is a huge improvement!

If you are interested in the result of the same test case in Serial Processing with Troubleshooter API. I can share with you, it takes about 11 seconds.

In the case of using this Troubleshooter API, Concurrent processing is 82% faster(2s vs 11s) than Serial processing.

Conclusion

At the beginning we use Cloud Asset and serial processing, it takes 15s to accomplish 5 “owner check”, in the end, by migrating to Troubleshooter API and concurrent processing, the same test, it ends up in 2s. It’s about 87% faster.

In this article, I presented the use case of 1, 5 and 10 processes concurrently, because it matches most of our users’ behaviors. I didn’t realize a huge pression test like 1000 processes, maybe it will be limited in the way presented here, but if you have other use case or ideas, feel free to contact me. Any ideas are welcome.

This is a real story on how we develop and improve performance on our services with the help of observability tools. I hope it may be helpful if you have the same use case. Try to run bulk processing concurrently, and don’t stop to look for the better tools.

Thanks for reading.

--

--

PengB
ADEO Tech Blog

an IT guy @Adeo DevDataOps. Like nature and self-challenge. Former @WorldlineExpertCommunity.