Write robust API, the right path, the right library

Published in

Geek Culture

8 min readJan 25, 2023

Python software architecture

Nowadays, architectures like microservices, or in general, services-oriented are widely used. Yeah, this statement really evident in architectures in which other services are external like a database, a mail server a log server, but, something similar happens also between each actor of the same software.

In this article, I suggest Tenacity, a useful python library that can save us a lot of time and headaches. Given an API organized in clean sub-services, Tenacity enhances the robustness of an application in a really simple and efficacious way in the interaction between services.

Starting from an intuitive e-commerce use case, I follow the SOLID principles to split the problem into small logic pieces, ready to be used with Tenacity in order to create a production-grade service.

Now coming back to the idea that in software different pieces of code do different stuff, this is true, also, when they are glued together and are in the same code base without clear boundaries between responsibilities. Someone switches to the ‘framework’ term, which means that also a single program, technically called monolithic application, can be rethought as a composition of multiple services, a kind of intra-services architecture. Here is where S.O.L.I.D. principles can help to enlighten the way. These winning practices push strongly to emphasize some very effective concepts to create well structured code.

Starting from the beginning, the “S” in fact stays for single-responsibility (aka SRp), it just means that each function or class should have responsibility for a single section of the logic. This is a crucial point because by identifying the boundaries of each piece of code we are able to write code, classes, inheritance, and all dependencies in a clean and well-organized structure. Organizing the code in isolated areas, promote also the separation of concern(aka SoC), another crucial pattern necessary to let the code grow clear and well structured.

Please follow the links above for the documentation.

Let’s start with a simple use case

Suppose we are a company and we are writing an e-commerce service, we must move data between the layers of the architecture.

From the database, we gather information about articles, stocks, shopping carts, and prices.
For the authentications and payments, we should call an external service.
We should inform the user in some manner, maybe via e-mail, so another service comes in place.
No less important are the infrastructure services, maybe, during all data flow, we collect pieces of information to send to another service and create a rich collection of overall performance and eventually apply some early warning rules. If the average response time degrades, then we will have to turn on new resources to help us but … ?! wait … this sounds like a new service to be integrated!
And the package tracking service? it can be very sexy in our application
What about SMS super fast promotion services?

Ok, these are only some examples in which the term service is used to identify functionality that is covered elsewhere. It is not so relevant, at this point, if it is external, for example, a managed solution from a cloud provider, or internal at our framework.

Focusing solely on the purchase function, our code should look like this:

Samples are in python-like pseudocode, they will be very easy to read also for less experienced.

def purchase_something():
  check_user_auth()
  check_goods_availability()
  do_transaction()
  place_order()
  give_reward_to_user()
  send_confirmation_email()
  send_something_to_metrics_server()

These steps are more or less a good data flow and are organized in order to respect the SoC principle.

check_user_auth: we must ensure, the correctness of the user if he or she can access the resource, and in general the account status.

check_goods_availability: stock availability, what if another user has bought the last piece?

do_transaction: Ok let’s start to touch money, we must be sure that the whole amount was paid before starting any movement of goods.

place_order: This is, for sure, the most critical step. Now we can begin to move goods, we must inform the warehouse, ensure that goods are really available and in good condition (what about if a package results damaged?), put together all products, and ship them. For simplicity, I’ve omitted steps like emptying the shopping cart and the data integration with the shipment company.

give_reward_to_user: This step is explicit because it represents all post-sell operations: new computations, new accesses to the DB or services. But I want to spoil something… what if something goes wrong here? Should the entire sale be aborted?

send_confirmation_email, send_something_to_metrics_server: these steps are less interesting because they don’t add any value to our transaction, but they could fail in somehow manner.

The service is almost ready…

Yes! it is, but in a plain approach, the entire transaction works ONLY if all interactions between services work perfectly.

if not?
this code is simply not resilient

In the real world, we have to face problems such as network problems related to DNS or devices with defects. HD breaks, external services can be off-line due to maintenance activities, and many dozen of other cases. Not always do the databases respond with the same performances. Email provider, metrics services, and so on. Any of these can trigger some unexpected timeouts or totally isolate one or more services.

We can start to gain awareness of these kinds of hitches, starting from the biggest, more secure cloud providers, each publishing SLA to inform all users that no service is 100% guaranteed.

These are some uptime SLA from main cloud providers:

Compute Engine Service Level Agreement (SLA) | Google Cloud

During the Term of the agreement under which Google has agreed to provide Google Cloud Platform to Customer (as…

cloud.google.com

Amazon Compute Service Level Agreement

This Amazon Compute Service Level Agreement (this "SLA") is a policy governing the use of the Included Services (listed…

aws.amazon.com

IBM Cloud Docs

Find documentation, API & SDK references, tutorials, FAQs, and more resources for IBM Cloud products and services.

cloud.ibm.com

SLA for Azure Cloud Services | Microsoft Azure

" Applicable Monthly Period" means, for a calendar month in which a Service Credit is owed, the number of days that you…

azure.microsoft.com

Despite our is not a mission-critical application, it is very annoying to lose business, isn’t it?

So what can we do to mitigate these failures?

Add robustness to our code! we must identify and isolate what must work or postponed or can possibly be left out

Let’s start to split the problem and give some priorities.

In real life, the sales department will be very disappointed to know that an entire day of sales has been lost ’cause we cannot send an e-mail. I mean, honestly, not all steps are mandatory to conclude a shipment, which means that not all steps are equally important. Let’s try to look at the exercise again with this element in mind.

Let’s try to look at the exercise again with this element in mind and check if all the steps are mandatory for our business and which one is not.

def purchase_something():  # Mandatory  check_user_auth()
  check_goods_availability()
  do_transaction()  # Postponable  place_order()
  give_reward_to_user()
  send_confirmation_email()  # Optional or expendable  send_something_to_metrics_server()

Ok, this gives to us a clear view of prioritization:

Mandatory steps: Must be computed in real-time and give asap a response to the user, if something goes wrong, no way! the sale must be aborted.

Postponable steps: Postponable introduce a concept of synchronicity to our code, which means that these kinds of stuff are well suited to a producer-consumer architecture. This means that we can confirm the order to our user in a bunch of seconds, but it can be processed in minutes or hours. In the worst case, we have time to fix the problem and reprocess all orders the day after.

Optional or expendable: Are that kind of operation that can be covered later, nightly? or totally lost, in the worst case we will lose some optional data.

If the code is well organized, now we know exactly which steps cannot fail which would be better if they didn’t fail and which one could fail.

Introducing failover code

From now the article became more technical and more pythonic.

How to manage code that can fail or it must never fail… adding some failover handler to our code.

In the last few years I saw a lot of self-made solutions, in order to handle failovers, but let’s try to make a rundown of the main steps.

the try/catch approach.

try:
    do_stuff()
except:
    do_stuff() #something goes wrong, but we try again!

When an error occurs in the do_stuff() function, we trap the error and rerun the function again. This can cause a lot of unexpected behavior related to atomicity, transaction, and so on. Applying this approach, we can notice early that a single try cannot cover many cases. Sometimes are also used delays to let the remote service solve its issue before calling it again.

2. multiple try/catch with rules or delay

try:
    do_stuff()
except:
    wait()
    do_checks()
    try:
        do_stuff()
    except:
        ...

Ok, this is a little bit insistent, but the problem is only postponed. We cannot definitively establish how many try/catch will cover all cases. So a better approach is to try indefinitely until the function goes to an end.

3. loop until it works

while True:
    try:
       do_stuff()
       break
    except:
       pass

Early we find this very insistent, so much so that it never ends. So we must introduce some rules in order to stop it after some try or after some time elapsed trying to do the same stuff.

These are only some introducing examples and are the peak of the iceberg, those who follow this approach, and there are many, will have to take into consideration several aspects in addition to constructing not at all trivial tests.

It’s clear now that we need a specific solution to solve these use cases. We need a robust and well-tested specialized library: let me introduce Tenacity

Tenacity

Tenacity is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything. It originates from a fork of retrying which is sadly no longer maintained. Tenacity isn’t api compatible with retrying but adds significant new functionality and fixes a number of longstanding bugs.

This is a clear example (from official documentation) of what can do this library to resilient our code.

@retry
def never_gonna_give_you_up():
    print("Retry forever ignoring Exceptions")
    raise Exception

Decorators are a pretty and concise way to ‘add’ tenacity to the code.

with the decorator retry, we can apply some useful rules like:

retry if a certain value is not reached
retry # times
retry for a certain amount of time
retry if some exception occurs
we can also wait or not between each retry, and establish if the wait time change and how it changes.
It’s possible to execute a function before and after any attempt.

The library itself is mature and well documented. I can add some examples from my code to show how it can be used, but my intention was to accompany you up here using logical steps and probable examples. For more information refer to the official documentation. I’ve solved in a clean way the problem to switch from synchronous to asynchronous processes by applying only a couple of decorators. Another aspect I find crucial is that by using this approach we can work on well-tested and fully reproducible code.

So don’t reinvent the wheel, this is the right way to integrate multiple resilient and fully testable services in a robust and maintainable manner.

Features declared

Generic Decorator API
Specify stop condition (i.e. limit by number of attempts)
Specify wait condition (i.e. exponential backoff sleeping between attempts)
Customize retrying on Exceptions
Customize retrying on the expected returned result
Retry on coroutines
Retry code block with a context manager

Disclaimer: I’m not related directly or indirectly to tenacity library or anyone involved in its development. I’m only a happy user.