How relying too much on a framework can sometimes introduce issues

Marco Mornati
Decathlon Digital
Published in
7 min readAug 10, 2020

Today when we are thinking of Java, most of the time, we directly think to SpringBoot with Hibernate and all the frameworks helping us to go faster in our coding: spring initializr, module choice, 3 lines of code and 10 annotations and you can test your first application.
But this magic side also comes with some drawbacks.
The issue is when you delegate everything to a framework, at some point you can lose control and detect problems that will take you days to be understood and fixed. We are going to talk about Java and Spring, just to have a practical example, but the same can happens on different frameworks or languages.

In this post, we would like to share with you how on a long live application — developed and maintained over several years from many different developers over time — new features can break everything. Or on the other hand, how with a simple annotation you can couple an external API performance to your application and make it unstable — or completely broken — until the external service comes back.

Application Context

The application we will use for this post is a payment proxy, but it will be the same with any proxy application making an external call and having internal logic too.

Payment Proxy Application - Functional Architecture

Imagine something like in the previous image:

  • customers make an API call to the application to make a payment.
  • We need to store this information! In case of a network problem, payment outage, or whatever, we need to save the request and its status somewhere. Lets put it in the local database.
  • based on the call, the proxy will trigger the right payment method, which is outside the application stack & infrastructure
  • The answer is provided to the original client.

That’s easy, isn’t it?

Following we propose a story of the LOTR ltd company implementing this payment proxy.

First Implementation

With the provided information the first dev team started with a simple application with the following main components:

  • A controller exposing a /payment API
PaymentController
  • A service storing information into the database
TransactionService
  • A rest service to make the external HTTP call
PaymentService

The enhancements over time

Application is used in the LOTR ltd for several years, and developers come over it one after one. In 10 years the LOTR ltd had 50/60 developers working on the project codebase, adding stuff, refactoring, … and for sure sometime some bugs were detected requiring a quick fix.

Step 1: Save all transactions
The first version of the application was working but, during the first months, the dev team discovered this was not so useful. All started a week after the go-live; a customer called the LOTR ltd support center:

C: “Good morning, I made an order 5 minutes ago but I had problems with the payment, can you help me, please?”.
D: “Good morning, thanks for calling. I can see nothing here about your order, but I will check with our technical team and I call you back in few minutes.”

Unfortunately even the technical team was not able to see anything about the payment transaction. But after a few hours, they got it: the production code is saving the payment only if the REST service is returning a success response. In any other case, the payment proxy is answering with an error (or at least something that is interpreted as an error by the caller) but the status gone.

After a few hours the dev team came up with a fix:

Right now the transaction is stored with status before to make a call and then updated with the payment provider response. Great job guys.

Step 2: Last transaction status not updated
During the first hours after the fix the support center called back the technical team:

C: Good morning, we are checking the latest orders but it seems payment are not processed. All the payment are in PROCESSING state. Is there any problem on your side?

Indeed, the dev team made a quick fix to the application but something was not correctly tested.
After a while, they understood where was the problem: the transaction status was changed outside a Transactional method and nothing was persisted on the database. So, new fix was provided just some minutes after and all was finally working as expected.

The Armageddon Day

In the LOTR ltd all worked well for years and the dev team goes ahead adding new functions, new payment methods for the customer, improving the code, ...
But one day, with no reason, the application started to crash, reboot loop, … and no way to understand for the dev team how to make it stable.

Are you able to point out where the LOTR ltd dev team made an error in the previous Gist?

Years after the first line of code there is no dev from the first team to help knowing why some things were added and how everything is working. Fixing this problem could take very long.

Problem and solution

We can now get out of our story and I’m pretty sure that anyone of you found something already seen in real life. That’s the life of almost all the IT projects and the reason we need proper documentation, a tested and readable code. But this is another story…

The problem comes from the Transactional annotation. We moved it on the method calling the external payment service which is causing to keep the database connection until the end of the method.
This is automatically coupling the performance of our payment proxy to the ones of the external services: when a payment provider starts to answer slower to our HTTP calls, the database connection is kept for a longer time and the connection pool will be exhausted due to the high traffic.
Usually, the database should provide a response in only a few milliseconds, up to seconds for complex queries; for this reason, in a “standard” architecture, the configuration allows creating a database connection pool with only 10 or 20 connections, that should be enough for hundreds of concurrent API calls per seconds.

To fix the problem we just need to manage differently the Transaction object update and the HTTP Call.

Gatling Load Test

In the previous image, we can see what is happening during a high load period when one of the external services became slower, on the left, is before the fix, and on the right is after.
Most of the calls in the application are waiting to have a database connection to create the payment transaction and then call the provider. Due to the fact database connection is not released until the end of the process, requests are queued and wait until the HTTP connection timeout. The result is an error on the client-side with most of the payments in error.
After the fix, the application works normally: calls are coming, transaction created and stored into the database — with the connection released just a few milliseconds after the operation — and then the HTTP call is made waiting for the response. As the service is slower we need to wait longer and, with this implementation, we are blocking an HTTP thread until the final client answer. But on this part of the architecture, there are usually more resources available than on the database side.

Database connection usage time

This is what is happening in the application before and after the fix: the mean connection usage time is drastically decreased. And it is coming to the normal values we should except for a database connection.

Conclusion

This is a very simple application we created to show you a possible problem. It was easy to understand with few lines of code where was the problem and how to fix it, but in the real-life, this kind of application can have several services linked to each other, and very long classes that will take a long time to be debugged.
There are also several other improvements you can add here and I know you are thinking of them. ASYNC HTTP calls? Yes sure, a proxy should always use async… even if this will consume much memory. At a time resources will be limited somewhere :)
Immutability of the object? Too! When objects can be modified everywhere in your code, it will be difficult to debug.

The lesson learned here is that even when things seem magical, developers should anywhere take care of what is globally happening. With a simple overview, we can understand that the application is crashing due to the missing database connections, but we have to go further to be able to understand what and then find the why and where. This can take a very long time.

If you are interested in playing with what described here, a sample application is available over this GitHub repository:

  • The persist_problem branch contains the application after the first fix: storing transaction on every call and then update the status (which is not stored in the database).
  • The fix_transactional branch contains the working but BSODed version
  • The main branch contains a working final version

--

--