Fixing “SSL error: decryption failed or bad record mac”

Philamer Sune
4 min readMay 31, 2020

--

A couple of days ago, I was greeted by a message that a particular API endpoint in our application is returning 500 Server Error. I was curious because I had tests for that and even tried calling them manually a couple of times when I was coding. I logged in and immediately saw this error django.db.utils.OperationalError: SSL error: decryption failed or bad record mac for a line number that queries the database. I grabbed the terminal and tried python manage.py shell if I can get results back from my models. Ha! I could! This is weird.

The affected part in the application is where the system updates the user’s registration tier. There are three steps in the registration process of our system: (1) user registered and will be assigned tier=initial_reg (2) user updates profile filling up important information and is assigned tier=pending (3) admin verifies the submitted information and if approved will be assigned tier=full_reg. All other endpoints that queries the database works properly and I’m starting to scratch my balls that it only happens when admin approves the registration. It also doesn’t help that the endpoint in issue is working correctly in my local Docker development setup. After a lot of pdb and serious keyboard banging, I finally found the issue.

When admin approves the registration, the system generates a Membership ID; angels will sing, signifying a revered divine event in the users life within the app. The ID is in a specific format and unfortunately I can’t use UUID. The generated ID is then saved in the database and if an IntegrityError occurs, it generates another one until it can be saved successfully. Membership ID generation will then block the request and for this reason I implemented a timeout() utility method which spawns a new process and times out at a specified max seconds. This is where it all goes awry.

A helper function to limit execution of a function in seconds.

The SSL error: decryption failed or bad record mac occurs either when the certificate is invalid or the message hash value has been tampered; in our case it’s because of the latter. Django creates a single database connection when it tries to query for the first time. Any subsequent calls to the database will use this existing connection until it is expired or closed, in which it will automatically create a new one the next time you query. The PostgreSQL engine in Django uses psycopg to talk to the database; according to the document it is level 2 thread safe. Unfortunately, the timeout() method is using multiprocessing module and therefore tampers the SSL MAC. There are different ways to fix this. We can either (1) use basic threads instead of spawning a new process or (2) use a new database connection in the timeout() method. We can also (3) scrap the timeout() method altogether and handle the async task properly via Celery.

The timeout() method is simple, generic, and a quick implementation to limit the execution time of a method. We can use this for anything that takes time, not only for those that queries databases. I think most of the time spawning a new process is safer than using threads. At this point in time, I also don’t want to introduce and maintain another moving part like Celery yet; so we’re choosing to use a new database connection to fix the issue! Unfortunately, it seems it wouldn’t be as pretty. In order to create a new database connection, you just have to close the existing one and Django will create a new one. But this doesn’t work if you are inside a database transaction — it will fail InterfaceError: connection already closed. There’s also no easy way to pass a new database connection when using the models. The closest would be the .using() method where you specify a database alias. You can probably overload the cached property django.db.connections.databases and add your alias with new connection but I wouldn’t really do that! Haha! So we’re left with executing custom SQL directly:

A sample rawSQL query using a new database connection.

The issue we talked about could probably be detected earlier if I just use PostgreSQL, like in Production, instead of SQLite in memory when running the integration tests. I also could’ve tested it works properly in Dev environment where it uses proper PostgreSQL. Silly me! Overall, the debugging process have kept me busy in this quarantine period while working at home. I think it’s also a little fun and a good learning experience for becoming a better Software Engineer.

Keep safe folks! Let’s come back stronger after this pandemic.

--

--