Post Mortems @ LBB

Akarsh Satija

Published in

AkarshSatija

5 min readJun 16, 2017

List of Stupidities or carelessness.

Mongo went down

End Of May ‘17

It was around 9.30 AM, I got a AWS notification of MEAN’s EBS health changed to Severe. I checked and confirmed on my phone If thats a false alarm. No It wasn’t. Website was down. I was driving to office, I pulled over my car and started my laptop. #DevOpsGuy
Our Node app went down . According to the logs, Its “Unable to connect to mongoDB”. Then I logged in to mongo Instance, Looks like mongod daemon is down. I tried restarting before checking for the issue, but it won’t start.
Now I have to check for mongo logs. While looking at the last few logs it was clear that there that mongo is not starting because of low disk space. Damn.. I have to clear up some space to start mongo. Then we’ll check for the cause.
There were a lot of mongo dumps in tmp directory, They got backed up on s3 but were not deleted from the machine. I cleaned up /tmp directory and started mongo now. Phew… It started fine.
Looks like the script we were using to take backups wasn’t cleaning it up.
Why did that go down at that time? Because we had a cron for auto backup timed at 0343hrs, But when we moved to AWS mumbai we did not reconfigure it and it was running at around 9 (0343+0530)hrs
Why did that suddenly became an issue when it was running fine from months? In order to reduce cost, we shrunk our mongo instance from 1TB SSD to approx 50GB SSD

LBB.in went down — Wordpress

Sun, 29 Jan 2017

At around 6 in the evening Our website went down with an error saying “Error in creating a database connection”, while we were in mumbai, waiting for our flight at airport.
It happened after a few days we migrated from WPENGINE-London to AWS-mumbai
Debug mode On: I was getting status 200 as well in 1 out of 5 times, but latency was really a lot. Surprisingly cache was also failing for most of the routes
1st thing I did was, checked on AWS panel if database is Up on not. It was up but high on load. There were like 100+ active connections not getting closed. I logged into 1 of my EBS instances to check if database is reachable or not.
Not sure what these connections are. We did not have any CRONs configured for that time. We got to kill all these connections to bring back the site.
1st restarted EBS and RDS to kill all the active connections.
Website is back but still fluctuating, Connections still increasing. WHY? No Idea yet.
Assumption: Since we did not remove domain on WPENGINE and it was a managed service, There must be some CRONs configured over there running at this time in DB and pinging our domain for execution.

Future Precautions:

Checked and removed all existing wpengine’s plugins
removed the install on WPengine
Enabled memcache

Used Memory kept increasing | Node APP

We have been facing this issue for a really long time, The memory used by the app on our instance kept on increasing.

We had to restart the process manually once the process reaches the limit
Tried a lot of of debugging using profiling and node inspector
After a lot of efforts, found out that the issue is with the new events scrapping module deployed. There was a array variable in that module assigned globally in the app and array elements kept on increasing, leading to memory increase.
Solution: We moved that variable to a triggered functions and further passed as an argument further. So that garbage collection could take over after function execution
That was a very bad code standard. There are a bunch of debugging blogs over internet to follow.
PS: Profiling is very important task in testing.

High Latency at the time of notifications

Latency of posts on articles used to go very high while notifications were being sent, webViews were taking forever to load at that time.
Because of increase in memory usage while notification normal HTTP requests used to take a lot of time to get executed. #NodejsIssues
Problem was all the notifications for different cities and for different OSs were going in 1 go. Though it was batched. But still they were getting queued in 1 go.
Node was trying to process all the batches in 1 go as they were queued simultaneously. And all the notifications scheduled at same time were also getting queued at once

Solution:

Moved all the notification execution limit to 1 city per 10 mins.
Moved 1 after 1 in batching to queue
And ultimately moved this notification to different Instance all together

Multiple notifications (upto 40–50 in 1 go)

There was a time when we started getting multiple notifications on a single app in 1 go. Same notification to an app multiple times.
Issue was widespread on android.
We were getting this complaint once a while from our non-tech internal team but not from our users at all. But the issue was majorly affecting our developer team.
Issue was that we had a collection of token for unauthenticated users and every time there was a fresh install a new token used to get generated and get saved to our collection. Plus there could be same token against multiple users in our db. In case a person signed in again with different ID
So in that case duplicate notifications were getting out. And since our app team keep installing the app multiple times they were registering multiple tokens on our server.

Culprit: GCM service doesn’t expire the token after an app gets uninstalled very soon. APNS does that sooner but not instantly. Plus they do assign different token for same device.