Post Mortems @ LBB

Akarsh Satija
AkarshSatija
Published in
5 min readJun 16, 2017

List of Stupidities or carelessness.

Mongo went down

End Of May ‘17

  • It was around 9.30 AM, I got a AWS notification of MEAN’s EBS health changed to Severe. I checked and confirmed on my phone If thats a false alarm. No It wasn’t. Website was down. I was driving to office, I pulled over my car and started my laptop. #DevOpsGuy
  • Our Node app went down . According to the logs, Its “Unable to connect to mongoDB”. Then I logged in to mongo Instance, Looks like mongod daemon is down. I tried restarting before checking for the issue, but it won’t start.
  • Now I have to check for mongo logs. While looking at the last few logs it was clear that there that mongo is not starting because of low disk space. Damn.. I have to clear up some space to start mongo. Then we’ll check for the cause.
  • There were a lot of mongo dumps in tmp directory, They got backed up on s3 but were not deleted from the machine. I cleaned up /tmp directory and started mongo now. Phew… It started fine.
  • Looks like the script we were using to take backups wasn’t cleaning it up.
  • Why did that go down at that time? Because we had a cron for auto backup timed at 0343hrs, But when we moved to AWS mumbai we did not reconfigure it and it was running at around 9 (0343+0530)hrs
  • Why did that suddenly became an issue when it was running fine from months? In order to reduce cost, we shrunk our mongo instance from 1TB SSD to approx 50GB SSD

LBB.in went down — Wordpress

Sun, 29 Jan 2017

  • At around 6 in the evening Our website went down with an error saying “Error in creating a database connection”, while we were in mumbai, waiting for our flight at airport.
  • It happened after a few days we migrated from WPENGINE-London to AWS-mumbai
  • Debug mode On: I was getting status 200 as well in 1 out of 5 times, but latency was really a lot. Surprisingly cache was also failing for most of the routes
  • 1st thing I did was, checked on AWS panel if database is Up on not. It was up but high on load. There were like 100+ active connections not getting closed. I logged into 1 of my EBS instances to check if database is reachable or not.
  • Not sure what these connections are. We did not have any CRONs configured for that time. We got to kill all these connections to bring back the site.
  • 1st restarted EBS and RDS to kill all the active connections.
  • Website is back but still fluctuating, Connections still increasing. WHY? No Idea yet.
  • Assumption: Since we did not remove domain on WPENGINE and it was a managed service, There must be some CRONs configured over there running at this time in DB and pinging our domain for execution.

Future Precautions:

  • Checked and removed all existing wpengine’s plugins
  • removed the install on WPengine
  • Enabled memcache

Used Memory kept increasing | Node APP

  • We have been facing this issue for a really long time, The memory used by the app on our instance kept on increasing.
  • We had to restart the process manually once the process reaches the limit
  • Tried a lot of of debugging using profiling and node inspector
  • After a lot of efforts, found out that the issue is with the new events scrapping module deployed. There was a array variable in that module assigned globally in the app and array elements kept on increasing, leading to memory increase.
  • Solution: We moved that variable to a triggered functions and further passed as an argument further. So that garbage collection could take over after function execution
  • That was a very bad code standard. There are a bunch of debugging blogs over internet to follow.
  • PS: Profiling is very important task in testing.

High Latency at the time of notifications

  • Latency of posts on articles used to go very high while notifications were being sent, webViews were taking forever to load at that time.
  • Because of increase in memory usage while notification normal HTTP requests used to take a lot of time to get executed. #NodejsIssues
  • Problem was all the notifications for different cities and for different OSs were going in 1 go. Though it was batched. But still they were getting queued in 1 go.
  • Node was trying to process all the batches in 1 go as they were queued simultaneously. And all the notifications scheduled at same time were also getting queued at once

Solution:

  • Moved all the notification execution limit to 1 city per 10 mins.
  • Moved 1 after 1 in batching to queue
  • And ultimately moved this notification to different Instance all together

Multiple notifications (upto 40–50 in 1 go)

  • There was a time when we started getting multiple notifications on a single app in 1 go. Same notification to an app multiple times.
  • Issue was widespread on android.
  • We were getting this complaint once a while from our non-tech internal team but not from our users at all. But the issue was majorly affecting our developer team.
  • Issue was that we had a collection of token for unauthenticated users and every time there was a fresh install a new token used to get generated and get saved to our collection. Plus there could be same token against multiple users in our db. In case a person signed in again with different ID
  • So in that case duplicate notifications were getting out. And since our app team keep installing the app multiple times they were registering multiple tokens on our server.

Culprit: GCM service doesn’t expire the token after an app gets uninstalled very soon. APNS does that sooner but not instantly. Plus they do assign different token for same device.

Solution:

  • We removed duplicate tokens from DB.
  • We used GCM’s ghost notification to figure out dead devices and alias devices.
  • Then we are sending bulk notification to unique tokens only.
  • Pheww…..

That’s the list of a few major issues I faced at LBB.

#Peace

--

--

Akarsh Satija
AkarshSatija

At the end of the day it is only the ‘I’ that shall strive for betterment. It is only the ‘I’ that shall attempt to achieve and overcome