Startup Fuck-ups:
How we lost 25% of our monthly revenue overnight

The first in a blog series called “Startup Fuck-ups” on turning bad experiences into good practices.

Insync
6 min readOct 30, 2014

Last month, one of our developers noticed that we were hitting 75% disk capacity on the server used by our mailer application. We checked it out because that wasn’t supposed to happen if things were running smoothly. What we found was that a number of failed jobs were being kept by the system, which meant that these were taking up a ton of space that they shouldn’t have been.

To fix the issue, we put together a script to delete the failed items, since any retries to send them didn’t appear to work. After running the script, the mailer application was restarted. A few hours later, we started receiving a lot of emails and tweets: for every one that we answered, there were ten more incoming from the customer’s end. Three months’ worth of emails were backed up, and these were only hitting our customers’ inboxes just then.

https://twitter.com/onedurr/status/510300616233455616

The replies we received showed that most of our users were both understanding and pretty chill about us flooding their inboxes unintentionally.

Problem solved? Not really.

Let’s backtrack a little to the months before.

We noticed three specific things:

  • our conversions from trial to paid upgrades were down roughly 40%
  • internal notifications that typically came on time weren’t being received, and;
  • replies from our email series had dropped by 75% (because they had nothing to reply to—but we didn’t know that)

Naturally, since our conversions were down, we had to think up big solutions—because a drop in sales equated to a big problem.

We broke down the timeline of what went wrong and where, pooling together input from all sides of the company: having meetings with our marketing team, revisiting our internal workflow—basically, getting our house in order with the things we could think of to improve.

We had data gathered. Something was up, we knew that much. We just couldn’t find the actual root cause which (three months down the line and a mad scramble for an email apology later) was something a lot smaller than any of these things: our mailer.

The apology email that we sent on September 12.

Things like this hurt young startups. And by things, what we actually mean is the line of thinking that big problems always equal big solutions.

Was this a big problem? Yeah.

Our figures speak for themselves: twenty-five percent lost in monthly revenue is BIG anywhere.

Was the cause big? Definitely not. We were just looking at all the wrong places.

It’s said that prevention is always better than the cure, but nobody actually listens to that until the problem bites them in the ass. At the time, we didn’t know that there was something to prevent. Now that we do, here are the things we’ve set in place:

  1. Monitor the mailer better.
  2. Document all processes.
  3. Communicate any irregularity (no matter how small it may seem).

Monitoring.

Currently, we’ve availed of Datadog’s service to monitor the mailer’s health.

We’ve set individual alerts for pending email thresholds and the number of current active mailer workers set for a particular period.

This is important in the event that our mailer queue exceeds a given size or if the emails sent during a specific period drops below a certain count — we get notified straight away.

You see, we used to look in on it manually—which is just time-consuming and stupid when there are better ways to check.

Documenting.

Attempts to replicate the issue and isolate what exactly caused the mailer workers to die have been done. Unfortunately, we haven’t come up with anything conclusive except that the workers stalled even if the code we had in place for them continued to run.

We knew three things about the mailer: (1) it sends emails on demand, (2) the queue size on the server shouldn’t grow too much if existing/new jobs are continuously being processed; and (3) since each worker has their own connection to the server, any drop implies that active workers have decreased.

That these things weren’t happening should have been taken into account. But with so many other things going on, we figured the mailer was an isolated incident—one that had nothing to do with our sales drop that seemed more pressing at the time.

It wasn’t. In fact, the mailer dying on us cut our lifeline to the customers.

This little guy was in charge of checking in to see if people were interested in moving from trial to paid. It was also the one who made sure that people knew that we’d received their payments or if they were lagging behind.

Had we better documented the mailer server process operation, dependencies, and functions, we could have determined that, since it wasn’t doing what it was supposed to, this had a direct correlation to why our sales weren’t doing well.

Communication.

We were already aware that our mailer application is responsible for sending forum notifications and scheduled emails to check in with our users. Some members of the team noticed that the notifications weren’t being sent; and over on the support end, we also got less replies from our “check in” emails.

Had any of us just spoken up about these issues to each other, we could have investigated things sooner. We let these go because, like we mentioned, we thought that whatever was happening to the mailer couldn’t possibly be that bad. Now, no matter how negligible something appears, we make it a point to chime out to each other (just in case that someone else picked up on something).

At the risk of sounding cheesy: best to assume that everything’s connected.

Despite every counter-measure that we’ve now put into play: this doesn’t change the fact that we’ve had to do the equivalent of putting a band-aid on a shotgun wound. When you have a goal as big as letting your fledgling company grow, you can’t take what you have for granted (which we did) and assume that things will run smoothly all the time (because it won’t).

Yes, this kind of thinking should be a given — but execution is a whole different animal from theory, and the things that sound the easiest to do are the hardest to put into practice.

We’re laying out the facts as they are because this experience in accountability and immediate response is not just applicable to a young business like ours. We’re sharing this in the hopes that other startups and entrepreneurs will have something to take away.

Ours, is that even the (seemingly) little things can have a pretty major impact. It’s been a hard lesson learned. A very expensive one.

Many thanks to the team for giving input and proofing this piece.

Noelle Pico is a creative with a passion for writing, music and all around geekery. She recently left corporate to work at Insync and has not looked back since. She occasionally tweets at @thenoeychu about random thoughts and her foray into tabletop gaming. You can email her at noellepico@insynchq.com.

--

--

Insync

Do more with cloud storage. Do more with Google Drive.