Engineering Update — Sept 2, 2016

Published in

Work in Public by ConvertKit

5 min readSep 2, 2016

It has been a crazy few weeks around here for the engineering team.

As no doubt some of you have noticed (okay, many of you — we see our own stats), we’ve been having some growing pains. And they’ve been hitting us in more ways than one! Not only are we now popular enough that certain kinds of fraudulent uses of our service are more common, but we’ve been seeing the issues with some downtime for the service as well. So I’m going to talk about that a little bit.

Companies that are growing experience downtime as they try to reduce bottlenecks. This happens at all levels of an organization. A few months back, one of our biggest bottlenecks was customer support — we were completely maxed out. Response times were way above the hours that they are now — they were into days and sometimes weeks. We had a large influx of new customers, and a number of problems in our processes if we wanted to keep that pace. To reduce the pressure, we switched tools (Intercom to Helpscout, which gave better metrics and let us respond thoughtfully instead of in a chat box), we hired a few people to help answer the volume, and we re-assigned an engineer at a time to be exclusively focused on the Customer Success queue for a week at a time. That helped, and most of the time many of your are experiencing response times on your questions that are far and beyond what they were at the beginning of the year.

Now, we’re starting to see some bottlenecks in the application itself. As many possible reasons as there were for bottlenecks in the Customer Success side of our organization, there are many, many, many more reasons for bottlenecks in engineering.

I’m gonna pause a second. I don’t want to come off defensive here. I think it’d be easy to read it that way, or like I’m making excuses. I’m not. Here at ConvertKit, we say Teach Everything You Know, and we say Work in Public. When we’ve been experiencing more downtime than normal, the working in public part is a little scarier. But that doesn’t mean it isn’t valuable.

Okay, back to the thing. Especially these last few weeks, we’ve been hit hard. And it’s one of those things that feels like whack-a-mole. Continued pressure on all parts of the application reveal that what worked for much smaller numbers is now not holding up as well. This is fine — it’s part of growth — but in the meantime it’s stressful for everyone. Conversations within the engineering team get a little tense when we start talking about adding more application health alerts that might wake us up in the middle of the night. Conveniently we have a distributed team (Slovenia, Catalonia, Thailand and the US), so most of the time we’ve got someone who isn’t bleary-eyed able to check out what’s going on.

Anyway, that’s all the ancillary stuff. Let’s actually dive in a little to what we’ve been working on.

Monitoring, measuring and improving performance
You may remember about a month back when I said we’d been working hard on broadcasts and were getting them out a whole lot faster. The speed of those is awesome. So awesome, in fact, that it causes new kinds of fun. As we flood messages out of our system, and statuses on those messages come pouring back in (statuses like “delivered”, “bounced”, those kinds of things), we hastily do a bunch of updates to make sure the interface can keep up, and so that we don’t accidentally send the same message twice. That, combined with sending out sequence emails, can sometimes cause some issues — probably. It’s about evaluating, and sometimes our only window to really see what’s going on is while it’s happening. We’ll keep improving, and do the common sense things we can to keep it going.
Heading off spammers
This is a taxing job. Sometimes emotionally — fraudulent accounts are draining. But, we’ve been working hard to reduce both the effects those that have already leaked through have created, and to prevent future spammers from joining or causing damage. We’re a little tight-lipped about some of this, because we’re big enough that I know fraudulent users are reading these words, and we don’t want to give any hints on how to weasel in.
Improving deliverability
We’ve done a few different things here. First, we’ve switched out some IPs so that they’re not already on some blacklists. Second, we’re increasing our internal ability to modify sending domains and IPs (two big things that can get a message stopped for spam once reputation is damaged). The other thing with deliverability is that we’re still small enough that we’ve often been able to help our customers directly over the past few weeks in the instances where messages didn’t hit the inbox.
Scaling up all the things
We have a lot of levers we can pull, and we’ve been pulling them. We upgraded our web servers and have been tuning them. We’ve upgraded database sizes and the power behind them (a few times in as many weeks, actually).

In addition to that, we’ve set up some better ways to keep you informed on application health (these are a little aggressive right now — we’ve been telling you about almost every hiccup, which is a little overkill). We’ve also decided to bring in a couple of consultants for a while who can focus with us explicitly on scaling, performance and deliverability.

We had opportunity last week for our engineering team to spend some time all together in person. Out of that, we’ve set some goals that we’re all feeling pretty good about:

Average response time for web of 200ms or less (as of 8/31, the last three months was 400ms, last month it was 700ms, and previous seven days was 372ms). “Web” is serving forms, serving landing pages, and the UI of the application itself.
Average response time for API of 200ms or less (as of 8/31, the last seven days was 1200ms — don’t cry). API gets hit by form submissions, third-party plugins like Wordpress, and custom scripts.
Reduce exceptions by half, then to hundreds at most per day (as of 8/31, we were at… thousands and thousands a day). This is mostly internal, but it is a situation that’s grown as the app has grown. Basically, we’ve been wading through a lot of notifications we set up for ourselves many moons ago, so this is just about reducing noise.
Split up the application. We’re running right now on two different servers — API and Web — and we can split that out further. Doing that is going to help us monitor specific functions in isolation (delivering embedded forms, tracking links, serving the interface).

Some of that got a little technical, but I’m okay with that — it’s about working in public, so that’s a peek under the curtain. :) Cheers, all.

Engineering Update — Sept 2, 2016

Written by ConvertKit