Reflecting on the past five months at Plan
Since this is my first post I thought I’d introduce myself.
My name is Thomas Potaire but my friends call me Teapot. I have been doing software engineering professionally for almost 8 years now.
I used to work at CBS Interactive in the Games department; GameSpot, GiantBomb, Metacritic and others. I then joined Twitter and spent time building communication tools for a year then moved to the Video department for the remaining three years.
It’s already been 5 months since I joined Plan as the Head of Platform.
Behind the fancy title, I am mostly working on the unseen: everything around the API.
Where we come from
My first week was intense. Our servers were having a tough time and couldn’t handle the load.
What load? How could it not? We can’t possibly have that much load.
My first few weeks were spent fixing bugs since that’s the best way to get up to speed with the code and with all those crashes I couldn’t help but think I had caused those issues.
The graphs above showcase what was happening multiple times a day every day. We would restart the servers, it’d be stable for a while and suddenly everything would spiral out of control. The servers became completely unresponsive.
But why? All we had were the two graphs above. No idea why.
We only had one success metric: how hard were the crashes on Monday mornings as users started their productive week? It crashed hard. Restarting would tame the issue for a few minutes at best.
After a little bit of work, I got more graphs and a few more success metrics.
Yes, that’s five million milliseconds. I look at this graph now I still can’t believe it’s possible. Something was definitely keeping our servers busy.
By measuring the number of requests versus the number of responses, I found out one of our API endpoints never returned a response.
Unfortunately it didn’t entirely fix our issues. The servers were now responsive and the graphs looked healthy but the memory usage would spike requiring the servers to be restarted. The API was plagued with chained callbacks across multiple files with some that would take multiple seconds to execute all the while the API would immediately return a response. It was difficult to know which part kept the servers busy or where the memory leak was. Since the servers were now healthy and restarting the servers was reasonable I didn’t spend much time on this issue.
Takeaways
- always measure and define success metrics.
- do not implement dangling async code and use job queues or cron jobs.
The turning point
The current Plan API is a beast of its own. Many features are interlocked and heavily rely on realtime updates. I am in awe of what Darshan was able to accomplish; the code I inherited was the results of years of learning. Once we got through some of the hurdles it was time to start building the features our users coveted.
I hit a wall.
The first feature we wanted to work on was recurring meetings or in Google terms, recurring events. It made sense since we want to build recurring tasks and since Google API already successfully implemented it we could learn from it.
The plan made perfect sense.
And yet the technical implementation that involved refactoring code was unrealistic as I’ve learned during the previous failed optimization attempts. The complex spaghetti code was not covered by tests and the possibility of a smooth rollout to our users was moot therefore the chances of breaking all the things were high.
The other option was a rewrite on the same stack and upgrade some of the libraries. It quickly became clear it’d be costly. Any attempts at upgrading the codebase or making it testable led to additional bugs.
I led major rewrites in my previous jobs and learned a lot from each of them. Rewrites are incredibly hard. At GameSpot we had more than a decade worth of data and features to support. The first review written back in May 1996 can still be read today. At Twitter, we had to leverage extremely scalable services designed to be consumed in very specific ways but not in ways that met our product requirements. Both rewrites were exhausting. It’s a ton of work, and every day can bring new surprises, resources can be pulled off, the deadline will slip numerous times all the while the higher-ups will stress the need for growth.
A rewrite in a new project could be the third solution. The previous code had one endpoint that served 99% of all traffic. Let’s name it synchronize for future references. It was the ideal candidate for an incremental rewrite because its code wasn’t shared and more importantly it was buggy.
As my brain suggested it, the prior years flashed back. The idea was unexciting and it made me feel exhausted. An incremental rewrite is a lot better than a full rewrite but with my experience I knew it was wishful thinking. I forced myself to shut down the idea until I caved in a week later and suggested it to Darshan and Leo.
It was December 2017.
Takeaways
- Rewrites are rollercoasters. Thrilling, exhausting and rewarding.
- Mistakes are always made. Your teammates will forever blame you for all the mistakes and secretly ❤️ you.
Project foundation
The initial requirements: documentation, continuous testing, standardized API, dev(eloper) tools and DRY code. It would bring ease of maintenance, increased performance and faster development cycles.
We named the project “platform”. A surface people or things can stand on. Not as fancy as Phoenix but it conveys meaning.
We all had experience with JavaScript and decided to use TypeScript. There is no point in not using it. It has a solid track record and is widely adopted. For better or worse it’s JavaScript with types. As a former Scala engineer, TypeScript does not attempt to fix JavaScript which is both frustrating and incredibly smart. Type safety improves our dev tools. Generics helps with writing DRY code.
We picked Express for our framework and OpenAPI 3.0 for our API standard. Those two are widely adopted and well documented.
Because code changes require the server to be restarted, we automate the restarts with nodemon. Since it detects changes, it might as well run lint and tests. The former improves our dev tools and the latter helps with continuous testing.
Continuous integration with Travis CI improves both dev tools and continuous testing. Pull requests can only be submitted if both lint and tests succeed. Upon submission, both are executed once more and a production Docker image is built and uploaded to our registry. Deploying is one click away.
This developer environment is probably one of the best I’ve had the chance to work with.
It’s fast. It’s safe. It’s impartial. It’s flexible.
Takeaways
- Focus on your developer environment. Tools make people productive.
- Setup tests as early as possible. Testable code is difficult to write.
- Setup lint and save time during code reviews.
- Leverage automation. It’s simple to setup, why wouldn’t you?
- Choose a type-safe language. If not for you then for your teammates.
- Performance can be improved later so long as your code is tested.
Architecture and design
The previous API used the classic MVC structure, but as the project grew it became difficult to browse. A feature may spread over multiple files whereas another may be buried at the bottom of another file within the same directory. Most of the code was contained in more than 70 files and split in 3 directories.
Each request on the synchronize endpoint would execute for multiple seconds. It cannibalized all the server resources. It was then switched on and off with an environment variable and executed on different servers.
The new architecture addresses both of those issues. It took inspiration from the architecture of front-end projects at Twitter. All the features are grouped in components. A component contains its own MVC logic.
Each API endpoint is named after the component and routed to a function in handlers directory. Data structure is contained in models. Helper functions are contained in services directory.
A component can interact with another using an interface defined at its root. If we ever implement micro-services it wouldn’t be a difficult refactor.
Each library is abstracted in its own directory. Some may be replaced (stats, logs, mail, cron) while others may not (websocket, routing, datastore.)
A library can consume another.
A component can consume a library but a library cannot consume a component.
Aside from a few exceptions, all handler functions consists of one line calling a library that handles GET, POST, PATCH and DELETE methods. The library takes care of access control, data lifecycles and realtime updates among other things. Each can be overridden in components. This is where we use TypeScript Generics most.
The software architecture is predictable in that it is familiar and organized.
As mentioned above, the software is shipped in a Docker image, hosted on Amazon EC2 instances, exposed with HAProxy and orchestrated by Rancher. Our instances are fronted with Amazon Application Load Balancing (ALB) since it supports websockets natively. Requests are routed based on geolocation and can be automatically failed over another datacenter using Cloudflare loadbalancing.
We chose AWS over GCP (Google Cloud Platform) for its wider adoption. A migration from one to the other wouldn’t be too difficult as Rancher supports both.
Due to my lack of expertise in system operations, I found this architecture well suited for the occasional downtime I caused.
The initial implementation used Rancher Cloudflare DNS available in the catalog. It registers public IP addresses in Cloudflare DNS pool. Whenever a container would crash or a host would become unavailable its IP would be removed from the pool and the entire API would become unavailable. After many hours, I realized my WIFI’s DNS wasn’t updated with the proper addressing and couldn’t resolve to our servers. Amazon ALB solved the problem by providing static addressing.
I wished the initial implementation worked better so we could manage most of our systems in one user interface; the current implementation requires extra steps to expand to new regions and fortunately they are now all documented. On top of that, some hosts frequently get out of sync with Rancher Server and must be restarted using the AWS console. There have been a few occasions where all hosts in a datacenter went out of sync.
Overall, I am satisfied with both software and system architecture.
Takeaways
- Divide your features. Keep your API as simple and as product agnostic as possible.
- Cloudflare packs so many features at a competitive price. You would be ill-advised not to use them.
- Despite its user-friendliness, Rancher requires some sysops expertise.
- I recommend using Rancher over AWS Fargate.
How the rewrite has been
The synchronize endpoint was migrated and started to serve a small amount traffic on December 20th. After a few weeks of tweaking, on January 2nd all traffic was progressively routed to it.
The rewrite was largely successfully. We were thrilled with the results. It was clear we wanted to invest more in the platform. We were targeting two features: recurring meetings and third-party integrations.
I hoped for a successful incremental rewrite but Plan is a beast and surely, every day brought a new surprise.
For instance, creating a task updates a milestone and a list and each data lifecycle generates realtime updates across multiple browser tabs, devices and users. Realtime updates on specific data are received after querying it. The front-end queries almost all the data on the initial page load (subject of another post.) This meant I had to write all the necessary endpoints to guarantee the integrity of the existing user experience.
Takeaway
“never two without three” — French and Italian proverb
If it happened twice, it’ll happen a third time. Rewrites are hard. This rewrite was no different.
Where we are at
Today, we have 63 documented API endpoints on the new platform, the API is ready to be shipped, and as of yesterday 100 users have been using it. The overall success rate is promising, so we’ll be slowly migrating the remaining users over the course of next week.
Where we are heading
I suspect time will be spent on performance improvements. As I mention above, the client code loads all the things and there will certainly be some tweaks to be made on the API code.
With the use of OpenAPI standard, a public API wouldn’t be too far of a stretch. Third-party integrations are definitely in the pipeline.
There are features we cannot wait to deliver. Recurring meetings, recurring tasks and many others.
Questions?
You can comment below, reach out on Twitter @teapot or send an email at thomas@getplan.co.
Plan is hiring
Help us foster straightforward and meaningful collaboration between people everywhere