Moving fast and not breaking things at scale
When you are a small startup, you can crank and ship features fast as you have fewer customers, smaller product footprint, and a rare chance someone will notice a 5 min blip. When a startup scales from “100 customers and 5 engineers” to “15000+ customers, 200+ engineers, and 3000+ nodes”, as engineering leaders you not only need to scale the code but also scale the cadence of shipping features fast without breaking things. We did 115 releases to production on a product called “Egnyte Connect” last year, that’s almost twice a week cadence. I have learned a lot by reading the technical blog posts written by large platforms companies. In this article, I will share the approach taken by Egnyte to pay it forward to the developer community.
What not to do at a growing startup?
At a SaaS startup I used to work for in the past, we would only do 4 releases in a year. There was even a production release freeze from Thanksgiving to Christmas. Every 2 months, we would cut a branch from production, and hundreds of engineers would write code for 2+ months and then cut a release branch and start a 2-week long process of testing and baking it in UAT. After that, we would have a weekend-long process of migrating the database and releasing code to 100s of nodes in production. I distinctly remember a release where the team and I spent 36 hours in the office just releasing to production because the Oracle database migration took 18 hours and the next 18 hours was releasing the code and immediately patching some fixes. Also, after every release, we would spend the next 2+ weeks trying to fix the corner cases and performance regressions not discovered in manual/automated testing. Of course, when 100s of engineers are writing code for 2 months, the large variance in code can cause minor bugs or performance issues in one area to bubble up as major regressions in other parts. In such a case, it’s very hard to hunt the root causes under the gun as everything is a suspect. While I enjoyed working at that startup, the release process was painful, so I decided: never again quarterly releases in the next startup.
What do we do at Egnyte?
10 years ago at Egnyte, we were doing monthly releases, and it took a lot of determination and iterations to move to a process that allows us to release twice a week and, if needed, patch multiple times a week. The current process allows even someone fresh off school to make critical changes to the distributed file system code with confidence. In order to achieve the right balance between shipping fast and safety, we rely on:
- Code reviews
- Automation unit tests
- An elaborate Feature flag system
- End-to-end Rest API automation tests
- Automated smoke tests and mini manual sanity
- Releasing to internal release candidate environments first, then production
- Ansible, Puppet, Liquibase, Kubernetes
- Monitoring tools like Newrelic, Grafana, Pagerduty
- Parallel builds
In most companies, there’s a single decision point to merge, and once you send something to QA, unless you explicitly revert it, it makes the next release and gets deployed to production. At Egnyte our system works differently and we have 2 decision points:
- On when to merge to QA
- On when to merge to Production.
This allows us to select what goes to production based on many factors — quality, timing, conflicts with other issues and so on and this second decision point IMHO is critical to “not breaking things”
The development process at Egnyte is a variant of Gitflow and Kanban approach and it goes like:
- For every JIRA ticket, the engineer clones the last released repository branch.
- The engineer writes the code and decides whether it needs a feature flag or not.
- On the creation of a merge request (MR), GitLab spins a pipeline per commit and runs all unit tests. At this point, the engineer can keep pushing more commits and, when the MR is ready, they collect 2+ reviews from other engineers familiar with the area after posting the MR on the Slack channel for review.
- Other engineers do a functional code review and look for various things like a code smell, whether proper tests are added for the new code, and do risk analysis like whether the feature requires a feature flag or whether code is backward compatible or not.
- If the code is placed behind a feature flag. By default the feature is OFF, and the engineer writes tests that evaluate the code with the flag ON and OFF.
- After collecting a minimum of 2 code reviews, the engineer resolves the ticket and specifies the QA scope in the ticket.
- At a scheduled interval like every 5 min or so, our CI/CD pipeline collects all resolved tickets since the last release,clones a new branch from the “integration” branch called “integration-candidate”, and merges all resolved ticket MR to it. This branch gets deployed to the QA stack, and we run our end to end REST API automation tests on it.
- The QA team also runs some manual tests and, depending on the ticket, if there is a manual QA scope in it, they execute it and approve the ticket. Upon approval, the merge request is merged to the “integration” branch, the ticket is marked for release candidate, and the latest “integration” branch is deployed to the Release candidate environment.
- Every Monday, the PMO team does risk analysis on the release and approves the last tagged commit on “integration” that was deployed to the Release candidate environment for a production deployment.
- Every Wednesday, we do a rolling update to production and the automation does a rolling deployment of release to various data centers and the public cloud at off-hours.
- Following this Kanban approach, the engineer only worries about getting their ticket approved and merged to the “integration” branch, where it will automatically catch the next release train, which can be next week or the next day. We also rely on continuous history on our Release branch “integration”, no spontaneous reverts, no manual cherry-picks. Everything tested and approved by QA is ready to be released, that’s a huge advantage compared to the previous flow.
- At this point, the engineer has the flexibility to turn on their feature selectively for
a. A node
b. A pod
c. Entire data center
d. A single customer
e. A set of customers
f. A percentage of customers
g. A probabilistic request distribution
- Some big features require large data migrations, and we just keep the migration feature ON in the background and, as each customer is migrated, we turn their product feature flag ON, more on this is coming in a future blog post.
- In special refactoring situations where we can’t do feature flags, we leave the ticket in Resolved state for weeks, and all tests will run on it daily and it will bake in QA environments. This MR’s code won’t make it to the next release until we are ready to approve, while other tickets will make it to the next release.
Building Large features at Egnyte
We take large feature projects,divide them into phases, and follow a cadence of shipping regularly every week or two. I call the process “incremental evolution” rather than revolution. For example, recently, we worked on a project to achieve 10x performance improvement on folder operations like move/copy by splitting a large move into sub-parts and executing each part in a separate transaction. The project took 3+ months, but we were shipping regularly. As an example, we broke the project into the following phases:
- Implement a metrics feature flag for the Long transaction project and add metrics like the number of subfolders/versions and collect more granular stats about large move/copy.
- Turn the metrics feature flag for only selected customers and keep collecting data.
- Every week, add more metrics and gather more data.
- Build an explain plan feature to divide the large move/copy into sub-parts and use it only if the number of subfolders to move is greater than 50K. This way, we are building the execution plan and collecting data, but the real move/copy is still happening using the older approach.
- Keep making algorithm changes and collect more metrics for selected large moves/copy and experiment with various parameters and come up with efficient splits.
- Write new code behind the execution feature flag to execute the large move/copy into sub-parts; this took more than a month to implement.
- Turn the feature ON in QA/UAT and OFF in production and bake it.
- Turn the feature ON for selected customers in the production and watch its metrics. In case of issues, turn the feature OFF and make changes to code and push the fixes during the next release.
- Keep increasing the number of customers the feature is ON for and, at some point, enable it for all customers.
Motivations to ship frequently
- Personally, as an engineer, I feel more productive if I do something tangible at the end of the day, and shipping frequently maintains the excitement and keeps me motivated.
- It forces the engineer to think in terms of a MVP and be more creative with time and resources.
- Dependencies on the team are identified early, leading to a concrete plan.
- It minimizes risk significantly by coding behind feature flags.
- It achieves a snowball effect, and the project gains more momentum as we come closer to turning the feature ON for everyone.
- Because management sees progress every week, there are fewer context switches, and it allows them to plan a deterministic roadmap for bigger projects.
Making it even faster
We are not yet satisfied with the process, and we continuously look for pieces to optimize and make the development and release process even faster. The next goal for us is to ship daily, and it’s a 10X problem than shipping twice a week. The main challenges that we need to solve before we take a crack at it are:
- Create more decoupled services, so we can release each service independently. Currently, we deploy multiple service groups at the same time due to coupling between them.
- Break teams into small autonomous squads.
- More decoupled automation units, perf and smoke tests.
- Scale the automatic schema changes process in some larger datastores.
- Move all compute to public cloud and Kubernetes so we can move to immutable deployments and better rollback support.
- More monitoring for faster detection and rollback.
- Eliminate remaining manual tests for anything except UI.
- Faster builds
Businesses and customers demand innovation and stability, while engineers crave speed and autonomy. At Egnyte, using the feature flag system, we are able to keep both sides happy and achieve a fine balance between speed and reliability. If you are interested in being a part of our amazing team at Egnyte, check us out at Jobs Page or reach me at https://www.linkedin.com/in/kpatelatwork.