Our DevOps Journey @SparkPost Engineering
From On-Premises Software to Continuous Deployment in the Cloud
These days, we have our heads in the clouds. Or the cloud, rather, as we engineer a fast-moving cloud service and deploy software dozens of times each week. SparkPost helps customers send billions of emails each month using our cloud APIs.
But that has not always been the case.
SparkPost began as Message Systems, which for over seven years specialized in an on-premises commercial software product called Momentum. If you’re unfamiliar with this legendary messaging platform, Momentum powers the email infrastructure of many large senders including Twitter, LinkedIn, Comcast and our own cloud email delivery service SparkPost. Our journey to the cloud has taken us from quarterly release cycles of Momentum to continuous delivery of SparkPost. We safely and quickly deploy new features and fixes to production as soon as they are ready. And now we are ready to gradually remove that final manual push button deployment step for most services and transition more fully to continuous deployment.
We decided to provide a cloud service to take advantage of the growth and transition to cloud computing in the email market. It was a fantastic opportunity for us to bring the power of Momentum to a broader developer audience. Our cloud service has brought the company meteoric growth in 2016. We have now completed our transformation to a cloud-first engineering team and company, which required a significant infrastructure and organization evolution.
Here’s what worked and didn’t work on our devops journey towards continuous deployment.
Meanwhile, the core Momentum development team was also building RESTful APIs to support templates and message generation. All of this new functionality was in support of Momentum v4, the next major release of our on-premise product, and would prove to be an excellent foundational API-first architecture on which to build our cloud service when the time came. The core Momentum team gradually adopted a more agile workflow which was no small feat considering the size and maturity of this code base. What a huge improvement over the prior days of development throwing code over the wall to QA. However, the build and test cycles still measured in days and weeks.
To the Cloud!
Our Managed Cloud service launched mid-year of 2014, essentially Momentum hosted in AWS. We did this under the assumption that our Tech Ops team could build out a customer environment in AWS and then install Momentum, just like any on-premise customers would. We targeted this offering at our traditional enterprise customers who did not want to operate the Momentum email infrastructure themselves. The newly formed Tech Ops team consisted of former Support and Remote Management team members and was separate from Engineering at the time. With little initial AWS experience they did a great job building what was our first generation of AWS infrastructure.
We chose AWS because it allowed us to get going quickly and provided a lot of flexibility not available to us if we had decided to build out our own data center. Nevertheless, we borrowed heavily from a normal data center approach, especially when it came to networking, since that is what our team had the most experience with. Additionally, our managed cloud business would be a customer, albeit an important customer, but still without any fundamental changes to the underlying product or how we build, ship, and deploy it. As we rapidly added more features it wasn’t long before we realized that this approach was problematic. There were disconnects between the dev teams and the operations team resulting in inefficiencies. The traditional on-premise installation and upgrade methods were not compatible with a rapidly changing cloud service.
A Startup Within a Startup
Meanwhile, that same summer we formed a small team focused on delivering the beta release of our as-yet-unnamed public cloud service targeted at developers. This team included a handful of application developers, along with a few engineers from the Momentum and Tech Ops teams. We took the approach of “a startup within a startup” to ensure focus on the mission and avoid distraction or blockers from the core on-prem enterprise business. This team built out our second generation AWS environment based on lessons learned from Managed Cloud. Collaboration improved between development and operations. Now developers deployed code on their own (manually) and provided more guidance on the infrastructure.
To bring this new service to life, with the help of the awesome UX service provider Intridea, the app dev team designed and built a new Web UI. The team followed a very light weight Kanban process with very little overhead. By September we settled on the name “SparkPost” and began to sign up beta users while we readied things for our official beta launch at our user conference later that year.
Following the well received beta of SparkPost we realized we needed to reorient the broader engineering team towards the cloud. We had ambitious goals for the official SparkPost launch in early 2015. Many features including self-service billing and compliance measures (to keep the spammers and phishers out) were on our to-do list. We also targeted additional client libraries and had to make important improvements to performance, scalability, and usability.
To move faster we had to tackle the challenge of reliably and frequently deploying changes to the production environment. While some of our microservices were more suitable to move towards continuous deployment, the Momentum software was not. Some challenges we encountered included lengthy build times and a regression test suite that ran overnight with numerous flaky test cases which slowed us down. We also started from a home grown installation utility written in Perl to perform installation and upgrades. We had designed this utility for our on-premises customers who installed and upgraded software very infrequently and it proved clunky for our use case.
To tackle these problems head on we decided to fully embrace the continuous delivery model and committed to tackling two short term objectives: to automate the deployment of any change to a UAT environment within 1 hour and to deploy Momentum to SparkPost production environment twice a week.
At this time we switched all of the engineering teams over to Kanban and incorporated all the learnings from the initial SparkPost beta team.
During the next few months there were a number of dramatic results to come out of this concerted effort to adopt continuous delivery. One change was a deliberate switch in who was responsible for doing software deployments and a resulting decrease in deployment times and unintended service interruptions. Rather than the developers providing software and instructions to the operations team, the development team took over this responsibility while still getting valuable assistance from the operations team. To solve our deployments problem we created a new cross-functional “Deployment Team” which included members from each dev team and operations.
The Deployment Team experimented with several approaches and tools before choosing Bamboo and Ansible to automate the deployment of database, code, and configuration changes. Within a short period of time the team had automated the nascent build and deployment pipelines for each service. We removed any long running test suites from the critical path, and we incorporated automated upgrade, smoke tests, and rollback scripts. The on-premises installer script was finally obsolete.
We achieved a reasonably good continuous delivery and deployment pipeline by the time of the GA launch in April 2015 and we were deploying several times a week during business hours, including not just the many lightweight microservices but also the Momentum platform.
Another big and positive result was the dramatic reduction in our cycle time. In 2014 our cycle time averaged around 8 days for all issues but within a few months this dropped to 6 days for 2015. Even more stunning, average cycle times for user stories dropped from 22 days to less than 10 days. This was even after moving the goal post on the definition of done from “verified in UAT” to “verified in production”. We were pleased to discover that our reduced cycle times resulted in greater velocity and improved quality with all teams getting a lot more done faster and better.
As an important enabler to these improvements we adopted an MVF (minimum viable feature) approach that clearly identified the customer need but let the development teams drive the solutions in an incremental way focusing on delivering quickly, eliminating a lot of the upfront requirements analysis and technical design.
We learned to listen more to our developer user community and took advantage of our shorter development cycle times to quickly deliver fixes and improvements that users wanted.
Over time the development teams gradually evolved their processes to fully incorporated unit, acceptance, and performance testing and we eliminated the separate QA function. Some of the QA team members transitioned into development and some moved into the Deployment Team.
Around this time we discontinued our traditional Project Management Office (PMO) which had centrally controlled all development projects. We decentralized responsibility for delivery to the individual development team managers, embedding Product Owners directly within those teams. This helped further reduce overhead and increased agility.
3rd Generation Infrastructure
We learned many valuable lessons operating email infrastructure at scale in AWS and by mid 2015 had completed the move to our third generation of infrastructure. Now we properly leverage Amazon’s VPCs, security groups, ELBs, EC2 instance types, CloudFormation provisioning instead of Terraform, EBS and ephemeral storage, and even more service resilience using clusters spanning multiple availability zones. We switched from Nagios to Circonus for monitoring. An outbound email proxy now separates the management of outbound IP addresses from the MTAs which allowed us to easily add more MTAs independently of the number of IP addresses we send email with.
During this time we formalized on-call schedules across not just the operations team but also the development teams with the understanding that everyone shares responsibility for the health of our production environment. This was increasingly important since most changes were deployed to production using automated deployment pipelines build by the dev teams. Our Deliverability team and Technical Account Managers also use Opsgenie for on-call rotations. Besides shortening the resolution time for production issues, this approach empowered the development teams to make the necessary improvements to minimize and often eliminate the source of production issues resulting in a much more reliable service.
We improved a number of important processes including Change Management (CM) and Root Cause Analysis (RCA). Our CM procedures cover all deployments and changes to our production environments. Change Management helps prevent negative impact to customers by enforcing a thorough testing and review process for all production changes. This approach has greatly improved transparency and risk management for us and reduced the number of off-hour fire-drills. Not all CM tickets are the same; we account for differences between standard and emergent changes and we do not require separate CM tickets for changes deployed through our automated deployment pipelines.
Our RCA process helps us properly identify the root cause of customer impacting events and follows the “5 whys” approach. We don’t use RCA’s to place personal blame but focus on the corrective actions instead — technology, process, or training — to ensure we do not fail the same way twice.
It’s important we optimize our time to find and fix a bug in production rather than slow things down too much in a futile effort to prevent all bugs with testing. We use our continuous delivery and deployment processes to quickly fix and deploy a patch confidently.
With so many customer facing changes made by different teams, we need effective internal communication without introducing unnecessary blocking dependencies. A core group from Product Management, Product Owners, Support, and technical team leads have a “scrum of scrums” each week to ensure there is sufficient awareness of coming changes.
To further help spread awareness of changes, we automatically post a daily summary of “customer impacting” JIRA tickets to an internal Slack channel and any major changes throughout the day get automatically posted here by one of our Slack bots. Slack is a fantastic tool and we use it very extensively throughout the company for team, topic, and interpersonal communication.
By early 2016 when MailChimp announced they were discontinuing their developer oriented standalone Mandrill transactional email service, we were in an excellent position to fill that gap for developers and offer them a much needed alternative.
As these developers came over to SparkPost in droves we were able to easily scale our platform and quickly add features that were in high demand, including subaccounts and dedicated IP pools.
Additionally, we ramped up our Developer Relations team to support this influx of developers. This team is grounded in the SparkPost Engineering department. The team’s mission is to support developers through client libraries, tools, content and even *gasp* direct human interaction. We love to interact with our developer community at various events, hackathons and on our community Slack team. You can find our upcoming events on developers.sparkpost.com.
Site Reliability Engineering
In the summer of 2016 we collapsed Tech Ops into Engineering to improve collaboration and efficiency. At this time we created a new Site Reliability Engineering (SRE) team that incorporated a number of functions into a single cross-functional team within Engineering. This finally broke down any remaining walls between development and operations. This new team fully embraces the “infrastructure as code” approach and has oversight of cloud infrastructure, deployments, upgrades, monitoring, and auto-scaling while promoting discipline, safety, and a positive customer experience.
Due to the efforts of this team we made significant improvements in our availability and reductions in customer impacting events. We also built out more comprehensive and actionable monitoring and alerting which has improved the overall customer experience — and boosted the morale of the team.
The Road Ahead
Besides lots of great features on the roadmap, we are rolling out the fourth generation of our AWS infrastructure. This latest iteration includes further decoupling of service tiers, improved automation and monitoring, replacing or augmenting some traditional technologies with AWS services such as SQS and DynamoDB, advances in auto-scaling of service tiers, and improvements in our outbound email proxy technology. We have some API performance improvements in the works as well, which users will love. We will complete the final transition of application configuration management from Puppet to Ansible and fill in any system configuration management gaps with Puppet. Early experiments with Amazon’s container services should begin to make their way into production. All of these advances will help make SparkPost APIs even more reliable, scalable, and faster.
This year we will likely move some services to continuous deployment. This means eliminating the manual button pushing for the final production deployment step. There is a little more work to do on our automated smoke tests and roll-back scripts before we are ready and we are exploring options such as canary releases.
Through disciplined devops practices and a lot of hard work over the past few years, the SparkPost team has achieved tremendous success. But we realize this is a journey and there is always more to do as we continue to scale up and move faster.
VP, Engineering and Cloud Operations at SparkPost
Reposted from SparkPost Blog