OpsGenie is on a journey to reap the benefits of serverless architecture

Engineers are coding and deploying new product features using the AWS Lambda service and moving existing apps to serverless

Sezgin Kucukkaraaslan
A Cloud Guru
8 min readMay 12, 2017

--

We started the OpsGenie start-up journey in 2011 with three senior full stack developers who were very experienced in building Java enterprise, on-premise products that specialized in integration, data enrichment, and management of 1st generation infrastructure monitoring tools. We saw an opportunity in the market and decided to use our expertise to build an internet service for Alert/Incident Management.

But stepping into the SaaS world would bring many unknowns. Concepts like operational complexity, high availability, scalability, security, multi-tenancy, and much more would be our challenges. The first thing we decided was that sticking with AWS technologies would help us overcome many of those challenges. Even if there were better alternatives out there, we started to use fully/half managed Amazon services for our computing, database, messaging, and other infrastructural needs that I cannot remember right now.

As many start-ups do, we started coding with a single git repository. But somehow we didn’t have a monolithic architecture. It was still a monolith, of course, in the sense that it was built from the same code repository. :) We separated customer-facing applications from the ones that did heavy calculations in the background. OpsGenie architecture was composed of the following components, in the early days:

Web: A typical backend-for-frontend server written in Grails that served OpsGenie web application and mobile Rest API’s.

Rest API: A lightweight HTTP server and in-house built framework written on top of Netty, which provided the OpsGenie Rest API.

Engine: A standalone JSE application which calculated who should receive a notification — and when.

Sender: A standalone JSE application that talked to third party providers in order to send email, mobile push, SMS and phone notifications.

We were operating in two zones of Amazon’s Oregon region, and we designed the architecture so that all types of applications had one member alive in every zone, during deployments. We put front-end servers behind Amazon Elastic Load Balancers, and all inter-process communications were made via asynchronous message passing with SQS. That provided us with great elasticity, in terms of redundancy, scalability, and availability.

Then the same old story happened. We encountered the same obstacles and opportunities on the paths that every successful startup journeys … the product aroused keen interest in the audience, which then caused us to develop many more features, handle support requests, recruit new engineers, and so on! As a result, the complexity of our infrastructure and code base increased.

Our architecture began to look like the following:

Before I mention the problems that emerged with this architecture, I’d better talk a little bit about the engineering culture we were developing:

We had embraced the evolutionary and adaptive nature of Agile software development methodologies even before we started OpsGenie. We were already performing Test Driven Development. We started to use a modified version of Scrum when our developer size exceeded eight or ten. We accepted the importance of lean startup methodologies and fast feedback loops. We committed to the work needed to continually evolve our organization, culture, and technology in order to serve better products to our customers.

Even though the term is relatively new, we embraced the technical aspects of DevOps from its earliest beginnings. We have been performing Continuous Integration and Continuous Delivery. We have continuously monitored our infrastructure, software, logs, web, and mobile applications. Also, as soon as a new developer joined the company, got his or her hands a little bit dirty with the code, and understood the flows, then he or she began to participate in on-call rotations to solve problems before our customers notice them. And we continue to honor our commitments to an engineering culture based on such ideas and practices.

After this brief insight, I hope that the problems that we have faced seem more understandable. Here they are:

  • Common code base: Although we strive to follow object-oriented best practices and SOLID principles, it becomes inevitable that your code base gets messy and dirty, as your product and its features grow bigger and bigger. When competition is tough, business pressure increases — which then raises your technical debt and bugs.
  • New engineer onboarding: Bringing on new engineers is made even more difficult by the situation above. When a junior — or even a senior — developer joins the company, he or she immediately encounters a huge technology stack and a complex code base. Dealing with that can result in significant delays before a new developer becomes productive.
  • Complex release cycles: When many changes are shipped in each release, the release process can become risky and uncomfortable.
  • Slower deployment: As your application grows, the time needed to deploy, start, and stop it increases.
  • Difficulty diagnosing production problems: When your application sends signals of slowing down or experiences resource shortages, it becomes difficult to find where the problem originated — especially if your application performs many heavy tasks and sophisticated business flows.
  • Inefficient scalability: If you detect that a business flow is slowing down and needs to be scaled, you need to scale the whole application, which is highly inefficient, in terms of resource utilization.
  • Failures effect entire system: When your application crashes, or its container goes down, all flows running in it also crash.
  • Lack of ownership: When a large group of developers participates in the implementation of your software, the level of their ownership decreases as your application goes through many different phases of development, testing, deployment, and operations. This can negatively effect your company’s Mean Time to Resolve (MTTR) performance.

As I mentioned before, we were not the first internet service company facing these kinds of challenges. Many more out there survived and succeeded on a massive scale. All we had to do was learn from their experiences and figure out the way that was most appropriate for us.

Microservices

Much has been said and discussed about microservices. There are floods of articles, blogs, books, tutorials, and talks about them on the internet. So I have no need or desire to explain the term, “microservices.”

Although pioneering companies like Amazon and Netflix switched to this architectural style in the previous decade, I think use of the term “microservices” exploded when Martin Fowler first wrote his blog post about the concept in 2014. Amazon CTO Verner Vogels mentioned their patterns as SOA in his interview published in 2006.

Instead of giving a complete definition, Martin Fowler addressed nine common characteristics of a microservices architecture:

Componentization via services

Organized around business capabilities

Products not projects

Smart endpoints and dumb pipes

Decentralized governance

Decentralized data management

Infrastructure automation

Design for failure

Evolutionary design

When we looked at our architecture, we realized that we were not too far away from those ideals to move to a microservices-oriented architecture. Our most critical need was to organize as cross-functional teams in order to implement different business capabilities in different code bases. We already had at least some organizational expertise with the other characteristics Fowler described.

Serverless (FaaS)

So, why are we moving to a serverless architecture instead of simply implementing microservices? There are a couple of advantages in using AWS Lambda instead of building Dockerized applications on top of AWS ECS — or deploying them to a PaaS platform like Heroku:

  • Capacity Management: The most important advantage of serverless technologies is that they can automatically scale your functions according to load. Although it is possible to configure PaaS or CaaS solutions to scale up and down, according to thresholds that you can define on some metrics, it is completely transparent in the FaaS world. Even if you have one customer today and thousands of customers tomorrow, you have nothing to worry about. AWS Lambda handles the load seamlessly, which greatly reduces operational complexity. Also, scaling happens at a micro-level, which optimizes your resource consumption.
  • Pay-as-you-use: AWS charges you at the invocation level. If your functions do not receive a load, you pay nothing. With micro-level capacity management and a pay-as-you-use model, it is evident that your costs will be optimized, too. That also creates some interesting opportunities… For example, you can easily determine if a particular product feature of yours is profitable or not.
  • Automatic Recovery: With sub-second startup times — well, not in JAVA — you don’t have to worry about things like failover, load balancing, etc. If your function crashes for some reason, AWS immediately spins up a new container for you — and the whole process happens entirely behind the scenes.
  • Versioning: When you deploy a Lambda function, you can assign a version to it, and different versions of your functions can co-exist. You can call whichever version of your function you want from your client code. AWS Lambda also has alias support for your functions so that you may not need to change your client code in order to execute different versions. This helps you to easily implement Blue/Green Deployment, Canary Release, or A/B testing in the backend.
  • Centralized Log Management: All logs of your Lambda functions go directly to AWS CloudWatch — with zero configuration.
  • Small Isolated Modules: It forces you to implement your software in highly cohesive and loosely coupled modules that eliminate lots of technical debt.

At the beginning of 2017, we recruited three senior engineers who had no previous knowledge of OpsGenie’s infrastructure and code base — and very little experience with cloud technologies. They started to code an entirely new product feature in a separate code base to be deployed to AWS Lambda service. In four months, they did an excellent job.

They prepared our development, testing, and deployment base — as well as implementing a brand new product extension. What they accomplished was a full blown application — not simple CRUD functions, database triggers, or any other officially-referenced architectural pattern. As I write these lines, they are sending it to production, to open for some Beta customers.

When we feel safe, and our delivery pipeline stabilizes, we plan to split our applications — domain by domain — and move them to serverless. Follow our journey on Medium in our our engineering blog where we will keep sharing our experiences with serverless.

--

--