Learnings after a day at Serverless Days Boston
Conference link : https://boston.serverlessdays.io/
The theme of the conference is less about why serverless is the future and more about things to consider when building serverless applications. The conference had talks by AWS, Microsoft and Google among many other startups in the cloud space. Even though a majority of the speakers talked about AWS, the conference focus was to drive engagement and innovation in the serverless space, irrespective of the cloud provider.
The conference was single day, single track, so you never have to worry about making a trade off between 2 different talks that are scheduled at the same time.
What is Serverless
The conference had 10 different talks and each speaker had a different explanation for serverless, but I liked this particular explanation from Ben Kehoe, Cloud Robotics Research Scientist at iRobot.
Serverless is a way of abstracting the infrastructure from the developer. The point of Serverless is
- Not about Functions as a Service
- Not about using managed services
- Not about reducing operational burden
- Not about cost savings
- Not about Technology
but, it is about driving maximum business value to the customers. The reason serverless lets teams focus on the business value is because we don’t have to worry about configuring/maintaining/scaling servers to a degree.
Other Additional Benefits to the Business
- Easier to outsource parts of the application
- Granular security helps make auditing easy and minimize blast radius
- Agility and Faster Development cycles
For someone that did not have any real production level exposure to serverless infrastructure, I realized that the journey of Serverlessness is a direction, not a destination. I can categorize the takeaways into 4 different sub-themes.
- Serverless mindset to any stack.
- Increase Observability of serverless infrastructure through Software Ownership
- Security in distributed systems
- Make cost a first class metric
Serverless is a state of mind and its principles could be applied to any stack. It is the focus on business value and relying on managed services for dependencies through very less distractions. Since serverless is a trait, we can consider it to be an infinitely tall ladder. In a team’s serverless journey, there is always a rung above the team on the serverless ladder. Here are some ways that we could impart the mindset into our work
- Always find your team’s part of the business value and focus complete efforts on it
- To improve focus, know the direction of the thing you’re building, consider constraints and make choices
- Always focus on building the smallest thing that could get the job done (Minimalism)
- Be afraid of the enormity of the possible. For example, when evaluating a technology, always check how much more something could do than what you need, and try to reduce that as much as you can. Code is a liability.
- Embrace Configuration as a code. Coding is not the most important part and if something could be achieved by configuring a managed service, prefer that over building something yourself. After all, Days of coding can only save you hours of configuration.
- Have people automate their jobs and take on new problems to solve
- If you can flowchart a workflow, you can serverless that workflow
This I would say was the core theme of the conference. Developers in the industry are increasingly spending more time debugging issues in an attempt to understand our systems. It doesn’t mean that they are bad at debugging, but they are using tools designed from different problems. We need tight feedback loops in our systems.
Debugging is hard in any monolithic environment, pretty sure everybody agrees with me. Now, consider a serverless environment on AWS. If you’re using Lambda service, it can be invoked by ~50 different sources and the Lambda itself could invoke other Lambdas. Given how complex distributed environments can get really fast, it is even harder to debug and hence being able to observe how our systems interact in real time in production. Here are some techniques/ideas to improve observability of systems.
- Log high cardinality metrics as a part of events
- Have a structured event that is captured end to end and bring in sampling if there is a necessity to control costs / bandwidth
- Move towards Observability Driven Development from Test Driven Development. TDD doesn’t help debug production issues
- Rely on Dashboards, as they are for passive interaction and cannot easily be used to debug a failure. They are also usually artifacts of some failure and do not convey the overall system health
- Test on staging/local environments and trust that things would work well in production. Test your code as close to the production environment as possible.
- Aggregates are the devil and they eliminate the ability to ask questions
honeycomb.io is one of the companies in this space that is providing solutions to the Observability problem for serverless applications.
Some of the challenges in handling errors in a serverless environment are
- Event driven design
- No server to connect to
- No current state
- Cannot store logs
- Cannot run debugging agents
To circumvent some of these problems, most of the cloud providers give an option to Retry a failure, but it is not the right solution to the problem, as troubleshooting with retries is also hard. Retries might change the flow of the application if your code is not idempotent.
Distributed tracing and Distributed logging will help with these things. AWS X-Ray is a good place to start but there are many other tools that would help you observe and handle errors better in production environment like Jaeger UI, Open Tracing and Open Consensus.
Measure, Debug, Collaborate and Fix is the mantra!
Your code is your responsibility. In a serverless public cloud environment, Security is a shared responsibility between the cloud service provider and the consumer. In AWS, when a Lambda could be triggered by ~50 different event sources, an attack on your application could come from any direction. OWASP serverless goat is a deliberately insecure, but realistic application that is maintained by OWASP and it could be used to understand various kind of vulnerabilities in a serverless application. But the most common vulnerability mistakes in a serverless environment is over provisioning services. So, for example, In AWS, always adopt Role-Per-Function model and SAM managed polices for providing better limited security to your applications. To prove this, the speaker showed how a hacker could hack into a job portal (uses Lambda to respond to a email a candidate has sent and saves the record in Dynamo DB) and retrieve data from the backend using the same Lambda, as it happened to be over provisioned. A similar example is linked below. Make sure to check it out.
When building serverless systems, Reliability should be a part of the operating model. Reliability here is the trustworthiness of a system’s ability to delight the customer. Forces that drive Reliability are DevOps(Culture) and SRE(Practice). There are multiple forces that impact serverless systems, like Availability, Efficiency, Monitoring and Capacity Planning. The name “serverless” is misleading, because there is some server somewhere at the bottom of the stack that is actually executing your logic, so not just Lambda costs, but the costs of all the moving parts(for example: In AWS Lambda costs, API Gateway costs, S3 costs etc for your application) should be considered when coming up with the SLA, SLO and SLI metrics for your system. A system that puts a company out of business is not a reliable system and no company has an unlimited supply of funds, so Cost should become a first class metric when defining the reliability of a system. A first class metrics should have real time data, has context, is measurable and has a clear definition of good and bad, which is what you define using your Service Level Objectives.
Making Cost a first class metric and merging it with the principles of DevOps is being called FinDevOps and its principles are pretty simple, yet powerful
- There should be a tight correlation between cost and Well Architected systems. For example, having a metrics for “At what point do we degrade the customer experience to reduce costs. If you say that your application’s response time would be ~4 seconds during Thanksgiving when you expect a lot of traffic to your app as opposed to ~1 second on any other day, that’s totally fine as long as the decision is made consciously”
- Focus on Return on Investment on cloud costs, so that effective decisions are made
- Track gross margins to determine the cloud spend
I also recommend Google’s SRE course if you want to learn more about building and monitoring reliable systems.
Additional Material suggested by speakers
Overall, I found the conference extremely useful, and would recommend it to others in future irrespective of whether you use serverless applications or not, as the learnings could be applied to any environment.