Monitoring & Alerting Lambda functions on scale

Monitor a fleet of Lambda functions with CloudWatch and ChatBot

Ronny Roeller
NEXT Engineering
3 min readFeb 3, 2024

--

When it comes to software and system management, having a strong monitoring and alerting setup is like having a dependable watchdog for your systems. Let’s break down why it’s crucial, the main rules to follow, and how to manage sensitive data without stepping over privacy lines.

Why Monitoring & Alerting Is a Big Deal

Think of a monitoring system as your system’s health check. Without it, your system’s performance could swing wildly — one minute everything’s fine, and the next, you’re firefighting a major issue.

Here’s why it’s vital:

  • Keep Performance Steady: You want a smooth-running system, not a roller coaster of performance highs and lows.
  • Catch Issues Early: It’s better to spot — and ideally fix — a problem before your users stumble upon it.
  • Think Ahead: It’s about anticipating what might go wrong and planning for it, so you’re not caught off guard.

Key Principles for a Solid Setup

Setting up a monitoring and alerting system isn’t just about installing some tools. It’s about sticking to some core principles to make sure it does its job well:

  • Make Alerts Meaningful: Every alert should be clear on what the issue is and what needs to be done.
  • Keep the Right People in the Loop: Make sure the alerts go to the team members who can actually fix the problem.
  • Cut Down on Noise: Too many unnecessary alerts and people start ignoring them, even the important ones.
  • Keep Improving: Use feedback to make your monitoring system better over time.

Handling Data Privacy

In today’s world, protecting customer data is critical. The challenge is to give your team the information they need without crossing privacy boundaries. Here’s our approach:

  • Share What’s Safe: We share general info (like metadata) that doesn’t reveal sensitive details.
  • Keep Tight Control on the Rest: Only the Ops team gets to access the sensitive data, ensuring customer information stays secure.

Our Practical Approach: Monitoring in Action

Theory is good, but practice makes perfect. Here’s how we’ve put these ideas into action:

From Error to Slack Notification

  1. Logging Errors: Our Lambda functions logs errors via console.error.
  2. Gathering Logs: The logs are collected in AWS CloudWatch.
  3. Picking Out Errors: A CloudWatch Subscription trigger a Lambda function, which extracts the errors from the logs, and enriches them with meta data (e.g. in which environment the error was detected, what Lambda function throw the error).
  4. Sending Notifications: These errors with their metadata are then sent to SNS (Simple Notification Service).
  5. Chat Integration: AWS ChatBot, hooked to SNS, picks up these notifications.
  6. Alerting the Team: Finally, the ChatBot pushes these notifications to Slack to alert the team.

From Notification to Action

  1. Finding the Details: A link in the Slack alert takes you right to the relevant log in CloudWatch.
  2. Digging Deeper: In CloudWatch, you can search for the ‘ERROR’ tag to get to the bottom of the issue.

[Bonus] Streamlining Subscription Filters for Efficiency

Managing a vast number of Lambda functions manually can be a daunting task, akin to herding cats. To streamline this process and ensure efficiency, we’ve automated the creation and management of CloudWatch Subscription Filters.

Here’s how we’ve tackled it:

  • Automated Filter Creation: A Lambda function collects all our Lambda functions across all our CloudFormation stacks. If for any of those the CloudWatch subscription is missing, it automatically adds them.
  • Daily Updates for Accuracy: We run this Lambda function once per day to ensure our alerting system remains up-to-date with the latest system changes.

By automating the management of Subscription Filters, we’ve not only saved valuable time but also enhanced the reliability and responsiveness of our monitoring system.

In a nutshell, a smart monitoring and alerting system isn’t just about having the right tools. It’s about being proactive, clear in communication, and smart about data privacy. It’s about ensuring your system doesn’t just run, but runs smoothly, and your team is ready to tackle issues head-on, efficiently, and effectively.

Happy coding!

--

--

Ronny Roeller
NEXT Engineering

CTO at nextapp.co # Product discovery platform for high performing teams that bring their customers into every decision