Dev IRL: How to ingest Heroku Log Drains with Nodejs on AWS CloudWatch? Part 1: Architecture
Well, this was a question I’ve never asked myself until a few of months when I realized that my favorite log management provider Logentries discontinued their Heroku Add-on (fun fact: without notice 🤡).
So what do you do when you must find a solution and no alternative is a perfect match for your needs? DIY.
Bonus: (almost) all you have to do is using the AWS built-in code editor. No complex deployment pipeline or even a Git Repo is needed.
This story is the first of a 6 parts series. You can find all the other parts below:
- Part 1: Architecture <<< 📍You are here!
- Part 2: Get the logs
- Part 3: Handle the drains
- Part 4: Sending events with SQS
- Part 5: Ingest the logs
- Part 6: The alert system
- Bonus Part: SpeedRun
TL;DR
You can go to SpeedRun version of this series for a more straight-to-the-point perspective. You’ll find all the final Lambda functions’ code along with their policies and the general architecture, but you’ll need a good understanding of the AWS platform to wrap it together.
Keep reading for the implementation details!
A little bit of backstory
For years I used the Logentries Heroku Addon as a logging and alert service for my apps. Quite simple to use, Heroku tailored alert system out-of-the-box and free of charge (with 7-days log retention and 3Go of monthly storage). This was fitting my needs perfectly.
One day, I couldn’t add the addon anymore. No special informations was provided but I immediately searched for an alternative (and I was right to do so, since they made an official announcement of their end of service several months later). Sad but true, none of the others addons where a drop-in replacement. To name a few:
- Sumo Logic doesn’t have the alert system in their free plan. You have to subscribe to their first paid tier at 108$/mo, while the Heroku app itself is billed 7$/mo…
- Coralogix doesn’t have a free plan (well, to be fair they have a 4-days free plan).
- Logtail could have been the perfect match if you didn’t have to set the ~30 Heroku specific alerts manually, for each app. And you didn’t received the said alerts within hours of delay. That render them quite useless.
I soon realized that I’ll have to implement a logging/alert system myself:
- to keep this rather critical aspect of my apps under my control, at least the basics: a minimum history of logs and critical alerts in realtime
- to lower the costs
The architecture
First and foremost, let’s think about how to organize the workflow and what services to use. I chose the AWS ecosystem due to my knowledge of the platform and the fact that Heroku uses AWS EC2 instances under the hoods.
Getting and parsing the logs
The first step is to parse the log drain. By nature this task:
- is intermittent: the logs are generated by web traffic or server events (like cycling). There is no need to deploy a web server that is always running.
- should be quick: this is “only” string parsing in the end
- should be able to scale: the amount of logs depends on the web traffic, and you don’t know when your app will be slashdotted.
Theses 3 statements defines a perfect job for a serverless function don’t you think? So the first task will be to create a AWS Lambda function triggered by an API endpoint. This function will parse the logs, decide to store them or not and check if an alert should be triggered.
Sending the logs in CloudWatch
CloudWatch is the place where you’ll store your logs for monitoring, consulting, creating alerts and further analysis. The service itself is very powerful but the learning curve is rather steep and the interface not so friendly.
To put it simply:
- the log itself is a message (that can be formatted in JSON for further analysis)
- all the messages are stored in a log stream, that you can organize the way you want by source (a database, a EC2 instance, a specific build of an app, …)
- all the log streams are stored in a log group, that you can organize the way you want by service (a database server, a specific set of errors from your server, a lambda function, etc…)
Given those 3 levels of hierarchy you have a great control to organize your logs, and the ability to set a specific lifespan of log retention for each log group (and save some bucks).
The thing is, you won’t be able to simply put the logs as they come because the AWS PutLogEvent
API (needed to add new logs) has a throttling limit of 5 req/s per log stream, not to mention the batch size limit of 1Mb and 256Kb by message. So you’ll need to create a message queue, where each item will be consumed at a specific rate to comply with this limit.
But you won’t do that within the Lambda function itself, since the very point of a serverless function is to be executed as quickly as possible (remember that you are charged for each ms of a function execution). So waiting for a message to be added in the log stream and create a timeout of 0.2s for the next one is out of the question.
Luckily, AWS SQS can just do that for you! It will act as a message broker to add all your logs in CloudWatch and can scale automatically to ingest the volume of logs. But remember the limit: 5 req/s per log stream. So you won’t use a single SQS queue for all your log sources (e.g. all your apps), but rather create a dedicated SQS for each log stream (i.e. one for each app).
In the end, all you have to do is sending a new message from your Lambda function to the SQS queue and process it later, at a specific rate.
Putting the logs in CloudWatch
Adding a log is not very expensive and quite fast to do. Since you don’t know how many logs you’ll have to store (you may have no logs at all during a certain amount of time), this looks like a job for another Lambda function.
Remember that a Lambda function can have a lot of different kinds of triggers, even multiple triggers of the same kind depending on their nature. It turns out you can add as many SQS triggers you want for a given Lambda function!
The trick here is to trigger a Lambda function through the right SQS queue, with the name of the log stream contained in the message. This way, the function will put the log message in the desired log stream with a simple PutLogEvent
API call. The Lambda invocation rate will be controlled by the Event Source configuration, where you can customize the BatchSize
and MaximumBatchingWindowInSeconds
properties depending on your traffic. This blog post gives interesting insights on how to do it, as well as this guide.
Handling the alerts
In a similar way, we can send alerts from a dedicated Lambda function if detected during the parsing phase. We’ll use the AWS SNS service to do it, and won’t necessarily need to throttle the SNS publish
API calls depending on your region’s limits and quotas. My region’s 9k messages per second limit should be enough for my needs. A simple Lambda invocation from one to an other will suffice.
However, you certainly don’t want to spam yourself in case of global outage. A restriction system needs to be implemented, e.g. sending only 10 messages per hour for a given alert. Since the alert sending system is handled by a Lambda function, you don’t have a way to “remember” how many alerts have been sent so far across the invocations.
With a closer look, this is an information that:
- needs to be stored elsewhere, e.g. in a remote database
- must be accessed as fast as possible
- don’t need to have a long lifespan (1 day max.)
This is exactly what Redis databases are for! You can choose the provider you want, but I usually go for the RedisCloud free tier that comes with 30Mb and 30 simultaneous connections.
That way, this Lambda function will be invoked each time an alert must be sent, check the availability of the alert and sent it through SNS if needed.
Wrap it up
Here is the global diagram of our system:
Regarding the workflow:
- Heroku send a HTTPS POST to our AWS API Gateway endpoint.
- The endpoint triggers the heroku-drains Lambda function that parse and format the logs.
- The formatted logs are sent to the heroku-drains-storage function through dedicated SQS queues.
- The heroku-drains-storage function is triggered by the SQS queues, and stores each log in a dedicated AWS CloudWatch Log stream with a
PutLogEvent
API call. - If an alert needs to be sent, the heroku-drains function invokes the heroku-drains-alarms function that will send a SNS notification with a
publish
API call.
Now start the implementation in the next part: Get the logs 🚀