Developer Analytics with Elastic Search
Everyone knows analytics are an important part of any product. We use Google Analytics and Mix Panel for studying customer behavior and website statistics. But these analytics only help the marketing and product teams. What we as developers need is how each of our APIs is performing, what is causing the delay, and when is it happening.
Since our most of the infrastructure is on AWS, we already had the benefits of cloudwatch logs. But these were not sufficient. We wanted more.
But why?
Our current logging system would not give us the exact details of what was changed in which request and when. This was a big hurdle while moving to a microservice architecture as logging across many services was needed for proper debugging.
Setting Up Elastic Search
Our elastic search servers were setup and according to documentation we just had to send a PUT request to the elastic search to insert some data in it. This works fine for low amount of traffic. But when it comes to handling over 2500 log events in a minute, 2500 PUT requests to elastic search was not a solution anyone would prefer. We had no control over how many requests can ES service can handle. And even if we use the largest available ES instance it was not something scalable.
We definitely needed some kind of buffering before sending the data to ES service so that ES does not crash or stop responding with request spikes.
Kinesis Firehose
After some discussion and eliminating the use of lambda and possibility of creating our own buffering service we stumbled upon AWS Kinesis Firehose.
Firehose is a fully managed delivery stream which fortunately enough could stream data to ES.
The best things about firehose are:
- Inbuilt buffering system on basis of buffer memory or time interval.
- It is fully managed
- It supports record transformation out of the box.
If the ending is happy then it is not the end of development 😆
Firehose had solved a lot of our problems but it had problems of its own.
Each firehose delivery stream could only support one (index, type) combination.
It means that if we wanted separate type for each of our models, we needed to create a new delivery stream before every new model is added. Totally unscalable.
Probable Solutions
- Streaming data through firehose to lambda and make lambda send request to ES.
- Use terraform or cloudformation to create new firehose streams on every deployment.
Both of them had their own problems. The problem with using lambda was that lambda could scale but again ES won’t be able to handle 1000s of concurrent requests.
Using terraform was a better solution but when using terraform, we won’t be able to change any of the infra from AWS console which was created by terraform.
Finally we decided to add developer guidelines so that logs can be streamed to predefined ES indexes.
What all changed after this?
- After implementing this logging we actually detected a few users who were pinging some api in high requests.
2. We changed the cron schedules to run when our server load was minimum.
3. Each DB change was mapped to a particular request which was again mapped to a session. We even could map which of our production server actually handled that request.
4. Each request was traced to the IP which enabled us to blacklist the countries during a DDoS attack.
5. We were able to find which APIs were highly used in various platforms and how our internal tools were interacting.
There is a lot we are extracting from these logs which is helping us deal with various request behaviors efficiently. And truly