K8s logging Cost on GCP
How did we cut down 80% on k8s logging on cloud? 🤔
Before that, let’s have a look at available solutions on the globe.
Opensource: ELK, grafana-loki, fluentd, graylog, etc.
Paid one: sumo logic, loggly, datadog, cloudwatch, splunk, etc.
So, we are on cloud and had to build the solution within the known space(means create simple solution using cloud only as we are already using it) because we have strict budget ane can’t introduce new tool for logging and so finally we started k8s logging on cloud.
After 2 month, we got a new problem
Buddy we are spending too much on logging, and price is fluctuating a lot, what to do now?
Then post investigation, we found 2 major cause of billing
- Log Ingestion rate
- Log storage cost
Fixing the Cost issue
This solution can be implemented on any cloud if the following things are available.
- Filter logs based on log level
- Routing logs to the specific location/storage
How did we fixed it.
- Route the k8s application logs to separate log cloud storage bucket and retain the data to 28days, that mean you have separate control over k8s service logs.
- Now think do you need the INFO level logs all the time or on an average in a day how much time do you check the application logs?
So we though, to Enable logs filter for severity greater than WARNING/ERROR only and save it, in actual that only help us troubleshoot the apllication, isn’t it? then, by doing this we were able to manage higher ingestion rate on stack driver/log storage and because ingestion rate is too much expensive than the storing logs, that was the main reason of our cost. if you see the ratio diff between info and error level logs is usually huge and this was the reason we thought abt it. - Minimize the logging at the application level using log level and custom logging. This was to reduce the log generation rate and just print required details. This is on prod specially, on dev you can flood it 😅.
- Put a monitoring on crashing/failure/restart pods, because in many cases, such application throws lots of logs/traces. But if you track them via alerting you can save lot of unwanted log ingestion.
Result
At the End of whole exercise we saw almost 80% cost saving on our logging. And comparing it with custom logging solution like ELK/grafana loki, The cost was coming near by cloud logging only. Then why to put more effort when it survive with managed services.
Issue with this approach
- Storing only ERROR logs, then developer would raise a concern, how to check the INFO logs on production? The solution for this is, you can use the log router and build some automation to update it’s config to enable full log for your application whenever it is you needed and disable once you are done with the work.
- Cloud Log viewer tool might not be that good to watch the and parse the logs, for that developer can raise a concern.
- You have to manage the IAM access for dev’s so that they can access the cloud logging service.