Sending Kubernetes logs to Amazon Elasticsearch Service

In this article, I’ll explain how to send logs to Amazon and why.

What do we have?

We have a production Kubernetes cluster managed by kops in AWS.

Also, we want to collect logs from this cluster, especially from Nginx Ingress to Elasticsearch. It sounds pretty easy, so let’s start.

So we have a simple plan for sending logs to Elasticsearch.

On every k8s node, Fluentd pod collect logs and push it into Elasticsearch domain.

Which Elasticsearch should I use?

So we need to set up Elasticsearch cluster, we have different ways to do that:

  • self-hosted cluster on EC2 instances (it’s cheaper than SaaS service, but we need to support it)
  • Amazon Elasticsearch Service (more expensive than self-hosted cluster, no need to support, opendistro support out of the box, harder to scale)
  • use other SaaS service (even more expensive than AWS but there are many tools for analyzing and visualizing)
  • setup EFK (Elasticsearch, Fluentd, Kibana) in the same k8s cluster (it’s extremely cheap but riskier to have all in the same place)

Also, need to take into account, that we can reserve a domain for Elasticsearch service or even use spot instances for self-hosted cluster or k8s nodes. So it depends on your workload and money.

Pricing

AWS Elasticsearch Service:

  • 120$ for m4.large node with 2 CPU and 8GB RAM
  • 76$ for 512GB EBS volume

Self-hosted EFK stack:

  • 77$ for m5.large node with 2 CPU and 8GB RAM
  • 56$ for 512GB EBS volume

It would be cheaper on k8s because we can user underutilized resources in the cluster. I didn’t research other SaaSes but I heard they are more expensive than AWS.

In this article we use this method:

How it works

Setup Elasticsearch domain on AWS

Our cluster contains 1 node because we don’t keep any critical data there.

How to set up Elasticsearch domain manually: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-createupdatedomains.html

Currently, the latest supported ES version is 6.5.

Also, I wanna mention that maximum disk space related to instance type. For example, for r5.large node, there is 1024GB of maximum disk space, but for m4.large — 512GB.

You can choose the Internet or VPC access to your domain, it depends on your network infrastructure, it’s better to use VPC policy but I’ll show you how to be with the Internet accessed domain.

Here is an example of terraform config for installing new ES domain:

provider "aws" {
region = "eu-west-1"
}
resource "aws_elasticsearch_domain" "elastic_logs_domain" {
domain_name = "elastic-logs-domain"
elasticsearch_version = 6.5
  cluster_config {
instance_type = "m4.large.elasticsearch"
instance_count = 1
dedicated_master_enabled = false
}
  ebs_options {
ebs_enabled = true
volume_type = "gp2"
volume_size = "512"
}
# This is access policy for lambda with ES curator
access_policies = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::1234567890:role/es-cleaner-lambda",
"arn:aws:iam::1234567890:root"
]
},
"Action": "es:*",
"Resource": "arn:aws:es:eu-west-1:1234567890:domain/elastic-logs-domain/*"
}
]
}
POLICY
}

So after a new cluster starts, you can check its status, it will be yellow if you set 1 node in the cluster and it’s OK.

In this config I added access policy for lambda es-cleaner-lambda, we use it for retention for elastic indices. If you want to use AWS Lambda with the curator, I recommend reading this article: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/curator.html

ES-proxy

In our case with AWS Elasticsearch Service we use Internet access to the domain instead of VPC. Also, we don’t want to add static IP addresses to domain policy.

This problem we can fix with Elasticsearch proxy which receives logs from Fluentd in the same k8s cluster and transfer it with AWS credentials to our domain.

Let’s create a deployment for it:

cat << EOF > es-proxy-deployment.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: es-proxy
namespace: logging
labels:
app: es-proxy
spec:
selector:
matchLabels:
app: es-proxy
strategy:
type: Recreate
replicas: 1
template:
metadata:
labels:
app: es-proxy
spec:
volumes:
- name: varlog
emptyDir: {}
containers:
- image: gcr.io/learning-containers-187204/vke-es-proxy
name: es-proxy
env:
- name: AWS_ACCESS_KEY_ID
value: "${AWS_ACCESS_KEY_ID}"
- name: AWS_SECRET_ACCESS_KEY
value: "${AWS_SECRET_ACCESS_KEY}"
- name: ES_ENDPOINT
value: "${ES_ENDPOINT}"
volumeMounts:
- name: varlog
mountPath: /var/log
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
EOF

Create a user with Elasticsearch access and get AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, deploy configuration:

export AWS_ACCESS_KEY_ID=123456
export AWS_SECRET_ACCESS_KEY=123qwe456tyu
export ES_ENDPOINT=https://search-elastic-logs-domain-123asdqwe123.eu-west-1.es.amazonaws.com
cat es-proxy-deployment.yaml | envsubst | kubectl create -f -

Also, we need to create a service for it:

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
name: es-proxy
namespace: logging
labels:
app: es-proxy
spec:
ports:
- port: 9200
name: http-es-proxy
selector:
app: es-proxy
clusterIP: None
EOF

Checking:

kubectl get deploy,svc -n logging
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.extensions/es-proxy 1 1 1 1 24d
NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/es-proxy ClusterIP None <none> 9200/TCP 24d

So, our proxy service is ready, let’s go ahead.

Fluentd

As you remember we need to install Fluentd on every node in the cluster for getting k8s logs.

It’s easy to deploy with HELM chart, let’s generate values file for it:

cat << EOF > fluentd-values.yaml
elasticsearch:
host: es-proxy # This is our ES proxy
port: 9200
buffer_chunk_limit: 2M
buffer_queue_limit: 8
logstash_prefix: 'kubernetes' # This is how prefix for k8s logs called
configMaps:
ingress.conf: |-
# Parser for Nginx ingress logs
<filter kubernetes.var.log.containers.nginx-ingress-**.log>
@type parser
format /(?<remote_addr>[^ ]*) - \[(?<proxy_protocol_addr>[^ ]*)\] - (?<remote_user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<request>[^\"]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*) "(?<referer>[^\"]*)" "(?<agent>[^\"]*)" (?<request_length>[^ ]*) (?<request_time>[^ ]*) \[(?<proxy_upstream_name>[^ ]*)\] (?<upstream_addr>[^ ]*) (?<upstream_response_length>[^ ]*) (?<upstream_response_time>[^ ]*) (?<upstream_status>[^ ]*)/
time_format %d/%b/%Y:%H:%M:%S %z
key_name log
types server_port:integer,code:integer,size:integer,request_length:integer,request_time:float,upstream_response_length:integer,upstream_response_time:float,upstream_status:integer
reserve_data yes
</filter>
<match kubernetes.var.log.containers.nginx-ingress-**.log>
@id elasticsearch-ingress
@type elasticsearch
@log_level info
include_tag_key true
type_name _doc
host "#{ENV['OUTPUT_HOST']}"
port "#{ENV['OUTPUT_PORT']}"
scheme "#{ENV['OUTPUT_SCHEME']}"
ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
logstash_format true
logstash_prefix "ingress" # This is how prefix for ingress logs called
reconnect_on_error true
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes-ingress.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever
retry_max_interval 30
chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
overflow_action block
</buffer>
</match>
EOF

With this configuration, we are going to deploy daemonset with Fluentd, which collects k8s logs and sent it to es-proxy service in kubernetes indice.

Also, we created a separate configuration for Nginx Ingress logs, which collects and send logs to the proxy in ingress indice.

Installing chart:

helm install --name fluentd --namespace logging stable/fluentd-elasticsearch -f fluentd-values.yaml

So we deployed Fluentd, let’s check what do we have in Kibana.

Follow the link to your Kibana in AWS Console and check Management Index Patterns menu section. Create indices kubernetes-* and ingress-*.

Kibana interface

You also can get some basic dashboards for Kibana here.

Kibana dashboard for k8s logs

Few calculations

Here is CPU and RAM utilization for our logs:

kubectl top po
NAME CPU(cores) MEMORY(bytes)
es-proxy-5b94658c8-kdmbg 3m 12Mi
fluentd-fluentd-elasticsearch-56v9f 9m 99Mi
fluentd-fluentd-elasticsearch-7hc7p 8m 187Mi
fluentd-fluentd-elasticsearch-t9tm7 10m 156Mi

We write logs to our domain not only from Kubernetes cluster but also from other sources, in total it’s about 7 million records or 5.2GB logs in 24h.

CPU utilization (for 2 CPU/8GB RAM node):

CPU utilization

How to count the cluster size? It depends, but here is a cool article which can help you: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

See ya!