PART III-Comprehensive guide in setting up the three pillars of observability in Kubernetes cluster within AWS private network

Bibin Kuruvilla
7 min readMar 10, 2023

--

Set up traces using Grafana and Tempo in Observability cluster and Grafana agent in tenant cluster

Document is published in 3 parts, PART I for metrics, PART II for logs and PART III for traces

Part III: Setting up traces

Tools explained
Grafana: To analyse and visualize logs. Grafana already installed in PART 1
Tempo: Highly scalable distributed tracing backend with object storage support
Grafana agent: Agent which collects and ships traces from application cluster to Tempo installed in observability cluster.
Hotrod application: Sample application to generate traces for testing purposes.

Set up Tempo in Observability cluster using helm
Create tempo-values.yaml file (make sure multitenancy is enabled and always use secret env values to pass aws key/id in production. Adding it directly in the tutorial to keep it simple)

apiVersion: v1
data:
overrides.yaml: |
overrides:
{}
tempo.yaml: |
multitenancy_enabled: true
usage_report:
reporting_enabled: true
compactor:
compaction:
block_retention: 24h
distributor:
receivers:
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_binary:
endpoint: 0.0.0.0:6832
thrift_compact:
endpoint: 0.0.0.0:6831
thrift_http:
endpoint: 0.0.0.0:14268
opencensus: null
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
{}
server:
http_listen_port: 3100
storage:
trace:
backend: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
bucket: bibin-tempo-data
forcepathstyle: true
#set to true if endpoint is https
insecure: true
access_key: “AXXXXXXXXXXL”
secret_key: “xxxxxxxxxxxxxxxxxxxxxxxxxxxH”
wal:
path: /var/lib/enterprise-traces/wal
querier:
{}
query_frontend:
{}
overrides:
per_tenant_override_config: /conf/overrides.yaml
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: tempo
meta.helm.sh/release-namespace: tempo
creationTimestamp: “2023–03–06T03:58:29Z”
labels:
app.kubernetes.io/instance: tempo
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: tempo
app.kubernetes.io/version: 2.0.1
helm.sh/chart: tempo-1.0.1
name: tempo
namespace: tempo
resourceVersion: “157004”
uid: 3612c55c-cb0e-4528–9ce4-bfc49829dd8d

Install Grafana tempo using helm

Kubectl create ns tempo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade — install tempo grafana/tempo -n tempo — values tempo-values.yaml

If all is fine, you will see below output

NAME: tempo
LAST DEPLOYED: Mon Mar 6 03:58:29 2023
NAMESPACE: tempo
STATUS: deployed
REVISION: 1
TEST SUITE: None

Check if tempo pods are running fine

Expose Tempo service on port 4317 using below command (Grafana agent will be shipping traces to this port)

kubectl expose service tempo — type=LoadBalancer — target-port=4317 — name=tempo-nlb -n tempo — dry-run=client -o yaml > tempo-nlb.yaml

Edit tempo-nlb.yaml and add annotations for NLB so that an internal network load balancer is created

apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: tempo
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: tempo
app.kubernetes.io/version: 2.0.1
helm.sh/chart: tempo-1.0.1
name: tempo-nlb
namespace: tempo
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-internal: “true”
spec:
ports:
— name: pushport
port: 4317
protocol: TCP
targetPort: 4317
selector:
app.kubernetes.io/instance: tempo
app.kubernetes.io/name: tempo
type: LoadBalancer
status:
loadBalancer: {}

Apply it using the below command

k apply -f tempo-nlb.yaml

Find the ELB address

[observe@bastion ~]$ k get svc -n tempo | awk ‘{print $4}’ | grep elb
Output: a0f2479duyrtcfsghkjxxxxxac-a8e26a6b24db8ffc.elb.us-east-1.amazonaws.com

Check above NLB using curl within the vpc (as its an internal NLB)

[observe@portx-observability-bastion ~]$ curl a0f2479duyrtcfsghkjxxxxxac-a8e26a6b24db8ffc.elb.us-east-1.amazonaws.com:4317
Output: Received HTTP/0.9 when not allowed

If you receive above message we can assume its fine

Next Create a VPC endpoint in Observability cluster using the above Tempo NLB

Set up an endpoint in App cluster VPC using above vpce-svc endpoint service name

Check the working on VPC endpoint using curl in application cluster vpc

curl vpce-099ad1fxxxxxx75-vpce-svc-0b32c00d6e1cb4511.us-east-1.vpce.amazonaws.com:4317
Output: Received HTTP/0.9 when not allowed

You get the similar message which means endpoint is fine

Set up Grafana Agent using in application cluster.

Kubectl create ns grafana-agent;
MANIFEST_URL=https://raw.githubusercontent.com/grafana/agent/v0.27.0/production/kubernetes/agent-traces.yaml NAMESPACE=grafana-agent /bin/sh -c “$(curl -fsSL https://raw.githubusercontent.com/grafana/agent/v0.27.0/production/kubernetes/install-bare.sh)" | kubectl apply -f -

Create file agent-traces-config.yaml or simply edit the current configmap and edit the values.

Make sure to use the VPC endpoint DNS as the tempo endpoint address (Agent ships traces to this address)

kind: ConfigMap
metadata:
name: grafana-agent-traces
apiVersion: v1
data:
agent.yaml: |
traces:
configs:
— batch:
send_batch_size: 1000
timeout: 5s
name: default
receivers:
jaeger:
protocols:
grpc: null
thrift_binary: null
thrift_compact: null
thrift_http: null
remote_sampling:
strategy_file: /etc/agent/strategies.json
tls:
insecure: true
opencensus: null
otlp:
protocols:
grpc: null
http: null
zipkin: null
remote_write:
— basic_auth:
password: awsome
username: bibink
endpoint: vpce-099ad1fxxxxxx75-vpce-svc-0b32c00d6e1cb4511.us-east-1.vpce.amazonaws.com:4317
insecure: true
headers:
X-Scope-OrgID: bibinNet
retry_on_failure:
enabled: false
scrape_configs:
— bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-pods
kubernetes_sd_configs:
— role: pod
relabel_configs:
— action: replace
source_labels:
— __meta_kubernetes_namespace
target_label: namespace
— action: replace
source_labels:
— __meta_kubernetes_pod_name
target_label: pod
— action: replace
source_labels:
— __meta_kubernetes_pod_container_name
target_label: container
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
strategies.json: ‘{“default_strategy”: {“param”: 0.001, “type”: “probabilistic”}}’

Apply the above file

kubectl -n grafana-agent apply -f agent-traces-config.yaml

Verify Grafana-agent pods are running

k get pods -n grafana-agent

Add datasource in Grafana
Note: Using X-Scope-OrgID and authentication to match values in agent-traces-config.yaml

Now you go to Grafana Explore and run some query. You can see traces if you already have an app generating traces.

But for me no application was there which could generate traces in the tenant/application cluster.
And so I deployed the hotrod application to generate traces for testing.

How to deploy hotrod application

As the first step install compose in your bastion host. Kompose will be used to convert docker-compose to Kubernetes manifest files.
For installation see https://kompose.io/installation/

Finally clone the repo for hotroad and convert it using compose

git clone https://github.com/jaegertracing/jaeger.git
cd jaeger/examples/hotrod
kompose convert

Output will be as below

INFO Kubernetes file “hotrod-service.yaml” created
INFO Kubernetes file “jaeger-service.yaml” created
INFO Kubernetes file “hotrod-deployment.yaml” created
INFO Kubernetes file “hotrod-jaeger-example-networkpolicy.yaml” created
INFO Kubernetes file “jaeger-deployment.yaml” created

We are interested only in “hotrod-service.yaml” and “hotrod-deployment.yaml” files

Edit “hotrod-deployment.yaml” and add the env variables. It should point to the Grafana agent we set up earlier

apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
kompose.cmd: kompose convert
kompose.version: 1.28.0 (c4137012e)
creationTimestamp: null
labels:
io.kompose.service: hotrod
name: hotrod
spec:
replicas: 1
selector:
matchLabels:
io.kompose.service: hotrod
strategy: {}
template:
metadata:
annotations:
kompose.cmd: kompose convert
kompose.version: 1.28.0 (c4137012e)
creationTimestamp: null
labels:
io.kompose.network/hotrod-jaeger-example: “true”
io.kompose.service: hotrod
spec:
containers:
— args:
— all
env:
— name: JAEGER_AGENT_HOST
value: grafana-agent-traces.grafana-agent.svc
— name: JAEGER_AGENT_PORT
value: “6831”
— name: JAEGER_SAMPLER_TYPE
value: const
— name: JAEGER_SAMPLER_PARAM
value: “1”
— name: JAEGER_TAGS
value: app=hotrod
— name: OTEL_EXPORTER_JAEGER_ENDPOINT
value: http://grafana-agent-traces.grafana-agent.svc:14268/api/traces
image: jaegertracing/example-hotrod:latest
name: hotrod
ports:
— containerPort: 8080
resources: {}
restartPolicy: Always
status: {}

Edit “hotrod-service.yaml” and modify it to use LoadBalancer

apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
io.kompose.service: hotrod
name: hotrod-lb
spec:
ports:
— port: 8080
protocol: TCP
targetPort: 8080
selector:
io.kompose.service: hotrod
type: LoadBalancer
status:
loadBalancer: {}

Apply the manifests

Kubectl apply -f hotrod-deployment.yaml
Kubectl apply -f hotrod-service.yaml

Find the ALB url

[user@bastion ~]$ kubectl get svc hotrod-lb | awk ‘{print $4}’ | grep elb
a7f2fe4b4exxxxxxxxx0c71334–105225129.us-east-1.elb.amazonaws.com

Access above ELB url via browser on port 8080 and click to generate traces

Then tail the hotrod pod for logs and see if traces are being generated

kubectl logs hotrod-655768f69b-lr77p | grep -i trace | tail -5

Now login to Grafana and see if traces are being generated

You can see traces which shows set up is working fine.

Optional steps

Now to make things easier, add derived fields to Loki datasource so that you can directly link traces to tempo (no need to copy and paste in tempo query field)

Use below values
Name: TraceID
Query: ${__value.raw}
Regex: ((\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+))
Internal Link: Toggle ON and select Tempo DS from dropdown

Now go to Grafana Explore, select Loki from datasource and execute query.

You can now see logs which highlights traces linked to tempo

And that’s it. You now have an observability cluster with logs, traces and metrics enabled which can be visualized and analysed using Grafana

Thanks for reading!

--

--

Bibin Kuruvilla

DevOps Engineer - AWS | Kubernetes | Terraform | Ansible