Probing Endpoints with Blackbox-Exporter. How ? Why ?

Yasintahaerol
Trendyol Tech
Published in
8 min readNov 16, 2021

We developed a project by using blackbox-exporter as pe-container team and decided to write our journey. Don’t be hesitate to ask any questions 🙏 Special thanks to Emin Aktaş and Necatican Yıldırım for their help 🙏

Tracking the status of the endpoints is another burden for workers in a big company like Trendyol. This is why it surely becomes more important to probe the endpoints for preventing any incident beforehand. This is where the Blackbox exporter comes in.

🌚 Blackbox exporter generates metrics that depend on the response time of internal or external endpoints, such as HTTP/S, TCP, ICMP, DNS.

  • It gathers information about the SSL certificate. You can create alerts for expired or invalid certificates.
  • Blackbox exporter observes a variety of endpoints and fires off alarms if something goes amiss. ( Like DNS lookup, HTTP latencies, etc… )

Blackbox exporter could be used in different ways. One of them is deploying with Kubernetes. Today we will focus on deploying with Kubernetes and use helm chart to configure it.

Before starting, It’s good to have some sort of familiarity with prometheus-operator. Here’s a brief introduction to concepts we will be using just in case.

Prometheus operator always monitors the Kubernetes API server for any changes in configuration and compares actual state and desired state. Then, it takes the necessary actions to reach the desired state. The Operator has many custom resource definitions (CRDs). One of them is ServiceMonitor.

What is ServiceMonitor ?

In documentation,

ServiceMonitor, which declaratively specifies how groups of Kubernetes services should be monitored. The Operator automatically generates Prometheus scrape configuration based on the current state of the objects in the API server.

  • Using ServiceMonitors enables us to configure Prometheus’ scrape targets dynamically.

What is probe ?

In documentation,

Probe defines monitoring for a set of static targets or ingresses.

  • A declarative way of defining how set of ingress or static targets is monitored. Actually, probe resembles servicemonitor, if we look what they do. When any probe is created in the cluster, Prometheus will start to scrape configuration automatically.

Why are these CRD ‘ s are important ?

  • It is easy to set up any (probe or servicemonitor) independently, So that you don’t have to take any manual changes in Prometheus configuration. These CRD’s will take care of the integration.
  • Entire Teams could create their own resources without affecting each other.
  • Easy deployment and troubleshooting.

How we can use blackbox-exporter ?

Now, it is start to time for hands-on. It is showed that three different methods and examples of how we configure all of them in this article. To be more clear, I will separate services as external or internal. Let me start first with external services.

External Services

To be able to probe external services, there exist two ways. The first one is creating servicemonitor, second one is creating a probe. When we considered our use case, it is better to use probe resources. Each team will be able to scrape metrics about their external services by creating a probe.

  1. ServiceMonitor

If you deploy blackbox-exporter via using helm, it is easy to configure serviceMonitor. There exist a section that enables us to activate serviceMonitor. When choosing to enable this property, necessary configurations would be created automatically for you. All URLs to be probing are specified in the targets section.

2. Probe

Also, it could be deployed an external probe resource in the cluster. Basically, they have similar results, when we look at what they do at the end of the day. In prober part, relevant blackbox-exporter service’s FQDN information should be entered.

Internal Services

There exist a feature on p8s-operator. We will create a job in prometheus.yml. With kubernetes_sd_configs feature (by choosing the service role), development teams could define an annotation for their services to get the metrics by blackbox-exporter. Like in the below example, If any service has specific annotation “promethesu.io/probe: true”, Prometheus will automatically start sending requests to blackbox-exporter. Also, with the power of Prometheus relabeling mechanism, it is possible to probe a variety of different sources such as consul catalog, endpoints, etc. Moreover, variety module definitions could be added to p8s-operator configuration. Common modules are HTTP/S, TCP, ICMP, etc.

A simple service example with prometheus.io/probe: true annotation. Here, an example.

2s ➜ k describe svc nginx
Name: nginx
Namespace: monitoring
Labels: app=nginx
Annotations: prometheus.io/probe: true
Selector: app=nginx
Type: NodePort
IP Families: <none>
IP: <ip>
IPs: <ip>
Port: http 80/TCP
TargetPort: 80/TCP
NodePort: http 30301/TCP
Endpoints: <ip>
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>

🔔How to create ALERTRULES ?

There exist numerous alert rules to be configured for blackbox-exporter. Alerts could be created for different troubling issues such as SSL expiration time, probe slowdown or non-reach to service. these warnings can be broadcast on different channels via webhook.

https://awesome-prometheus-alerts.grep.to/rules.html#blackbox-1

🌅How to visualize data ?

Blackbox metrics could be converted to human-readable format by using detailed grafana dashboards. Here, you can find many dashboard templates for blackbox exporter depending on your need.

dashboard
dashboard

BONUS

Let us get our hands dirty with the Blackbox Exporter to understand how it works since we provide necessary parameters via Prometheus which does the dirty job for us. Time to do it ourselves.

Blackbox Exporter has many abilities through modules. Here are a couple of examples; it makes basic HTTP requests such as GET, POST and expects to receive a 2xx status code within the timeout period. Or, it can make matching with regex to body or header. If you want more details about the probes and options, check the documentation.

We can now probe any target with http_2xx probe which is defined in the configuration file along with other probe configurations.

By simply calling the url http://localhost:9115/probe?target=www.trendyol.com&module=http_2xx returns Prometheus metrics.

probe_success is the first metrics we should check. 1 means that probe succeeded.

Also, we can do debugging by just add debug=true the end of the url like this: http://localhost:9115/probe?target=www.trendyol.com&module=http_2xx&debug=true

We are going to see more details along with our module configuration.

Logs for the probe:
ts=2021-11-10T12:03:19.539609322Z caller=main.go:320 module=http_2xx target=www.trendyol.com level=info msg="Beginning probe" probe=http timeout_seconds=5
ts=2021-11-10T12:03:19.539685705Z caller=http.go:335 module=http_2xx target=www.trendyol.com level=info msg="Resolving target address" ip_protocol=ip6
ts=2021-11-10T12:03:19.570921716Z caller=http.go:335 module=http_2xx target=www.trendyol.com level=info msg="Resolved target address" ip=104.17.133.16
ts=2021-11-10T12:03:19.570980068Z caller=client.go:251 module=http_2xx target=www.trendyol.com level=info msg="Making HTTP request" url=http://104.17.133.16 host=www.trendyol.com
ts=2021-11-10T12:03:19.74709647Z caller=client.go:492 module=http_2xx target=www.trendyol.com level=info msg="Received redirect" location=https://www.trendyol.com/
ts=2021-11-10T12:03:19.747202186Z caller=client.go:251 module=http_2xx target=www.trendyol.com level=info msg="Making HTTP request" url=https://www.trendyol.com/ host=
ts=2021-11-10T12:03:19.747223777Z caller=client.go:251 module=http_2xx target=www.trendyol.com level=info msg="Address does not match first address, not sending TLS ServerName" first=104.17.133.16 address=www.trendyol.com
ts=2021-11-10T12:03:20.085912327Z caller=main.go:130 module=http_2xx target=www.trendyol.com level=info msg="Received HTTP response" status_code=200
ts=2021-11-10T12:03:20.309809321Z caller=main.go:130 module=http_2xx target=www.trendyol.com level=info msg="Response timings for roundtrip" roundtrip=0 start=2021-11-10T12:03:19.571052069Z dnsDone=2021-11-10T12:03:19.571052069Z connectDone=2021-11-10T12:03:19.631688228Z gotConn=2021-11-10T12:03:19.631718525Z responseStart=2021-11-10T12:03:19.747027915Z tlsStart=0001-01-01T00:00:00Z tlsDone=0001-01-01T00:00:00Z end=0001-01-01T00:00:00Z
ts=2021-11-10T12:03:20.309844977Z caller=main.go:130 module=http_2xx target=www.trendyol.com level=info msg="Response timings for roundtrip" roundtrip=1 start=2021-11-10T12:03:19.747300002Z dnsDone=2021-11-10T12:03:19.751055881Z connectDone=2021-11-10T12:03:19.846510737Z gotConn=2021-11-10T12:03:19.914806905Z responseStart=2021-11-10T12:03:20.085834663Z tlsStart=2021-11-10T12:03:19.846537968Z tlsDone=2021-11-10T12:03:19.914701661Z end=2021-11-10T12:03:20.309796122Z
ts=2021-11-10T12:03:20.309911491Z caller=main.go:320 module=http_2xx target=www.trendyol.com level=info msg="Probe succeeded" duration_seconds=0.770276769



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.031248997
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.770276769
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length -1
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.156121312
probe_http_duration_seconds{phase="processing"} 0.286337175
probe_http_duration_seconds{phase="resolve"} 0.035004882
probe_http_duration_seconds{phase="tls"} 0.068163702
probe_http_duration_seconds{phase="transfer"} 0.223961445
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 1
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 222945
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 2
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 1.231528671e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.652864248e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp seconds
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.652864248e+09
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="0315524193aa6ceb020b85a8311534d51d7b32d0344895687c57b9f0928eb9bb"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Contains the TLS version used
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.3"} 1



Module configuration:
prober: http
timeout: 5s
http:
ip_protocol_fallback: true
follow_redirects: true
tcp:
ip_protocol_fallback: true
icmp:
ip_protocol_fallback: true
dns:
ip_protocol_fallback: true

Conclusion

Monitoring has a prominent role in big tech firms like @trendyol. Our teams have their own dashboards and metrics to follow. It should be designed a system that is open to new updates for monitoring. An upcoming new feature should work on the prod environment without affecting other teams. In that point, p8s-operator has a critical role. New configurations are automatically imported to Prometheus using serviceMonitor and probe. Then, ps8-operator does the rest of the work. You can easily probe your internal-external services by using Blackbox exporter. We want to be immediately informed in any down situation and also create alerts for critical problems related to services. Therefore, It could be created many detailed dashboards on grafana. In the future, we want to spread of using blackbox-exporter to entirely teams by increasing the number of serviceMonitor or probe resources. As we mentioned before, you don’t have to manually change the Prometheus configs to use the blackbox-exporter. Using probes and service monitors, the endpoints that each team wants to follow could be passed to Prometheus as targets. In this way, teams create their own resources without affecting each other.

--

--