Operationalizing Google Cloud Media CDN

Published in

Google Cloud - Community

8 min readJan 31, 2024

In the realm of content delivery networks (CDNs), steady state operations teams play a pivotal role in ensuring the seamless and reliable delivery of content to users worldwide. Acting as the custodians of CDN stability, these teams are responsible for maintaining optimal performance levels, proactively identifying and mitigating potential issues, and ensuring that CDNs are able to withstand even the most demanding workloads.By proactively monitoring and analyzing CDN performance, they are able to identify and address potential issues before they impact users.In essence, steady state operations teams serve as the backbone of CDNs, ensuring their resilience, reliability, and optimal performance. Without their dedication and expertise, CDNs would be vulnerable to disruptions and outages, potentially impacting the availability and performance of critical online services.

The purpose of this particular blog is to enable users of Media CDN to understand how to gain more visibility into Media CDN performance , monitor the various aspects of service , and (if needed) do a better and faster troubleshooting of Media CDN. At a high level , in this blog we will try address -

Built in Monitoring for Media CDN : How and when to use
Cloud Logging for Media CDN : How and when to use
Metrics Explorer : How and when to use
Alerting : How and when to use

This will be a long blog post and I request readers to stay patient and focus on the details shown in snippets and I am sure you will love the power of observability which Media CDN offers to its customers . Let’s dive in ..

Built in Monitoring for Media CDN

In Media CDN terminology , an Edge Cache Service is a public endpoint that makes thousands of global edge locations available for delivering media to your users. Therefore, from Media CDN configuration point of view, caching and routing related configurations are defined under ‘ Edge Cache service’.

Each Media CDN service offers a built-in monitoring dashboard , which gives a glimpse of how the service is performing . Following are the examples -

image 1 : Monitoring Page of Media CDN service

This snippet gives an option to the customer to choose the time-range for displayed information . The maximum time range can be 6 weeks . Once time-range is selected , following are important information displayed

Requests
Cache egress
Cache fill
Error %
Cache results (HIT / MISS etc)
Client based information like audience countries etc

Image 2: Monitoring Page of Media CDN service (contd..)

Moving data of HIT ratios over a configured time-range .This is very helpful when a customer wants to see if there was a dip in hit ratio during a given time interval ?
Moving data of Cache-egress bandwidth over a configured time-range
Moving data of http response-code over a configured time-range . This is very useful if a customer wants to see if there was a sudden peak in 4xx / 5xx error codes in a given time interval ? All 2xx in this graph reflects good health of Media CDN edge service

Image 3: Monitoring Page of Media CDN service (contd..)

The above snip shows additional information for request going to origin as follows –

How many requests went to origin and
Moving data of ‘the response code’ from the origin ?

Image 4: Monitoring Page of Media CDN service (contd..)

The above are useful monitoring data for latencies . please note that all latencies shown in out-of-box monitoring in Media CDN are shown as 99 percentile , 95 percentile , 50 percentile and Mean values

Moving data of total latency over a configured time-range ,which is calculated as the distribution of latencies calculated from when requests were received by a proxy until the client acknowledged the proxy on the last response byte
Moving data of origin time to first byte over a configured time-range , which is calculated as the distribution of latencies calculated from when the request was sent by the proxy until the response headers were received from the origin
Moving data of HTTP time to first byte over a configured time-range , which is calculated as the distribution of latencies calculated from when the request was received by the proxy until the last byte of the response was sent with client location information

Cloud Logging for MEDIA CDN

By default ; Logging is disabled to minimize the data stored and cost incurred.

Logging for Media CDN is enabled / disabled at individual “service” level.

Use the following gcloud command to enable/ disable logging for media CDN service

gcloud edge-cache services update YOUR_SERVICE \
    --enable-logging \
    --logging-sample-rate=1.0

gcloud edge-cache services update YOUR_SERVICE \
    --no-enable-logging

Customer should keep in mind that they will be required to optimize the sample-rate in order to avoid each and every request getting logged.Sample logging query to see MEDIA CDN specific logs -

BEST PRACTICES

A best practice for managing Media CDN is to ensure that logging is enabled for production services.
Sampling rate is a trade-off between cost and amount of information logged. For customers with large volumes of requests, you might prefer to sample logs rather than capture a log for every request.

Now let’s look into a sample log entry :

{
httpRequest: {
protocol: "HTTP/2"
remoteIp: "xxx.1xx.yy.xx"
requestMethod: "GET"
requestSize: "3560"
requestUrl: "https://basicvod.example.com/transcoder-output/TOS/media-hd0000000000.ts"
responseSize: "1537623"
status: 206
userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}
insertId: "66468f56-0000-267f-855e-582429bd83a8@a1"
jsonPayload: {
@type: "type.googleapis.com/google.cloud.edgecache.v1.EdgeCacheLogEntry"
cacheId: "del-32c19183, del"
cacheKeyFingerprint: "9b394d858f20f143"
cacheMode: "FORCE_CACHE_ALL"
cacheStatus: "hit,uncacheable"
clientAsn: "24560"
clientCity: "Gurugram"
clientRegionCode: "IN"
enforcedSecurityPolicy: {
configuredAction: "ACCEPT"
name: "no_policy"
outcome: "ACCEPT"
priority: 2147483647
}
httpTtfb: "0.006682726s"
latency: "0.050127437s"
metroIataCode: "DEL"
origin: "stage"
originIp: "172.217.166.187"
originalRequestId: "f25bcdc6-4d0c-427c-9434-35c560cc4361"
proxyRegionCode: "IN"
proxyStatus: "Google-Edge-Cache"
rangeHeader: "bytes=246851520-248388419"
requestId: "0b4abb04-7e8f-48c8-9906-7f2f9f1ce696"
tlsVersion: "TLS 1.3"
}
logName: "projects/test-xxxxx/logs/edgecache.googleapis.com%2Fedge_cache_request"
receiveTimestamp: "2023-10-10T05:22:05.643564237Z"
resource: {
labels: {
location: "global"
matched_path: "/**.ts"
path_matcher_name: "path-matcher-0"
resource_container: "projects/xxxxxxxxxxx"
route_destination: "projects/xxxxxxxxxxx/locations/global/edgeCacheOrigins/stage"
route_type: "ORIGIN"
service_name: "basicvod"
}
type: "edgecache.googleapis.com/EdgeCacheRouteRule"
}
timestamp: "2023-10-10T05:22:05.048175437Z"
trace: "projects/xxxxxxxxxxx/traces/0b4abb04-7e8f-48c8-9906-7f2f9f1ce696"
}

Important Ingredients of logg message and how to make best use of them

Client Facing details : These details are in ‘httpRequest’ and under ‘JsonPayload ‘ section .

Like following are important detail under ‘httpRequest’ section –

Client IP
Which URL was requested
User-Agent
HTTP request Method
Request size

Like following are important detail under ‘httpRequest’ section –

Client ASN
Client City
Client Region

2. Caching Behavior Details : These details are under ‘ JsonPayload ‘ and ‘ enforcedSecurityPolicy ‘ section .

Important fields are -

CacheMode
CacheStatus

For Cache-Status , please keep following in mind -

If it says hit (or stale or revalidated), it’s a hit.
If it says fetch, it’s a miss .
If it only says uncacheable (does not say miss anywhere) , consider it uncacheable. Review Media CDN’s criteria for cacheable responses to determine if it was truly cached or not.
If it says miss anywhere, it’s (probably) cacheable.

EnforcedSecurityPolicy section in logs tell if the Cloud Armor policy was applied to Media CDN service and if that’s the case , details of policy / rule / action will be shown in this section of logs

Since most of crucial information is present in logs , customers are free to ingest these logs and make a view suitable to them . For example (not a exhaustive list) , following points can be easily addressed -

Which are the top N countries accessing my assets using media cdn ?
Which are the top M cities accessing my assets using media cdn ?
Which are top X accessed URLs ?
For a given country , which cities is my end-user traffic coming from ?

GOOGLE CLOUD METRICS for MEDIA CDN

Cloud Monitoring supports the metric types from Google Cloud services like Media CDN services . These offer additional visibility than built-in monitoring dashboards and from the data collected in Media CDN logs . A customer can navigate to GCP ‘metric explorer’ and look for Media CDN metrics as follows

The comprehensive list of metrics available are documented here

The common use case of metrics explorer include ( not limited to ) finding more details for Common media client data [CMCD] , finding more details of request count , Cloud Armor action on request count etc. Metrics explorer offers features to apply filters on various metrics , which are useful to filter the required and more useful data .

Example : A customer wants to find the sum of ‘egress bytes count ’ which are served from CDN ( cache status = HIT ) in the last 15 min . To do so , metrics explorer can be configured as follows

Alerting on Media CDN events

Alerting is an important part of GCP cloud monitoring and should be a part of the steady state monitoring pipeline for Media CDN as well . The whole idea of setting up monitoring for MEDIA CDN is to ensure that customers get notified when your Media CDN performance doesn’t meet defined criteria . The details of GCP alerting is documented here . At a high level ,a customer has to plan for following while defining a alerting policy -

What Media CDN parameter needs to be monitored
What is a safe threshold that needs to be monitored?
If the threshold is breached , how should customer be communicated

The below example shows the customer monitoring ‘total latencies’ parameter of MEDIA CDN . Customer wants to be notified if total latencies for *.ts served from CDN (cache-status= HIT) breaches the 4 second threshold .

When defining an alerting policy , the same Media CDN metrics as documented on metrics explorer page are available for selection . Customer can configure the available filters to narrow down the alerts.

As a next step , customer is required to define the threshold values for monitored parameters

Finally customer chose the notification methods

In my example , my alerting policy notification channel was selected was email . the below shows a sample email received when threshold was breached

Useful Resources

Monitoring a Media CDN service

Metrics for Media CDN

GCP Alerting

Disclaimer: This is to inform readers that the views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author’s employer, organization, committee or other group or individual.