How to Debug an Unresponsive app on GCP

Alex Amies
Google Cloud - Community
8 min readMar 28, 2020

TL;DR: You can debug an unhealthy app using Google Cloud Logging and other GCP tools even if the app has become unresponsive. A test app is provided that simulates the problem in an experiment for the purpose of explaining the debugging process with App Engine Flex. Similar debugging tools and methods are broadly available on other Google Cloud compute environments.

Background

App Engine Flex and other serverless offerings on Google Cloud Platform are very convenient, leaving Google to manage the app for you. If a virtual machine crashes, App Engine will reboot or create a new instance for you so that your app is still available to users. However, it is also possible that your app may become unhealthy while the virtual machine is still healthy. Conditions that may lead to this include out of memory, deadlocks in the app, and I/O problems. Rebooting the app and / or the instance may be helpful for the availability of your app. However, there may be something systemic in your app that is causing the problem then you can get to the bottom of that using the methods described here.

Test app

The test app is a Python Flask application that serves a few requests successfully but then randomly goes into a simulated unhealthy state when it then takes 90 seconds on all subsequent requests, including health checks. The main handler is shown below. Full code is in this gist.

@app.route('/')def home():  """1 out of 10 requests change the app to an unhealthy state"""  msg = 'Welcome to the home page!\n'  if not get_app_healthy():    msg = 'Changing to an unhealthy state'    logging.error(msg)    time.sleep(90) # seconds    return msg  r = random.uniform(0.0, 1.0)  if r > 0.9:    msg = 'App is unhealthy'    logging.error(msg)    set_app_healthy(False)    time.sleep(90) # seconds  return msg

The handler for the readiness health check is shown below.

@app.route('/readiness_check')def readiness_check():  """Respond to the load balancer's readiness check"""  msg = 'OK'  if not get_app_healthy():    msg = 'App is unhealthy'    logging.error(msg)    time.sleep(90) # seconds  return msg

Deploying the app

A deployment descriptor app.yaml is provided for App engine Flex. The timeout for the readiness health check in App Engine Flex is set to 4 seconds, which is performed every 5 seconds.

readiness_check:  path: "/readiness_check"  check_interval_sec: 5  timeout_sec: 4  failure_threshold: 2

Deploy the app to Flex with the commands below.

Set your project as the default in gcloud

GOOGLE_CLOUD_PROJECT=[Your project]gcloud config set project $GOOGLE_CLOUD_PROJECT

Enable App Engine

gcloud app create

Deploy the app

gcloud app deploy

Browse to the app

gcloud app browse

You should see a successful response in your browser, unless you got unlucky. In that case you can try again.

Set up Apache Bench In the Cloud Shell or Linux command line to use for sending HTTP requests. You can use the Google Cloud Shell if you local development environment is not Debian.

sudo apt-get install apache2-utils

Send some requests to your app

ab -n 1000 -c 5 https://${GOOGLE_CLOUD_PROJECT}.appspot.com/

We expect that this will overload the app. Navigate to the App Engine dashboard to see how the app is performing. You should see high latency, as in the chart below although it may take a few minutes for the monitoring data to appear.

App Engine Flex Latency
App Engine Flex Latency

Suppose we did not know that our app had an intentionally injected problem to simulate bad health? How would you debug the problem? After noticing a general problem in the monitoring dashboard you should check the logs. Navigate to the Cloud Logs Viewer and select App Engine | stderr, as shown below

Screenshot: App Engine Standard Error Logs
Screenshot: App Engine Standard Error Logs

Notice that the latency for the app suddenly jumped up from 1 millisecond to over 30 seconds and errors increased. If we browse through the logs you may find some errors from the runtime container, in this case Flask or Gunicorn, such as shown below:

App Engine Flex Standard Error Log Detail
App Engine Flex Standard Error Log Detail

Notice that the instance name is given as well. This allows you to identify the instance, in case you need to SSH to it, which you can do with the gcloud app instances ssh command. Suppose that your app is so unhealthy that it does not emit any log messages. There are also log messages for the NGINX health checks that timed out that can be checked, as shown below.

App Engine Flex NGINX Health Check Logs
App Engine Flex NGINX Health Check Logs

Notice that the user agent is GoogleHC for health check and the 502 HTTP response code due to the health check client timing out (upper red box). These logs also give the instance name (lower red box). You can use the instance name from the logs to view details about the instance. Navigate to the App Engine | Instances screen. Select the instance in question and look at the monitoring charts. The chart for CPU is shown below.

App Engine Flex Instance CPU Utilization
App Engine Flex Instance CPU Utilization

The CPU usage is kind of high but maybe not so high that it was the cause of the problem. You can also check the charts for memory and disk from the dropdown.

At this point we should suspect our app because there is nothing obviously wrong with the virtual machine instance. There are a number of tools that you can use for detecting problems in your app, such as Cloud Trace, which can tell you where the time in your app is spent, Stackdriver Profiler that can tell you where in your code the CPU resources are being spent and Stackdriver Debugger to introspect the state of your app. These tools work on the basis of samples and are appropriate for production use. However, you need to set them up in advance of an incident.

To get to the bottom of a problem where the process running the app is totally unresponsive, you may need to SSH to the virtual machine and use other tools to inspect the state of threads or memory or do packet captures. Hopefully, this will be rare and you will be able to get the information that you need from the Cloud Console in most cases without having to resort to that.

Health checks

The test app has a custom handler for readiness checks. App Engine Flex readiness checks confirm that an instance can accept incoming requests. By contrast, a liveness check verifies that the virtual machine and Docker container are running. Other configurations on Google Cloud Platform have health checks that are similar but differ in some details. Kubernetes has a similar concept of liveness and readiness checks. Google Cloud Load Balancer also has its own flavor of health checks.

Other runtimes and languages

Runtimes have different characteristics in the way that they manage applications. For example, if we change the sleep statement in the home() handler to exit the process

sys.exit(1)

then Flask will restart the app. You can verify this by running the app locally

python3 -m venv envsource env/bin/activatepip3 install  -r requirements.txtpython3 main.py

If you send ten or so requests with curl to the app in another terminal

curl http://localhost:8080

Then you can verify that Flask restarts the app. Not all runtimes handle process exits in the same way. In production, the App Engine Flex app.yaml file specifies the gunicorn HTTP server, which manages a pool of worker processes. If one process exits it will not make your app unavailable because there are other processes already running to handle incoming requests.

Runtimes for Java and Go differ from Python typically exiting the process without restart. In that case the Docker container will also exit, which will be detected by a liveness check resulting in a restart of the VM instance. We can do a similar experiment by making a modification to the App Engine Flex Go quickstart to exit the process when we get an unlucky number as shown below

func handle(w http.ResponseWriter, r *http.Request) {  num := rand.Intn(10)  if num == 9 {    os.Exit(1)  }  if r.URL.Path != "/" {    http.NotFound(w, r)    return  }  fmt.Fprint(w, "Hello world!")}

Then when deployed to App Engine Flex production you can see from the logs that the VM instance is shutdown and restarted.

App startup

Fast app startup is critical to rapid recovery in the case of an outage. If you have a problem that brings down your fleet, then getting a fix rolled out fast will help reduce your recovery time. In addition, slow starting apps that handle a large amount of traffic are susceptible to thundering herd problems. That is because incoming traffic can overwhelm instances before they are ready to process traffic. You can add a custom readiness check, as described above, to address this problem. However, it may still take time to ramp up a sufficient number of instances to handle incoming traffic. The duration that takes will affect your incident recovery time.

App Engine Flex takes considerably longer to roll out an app change than App Engine Standard. Flex also typically takes longer to roll out an app change than Google Kubernetes Engine (GKE). In the case of GKE an app change will often only involve loading a new container image, typically single digit seconds.

Log management

If you have a high traffic app you may use a log exclusion filter to avoid excessive logs, such as from successfully passing health checks. This should not prevent you from capturing the logs that you need for debugging with a minor adjustment to the filter:

resource.type="gae_app"logName="projects/[your project]/logs/appengine.googleapis.com%2Fnginx.health_check"httpRequest.status>200

The last line limits the log exclusion filter to unsuccessful requests.

Error handling

From the experiments above, it is clear that a process exit is handled better than a problem where the app is running but unresponsive. To avoid this is it important for the app to not catch general runtime errors that are not actionable within the context of your app code. Examples of these are RuntimeException in java, which is extended by OutOfMemoryError and other system level errors. The app should only catch application specific exceptions that are expected and that it can reasonably address or do not matter to the health of the app.

--

--