How to turn cloud functions more reliable after development?

Pamela Peixinho
Blog Técnico QuintoAndar
13 min readMar 13, 2020

--

TL;DR: Monitoring and iterating on their development we could discover problems, solve and turn it more reliable to sync our data. A real example of how to do it in cloud function is in this article.

grey wall with two white security cameras
Monitoring (Unsplash)

How many times have you looked at a system and had no idea how it was working in production?

How many times did you receive a report from your users?

How many times have you seen a log that didn't have any meaningful text or info about the problem?

How many times have you seen an error report and couldn't find another log that says the related info or the last action that happened?

I bet at least once!

This kind of problem, even though common, is very problematic in software development. If I told you it was possible to find a bug in less than 2 minutes in cloud functions, would you believe me? What if I say that this already happened?

In 2019, May 30th, we're doing a reconstruction of the architecture that was responsible to sync data between two databases. I received an email from error reporting in production. We were still in the beta of the new sync, so my teammate and I,

looked to try to understand the problem.

  • We saw the error and occurrence along with part of the stack trace on errors reporting
  • We got the line that was with error to look in the project (We understood the context and maybe the options of the error)
  • We went from error reporting directly to the log
  • In the logs viewer, we searched for the logs of the entire execution of that function
  • We could see the whole flow of the function. We saw that one field of the payload was coming with null (previous and current no)
  • So, we got the problem and fixed the service that was writing to Firestore

Just 2 minutes and 3 clicks and bug found! Just monitoring the error reporting and the right info inside our logs, we can easily understand what our production serverless function is doing.

Debugging with right logs and the right tools

Monitoring and Alarming are extremely important for any software being developed but are oftentimes neglected. We faced these problems for real while reconstructing the sync architecture and code for agents PWA. So, after the obtained learnings, we found a way to establish best practices to monitor CFS and how to integrate this with our team process

Nowadays, every new cloud function on my team has meaningful logs, as it’s included in dashboards, and alarms and we can know exactly what is happening, when and how. I venture to say that we can be error-proof. This DOESN’T mean that we will never have errors. It means that when we have an error, it’s simple to find and solve.

As you saw what happened, it’s possible because of the following strategies:

  • Logs
  • Error reporting
  • Dashboard
  • Alarm

Let’s take a little tour of this tool and the best practices that we learned by doing it wrong.

Logs

Even though logging is such an important thing it is usually done in a suboptimal way, or not done at all

Logs have a lot of uses, but the main usages for cloud functions are:

  • Monitoring
  • Debugging

In our case, we’ve created dashboards to monitor cloud functions using the logs and this has helped to see the main unexpected behavior in our cloud functions. We were retrying the CFs executions more than we expected. After seeing this, we’ve analyzed the cases looking directly to the data within the logs written along with error reporting and we’ve verified that we were retrying non-transient errors.

In other words, not transient errors are not temporary errors. We called them “not retryable errors”. It means that the function was retried continuously within errors in the code that will not change the output retrying unless we fix it. So, the logs were the heroes in this case 👩‍🚒

Image of google stackdriver dashboard. We have several graphs of occurrences that has title by cloud function
Dashboard for monitoring some of our cloud functions

However, we truly know that the most used action of logs is to debug errors! Using logs you can verify the entire way that the data has been passed, along with all the actions that your application handled with this.

Nice, but how do we create good logs that help us to do this? There are some best practices that we’ve adopted first in my team and have passed to others that are using cloud functions. Now we’re going to share them with you.

Best practices

The main idea of Cloud Functions as a Firestore listener for us at QuintoAndar is to replicate and enhance data. So, thinking in this way there are some default requirements that we’ve learned that are important.

Maybe it looks obvious right now, but when we’re coding it was not so obvious! On my team, we’ve spoken since the beginning to do this right, but we’ve created some meaningless logs and some others missing a lot of information and we needed to go back and improve our logs

It’s essential to follow:

  • A log at the start of the function (Will process)
  • Logs to every “decision/result” made by your code, examples: Dropping the event by idempotency; Dropping the event because there’s no change between the previous and the current data, No matching rentProgress for offer
  • Log for the final result (Including errors and success): Finishing the execution of the function successfully; Finishing with retryable error; Not retryable error — code: validation — missing data …

However, more important than the log string is the data that goes as extra information along with the log. This helps us understand the context, data handled at that moment and maybe what caused the other action.

All logs should add the following data as extra:

  • The previous data (before the update that triggered the cloud function)
  • The current data (that is in Firestore right at that moment)
  • The data that you’re going to write (built object for replication)
adding a JSON.stringfy in all objects in a PR
One of the PRs adding JSON.stringfy to get as a string, not as an unreadable object — centralize the function call in one place

Tip: if you’re using JS as we are, take care with objects, always use JSON.stringify we’ve already forgotten and I had to create a PR just putting this in a centralized way applying to all logs.

Following these basic rules should work for ONE cloud function. However, when you have lots of cloud functions. You must follow another approach “easy to say and difficult to follow”: Consistency.

In all logs of all cloud functions, it is important to be consistent to help the investigations and the monitoring. That’s why we’ve created a simple solution that helps to find the logs easily: Unified log messages. This means that all the messages that are going to be logged are in the same structure. Let's see how it is:

Sample of unified log messages

Using:

Sample of unified log message being used

Although it’s only simple if you create logs that are readable for humans!

If you just create a log with a message like:

service.saveDocument [object] or An error occurred

It’s not readable at all. We need to create meaningful and easy to understand logs. When you’re coding, always try to put yourself in the other developer’s shoes, imagine that you don’t know what this code does, neither what this error and try to see if you can understand the log.

For example, you can understand:

Ignoring the contract id = X as is not newer than the current one (extra, context)

Along with readable log messages, we should pay attention also to the log levels. Image the confusion of seeing an info log with text “error to save visit info”.

Is it an error or info? What is causing the problem? Do I need to do anything?

Good examples of using the log levels

Another important practice of logging that is not so trivial, is to have unique identifiers to easily follow the entire stack trace. There are several ways to create these identifiers, however, for a single cloud function's execution, we don’t need to since they already have one builtin:

Image with a log in stackdriver with execution_id field label highlighted
execution_id is the unique identifier by cloud function execution

Following these simple “rules” of best practices to log in cloud functions, you can improve a lot your way to monitor and debug. So, recapping:

  • Log the beginning, every action/decision, and results
  • Log all the data related to the function execution
  • Log consistently — prefer to unify the log messages
  • Always log messages that are totally readable for humans
  • Pay attention with log levels
  • Use unique identifiers

Using the logs

After writing good logs, you need to understand how to extract the maximum from it. The google cloud platform has some good tools, but they are not always simple to use, as Stackdriver that we're going to see in the next section.

Stackdriver

Let’s start with Stackdriver, what is it?

Stackdriver is a Google platform tool for monitoring and management for services, containers, applications, and infrastructure.

Full observability for your code and applications stackdriver aggregates metrics, logs, and events from infrastructure — Reference

There are several different features. In this article, we are going to talk about logging, error reporting, monitoring, and dashboard.

Logging

There are two main views that we’re mainly using: Logs viewer and Logs-based metrics.

The first one, logs viewer, is a tool where we can see real-time and stale logs with useful filters like execution ids, resource or text queries.

The second one, the logs-based metrics, is a different use of the logs. We can save a query and verify the entries by this query, with this, we can create charts and alerts.

In Stackdriver's own words (Reference):

Logs-based metrics are Stackdriver Monitoring metrics that are based on the content of log entries. You can use logs-based metrics in Stackdriver Monitoring charts and alerting policies.

Logs viewer

There are some ways to use queries in the logging system:

  • Search by the text field (log message)
  • Search by Cloud Function name or by all cloud functions (other Google resources)
  • Search by log level
  • Search by time
  • Advanced queries
  • Show matching entries
  • execution_id
  • Show resource in the context
  • And so on..

In this article, I’m not going to teach you to use the google logging platform. I’m going to focus the tips & tricks to easy debug using logs and how this worked for us within the sample below:

(If you want a tutorial, go to google cloud docs at the end of the article).

QuintoAndar is a digital real estate agency that is transforming the way of housing in Brazil. So, one of the main parts of our system is a real-time booking service that allows our users to book visits with the homeowners and real estate agents with no manual steps. We use cloud functions to sync the visits with the agents’ agenda.

Now, imagine that something went wrong and a visit didn’t show up in the agent agenda. This can cause a bad experience for the user waiting in the front door and the agent not showing. So, you need to understand what went wrong before this happen. The only thing you know it’s the scheduling visit id is 4221578.

So, you start searching by the ID with the log message that you’ve coded

A search entry on the field with info of the first occurence
Find the log entry of the visit

So, you have found one occurrence of the start of the function. Now you need to understand the entire execution of the function and you can use the execution_id param using the action (Show matching entries):

Using the button with the filter "Show matching entries", you can filter by the execution
Filter by matching entries (using execution_id)

Make sure to remove the string in the search field and leave only the execution_id assertion in the query field.

list of logs with the info about the execution

Now, I can see what exactly happened in that cloud function execution! In this case, it was aborted because of Firestore limitations. It means that the Cloud Function will retry and fix it eventually. You can see the retried execution using the same execution_id of the one that failed:

Find by the execution_id and have unified logs were the two tricks that have saved my life a lot of times to understand problems' origins.

Logs based metrics

Logs viewer gave us a zoomed-in view of the logs. Now, we’re going to understand how to have a base for the zoomed-out view of logs, based on occurrences, latency or any other usage. This is going to be the source to create charts (for dashboards) and alarms.

Imagine that you need to know how many times a cloud function started and complete the replication of the document or how many times a function had its retry exceeded. You can have this information by getting the log message and saving the metrics of the log entry.

After this, you can verify the occurrences creating a chart or you can set up an alarm, using this. It’s a good practice to have a consistent naming for all, for example:

{cloudfunctiongroupd}_{status}{action}

agents_willUpdateTheDocumentation

agents_successToUpdateTheDocumentation

agents_errorToUpdateTheDocumentation

Errors reporting

It’s so much easier when you have the possible problems mapped and you can use logs to create metrics and alarm you that something is going wrong, however, let’s face the reality, we can’t map all the errors. We can map just business errors, but there are unexpected errors that usually are development mistakes. That’s why the error reporting section exists in the first place!

Errors reporting is a service like Sentry that catches the unexpected errors, put in a UI that is easy to see the first time seen, last time has seen, occurrence and the beginning of stack trace. It also sends an email every new error or regression.

There’s no secret in using it. It was automatically set up when you use cloud functions. It has auto-reload and a field to say that the issue is open, resolved or acknowledged.

Best practices

  • Grouping: All errors should not have variables in their title because error reporting will group by them — put the variables on extra content — console.error("Title", { extra: id, error..});
  • Take a look continuously
  • Act immediately
  • Never ignore any report
  • Link to the task / Pull request that will fix it

Dashboard

Where you can see different function executions that it’s important for the business because it can affect the users not seeing the info that need, it’s a good practice to use the dashboard to have the most zoomed-out view of what it’s happening, then you can use the others resources to find the error or improve the perf.

It’s really easy to create a dashboard based on the logs of cloud functions. For the first time, I confess that I got angry because I could save the chart because I was doing it wrong. As users, we always blame the software :)

First, we need to define what you need to have an overview, after this, we save a log and then, go to the dashboard, create a new chart using the query saved. Easy!

Best practices

  • Have a dashboard for each functionality you write a cloud function
  • In the dashboard have one specific chart that shows together: will, done, error for your functionality
  • Have generic one to look to a specific error/occurrence
  • Look also for execution times
  • Look for time to execute — how is the performance
  • Remember that your dashboard is your macro view

Alarms

As we used our logs for debugging and error reporting, we can also apply the logs to create alarms to monitor when something violates the metrics that you have defined.

After releasing the beta version of the new sync, as I talked previously, in my team. We have created the dashboard to have the entire view of what it’s happening, along with that we’ve created an alarm with a reasonable threshold (as we thought) to alert us when the retry exceeded, after setting up the alarm, it started to flood our slack channel:

we’re retrying too much until exceeded

So, we started investigating and find the problems, we were retrying the errors that cannot be fixed. Then, we fixed it and the alarm stopped alerting us and the sync worked nicely.

Best practices

  • Have meaningful alarms — that fire when it’s really needed
  • Always iterate and improve the alarm — you need to iterate to find the correct alarms and the correct thresholds
  • If it always alerting, the problem is not the threshold of the alert, you need to solve it.

So, that’s it! The article is a little bit longer than I thought, but it has all the lessons learned after developing a lot of cloud functions as Firestore listeners.

Remember, all these tools are complementary and we should use the best of each one aligned with what’s being developed. The logs are the main and smallest part. When it’s done wrong, it can turn all monitoring into complete chaos. So, write good logs, monitor your cloud functions and be happy.

Tks for reading! I’m always open to receive feedback, recommendations or questions, feel free to contact me!!

Pamela Peixinho, Lead Software engineering @ QuintoAndar

LinkedIn: https://www.linkedin.com/in/pamepeixinho
Twitter: https://twitter.com/pamepeixinho
GitHub: https://github.com/pamepeixinho

My last name is LittleFish, “sea” you later! 😉

--

--