Gmail API’s push notifications bug and how we worked around it at Hiver

Raghav C S
Hiver Engineering
Published in
7 min readFeb 24, 2023

The Hiver Shared Mailbox is a helpdesk software that helps teams on Gmail collaborate better. And being built on top of Gmail means it relies heavily on the Gmail API to sync actions across users. So naturally, any hiccups in this integration have the potential to cause massive disruption to Hiver’s functionality.

We recently had one of these small “hiccups,” which initially seemed like a harmless one-off but later turned into a source of major annoyance for the whole product. At its worst, the delays in email syncing affected 200 Hiver users. In total, our team spent 2 weeks hunting down this bug that plagued the Gmail Watch API.

Some background info before jumping in -

Gmail Push Notifications

This is Gmail’s own description of their push notification service.

The Gmail API provides server push notifications that let you watch for changes to Gmail mailboxes. You can use this feature to improve the performance of your application. It allows you to eliminate the extra network and compute costs involved with polling resources to determine if they have changed. Whenever a mailbox changes, the Gmail API notifies your backend server application.

The Watch API

To configure Gmail accounts to send notifications to your Cloud Pub/Sub topic, simply use your Gmail API client to call watch() on the Gmail user mailbox similar to any other Gmail API call.

The Watch API request — The request takes in the cloud pub/sub topic name and additionally any specific label ids for filtering notifications. Along with the user who you want to watch, of course.

request = {
'labelIds': ['INBOX'],
'topicName': 'projects/myproject/topics/mytopic'
}
gmail.users().watch(userId='me', body=request).execute()

The Watch API response -

If the watch() request is successful you will receive a response like:

{
historyId: 1234567890
expiration: 1431990098200
}

The historyId represented above is the current historyId for the user. A historyId represents the identifier for the mailbox’s current history record. You can get all the changes in the users’ mailbox preceding this history using the history.list() API.

The expiration is a Unix timestamp representing when the watch will expire for the user. Once the watch call expires, your pub/sub stops receiving notifications regarding mailbox changes.

Hiver’s sync engine

The Hiver sync engine is responsible for syncing Gmail actions between users as part of the Shared Mailbox’s collaborative idea. It’s important to be notified of changes that occur in a particular mailbox as it affects the product flow.

The Hiver Sync Engine

The Pull Service- is the first service in the sync engine data flow. It receives the pub/sub notifications for the user’s mailbox changes. Once a change has been identified in the user’s mailbox, the Pull service simply passes the context to the next service in the flow to get and process the user’s Gmail histories.

The History Processing Service — retrieves the Gmail histories of a Hiver user and processes the relevant ones.

The Pull and History Processing Services

Now that we have all the information, we can move on to the actual issue :)

The Fault in our stars :(

Following the classic Murphy’s Law adage, this particular issue struck us at the most inopportune moment — The entire sync engine team was on a work-cation, enjoying the sight of the setting sun, cooling their heels on a beachside shack in the beautiful west coast paradise that is Goa. We should’ve known that it was too good to be true — sipping on piña coladas in beach shorts and warming our toes in the sand, was not characteristic of a typical Wednesday evening — when customer traffic was at its highest.

And then the alert came in — “Sync not working for a user”. Investigations were slow, as the system health was pretty good. We couldn’t find any anomalies in our lag or infra metrics. It seemed like something was only affecting this one specific user. It was only after what felt like an hour of debugging, we discovered that the Pull Service was not receiving notifications for the affected user. It was an issue on the Gmail side, which left us scratching our heads.

We couldn’t figure out what was wrong with the pub/sub at the moment, so we decided to just start polling for changes. After writing a quick hotfix to poll every minute, we fixed the issue for now and closed the alert.

The strange thing is — the pub/sub seemed to fix itself a few days later. As the user started receiving notifications again, and we stopped our fallback polling fix. (Ominous music plays in the background)

We decided to ignore the issue for now as a one-off and hoped Gmail had fixed the bug on their end. Alas, the peace did not last long as the same sort of thing happened a couple more times, affecting different users. More users reported sync delays, and we found out that they’re not receiving pub/sub notifications.

Hence, we decided to refresh the watch call manually for one of the users during the follow-ups, and this seemed to fix the issue. We verified this a couple of times again for similar issues, and it seemed to be working.

In theory, the idea was that stopping the pub/sub temporarily and then triggering the watch() api again, restarted the pub/sub. However, this was only a manual fix, and we had no way of avoiding the inconvenience altogether.

The Quest for answers

Obviously, our first instinct was to look for a solution online. Unfortunately, all we found were a couple of unsolved issues posted on stackOverflow and Google’s issuetracker.

We double-checked our implementation of the pub/sub and watch API, and ensured we weren’t hitting any quota limits or the watch expiration.

The next step was reaching out to Gmail support. But, all they did was give us some guidelines to follow as good practices. There were no mentions of any bug or issue on their end.

After a couple of weeks of trying to figure this out to no avail, we decided to go ahead with an in-house solution as a permanent fix for this issue.

Problem solving time

The first thing we did was to implement a system of detection, i.e identifying the users who are not receiving any notifications. At this point, we did not know how many users were impacted — because it is possible that a user might have less traffic on their mailbox and didn’t notice the delay.

If it was possible to actually detect the issue for a user, fixing it could just be as simple as calling the stop() and watch() APIs for the affected users.

Now, here’s where things become complicated. It’s impossible to differentiate between a user who wasn’t receiving notifications, and one who simply had no mailbox changes for a while. Without this key differentiation, we might end up calling the watch() API too frequently and exhausting the quota limit.

A slightly unconventional solution :D

The only way to figure out if the pub/sub is working correctly or not, is to reflect the change in the mailbox and receive the subsequent notification. If we suspect that the pub/sub is not working, we can trigger a harmless change in the affected mailbox and then check if the notification arrived.

To elaborate, for each user, we store the last notification received time as a simple key-value pair in an in-memory cache like Redis. Using this, we can find out how much time has passed since the last notification was received at any given time.

The Watch Fallback Module

If the current time is greater than 20 minutes since the last notification, then that’s an indication that watch() might not be working.

This is when we trigger our harmless mailbox change. The way we achieve this is by pushing an automated email that doesn’t show up in any inboxes. And successively, we can add and remove the UNREAD label to the automated email. Ideally, if the pub/sub is working properly, we will receive a notification, and the last_notification_received_time will be updated.
The 20-minute mark is something we felt was a good enough interval to perform our checks. It’s neither too soon nor late.

Test email to trigger the push notification

As we run our checks after another 20 minutes, we know whether the last_notification_received_time was updated , or if the watch() API has stopped working. Then, we log the issue and fix it by calling stop() and watch().

Watch Fallback Flowchart

Aftermath

Once we tested and deployed the fix, we found out that at least 200 users were affected by the watch() API bug, and our fix ended up working for most of them. Now, that was the kind of statistic to make us feel good after a couple of rough weeks :)

In closing

We hope recounting our experience helps you in your own personal Gmail development endeavours :)

In the meantime, we’re hiring at Hiver, and if you want to help solve interesting problems like these, feel free to hit us up!

--

--