“Drive Magic” II.

Balázs Németh, Senior software developer @Doctusoft

If you haven’t read my previous post regarding this topic I urge you to do it. It gives some context that might be required for understanding everything I’m referring to. Not to mention that it might be useful for you on it’s own anyway ;) I touched some areas I think the Drive API lacks — like methods used by the UI not available publicly -, constraints present due to the original underlying architecture — like how folders were acting like labels — or how some small issues in the Drive’s backend can have huge influence on products using it — like noop actions triggering change notifications. You can find it here.

So… let’s start with the toughest nut to crack. I say this because by far this was the biggest blackbox and has cost us the most time to address in accordance with our quality standards. In many cases we had to strive to achieve as high throughput as possible which resulted in multiple iterations to tweak/perfect the system. Despite how minor a new information available might have seemed it could have easily resulted in a refactor with a considerable size.

Oh boy, you are in world of pain on this one. Do you know what is the official recommendation for mitigating rate limit exceeded issues? Use exponential backoff. That’s it. You would probably think, they would help you in some way in avoiding spamming the API unnecessarily, right? Nah. That’s for losers. You are a pro. Solve it yourself. (Check how GitHub API handles rate limits for example, and you will understand what I’m talking about https://developer.github.com/v3/#rate-limiting). There is no official explanation how they calculate it, how it works, nothing. All you can do is do benchmarks and based on the findings do educated guesses. If you are lucky you might get hints from support. Apart from that…

Yes, exponential backoff works… and it’s enough… until a point. When you reach that and you see that so much of your resources — and essentially money — get burnt on requests that gets rejected due to a rate limit you start to think how to optimize it. The first step — to go under the rate limit by a big margin so you won’t have issues — could be rather easy. The next one on the other hand — when you try to do as much requests as possible because you want lower latency and/or you just have a lot of tasks to do and you need the throughput — is where it gets rather tricky. You essentially have to implement your own solution that changes the throughput based on the rate limits you encounter. Don’t get me wrong. Using exponential backoff, and reducing the throughput on our side if we reach the rate limit is a “musthave” functionality. My problem is that’s it’s the ONLY input the whole system can use.

Then if this wouldn’t be enough the rate changes based on the nature of your calls. This is where the educated guesses come into play:

  • Read only calls has higher throughput than write calls.
  • Burst throughput is higher than sustained.
  • Read throughput seems to be unaffected by the cardinality of files they are aimed at, meanwhile write has lower throughput if concurrent calls try to modify the same file.
  • Read throughput seems to be the same regardless of the concurrency of calls, meanwhile write suffers if the concurrency is too high.

I thought about creating fancy charts for this, but I realized it might change by the time you read this. It has already changed multiple times during the years. But don’t be sad. I certainly don’t mind. In fact I enjoy it. I like the challenge of doing the same task over and over again to investigate the behavior.

Was it convincing enough? Anyway let’s just hope it’s still valid and use these relatively safe magic numbers. Sustained read 10/s, sustained write 1/s.

Ohh, almost forgot. The rate limiting is applied on a per user basis, and a file belongs to the owner and not the actual user we do the modification with. So if you want to implement some throttling you will have to know to whom the file belongs. Logical, isn’t it? ;)

Yet another thing that doesn’t work the way like you would probably expect. At least it certainly didn’t for me. When I first discovered batching I thought — okay maybe hoped — that a batch request would count only as one towards the rate limit. It would have been heaven. Obviously that is not the case. I know, I shouldn’t dream. Okay, my next instinct was that it is only there to alleviate the overhead of executing http requests and it counts as it would normally. Even the documentation explicitly states that now (https://developers.google.com/drive/v2/web/batch). Doesn’t matter. It’s somewhere in the middle. It increases throughput, but not in a linear or consistent way. If you still had no need for a system described in the “Rate limiting” section you are rather lucky as you could utilize this to a certain degree. On the other hand if you already have it and you are aiming at consistent “as-big-as-possible” throughput then it’s one more unknown factor in your equation. GL&HF.

It was rather easy. Just get a sufficiently big folder structure with enough files in it, which is shared with a lot of users/groups… and then start modifying things a lot. Every shared file has subscribers listening for changes. Anyway the more people it is shared with, the more of those exists. When you do a single API call, it might trigger a lot more events in the backend. If you do a LOT of API calls, well that is how you get weeks/months long queues on the Drive backend.

Imagine you have a resource which has a list property and you try to update/patch that property. I’m pretty sure you have the concept how it should behave for empty or missing values. For example with an empty list provided what would you expect? I would expect the list to be cleared. Well that is not how drive.files.update/patch works for file.parents (https://developers.google.com/drive/v3/reference/files/update). Simply nothing happens. Rather confusing given that if you provide at least one parent value it works as anyone would expect. I know it has “addParents” and “removeParents” optional query parameters, but FFS why the inconsistent behaviour, and why does it force the developers to get the current parent list before clearing it out? Not to mention there isn’t a single warning or documentation anywhere mentioning that it won’t work like that. It just silently skips it. The very same issue happens with drive.file.copy. I would expect to create it without a parent if I provide an empty list, but it makes no difference. You literally have to do another call to remove every parent the newly copied file have. Makes sense, doesn’t it? :D

The following has been fixed for a while now, but they definitely worth mentioning. You know just to realize how much better it is now :)

There are obvious response codes like 404 Not Found. Some of them means it shouldn’t be even retried. Some of them was to be retried immediately in the same request. Some of them meant to retry them later. Handling those were never an issue. Documentation regarding the matter was a lot more scarce that it is now (https://developers.google.com/drive/v2/web/handle-errors), so figuring out what to handle was a trial-and-error and logging what unhandled responses you get. It was not convenient, but eventually you could cover the errors you usually get… and then just because why not, they changed a few of those undocumented response codes without any warning or documentation at all. Just because why not. We were not happy. At all.

There was a time when any existing file was accepted as parent. Even documents. It was nice way to hide files from prying eyes, although I’m sure it wasn’t intended. :) Also files’ ownership could have been transferred to groups and eventually completely lost. :) Anyway both of these bugs have been fixed since.

Fortunately this seems the be a thing of the past, but if you have ever received an error with the Document List API, I’m sure you will never forget it. Getting a whole html page inside the response. Pure epicness.

Even the much better/newer Drive API deserves a honorary mention. I mean who wouldn’t want to read error logs with japanese messages :)

As you have seen I have listed a lot of issues, missing features, odd behaviours and trust me… there is more :) … and I still prefer working on apps running on GCP using/heavily relying on the Drive API over many other alternatives I could be doing. Obviously big part of that subjective view comes from GCP, so it’s not all about Drive :) … and despite my long ranting most of these issues has been addressed somehow already. Some of them were fixed by Google, but for most of them we managed to find some kind of a workaround. You know, sometimes we actually have to work for our salaries, and it’s not always about waiting for compilation/deployment/tool execution ;)


Originally published at medium.com on February 12, 2018.