Data detour

Published in

Gett Tech

10 min readJul 1, 2024

A case study of excessive data consumption fire

Most mobile engineers are anxious about the day a ludicrous bug might be assigned to them, something that is so big, nasty and absurd, that you simply don’t know where to even start.

These bugs arrises from time to time and are part of our responsibility to handle them. No one (or at least almost no one) likes them, but they exist…

Since I’ve moved to Team Lead position of our mobile developers team here @ Gett, I enjoyed the fruit of being able to delegate such issues to my beloved teammates, and obviously they are nailing it, but the number of opportunities I had to personally deal with such a bug became lower and lower as time went by.

Until recently.

A vague message came up in our team’s slack channel, by a tech support engineer, claiming he got a couple of reports from drivers about the Gett iOS driver app which is exhausting their cellular data plan. A few screenshots showing a crazy amount of data has been used, and an ask for help.

Our driver app is a working tool used by the Gett’s drivers for long period of time each day. A session might last 10 hours. A large data amount is expected to be used but the numbers that were shared seemed unrealistic.

We made a quick debugging session to make sure no real leak existed and nothing came up. I’ve asked the tech support engineer to open a trivial production bug and went on with my other tasks which were in higher priority, knowing my iOS developers will tackle it at some point- like the rest of the production bugs..

2 weeks have gone by and another message on that thread was sent, asking for an update on the matter, adding a few more examples of such claims together with a rise in severity of that bug, making it a critical one. A critical bug in Gett is treated with a sense of urgency, committing to an SLA of 2 weeks.

Illustration of SLA of 2 weeks to solve a critical bug

Wait a minute.

The latest driver app version that is released to our drivers is already a month old and any REAL issue would have already get pace by now. A total of 5 drivers complaints can’t be treated as a critical bug, for us to leave everything we do and immediately go to an adventure which, by the nature of the bug, is a unique one and requires some complex debugging. The amount of developer’s capacity I have in the team is limited and I must secure the business needs and requirements to ongoing projects.

We agreed to reduced the severity to a major bug, and I asked to gather more info to help us get a lead to where to start looking, including asking for the driver logs (an in-house mechanism we built, allowing our tech support team to fetch the local logs of the app to our databases upon a need). Also, due to the nature of the issue, I asked the tech support engineer to try to get a consent of an affected driver to log into his profile to be able to debug a real use case.

While doing so I’ve privately discussed with our iOS tech lead trying to get to some ideas, whilst discussing the possibility this bug might get an higher attention and become a critical one as well, which will force us to shift capacity and deal with it instead of other critical agendas.

This possibility became the reality.

A couple of days later, on a Thursday evening, while going round and about in the supermarket, I’ve got an email updating me the bug severity has been raised to be on the highest severity — it’s a fire!

Simultaneously, I receive a few slack messages from few operation managers, asking me for updates on the fire. What’s the status? Is someone working on it? Is there an ETA for a solution? We got a driver which reported a data consumption of an unreal 4.7Gb..! A day! That’s bad!

A screenshot showing the massive usage of 4.7Gb byGett Driver app — A screenshot shared by one of the drivers showing the massive daily usage of 4.7Gb by the Gett Driver app.

Data cost money. If you’re bleeding ~5Gb a day you’ll finish a month worth plan in no time. In Israel, the data plans are normally unlimited so one could claim its not that critical, but in the UK or in the US these plan are expensive. This requires my immediate attention.

Quickly I’ve completed my shopping and head back home to dig in.

What could go wrong? Where should I put my focus on? How do I even start?!

Knowing I’ve just got my to-be-hated ludicrous bug, I felt the sense of ownership laying over me.

I assumed that with such a high volume of data bleeding, there must be files involved. Big files.

Three candidates came to my mind:

In app webviews loading pages with massive amounts of data.
Our logs mechanism allowing us to send logs files upon request.
Some 3rd party that is leaking.

The first one was quickly ruled out and my brain run few scenarios on the 2nd one:

Logs

Internal issue with the logs mechanism — that must be it! All complaints are from the UK, this mechanism is broken in our UK environment on the server side (for a few months now). This malfunctioning must be causing our app to keep uploading the logs file. Something is being stuck and we bleed.

I’ve quickly set up the configuration needed to be able to debug this scenario- and soon after hit the first wall. Its a dead end. Everything is working fine here.

External SDK

Our driver app is a very complex app — it evolved over the past decade and a half of its existence into a gigantic app providing everything our drivers might need, allowing them to manage their entire work routines in the app. From receiving jobs offers, taking them, manage their expected monthly income and statistics to learnings materials, legal materials, gamification features or getting help using an in app chats with customer support representatives, all fueled by analytics support — we got it all.

It could be anything…

Felt the gloominess of defeat I called it a day, though I knew its going to hunt me in my dreams tonight.

Indeed it came to my dreams, together with an hypotheses- one of the SDKs we’re using must be sending some big files constantly. Question is, which one? What file type is big enough to leave such footprint? Probably there are images involved. Which SDK deals with images?

Our customer service SDK. We use Intercom for that functionality. One of their features is to send photos in a chat with a support representative. I quickly finished my morning coffee and turned on my Mac. Proving this should be easy. All I need to do is to open a fake chat, open a network sniffing tool (we use Proxyman for that) and see whats going on.

Indeed Intercom leaks! Each time a chat screen is being opened, Intercom downloads all the chat content including all images shared in that conversation. EACH TIME!

Jackpot! Euphoria! Solved the mystery in less than a day! I still got it.

A celebration illustrating the sense of joy I exprienced at that moment

Sharing the new findings with the tech support engineer, I asked him to verify if the affected drivers indeed use Intercom in the app and are regularly sending photos in the chats.

I could safely start my weekend.

The second wall

At the beginning of the week we got an update from tech support- all drivers uses Intercom regularly (so far so good…), however only one of them send photos regularly! (Com’on… seriously ?! Another dead end?) Also, there were news about one of the driver granting his consent to use his account for live debugging!

Awesome — this is actually good news — if something is wrong with that driver account we’ll quickly find it logging in on his behalf with all the debug tools we can use!

So the 2nd wall was a bit of a softer wall- we proved there is a data leak bug on the SDK — which is a great progress- we can escalate the issue with Intercom and improve our app stability.

Its not our bug we were hunting down, but indeed a great value for my invested time so far, and there is still hope with the drivers consent to use his profile.

Rollercoaster of root causes

We intended to log into that driver account, turn on Proxyman and start the monkey testing — play around with the app, opening screens, closing them. In short — abuse the hell out of it — hoping to catch something…

30 seconds after we found it!

In the app menu there is a small thumbnail of the driver’s image. It seems like each time the menu opens — we download the full image of the driver. This driver had a 5.5Mb profile image, so to download it on each menu open was a) redundant, and b) probably the root cause of the data drain we were looking for.

I immediately started to phrase the description of the issue in my head:

Surprisingly, we don’t have a caching mechanism around that area in order to be able to only download the image once. Need to set up such mechanism and it will eliminate the data leak. Such fix requires a new version release meaning only drivers that will update the app could enjoy the fix. Fair enough.

Since many drivers usually take their time to update the app, we must think of a mitigation plan:

Critical Escalation: Escalate the issue with Intercom’s support and hope for a quick update on their side.
Internal Fix: Set up a caching mechanism to avoid redundant fetching of the driver’s profile image.
Mitigation Maneuver: A manual reduction of image size for the 5 affected drivers, offering immediate relief.
Scripted Resolution: The next phase involved automating the image size reduction for the entire drivers that work with us.
Future proofing: Add limitations and validation to the size of the image one can upload in our back office UI.

Let’s test our mitigation plan. I’m already logged into the 5.5Mb driver account, let me just upload a smaller version of his profile image in our back office site, and validate this indeed resolve the issue.

I hit upload on an equivalent smaller version of the driver’s image file (120kb in size) and opened the app again. Proxyman in the background, I’m opening the menu — a request to download the 120kb! **40 times smaller response**. Profit!

If the driver would’ve seen a data consumption of 40 times lower than his reported 4.7Gb, he would have seen a 120Mb a day. That would definitely won’t be alarming…

Let’s see how much I can get with downloading only 120kb each time, simulating a stressed driver checking his score every second or so (causing a new download of is profile image each time).

Closing the menu. Opening the menu. No new request. What?!

What’s going on? Where is our missing caching? Shouldn’t there be a request on each menu open?
I try again — no request. Was I wrong?

One step back — I upload a different image — 2Mb this time, to make it stand out from the sea of network requests. Opening and closing the menu a few more times — 1 request.

There is a caching mechanism after all! The narrative took an unexpected turn.

I deep dived into our network layer code, laying my eyes on the caching configuration setup and found it. Caching exists… to a limit of 5Mb.😅

 public enum Constants {
        /// Defaut timeout for requests that don't define their own
        public static let defaultTimeout: TimeInterval = 30
        
        /// In-memory capacity for cached requests, in MB
        public static let cacheMemoryCapacity = 5
        
        /// On-disk capacity for cached requests, in MB
        public static let cacheDiskCapacity = 20
    }

self.sessionManager.session.configuration.urlCache =
        URLCache(memoryCapacity: Constants.cacheMemoryCapacity * 1024 * 1024,
                 diskCapacity: Constants.cacheDiskCapacity * 1024 * 1024,
                 diskPath: nil)

Wow. That’s even better! This narrowing down the scope of the issue even further. Awesome!

Conclusions: Lessons learnt from the data bandwidth detour

As the dust settled, our debugging odyssey left us with some valuable insights:

The importance of swift and decisive action in the face of critical bugs became evident.
The collaboration with cross functional teams, such as our dev team, tech support team and the dev-ops team showcased the importance of clear communication.
An ego free approach, and a creative mindset proved to be key in nailing down the root cause of the issue. The tech support engineer could have send the message and forget about it. We could have disregard the issue claiming a 5 users issue is not a real issue. Dev-ops engineer could just claim this is a mobile app issue and provide no help.
The manual interventions underscored the significance of real-world user feedback and the necessity for quick mitigation strategies.
I still got it 😜

In the end, the bandwidth turbulence that threatened our driver’s cellular plan and our good name (and obviously our good night sleep) became a great story to tell, and a great learning experience.