When to rebuild a feature and how to do it
Successful products are about delivering value. But for a product that has already been adopted but suffers from core product issues, how does one decide to make a radical change?
We were in the same situation a couple of months back, deciding whether to make one of the most loved features ‘just work’ or to rebuild it on the fly.
Move fast, break things
Most teams at any startup are highly frugal, they live and breathe by the mantra of ‘Move fast, break things’ — a mantra given by an entrepreneur in the Silicon Valley who happens to own Facebook, Instagram, and WhatsApp. Ideas are thrown out of the window in a jiffy by the product teams and so is the tech that enables those features.
To put it simply, when any company is hoping to get more users for their product, the tech that they ship to test their ideas is generally at the bottom of the pyramid.
But out of the hundreds of features that get shipped but don’t work, a few do and you tend to structure your entire product around those specific features that your users love. One such feature for us is the Practice Partner Calling [internally we call it P2P Calling] which helps our users practice English-speaking with another partner on the app, on which 130,000 minutes are spent practicing every single day by users.
The beginning of a new beginning
Our current version of P2P was tested for a few weeks before its launch through different WebRTC third-party providers and was entirely written in 10 days. Any feature that is built fast and works for users is considered perfect — for the company and for the users.
But that’s precisely when we started to encounter the limitations of building tech that wasn’t built to last or to scale. We spent days and even months to keep the feature working. Every hole in the ship was filled as soon as we could find the source because we couldn’t afford it to sink. The cost of rebuilding in terms of developer time was just extremely high.
One particular bug in the feature took 4 days of our developers’ time to just be found and another took a couple of days to get fixed and tested. That is when we realized that there was no more space to put the fillings. That is when we decided to rebuild this feature from scratch.
How bad was the problem?
Initially, our ideology was to fix whatever broke. Every release of 14 days had at least one fix for the P2P feature. Things started getting out of hand when we spent 4 days just working on finding an issue that was coming on production. Another challenge was the number of people working on this particular feature. With every new person, came more complexity coupled with the challenge of understanding the why’s and how’s of a new code that was only ‘just’ working. There was no documentation to make it understandable.
Still, this one time Naman Mahendra made an entire map and UML diagram of the feature. It enabled the team to understand the scope of the problem. Solving the problem was like fixing a leaky boat in the middle of a storm — we fixed one thing and two more holes used to break.
The decision to rebuild
Before we decided to rewrite the entire feature, we held long discussions to find out all the possible ways we could solve the problem. One of the reasons to have these discussions was also to find out the solution to edge cases.
Through these discussions, everyone was clear about one thing — this time when we rebuild P2P, we are not going to make it ‘just work’. For every developer in the team, this feature had become a child that they wanted to raise from the start. Also, at a time when we were already clocking 115,000 minutes of learning practice every day, we had earned the privilege and the validation to build tech that is built to last and scale. So, that’s what we did and rewrote our most loved feature from scratch.
Our goals, in the beginning, were simple:
- Analyze and figure out all possible existing issues.
- Solve for 10x higher concurrent users.
- Try to move from the bottom of the coding pyramid to the top.
Prepping for maximum agility
The majority of our users are from Tier 2+ towns and cities in India which implies that even having an android phone is a luxury for them. A few of our users who use our latest release still use a phone running on Android Lollipop (V5) and we serve a range of about 4500 different devices across android versions ranging from 5 through 12.
Along with the challenge of solving for all of these different Android versions and policies, the biggest challenge that made us sweat was network issues — slow internet, no internet, internet disconnects, two users on very different internet speeds, etc.
For almost two weeks, we tinkered and experimented with possible solutions to all these edge cases on all levels of the system (for both backend and android) and network architecture to find the best possible solution that could be built and shipped within 6 weeks.
Our days were full of creating all sorts of performance graphs, memory profiles, UML, use case diagrams, high-level diagrams, low-level diagrams, and design patterns to improve performance even by a microsecond for all the MVPs of all sub-features. Finally, we tested and broke the feature away from our monolith, and created a separate microservice.
The night before the dawn
Once the design was finalised on both android and backend, we estimated that it would take us a maximum of 6 weeks to build, test, merge and release the feature, which we later realised was a gross miscalculation on our part. After exceeding deadlines [when we thought we were almost done with the process], we found a major flaw in our design during internal testing. It got worse when we found another major bug in the production the very next day. It was as if we were fighting a battle.
Then, the week that followed was just a show of what relentless pursuit and extreme focus could achieve. All of us went back to the drawing board, rewrote our objectives, and figured out and solved EVERY SINGLE edge case. This was followed by restructuring major chunks of our codebase and attempting to reproduce all possible edge cases on tens of devices under different situations. What worked for us was the same child-like love for the feature for which we would do ANYTHING.
Things coming together
After about 50 days, things started coming together. Everything that we had thought of achieving, finally started getting materialised.
We conducted internal testing with our 35+ team members. Everything worked flawlessly! I don’t think anything could have matched that feeling of joy and contentment.
Kill the old ways, don’t wait for them to die
Rewriting features can be tough but what’s tougher is deciding when and how to make the call to rehaul.
We thought that we were late in rewriting our most loved feature because the cost of fixing things in every sprint was just too high for us and now in retrospect, we can say that that was the perfect time to make that decision.
We knew that things were slowly reaching a stage in which it would be impossible to fix anything, but by all the small fixes and keeping things together for about 6 months, we figured out all the possible causes of failure. Identifying these causes gave us a much better insight while creating the objectives for this 2-month long sprint.
By the end of the sprint, we managed to achieve the following results with our newly written Practice Partner Calling feature:
- Call connection time was down from 8 to 2 seconds.
- Significant increase in median time that people spend per call.
- Exchange of all possible states such as mute, hold, and connectivity issues amongst users on call.
- Improvement in call connection percentage (number of connected calls).
A lot of our developers and QA team’s time got into building this in the last few months and though White Star Line may not have been able to but we surely managed to return our majestic Titanic to its true glory.
A huge shoutout to Gopal Agarwal, Naman Mahendra, Harish Kumar, Bhaskar Singh and Harsh Mehta for contributing to this blog post!