Lessons From a Rewrite Gone Right

Published in

Kustomer Engineering

8 min readFeb 8, 2022

David Druker and John Chen are frontend engineers at Kustomer and the authors of this blog post.

Never do a rewrite. Ever since Joel Spolsky wrote “Things You Should Never Do” 20 years ago, avoiding rewrites has been accepted as best practice across the software industry.

We recently faced the same question as the Netscape team — to rewrite or not to rewrite, with very different results. Was Joel wrong? Are there circumstances where a rewrite may actually be a better option? Now, with our new experiences, we want to share with you our answers to these questions.

Today, live chat is Kustomer’s biggest messaging channel by volume, even bigger than Email. Our live chat web app gets loaded more than half a billion times a month all over the world, and is one of the fastest and most lightweight live chat apps in the customer service industry with a complete bundle size of less than 200KB of JavaScript while other chat apps could be 5X bulkier with up to a megabyte in size. Our chat app even has decoupled UI and API components, so our clients have the much requested ability to build their own chat interface that connects to the Kustomer platform.

But things weren’t always like this.

Just a year ago, we failed to close a deal because we didn’t have the one feature a major potential client wanted: the ability to build their own UI. Additionally, within our live chat app lurked scalability and reliability problems. These were long standing issues that required architectural improvements, including

A large bundle size: After collecting information about the live chat apps of our competitors, we found our app was actually competitively sized. However, we had several ideas for how to significantly reduce our bundle size further given the chance.
Namespace pollution bugs: As an embed, our chat widget shares the same DOM as our customers’ websites. Because we support a wide range of browsers, including IE11, we used polyfills which sometimes clashed with our customers’ code and dependencies. Additionally, since every customer’s website was unique, each bug was unique, creating a huge time sink for both the customer experience and engineering teams.
Message reliability issues: As Kustomer grew and we signed more international customers, we started to notice a growing number of performance issues related to low bandwidth and less stable network connections. Intermittent blips in connectivity could mean a customer would never receive a support agent’s message, which was unacceptable.

This is when we faced the question: to rewrite or not to rewrite?

Firstly, for many software projects, freezing feature development for at least a few months to work on a rewrite is simply out of the question and an incremental refactor is the only option. In our case, because the live chat app was not Kustomer’s core piece of software, the option of a rewrite was easier to justify and the company was aligned on potentially having to do a rewrite as well.

We did eventually choose to rewrite the codebase, and here’s how we made our decision.

Rough initial estimates told us a refactor would take around three months and a rewrite would take one to two months longer. However, each of the above changes carried substantial added complexity and risk if done through an incremental refactor. In fact, it could be argued any of them could have justified a rewrite on its own.

On the other hand, we knew from literature such as Joel’s blog post, that rewrites could be a dangerous proposition. A common reason why organizations choose to rewrite over a refactor is because new patterns can look so different from existing patterns that a rewrite without the baggage seems easier to execute.

And this is true — in the beginning.

As a rewrite progresses, some parts of the old code may prove more difficult to replicate in the new codebase, and the goal to solve old problems may give way for new unanticipated problems. The original estimates start to balloon, and for every step you take, the end of the tunnel seems to grow two steps further away.

Essentially, even with a rewrite, we faced substantial complexity and risk. The difference was with an incremental refactor, we were confronted with the risk upfront, while a rewrite’s risk was deferred. For example, one of our ideas to reduce the bundle size was to move from React to Preact. We could see immediately that this would be difficult to do incrementally as it required us to maintain both UI libraries while migrating functionality over. It additionally required a lot of effort to thoroughly test for regressions over several months.

Through a rewrite, the risks weren’t as obvious, but there were many open questions that could lead to a potential rewrite being blocked entirely and requiring a do-over. For example, does Preact offer the features we need to achieve feature parity with our original chat app written in React? Does Preact have any open issues that could block us in the future?

Before we could start a rewrite, we needed to create a list of all of our open questions such as the ones above. Then we would need to address these open questions and make sure that our attempt to solve an existing problem would not simply spawn several new problems later.

We did this by

Reaching out to mentors and other senior engineers both internally and externally to help add and answer questions on our list.
Reading Third Party JavaScript. This is a must-read for building web apps that live on a customer’s website. The ways we used src-less iframes, and the HTML data attribute in our embed script were both teachings from the book.
Reading documentation on all the new technologies we were considering like Preact, src-less iframes, and PubNub to add and answer more open questions.

After compiling the open questions, we set aside an entire sprint (two weeks) primarily just to find answers. This consisted of our two-person team pair programming to build a barebones version of the chat app. During this time, we were able to answer questions such as

How are we going to make sure our app is supported in IE11?
Is our app protected from our customers’ CSS?
How would we test our Preact code?

We also thoroughly scrutinized the original chat app code to ensure our new design supported feature parity.

By the end of the two weeks, we were armed with a working prototype that modeled what we expected to be our final implementation. If we wanted, we could have just kept building on top of the prototype and turned it into the product. By confronting the hidden risk of a rewrite upfront with a prototype, we could more confidently estimate the amount of effort the rest of the rewrite would take. When we saw that our current estimates were somewhat close to our initial rough estimates, which, to the company, was a reasonable timeline, it was only then that we were finally able to make the decision to rewrite.

So was Joel wrong?

We still like Joel’s blog post, and we still think it gives a valuable perspective on the risks of a rewrite. However, as widely applicable as that advice can be, we do feel, with our project as the example, that certain factors do lend to rewrites being a reasonable option.

Looking back, we see the prototype as the critical reason why our rewrite went the way it did. However, there were also other important factors that we feel played a large role in our eventual success:

Enough Time to Validate our Technical Design

Our plan for the rewrite contained several entirely new ideas without precedent at Kustomer, so the most important factor to the success of the project may have been the fact that we could spend enough time prototyping and validating our hypotheses to ensure our planned solutions could really work out in the wild. This required us both to spend an entire sprint not focused on delivering features, but just experimenting and researching, with possibly nothing to show for it after. Through this process, we recognized the amount of trust we were given and the value of working in an environment that encourages engineers to build things the right way. Both of Kustomer’s founders are engineers and they continue to make substantial contributions to the platform even to this day. Without this environment, it may have been more difficult to push for a rewrite which would require several months of no new features on our existing chat product.

2. Reasonable Scope of Project

Something else to consider is that it was possible for us to complete a prototype of our chat app project in two weeks. We realize as an app grows larger, the process of creating a list of all your open questions and making a working prototype can take significantly longer as you try to ensure all needed features can be accommodated by the new design. As that process takes longer, the potential benefit of the rewrite may start to wane.

3. Small Team

The team behind the frontend rewrite was small (two developers) and having few channels of communication meant everyone was always in sync. This meant we could set up meetings very easily, often meeting multiple times a day for quick discussions, and could shift directions very easily, because there was only one other person to communicate changes to. What is actually considered a small team? This depends on the project you are working on, but we would recommend no more than the number of easily parallelizable divisions of labor in the project. For us, a team of two was perfect since it meant one person could work on the UI layer while the other worked on the API layer.

Challenges

On the other hand, the project was not without challenges, and there are a few things we would have done differently:

Clarifying Feature Parity

One of the primary goals of any rewrite is to attain feature parity with the existing product. However, during our planning phase, we did not spend enough time figuring out where that line was to be drawn. The problems came as we approached release and we started realizing the functionality that we thought was safe to drop was actually crucial for some of our customers and would be a dealbreaker. This led to us having to shift our goalposts of feature parity late in the game, and while we were able to deal with this without greatly impacting project timelines, we could have avoided this by spending more time verifying our customers’ usage through code level tracking rather than only through customer interviews.

2. Spending More Time on a Migration Strategy

During project planning, not enough time was spent deciding on an optimal migration strategy.

For example, there were several possible approaches, such as:

Allowing our customer to embed both versions of the chat product at the same time.
Only allowing one to be embedded at a time, but making the APIs identical so there were no breaking changes after switching over.
Requiring a rip and replace with breaking API changes and only allowing one version at a time.

Our eventual implementation required our users to go with the last option. To encourage our customers to migrate, after we released the new chat app to production, we imposed a generous sunset date on the older version. We also released a migration guide to help our customers find out how the original chat app’s API methods mapped to the new chat app’s. Ultimately, our customers were able to move over without too much trouble, but this process could have been smoother.

Kustomer is growing, and recently, the new live chat app was able to be inherited successfully by a new team of excellent engineers. As we now embark on new projects, we feel somewhat more prepared with the knowledge and experience we’ve gained through our journey, and we hope when you find yourself at a crossroads, facing the same question we faced a year ago, the lessons we learned can be useful to you too.

Lessons From a Rewrite Gone Right

Challenges

Written by John Chen