Software reliability & legacy code — Sustaining Engineering Story

Published in

Legacy Systems Diary

6 min readJun 18, 2018

This is going to be a story about a monolith, legacy components, rapid development, reliability (or lack of thereof) and change.

Chapter 1

Let me start with giving you some background to the story.

Once upon a time, in early 2000’s Wikia had been built on the foundation of MediaWiki. Like most projects from that era, much of the platform was not written with today’s best software development practices in mind. Wikia’s engineering team had been extending the MediaWiki platform since 2004 and those efforts suffered similarly from technical debt and bad practices.

Within the organization the SLA solution was created to ensure that the product and engineering organization efficiently resolved problems, respecting the levels of priority determined by Product Managers.

In general, it accomplished its primary goal, but it also brought many challenges:

responsibility for problematic areas of the platform was concentrated only on a few of the engineering teams.
some portions of the platform were so poorly understood that it took engineers significant amount of time to fix bugs
due to the unpredictable nature, it often wreaked havoc on the schedules of the heavily burdened teams.
context switching costs were very high, further decreasing productivity.
The little knowledge gained about legacy systems remained contained within a few teams/individuals, which further perpetuated problems.

This setup continued to exist for years till its ramifications impact grew so large, that the vast amount of time spent by product teams on dealing with non-product issues could no longer be ignored.

In January 2016, the exec staff decided on silver bullet to these problems — creation of a new engineering team — Sustaining Engineering (SUS) — which was to take over all the legacy features and uncork the product teams pipelines so that they can focus on product development work.

The SUS team’s goal was to:

eliminate external distractions to allow other teams focus on their goals
distribute knowledge about legacy products/systems throughout the product/engineering organisation
increase the predictability of engineering resource availability for individual teams
help inspire high quality software development

The team was to consist of Tech Lead, Engineer, Engineering Manager, Product Manager and Community support. On top of that there were rotational roles. Every other Engineer (including QM team members) in the organization was to be committed to spending four weeks per year (in either two two-week chunks, or continuous four weeks) as part of the team.

*SUS Manager trying to get people on SUS rotation*

And this is how the Sustaining Engineering team came to life.

Chapter 2

So we had this team to take over all the legacy components, save the day and make our software fine and dandy — sounds great!

However it’s easier said than done. Don’t get me wrong, we’ve had excellent engineers, with great domain knowledge and their hearts in the right place, but it was way too easy to fall into a trap of “mindlessly” shuffling the bugs without foreseeable end and making SUS team constantly fire fighting.

That’s why we came up with a the following idea — let’s take over just a few components at a time, look at all the issues on the books for those components, treat them as a symptoms and try to identify the root causes.

This way SUS team was focusing on making the feature stable, reliable and working properly instead of just fixing reported and incoming bugs.

This approach made a huge difference — tickets became clues, that the team took a look at in order to direct efforts into right direction. We were working on the areas which generated these issues and making the platform better. As components were stabilized the tickets were bulk closed — our stats were reaching the roof and team instead of being demotivated by constant firefighting was actually excited by their work.

That worked really fine, but all fine things must end eventually. We’ve had many obstacles on our way like:

problems with pulling engineers from other teams for SUS rotations,
context loses which came with shorter rotations,
requirement to dramatically increase the pace of components takeover.

But the hardest obstacle came with significant company level changes, which resulted in reduction of SUS team by removing QA (sick!), Product Owner and most rotating Engineers slots which left SUS with only one rotational engineer.

Except the obvious effects of not having enough engineers and QA, we were to face a problem of not having a Product Owner and therefore dealing directly with all the stakeholders (at some point pretty much everybody was a stakeholder due to massive component handover) doing their best to pressure the team with their ‘most important’ tickets. As you can imagine, without a Product Owner the SUS team was having some difficulties on deciding what’s the most important thing to do…

Chapter 3

“The real problem is not whether machines think but whether men do.”

As words of wisdom go “change is the end result of all true learning” — and that is what we had to do again. As a response to clear need of new process that would help us prioritise our work we’ve came up with Mycroft Holmes (High-Optional, Logical, Multi-Evaluating Supervisor) aka Mike.

As we couldn’t rely on qualitative marks of different stakeholders regarding the value of components, we started to rely entirely on the component statistic data in order to prioritise our work.

Our Mycroft Holmes tool started gathering metrics for all our components based on: Jira tickets, CommunitySupport’s gradation and the components code quality. The most important code quality metrics we’ve included were:

cyclomaticComplexity — which measures the number of linearly-independent paths through a program’s source code

Cyclomatic complexity used to measure the complexity of the program.

from 1 to 10 — code fairly simple introducing an insignificant risk
11 to 20 — the code introduces a risk to the average level
21 to 50 — very complex code associated with a high risk
above 50 — Code unstable threatening a very high risk.

afferentCoupling — number of classes affected by this class
efferentCoupling — number of classes that the class depend on
maintainabilityIndex — based on Halstead’s metrics, LOC and Cyclomatic complexity number.

<64: low maintainability. The project has probably technical debt.

65–84: medium maintainability. Project has problems, but nothing really serious.

>85: high maintainability. The project is probably good.

Above mentioned metrics along with JIRA tickets stats, Community Support input, components usage volumes and a few other factors had been assigned their weight. Metrics were auto gathered every Friday Evening and uploaded to ‘Mike’.

As a result we’ve received a logical list of components and their relative priority index, which not only replaced a Product Manager, but also provided data which dramatically improved our negotiating power with the stakeholders regarding components value.

Chapter 4

Of Course, that’s not the end. Now that we got back our QA along with over 60 components in our ownership and having all spectacular accomplishments like search optimization work resulted with:

Index size: from 404 to 225 GB (down by 44%)
Number of documents: from 250 to 168 mm (down by 33%)
Average Solr query time: from 4 to 1,9 ms

*Average Solr query time (in ms). Spikes come from replication. The smaller the index the smaller in the impact of replication on response times.*

Huge performance improvements on databases

*Here’s the graph showing number of database queries before and after the fix was released.*

The road from the start to where we are now was not an easy one. Sometimes it felt like a rollercoaster ride, on some occasion like pushing forward through a swamp nonetheless it had been always a rewarding experience.

As things stand now, we’ve got our QA back, have over 60 components under our wings and enjoying the work we do. Not everything is peachy though as our Team is facing new issues and problems but … that’s a separate story for another time ;)

Software reliability & legacy code — Sustaining Engineering Story

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Written by Joanna Boruń