Dealing with a rotten banana

Code is not a thing carved in stone. It evolves to meet constantly changing requirements. It will eventually go away and be replaced by The Next Great Feature. But does your data follow the code on its journey? Does it go away when your feature’s code hits /dev/null?

Maciej Brencz
Legacy Systems Diary
4 min readFeb 21, 2018

--

The other half of this banana is (believed to be) healthy

Working in a team that deals with a legacy software is a great learning experience. You can see how bad choices from N years ago impact the current system. How badly performed code changes or removals leave traces in the data storage.

The story of page_vote table

Our main, MediaWiki-powered application is ten years old. We started with few dozens of wikis to reach the number of 400k active wikis today. A huge amount of features were introduced, some were later extended, some were removed or replaced.

Back in Wikia’s early days we implemented a simple five stars rating system that either a registered user or an anonymous visitor could use to vote on any article on a wiki. Every wiki got a new table named page_vote with columns for storing article ID, user ID (if a vote was from a registered user), rating (how many stars were selected) and an IP address (used for preventing multiple votes from the same user — you may argue how well it worked in the days of ADSL ;). That worked great.

Back then. Now in 2018 we no longer have five stars voting system. It was removed when we changed the look and feel of the site. Code removal was quite simple - you commit the code, deploy it and ticket closed. However, no database cleanup took place.

page_vote table today

So we’re in 2018. page_vote table is now only used to up-vote forum threads. Our automated reporting tool (that scans the application logs and creates the tickets) filed a bug that an IPv6 addresses cannot be insert into page_vote tables. Ok — we thought — let’s just add a varbinary column that will store IPv6 in an efficient format, migrate the old data and we’re fine.

However, as we dig deeper into the issue it turned out that IP data is never actually used by SELECT queries. index_digest tool found that three indices that this table has are not used and are partially redundant. Next discovery was the fact that nowadays this table is used by a feature that only allows registered users to vote. And that there were still entries from The Old Days when one could vote for any article on the side. And that there were entries for no longer existing articles. And that there were redundant entries for the same article and registered user ID (obviously, there was no unique key restriction).

Thankfully, MediaWiki comes with a maintenance script that can be run across all wikis and perform database cleanup tasks. We added a task there that takes care of cleaning the page_vote table:

  • all entries for anons can go away
  • all entries for articles that were handled by the star rating can go away
  • all entries for no longer existing articles can go away
  • IP column can go away
  • column with a vote rating (1–5 starts) can go away, forum thread up-vote is always treated as +1, all the code does is select count(*)
  • current table indices can go away and a new, unique key can be introduced

Results

So far ~5% of wikis were processed by the cleanup script, however we started with the oldest ones, so the savings from removing The Rotten Banana are quite huge there.

  • rows removed — 74%
  • tables size dropped from 1130 to 235 MiB (~80%), including a replication we saved 7.8 GiB of DB storage
  • page_vote table now has a unique key that helps MySQL replication

Lessons learned

  • Code re-usability is great. But that does not necessarily apply to database tables. You may be building your new feature on top of a storage that is already rotten and should actually be removed.
  • Bugs that seem to be easy to fix are quite often a sign of a bigger issue. Investigate, dig deeper, you won’t regret.
  • Design your table with data consistency in mind — use proper types, unique and foreign keys.
  • Automate schema updates and cleanup queries to keep your data in a healthy state and tables schema in sync.

--

--

Maciej Brencz
Legacy Systems Diary

Poznaniak z dziada-pradziada, pasjonat swojego rodzinnego miasta i Dalekiej Północy / Enjoys investigating how software works under the hood