Platform Team’s Technical Decision Process — Actor Migration Case Study

Published in

Fandom Engineering

9 min readNov 10, 2021

As the Platform team at Fandom, we are responsible for providing core functionalities and shared services used by all the teams that work on the Unified Community Platform. As a part of our job, we are responsible for keeping our platform secure, performant, and technically up-to-date.

That includes keeping MediaWiki (the wiki engine we are using) upgraded to the newest version, providing users with a secure platform with the newest features. This article describes our decision-making process for a project which was blocking us from upgrading MediaWiki to the new version.

MediaWiki is open-source software written mostly in PHP to power Wikipedia, which started development in 2002. Currently, it’s used by multiple websites to host wikis for users. At Fandom, we are using it to provide users with a possibility to create, read and contribute to pop culture-related communities.

So why do we want to keep MediaWiki up-to-date? With every MediaWiki release, we are provided with security and performance improvements and new features that make the user experience better. Additionally, third-party extensions and their updates are very often not compatible with older MediaWiki versions.

What was blocking us from migrating to the newest version of MediaWiki?

When planning work needed up-front to upgrade MediaWiki we’ve learned that we will need to migrate our wikis to use actors, before the actual MediaWiki upgrade. As the MediaWiki documentation explains:

Historically, MediaWiki has stored references to users (such as the author of a revision or an image) as an [<id>, <name>] pair, with [0, <IP address>] for anonymous edits. MediaWiki 1.31 introduces the concept of actors (someone who makes an action, such as an edit or a log event; currently either a registered user or an anonymous one).

So why were actors even introduced? It’s a waste of space to store user data in each table that references users — instead of storing user ID and user name like MediaWiki was doing previously in multiple tables, we now only store actor ID. Another reason mentioned in the MediaWiki document is that the old database schema was causing issues on wikis with a lot of revisions and was making the user rename process complicated as user data was duplicated in several places.

Because of all that — we had to prepare our wikis to be able to use actors, next to users. It was possible to start using the actors right now, on an older MediaWiki version, but we would have to migrate our communities to use actors anyway. The reason for that is that starting from MediaWiki 1.35, ID and name columns are removed from all database tables used by MediaWiki code. That means we can only have and use columns that refer to the actor table.

Challenges (aka problems) we have faced

One of the most important parts of the project was to decide how we want to approach actor synchronization in our infrastructure.

A quick explanation — in our setup we have multiple database clusters (aka servers). There’s a main, shared one that stores general information about all of our wikis — like their URLs, founders, and configuration information. In addition to this, we have a few database clusters that are used to store wiki databases. For example: on a shared cluster we keep information about the Hollow Knight wiki’s founder, its date of creation, its domains, etc., on cluster X we have the Hollow Knight wiki database that stores all articles, revisions, changelog info, and so on.

In the past, we had a similar problem related to user replication/synchronization. The solution we ended up with was a mix of synchronization using MediaWiki and MySQL. When user data is modified or a new user is created, the MediaWiki hook is triggered and the user ID is saved in a special table in the shared database. Next to that, there’s a job that runs periodically. It reads from this table and populates modified user data across all clusters using a MediaWiki maintenance script we’ve written.

We had some issues with this user replication process being ineffective and delayed — user data was occasionally missing on some of the clusters. One of the side effects of that was, that user accounts existed on some of the wikis but were missing on others, making it impossible to contribute to them. That meant that the automatic process of replication required engineers’ attention a lot of times. We wanted to avoid it with actors and decided to discuss other options for the actors’ storage.

In addition to that in projects like this, we need to think about the scale of our wiki platform. The actor or user replication projects would be less complex if we have 1k wikis and 10k users, but it’s not the case. In the last 2 weeks we had 50k pageviews, 33k users online, and 180 edits every minute on around 290k wikis. That’s a lot of user and wiki data stored in our databases affected by the actor migration project.

Decision-making process

We came up with four ideas for the actors’ storage problem — two of them required the use of some sort of data replication mechanism. We also decided to discuss if replication is needed for the actor data at all, and if not, how would we store the data.

Shared database cluster as a source of truth about actors & MySQL replication

In this solution, our idea was to run an asynchronous job that is updating cluster copies of the actor table based on the data stored in the shared cluster once a day or so, similarly to the user replication.

The main weakness of this solution was that it would be another complicated process in our infrastructure, possibly causing problems like conflicts in user data between database clusters. In addition to that, we were worried that it might cause an increased load on the shared cluster, which could cause performance issues, and that it would be error-prone like the user replication process.

❌ It was a deal-breaker for us and we decided to not test this solution.

Shared database cluster as a source of truth & MediaWiki synchronization

In this solution, MediaWiki takes care of actors replication — when an actor makes an edit, comment, registers, etc. MediaWiki is copying actor data from the shared cluster to other ones, right after it’s requested.

We had two issues with this solution — we would need to modify the core MediaWiki code (which we try to avoid to make upgrades easier) and we were worried about the increased load on the shared database. On the other hand, this idea had a lot of pros: synchronization would happen automatically when needed, right after the user makes an action, errors would be handled by MediaWiki and we wouldn’t have to deal with another complicated process in our stack.

✅ A lot of thumbs up for this solution when discussing it.

Actor tables stored per database cluster & no synchronization between clusters

This was a completely different idea — we were thinking about keeping separate actor tables in each database cluster and not synchronizing them, meaning that they would differ from each other.

No need for synchronization and introducing changes were actually the only pros. It would be very problematic to move wikis between database clusters, when we may need to redistribute load or disk usage, as actor data would be inconsistent. For the same reason, it would be impossible (or very hard) to switch to a solution with synchronization once actor tables diverged. We also were not sure how MediaWiki would act in a setup with an actor table shared in one cluster, rather than across all of them — we were wondering if it might cause issues with wikis used as file repositories for wikis from other database clusters.

❌ All of that was a big no for us and this also did not make it to the testing phase.

Actor tables stored in the local wiki databases

This was very similar to a previous solution, but rather than keeping actor data per database cluster we were thinking about keeping in local wiki databases, meaning that each wiki has its own actor table with specific data.

This solution was an upgraded version of the previous one — we would benefit from it — no need to modify MediaWiki and synchronize the data. We were also convinced that it would scale up well (with no need to have one actor table in a shared cluster that would keep growing) and that we could still migrate wikis between database clusters. We were wondering if it would be a waste of storage space, since actor data would be duplicated between clusters, for each wiki database which might be problematic since we have around 290k wiki databases. And again that it would be very hard to switch to a solution with actors’ synchronization.

✅ We decided that we want to try one of the solutions without synchronization so this one also gets a thumb up.

Experimentation phase

After a discussion, we have made a decision — we will try two solutions: shared database cluster as a source of truth with MediaWiki synchronization and actor tables stored in the local wiki databases.

We started with testing actor synchronization using MediaWiki on a couple of wikis to see if it’s easy to implement and if it actually works, and for the remaining wikis, we decided to test the solutions with actor data stored in the local wiki databases. During implementation, we had to make sure to send a lot of logs from the synchronization to verify if it works as expected and that no users will be affected in case anything goes wrong. Additionally, we could use the MediaWiki mechanism that allows using old database schema for writes and new for reads during actor migration. We have used the wgActorTableSchemaMigrationStage variable, which allows specifying the stage of the actor migration, and depending on the value tells MediaWiki to only create or read actor data or do both.

After the testing phase, it was time to make a final call and check if the experiment answered our questions about both of the solutions:

Does the actor synchronization actually work? Does it happen automatically and does not require our attention? Yes and yes
Did we modify the MediaWiki code a lot? No! We’ve only added two hooks in the User class that were triggered by MediaWiki, and the whole functionality was handled in our custom extension.
Do future releases of MediaWiki handle synchronization for us? Partially, the new MediaWiki version introduced new classes for the actors’ data handling, so there’s no need to introduce changes to the User class.
Is there an increased load on the shared database cluster when synchronizing as we expected? Nothing like that happened.
Is the wiki actor data stored in the local databases actually a waste of storage space? How many users actually contribute to more than one wiki, or to wikis from multiple clusters? As pointed out by one of our OPS engineers — increased storage per wikis is not problematic, we should make sure that the solution is scalable and resilient.
Do we think we might want to switch to a different solution in the future? There’s a possibility that it might happen.

We decided to go with the MediaWiki synchronization solution — code that we’ve implemented worked pretty well and we liked that synchronization doesn’t require any engineering actions. Even our fears did not come true — increased load on the shared cluster was not a worry at all during the 4 months since the wiki migrations.

Conclusions

Surprisingly the migration process went very fast and smoothly, without major issues. There was no need to improve the speed of the migration — the initial estimation of migration time was correct and it took us around 1 month to migrate all wikis to use actor tables.

For me as one of the tech leads of this project, the biggest lesson learned is how important it is to have a brainstorming session before starting a project with a lot of unknowns and to take different perspectives into consideration. I don’t think that without discussing all of the possible solutions and testing the best ones we would be able to finish this project so easily and without major issues.

Originally published at https://dev.fandom.com.