Using diffs to manage game data

Akash Khan
Akatsuki Taiwan Technology
6 min readApr 22, 2019

By Akash Khan

At Akatsuki, we have manage the game data using MySQL databases which often get updated on a weekly basis.The database content is managed using CSVs/YAML/Excel files which get checked into version control.

So just like in the case of source code management, a PR gets created which gets reviewed, merged and consequently deployed. However since git is a stupid content tracker which doesn’t care about the underlying file types, the PRs often end up being difficult to review and can lead to bugs.

To understand the problem better, let us consider a simplified example. We have a single table called characters which controls character settings in our game with the following schema

 -------------------------------------
| character_id | name | atk | def |
-------------------------------------
| 1 | chara1 | 100 | 100 |
| 2 | chara2 | 150 | 80 |
-------------------------------------

Now on testing, we find that chara2 is a little OP so we decide to nerf him and reduce his attack to 120. Our resulting PR would look something like this

CSV diff ~ YAML diff but YAML has advantage due to clear column checking

Just by looking at the PR, you can see that excel diffs are pretty useless here. Since git treats them as binary blobs it is impossible to review changes without downloading the new file and checking it.
The CSV diffs still look reviewable but might be difficult as the number of changes increases while YAML diffs are clearly the winner here for ease of reviewing.
Similarly for the case of adding new characters , CSV diffs look better but YAML diffs win once again because it’s easy to verify if the columns are set appropriately ( Check out this PR for an example)

YAML files show the column value mapping so it’s easier to check data

Now we’d like to make hit points a character specific stat. Sounds pretty simple right? Just add an hp column to the table and assign hit points to existing characters.

Here’s what the diff would look like

Excel diffs as expected remain impossible to decipher but now even the CSV diffs are slightly more difficult to review the updates.
YAML diffs still retain their readability and show only the data that has actually changed

So problem solved then? Just use YAML files to verifying PRs and there won’t be any issues!

Not really…

As pretty as YAML diffs look, the file format itself is very inefficient for storing structured data. For example, representing a table with m columns and n rows as a CSV would take O(n)(n lines really but word wrap while displaying adds some overhead). On the other hand, the same table in the YAML format would take O(mn) lines. This quickly adds up and pretty soon we start seeing the following

Solution

So how do we tackle this issue? The hint lies in the title of this post :P

Let’s take a step back and think about what we’re trying to achieve here. We need a way for game designers to be able to verify and review the data that is going to get deployed.
For example in the first case above, during review we only care that chara2 has his atk stat changed from 150 to 120 and all other columns are unchanged. We don’t care about what values the other rows contain since they are not changing and don’t need to be reviewed.

So in essence the reviewer only needs to verify if the following change is correct:

 -----------------------------------
| character_id | name | atk | def |
| 2 | | 120 | |
-----------------------------------

Representing the data in this format greatly reduces the amount of effort needed to review the change and scales very well because it is independent of the number of rows the original table contains.

Implementation

To implement this we need to essentially do the opposite of git diffdoes. Given a base file and diff, generate the merged file. This can be a quite challenging if we try to handle the generic case. Git assumes nothing about the underlying file but we are not bound by this restriction. Since we know about the structure of the files being compared, we can make some assumptions to make it easier. Some of the assumptions I made are:

1) Both the base file and diff file are in CSV format. ( Not really important. Just used because of quick and easy parsing )
2) The table has a unique key column. This is extremely important and forms the basis of our algorithm.

Once we have the above conditions met, our merge algorithm becomes really simple.

1) Loop through all rows in differences
2) For each row find the corresponding row inside the base CSV. The unique index ensures that there’s one-one mapping.
3) Generate the merged headers ( in case new columns are being added )
4) Overwrite the base row with diff row contents

Sample implementation

That’s it! Well not really… We have handled new rows being added and old rows being modified, but what about handling deletions?

Since we explicitly skip the empty columns in the diff CSV we need a way to mark deleted data. There’s few ways of handling this but the way I handled it is by adding a delete_mode?option to the script.
If this flag is set the script treats empty rows as deleted data other wise it treats it as regular difference data

Well that’s pretty much it for the basic idea. The real version used in our project has some additional data integrity checks and options which are not really important for this article.

If you wish to implement a similar approach you should keep in mind the following

Pros

  • Easy to keep track of changes across different versions. Want to know what was deployed 2 weeks ago? Just check the diff sheets. No need to go through git logs to find the source
  • Size of differences is almost always smaller than the actual file so it makes it easier to review.
  • Each set of differences can be grouped based on features/intent .This also enables parallel testing since two teams can test their own set of changes without having to worry about messing the other team’s work.
  • In order to modify the data, you need to explicitly state the unique key being modified. This helps reduce the chances of error due to copy pasting incorrect data from adjacent rows.

Cons

  • Sometimes it’s necessary to preview the entire data at once which is difficult to do it in this case. We overcame this by adding a script to upload the merged CSV to Google sheets where it can be previewed.
  • Since you cannot copy paste data most of the time, data entry process becomes more time consuming.

Overall for us the pros far outweigh the cons and the algorithm is simple enough to be extended to handle our requirements. We have been using this in one of the projects for almost a year now and it has helped speed up our development process significantly.

--

--