The inefficiency of storing code as text
This article is also published on my website: https://fwouts.com/articles/why-storing-code-as-text-sucks
Every major version control system handles files as plain text. Git, SVN, Perforce, even Google’s newest system Piper are text-based. None of them can understand code like humans do.
This leads to absurd situations.
Case study: manipulating JSON with Git
Let’s take the following example of a JSON file in a Git repository:
A first developer wants to add a new property:
A second developer, starting from the same code, wants to add another property:
The first developer will have no issues merging their code into master
. However, the second developer will see the following error when they run git rebase master
:
$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: Add author
Using index info to reconstruct a base tree...
M package.json
Falling back to patching base and 3-way merge...
Auto-merging package.json
CONFLICT (content): Merge conflict in package.json
error: Failed to merge in the changes.
Patch failed at 0001 Add author
Use 'git am --show-current-patch' to see the failed patchResolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".
When they look at the file, they’ll see a butchered JSON:
This is exactly what Git is expected to produce. It unfortunately cannot provide any better suggestions, because it does not understand what the JSON format is. As far as Git is concerned, a file is simply a large string, broken up into multiple lines.
What should the outcome have been?
One could argue that Git had no way of guessing which outcome was best. Setting apart the issue with comma placement, should author
be added after description
or after name
?
Option A:
Option B:
These two options are very different. Or at least they look different. Conceptually though, they are absolutely identical because a JSON object is an unordered set of name/value pairs.
You may argue that some software relies on the order of keys defined in a JSON file. Such software does not use JSON. It uses its own custom format that looks and feels like JSON, but isn’t really JSON.
Text storage forces us to choose one representation
We’ve established that options A and B are the same thing. They each correspond to one particular text representation. But because Git (like its alternatives) only handles text, we need to pick one.
We shouldn’t need to. It’s a waste of:
- time (handling merge conflicts, spacing, styling issues)
- storage (storing every single character, whether it’s relevant or not)
- performance (parsing code from an inefficient text format)
Of course these issues are not specific to JSON. They apply to every single text-based computer language — that is, virtually all of them.
It seems like we’ve stuck with text-based storage because it’s the easiest format for humans to deal with. However, is this a good enough reason? Could we not come up with a smarter way of storing code, even if it means introducing an abstraction layer that allows us to keep the text interface we’ve grown to love?