GitHub’s Large File Storage is no panacea for Open Source — quite the opposite
I was pretty excited a few months ago when GitHub and other code hosting sites announced efforts to improve on a critically lacking aspect of the popular Git source control system, namely its cumbersome handling of large binary files.
A little bit of background
Without getting into too much details, Git is a distributed SCM system, which means that every user of a code repository has a full copy of the history locally on their computer. For large projects with a long history, this can be rather big.
The main problem with Git is that binary files are stored “as is” in the history of the project, so that every single revision of a new binary file (even if just a single byte has changed) is stored in full. This makes the size of the repository including such files rapidly balloon up, and makes Git particularly unsuited to store large binary assets.
On the other hand, source files being mostly text, they are more intelligently handled and typically only differences between revisions are stored in the commits.
As such the handling of large files is not necessarily a big issue for many smaller projects, which tend to be mostly a set of text files comprising their source code. But for most modern development (such as when developing apps) it is usually a necessity to store assets such as high-resolution images along with the source code of the project. These add up very quickly as changes are made to them over time.
With Git ballooning in popularity over the last few years, largely because of the popularity of social coding sites like GitHub, several projects have been in development to address that issue with the software. The two biggest ones are Git-LFS and Git Annex, with the first one looking like it is quickly becoming the de-facto standard. As a matter of fact, Git-LFS is GitHub’s adopted solution for this problem.
My Own Usage of GitHub
My personal GitHub account hosts mostly public open-source projects. I also use Git routinely with other hosting providers such as Atlassian’s BitBucket (which provides free unlimited repos for small teams) and the open-source Gitlab (which can be deployed on your own hardware for free).
A few months ago, I joined the AudioKit team on GitHub and have contributed a number of large commits over time. One of the aspects I have been handling is the production of suitable universal libraries of the third-party projects that AudioKit depends on. This makes it considerably easier on AudioKit’s users as building these libraries properly is no small task.
Up till now, we stored these large files as binaries in the AudioKit repo, which came with all the downsides mentioned earlier. This is becoming even more of a problem these days with Apple introducing the Bitcode technology for binaries, which makes these libraries bigger than ever before. As such, I was excited by the introduction of official Git-LFS support on GitHub as it would be a perfect fit to handle these files.
After a pretty lengthy beta period, a few weeks ago GitHub finally announced that the technology was now available for all users of the site. We had not enabled LFS for AudioKit previously as being a public open-source project without our users able to check out and contribute would be a problem, to say the least.
After the announcement, I was rather enthusiastic to finally have a solution for AudioKit; one of my more recent commits added the needed attributes to allow usage of LFS for the next set of binaries in AudioKit. The disclosed free tier for storage and bandwidth was rather limited, but it didn’t seem like it would be a problem for us. I’m sure we wouldn’t have minded even shelling out a modest amount of money to pay for storage if that made life easier on our users — after all AudioKit already has a paid organization account.
Attempting to use LFS with the AudioKit repo
The first test of this new LFS would come quickly as I added support for the new Apple TV (running tvOS) to AudioKit. Along with it came a new set of binaries that I included in my pull request.
We quickly ran into some very strange authentication issues. My pull request was merged into the project, but none of the other users could get to the binary files in LFS. Thinking it was a problem on my system, I tried a number of things to attempt to solve this, and eventually created a small test repository specifically to come up with a reproducible case. It quickly became clear that things started to break when the repository containing LFS files was forked.
This was very strange to me as there was no indication anywhere that this would fail. All the GitHub documentation on LFS continually stresses that LFS is as transparent as it goes in daily usage — we should be able to keep our usual workflow, including forks and pull requests. It made no sense to me that this somehow would not be the case.
Thinking this might be a legitimate bug or issue, I filed a bug report with the git-lfs project on GitHub. The next day, I got an answer from the project lead on my report, which also got closed at the same time.
It turns out that this is actually expected behavior on GitHub right now. If you decide to use LFS with your project on GitHub, then you have to kiss goodbye the ability to have it forked and receive pull requests.
Case in point: if a very popular Github repository (such as the one for the Linux kernel) decided to start using LFS for some of their files, they would instantly alienate all of their users. They would no longer be able to properly fork the project, or even clone it to get its binary files stored via LFS. Nobody would be able to send a pull request to Linus as a result without considerable effort. This should be unacceptable on its face, and likely Linux sources would be out of GitHub the next day.
I honestly couldn’t believe that GitHub would be willing to do something that shortsighted, visibly motivated by greed from the cash they thought they could extract from some of their users. But this is madness to me, and I don’t see how this can work, for one simple reason: paying for LFS storage doesn’t actually fix any of this!
You basically have to sacrifice the core of the functionality of Github just to get a small reprieve on the inconvenience of storing large files in your repository.
The core of the issue seems to be their mind-boggling decision to charge for bandwidth usage, alongside storage. While I believe that charging for storage in this instance is perfectly acceptable, the bandwidth charge just makes no sense whatsoever. It makes no business sense, and moreover it makes absolutely no technical sense as it is actively impeding — right now — an otherwise very useful technology.
The way this bandwidth charge is getting in the way is simple: files downloaded from your LFS bucket count against your quota, even if you are not the user requesting them. Thus, a popular public repository would trigger a lot of downloads, rapidly depleting the bandwidth assigned to the owner of the repository itself.
This is completely batshit. The side effect of this pernicious, greedy pricing model is to effectively discourage popular projects that could use something like LFS from being hosted on GitHub — it shuts down collaboration, the key metric in evaluating the popularity of projects, and ironically what helped GitHub grow to become the company it is today.
GitHub’s current answer to this bandwidth conundrum is to keep LFS functionality to private projects. And by private, they do mean projects that cannot get forked even by their own private collaborators. Also: you do need a paid account to even have private repos.
This in itself wouldn’t be such a problem if this fact was disclosed up front. It would certainly be very disappointing, but users wouldn’t feel ambushed into this. All the marketing material pimping GitHub’s LFS support does NOT stress — or even mention — that fact at all. I do not believe this is unintentional.
As a matter of fact, if they were being up front about this, they would explicitly disable the “Fork” button on projects that have started using LFS. At the very least, they could give a stern warning to the repo owner before they unwillingly condemn their precious code to the dark corners of the antisocial web.
Instead, what you get right now when forking a project using LFS is the following: nothing. Everything is oklee-dokley. Fork away! The forked project even shows the LFS files in the Github interface, as if everything worked. Except you can’t actually reach those files in any practical way. And you will start running in the host of mysterious errors that I ran into.
At its heart, LFS is really just another interface to a storage bucket. The main difference with something like DropBox or Google Drive is mostly the fact that it is designed to work specifically with Git. In many ways, it is a lot more limited than what your vanilla cloud files services provide. You cannot browse the contents of your LFS storage, for instance — GitHub will only give you a summary of your usage in the billing section of your settings.
Yet, you don’t see any of the popular cloud file providers charge for bandwidth. They will happily charge you for premium storage options (usually with a free tier to get started), but they never charge for bandwidth usage.
Google does not do this. DropBox does not do this. Box.net does not do this. For fuck’s sake, even Microsoft does not do this. And to make things worse — Microsoft’s own code hosting solution, Visual Studio Online, which also happens to implement Git LFS, does not charge for bandwidth (or even for storage!)
You know you’ve seriously screwed the pooch when this die-hard open-source enthusiast is seriously considering a Microsoft product as an alternative.
My guess is that some high-level greedy marketing dickwad, completely unaware of the asinine implications of his brilliant idea, signed off on this dumb-as-a-bag-of-rocks pricing model. He then directed the grunts to somehow implement his grand vision on GitHub’s servers. That’s when shit started to hit the fan.
There are plenty of technical reasons why the current approach is completely unsustainable. Let me list the various ways this is screwed up.
- The free LFS storage tier includes 1GB of storage and just 1GB of monthly bandwidth. The bandwith amount is ridiculously low. If you have a repo with 500MB of binary assets to upload to LFS, then after your initial upload and the first download, you’d have no bandwidth left and nobody could access those files until the next billing cycle. Who the hell does GitHub think they are? AT&T? This looks more like pricing for a shitty cell data plan than for an actual professional service.
- Apparently all access to LFS files counts against your bandwidth quota. Since this would quickly run out with popular repos on the free tier, GitHub is forced to handle user authentication to access what would normally be public files. Then forcibly refuse them for anybody but the original uploader and his repo contributors, preventing anybody else from getting to them. This conveniently puts the pressure on the repo’s owner to sign up for pricier LFS data plans.
- No more social features. Renders completely inoperative the primary way to collaborate on Github: forks and pull requests. At this point GitHub becomes just another dumb private file hosting service with some version control on top. You’d be served just as well with Subversion over Dropbox. Did I mention that you can’t even buy your way out of this?
- GitHub doesn’t charge for bandwidth in any of their other products. Even their Github Pages product (which I use to host my personal website, and also hosts the AudioKit page), a fully fledged web hosting solution, is free to use.
- Not using LFS will actually increase bandwidth usage for GitHub. Without LFS, entire repos containing the entire history of binary files can get cloned for free by any user, for as many times as they wish. This status quo is much more bandwidth-intensive on Github servers than offloading files to LFS, which only needs to trigger downloads for the revisions of files being checked out. Yet this bandwidth hog is currently free. Wouldn’t you like to pay GitHub to help them save some money instead, at the expense of your own functionality? Sounds like a winner!
- This will effectively drive users away from GitHub. Other organizations implementing LFS seem to be a lot more sensible about these issues and certainly won’t try to enforce unrealistic bandwidth limitations on their servers. There is competition in this field: BitBucket, Gitlab, even Visual Studio Online. Users will be able to collaborate as they normally would on these sites, all the while using LFS to manage their large files.
- It doesn’t help that GitHub is apparently obfuscating the situation with misleading marketing. If your implementation of LFS has very serious limitations, you owe it to your customers to disclose those up front. Not doing so is very shady, to say the least.
Hoping For A Resolution
I am hopeful that Github, as an organization, will come to its senses and fix this situation sooner rather than later. It really seems as simple as throwing away the bandwidth-charging model. Not having to track individual users’ usage would make these problems go away, and GitHub could still turn out a decent profit by keeping their users and charging them just for extra storage. That’s the way they’ve done things so far, and it worked out pretty well for them — without compromising their core value: fostering open collaboration over code.
I have been a long-time enthusiastic user of their service, and I feel like on this one they’ve really let down the open-source community at large.
Let’s hope they listen.