How to download large files from Github

I recently came to find a limitation on the otherwise amazing Github API.

My goal was to download a big file (specifically, a tar.gz) from a private repository, using the command line, via a bash script that is. As it turned out, this task has been much more difficult that I initially thought.

First of all, the OAuth token. Nothing amazing here: just head to the proper, detailed help page, follow the instructions and get this thing done. Easy peasy.

Test it with Curl:

curl -H ‘Authorization: token YOURSUPERSECRETCODEGOESHERE' -L 'https://api.github.com/repos/YOUR_ACCOUNT/YOUR_REPO/contents/YOUR/PATH/TO/A/FILE'

A couple of things to notice here: without specifying any Accept header, the default will make Github send you a JSON with everything it knows about that file. The content of the file is — understandably — in the content key of the returned JSON. The other interesting and very important thing is that with this method you can only fetch files that are not bigger than 1MB.

First, let’s address the first issue: I want to get the file directly without parsing any returned JSON. The key is to provide the Accept header with the proper value set:

curl -H ‘Authorization: token YOURSUPERSECRETCODEGOESHERE' -H 'Accept: application/vnd.github.v3.raw' -L 'https://api.github.com/repos/YOUR_ACCOUNT/YOUR_REPO/contents/YOUR/PATH/TO/A/FILE'

The file will be returned as it is, no JSON.

The second issue, regarding the 1MB limit, is where I hit the wall.

The error that you get back instructs you to use another set of API, not the “simple” /contents for files that are too big. Instead, you have to use the /data one. You basically need to ask Github for a blob, which is your file, but you can refer to it only by its SHA signature, not by its file name.

So your next problem is: How do I get the SHA of my file? As it turned out (and I’d like to be wrong here), it’s impossible. There is no way to ask Github for just the meta-information of a file: if you ask for a file resource (the aforementioned JSON, which happens to also contains the SHA), Github will also try to give you the content of it. And if the file is too big… you ain’t get nothing.

The solution I found is that you could still use the /contents API but to ask the content of the directory containing you fatty, stubborn file. You’ll then get — as a JSON — the array containing the data for each of the files which resides inside the directory. And guess what? Each array element contains the SHA of the file.

You’ll then parse the JSON, get the SHA of your file, and use the /data endpoint to (finally) download it.

curl -s -H “Authorization: token YOURSUPERSECRETCODEGOESHERE” -H “Accept: application/vnd.github.v3.raw+json” -o “YOUR_LOCAL_FILE_NAME” -L “https://api.github.com/repos/YOUR_ACCOUNT/YOUR_REPO/git/blobs/THE_SHA"

Oh, one last thing. As I said I needed to make it all with a bash script, so at a certain point I had to parse JSON (to get the SHA from an array) in it. Well, I used node ;)

# RESPONSE contains the JSON from the /content API call and TARGET is the name of the file I need
SHA=$(echo “var r=JSON.parse(‘${RESPONSE}’);r.forEach(function(v) { if(v.name==’${TARGET}’){console.log(v.sha);}});” | tr -d ‘\n’ | node)

Image credit: http://pixabay.com/p-41294/