Sneaky Git number encoding

Concert
5 min readApr 25, 2018

--

Paul looks at how arbitrarily large numbers are encoded in git pack files

Squeezing every last bit of data — original image by Keith Williamson

Whilst implementing some native Haskell tools to inspect and alter git repositories, I came across the excellent “Unpacking Git Packfiles” article by Aditya Mukerjee. In it he introduces the idea that a pack file is intended to increase the efficiency of transferring git objects over the network by packing them as densely as possible.

He then goes on to detail how git’s pack index files and pack files are structured, which proved invaluable for me in taking these daunting opaque binary structures and turning them into meaningful Haskell data structures. However, two aspects of size encoding tripped me over horribly.

The wrong end of the stick

My first difficulty was in decoding object sizes from pack file object headers. Aditya tells us that

[pack file object data] … begin by telling us the size of the object that the packfile contains. This size is encoded as a variable-length integer with a special format.

Because the integer has a variable length, the first bit of each byte — also called the MSB, for “most significant bit” — is reserved. That bit tells us whether the next byte belongs as part of the variable-length integer that we’re decoding. If it’s a 1, we should read the next byte. An easy way to check for this is to check if the byte is less than 128, which is 10000000 in binary.

Naively, I thought this was going to be straightforward and set about implementing it. An example decoding would, I thought, look like this:

+-- Keep reading
| +-- Keep reading
| | +-- Stop
v v v
10000001 10010000 00100000
|<--->| |<--->| |<--->|
Data(A) Data(B) Data(C)
== 1 0010000 0100000
(A) (B) (C)
== 18464

It turns out that while Aditya’s nice diagram says we read the most significant chunks first (i.e. in network/big-endian order), the git source code confirms that they’re actually encoded with the least significant chunk first (i.e. little-endian order). That ends up looking like

+-- Keep reading
| +-- Keep reading
| | +-- Stop
v v v
10000001 10010000 00100000
|<--->| |<--->| |<--->|
Data(A) Data(B) Data(C)
== 100000 0010000 0000001
(C) (B) (A)
== 526337

Git provides some really useful plumbing tools for inspecting pack files, including letting you know how big an object is supposed to be (the third column in the output below):

$ git verify-pack --verbose .git/objects/pack/pack-e3d0187af6509597f36b7b7429380e26b81ea22a.idxdc78b37d73a4757310c2d8928cd4974eef64c8ef commit 384 264 12
ef7a0857a607ca2d2ef376064527aca9a7342e7b commit 252 178 276
5570579cc2531e37b9a30c14f5f3675ae6abdd42 commit 250 174 454
e10436e4c95c0179feb29a2b1262d7e8b0f07d8a commit 256 180 628
ab9adc3c7cbb8299cbd99ae5704961a2840c42e8 commit 311 223 808
a4903dc5e53d11d0ad0926c08838acf6e30751b0 commit 59 71 1031 1 ab9adc3c7cbb8299cbd99ae5704961a2840c42e8
ee38075d2249a01b470c032452e2d533c57c1f8a commit 312 212 1102
82b30a1b621aab372c9ad3edc4815e5717071af3 commit 268 184 1314
3ec0c9e6f670500d9d840b57aab6d2c04e9fb64d commit 255 175 1498
363b668809273ad0a3098183792b70114ea3a408 commit 282 192 1673
e6aecaa41308643915f67489d4b6f28b4962436f commit 69 77 1865 1 363b668809273ad0a3098183792b70114ea3a408
44c8c213c0ca89dbc08773e07a587a9b50a726b1 commit 249 177 1942
...

Comparing my decodings and the decodings from git-verify-pack was a great way to notice the little-endian/big-endian discrepancy and take a closer look at the encoding.

Because big-endian and little-endian schemes encode single-chunk-sized numbers identically, small object sizes* were actually encoded just fine even when I was decoding the most significant chunk first. In practice this meant that my bug went unnoticed for quite some time, because I was testing against a toy repository with only a few small objects!

Off by 0, 128, 16512…

The second stumbling block I encountered was how file offsets in object delta headers were encoded. Aditya says

For [offset delta entries], instead of the 20-byte base object name, there’s another variable-length integer. Again, the MSB tells us where the last byte of the integer is. This integer represents a negative offset — it tells us how many bytes to scan backwards in the current packfile in order to find the base object.

I assumed that “another variable-length integer” referred to the same little-endian encoding as we’ve just explored. However, looking at actual pack file data and following the offsets manually using git verify-pack revealed that, again, something was amiss — the offsets didn’t point to the right objects.

Instead of the little-endian result, our example byte string decoded as follows:

+-- Keep reading
| +-- Keep reading
| | +-- Stop
v v v
10000001 10010000 00100000
|<--->| |<--->| |<--->|
Data(A) Data(B) Data(C)
== 10 0010001 0100000
(A?!) (B?!) (C)
== 34976

The final seven bits of the resulting number match the final seven bits of the encoded data, which hints that we’re decoding most-significant-byte first. However, if we look at the plain big-endian parse result side-by-side with the result above, we can clearly see the discrepancy in all but the last chunk:

18464 ==  1 0010000 0100000  -- Big endian result
34976 == 10 0010001 0100000 -- Mysterious kinda-big-endian result

It turns out that pack file delta offsets are encoded even more efficiently than simply chunking with continuation. When displayed in the format above, it’s clear that each 7–bit chunk of data that came from a byte with a continuation bit set (i.e. all but the last chunk) is also incremented by one when decoding.

As far as I can tell, this detail hasn’t really been talked about anywhere that discusses pack file encodings! It’s a neat little trick which works because the mere presence of a continuation byte signals at least one bit of information.

In the naive big-endian encoding, the following byte strings would produce identical results of zero:

                  00000000 == 0
10000000 00000000 == 0
10000000 10000000 00000000 == 0
...

In the “add one on continuation” scheme, the continuation bytes themselves lead to different results:

                  00000000 == 0
10000000 00000000 == 128 == 2^7
10000000 10000000 00000000 == 16512 == 2^14 + 2^7

This extra bit of information can lead to space savings in the case that, when decoding, a bit overflows into the next 7-bit chunk. For example 16835 would naively be big-endian encoded as 10000001 10000011 01000011, but can be packed into two bytes in the “add one” scheme: 11111111 00000001.

You can see from the source code that this is indeed how git encodes the source and target lengths of delta application (note the incrementation of the data chunk on line 1450).

Today I learned

So, like most things, at their heart the encodings in git pack files are nothing to be scared about. However, they did lay some traps for the unwary explorer, and I wouldn’t describe them as “trivial”. This is especially true given the somewhat ad-hoc choice of encodings used in various places!

Don’t forget you can checkout out our native Haskell git tools repository on Github. I hope you found this useful, and happy git unpacking!

Update: We’ve just released two packages to Hackage for encoding (bytestring-builder-varword) and decoding (attoparsec-varword) using these schemes!

*Small numbers in this very specific context actually meant “up to four bits”. This is because in the first byte of the pack file object header, the most significant three bits of the 7-bit data chunk are used to encode the object type, as you can see in this diagram from Aditya Mukerjee.

--

--