GSoC Week 2: redesign the table format

Abhradeep Chakraborty
5 min readJun 27, 2022

--

If you saw my previous blog GSoC Week 1: Let’s get started, you know that I am working on integrating a lookup table extension to Git’s bitmap file. It would remove the overhead of loading each and every bitmap by parsing only the necessary bitmaps.

I submitted my initial version of patch series on last Monday (IST). After that, a series of review came. Some of these reviews suggested major changes which I am listing down below. But before that, let us recap the table format first -

The initial table format contain a list of commit oids (for each bitmap entry), a list of <offset, xor_pos> pairs (for each bitmap entry). The i’th <offset, xor_pos> pair denotes the bitmap position and xor-commit position (respectively) of the i’th commit id in the list. By saying xor-commit position, I mean the position of the commit oid (in the commit id list) with whose bitmap the current commit’s bitmap need to xor.

Fact: I know if you’re a non Git developer, you probably are not getting what I just have said ;). Don’t worry, I will surely make detailed articles on Git internals later so that you can also take part in the contribution of Git! For now, read the ProGit book.

Major changes suggested in the review comments

  • Use commit positions instead of commit oids in the table:
    Commit ids are of 8 bytes and are used here to know which bitmap belongs to which commit. So, the table size becomes Nr_entries * (8+ 4+ 4) bytes. We can save some spaces if we can store the midx or pack-index position (4 byte) of the commit instead of storing the commit ids. This would save Nr_entries * 4 bytes.
    At first, I thought this would add extra cost — We first have to get the index position of given commit, match that with the list of commit positions in the table and then get the offset of that commit. But Taylor said, this would be negligible and I should try it.
  • Use 8 byte offset positions instead of 4 byte:
    The initial version uses 4 byte offsets which might not be sufficient for holding the bitmap positions. So, Derrick ( GSoC mentor for another Git project named More Sparse Index Integrations) and Taylor suggested to make it a 8 byte unsigned integer instead.
  • Use iterative approach for parsing xor bitmaps:
    There were two functions stored_bitmap_for_commit() and lazy_bitmap_for_commit() which were being used recursively to find the xor bitmaps. Derrick suggested to use a stack instead. The logic is someting like this — For every xor_pos, push the commit positions to the stack and finally pop them one by one. For each popped positions use the previous bitmap as xor_bitmap and compute the current bitmap.
  • Use <commit-pos, offset, xor_pos> triplets:
    In the initial version, the table first has a list of commit oids and then the list of <offset, xor_pos> pairs. Taylor suggested to redesign the table such that all the related information come together forming a triplet. E.g. instead of having two separate lists (one for commit ids, and another for <offset, xor_pos> pairs), the table now contains a single list of commit-pos, offset, xor_pos> triplets. The i’th triplet contains the information of the i’th bitmap.

Besides this, there were some other reviews also but those are relatively small. So I didn't include them here.

Implementing the suggestions

My first job was to redesign the table. I started implementing the “writing table to the bitmap” part. I didn’t face any problem regarding it since most of the implementations are same as the initial version. The tricky part here was to get the commit positions given the commit ids. As I am relatively new in the community, I didn’t know what functions are available for it. But after some investigation, I suddenly remembered that bitmaps already store the commit positions. So I could store those positions into an array and pass it to the write_lookup_table() function! Thus I got what I wanted!

Then I moved to the read part. It was the most time consuming task for this week. I had to rewrite the patch itself. During this time I realized that the previous version had some bugs in it. E.g. the store_bitmap() function throws a Duplicate entry in bitmap index error whenever it sees an already stored bitmap is trying to be stored again. This error is obvious if all the bitmaps are getting parsed one by one. But in our case, we are trying to parse only the necessary bitmaps. So in this process this may happen that a bitmap has already been parsed and another bitmap has a xor relation with it. Whenever we get to this bitmap from the bitmap that has a xor relation with it (using iterative approach we discussed previously), store_bitmap is throwing the above error and is returning NULL. Below is my fix:

/* A 0 return code means the insertion succeeded with no changes,
* because the SHA1 already existed on the map. If lookup table
* is NULL, this is bad, there shouldn't be duplicated commits
* in the index.
* If table_lookup exists, that means the desired bitmap is already
* loaded. Either this bitmap has been stored directly or another
* bitmap has a direct or indirect xor relation with it. */
if (ret == 0) {
if (!index->table_lookup) {
error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));
return NULL;
}
return kh_value(index->bitmaps, hash_pos);
}

I also converted the previously implemented recursive search into a iterative parsing. The process was not smooth — I faced many errors here. Some of them were fixed quickly, some of them took two or three days to fix. But at the end, all was finally fixed and was ready for another round of review. Here is the latest patch series.

I think this format is way simpler and more structured than the previous one. Huge shoutout to Taylor for suggesting this format!

Next step

While writing the blog, I realised that I could optimise the parsing process further. The idea is — when we are filling the xor commit positions into the stack , if somehow we can know that the bitmap belonged to the current commit position has already been stored, we don’t need to iterate further for xor positions. Because we know that those bitmaps has been stored already. This way, we can stop unnecessary iterations.

So my next step is to implement the above idea and to resubmit it.

--

--

Abhradeep Chakraborty

I am a 3rd year IT student. I like to build full stack web apps and have a strong interest in core development. Currently learning cloud technologies.